Author Topic: Vivado HLS in action.  (Read 5547 times)

0 Members and 1 Guest are viewing this topic.

Offline hamster_nzTopic starter

  • Super Contributor
  • ***
  • Posts: 2829
  • Country: nz
Vivado HLS in action.
« on: December 12, 2015, 05:26:32 am »
Given that I installed a Vivado HLx Webpack license I thought I should at least try it.

Given that I love fractals on FPGAs, I tried this:

Code: [Select]
unsigned char mandelbrot(double cx, double cy) {
  int i = 0;
  double x = cx, y = cy;
  while(i < 255 && x*x+y*y < 4.0) {
    double t = x;
    x = x*x-y*y+cx;
    y = 2*t*y+cy;
    i++;
  }
  return i;
}
I set that I wanted it to run at 100MHz.

It took 7 seconds to convert from C into both VHDL and Verilog.

Usage is:
0 memory blocks
25 DSP blocks
1273 Flipflops
1358 LUTs

Detailed stats were...
Latency: 30 to 7662 cycles
Iteration Latency: 30 cycles
Trip Count: 1-255 cycles
Pipelined: no

I changed the inputs to 'float's, and the usage went to 22 DSPs,  1,611 FFs, and 1,733 LUTs, total latency is 30 to 7407 cycles and iteration latency went down to a slightly faster 29 cycles.

This is the top level interface for the 'float' version (the 'double' version just had wider 'cx' and 'cy'):
Code: [Select]
entity mandelbrot is
port (
    ap_clk : IN STD_LOGIC;
    ap_rst : IN STD_LOGIC;
    ap_start : IN STD_LOGIC;
    ap_done : OUT STD_LOGIC;
    ap_idle : OUT STD_LOGIC;
    ap_ready : OUT STD_LOGIC;
    cx : IN STD_LOGIC_VECTOR (31 downto 0);
    cy : IN STD_LOGIC_VECTOR (31 downto 0);
    ap_return : OUT STD_LOGIC_VECTOR (7 downto 0) );
end;

I didn't perform any HLS-level place and route on the IP, which would have taken I guess 15 minutes, given that a design with 640 DSPs and 100k LUTs takes a couple of hours).

So I guess the question is this:  Is 300ns for a double precision complex MAC, implemented  in hardware any good?
« Last Edit: December 12, 2015, 05:30:06 am by hamster_nz »
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline autobot

  • Regular Contributor
  • *
  • Posts: 66
Re: Vivado HLS in action.
« Reply #1 on: December 16, 2015, 01:13:32 pm »
Interesting.What does it mean in ops/sec?

Also why not pipelined ? should increase performance.

 

Offline hamster_nzTopic starter

  • Super Contributor
  • ***
  • Posts: 2829
  • Country: nz
Re: Vivado HLS in action.
« Reply #2 on: December 16, 2015, 07:16:58 pm »
Interesting.What does it mean in ops/sec?
At 100MHz,it works out at about 33M ops/per sec, as each pass is about 10 simple operations.

Also why not pipelined ? should increase performance.

So I turned on pipelining, using the required pagama statement:
Code: [Select]
unsigned char mandelbrot(float cx, float cy) {
  int i = 0;
  int j = 0;
  float x = cx, y = cy;
#pragma HLS PIPELINE
  for(j = 0; j < 256; j++) {
  if(x*x+y*y < 4.0) {
  double t = x;
  x = x*x-y*y+cx;
  y = 2*t*y+cy;
  i++;
  }
  }
  return i;
}

I can now drop an item into the pipeline every cycle (for about 256,000 ops per second) - however, it uses 9443 DSP slices, 622,247 FFs, and 612,810 LUTs - a huge amount of resources.

My hand-written implementation does about 192,000M ops per second, with usage of 680 DSP blocks, 158,261 FFs, and 99.568 LUTs (with 20% of the DSP blocks implemented in LUTs).

The reasons for the improved efficiency of the hand-written design are quite varied, but mostly structural.

  • I 'triple-pump' the pipeline, reducing the pipeline length by 2/3rds. It runs at 225 MHz generating pixels at 75MHz, with each pixel passing through the entire pipeline three times.
  • The HLS design uses 32-bit floating point on the external interfaces, but internally it appears to do the maths as doubles and then truncate it. I use a 36-bit fixed point (with a range of  +8 to -8) - perfectly matching the problem.
  • I've pipelined to match the requirements of the underlying hardware, allowing the design to meet timing at 225MHz, The HLS design's pipelining looks to match the scheduling of the C code's operations.

However, I have spent a very long time on my design, a lot longer than the hour playing with the HLS tool





« Last Edit: December 16, 2015, 10:59:18 pm by hamster_nz »
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline autobot

  • Regular Contributor
  • *
  • Posts: 66
Re: Vivado HLS in action.
« Reply #3 on: December 16, 2015, 08:20:32 pm »
>> The HLS design uses 32-bit floating point on the external interfaces, but internally it appears to do the maths as doubles and then truncate it. I use a 36-bit fixed point (with a range of  +8 to -8) - perfectly matching the problem.

Vivado HLS has decent support for arbitrary precision, looks easy to use:

http://www.xilinx.com/support/documentation/application_notes/XAPP1173-carrier-loop.pdf  - pg 9.



 
« Last Edit: December 16, 2015, 08:23:32 pm by autobot »
 

Offline free_electron

  • Super Contributor
  • ***
  • Posts: 8549
  • Country: us
    • SiliconValleyGarage
Re: Vivado HLS in action.
« Reply #4 on: December 16, 2015, 09:01:09 pm »
thought about restructuring code ?
Code: [Select]
unsigned char mandelbrot(float cx, float cy) {
  int i = 0;
  int j = 0;
  float x = cx, y = cy;
 float xx,yy;
#pragma HLS PIPELINE
  for(j = 0; j < 256; j++) {
   xx = x*x;
   yy = y*y;
   if((xx+yy) < 4.0) {
    double t = x;
    x = xx-yy+cx;
    y = 2*t*y+cy;
    i++;
   }
  }
  return i;
}
Professional Electron Wrangler.
Any comments, or points of view expressed, are my own and not endorsed , induced or compensated by my employer(s).
 

Offline hamster_nzTopic starter

  • Super Contributor
  • ***
  • Posts: 2829
  • Country: nz
Re: Vivado HLS in action.
« Reply #5 on: December 17, 2015, 05:35:05 am »
thought about restructuring code ?
Code: [Select]
unsigned char mandelbrot(float cx, float cy) {
  int i = 0;
  int j = 0;
  float x = cx, y = cy;
 float xx,yy;
#pragma HLS PIPELINE
  for(j = 0; j < 256; j++) {
   xx = x*x;
   yy = y*y;
   if((xx+yy) < 4.0) {
    double t = x;
    x = xx-yy+cx;
    y = 2*t*y+cy;
    i++;
   }
  }
  return i;
}

Much to my surprise, it uses more resources!
Code: [Select]
Original pipelined  Refactored Pipeline
DSP                  9,443    9,443
LUT                622,247       767,457
FFs                612,810       667,491

If I get a chance to night I'll play around with the fixed precision stuff and see how that works out (it should be much better...)
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline hamster_nzTopic starter

  • Super Contributor
  • ***
  • Posts: 2829
  • Country: nz
Re: Vivado HLS in action.
« Reply #6 on: December 18, 2015, 12:40:28 am »
I did have an attempt at using the fixed point data types. Not sure if I did it correctly:

Code: [Select]
#include <ap_fixed.h>

typedef ap_fixed<35,4,AP_RND > fixed_4_31;

unsigned char mandelbrot(fixed_4_31 cx, fixed_4_31 cy) {
  int i = 0;
  int j = 0;
  fixed_4_31 x = cx, y = cy;
  fixed_4_31 xx,yy;
  #pragma HLS PIPELINE
  for(j = 0; j < 256; j++) {
   xx = (float)x * (float)x;
   yy = (float)y * (float)y;
   if((xx+yy) < 4.0) {
    fixed_4_31 t = x;
    x = xx-yy+cx;
    y = 2*(float)t*(float)y+(float)cy;
    i++;
   }
  }
  return i;
}

Resources are now 3576 DSP blocks, 587,441 FFs and 577,500 LUTs. This is pretty close to what it should be for fixed point (as a single-cycle 35-bit multiplication requires 4 DSP blocks).

It starts becoming more and more divergent from standard C...
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf