Interesting.What does it mean in ops/sec?
At 100MHz,it works out at about 33M ops/per sec, as each pass is about 10 simple operations.
Also why not pipelined ? should increase performance.
So I turned on pipelining, using the required pagama statement:
unsigned char mandelbrot(float cx, float cy) {
int i = 0;
int j = 0;
float x = cx, y = cy;
#pragma HLS PIPELINE
for(j = 0; j < 256; j++) {
if(x*x+y*y < 4.0) {
double t = x;
x = x*x-y*y+cx;
y = 2*t*y+cy;
i++;
}
}
return i;
}
I can now drop an item into the pipeline every cycle (for about 256,000 ops per second) - however, it uses 9443 DSP slices, 622,247 FFs, and 612,810 LUTs - a huge amount of resources.
My hand-written implementation does about 192,000M ops per second, with usage of 680 DSP blocks, 158,261 FFs, and 99.568 LUTs (with 20% of the DSP blocks implemented in LUTs).
The reasons for the improved efficiency of the hand-written design are quite varied, but mostly structural.
- I 'triple-pump' the pipeline, reducing the pipeline length by 2/3rds. It runs at 225 MHz generating pixels at 75MHz, with each pixel passing through the entire pipeline three times.
- The HLS design uses 32-bit floating point on the external interfaces, but internally it appears to do the maths as doubles and then truncate it. I use a 36-bit fixed point (with a range of +8 to -8) - perfectly matching the problem.
- I've pipelined to match the requirements of the underlying hardware, allowing the design to meet timing at 225MHz, The HLS design's pipelining looks to match the scheduling of the C code's operations.
However, I have spent a very long time on my design, a lot longer than the hour playing with the HLS tool