Author Topic: FPGA to accelerate fractal calculations (Read 2636 times)

NiHaoMike · « **on:** September 18, 2022, 02:12:33 pm »

I recently became friends with someone who is really into fractal art. I used Fraqtive to make videos she really likes, the problem is that the rendering takes a long time, many hours for a 1 minute video.

Would it be even remotely practical to accelerate the fractal rendering process with a hobbyist FPGA like an Artix-7 200? Or would a faster CPU or GPU be the way to go? Keep in mind that energy efficiency is a big factor, which FPGAs excel at for certain tasks.

SiliconWizard · « **Reply #1 on:** September 18, 2022, 05:58:36 pm »

I would probably go for a GPU-assisted rendering here. But that's significant work involving knowing CUDA or similar.

As far as I've seen, Fraqtive does nothing like that. It only uses CPU and OpenGL - the GPU is not involved in computing the fractals themselves, so a faster GPU would do absolutely nothing here without rewriting the code entirely. At least that's what I've gotten from Fraqtive, I may have missed something.

mon2 · « **Reply #2 on:** September 18, 2022, 05:59:37 pm »

https://github.com/rkrajnc/mandelbrot_fpga

https://www.intel.com/content/www/us/en/developer/articles/technical/accelerating-software-with-fpgas-the-mandelbrot-set.html

NiHaoMike · « **Reply #3 on:** September 19, 2022, 03:42:38 am »

https://www.markbowers.org/fpga-mandelbrot
The way even a rather old FPGA was over an order of magnitude faster than a PC at the time shows a lot of promise for a much more modern FPGA like the Artix-7 200 I have. But I'm expecting that the programming would be quite a project.

rstofer · « **Reply #4 on:** September 19, 2022, 03:33:36 pm »

But since the source code is given for various boards, it should be quite easy to replicate the project. In fact, the .bit file is provided for some boards; just program the device and you should be ready to go. There would be the issue of replicating the Baseboard components like the ADC. I'm not sure if the FPGA board is available but the Baseboard might be; I found a price...

I'm not sure that the logic changes between boards but the constraints file defining the pinout certainly would.

nctnico · « **Reply #5 on:** September 19, 2022, 04:00:10 pm »

I'd go for using a GPU first and see where it goes. You can either use CUDA (Nvidia proprietary IIRC) or OpenCL which is a widely adopted standard. If the calculations match a GPU well, then it is likely to be more efficient compared to an FPGA.

I found this paper on the topic: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.365.8046&rep=rep1&type=pdf But be aware that this seems to originate from Altera so it likely is biased / making wrong conclusion based on lack of knowledge about using FPGA or OpenCL effectively.

rstofer · « **Reply #6 on:** September 19, 2022, 04:13:12 pm »

V3 of the project linked above uses 35 bit multiplication with a 70 bit result. There's probably more to the story but this isn't likely to fit well on a CUDA unit where word width needs to be a power of 2 (as I understand it). So, maybe it isn't used for reference.

OTOH, there are quite a few hits on Google for 'cuda mandelbrot set'

I have an HP laptop with a NVIDIA chip having in excess of 5800 CUDA units. Maybe I should look around for some code...

http://selkie.macalester.edu/csinparallel/modules/CUDAArchitecture/build/html/1-Mandelbrot/Mandelbrot.html

Looks helpful. It does require the NVIDIA SDK but that's free.

nctnico · « **Reply #7 on:** September 19, 2022, 05:05:08 pm »

Quote from: rstofer on September 19, 2022, 04:13:12 pm

V3 of the project linked above uses 35 bit multiplication with a 70 bit result. There's probably more to the story but this isn't likely to fit well on a CUDA unit where word width needs to be a power of 2 (as I understand it). So, maybe it isn't used for reference.

DSP slices in FPGAs seems to have typically >32 bit to prevent rounding errors when using fixed point math. GPUs OTOH use floating point with various word lengths.

ejeffrey · « **Reply #8 on:** September 19, 2022, 07:56:14 pm »

First thing to do is to get an idea of your problem. How many GFLOPs are you getting out of the existing code? How much is the theoretical GFLOPs of the GPU or FPGA you are considering? That will give you some idea of how feasible this is.

Honestly these days FPGAs are not that great at pure compute acceleration. GPUs are considerably more powerful and power efficient than FPGAs (at least at any sort of affordability parity). CPUs also have greatly expanded vector processing capability and now have many cores. They are still outclassed by GPUs for raw GFLOPS but if your problem doesn't map well to a GPU they can still get quite good performance. This doesn't leave a lot of space for FPGAs. There are still use cases for FPGAs but they are kind of niche. FPGAs have lots of other uses of course where nothing else can touch them, but as a pure compute accelerator they are fairly limited these days. Especially since the FPGAs that mere mortals have access to are based on several generations old process technology while current generation CPUs and GPUs are 7 nm and below.

In any case: Fraqtive is using double precision floats. Since it claims to be optimized and is using SSE, they could have certainly used single precision for a performance boost and presumably decided not to. That is probably for accuracy. As I understand, zooming into a fractal requires both high precision and dynamic range. This suggests against GPUs as modern GPUs do not support more than single precision floats. I think they can technically do double precision but it is emulated and extremely slow. Most of the effort these days is actually moving to things like bfloat16 because that is enough for many ML applications.

The xilinx DSP48 dsp slice from the 7 series has a 18x18->36 bit multiplier. The 48 bit is only for the accumulator. That isn't even enough for single precision float much less double. You can chain them together to make larger multipliers, but obviously that reduces the total number of FLOPS you can do.

On the other hand it doesn't look like Fraqtive is optimized for modern CPUs. It is using SSE2 which is an ancient vector standard. A straightforward conversion to AVX2 with its 256 bit registers would probably double performance, but you can probably do better. For instance with AVX you can compute 4 pixels in parallel. But if you have enough registers and can compute 8 in parallel alternating between two groups of 4 you can often get higher throughput by reducing register dependencies. This isn't easy, but Intel has tools to help you and its certainly easier than getting a good implementation on an FPGA. And I think when you look at it, the peak GFLOPs of an 11th generation intel CPU or an AMD zen3 is going to be as high or higher than from your artix 7.

Someone · « **Reply #9 on:** September 19, 2022, 10:27:30 pm »

Quote from: ejeffrey on September 19, 2022, 07:56:14 pm

GPUs are considerably more powerful and power efficient than FPGAs (at least at any sort of affordability parity).

Citation required... they're pretty much $$$ comparable and FPGA's win on power consumption most of the time.

ejeffrey · « **Reply #10 on:** September 19, 2022, 11:08:42 pm »

Quote from: Someone on September 19, 2022, 10:27:30 pm

Quote from: ejeffrey on September 19, 2022, 07:56:14 pm
GPUs are considerably more powerful and power efficient than FPGAs (at least at any sort of affordability parity).
Citation required... they're pretty much $$$ comparable and FPGA's win on power consumption most of the time.

https://www.xilinx.com/content/dam/xilinx/support/documents/data_sheets/ds890-ultrascale-overview.pdf

Kinetix Ultrascale+: 6.3 TMAC/s for the largest, highest speed grade part I don't know what the FPGA price itself is, but the KCU116 dev board with a mid-sized (KU5P) and the medium speed grade costs $3000. I think those parts have 27x18 bit multipliers, so they are not doing even full single precision floating point at that rate. The top end Kintex Ultrascale hits 8.2 TMAC/s, but with 18x18 bit multipliers.

https://www.techpowerup.com/gpu-specs/geforce-rtx-3080.c3621

The NVidia RTX3080 does 29.7 TFLOPS at full single precision (F32) for < $750. A multiply accumulate still generally counts as two "floating point operations", so it's peak throughput is probably only slightly more than double the Kinetix. You have to go up to the Virtex or the Versal ($$$) product lines to get something that is even competitive with a GPU whose successor is shipping this month.

Of course the peak performance on either is not necessarily what you can get. It is a lot of work to keep them fed to maintain that throughput. But that is probably easier and more common on the GPU than the FPGA.

NVidia has a huge presence in the top500 supercomputing list. Xilinx and Altera have none. If FPGAs could get a performance/cost or performance/power boost for a significant fraction of supercomputing workloads there would be FPGA based supercomputers, and supercomputer users would actually be willing to put in the development effort to use it. AWS has FPGA instances that people are using so there is some market there but it's really hard to tell how big or what people are actually using it for. The supercomputing GPUs are clocked lower, so have lower max throughput for better energy efficiency but they still beat price comparable FPGAs.

Someone · « **Reply #11 on:** September 19, 2022, 11:29:00 pm »

Quote from: ejeffrey on September 19, 2022, 11:08:42 pm

Quote from: Someone on September 19, 2022, 10:27:30 pm
Quote from: ejeffrey on September 19, 2022, 07:56:14 pm
GPUs are considerably more powerful and power efficient than FPGAs (at least at any sort of affordability parity).
Citation required... they're pretty much $$$ comparable and FPGA's win on power consumption most of the time.
Of course the peak performance on either is not necessarily what you can get. It is a lot of work to keep them fed to maintain that throughput. But that is probably easier and more common on the GPU than the FPGA.

Unsurprisingly the FPGA vendors disagree:
https://www.xilinx.com/content/dam/xilinx/publications/product-briefs/amd-xilinx-vck5000-product-brief.pdf
2x TCO advantage over GPU for machine learning. You can rabbit on all day about FLOPS but that's just one perspective on computing with very few applications actually needing floats (largely a convenience for higher level abstraction), its the dominant methodology, but not the most power or cost efficient. If power or cost is the driving factor then fixed point is very relevant be that on an FPGA or CPU, its only GPUs and DSPs that have gone for float centric designs.

rstofer · « **Reply #12 on:** September 20, 2022, 02:05:55 pm »

Quote from: Someone on September 19, 2022, 10:27:30 pm

Quote from: ejeffrey on September 19, 2022, 07:56:14 pm
GPUs are considerably more powerful and power efficient than FPGAs (at least at any sort of affordability parity).
Citation required... they're pretty much $$$ comparable and FPGA's win on power consumption most of the time.

It would seem to me that if FPGAs were price and performance equivalent to GPUs, graphics cards would be using FPGAs. Why design an ASIC if a programmable chip can compete? Among other things, FPGA graphics cards would be upgradable in the future or vendors could put a switch on performance and sell the same card at different price points. Mainframe manufacturers did this kind of thing and priced runtime hours based on 'switch' position.

My HP Omen laptop has an Nvidia RTX 3070 TI Laptop chip with 5888 CUDA Cores, a base clock of 1035 MHz and a Boost Clock of 1485 MHz. I don't think I have seen FPGA clocks anywhere near this fast. Maybe the very expensive chips but the Artix 7 sure won't get there. FWIW, the chip is capable of around 12 TFLOPS. Considering we went to the Moon with mainframes rated around 2 MFLOPS, that's quite an advance (6,000,000 times over).

I know very little about Machine Learning but most of what I have done with Deep Neural Networks tends to involve float16 or float32 values for weights and biases. I'm using TensorFlow and Keras at the moment.

SiliconWizard · « **Reply #13 on:** September 20, 2022, 07:07:18 pm »

Unless you were implementing something very clever that wouldn't lend itself well to a GPU architecture (and these days that would have to be pretty clever indeed for this kind of calculations), a GPU will be much more power-efficient and cost-effective than any FPGA-based solution. If you think you can beat a decent GPU (not talking about those high-end ones) with an Artix-7 (again unless you have a very specific approach in mind, do not hesitate to provide a concrete example if so), just uh. Ah, delusion.

ejeffrey · « **Reply #14 on:** September 20, 2022, 09:39:40 pm »

So first off, nothing about a Xilinx FPGA based ML accelerator is relevant to accelerating fractal computation on a Artix-7.

Quote from: Someone on September 19, 2022, 11:29:00 pm

Unsurprisingly the FPGA vendors disagree:
https://www.xilinx.com/content/dam/xilinx/publications/product-briefs/amd-xilinx-vck5000-product-brief.pdf

I think the Versal AI card is a great product at a great price. In fact, it's such a great price that as I understand, Xilinx will not let you use it as a regular FPGA with Vivado, but only the Vitis AI toolkit stuff. But it's an interesting product, and as it is produced on a 7nm node, it has a reasonable hope of competing with a modern GPU in perf/die area and perf/watt. But the comparison in that document is basically complete BS.

They are comparing their $2500 AI accelerator card to an Nvidia A100 which is about 5x the price. The NVidia still has better performance, but it's not surprising that the Xilinx part walks away with 2x better perf/cost.

But frankly their AIC is much more comparable to a $800 consumer GPU than a $15,000 HPC accelerator. The Versal card has 16 GB of DDR4 memory with 100 GB/s bandwidth. It doesn't even have GDDR memory like an ordinary desktop GPU. The A100 has 80 GB of HBM with 20 TB/s of bandwidth. These are features that a lot of actual customers need and want and are willing to pay 15x the price compared to the consumer version of the same chip.

If you cherry pick an application that doesn't need the memory size or bandwidth that the A100 provides it's not surprising that the cheaper device can beat it in perf/cost. But then you should probably compare it to a desktop GPU that is less than half the price.

Again: I am clearly not going to deny that FPGAs have an important place in the world. There are even compute applications where they are going to be the best solution. But in high performance computing and machine learning they are definitely niche players compared to GPUs.

Someone · « **Reply #15 on:** September 20, 2022, 10:48:05 pm »

Quote from: Someone on September 19, 2022, 10:27:30 pm

Citation required... they're pretty much $$$ comparable and FPGA's win on power consumption most of the time.

Quote from: rstofer on September 20, 2022, 02:05:55 pm

My HP Omen laptop has an Nvidia RTX 3070 TI Laptop chip with 5888 CUDA Cores, a base clock of 1035 MHz and a Boost Clock of 1485 MHz. I don't think I have seen FPGA clocks anywhere near this fast. Maybe the very expensive chips but the Artix 7 sure won't get there.

Some combination of width and speed is a curve on a throughput or computational locus, FPGAs clock hundreds of MHz but stuff their wide pipelines with higher utilisation (as the linked advertorial specifically mentions). Its about compute per joule or per watt, since FPGAs target parallel operations and scale with the task.

Quote from: ejeffrey on September 20, 2022, 09:39:40 pm

The Versal card has 16 GB of DDR4 memory with 100 GB/s bandwidth. It doesn't even have GDDR memory like an ordinary desktop GPU. The A100 has 80 GB of HBM with 20 TB/s of bandwidth. These are features that a lot of actual customers need and want and are willing to pay 15x the price compared to the consumer version of the same chip.

Again, you pick one characteristic which is linked to performance in one area and claim its an essential pillar, when that is only one way to get there. Big FPGAs have globs of granular SRAM with lower latency and eye-watering bandwidth, available directly to the programmer (not just second-guessing a often-shared cache).

Quote from: ejeffrey on September 20, 2022, 09:39:40 pm

So first off, nothing about a Xilinx FPGA based ML accelerator is relevant to accelerating fractal computation on a Artix-7.

We could say the same about you insisting TFLOPs, DDR/HBM bandwidth, and clock speeds are important characteristics..... FPGAs are entirely different in the way they are used to calculate and compute when compared to GPUs or CPUs, they usually need different algorithms and implementations. In return for that you toggle less silicon and end up with better power efficiency.

so yeah, for all your performance performance performance "comparisons" (which as above are not comparisons), nothing to do with power or energy efficiency. Which is where for any reasonably well optimized/fitted problem the order falls to ASIC > FPGA > GPU/DSP > CPU

But hey this FPGA forum has gone... time to leave.

Nominal Animal · « **Reply #16 on:** September 20, 2022, 11:17:12 pm »

For Mandelbrot and Julia sets, range -8.0 to +8.0 suffices for the range of the complex number components, including intermediate values. So, you can trivially use e.g. Q3.60, or 64-bit signed integers with 60 fractional bits, for the computation. After all, these just iterate $$z_{n+1} = z_n^2 + c, \quad z, c \in \mathbb{C}$$to see if it diverges, whenever \$\lvert z_n \rvert \gt R\$, \$R \in \mathbb{R}\$. Such points are typically colored based on the smallest \$n\$ for which \$\lvert z_n \rvert \gt R\$ is true.

Mandelbrot set uses \$z_0 = 0\$, with \$c\$ identifying the point on the complex plane, and \$R \ge 2\$. Julia set uses \$z_0\$ for the point on the complex plane, \$c\$ as an arbitrary constant, and \$R \ge (1 + \sqrt{4 \lvert c \rvert + 1})/2\$. You can use a larger \$R\$, since the point is to find if the sequence diverges or not, and the specified \$R\$ are just the minimum values.

When zooming very deep, floating point allows one to discern more details in the set. For small \$n\$, the complex number components are very small, and tend to either cycle (within the set), or grows more or less steadily (not enough so to help much with the checks, though).

If we use \$z_n = r_n + \mathbb{i} i_n\$ and \$c = r_c + \mathbb{i} i_c\$, then
$$z_{n+1} = \left( r_n^2 - i_n^2 + r_c \right) + \mathbb{i} \left( 2 r_n i_n + i_c \right)$$
i.e.
$$\begin{aligned}
r_{n+1} &= r_n^2 - i_n^2 + r_c \\
i_{n+1} &= 2 r_n i_n + i_c \\
\end{aligned}$$
Each iteration involves three multiplications and three additions or subtractions, plus one doubling of a product; or two squarings, one double product, and three additions or subtractions. If any of the operations overflow, the sequence diverges, and that point belongs to the set. All three multiplications can be done in parallel; you can obviously also do multiple points in parallel (which is easier when using vector extensions like SSE, AVX, or NEON).

rstofer · « **Reply #17 on:** September 20, 2022, 11:21:20 pm »

Here's a MATLAB video about Machine Learning on an FPGA with their tools:

https://youtu.be/ZQOfnRL7YyQ

I haven't played with this but as I understand it, their tools will generate the HDL.

NiHaoMike · « **Reply #18 on:** September 21, 2022, 03:54:07 am »

I'm under the impression that GPUs don't work very well with 64 bit numbers since that's overkill for most graphics work which is mostly 32 bit.
Roughly what sort of performance vs. bit width tradeoff should I expect for fixed point on a FPGA?

Someone · « **Reply #19 on:** September 21, 2022, 04:12:41 am »

Quote from: NiHaoMike on September 21, 2022, 03:54:07 am

I'm under the impression that GPUs don't work very well with 64 bit numbers since that's overkill for most graphics work which is mostly 32 bit. Roughly what sort of performance vs. bit width tradeoff should I expect for fixed point on a FPGA?

You'll probably be limited by the multipliers, and their hard widths (as you would be for a 32/64/128 bit CPU register width). So 1/(bit_width^2) from the growth in multiplier use but with steps at each round multiple of their native width. The above linked project is a good reference point for a simple/direct low effort implementation.

Nominal Animal · « **Reply #20 on:** September 21, 2022, 05:17:21 am »

Because two of the three operations are actually squarings, you can do some useful optimizations.

For example, consider the Karatsuba algorithm, for two-limb numbers \$x = x_0 + x_1 B\$ and \$y = y_0 + y_1 B\$, whose product is
$$\begin{aligned}
x y &= (x_0 + x_1 B)(y_0 + y_1 B) \\
~ &= x_0 y_0 + (x_0 y_1 + x_1 y_0) B + x_1 y_1 B^2 \\
x y &= z_0 + z_1 B + z_2 B^2 \\
z_0 &= x_0 y_0 \\
z_1 &= x_0 y_1 + x_1 y_0 = (x_0 + x_1)(y_0 + y_1) - z_2 - z_0 \\
z_2 &= x_1 y_1 \\
\end{aligned}$$
where \$B\$ is the radix for each limb, typically some power of two, and the Karatsuba algorithm being the right hand side for \$z_1\$, trading one multiplication for three additions or subtractions. Note that \$z_k\$ is double the size of \$x_k\$ and \$y_k\$.

Applying to \$x^2\$, we obviously have
$$\begin{aligned}
x^2 &= X_0 + X_1 B + X_2 B^2 \\
X_0 &= x_0^2 \\
X_1 &= 2 x_0 x_1 \\
X_2 &= x_1^2 \\
\end{aligned}$$
which means we don't need to do the Karatsuba trade, and all three products can be calculated in parallel.

Karatsuba isn't useful when you have multiplications as fast as additions or subtractions, though, so it only becomes useful when you have more than two limbs per number.

Again, for both Mandelbrot and Julia fractals, for each iteration you need two squares and one product, plus some additions and subtractions, so there is quite a lot of mathematical optimizations possible.

ejeffrey · « **Reply #21 on:** September 21, 2022, 04:36:47 pm »

Quote from: NiHaoMike on September 21, 2022, 03:54:07 am

I'm under the impression that GPUs don't work very well with 64 bit numbers since that's overkill for most graphics work which is mostly 32 bit.
Roughly what sort of performance vs. bit width tradeoff should I expect for fixed point on a FPGA?

Yes most consumer GPUs target single precision float (8 bit exponent, 24 bit significand). GPUs targeting hpc / scientific computing have lots of double precision MAC units because that is often needed in those applications.

FPGAs have fixed sized multipliers which depends on the platform. Older FPGA usually had ~18 bit multipliers that are good for basic fixed point work but too small for transitional floating point formats by themselves. Some newer FPGAs have bigger multipliers that can directly handle single precision floating point, but still nothing like 53/64 bit. For that you need to combine several multipliers to compute partial products and sum them. This eats up your dsp resources fast so generally you want to limit precision if at all possible. Unfortunately fractals are like the definitive example of where there are no easy approximations. If you want a high resolution render of a fractal you need precision arithmetic.

NiHaoMike · « **Reply #22 on:** September 24, 2022, 05:23:35 am »

I tried the mandelbrot_gpu script and my friend noticed that the details weren't as clear! Most likely because the GPU render mode is 32 bit due to most GPUs being poor at 64 bit.

How well would fixed point do in comparison?

Nominal Animal · « **Reply #23 on:** September 24, 2022, 11:33:04 am »

Quote from: NiHaoMike on September 24, 2022, 05:23:35 am

I tried the mandelbrot_gpu script and my friend noticed that the details weren't as clear! Most likely because the GPU render mode is 32 bit due to most GPUs being poor at 64 bit.

How well would fixed point do in comparison?

How detailed an answer do you want?

Each iteration incurs rounding error, so unless you compare a floating-point and fixed-point implementations with the exact same rounding mode (and precision), the iteration counts outside the fractal set will vary. The variation is smooth, not random.

The deeper you zoom into the set boundary, the more iterations are needed to determine whether the point is within the set or not, and thus higher precision is needed (to avoid the cumulative rounding error to become visible).

Precision, or size in bits, seems to be much more important than a floating point. The iteration formula is such that there is very little benefit of having better precision near zero than elsewhere; it gets swamped by domain cancellation at the sum point anyway.

As an example, consider the case where you want to look at 0.3971+0.199i . With 'float', your resolution around there is about 0.0000001 per pixel, so you can generate a 1600×1200 image where the diagonal is about 0.0001 in the complex plane. Any deeper than that, and neighboring points are rounded to the same complex number. 32-bit fixed point (Q3.28, actually) lets you zoom in 16x deeper than float, and 64-bit fixed point (Q3.60) lets you zoom in 60x deeper than double.

So, in simple terms, Q3.N-4 fixed point is better for at least Mandelbrot and Julia set calculations than N-bit floating point. (I've actually tested this.)

ejeffrey · « **Reply #24 on:** September 24, 2022, 05:01:55 pm »

Fraqtive also says it does some anti aliasing by oversampling. This could be part of the visual difference you see. Did you try turning that off in fraqtive to see how it performs? But fundamentally calculating fractals accurately requires high resolution. Nvidia desktop GPUs don't do this particularly well you need the Tesla line of data center GPUs to get good double precision performance.

How much faster is the gpu version than the CPU?


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: FPGA to accelerate fractal calculations (Read 2636 times)

Share me