Author Topic: Image processing accelerator  (Read 4814 times)

0 Members and 1 Guest are viewing this topic.

Offline OM222OTopic starter

  • Frequent Contributor
  • **
  • Posts: 768
  • Country: gb
Image processing accelerator
« on: May 20, 2021, 11:58:10 pm »
Hello everyone!
This will be my first project using an FPGA so please bear with me. I'm trying to implement an artifical retina on a Spartan 7 FGPA (Cmod S7). I know this won't cut it for the end system and that I need to step up to a Zynq 7000 in the end, but I want a cheap proof of concept before spending too much money.

I intially implemented the retina in software (using C and Python) which revealed the major bottleneck to be the memory speed! The algorithm itself is extremely simple (just multiply accumulate and addition) which as far as I know can be done in 1 cycle uaing the DSP slices in the FPGA. My idea to overcome the memory bottleneck was to use many QSPI NOR flash modules to create a wide bus (64 or maybe 128 bit wide bus). The only operation on these NOR flash modules would be read operations which should be fast enough for this apllication (I'm aware they are very slow to write to, so they will only be used as a ROM).

The accelerator needs to recieve an image as an input and return the result to the program. I'll be using the USB port for proof of concept which is not ideal and I think the AXI interface could be used if I go with the Zynq solution.

Please let me know if this idea is feasible or if I'm thinking about it the wrong way around. I will add some more details about the exact details of the algorithm, the software implementation  and how I plan to modify it to work with an FPGA in the next post (I have to simplify things so they can fit here).

P.S: I forgot to mention that the total memory usage is around 300MB, so I can't rely on BRAMs. DMA with DDR memory would be next best option, however that won't address the bandwidth issue, which is why I want to use sperate modules to create a wide bus.
« Last Edit: May 21, 2021, 03:21:57 am by OM222O »
 

Offline Someone

  • Super Contributor
  • ***
  • Posts: 4959
  • Country: au
    • send complaints here
Re: Image processing accelerator
« Reply #1 on: May 21, 2021, 01:59:16 am »
6) If you want to test a HDL implementation, you can simulate the actual portable synthesizable HDL either on a PC (verilator, icarus verilog, vivado, quartus, ...) and if the algorithm is as simple as you say it'll probably run the HDL simulation fast enough to be real time on a decent PC or at least fast enough to prove the concept with good test vector full frame clips data sets.
lol, not sure what you call "realtime" would match up to others expectations. HDL simulation is many orders of magnitude slower than running it in hardware, which is why ASIC simulation platforms are a big pile'o FPGAs.
 

Offline asmi

  • Super Contributor
  • ***
  • Posts: 2797
  • Country: ca
Re: Image processing accelerator
« Reply #2 on: May 21, 2021, 03:01:24 am »
Hello everyone!
This will be my first project using an FPGA so please bear with me. I'm trying to implement an artifical retina on a Spartan 7 FGPA (Cmod S7). I know this won't cut it for the end system and that I need to step up to a Zynq 7000 in the end, but I want a cheap proof of concept before spending too much money.
In hardware world one would usually go the other way around - you pick up a devboard with oversides FPGA to make sure there is plenty of space for debug cores and stuff, and only once you have a design and know exactly how much resources does it need, you downsize your FPGA if possible.

I intially implemented the retina in software (using C and Python) which revealed the major bottleneck to be the memory speed! The algorithm itself is extremely simple (just multiply accumulate and addition) which as far as I know can be done in 1 cycle uaing the DSP slices in the FPGA. My idea to overcome the memory bottleneck was to use many QSPI NOR flash modules to create a wide bus (64 or maybe 128 bit wide bus). The only operation on these NOR flash modules would be read operations which should be fast enough for this apllication (I'm aware they are very slow to write to, so they will only be used as a ROM).
I'm not quite sure what are you trying to achieve here...

The accelerator needs to recieve an image as an input and return the result to the program. I'll be using the USB port for proof of concept which is oit ideal and I think the AXI interface could be used if I go with the Zynq solution.
Again, you got it backwards. Designing USB controller core and software stack is many-many orders of magnitude more complicated than receiving stuff from Zynq via AXI (or direct from DDR via DMA).

Please let me know if this idea is feasible or if I'm thinking about it the wrong way around. I will add some more details about the exact details of the algorithm, the software implementation  and how I plan to modify it to work with an FPGA in the next post (I have to simplify things so they can fit here).
Yea, some details are definitely in order. For now it's hard to say anything certain.

P.S: I forgot to mention that the total memory usage is around 300MB, so I can't rely on BRAMs. DMA with DDR memory would be next best option, however that won't address the bandwidth issue, which is why I want to use sperate modules to create a wide bus.
Hmm what? Zynq's hard DDR controller with 32bit DDR3 provides 533M*2*32~33Gbps of bandwidth, which is quite a bit.
Also not sure here which separate modules do you intend to use and exactly how. Finally - it is very uncommon for FPGA design to use so much memory, as ideal FPGA design streams data in real time with minimal internal buffering designed to amortize the difference in data flow, typically at the input and outout of the pipeline because external source and sick typically provide/consume data in bursts and not steady stream.
« Last Edit: May 21, 2021, 03:28:54 am by asmi »
 

Offline james_s

  • Super Contributor
  • ***
  • Posts: 21611
  • Country: us
Re: Image processing accelerator
« Reply #3 on: May 21, 2021, 03:10:45 am »
Granted I didn't mean real time in the sense that it'd be as fast as peak FPGA hardware for decent FPGAs.

Just for reference, a primitive circuit like the original Pong arcade game takes several seconds to render a single frame on a reasonably powerful modern PC simulating the circuit at the gate level. Normally simulation of a FPGA design will take tens of seconds to tens of minutes to simulate a small fraction of a second. It's typically used to unit test portions of the design, particularly looking at timing. There's nothing "real time" about any of this.
 
The following users thanked this post: Someone

Offline OM222OTopic starter

  • Frequent Contributor
  • **
  • Posts: 768
  • Country: gb
Re: Image processing accelerator
« Reply #4 on: May 21, 2021, 03:37:29 am »
The reason behind choosing the cheapest reasonable FPGA I could find is that I'm a student and have to pay out of my own pocket for these projects, so my time is actually pretty cheap. The main goal is to have the software retina working on robotics applications (small robots) or things similar to cube stats, so powerful, PC grade GPUs (even old ones) are out of the question due to very large size and high power consumption. The Cmod S7 should be enough for me to simulate a small retina and to be able to benchmark the critical aspects of the project. Simulations are good, but I'd rather have the real thing working.

I have attached my dissertation where you can find the full details of the algorithm itself (chapter 4 is the design section and is relatively short, about 3 pages). From the results section you can see that multi threading adds no performance benefits but increasing the memory bandwidth (and even reducing its latency) scales linearly. I have confirmed this via another method which I will not discuss here since it's outside of the scope of this thread. I have 0 hope for SBC boards like the Pi , Jetson nano, etc. When a 6700HQ paired with 2133MHz DDR4 @ CL15 is facing memory bottlenecks, the slower ram on lower end devices will cause significant problems. With an FPGA, data can be pre-fetched into small register, while simultaneously processing data; This means if the memory bandwidth is high enough, the processor won't be sitting idle, waiting for data to arrive from memory. If the memory is too fast, more DSP slices can be added so that a fine balance is reached. Another advantage would be being able to do all 3 operations which are mentioned in the dissertation at once, given a one cycle latency between sampling and the other operations, whereas a conventional CPU has to carry out each operation sequentially.

I have also found a project that seems to implement a decent lightweight QSPI memory controller but I've not had enough time to read through it completely:
https://zipcpu.com/blog/2019/03/27/qflexpress.html

It promises very fast read speeds, which again, is my main bottleneck. I hope that answers your questions about my "odd" choices and why I chose a rather difficult task as a first project.
I have used some tricks to reduce the memory footprint of the software retina by 2x, which resulted in an almost 2x performance increase (1.7x) on my laptop. I appreciate all of your suggestions about the alternatives, but please let's go back to the original proposal. I honestly can't think of some systematic flaw but there might be better ways of dealing with the bandwidth problem (for example HBM memory for FPGA? as far as I know that doesn't exist but I'm also very new to this). My main inspiration for using an FPGA came from slow motion cameras which must use an FPGA to capture images at high frame rates (large data throughput); My application is pretty much the same but needs high read speeds as opposed to high write speeds.

Edit: the PDF is too large to post as an attachment here (15MB) so I have uploaded it on google drive : https://drive.google.com/file/d/1gRrMOzkGZNQLeC0UNWlvFhYIsdlaMoiy/view?usp=sharing

Again, regarding the memory bandwidth: my current system has 2133*64 = 136,512Gb/s , which is a bottle neck for my application, so DMA with DDR3 memory will definitely be a bottleneck (pretty sure you calculated it wrong and don't need a *2 since that's the rated MT/s, not clock frequency, which is half of that. For example 2133"MHz" is actually running at 1066MHz. Also having a look at many Zynq Dev. boards, they advertise "512MB DDR3 with 16-bit bus @ 1050Mbps"). To be clear, I don't expect the FPGA implementation to beat the laptop performance, but it'll be a far cry better than the dismal throughput on something like a raspberry pi. Not to mention the fact that the limited throughput on those devices, will be shared with many other processes and the OS which is a recipe for disaster.

Here is the core of the algorithm written in Cython (custom C extensions for python):

Quote
@cython.wraparound(False)
@cython.boundscheck(False)           
cpdef sample(unsigned char[::1] img_flat, unsigned short[::1] coeffs, unsigned int[::1] idx, unsigned int[::1] result_flat):
    cdef unsigned int x
    with nogil:
        for x in range(img_flat.shape[0]):
            if coeffs [ x ] > 0 :
                result_flat [ idx [ x ] ] += img_flat [ x ] * coeffs [ x ]

the "[::1]" means that the date is stored as a contiguous, flat (1D) array. As you can see the "idx" array is used as a look up table and the only real line of code is a multiply accumulate: "result_flat [ idx [ x ] ] += img_flat [ x ] * coeffs [ x ] ". The if statement, only skips unceneccaray operations (if you multiply by 0 and add the result to a value, the value doesn't change). This is the part where I have managed to use some tricks to remove all 0 elements from the array to reduce the memory overhead by about 2x.

The specific IC I'd like to use is: https://www.mouser.co.uk/ProductDetail/Winbond/W25Q256JVFIM?qs=qSfuJ%252Bfl%2Fd7mFTiWxnKquA%3D%3D
However I may drop to a lower capacity if I end up going with for example 128 bit bus (since that requires more chips to begin with, and I don't really need massive amounts of ROM).
« Last Edit: May 21, 2021, 03:56:00 am by OM222O »
 

Offline OM222OTopic starter

  • Frequent Contributor
  • **
  • Posts: 768
  • Country: gb
Re: Image processing accelerator
« Reply #5 on: May 21, 2021, 06:32:29 am »
Regarding number of pins used for the bus:
only one clock and CS pins will be used, so the only limit will be bus width, so for example a 128 bit bus, only requires 130 pins. This approach definitely reduces the flexibility (each read operation MUST BE 128 bits), but for my application that's not really an issue. Since I know the data sizes and the location (offset) of each data point, I can easily manipulate the corresponding pointers in the BRAM (again, the external flash is only used as ROM).

Regarding the CMOD S7, it was just the cheapest FPGA I could find that was similar (same family) as the final product (minus the hard ARM cores), which lets me test and validate different parts of the design and to implement a very small retina (1000 cells, maybe up to 8000 cells) using the SF3 PMOD module. It's a micron QSPI flash as opposed to the winbound chip I actually intend to use, but minor differences shouldn't matter. I also plan on using it for future projects (for example a hardware based servo controller) so it's not a waste. I do realize it's under powered and seems like an odd choice, but it should be sufficient for the proof of concept stages (I'm honestly fairly certain even with a single QSPI chip, the CMOD S7 can beat a Pi).

You're probably right about the power consumption (watts/MIPS), so power may be the performance limiting factor in the end, but for now, it's a safe variable to ignore, since the design can be under clocked without an issue.

I also had a look at the FPGA + HBM that xilinx sells ... it's very interesting, however ... 9 grand is more than my housing fee for the year so it's not gonna happen any time soon. Maybe if I get funded for a PhD, it'd be a fun thing to work with.
If I understood correctly, there are no major flaws in my plan (ignoring the odd choices), so I'll be excited to work on the memory controller! It'll take a while to read through the entire article and to understand it, but I'm really looking forward to it. I also found a tutorial on youtube which seems very similar to my goal (doing convolution, which is basically MAC) so with minor modifications, the processing pipeline should also be easy to implement.


I will post some updates when I make some progress, but feel free to use the software library and provide some feedback. that project is still a work in progress since it has serious potential for replacing the need for GPU rendering. Also feel free to share any further comments about other approaches to image processing on FPGAs or if there are any common tips or tricks.
« Last Edit: May 21, 2021, 06:39:28 am by OM222O »
 

Offline Berni

  • Super Contributor
  • ***
  • Posts: 5026
  • Country: si
Re: Image processing accelerator
« Reply #6 on: May 21, 2021, 06:45:31 am »
If power consumption is an issue then you can simply use a smaller GPU.

The big 200W eating beasts of a GPU used to run games and mine Etherium coins have incredible amounts of processing power when compared to any CPU. The FPGA implementation of a computational machine with similar processing power would likely require a large number of 10k$ FPGAs on a board while burning >1kW of power to run. You most likely do not need this level of power.

There are lots of smaller graphics cards (even from Nvidia) that have less processing power but also consume so little power that they don't even need a fan to run. You can also get a similar level of processing power from integrated GPUs in Intel processors. The memory bandwidth of even a consumer socket 1151 CPU is still impressively high. The bus is 64bit wide and you typically run dual channel so 128bit and they run at rather high clock speeds, so you can get >25GB/s of bandwidth out of a single DDR4 stick, double that to >50GB/s for dual channel. Or if you step up to a socket 2011 then you get quad channel memory for >100GB/s. The power efficiency of the more lower power Intel CPUs also consumes so little power that passive cooling is possible.

And if size is also a consideration then you can get a Nvidia SoC to get a pretty grunty GPU in the form factor of a RaspberryPi while also being optimized for even lower power.

FPGAs might seam low power because they consume so little power (Most of the time they don't even need heatsinks) but they have only a small fraction of the computational performance of a GPU. Getting high GB/s of external memory bandwith out of a FPGA is also not trivial.

For example there was one brief period where mining Bitcoin was most efficient on FPGAs because they ware way faster than CPUs while using much less power. But once GPU shader cores became flexible enough to do the job they made FPGA mining obsolete. These days even GPUs are obsolete there because of mining ASICs.

 

Offline Someone

  • Super Contributor
  • ***
  • Posts: 4959
  • Country: au
    • send complaints here
Re: Image processing accelerator
« Reply #7 on: May 21, 2021, 11:23:43 pm »
The big 200W eating beasts of a GPU used to run games and mine Etherium coins have incredible amounts of processing power when compared to any CPU. The FPGA implementation of a computational machine with similar processing power would likely require a large number of 10k$ FPGAs on a board while burning >1kW of power to run. You most likely do not need this level of power.
....
FPGAs might seam low power because they consume so little power (Most of the time they don't even need heatsinks) but they have only a small fraction of the computational performance of a GPU. Getting high GB/s of external memory bandwith out of a FPGA is also not trivial.
Lol, so very misinformed. Its all down to the particular application and convolution is one thing that FPGAs do incredibly efficiently on power:
https://arxiv.org/pdf/1906.11879.pdf
Its easy enough to find references of people claiming 10x or 100x improvements in power or energy when switching from GPU to FPGA. When equally optimised for both platforms. Sure, you can make some floating point task that fits very poorly to an FPGA and claim a "win" but real world very few tasks match that and most of the work in image processing and neural networks is pushing short/compact fixed point.

Single FPGA chips are comparable in compute performance,  memory bandwidth, and power consumption to GPUs. One of them is a consumer product and gains the economies of scale there but its not as large a gap as you claim, Xilinx Alveo cards would be a better comparison than the comparatively imaginary single unit chip prices from resellers.

Good to know the FPGA forum is now overrun by hobbyists with no clue, and no real experience. Seems like another corner of this forum to ignore.
 
The following users thanked this post: Bassman59

Offline OM222OTopic starter

  • Frequent Contributor
  • **
  • Posts: 768
  • Country: gb
Re: Image processing accelerator
« Reply #8 on: May 22, 2021, 04:34:20 am »
I've been looking through the suggested range of options and HyperFlash seems very promising. Perhaps I'll try to find either a dev board or design a small evaluation PCB for a HyperFlash to test it against QSPI to see if it's actually that much faster (cypress claims x5 faster reads compared to QSPI). The other options are certainly possible (NVME drives or PCIE storage devices), but I think those are too big of a bite to chew as a first project. Please let me know if there are breakout boards that use HyperFlash which can be used for the initial investigation.

Edit: after a bit of googling, I really can't find anything regarding HyperFlash besides cypress's website which is a bit worrying. Most results are related to the generic SPI/QSPI flash modules. I wonder if there's a reason behind this lack of wide spread adoption. Perhaps it's more of a niche product?
« Last Edit: May 22, 2021, 04:41:19 am by OM222O »
 

Offline FlyingDutch

  • Regular Contributor
  • *
  • Posts: 147
  • Country: pl
Re: Image processing accelerator
« Reply #9 on: May 23, 2021, 10:43:39 am »
Hello,

first of all please write what goal do you want to achieve? If just detecting people on camera image you don't need FPGA at all. I made people detection on image from camera (just is men on image or not) with 95.58% accuracy (confusion matrix made on 2000 imgages which was never seen before by the solution). I used Keras framework (Python) for training CNN (Convolutional Neural network). I just used "VGG16" CNN model which won competition for object detection in 2014 year. I just used "transfer learning" which means that I removed last layers from model and just trained only one added layer (dense layer with two neurons). After that this CNN can detect people or body parts on images with almost 96 % of prediction accuracy. But model of CNN VGG16 is big - it is about 533MB and cannot work on very small computer. But after that I used second model called "Mobilenet" and also used "transfer learning" - after that this CNN can detect people on imgae from camera with 92% of accuracy. This solution is working fine with "Tensorlfow lite"
framework on "Raspberry Pi Compute module 4" - see link:

https://www.raspberrypi.org/products/compute-module-4/?variant=raspberry-pi-cm4001000

For training I was using free "Google colaboratory account" - it is linux machine with 12 GB RAM powerful Nvidia GPU and TPU (tensor processing unit).
One can use GPU or TPU for accelerating training of Artificial Neural Networks on Google Colab account - so for training model you don't need any expensive hardware. I propose to familiarize yourself with these tutorials about 'deep learning":

https://deeplizard.com/learn/playlist/PLZbbT5o_s2xrwRnXk_yCPtnqqo4_u2YGL

BTW: If you want necessarily run all phassess of model training and run on your own hardware rather buy "Google Coral" USB AI accelerator - see link:
https://coral.ai/products/accelerator/

or small microcontroller board whit this accelerator on-board - for example" :

https://www.nxp.com/products/processors-and-microcontrollers/arm-processors/i-mx-applications-processors/i-mx-8-processors/coral-dev-board-tpu:CORAL-EDGE-TPU

Best regards
« Last Edit: May 23, 2021, 10:57:36 am by FlyingDutch »
 

Offline FlyingDutch

  • Regular Contributor
  • *
  • Posts: 147
  • Country: pl
Re: Image processing accelerator
« Reply #10 on: May 23, 2021, 12:17:09 pm »
Hello again,

after I looked into your master's disertation I noticed that you call "retina" an "Convolutional Neural Network - CNN" and it's typical operations (kernel and filters). Such CNNs are implemented in "Keras" deep learning framework, you can just used such networks models. After you trained such network you can make visualization how such network "sees" images and make predictions. After model is built and trained one can make predictions with objects detections. After one have trained mnodel is able to run such network using only "tensorflow" framework. It is possible to run "tensorflow" framework and "OpenCV" on some models of ZYNQ-7000 boards (together with Petalinux), but it is not easy and I never tried this so I am not able to give you valuable hints how to achieve this goal.

Best regards

 

Online NorthGuy

  • Super Contributor
  • ***
  • Posts: 3246
  • Country: ca
Re: Image processing accelerator
« Reply #11 on: May 23, 2021, 02:51:48 pm »
I intially implemented the retina in software (using C and Python) which revealed the major bottleneck to be the memory speed!

I think you're wrong in your assessments. A high end PC may have 256-bit memory bus running 2 GT/s speed, which is 64 GBytes/s bandwidth. I don't think this would be the bottleneck. At any rate, an FPGA with such memory bandwidth to external RAM will be very expensive.

To assess speed potential, calculate how many accesses to memory you need per second and how wide your accesses are, how many MAC (multiply and accumulate) you need per second, etc. This is enough to assess the speed and ball-park a suitable FPGA.

You don't need a real FPGA for proof of concept. You can create the design without buying an FPGA and the tools will validate the speed for you. Then you can decide which board to buy whether you really want to buy it.
 
The following users thanked this post: Silenos

Offline rstofer

  • Super Contributor
  • ***
  • Posts: 9935
  • Country: us
Re: Image processing accelerator
« Reply #12 on: May 23, 2021, 03:25:10 pm »
There are some very useful links in the previous few threads.  Thanks!
Filling up my bookmarks...
 

Offline olkipukki

  • Frequent Contributor
  • **
  • Posts: 790
  • Country: 00
Re: Image processing accelerator
« Reply #13 on: May 23, 2021, 05:05:26 pm »
1) Use something like a KRIA dev board which is specifically made for easy prototyping of image / video processing?
https://www.xilinx.com/products/som/kria.html


I guess for $200 it would be best bet for a cheapest solution.

and while awaiting a kit, familiarize how to

https://github.com/Xilinx/Xilinx_Kria_KV260_Workshop

https://www.xilinx.com/support/documentation/white_papers/wp529-som-benchmarks.pdf
 

Offline OM222OTopic starter

  • Frequent Contributor
  • **
  • Posts: 768
  • Country: gb
Re: Image processing accelerator
« Reply #14 on: May 23, 2021, 06:26:18 pm »
I have clarified this several times on this thread but people seem to want to skip over it or may not understand the end goal. I'm not trying to detect people  :-// The memory IS CONFIRMED to be the bottle necking factor (I have received hundreds of replies regarding this). Also FPGA offers advantages like ability to prefetch data into buffers, whereas a CPU has to buffer a small amount of data into cache, process that data, send a request to memory to fetch more data and sit there idle for many cycles (which is made even worse due to memory latency. Some of this can be addressed with things like out of order execution, branch prediction, etc. but despite the on paper specs, FPGAs do very well, sometimes even better than a desktop PC for image processing tasks. Here is a quick demo of Zynq FPGA beating a computer in a very similar task:


Even if the end product is slower than a top of the line desktop PC, it's fine since the target to beat is a raspberry pi, not an AMD 5950x  :popcorn: Please read the previous replies before posting an answer.
 

Offline olkipukki

  • Frequent Contributor
  • **
  • Posts: 790
  • Country: 00
Re: Image processing accelerator
« Reply #15 on: May 23, 2021, 10:23:44 pm »
.. The memory IS CONFIRMED to be the bottle necking factor ...
What throughput and latency are required for your application?
 

Offline olkipukki

  • Frequent Contributor
  • **
  • Posts: 790
  • Country: 00
Re: Image processing accelerator
« Reply #16 on: May 23, 2021, 11:14:56 pm »
You might also consider SRAMs since they're fast and wide and deep enough that they might be useful for your cache level processing if such won't fit in BRAM. 

Very unlikely you will find cheap QDR II+ chips, so cost of FPGA might looks like a bargain compare to  ::)

 

Offline Someone

  • Super Contributor
  • ***
  • Posts: 4959
  • Country: au
    • send complaints here
Re: Image processing accelerator
« Reply #17 on: May 23, 2021, 11:21:54 pm »
The memory IS CONFIRMED to be the bottle necking factor (I have received hundreds of replies regarding this).
Then justify this, a specific implementation on a CPU there with a trend connected to memory performance is not conclusive. You may have wandered around a local minima and missed the bigger picture. Given the sparse nature of the convolution output and the symmetry/scaling of the kernels, you've probably chosen a poor implementation method. Also, what works well on a CPU is not what works well on a FPGA, so an entirely different approach may be needed.

What are the fundamental parameters that are causing memory pressure? Rates for pixel input, kernel lookup, pixel output, their associated bit depths, these can be quantified.

I have attached my dissertation where you can find the full details of the algorithm itself (chapter 4 is the design section and is relatively short, about 3 pages). From the results section you can see that multi threading adds no performance benefits but increasing the memory bandwidth (and even reducing its latency) scales linearly.
Just to reiterate the "application" so we're talking about the same thing: Its a sparse convolution to make an irregularly sampled image from a regularly sampled one?

Usually thats only done for simulation of what an optical system could provide as the raw data to a further processing system (such as a neural network), doing it in realtime is a layer of abstraction that is almost certainly wasteful as the downstream processing system could be trained on a "cheaper" representation of the data and end up with a similar result. Or, given a power/processing/cost budget, putting more resources into the inference/intelligence and less into making it "nice" from a bio-motivated or conceptual ideal. Ditch the perfectly sized/shaped gaussian type kernels, what happens when the inference is trained/tuned on simple decimation from rectangular averages?

Spending huge time and resources to change the data format from uniform to this arbitrary and complex format without considering other ways to achieve the desired result of the system is a big waste of time. I know, I've worked on this exact field, and resolved this exact problem of resampling regular images for downstream processing. You may have been tasked with this little slice of a larger project by someone else, or misunderstand the significance of foveated image structures, but its pushing to build a very complicated system that doesn't solve any real problem.

The cost of doing an ideal foveated data compression/reduction step is disproportionate to its possible improvements elsewhere.
 
The following users thanked this post: FlyingDutch

Offline OM222OTopic starter

  • Frequent Contributor
  • **
  • Posts: 768
  • Country: gb
Re: Image processing accelerator
« Reply #18 on: May 24, 2021, 01:30:48 am »
The memory IS CONFIRMED to be the bottle necking factor (I have received hundreds of replies regarding this).
Then justify this, a specific implementation on a CPU there with a trend connected to memory performance is not conclusive. You may have wandered around a local minima and missed the bigger picture. Given the sparse nature of the convolution output and the symmetry/scaling of the kernels, you've probably chosen a poor implementation method. Also, what works well on a CPU is not what works well on a FPGA, so an entirely different approach may be needed.

What are the fundamental parameters that are causing memory pressure? Rates for pixel input, kernel lookup, pixel output, their associated bit depths, these can be quantified.

I have attached my dissertation where you can find the full details of the algorithm itself (chapter 4 is the design section and is relatively short, about 3 pages). From the results section you can see that multi threading adds no performance benefits but increasing the memory bandwidth (and even reducing its latency) scales linearly.
Just to reiterate the "application" so we're talking about the same thing: Its a sparse convolution to make an irregularly sampled image from a regularly sampled one?

Usually thats only done for simulation of what an optical system could provide as the raw data to a further processing system (such as a neural network), doing it in realtime is a layer of abstraction that is almost certainly wasteful as the downstream processing system could be trained on a "cheaper" representation of the data and end up with a similar result. Or, given a power/processing/cost budget, putting more resources into the inference/intelligence and less into making it "nice" from a bio-motivated or conceptual ideal. Ditch the perfectly sized/shaped gaussian type kernels, what happens when the inference is trained/tuned on simple decimation from rectangular averages?

Spending huge time and resources to change the data format from uniform to this arbitrary and complex format without considering other ways to achieve the desired result of the system is a big waste of time. I know, I've worked on this exact field, and resolved this exact problem of resampling regular images for downstream processing. You may have been tasked with this little slice of a larger project by someone else, or misunderstand the significance of foveated image structures, but its pushing to build a very complicated system that doesn't solve any real problem.

The cost of doing an ideal foveated data compression/reduction step is disproportionate to its possible improvements elsewhere.

Regarding the data throughput problem: I have confirmed it via 4 separate methods, only one of which (comparison between two different speeds, or as you call it the trend) was mentioned in the paper. Other methods were using theoretical maximum speed calculations, memory throughput benchmarks and finally an alternative packing method which processes the exact same data, but gets rid of blank elements to reduce memory throughput, which scaled linearly. The results are very conclusive if you ask me.

I appreciate the comment regarding this "high speed in a small package" approach being a waste of time, but we already have projects where this process is still the main bottleneck of a CNN which uses fovea sampled inputs. I already have a few potential follow up projects which all can use a faster retina (using a PC) or one with a small footprint and low power consumption (small robots).

The idea of just averaging rectangular areas might also be worth investigating but outside the scope of the current thread. I'll try to look into it, but I actually feel like it's going to perform worse (changing MAC with accumulate and divide, which as far as I know is not an instruction. Also division is about 3x slower than multiplication (at least on mainstream intel platforms)).

...
For such a simple 24-pin IC with 12-data lines or such  it wouldn't be hard to just make a plug in module or PCB yourself as needed holding from 1-4 devices.

There's one example of a RAM adapter board here:
https://docs.icebreaker-fpga.org/hardware/pmod/hyperram/

It seems like building my own modules is the way to go since I can't find anything regarding HyperFlash. HyperRam seems to be the more common of the two.

1) Use something like a KRIA dev board which is specifically made for easy prototyping of image / video processing?
https://www.xilinx.com/products/som/kria.html


I guess for $200 it would be best bet for a cheapest solution.

and while awaiting a kit, familiarize how to

https://github.com/Xilinx/Xilinx_Kria_KV260_Workshop

https://www.xilinx.com/support/documentation/white_papers/wp529-som-benchmarks.pdf

They certainly look like a decent option. I have to familiarize myself with the different families xilinx provides to make a final decision, but having a small form factor module is definitely a plus.

.. The memory IS CONFIRMED to be the bottle necking factor ...
What throughput and latency are required for your application?

There are no strict requirement defined so far, as this will be a first implementation as far as I'm aware. The throughput is the more important factor here since latency can be dealt with by pre fetching and buffer multiplexing. As for throughput, some very basic calculations (total number of bits fetched divided by frame time) show that a minimum of about 5Gb/s is required, but it's nice to be on the side of caution and go with something around 8 to 10 Gb/s.

I will begin experimenting with a few of the suggested methods (HyperFlash, QSPI and possibly PCIE storage) for now. Thanks everyone for leaving suggestions.

 

Offline Someone

  • Super Contributor
  • ***
  • Posts: 4959
  • Country: au
    • send complaints here
Re: Image processing accelerator
« Reply #19 on: May 24, 2021, 03:33:33 am »
this process is still the main bottleneck of a CNN which uses fovea sampled inputs
Thats the problem, creating synthetic foveated data with this method is expensive and without any justification for its accuracy/quality. The CNN likely doesn't care and can be retrained on cheaper data sources. Trying to make synthetic foveation faster/cheaper is almost certainly an inefficient use of time/effort. Give the CNN an equally arbitrary layer there and it will find a better solution, thats the whole movement to ML and CNNs. Computers find better convolutional structures than human designed/imagined ones, put as much inside the ML as possible.

Image -> foveation -> CNN2 -> result
vs
Image -> CNN1 -> result

OMG! smaller CNN1! Publish publish publish.

But when:
foveation cost > cost (CNN1 - CNN2)
Lols

These costs are easily obtained/presented, but aren't.

The idea of just averaging rectangular areas might also be worth investigating but outside the scope of the current thread. I'll try to look into it, but I actually feel like it's going to perform worse (changing MAC with accumulate and divide, which as far as I know is not an instruction. Also division is about 3x slower than multiplication (at least on mainstream intel platforms)).
Right there you completely miss the fundamentals of convolution, signal processing, and FPGAs. This really isn't the project for your skill/experience set (besides the fact that the task is already questionable).

Regarding the data throughput problem: I have confirmed it via 4 separate methods, only one of which (comparison between two different speeds, or as you call it the trend) was mentioned in the paper. Other methods were using theoretical maximum speed calculations, memory throughput benchmarks and finally an alternative packing method which processes the exact same data, but gets rid of blank elements to reduce memory throughput, which scaled linearly. The results are very conclusive if you ask me.
No they aren't conclusive when applied to FPGAs which have entirely different data and memory handling characteristics. "gets rid of blank elements" is not describing how you did it (previous mentions in this thread suggested very inefficient methods). Entirely different memory structures and handling should be considered when moving an algorithm to an FPGA.

Assuming the foveation is entirely done with radially symmetric gaussians (or at least small shapes that can be repeated with reflection) then your primary claim is incorrect:
Quote
A better approach would be creating a larger image that contains multiple kernels, which can be thought of as a larger kernel; This approach only needs one calculation to find the corner coordinates and allows for a cache friendly way to multiply the original image with this new coefficient image.
That might be simpler from a high level language and may well be more efficient if the abstraction overheads are quite high in your test. But in FPGAs that no longer applies and there are huge gains from doing things with smaller localities of memory/data. Your tests are only valid for the assumptions/platforms/structure that you haven't fully defined, and are therefore likely to be misleading or entirely incorrect when applied to a different computational architecture. "one calculation" is nonsense when compute resources are counted in add/mult operations, memory bandwidth is counted in bits/s, but you won't tell us what the actual compute task is.

A gaussian 5x5 is not the naive 25x mults + 24x two way adds + 50x 32bit memory lookups. It might be in a very poor implementation, but even a trivial FPGA architecture is 6x mults (some can be simplified out with binary shifts) 24x two way adds and streaming data (zero memory lookups). This can be generalised to support arbitrary sizes and sub pixel positioning, still at lower computational cost than brute force convolution. This is all routine work that has been done many times before by many different people.
 

Offline OM222OTopic starter

  • Frequent Contributor
  • **
  • Posts: 768
  • Country: gb
Re: Image processing accelerator
« Reply #20 on: May 24, 2021, 05:24:37 am »
Thats the problem, creating synthetic foveated data with this method is expensive and without any justification for its accuracy/quality. The CNN likely doesn't care and can be retrained on cheaper data sources. Trying to make synthetic foveation faster/cheaper is almost certainly an inefficient use of time/effort. Give the CNN an equally arbitrary layer there and it will find a better solution, thats the whole movement to ML and CNNs. Computers find better convolutional structures than human designed/imagined ones, put as much inside the ML as possible.

I feel like you're entirely missing the point of "foveated vision". To process large images (1080P for example), a CNN will be nowhere near fast enough due to the massive amount of data. A retina samples this down to a much lower number of elements (for example 50K), which is smaller than even a 640x480P image (around 307k). You may spend more time on the foveation step compared to a simple convolutional layer, but you massively decrease the amount of data that needs to be processed. It's essentially a lossy compression algorithm. Cropping or resizing have their own downsides which are mentioned in the dissertation.

Assuming the foveation is entirely done with radially symmetric gaussians (or at least small shapes that can be repeated with reflection) then your primary claim is incorrect:
Quote
A better approach would be creating a larger image that contains multiple kernels, which can be thought of as a larger kernel; This approach only needs one calculation to find the corner coordinates and allows for a cache friendly way to multiply the original image with this new coefficient image.
That might be simpler from a high level language and may well be more efficient if the abstraction overheads are quite high in your test. But in FPGAs that no longer applies and there are huge gains from doing things with smaller localities of memory/data. Your tests are only valid for the assumptions/platforms/structure that you haven't fully defined, and are therefore likely to be misleading or entirely incorrect when applied to a different computational architecture. "one calculation" is nonsense when compute resources are counted in add/mult operations, memory bandwidth is counted in bits/s, but you won't tell us what the actual compute task is.

A gaussian 5x5 is not the naive 25x mults + 24x two way adds + 50x 32bit memory lookups. It might be in a very poor implementation, but even a trivial FPGA architecture is 6x mults (some can be simplified out with binary shifts) 24x two way adds and streaming data (zero memory lookups). This can be generalised to support arbitrary sizes and sub pixel positioning, still at lower computational cost than brute force convolution. This is all routine work that has been done many times before by many different people.

The kernels are not symmetrical, the gaussians themselves are sub pixel shifted. If you want some examples, I can provide visualizations in a later post.

The methods I used do seem weird or even inefficient, but have been the result of about 1.5 years of experimenting with different approaches and are proven to be the fastest (CPU rendering that beats GPU performance, just think about that). I'm actually in the process of creating a CUDA version of the Cython extensions to test if applying the same methods can increase the GPU performance as well (GPU has a ton of bandwidth, but is pretty crap at doing lots of small random reads, which is the case with the retina, so my implementation theoretically improves that massively). The new approach that "gets rid of the blank elements" (coefficients that are zero) looks even more absurd and is hard to explain so I will make some graphics to explain it in my next post.

Lastly: I'm well aware that this project will be completely new (I have essentially 0 experience with FPGAs), but the entire purpose of these projects is research, learning project management skills AND learning about new things. Before last year, I had 0 experience with image processing or foveated vision as well, but I delivered pretty decent results. The wrong way of approaching any challenge is: "Well I might as well not try since I have no clue what I'll be doing"; that way you'll be stuck forever where you are. Even the most skilled programmers or god like engineers, started from not knowing anything and learned their way out of the problems they faced. I'm certainly up for the challenge anyways.  ;)
« Last Edit: May 24, 2021, 05:26:22 am by OM222O »
 

Offline Someone

  • Super Contributor
  • ***
  • Posts: 4959
  • Country: au
    • send complaints here
Re: Image processing accelerator
« Reply #21 on: May 24, 2021, 06:27:55 am »
Thats the problem, creating synthetic foveated data with this method is expensive and without any justification for its accuracy/quality. The CNN likely doesn't care and can be retrained on cheaper data sources. Trying to make synthetic foveation faster/cheaper is almost certainly an inefficient use of time/effort. Give the CNN an equally arbitrary layer there and it will find a better solution, thats the whole movement to ML and CNNs. Computers find better convolutional structures than human designed/imagined ones, put as much inside the ML as possible.
I feel like you're entirely missing the point of "foveated vision". To process large images (1080P for example), a CNN will be nowhere near fast enough due to the massive amount of data. A retina samples this down to a much lower number of elements (for example 50K), which is smaller than even a 640x480P image (around 307k). You may spend more time on the foveation step compared to a simple convolutional layer, but you massively decrease the amount of data that needs to be processed. It's essentially a lossy compression algorithm.
I fully understand compressed sensing and foveated images, having done research and work in the field. You're not listening and focused on your "task" which as I have explained, is stupid.

If the cost of the specific decimating/reducing/filtering/foveating the image is larger than the cost gains in later stages, then that is a complete waste of time. There are innumerable ways to do data reduction, CNNs have such methods through strides etc, foveating could be considered a layer with a high (but not full connectivity) and adaptive/variable strides. Constrain the learning to find a more efficient use of that computational cost and I'm sure it would. There is no evidence that the foveated pattern you are pursuing is the optimal solution, its a designed pattern which is the opposite of ML. If it was a cheap operation it might just get hand waved away as a bit of polish/nice but its a huge sink of computational resource which is unjustified.

There is nothing in your or your colleagues work that I have seen to justify the use of the foveated reduction. Sure it reduced the cost of a CNN classifier, but foveation wasn't compared to any other data reduction step or method, and its computational cost was not accounted for in the comparison. It would justify a fancy lens that could do that processing for "free", but not what you are erroneously extending it to.

Cropping or resizing have their own downsides which are mentioned in the dissertation.
Sure, so where is the comparison to show how they are less efficient in computational cost when coupled to a CNN? They were never compared in the publications I can see.

Quote
The methods I used do seem weird or even inefficient, but have been the result of about 1.5 years of experimenting with different approaches and are proven to be the fastest (CPU rendering that beats GPU performance, just think about that).
Again, from your enforcement of a very particular approach and method. Which probably isn't the best method to use on a GPU, or FPGA. Its very easy to produce inefficient examples without even realising, unless you understand the underlying architectures. You can keep pointing back to how great an improvement you made to the CPU case, and thats a useful thing, but its not any indication of that method being applicable, appropriate, or sensible to use on a GPU or FPGA.

The wrong way of approaching any challenge is: "Well I might as well not try since I have no clue what I'll be doing"; that way you'll be stuck forever where you are. Even the most skilled programmers or god like engineers, started from not knowing anything and learned their way out of the problems they faced. I'm certainly up for the challenge anyways.
You've already jumped to the FPGA algorithm and are trying to buy hardware to implement it, thats you putting the cart before the horse. Go back and learn about how to do image manipulation, with a specific focus on implementations for FPGA architectures, because they're a world apart from what you are proposing to shoe-horn into an FPGA. It'll be yet another I approached an FPGA with no guidance or clue about to use them and resulted in an inefficient implementation. All your posts keep reinforcing this is a classic square peg in round hole approach. But hey we might end up with some more rubbish publications.
 
The following users thanked this post: FlyingDutch

Offline FlyingDutch

  • Regular Contributor
  • *
  • Posts: 147
  • Country: pl
Re: Image processing accelerator
« Reply #22 on: May 24, 2021, 07:23:25 am »
I have clarified this several times on this thread but people seem to want to skip over it or may not understand the end goal. I'm not trying to detect people  :-// The memory IS CONFIRMED to be the bottle necking factor (I have received hundreds of replies regarding this).

Even if the end product is slower than a top of the line desktop PC, it's fine since the target to beat is a raspberry pi, not an AMD 5950x  :popcorn: Please read the previous replies before posting an answer.
Hello,

yes I don't understand what do you want  to achive and what is your goal. After reading all answers (post) in this thread my understanding is such that you just want to implement "Convolutional Neural Network" and you have strange issues because you choosed bad method for implementing it. Such neural networks are very easy to implement without similar issues but you have to choose right tools. Just read tutorials from "Deep Lizard" and you will be able to implement your "retina" without strange issues. Of course FPGA can help with "Computer Vision" and accelerate operations on Convolutional Neural Networks, but what for to reinvent the wheel again.

Best Regards
 

Offline Silenos

  • Regular Contributor
  • *
  • Posts: 63
  • Country: pl
  • Fumbling in ignorance
Re: Image processing accelerator
« Reply #23 on: May 24, 2021, 07:56:50 am »
Apart from everything it is hard to believe to have any current PC ddram choke; I mean having 4x ddr4 ++3000  CL<15/16 on dual channel isn't expensive for today. Having it supremed by qspis on fpga... well, high chance of running out of pins on any fpga. Imo any fpga board with ddr3 would do.
And finally you give the 10 Gbits/s figure - low chance you have bottlenecked the ddram. No chance on multicore cpu with multithreaded software. And it all seems irrelevant, as PC isn't supposed to be the target platform of your application.
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf