The reason behind choosing the cheapest reasonable FPGA I could find is that I'm a student and have to pay out of my own pocket for these projects, so my time is actually pretty cheap. The main goal is to have the software retina working on robotics applications (small robots) or things similar to cube stats, so powerful, PC grade GPUs (even old ones) are out of the question due to very large size and high power consumption. The Cmod S7 should be enough for me to simulate a small retina and to be able to benchmark the critical aspects of the project. Simulations are good, but I'd rather have the real thing working.
I have attached my dissertation where you can find the full details of the algorithm itself (chapter 4 is the design section and is relatively short, about 3 pages). From the results section you can see that multi threading adds no performance benefits but increasing the memory bandwidth (and even reducing its latency) scales linearly. I have confirmed this via another method which I will not discuss here since it's outside of the scope of this thread. I have 0 hope for SBC boards like the Pi , Jetson nano, etc. When a 6700HQ paired with 2133MHz DDR4 @ CL15 is facing memory bottlenecks, the slower ram on lower end devices will cause significant problems. With an FPGA, data can be pre-fetched into small register, while simultaneously processing data; This means if the memory bandwidth is high enough, the processor won't be sitting idle, waiting for data to arrive from memory. If the memory is too fast, more DSP slices can be added so that a fine balance is reached. Another advantage would be being able to do all 3 operations which are mentioned in the dissertation at once, given a one cycle latency between sampling and the other operations, whereas a conventional CPU has to carry out each operation sequentially.
I have also found a project that seems to implement a decent lightweight QSPI memory controller but I've not had enough time to read through it completely:
https://zipcpu.com/blog/2019/03/27/qflexpress.htmlIt promises very fast read speeds, which again, is my main bottleneck. I hope that answers your questions about my "odd" choices and why I chose a rather difficult task as a first project.
I have used some tricks to reduce the memory footprint of the software retina by 2x, which resulted in an almost 2x performance increase (1.7x) on my laptop. I appreciate all of your suggestions about the alternatives, but please let's go back to the original proposal. I honestly can't think of some systematic flaw but there might be better ways of dealing with the bandwidth problem (for example HBM memory for FPGA? as far as I know that doesn't exist but I'm also very new to this). My main inspiration for using an FPGA came from slow motion cameras which must use an FPGA to capture images at high frame rates (large data throughput); My application is pretty much the same but needs high read speeds as opposed to high write speeds.
Edit: the PDF is too large to post as an attachment here (15MB) so I have uploaded it on google drive :
https://drive.google.com/file/d/1gRrMOzkGZNQLeC0UNWlvFhYIsdlaMoiy/view?usp=sharingAgain, regarding the memory bandwidth: my current system has 2133*64 = 136,512Gb/s , which is a bottle neck for my application, so DMA with DDR3 memory will definitely be a bottleneck (pretty sure you calculated it wrong and don't need a *2 since that's the rated MT/s, not clock frequency, which is half of that. For example 2133"MHz" is actually running at 1066MHz. Also having a look at many Zynq Dev. boards, they advertise "512MB DDR3 with 16-bit bus @ 1050Mbps"). To be clear, I don't expect the FPGA implementation to beat the laptop performance, but it'll be a far cry better than the dismal throughput on something like a raspberry pi. Not to mention the fact that the limited throughput on those devices, will be shared with many other processes and the OS which is a recipe for disaster.
Here is the core of the algorithm written in Cython (custom C extensions for python):
@cython.wraparound(False)
@cython.boundscheck(False)
cpdef sample(unsigned char[::1] img_flat, unsigned short[::1] coeffs, unsigned int[::1] idx, unsigned int[::1] result_flat):
cdef unsigned int x
with nogil:
for x in range(img_flat.shape[0]):
if coeffs [ x ] > 0 :
result_flat [ idx [ x ] ] += img_flat [ x ] * coeffs [ x ]
the "[::1]" means that the date is stored as a contiguous, flat (1D) array. As you can see the "idx" array is used as a look up table and the only real line of code is a multiply accumulate: "result_flat [ idx [ x ] ] += img_flat [ x ] * coeffs [ x ] ". The if statement, only skips unceneccaray operations (if you multiply by 0 and add the result to a value, the value doesn't change). This is the part where I have managed to use some tricks to remove all 0 elements from the array to reduce the memory overhead by about 2x.
The specific IC I'd like to use is:
https://www.mouser.co.uk/ProductDetail/Winbond/W25Q256JVFIM?qs=qSfuJ%252Bfl%2Fd7mFTiWxnKquA%3D%3DHowever I may drop to a lower capacity if I end up going with for example 128 bit bus (since that requires more chips to begin with, and I don't really need massive amounts of ROM).