Author Topic: FPGA to softcore or hardcore data stream bus  (Read 2222 times)

0 Members and 1 Guest are viewing this topic.

Offline lk.dgironiTopic starter

  • Regular Contributor
  • *
  • Posts: 76
  • Country: it
FPGA to softcore or hardcore data stream bus
« on: February 19, 2024, 09:45:40 am »
Hi all,

My question is about a data stream bus for FPGA to and from softcore or hard microprocessor (let's call it SoC) to transfer a data stream

I'm working on a project that has the following architecture
ADC -> FPGA -> SoC -> FPGA -> DAC
ADC to FPGA and FPGA to DAC works at present at 100kHz.
FPGA to SoC and SoC to FPGA should happen at 100kHz, but even 10kHz could be enough.
Bandwidth used:
  • FPGA -> SoC: 32 channels @ 24 bit from ADC, so at least 768 bit ( + some protocol header)
  • SoC -> FPGA: 32 channels @ 16 bit from ADC, so at least 512 bit ( + some protocol header)
SoC does some simple math, maybe it can even works without float, but I've to check it, of course it's better to have resources for a FPU.

My actual hardware is a 25k FPGA from GoWin (GW5A-LV25MG121 on Tank Primer 25k). I'm using it just for tests, because it's really fast on synth+route+place. It will not have enough LUT for all the project. I think The final project will run on a Xilinx, this way I can use ARM core; or better on the GoWin GW5AST-LV138FPG676A found in Tang Mega 138k, this way I can run it on the RISCV hard core. I prefer this last option to stay on RISCV architecture, but no problem on going to ARM.
Using Xilinx for this test could be an option but I'm speaking of 1 minute synth+route+place on GoWin vs 15 minutes on Xilinx using the same i7 PC. Debugging will became a pain. Anyway If I've no alternative I can move to Xilinx.

The ADC -> FPGA and FPGA -> DAC stream is already working (at 100kHz).
I'm experimenting on VexRiscV (I must admin the SpinalHDL over Scala learning cure is a bit difficult).
I've implemented an AXI stream that talks with the softcore through ABP3, starting from the Briey SoC (find my post about this argument here: https://github.com/SpinalHDL/VexRiscv/issues/391).
What I've notice is that it takes almost 50 clock cycle to transfer stream FPGA -> SoC, and a little less (40) to transfer from SoC to FPGA, this roughly means 90*32 2900 cycles just to transfer data. At 72Mhz (that is the speed I'm running my FPGA and SoC) if I want to reach 10kHz this leaves me 4300 cycles (72000000/10000 - 2900) that's not that much.

So, maybe 10kHz are too high? Or maybe is something related to VexRiscV?
I was also thinking about some DMA access way, but I'm not that expert in this architecture.

Thanks!
 

Offline glenenglish

  • Frequent Contributor
  • **
  • Posts: 265
  • Country: au
  • RF engineer. AI6UM / VK1XX . Aviation pilot. MTBr.
Re: FPGA to softcore or hardcore data stream bus
« Reply #1 on: February 19, 2024, 10:41:53 pm »
so you have 9.6Mbytes/sec  in each direction approx.
low speed.... But you need to think either DMA interface or shared dual port memory.

As for the FPGA interface- and the tools- by the time you have big design, I bet the Xilinx tools will be faster.

If the Xilinx tools are taking longer for the same design then you probably have not constrained the timing sufficiently.

Xilinx Vivado  tools wil by default try and time everything and if you do not constrain  smartly, you can end up with 10x the place and route time. (XSE was the opposite- nothing was timed unless explicitly timed.)

If you are happy with the Gowin tools, I would suggest staying with the GOWIN. No shame in that and I am quite impressed with Gowin

For your interface, you need to build a multiplexor from your multiple  parallel ADC streams into a single multichannel AXI4 stream, then work on getting that into the RISCV hardcore (or softcore) like Microblaze)
Many cores have native AXI4 stream interface components for DMA or instruction level access.

Alternatively, first multiplex the stream, and write into a dual port RAM that is shared memory in the memory map of the processor system. That's what I do most of.
For high speed stuff, like say 400 Mbytes per second , My multiplexed AXI4 stream writes directly into DDR via DMA....

-glen
 

Offline lk.dgironiTopic starter

  • Regular Contributor
  • *
  • Posts: 76
  • Country: it
Re: FPGA to softcore or hardcore data stream bus
« Reply #2 on: February 20, 2024, 08:46:10 am »
Thanks @glenenglish

I'm trying to stay with GoWin just for speed reason, just consider a simple blinky verilog code. If you synth+rout+place in GoWin IDE it takes 30 sec, same thing on same PC on vivado, 7 min. Tryied the same think on Linux laptop, almost same result.
Said that I'm pretty sure I've to move to Xilinx for the final project, but I want to stay on GoWin for development. This is part of the reason I would like to stay on a portable softcore like VexRiscV, but again I think I'll have to move away for performance reason, still have to work on this task.

I've yet build a mux that takes all the ADC to a single AXI4 stream, and a demux that takes AXI4 stream to DAC. But it takes almost 3000 cycles for mux (FPGA -> SoC) + demux (SoC->FPGA).

Do you think a dual port BRAM could do the job?

Are you using hardcore MCU (like the Zynq ARM) or Microblaze to work with your high speed stuff (the 400Mbps one I mean)?
 

Offline glenenglish

  • Frequent Contributor
  • **
  • Posts: 265
  • Country: au
  • RF engineer. AI6UM / VK1XX . Aviation pilot. MTBr.
Re: FPGA to softcore or hardcore data stream bus
« Reply #3 on: February 21, 2024, 02:17:38 am »
Hi
I use Microblaze and Zynq and Zynq Ultrascale.....
Which one you use might depend on how much processing you want to do in software.
As you know, processing in software is fast to write and debug .  But there is a limit you hit....
If there is lots of floating point and vector work, and large network bandwidth required, ZYNQ and Zynq Ultrscale are good choices. The DMA facilities are very powerful and wide.
But, the compexity of the system is high...
Using Microblaze is easy, no 7000 page datasheet to know.  Use the microblaze in PERFORMANCE setting (deep pipeline)  and the FPU will generate 1 result per clock... fast !

The dual port RAM is a good way to do interfacing. in Microblaze that's very easy to add a dual port ram to the memory map.  Go a cachless system, using block rams for all memory and its quite fast.

DDR based ZYNQ systems require cache miss considerations. staying away from DDR can simplify your hardware design a huge amount...

You'll need to determine how many cycles are required for your sigproc. You can estimate this pretty well from number of C lines. If you get stuck, hardware  accelerators in Microblaze are single instruction interfaced if you like.

Also, worthwhile trying EFINIX. excellent devices and they have a new quad RISCV hardcore part. I have consumed a tray of Titanium family devices. The DSP blocks and block rams will run at 1000 MHz , no kidding. tools are OK.  routing methods are different, designs may be bigger than XIlinx style devices. expect a 2:1 ratio on average.

oh and that's 400 MBps not 400 Mbps.
400 Mbps is easy. 400 MBps is harder , but for a 150 MHz microblaze 32 bit AXI  interfaceface, theoretical max perf is still 4800 Mbps..... so bus transfer is not the issue, what you might do with the data is the issue.
-glen
« Last Edit: February 21, 2024, 02:19:19 am by glenenglish »
 

Offline lk.dgironiTopic starter

  • Regular Contributor
  • *
  • Posts: 76
  • Country: it
Re: FPGA to softcore or hardcore data stream bus
« Reply #4 on: February 21, 2024, 04:04:59 pm »
Thanks for your help!

My mistake the "b" 2 "B" makes things 8 times harder :)

Well, my application has a few floating point, and a pretty amount of code. I don't yet know how much cause I've to first write the container for the application, so decide the architecture, then I can write the C code.
The code will implement a "controllable" PID system that do some math on the 32 channels @ 24 bit from ADC and output 32 channels @ 16 bit to DAC. The application also has to compute the target point signal, then compare the feedback signal coming from the ADC, and output the PID error checked signal to DAC.
There is also some logic around this (as example it can be that the feedback for a PID may come from ADC signal 1 + ADC signal 2), or PID setting can change after X seconds of execution.. and so on.
It's not that simple.
It also has 64 digital bit input and output to be managed by the software.
In addition to the ADC / DAC / GPIO signals there are also another slow async stream: the command stream. Commands are received from the Gigabit interface driven in verilog. They passthrough to the SoC and setup the way the application work (like command #10 with payload 144 can mean set the P for line 1 of the application at 1.44). Commands are 128bit long payload, and can even go out from the SoC (like as example command #14 with payload 55 can mean error number 55 in the application).

One last requirements, the application should be loaded and executed in place (XIP). I mean it would be better if one can load the application through the verilog gigabit interface to some ROM, then the SoC could execute the application from the SoC. If this is a blocking requirements, I have to think about this.

The doubt I have and suggestion I need for this application are:
- bandwidth requirements and system frequency, as I tell you 10kHz is a good reach point, 100kHz (so 96Mbytes/sec almost) is the dream point
- floating point and speed of computation. Consider that each input ADC signal has to be computed in time with the system frequency (10kHz as example), so that the signal to DAC is output before the next 10kHz tick
- XIP executable application mode, is it feasible?

Hope I've explain my use case. I think I need an hardcore SoC, but I don't know.
 

Offline glenenglish

  • Frequent Contributor
  • **
  • Posts: 265
  • Country: au
  • RF engineer. AI6UM / VK1XX . Aviation pilot. MTBr.
Re: FPGA to softcore or hardcore data stream bus
« Reply #5 on: February 21, 2024, 08:30:00 pm »
I think your project lends itself to all FPGA, as you have many copies of the same task.

for pure FPGA For bootload- the fpga will need a startup image. It can boot from the gigabit interface and then perform reconfiguration.
However, because your project will not be cost sensitive with all those interfaces, I would suggest a SoC to manage the external network interfaces and boot, then you can load the FPGA image completely independently of the SoC operation. but at extra complexity. otherwise, microblaze for network and control, and everythign else in the fabric
Then, perform all computations in FPGA fabric.
You will need to calculate the number of ops per second required.
I would try and stay in fixed point if possible, no problem with wide bit widths. wide bit widths are economical and effective, operation in fraction mode is what you want... then you can do it all in the fpga fabric.
I do not any any issue with acheiving the 100kHz , none at all.   You can time share the DSP blocks that will run at 400 MHz if you need to.... but I do not think you will need to share much.
It can be useful to share the DSP blocks/ computation and have small numbers of AXI multichannel stream inputs, as a single DSP block , they are all fully pipelined , can perform 12.5 MOPS for each of the 32 channels. so for 100kHz,  that is 125 MOPS per channel for a single DSP block, and you might have 50-500 DSP blocks available !

So, no problem even at 10x channels or 10 x sample rate....
you can use Xilinx HLS for your algorithms, but if they are only PID type, you dont need C code for that...
Remember, also there is SpinalHDL, and various python to HDL....

 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf