Author Topic: Maximum utilisation  (Read 1294 times)

0 Members and 1 Guest are viewing this topic.

Offline tom66Topic starter

  • Super Contributor
  • ***
  • Posts: 7014
  • Country: gb
  • Electronics Hobbyist & FPGA/Embedded Systems EE
Maximum utilisation
« on: September 16, 2024, 01:20:46 pm »
How far have you pushed an FPGA design?

I was amazed I was able to get our latest product to build.

89% LUT utilisation on Zynq 7010, passing timing at 96MHz main clock with 22ps WNS.

This is quite a small device as FPGAs go, I have pretty much got all I can get out of it!  I can no longer add any more features without risking the design failing timing.  Moving to a 7020, I have 311ps slack instead, because the router has a lot more freedom and can place blocks better.

Current design takes about 30 minutes to build on a Ryzen 5800H whereas the 7020 takes only 8 minutes.
 

Offline asmi

  • Super Contributor
  • ***
  • Posts: 2798
  • Country: ca
Re: Maximum utilisation
« Reply #1 on: September 16, 2024, 02:28:41 pm »
I try to avoid pushing utilization past 70% on 7 series as that's about the point when p&r starts to work much harder and so p&r time really takes off into space, and you have real chance to encounter routing congestions. But that is of course during development, "production" runs sometimes go tighter than that, but since such runs are performed very rarely, p&r time is not that important.

Online BrianHG

  • Super Contributor
  • ***
  • Posts: 8094
  • Country: ca
Re: Maximum utilisation
« Reply #2 on: September 16, 2024, 07:29:18 pm »
95% LC on an Altera EP3C55F484I7N.  (55k LC)
100% blockram.  (Not actually true, some of the blocks were only partially used, but because of the way they are filled, I was only using 85% of the blockram according to the compiler report, but if you inspect the number of actual blocks used, they were all assigned.  IE: I could not add an additional single 1kb block of ram to the design, it would say all blockram elements used.)
98% IOs.
I think I approached 70% DSP.

The product was a studio grade 30bit video scaler with 2 video in and 2 video out.
PIP, any size source to any size destination picture and video modes up to 200MHz pixel clock on each port.
Built in test patterns, external genlock sync and ethernet controls.

Video processing as well as enhance, noise removal, edge correcting for low bitrate mpeg sources and full color processing for each video source.
« Last Edit: September 16, 2024, 10:14:11 pm by BrianHG »
 

Offline nctnico

  • Super Contributor
  • ***
  • Posts: 27935
  • Country: nl
    • NCT Developments
Re: Maximum utilisation
« Reply #3 on: September 16, 2024, 09:04:50 pm »
Old project where I tried to cram as much logic into an FPGA as possible:
Spartan 3 (200 IIRC)

Number of BUFGMUXs                  6 out of 8      75%
   Number of DCMs                      2 out of 4      50%
   Number of External IOBs            62 out of 141    43%
      Number of LOCed IOBs            62 out of 62    100%

   Number of MULT18X18s                1 out of 12      8%
   Number of RAMB16s                  11 out of 12     91%
   Number of Slices                 1847 out of 1920   96%
   Number of SLICEMs              351 out of 960    36%

Placed & routed within a few minutes though. Fixing the locations of some key components like DCMs or blockrams can help the P&R process enormously.


I'm not sure though using a bigger FPGA is always a solution. A bigger FPGA has longer internal delays so you might not be able to make timing just because of the internal delays. For one of the designs I'm maintaining, the Xilinx Virtex 6 FPGA is so big (oversized) that I have to turn OFF optimisations to place logic close together. Otherwise it won't be able to meet timing towards the pads.
« Last Edit: September 16, 2024, 09:31:03 pm by nctnico »
There are small lies, big lies and then there is what is on the screen of your oscilloscope.
 

Offline radar_macgyver

  • Frequent Contributor
  • **
  • Posts: 729
  • Country: us
Re: Maximum utilisation
« Reply #4 on: September 17, 2024, 05:56:16 am »
Not so much with utilization (about 60% iirc), but I did push a Virtex-6 (LXT240) design to run at 400 MHz. Took about 5-6 hours to finish PAR, and it was not guaranteed to meet timing.
 

Offline nctnico

  • Super Contributor
  • ***
  • Posts: 27935
  • Country: nl
    • NCT Developments
Re: Maximum utilisation
« Reply #5 on: September 17, 2024, 07:43:50 am »
Not so much with utilization (about 60% iirc), but I did push a Virtex-6 (LXT240) design to run at 400 MHz. Took about 5-6 hours to finish PAR, and it was not guaranteed to meet timing.
The design I'm maintaining also needs about that time. But I've found that tweaking the 'starting placing cost table' (ISE14) has a massive influence on the amount of time needed and meeting timing or not. Since this modifies the random seeding of the placement process, there is no way to tell what the effect is though. With some number it can run for 24 hours and fail and with other numbers it will run for 20 minutes and pass timing as well.
« Last Edit: September 17, 2024, 07:46:45 am by nctnico »
There are small lies, big lies and then there is what is on the screen of your oscilloscope.
 
The following users thanked this post: radar_macgyver

Online SiliconWizard

  • Super Contributor
  • ***
  • Posts: 15328
  • Country: fr
Re: Maximum utilisation
« Reply #6 on: September 17, 2024, 08:04:13 am »
I suppose you're mainly talking about utilization in terms of LUTs, as 100% utilization of BRAM, or PLLs, for instance, usually doesn't cause any problem. :-/O

Then that depends on the FPGA's architecture to some extent, but as a rule of thumb, I try to avoid exceeding 80% of LUTs - beyond that, it often becomes hard to meet timings, P&R times become prohibitively long, etc.
Of course, there are always exceptions to the rule.
 

Offline glenenglish

  • Frequent Contributor
  • **
  • Posts: 451
  • Country: au
  • RF engineer. AI6UM / VK1XX . Aviation pilot. MTBr
Re: Maximum utilisation
« Reply #7 on: September 17, 2024, 10:51:38 am »
Nice one Tom66.
It probably means your HDL is conservatively structured (and 96 megs is not that fast)
But 30 minutes to build a 7010 is quite long.... OMG ! it is having to work. Set up multiple build machines and try different P&R strategies

I find fmax dies afer > 75% utilization with 7 series. With a 96 MHz clock, that should route close to 100% if you give the router time, you might try some different strategies (router options) .... (assuming there are no crazy deep logic levels)  , and if you sensibly set the timing constraints.
Ultrascale seems to be much better at high utilization. , seems to make timing for my designs no different at 50% or 90% utilization. but there again, my VHDL is full of registers, pipelineing etc.
I'm doing a 7010(-1)  design right now at 393 MHz. It's about the limit of the -1 fabric.... block rams, multipliers etc...burns up registers ! I will need to use DFX (partial reconfiguration) to fit the design in.
Note that if you are buying 7020s , and have good volume, you can buy a small package ZU1 for similar money if you push....  but ZU is quite complex compared to 7Z.....

« Last Edit: September 17, 2024, 10:59:20 am by glenenglish »
 

Offline tszaboo

  • Super Contributor
  • ***
  • Posts: 7912
  • Country: nl
  • Current job: ATEX product design
Re: Maximum utilisation
« Reply #8 on: September 17, 2024, 01:28:55 pm »
130% before the implementer gave up. I tried to brute force large LUT into Spartan 3.
After like 3 hours of trying.
 

Online BrianHG

  • Super Contributor
  • ***
  • Posts: 8094
  • Country: ca
Re: Maximum utilisation
« Reply #9 on: September 17, 2024, 01:56:26 pm »
95% LC on an Altera EP3C55F484I7N.  (55k LC)
4.5 hours compile time to meet timing for the 128bit DDR2 onboard controller.
 

Offline glenenglish

  • Frequent Contributor
  • **
  • Posts: 451
  • Country: au
  • RF engineer. AI6UM / VK1XX . Aviation pilot. MTBr
Re: Maximum utilisation
« Reply #10 on: September 17, 2024, 08:55:05 pm »
older tools tried essentially best effort for every signal.
whereas modern tools accept a more complex and comprehensive set of timing constraints..
If I get long build times, that's a red flag for whatever I've done for moderate 50k-120k LE designs

I the bad old days (2005, Virtex2) build times could be 2-4 hours.

Tom66, I also use AMD. My development Vivado runs in linux on a Win10 host in VMware, and I run multiple native linux machines for purely what-f builds to explore P&R tactics
 

Online BrianHG

  • Super Contributor
  • ***
  • Posts: 8094
  • Country: ca
Re: Maximum utilisation
« Reply #11 on: September 17, 2024, 09:51:24 pm »
older tools tried essentially best effort for every signal.
whereas modern tools accept a more complex and comprehensive set of timing constraints..
If I get long build times, that's a red flag for whatever I've done for moderate 50k-120k LE designs

I the bad old days (2005, Virtex2) build times could be 2-4 hours.

Tom66, I also use AMD. My development Vivado runs in linux on a Win10 host in VMware, and I run multiple native linux machines for purely what-f builds to explore P&R tactics
Yes, my 4.5 hour design could have compiled (fitter) faster if at the time I would have been better at filling out a complete .sdc file as I do today.  The compiler would spin on trying to criss-cross 5 clock domains in my design which actually didn't need to be associated with each other as the different domains had 16kb dual-port dual-clock read and write sides separated in time with a huge allowance down in the sub mhz range.

Though, back in ~2009 when I did this design, I was still a beginner when it came to such complex designs which had everything running at 216 Mhz and 432 Mhz.
« Last Edit: September 17, 2024, 09:53:31 pm by BrianHG »
 
The following users thanked this post: glenenglish

Offline glenenglish

  • Frequent Contributor
  • **
  • Posts: 451
  • Country: au
  • RF engineer. AI6UM / VK1XX . Aviation pilot. MTBr
Re: Maximum utilisation
« Reply #12 on: September 17, 2024, 10:01:31 pm »
Hi Brian
yeah that's the big change eh ? prevent the tool trying to time everything . Nowday, the tools are pretty good at ignoring what it can identify as unrelated paths..  together with a few ASYNC etc directives, no more global  sync resets etc..... providing sensible (only what is required) timing constraints .... I/O ports included of course, no need for the tool to time to a picosecond if 10nS is fine.
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf