Author Topic: Learning FPGAs: wrong approach? (Read 55265 times)

hamster_nz · « **Reply #200 on:** June 28, 2017, 10:25:46 am »

Quote from: nctnico on June 28, 2017, 06:01:04 am

Quote from: hamster_nz on June 28, 2017, 12:01:04 am
So the "designed with hardware in mind" can be almost 3x smaller, and > 2x faster than the "simple bubble sort" version. It also delivers consistent performance and usage no matter how it is used.
That is the wrong conclusion. By adding the registers you add more logic to the design. Also: did you P&R the design? There is an extra logic optimisation stage in there as well.

Nah, that conclusion is bang on, the two designs are nothing alike even when they use the same quantity of resources. The inferred design is complete crap. Just to show how different they are, here are the slowest path in each design:

Slowest path in the "outputs registered only, inferred design":

Code: [Select]

Slack (setup path):     -13.355ns (requirement - (data path - clock path - clock arrival + uncertainty))
  Source:               b_in<1> (PAD)
  Destination:          uut/sorted_array_out_1_6 (FF)
  Destination Clock:    clk_BUFGP rising at 0.000ns
  Requirement:          10.000ns
  Data Path Delay:      25.822ns (Levels of Logic = 25)
  Clock Path Delay:     2.492ns (Levels of Logic = 2)
  Clock Uncertainty:    0.025ns

  Clock Uncertainty:          0.025ns  ((TSJ^2 + TIJ^2)^1/2 + DJ) / 2 + PE
    Total System Jitter (TSJ):  0.050ns
    Total Input Jitter (TIJ):   0.000ns
    Discrete Jitter (DJ):       0.000ns
    Phase Error (PE):           0.000ns

  Maximum Data Path at Slow Process Corner: b_in<1> to uut/sorted_array_out_1_6
    Location             Delay type         Delay(ns)  Physical Resource
                                                       Logical Resource(s)
    -------------------------------------------------  -------------------
    P139.I               Tiopi                 0.790   b_in<1>
                                                       b_in<1>
                                                       b_in_1_IBUF
                                                       ProtoComp29.IMUX.10
    SLICE_X1Y59.C1       net (fanout=4)        1.933   b_in_1_IBUF
    SLICE_X1Y59.C        Tilo                  0.259   uut/in_array_in[1][7]_in_array_in[0][7]_LessThan_1_o1
                                                       uut/in_array_in[1][7]_in_array_in[0][7]_LessThan_1_o3
    SLICE_X2Y59.B5       net (fanout=1)        0.764   uut/in_array_in[1][7]_in_array_in[0][7]_LessThan_1_o1
    SLICE_X2Y59.B        Tilo                  0.203   N30
                                                       uut/in_array_in[1][7]_in_array_in[0][7]_LessThan_1_o1_SW2
    SLICE_X2Y59.A5       net (fanout=1)        0.222   N30
    SLICE_X2Y59.A        Tilo                  0.203   N30
                                                       uut/in_array_in[1][7]_in_array_in[0][7]_LessThan_1_o1
    SLICE_X6Y54.C5       net (fanout=1)        1.238   uut/in_array_in[1][7]_in_array_in[0][7]_LessThan_1_o2
    SLICE_X6Y54.C        Tilo                  0.204   uut/in_array_in[0][7]_in_array_in[1][7]_mux_1_OUT<4>
                                                       uut/in_array_in[1][7]_in_array_in[0][7]_LessThan_1_o21
    SLICE_X4Y59.CX       net (fanout=18)       1.004   uut/in_array_in[1][7]_in_array_in[0][7]_LessThan_1_o
    SLICE_X4Y59.CMUX     Tcxc                  0.164   N28
                                                       uut/in_array_in[2][7]_in_array_in[1][7]_LessThan_4_o5
    SLICE_X4Y59.A2       net (fanout=1)        0.624   uut/in_array_in[2][7]_in_array_in[1][7]_LessThan_4_o1
    SLICE_X4Y59.A        Tilo                  0.203   N28
                                                       uut/in_array_in[2][7]_in_array_in[1][7]_LessThan_4_o1_SW2
    SLICE_X7Y50.A6       net (fanout=1)        1.058   N28
    SLICE_X7Y50.A        Tilo                  0.259   uut/in_array_in[1][7]_in_array_in[2][7]_mux_4_OUT<0>
                                                       uut/in_array_in[2][7]_in_array_in[1][7]_LessThan_4_o1
    SLICE_X7Y50.B6       net (fanout=1)        0.118   uut/in_array_in[2][7]_in_array_in[1][7]_LessThan_4_o2
    SLICE_X7Y50.B        Tilo                  0.259   uut/in_array_in[1][7]_in_array_in[2][7]_mux_4_OUT<0>
                                                       uut/in_array_in[2][7]_in_array_in[1][7]_LessThan_4_o2
    SLICE_X5Y52.D4       net (fanout=12)       0.987   uut/in_array_in[2][7]_in_array_in[1][7]_LessThan_4_o
    SLICE_X5Y52.D        Tilo                  0.259   uut/in_array_in[2][7]_in_array_in[1][7]_mux_5_OUT<3>
                                                       uut/Mmux_in_array_in[1][7]_in_array_in[2][7]_mux_4_OUT141
    SLICE_X6Y51.D6       net (fanout=4)        0.668   uut/in_array_in[2][7]_in_array_in[1][7]_mux_5_OUT<3>
    SLICE_X6Y51.CMUX     Topdc                 0.368   uut/in_array_in[3][7]_in_array_in[2][7]_LessThan_7_o1
                                                       uut/in_array_in[3][7]_in_array_in[2][7]_LessThan_7_o1_F
                                                       uut/in_array_in[3][7]_in_array_in[2][7]_LessThan_7_o1
    SLICE_X7Y49.C4       net (fanout=1)        0.513   uut/in_array_in[3][7]_in_array_in[2][7]_LessThan_7_o2
    SLICE_X7Y49.C        Tilo                  0.259   uut/in_array_in[2][7]_in_array_in[3][7]_mux_7_OUT<3>
                                                       uut/in_array_in[3][7]_in_array_in[2][7]_LessThan_7_o2
    SLICE_X7Y50.C5       net (fanout=10)       0.372   uut/in_array_in[3][7]_in_array_in[2][7]_LessThan_7_o
    SLICE_X7Y50.C        Tilo                  0.259   uut/in_array_in[1][7]_in_array_in[2][7]_mux_4_OUT<0>
                                                       uut/Mmux_in_array_in[2][7]_in_array_in[3][7]_mux_7_OUT18
    SLICE_X7Y51.D2       net (fanout=3)        0.602   uut/in_array_in[2][7]_in_array_in[3][7]_mux_7_OUT<0>
    SLICE_X7Y51.D        Tilo                  0.259   uut/in_array_in[2][7]_in_array_in[1][7]_LessThan_13_o3
                                                       uut/in_array_in[2][7]_in_array_in[1][7]_LessThan_13_o2
    SLICE_X6Y48.B6       net (fanout=1)        0.468   uut/in_array_in[2][7]_in_array_in[1][7]_LessThan_13_o3
    SLICE_X6Y48.B        Tilo                  0.203   uut/in_array_in[2][7]_in_array_in[1][7]_LessThan_13_o1
                                                       uut/in_array_in[2][7]_in_array_in[1][7]_LessThan_13_o4
    SLICE_X6Y48.D1       net (fanout=2)        0.482   uut/in_array_in[2][7]_in_array_in[1][7]_LessThan_13_o1
    SLICE_X6Y48.CMUX     Topdc                 0.368   uut/in_array_in[2][7]_in_array_in[1][7]_LessThan_13_o1
                                                       uut/in_array_in[2][7]_in_array_in[1][7]_LessThan_13_o1_F
                                                       uut/in_array_in[2][7]_in_array_in[1][7]_LessThan_13_o1
    SLICE_X5Y48.B6       net (fanout=1)        0.607   uut/in_array_in[2][7]_in_array_in[1][7]_LessThan_13_o2
    SLICE_X5Y48.B        Tilo                  0.259   uut/in_array_in[2][7]_in_array_in[1][7]_mux_14_OUT<3>
                                                       uut/in_array_in[2][7]_in_array_in[1][7]_LessThan_13_o24_SW0
    SLICE_X5Y48.A5       net (fanout=1)        0.187   N24
    SLICE_X5Y48.A        Tilo                  0.259   uut/in_array_in[2][7]_in_array_in[1][7]_mux_14_OUT<3>
                                                       uut/in_array_in[2][7]_in_array_in[1][7]_LessThan_13_o24
    SLICE_X5Y47.C3       net (fanout=14)       0.507   uut/in_array_in[2][7]_in_array_in[1][7]_LessThan_13_o
    SLICE_X5Y47.C        Tilo                  0.259   uut/in_array_in[1][7]_in_array_in[2][7]_mux_13_OUT<0>
                                                       uut/Mmux_in_array_in[1][7]_in_array_in[2][7]_mux_13_OUT12
    SLICE_X4Y49.C6       net (fanout=3)        0.489   uut/in_array_in[1][7]_in_array_in[2][7]_mux_13_OUT<0>
    SLICE_X4Y49.CMUX     Tilo                  0.361   uut/in_array_in[0][7]_in_array_in[1][7]_mux_1_OUT<0>
                                                       uut/in_array_in[1][7]_in_array_in[0][7]_LessThan_16_o4_G
                                                       uut/in_array_in[1][7]_in_array_in[0][7]_LessThan_16_o4
    SLICE_X5Y51.B1       net (fanout=2)        0.643   uut/in_array_in[1][7]_in_array_in[0][7]_LessThan_16_o1
    SLICE_X5Y51.B        Tilo                  0.259   uut/in_array_in[0][7]_in_array_in[1][7]_mux_1_OUT<3>
                                                       uut/in_array_in[1][7]_in_array_in[0][7]_LessThan_16_o1_SW1
    SLICE_X4Y50.C5       net (fanout=2)        0.352   N19
    SLICE_X4Y50.CMUX     Tilo                  0.361   N18
                                                       uut/in_array_in[1][7]_in_array_in[0][7]_LessThan_16_o1_G
                                                       uut/in_array_in[1][7]_in_array_in[0][7]_LessThan_16_o1
    SLICE_X5Y49.D4       net (fanout=1)        0.424   uut/in_array_in[1][7]_in_array_in[0][7]_LessThan_16_o2
    SLICE_X5Y49.D        Tilo                  0.259   N26
                                                       uut/in_array_in[1][7]_in_array_in[0][7]_LessThan_16_o24_SW0
    SLICE_X7Y45.A3       net (fanout=1)        0.838   N26
    SLICE_X7Y45.AMUX     Tilo                  0.313   uut/in_array_in[0][7]_in_array_in[1][7]_mux_16_OUT<2>
                                                       uut/in_array_in[1][7]_in_array_in[0][7]_LessThan_16_o24
    SLICE_X7Y46.D3       net (fanout=7)        0.543   uut/in_array_in[1][7]_in_array_in[0][7]_LessThan_16_o
    SLICE_X7Y46.DMUX     Tilo                  0.313   uut/in_array_in[0][7]_in_array_in[1][7]_mux_16_OUT<6>
                                                       uut/Mmux_in_array_in[0][7]_in_array_in[1][7]_mux_16_OUT171
    OLOGIC_X12Y23.D1     net (fanout=1)        2.214   uut/in_array_in[1][7]_in_array_in[0][7]_mux_17_OUT<6>
    OLOGIC_X12Y23.CLK0   Todck                 0.803   uut/sorted_array_out_1<6>
                                                       uut/sorted_array_out_1_6
    -------------------------------------------------  ---------------------------
    Total                                     25.822ns (7.965ns logic, 17.857ns route)
                                                       (30.8% logic, 69.2% route)

Slowest path in the "outputs registered only, coded for H/W design":

Code: [Select]

Paths for end point c_out_2 (OLOGIC_X11Y2.D1), 268 paths
--------------------------------------------------------------------------------
Slack (setup path):     0.064ns (requirement - (data path - clock path - clock arrival + uncertainty))
  Source:               a_in<1> (PAD)
  Destination:          c_out_2 (FF)
  Destination Clock:    clk_BUFGP rising at 0.000ns
  Requirement:          10.000ns
  Data Path Delay:      12.477ns (Levels of Logic = 6)
  Clock Path Delay:     2.566ns (Levels of Logic = 2)
  Clock Uncertainty:    0.025ns

  Clock Uncertainty:          0.025ns  ((TSJ^2 + TIJ^2)^1/2 + DJ) / 2 + PE
    Total System Jitter (TSJ):  0.050ns
    Total Input Jitter (TIJ):   0.000ns
    Discrete Jitter (DJ):       0.000ns
    Phase Error (PE):           0.000ns

  Maximum Data Path at Slow Process Corner: a_in<1> to c_out_2
    Location             Delay type         Delay(ns)  Physical Resource
                                                       Logical Resource(s)
    -------------------------------------------------  -------------------
    P111.I               Tiopi                 0.790   a_in<1>
                                                       a_in<1>
                                                       a_in_1_IBUF
                                                       ProtoComp13.IMUX.25
    SLICE_X4Y58.D4       net (fanout=7)        2.715   a_in_1_IBUF
    SLICE_X4Y58.D        Tilo                  0.203   a[7]_c[7]_LessThan_3_o22
                                                       a[7]_c[7]_LessThan_3_o23
    SLICE_X5Y40.C3       net (fanout=1)        1.505   a[7]_c[7]_LessThan_3_o22
    SLICE_X5Y40.C        Tilo                  0.259   a[7]_c[7]_LessThan_3_o23
                                                       a[7]_c[7]_LessThan_3_o24
    SLICE_X5Y40.B6       net (fanout=1)        0.285   a[7]_c[7]_LessThan_3_o23
    SLICE_X5Y40.B        Tilo                  0.259   a[7]_c[7]_LessThan_3_o23
                                                       a[7]_c[7]_LessThan_3_o25
    SLICE_X7Y36.C2       net (fanout=8)        1.240   a[7]_c[7]_LessThan_3_o
    SLICE_X7Y36.C        Tilo                  0.259   BUS_0001_d[7]_wide_mux_11_OUT<5>
                                                       Mram_table41
    SLICE_X7Y14.A6       net (fanout=8)        2.389   _n0044<4>
    SLICE_X7Y14.A        Tilo                  0.259   BUS_0003_d[7]_wide_mux_10_OUT<2>
                                                       Mmux_BUS_0003_d[7]_wide_mux_10_OUT31
    OLOGIC_X11Y2.D1      net (fanout=1)        1.511   BUS_0003_d[7]_wide_mux_10_OUT<2>
    OLOGIC_X11Y2.CLK0    Todck                 0.803   c_out_2
                                                       c_out_2
    -------------------------------------------------  ---------------------------
    Total                                     12.477ns (2.832ns logic, 9.645ns route)
                                                       (22.7% logic, 77.3% route)

Designed for the H/W code beats the pants off of that fully inferred design. 6 levels of logic vs 25. And it meets the timing requirement rather than missing it by 120%

Can you supply any data to support your conclusion?

NorthGuy · « **Reply #201 on:** June 28, 2017, 01:20:05 pm »

Quote from: hamster_nz on June 28, 2017, 10:25:46 am

... 6 levels of logic ...

It did the comparisons in 3 levels, but it certainly can be done in 2.

mrflibble · « **Reply #202 on:** July 02, 2017, 04:52:04 am »

Hey, fun exercise!

Quote from: hamster_nz on June 28, 2017, 10:25:46 am

... The inferred design is complete crap. Just to show how different they are, here are the slowest path in each design: ...

No kidding. I didn't even try to do an inferred design for this problem. Last time I did that it gave me a headache and made my hex vision act up for days.

Anyways, below the timings of my attempt at a hardware-targeted design. Worst path:

Code: [Select]

 ================================================================================ 
 Timing constraint: TS_clk_400 = PERIOD TIMEGRP "clk_400" TS_GCLK / 4 HIGH 50% INPUT_JITTER 0.2 ns; 
 For more information, see Period Analysis in the Timing Closure User Guide (UG612). 
  416 paths analyzed, 278 endpoints analyzed, 0 failing endpoints 
  0 timing errors detected. (0 setup errors, 0 hold errors, 0 component switching limit errors) 
  Minimum period is   2.454ns. 
 -------------------------------------------------------------------------------- 
  
 Paths for end point sort_four/select_sort_order/mux_this_2/out_7 (SLICE_X47Y81.D5), 1 path 
 -------------------------------------------------------------------------------- 
 Slack (setup path):     0.046ns (requirement - (data path - clock path skew + uncertainty)) 
   Source:               sort_four/packed_evals_to_sels/sel_2_0 (FF) 
   Destination:          sort_four/select_sort_order/mux_this_2/out_7 (FF) 
   Requirement:          2.500ns 
   Data Path Delay:      2.357ns (Levels of Logic = 1) 
   Clock Path Skew:      -0.015ns (0.297 - 0.312) 
   Source Clock:         clk_400 rising at 0.000ns 
   Destination Clock:    clk_400 rising at 2.500ns 
   Clock Uncertainty:    0.082ns 
  
   Clock Uncertainty:          0.082ns  ((TSJ^2 + DJ^2)^1/2) / 2 + PE 
     Total System Jitter (TSJ):  0.070ns 
     Discrete Jitter (DJ):       0.147ns 
     Phase Error (PE):           0.000ns 
  
   Maximum Data Path at Slow Process Corner: sort_four/packed_evals_to_sels/sel_2_0 to sort_four/select_sort_order/mux_this_2/out_7 
     Location             Delay type         Delay(ns)  Physical Resource 
                                                        Logical Resource(s) 
     -------------------------------------------------  ------------------- 
     SLICE_X37Y83.CQ      Tcko                  0.430   sort_four/packed_evals_to_sels/sel_2<0> 
                                                        sort_four/packed_evals_to_sels/sel_2_0 
     SLICE_X47Y81.D5      net (fanout=8)        1.554   sort_four/packed_evals_to_sels/sel_2<0> 
     SLICE_X47Y81.CLK     Tas                   0.373   sort_four/select_sort_order/mux_this_2/out<7> 
                                                        sort_four/select_sort_order/mux_this_2/Mmux_sel[1]_d[7]_wide_mux_1_OUT81 
                                                        sort_four/select_sort_order/mux_this_2/out_7 
     -------------------------------------------------  --------------------------- 
     Total                                      2.357ns (0.803ns logic, 1.554ns route) 
                                                        (34.1% logic, 65.9% route) 
  
 --------------------------------------------------------------------------------

I constrained it conservatively at 400 MHz, and with a decent amount of clock uncertainty. Did several runs, and it easily meets timing. And based on some things I noticed (el stupido routing decisions by PAR) I'm guessing that with some extra constraints it would probably do around 425 MHz. Still have margin left on the clock uncertaintly as well....

This is using ISE 14.7 and targeting a spartan-6: xc6slx45-2csg324. The design is pipelined, 3 stages.

Incidentally, do you have the project settings? Either .xise file or empty project .zip will work. Just to make sure that I am not using different settings that will give skewed results.

legacy · « **Reply #203 on:** July 02, 2017, 05:42:04 pm »

I am playing with old CPLD, XC9500 serie, the above PCB is a matrix-keyboard controller, nothing special but it makes me to appreciate what comes for free with CoolRunner: built-in "pullup"

I recycled what I happened to find at home, a few big CPLD chips, good because they are 5V tolerant, but constraints don't allow pullup/pulldown since the physical XC9500-hardware doesn't have it.

So, that's the reason why I added a big-and-long SIL pack on the PCB.

Yansi · « **Reply #204 on:** July 02, 2017, 05:51:15 pm »

Also have a ton of such old devices laying around, including some crazy old FPGAs. However why bothering with those non-FLASH based devices? Get yourself at least either an Altera MAX II device or Xilinx XC9500XL. The both are FLASH based and the latter ones also 5V compatible. Both cheap too.

legacy · « **Reply #205 on:** July 02, 2017, 07:51:05 pm »

Yup, I also have a few of XC9572 chips in PLC84 package as well as a couple of XC2C64A in smd package. Might be I will realize a second board.

What I really miss is ... a couple of Spartan2 fpga chips. They are 5V tolerant and it on some designs it makes easier than using a 5V <-> 3.3V level-shifter. I have plenty of Spartan3 and Spartan6 chips, whose IO-core is 3.3V as MAX voltage, but the last Spartan2 chip that I had ... was soldered on a Nintendo ADV adapter IO-core 5V, which I built several years ago when Spartan2 was available everywhere.

Regret ... I didn't buy more chips.

Someone · « **Reply #206 on:** July 03, 2017, 12:22:02 am »

Quote from: mrflibble on July 02, 2017, 04:52:04 am

Hey, fun exercise!
...
This is using ISE 14.7 and targeting a spartan-6: xc6slx45-2csg324. The design is pipelined, 3 stages.

Fun indeed, I've access to fully licensed tools so might have a slight edge here (possibly some extra options/strategies unlocked) but I'm not running smart explorer to get the last few % out of the design and yet there appears to be a lot of slack available from the attempts so far.

ISE 14.7 xc6slx45-2csg324
Minimum area, combinatorial only. 58 LUTs
Logical pipeline of 3 stages. 106 LUTs >440 MHz
Fully pipelined with 4 stages. 118 LUTs >540 MHz
(requires using both edges of clock)

ISE 14.7 xc7a100t-2csg324
Minimum area, combinatorial only. 58 LUTs
Logical pipeline of 3 stages. 132 LUTs >580 MHz
Fully pipelined with 4 stages. 141 LUTs >580 MHz
(both switching limited)

Vivado X.X xc7a35t-2csg324
Minimum area, combinatorial only. 76 LUTs
Logical pipeline of 3 stages. 114 LUTs >380 MHz
Fully pipelined with 4 stages. 147 LUTs >400 MHz

Vivado X.X xc7a100t-2csg324
Minimum area, combinatorial only. 76 LUTs
Logical pipeline of 3 stages. 114 LUTs >380 MHz
Fully pipelined with 4 stages. 147 LUTs >410 MHz

It's known that ISE can do a better synthesis job on many designs but its orphaned for device support now and harder to use going forward. But 7 series parts are easily 50-100% faster than Spartan 6 so many designs need to be reassessed for area/speed tradeoff and can be adapted to the new Vivado synthesis at the same time. These results above are using a sort algorithm better suited for FPGA implementation but still written with a high level functional description in VHDL, so its not necessary to get down to gate level descriptions but rather knowing how to map to resources allows you to design for minimum area while still using high level constructs.

hamster_nz · « **Reply #207 on:** July 03, 2017, 02:06:01 am »

Quote from: Someone on July 03, 2017, 12:22:02 am

Quote from: mrflibble on July 02, 2017, 04:52:04 am
Hey, fun exercise!
...
This is using ISE 14.7 and targeting a spartan-6: xc6slx45-2csg324. The design is pipelined, 3 stages.
Fun indeed, I've access to fully licensed tools so might have a slight edge here (possibly some extra options/strategies unlocked) but I'm not running smart explorer to get the last few % out of the design and yet there appears to be a lot of slack available from the attempts so far.

ISE 14.7 xc6slx45-2csg324
Minimum area, combinatorial only. 58 LUTs
Logical pipeline of 3 stages. 106 LUTs >440 MHz
Fully pipelined with 4 stages. 118 LUTs >540 MHz
(requires using both edges of clock)

ISE 14.7 xc7a100t-2csg324
Minimum area, combinatorial only. 58 LUTs
Logical pipeline of 3 stages. 132 LUTs >580 MHz
Fully pipelined with 4 stages. 141 LUTs >580 MHz
(both switching limited)

Vivado X.X xc7a35t-2csg324
Minimum area, combinatorial only. 76 LUTs
Logical pipeline of 3 stages. 114 LUTs >380 MHz
Fully pipelined with 4 stages. 147 LUTs >400 MHz

Vivado X.X xc7a100t-2csg324
Minimum area, combinatorial only. 76 LUTs
Logical pipeline of 3 stages. 114 LUTs >380 MHz
Fully pipelined with 4 stages. 147 LUTs >410 MHz

It's known that ISE can do a better synthesis job on many designs but its orphaned for device support now and harder to use going forward. But 7 series parts are easily 50-100% faster than Spartan 6 so many designs need to be reassessed for area/speed tradeoff and can be adapted to the new Vivado synthesis at the same time. These results above are using a sort algorithm better suited for FPGA implementation but still written with a high level functional description in VHDL, so its not necessary to get down to gate level descriptions but rather knowing how to map to resources allows you to design for minimum area while still using high level constructs.

Wow - these are quite significant differences. I wonder what ISE knows that Vivado doesn't?

Maybe they are using different underlying timing models... When compared to ISE, Vivado seems to spend half an age doing nothing when working on small designs. I assume it is dynamically building routing/timing models for the whole die before it places/routing anything.

NorthGuy · « **Reply #208 on:** July 03, 2017, 02:55:35 am »

Quote from: hamster_nz on July 03, 2017, 02:06:01 am

When compared to ISE, Vivado seems to spend half an age doing nothing when working on small designs. I assume it is dynamically building routing/timing models for the whole die before it places/routing anything.

If that was the case, then Vivado would work faster with smaller parts (e.g. the synthesis/implementation for XC7A50T would be faster than for XC7A200T), but this doesn't seem to be the case. I'd rather suspect the usual - poor design, overbloat. Vivado is just generally terribly slow.

Someone · « **Reply #209 on:** July 03, 2017, 03:04:29 am »

Quote from: hamster_nz on July 03, 2017, 02:06:01 am

Wow - these are quite significant differences. I wonder what ISE knows that Vivado doesn't?

Maybe they are using different underlying timing models... When compared to ISE, Vivado seems to spend half an age doing nothing when working on small designs. I assume it is dynamically building routing/timing models for the whole die before it places/routing anything.

My understanding is that from ISE to Vivado was a radical change of design, importantly so that the tools could continue scaling out to larger designs. ISE scales poorly when you use the larger devices at high utilisation while Vivado could run on much less memory and route with higher utilisation (the iterative routing seems to do very well).

Quote from: NorthGuy on July 03, 2017, 02:55:35 am

Quote from: hamster_nz on July 03, 2017, 02:06:01 am
When compared to ISE, Vivado seems to spend half an age doing nothing when working on small designs. I assume it is dynamically building routing/timing models for the whole die before it places/routing anything.
If that was the case, then Vivado would work faster with smaller parts (e.g. the synthesis/implementation for XC7A50T would be faster than for XC7A200T), but this doesn't seem to be the case. I'd rather suspect the usual - poor design, overbloat. Vivado is just generally terribly slow.

Slower, but with less memory and its able to close timing on designs that ISE couldn't.

legacy · « **Reply #210 on:** July 03, 2017, 09:19:51 am »

How much memory does Vivado usually eat during the synthesis?

p.s. about computing horsepower, i9 has been already released by Intel, which means .... i7 is going to have a price-drop

!!!

Someone · « **Reply #211 on:** July 03, 2017, 09:33:56 am »

Quote from: legacy on July 03, 2017, 09:19:51 am

How much memory does Vivado usually eat during the synthesis?

The peak memory use is typically during routing and Xilinx only suggest memory for the overall process rather than each stage as it would be unusual to run the stages on different machines:
https://www.xilinx.com/products/design-tools/vivado/memory.html
You can hunt down the ISE version with the wayback machine, but they're both a little optimistic and real world use it higher than their tables once you add in the other things running during a build.

nctnico · « **Reply #212 on:** July 03, 2017, 10:39:35 am »

Quote from: Someone on July 03, 2017, 03:04:29 am

Quote from: NorthGuy on July 03, 2017, 02:55:35 am
Quote from: hamster_nz on July 03, 2017, 02:06:01 am
When compared to ISE, Vivado seems to spend half an age doing nothing when working on small designs. I assume it is dynamically building routing/timing models for the whole die before it places/routing anything.
If that was the case, then Vivado would work faster with smaller parts (e.g. the synthesis/implementation for XC7A50T would be faster than for XC7A200T), but this doesn't seem to be the case. I'd rather suspect the usual - poor design, overbloat. Vivado is just generally terribly slow.
Slower, but with less memory and its able to close timing on designs that ISE couldn't.

When it comes to ISE getting good results it depends a lot on the placing cost tables settings whether it can meet the timing or not. With a poor setting the P&R can run for 24 hours without meeting timing while with others settings the design goes through the P&R stage is less than 10 minutes and meet all timing constraints. Unfortunately it takes trial & error to get the right placing cost table settings.

NorthGuy · « **Reply #213 on:** July 03, 2017, 01:42:42 pm »

Quote from: Someone on July 03, 2017, 03:04:29 am

Slower, but with less memory ...

That's one of the poor decisions. Since the world migrated to 64-bit, you can have huge memory. The speed, however, doesn't progress much - my 6-year old i5 processor is only 30% slower than the best modern mass-produced Intel CPU. How stupid is it to sacrifice speed in order to get less memory usage?

I'm sure there were hundreds of bad decisions like that on different levels which made Vivado as slow as it is. It's funny that it's being marketed as Ultra-Fast.

Quote from: Someone on July 03, 2017, 03:04:29 am

... and its able to close timing on designs that ISE couldn't.

May be. I don't know.

Someone · « **Reply #214 on:** July 04, 2017, 01:24:07 am »

Quote from: nctnico on July 03, 2017, 10:39:35 am

Quote from: Someone on July 03, 2017, 03:04:29 am
Quote from: NorthGuy on July 03, 2017, 02:55:35 am
Quote from: hamster_nz on July 03, 2017, 02:06:01 am
When compared to ISE, Vivado seems to spend half an age doing nothing when working on small designs. I assume it is dynamically building routing/timing models for the whole die before it places/routing anything.
If that was the case, then Vivado would work faster with smaller parts (e.g. the synthesis/implementation for XC7A50T would be faster than for XC7A200T), but this doesn't seem to be the case. I'd rather suspect the usual - poor design, overbloat. Vivado is just generally terribly slow.
Slower, but with less memory and its able to close timing on designs that ISE couldn't.
When it comes to ISE getting good results it depends a lot on the placing cost tables settings whether it can meet the timing or not. With a poor setting the P&R can run for 24 hours without meeting timing while with others settings the design goes through the P&R stage is less than 10 minutes and meet all timing constraints. Unfortunately it takes trial & error to get the right placing cost table settings.

Thats not unique to ISE, Vivado suffers the same wildly variable results from the initial seeds.

hamster_nz · « **Reply #215 on:** July 04, 2017, 02:07:00 am »

Quote from: Someone on July 04, 2017, 01:24:07 am

Quote from: nctnico on July 03, 2017, 10:39:35 am
When it comes to ISE getting good results it depends a lot on the placing cost tables settings whether it can meet the timing or not. With a poor setting the P&R can run for 24 hours without meeting timing while with others settings the design goes through the P&R stage is less than 10 minutes and meet all timing constraints. Unfortunately it takes trial & error to get the right placing cost table settings.
Thats not unique to ISE, Vivado suffers the same wildly variable results from the initial seeds.

It is most likely unavoidable - you have to add enough randomness prevent the P+R process from falling into the same local minima all the time (a.k.a. "getting stuck in a rut"). How often this happens is most likely design dependant, and something to do with the bisection width of the design. Highly connected designs will be more likely to suffer bad placement decisions, but flowing pipelines will usually play nice.

Someone · « **Reply #216 on:** July 04, 2017, 03:25:22 am »

Quote from: hamster_nz on July 04, 2017, 02:07:00 am

Quote from: Someone on July 04, 2017, 01:24:07 am
Quote from: nctnico on July 03, 2017, 10:39:35 am
When it comes to ISE getting good results it depends a lot on the placing cost tables settings whether it can meet the timing or not. With a poor setting the P&R can run for 24 hours without meeting timing while with others settings the design goes through the P&R stage is less than 10 minutes and meet all timing constraints. Unfortunately it takes trial & error to get the right placing cost table settings.
Thats not unique to ISE, Vivado suffers the same wildly variable results from the initial seeds.
It is most likely unavoidable - you have to add enough randomness prevent the P+R process from falling into the same local minima all the time (a.k.a. "getting stuck in a rut"). How often this happens is most likely design dependant, and something to do with the bisection width of the design. Highly connected designs will be more likely to suffer bad placement decisions, but flowing pipelines will usually play nice.

I find it easier with Vivado as there are a diverse group of directives ("strategies") which can be individually applied (usually iteratively) at each stage, much more flexibility with some feeling of control and less reliance on the initial seed being lucky.

Cerebus · « **Reply #217 on:** July 04, 2017, 01:23:44 pm »

Quote from: hamster_nz on July 04, 2017, 02:07:00 am

It is most likely unavoidable - you have to add enough randomness prevent the P+R process from falling into the same local minima all the time (a.k.a. "getting stuck in a rut"). How often this happens is most likely design dependant, and something to do with the bisection width of the design. Highly connected designs will be more likely to suffer bad placement decisions, but flowing pipelines will usually play nice.

That's a direct consequence of the underlying graph layout algorithms (graph as in vertices and edges, not squiggly lines on paper). I saw exactly the same phenomenon some years back when I was working on a network management tool that tried to draw a decent network diagram from the connectivity graph of the network. It was surprising how big a change in layout one would see from little tweaks to weightings and other parameters.

mrflibble · « **Reply #218 on:** January 31, 2018, 01:51:13 pm »

Quote from: Someone on July 03, 2017, 12:22:02 am

Fun indeed, I've access to fully licensed tools so might have a slight edge here (possibly some extra options/strategies unlocked) but I'm not running smart explorer to get the last few % out of the design and yet there appears to be a lot of slack available from the attempts so far.

ISE 14.7 xc6slx45-2csg324
Minimum area, combinatorial only. 58 LUTs
Logical pipeline of 3 stages. 106 LUTs >440 MHz
Fully pipelined with 4 stages. 118 LUTs >540 MHz
(requires using both edges of clock)

ISE 14.7 xc7a100t-2csg324
Minimum area, combinatorial only. 58 LUTs
Logical pipeline of 3 stages. 132 LUTs >580 MHz
Fully pipelined with 4 stages. 141 LUTs >580 MHz
(both switching limited)

Vivado X.X xc7a35t-2csg324
Minimum area, combinatorial only. 76 LUTs
Logical pipeline of 3 stages. 114 LUTs >380 MHz
Fully pipelined with 4 stages. 147 LUTs >400 MHz

Vivado X.X xc7a100t-2csg324
Minimum area, combinatorial only. 76 LUTs
Logical pipeline of 3 stages. 114 LUTs >380 MHz
Fully pipelined with 4 stages. 147 LUTs >410 MHz

Wow, that's pretty damn impressive. Especially the ISE result is ... well, fast!

Quote from: Someone

It's known that ISE can do a better synthesis job on many designs but its orphaned for device support now and harder to use going forward. But 7 series parts are easily 50-100% faster than Spartan 6 so many designs need to be reassessed for area/speed tradeoff and can be adapted to the new Vivado synthesis at the same time. These results above are using a sort algorithm better suited for FPGA implementation but still written with a high level functional description in VHDL, so its not necessary to get down to gate level descriptions but rather knowing how to map to resources allows you to design for minimum area while still using high level constructs.

Does this sort algorithm have a name? I'd guess maybe a bitonic sort network, but even then 580 MHz on a spartan-6 is neat result.

A somewhat related question, do you know of any good books or other forms of reference material where one can go and read up on the various parallel algorithms? Specifically with an eye to fpga implementation, but if there's a good compendium of handy ciruits for VLSI then that's certainly better than what I have now. I find that, as with any problem really, a large part of the job is "Pick the right data structure and algorithm / representation and possible operators". I don't mind reinventing the wheel every now and then, provided the payoff is some extra insight that can be used in future projects. But every once in a while it would be nice just to be able to browse the catalog as it were, read up on several ways to get the computation of the day done, and then pick one. Then "all" you will have to do is not fsck up the implementation. Which can be enough of a challenge already. Especially without coffee.

Someone · « **Reply #219 on:** February 01, 2018, 07:36:01 am »

Quote from: mrflibble on January 31, 2018, 01:51:13 pm

Does this sort algorithm have a name? I'd guess maybe a bitonic sort network, but even then 580 MHz on a spartan-6 is neat result. A somewhat related question, do you know of any good books or other forms of reference material where one can go and read up on the various parallel algorithms? Specifically with an eye to fpga implementation, but if there's a good compendium of handy ciruits for VLSI then that's certainly better than what I have now.

Even working to a specific sort algorithm it still needs a lot of experience to map that efficiently to primitives, and several networks can achieve the same result:
https://en.wikipedia.org/wiki/Bitonic_sorter
https://en.wikipedia.org/wiki/Batcher_odd–even_mergesort
https://en.wikipedia.org/wiki/Pairwise_sorting_network
(https://en.wikipedia.org/wiki/Sorting_network)
any one might be optimal for the particular network/data size or platform. For algorithm design there aren't canned examples like with analog circuits as assumptions/constraints which can be used to optimise any given problem are tightly intertwined with the implementation, its always good to spend some time looking at possible ways to solve the problem before committing too much effort into any single one.

mrflibble · « **Reply #220 on:** February 01, 2018, 08:15:24 pm »

Quote from: Someone on February 01, 2018, 07:36:01 am

For algorithm design there aren't canned examples like with analog circuits as assumptions/constraints which can be used to optimise any given problem are tightly intertwined with the implementation, its always good to spend some time looking at possible ways to solve the problem before committing too much effort into any single one.

Oh I don't expect any canned examples. Besides, where would be the fun in that? Fully agreed on spending some time on multiple different ways to solve it. I guess the point I was trying to make, is that you can only spend time on those multiple different ways if you actually know about the existence of those different ways. Basically I would be happy already with a dictionary of algorithms usable on programmable logic, with a one line description. At least then I have a term I can google and hunt for papers to read. Right now it's you don't know what you don't know... For example a fat tree encoder is damn handy, but I don't think I'd ever come across that on the software side of algorithms. And the hardware side is definitely less accessible. Well, that or I need glasses + a google refresher course or something...

mrflibble · « **Reply #221 on:** February 04, 2018, 04:26:51 pm »

While working out the logic bits for another project, I just realized that I totally missed something with the sorting circuit.

At the time I was feeling all clever and stuff, because I had just optimized the way of doing a comparison. Before that I actually described the hardware as per the hardware-description-language mantra, so (a < b). Not totally unexpected, that gave crap timing. So worked out what a comparison actually is in arithmetic, implement that. Yup, definitely better. Hence the feeling all clever and stuff. Only to realize now that while yes, that may have been better, but I could have done in 1 slice what I did there using 2 whole frigging slices. Doh! Well great, now I have to try that as well. Curse you curiosity!

Incidentally, I was just checking the timing report of the old circuit. What kind of clock uncertainty to use? Would be good to compare apples with apples. Or just reduce every inconvenience to zero and see how high the numbers get. Because if benchmarks have taught me anything, it is that higher numbers moar better.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Learning FPGAs: wrong approach? (Read 55265 times)

Share me