Author Topic: Learning FPGAs: wrong approach? (Read 55692 times)

AndyC_772 · « **Reply #150 on:** June 26, 2017, 09:30:46 am »

Quote from: BrianHG on June 26, 2017, 08:39:31 am

To pipeline speed this one up, here is how I would do it:

(a)
Compare all inputs with each other and generate 4 sets of 2 bit selection flags/words.
--- and --- Store all 4 inputs in D-flipflop registers.

(b) The output of all 4 D-flipflop registers would feed 4 x 4:1 mux selection units, each unit receiving the 2 bit selection flags generating 4 sorted outputs.

I think that's a better algorithm, thanks.

My method requires three comparators on the first cycle, three more on the second cycle, and the inputs to some depend on the outputs of others, so there's an extra propagation delay to consider, which might limit fmax.

Your method also requires six comparators, but all their inputs are known at the start of the first cycle, so they can operate faster.

You also require multiplexers, but I'm willing to bet they're faster than logical comparators.

nctnico · « **Reply #151 on:** June 26, 2017, 10:02:10 am »

Regarding sorting: write a VHDL function with a variable number of inputs which does the sorting in a for-loop.

Cerebus · « **Reply #152 on:** June 26, 2017, 11:32:10 am »

A purely combinatorial version of the sorting problem in Verilog. Code actually simulated against random numbers and it works.

Because it's purely combinatorial one hopes that a decent synthesizer would mush this down to the minimum possible number of gates. If you want to count up discrete circuit elements it's 12 8-bit comparators, 12 1-bit adders, 16 2-bit comparators, 128 2-input AND gates and 8 4-bit OR gates.

Code: [Select]

module sorter (input wire [7:0] A, B, C, D, output wire [7:0] E, F, G, H);

wire AgtB = (A > B);
wire AgtC = (A > C);
wire AgtD = (A > D);
wire [1:0] Apos = (AgtB + AgtC + AgtD);	// population count of how many other inputs this input is greater than

wire BgtA = (B > A);
wire BgtC = (B > C);
wire BgtD = (B > D);
wire [1:0] Bpos = (BgtA + BgtC + BgtD);

wire CgtA = (C > A);
wire CgtB = (C > B);
wire CgtD = (C > D);
wire [1:0] Cpos = (CgtA + CgtB + CgtD);

wire DgtA = (D > A);
wire DgtB = (D > B);
wire DgtC = (D > C);
wire [1:0] Dpos = (DgtA + DgtB + DgtC);

// For all you VHDL-only crowd the {8{aBit}} 'widens' the single bit to 8 bits
assign E = A & {8{Apos==3}} | B & {8{Bpos==3}} | C & {8{Cpos==3}} | D & {8{Dpos==3}};
assign F = A & {8{Apos==2}} | B & {8{Bpos==2}} | C & {8{Cpos==2}} | D & {8{Dpos==2}};
assign G = A & {8{Apos==1}} | B & {8{Bpos==1}} | C & {8{Cpos==1}} | D & {8{Dpos==1}};
assign H = A & {8{Apos==0}} | B & {8{Bpos==0}} | C & {8{Cpos==0}} | D & {8{Dpos==0}};

endmodule

Cerebus · « **Reply #153 on:** June 26, 2017, 11:35:51 am »

Quote from: nctnico on June 26, 2017, 10:02:10 am

Regarding sorting: write a VHDL function with a variable number of inputs which does the sorting in a for-loop.

Wouldn't it have been quicker to write "Wave magic wand." or "Assign task to minion."?

hamster_nz · « **Reply #154 on:** June 26, 2017, 11:41:41 am »

Spent a night watching Doctorr Who and fiddling with code.

All solutions are single-cycle, and outputs are registered, constrained for 200MHz. Results:

1) Bubble sort
96.833 MHZ
10.327 ns
124 LUTs

2) A bit like a shell sort -
148.65 MHZ
6.727ns
105 LUTs

3) H/W optimized design (six tests to index a lookup table, that is then used to MUX the outputs), as per NorthGuy -
234.19 MHz
4.027ns
61 LUTs

So with the last design being twice as fast, and well under half the size, but took 4x longer to write :-)

nctnico · « **Reply #155 on:** June 26, 2017, 11:49:02 am »

Quote from: Cerebus on June 26, 2017, 11:35:51 am

Quote from: nctnico on June 26, 2017, 10:02:10 am
Regarding sorting: write a VHDL function with a variable number of inputs which does the sorting in a for-loop.

Wouldn't it have been quicker to write "Wave magic wand." or "Assign task to minion."?

No, it lets the synthesizer deal with the problem. You might be surprised by the results.

BrianHG · « **Reply #156 on:** June 26, 2017, 12:06:06 pm »

Quote from: hamster_nz on June 26, 2017, 11:41:41 am

Spent a night watching Doctorr Who and fiddling with code.

All solutions are single-cycle, and outputs are registered, constrained for 200MHz. Results:

1) Bubble sort
96.833 MHZ
10.327 ns
124 LUTs

2) A bit like a shell sort -
148.65 MHZ
6.727ns
105 LUTs

3) H/W optimized design (six tests to index a lookup table, that is then used to MUX the outputs), as per NorthGuy -
234.19 MHz
4.027ns
61 LUTs

So with the last design being twice as fast, and well under half the size, but took 4x longer to write :-)

The 2 stage HW optimized recommendation was not NorthGuy, it was me BrianHG...
As for the longer writes of my optimized designs, after making a 1080p video mixers and filters on really old slow Cyclone 1 devices a decade ago with a buggy crashing Quartus at the time, and slow compiles, you could imagine my frustrations. But getting such old FPGAs running 2 channel 30 bit color at 148.5MHz with simple DDR ram, you better believe the ingenious compact chunks of Verilog I created was as compact & as fast as can be without having to resort to AHDL and no special Altera functions other than the PLL clock function block and their pipeline multiply/add and dual-port ram mega-functions.

Cerebus · « **Reply #157 on:** June 26, 2017, 12:09:45 pm »

Quote from: nctnico on June 26, 2017, 11:49:02 am

Quote from: Cerebus on June 26, 2017, 11:35:51 am
Quote from: nctnico on June 26, 2017, 10:02:10 am
Regarding sorting: write a VHDL function with a variable number of inputs which does the sorting in a for-loop.

Wouldn't it have been quicker to write "Wave magic wand." or "Assign task to minion."?
No, it lets the synthesizer deal with the problem. You might be surprised by the results.

I doubt the synthesizer is going to "write a VHDL function" for you. Sounds like the Montgomery Scott solution - [Fx: pick up mouse, use as microphone] "Computer: write me a VHDL function that sorts a variable number of 8 bit numbers, and pour me a nice single malt in the replicator."

NorthGuy · « **Reply #158 on:** June 26, 2017, 02:13:10 pm »

Quote from: BrianHG on June 26, 2017, 08:39:31 am

To pipeline speed this one up, here is how I would do it:

(a)
Compare all inputs with each other and generate 4 sets of 2 bit selection flags/words.
--- and --- Store all 4 inputs in D-flipflop registers.

(b) The output of all 4 D-flipflop registers would feed 4 x 4:1 mux selection units, each unit receiving the 2 bit selection flags generating 4 sorted outputs.

About pipelining.

The speed of combinatorial logic depends on the number of layers. Simple design has only one layer:

IN->LUT->OUT

Of course, there may be many parallel paths like that, but all LUTs are fed directly from the input. This makes it the fastest.

Then you introduce LUTs which depends on the values produced by other LUTs, like this:

IN->LUT->LUT->OUT

Here you have two layers of LUTs. You need to wait until the LUTs of the first layer settle and provide stable outputs to the LUTs of the second layer. Then you must wait for the LUTs of the second layer. Therefore, it takes longer. Each layer adds roughly 0.7ns on Xilinx.

The design we're discussing has 4 layers:

IN->LUT->LUT->LUT->LUT->OUT

Only the longest path affects the overall speed. For example, in this design there's a shorter path which goes from input to the final MUX. It only has one LUT. It could be done faster, but the presence of longer paths don't let the design run faster. The speed is roughly determined by the number of layers on the longest path.

Any combinatorial design can be pipelined.

You don't do it as AndyC suggested by splitting things which already can run in parallel. You do it by inserting flip-flops between combinatorial layers:

IN->LUT->LUT->FF->LUT->LUT->OUT

Now the clock doesn't need to wait for all four layers to complete. Once two layers are done, the flip-flop can clock and remember the intermediary result. On the next clock, the next two layers of LUTs will finish the job. You turned 4-layer design into 2-layer design, but now there's one clock delay.

One flip-flop must be inserted in every path, be it a simple wire or a LUT.

To maximize the clock speed, you need to minimize the number of layers. This can be done by inserting flip-flops exactly in the middle of LUT chain. In the example above, two layers go before the flip-flop and two layers go after it.

You don't do it as BrianHG did:

IN->LUT->LUT->LUT->FF->LUT->OUT

In his design, he put 3 layers (2 layers of comparison and one layer to generate MUX inputs) before the flip-flops, and only one layer (MUX) after the flip-flop. If you do this, the first stage will be 3-layer design and the second stage will be 1-layer design. Since they're clocked by the same clock, the overall design is still 3-layer. It is faster than 4-layer design, but it is slower than 2-layer design.

To get 2 layer design you need this:

IN->LUT->LUT->FF->LUT->LUT->OUT

Which means the 2 layers of comparisons go before the flip-flop, and everything else goes after, as this:

Stage 1. 6 bits of comparison results are saved using 6 flip-flops. Since flip-flops must go into every path, we also need 32 flip-flop to save the original inputs.

Stage 2. MUX input is generated from comparison results (one layer) and MUX selects the appropriate input (second layer).

This produces fast 2-layer design.

We can pipeline even further:

IN->LUT->FF->LUT->FF->LUT->FF->LUT->OUT

Now we've got one-layer design, which is as fast as it gets, but you need to wait 3 extra clocks to get the result. Also, this will be tedious to program - you'll have to pipeline comparison operations.

AndyC_772 · « **Reply #159 on:** June 26, 2017, 02:24:26 pm »

Quote from: NorthGuy on June 26, 2017, 02:13:10 pm

You don't do it as AndyC suggested by splitting things which already can run in parallel.

Just for the sake of clarity, what I had in mind was an implementation of the 'bubble sort' method, not the 'rank-then-multiplex method':

- on the first clock, perform the first three compare/swap operations (a-b, b-c, c-d). The outcome of each of these depends on the previous operation, so it takes 3 levels' worth of delay time

- on the second clock, perform the second set of three compare/swaps (a-b, b-c, a-b) on the intermediate results which were stored after the first clock.

The overall effect is to split a logical operation that would have taken 6 levels' worth of delay, and splits it into two operations each of which takes only 3. It would, of course, be possible to split this into a 6 stage pipe, each of which does just one compare/swap, and that might not be a bad implementation at all if you don't mind the latency or storage requirement.

That's not splitting things that can run in parallel... is it?

nctnico · « **Reply #160 on:** June 26, 2017, 02:42:33 pm »

Quote from: Cerebus on June 26, 2017, 12:09:45 pm

Quote from: nctnico on June 26, 2017, 11:49:02 am
Quote from: Cerebus on June 26, 2017, 11:35:51 am
Quote from: nctnico on June 26, 2017, 10:02:10 am
Regarding sorting: write a VHDL function with a variable number of inputs which does the sorting in a for-loop.

Wouldn't it have been quicker to write "Wave magic wand." or "Assign task to minion."?
No, it lets the synthesizer deal with the problem. You might be surprised by the results.

I doubt the synthesizer is going to "write a VHDL function" for you. Sounds like the Montgomery Scott solution - [Fx: pick up mouse, use as microphone] "Computer: write me a VHDL function that sorts a variable number of 8 bit numbers, and pour me a nice single malt in the replicator."

Duhhu

. You are supposed to write the VHDL function yourself but let the synthesizer deal with the actual implementation.

Cerebus · « **Reply #161 on:** June 26, 2017, 02:54:16 pm »

Quote from: nctnico on June 26, 2017, 02:42:33 pm

Quote from: Cerebus on June 26, 2017, 12:09:45 pm
Quote from: nctnico on June 26, 2017, 11:49:02 am
Quote from: Cerebus on June 26, 2017, 11:35:51 am
Quote from: nctnico on June 26, 2017, 10:02:10 am
Regarding sorting: write a VHDL function with a variable number of inputs which does the sorting in a for-loop.

Wouldn't it have been quicker to write "Wave magic wand." or "Assign task to minion."?
No, it lets the synthesizer deal with the problem. You might be surprised by the results.

I doubt the synthesizer is going to "write a VHDL function" for you. Sounds like the Montgomery Scott solution - [Fx: pick up mouse, use as microphone] "Computer: write me a VHDL function that sorts a variable number of 8 bit numbers, and pour me a nice single malt in the replicator."
Duhhu . You are supposed to write the VHDL function yourself but let the synthesizer deal with the actual implementation.

Indeed one is, but you just waved your hand and regally said 'Let it be done', that's what I'm poking fun at.

nctnico · « **Reply #162 on:** June 26, 2017, 03:19:11 pm »

Quote from: hamster_nz on June 26, 2017, 11:41:41 am

Spent a night watching Doctorr Who and fiddling with code.

All solutions are single-cycle, and outputs are registered, constrained for 200MHz. Results:

1) Bubble sort
96.833 MHZ
10.327 ns
124 LUTs

2) A bit like a shell sort -
148.65 MHZ
6.727ns
105 LUTs

3) H/W optimized design (six tests to index a lookup table, that is then used to MUX the outputs), as per NorthGuy -
234.19 MHz
4.027ns
61 LUTs

So with the last design being twice as fast, and well under half the size, but took 4x longer to write :-)

I just ran this VHDL software approach bubble sort through the Xilinx synthesizer using 4 inputs each 8 bits wide:
https://stackoverflow.com/questions/42420983/bubble-sort-in-vhdl
Result: 73 LUTs when optimised for speed and 70 LUTs when optimised for area (Spartan6)

The result speaks for itself. The synthesizer does a way better job then off-the-cuff hardware-like implementations in HDL so just describe the problem and let the synthesizer deal with it. These discussions remind me of the endless C versus assembly arguments.

NorthGuy · « **Reply #163 on:** June 26, 2017, 04:05:10 pm »

Quote from: nctnico on June 26, 2017, 03:19:11 pm

I just ran this VHDL software approach bubble sort through the Xilinx synthesizer using 4 inputs each 8 bits wide:
https://stackoverflow.com/questions/42420983/bubble-sort-in-vhdl
Result: 73 LUTs when optimised for speed and 70 LUTs when optimised for area (Spartan6)

What clock speed are you getting with this?

nctnico · « **Reply #164 on:** June 26, 2017, 04:08:56 pm »

Quote from: NorthGuy on June 26, 2017, 04:05:10 pm

Quote from: nctnico on June 26, 2017, 03:19:11 pm
I just ran this VHDL software approach bubble sort through the Xilinx synthesizer using 4 inputs each 8 bits wide:
https://stackoverflow.com/questions/42420983/bubble-sort-in-vhdl
Result: 73 LUTs when optimised for speed and 70 LUTs when optimised for area (Spartan6)
What clock speed are you getting with this?

That depends entirely on the FPGA so I didn't include that.

NorthGuy · « **Reply #165 on:** June 26, 2017, 04:28:26 pm »

Quote from: nctnico on June 26, 2017, 04:08:56 pm

That depends entirely on the FPGA so I didn't include that.

Please tell us.

LUTs also depend on FPGA. Spartan-6 has 6-input LUTs. Others have 4-input LUTs, so you would need lot more of them for the same design.

nctnico · « **Reply #166 on:** June 26, 2017, 05:00:56 pm »

Quote from: NorthGuy on June 26, 2017, 04:28:26 pm

Quote from: nctnico on June 26, 2017, 04:08:56 pm
That depends entirely on the FPGA so I didn't include that.
Please tell us.

LUTs also depend on FPGA. Spartan-6 has 6-input LUTs. Others have 4-input LUTs, so you would need lot more of them for the same design.

On a Spartan6 speed grade 2 device I can go slightly over 100MHz (while making sure all paths are constrained by adding extra input and output registers). If I enable 'register balancing' (more or less automatic pipelining) I can get it to run at over 400MHz. Both frequencies come from place&routed designs.

Yansi · « **Reply #167 on:** June 26, 2017, 05:39:06 pm »

Quote from: Bruce Abbott on June 24, 2017, 08:07:33 pm

I am currently reading this tutorial, which is making a lot more sense to me so far...

Thank you very much for that book. Might be really helpful for me, a total dumb CPLD/FPGA beginner that spent all of his previous life with sequential MCUs.

NorthGuy · « **Reply #168 on:** June 26, 2017, 05:40:02 pm »

Quote from: nctnico on June 26, 2017, 05:00:56 pm

On a Spartan6 speed grade 2 device I can go slightly over 100MHz (while making sure all paths are constrained by adding extra input and output registers). If I enable 'register balancing' (more or less automatic pipelining) I can get it to run at over 400MHz. Both frequencies come from place&routed designs.

This is a similar result to what hamster_nz have posted. The "hardware mindset" produces about 2x speed for combinatorial logic compare to the "software mindset" optimized with tools. This is about the same speed difference as the difference between Xilinx UltraScale+ and Spartan-6.

I'm surprised that the tools didn't do a better job. They're taking so much time from the code to bitstream. What the hell are they doing all this time? I expected their optimizations to be nearly perfect.

nctnico · « **Reply #169 on:** June 26, 2017, 05:49:31 pm »

Quote from: NorthGuy on June 26, 2017, 05:40:02 pm

Quote from: nctnico on June 26, 2017, 05:00:56 pm
On a Spartan6 speed grade 2 device I can go slightly over 100MHz (while making sure all paths are constrained by adding extra input and output registers). If I enable 'register balancing' (more or less automatic pipelining) I can get it to run at over 400MHz. Both frequencies come from place&routed designs.
This is a similar result to what hamster_nz have posted. The "hardware mindset" produces about 2x speed for combinatorial logic compare to the "software mindset" optimized with tools. This is about the same speed difference as the difference between Xilinx UltraScale+ and Spartan-6.

Without knowing which FPGA Hamster_nz targeted and what synthesis settings he used you can't make this comparison. So where do you get a 2x speed improvement from? Also 400MHz is more than 234MHz so I'd say the 'software approach' is ahead for now.

NorthGuy · « **Reply #170 on:** June 26, 2017, 06:36:23 pm »

Quote from: nctnico on June 26, 2017, 05:49:31 pm

Without knowing which FPGA Hamster_nz targeted and what synthesis settings he used you can't make this comparison. So where do you get a 2x speed improvement from?

Whatever he used was the same FPGA and he's got roughly 2x difference. Your numbers are similar to his, and why wouldn't they be - you did the same thing.

Quote from: nctnico on June 26, 2017, 05:49:31 pm

Also 400MHz is more than 234MHz so I'd say the 'software approach' is ahead for now.

As I explained few posts ago, you can pipeline any pure combinatorial design.

The speed of the design depends on the number of combinatorial layers. You can either run all layers in a single clock - then your clock speed get limited. Or you can pipeline the layers (by inserting flip-flops between them). If completely pipelined, the clock speed will be roughly the same for any design, but it'll be one extra clock delay for every combinatorial layer you remove by pipelining.

It is meaningless to compare pipelined design with purely combinatorial design in terms of clock speed (or in terms of clock cycles for that matter).

Cerebus · « **Reply #171 on:** June 26, 2017, 07:06:17 pm »

Quote from: NorthGuy on June 26, 2017, 06:36:23 pm

The speed of the design depends on the number of combinatorial layers. You can either run all layers in a single clock - then your clock speed get limited. Or you can pipeline the layers (by inserting flip-flops between them). If completely pipelined, the clock speed will be roughly the same for any design, but it'll be one extra clock delay for every combinatorial layer you remove by pipelining.

It is meaningless to compare pipelined design with purely combinatorial design in terms of clock speed (or in terms of clock cycles for that matter).

It would be helpful if you didn't use 'speed' for both 'latency' and 'throughput', what you're trying to say would be much clearer if you used the two separate terms.

hamster_nz · « **Reply #172 on:** June 26, 2017, 07:09:27 pm »

Quote from: nctnico on June 26, 2017, 03:19:11 pm

Quote from: hamster_nz on June 26, 2017, 11:41:41 am
Spent a night watching Doctorr Who and fiddling with code.

All solutions are single-cycle, and outputs are registered, constrained for 200MHz. Results:

1) Bubble sort
96.833 MHZ
10.327 ns
124 LUTs

2) A bit like a shell sort -
148.65 MHZ
6.727ns
105 LUTs

3) H/W optimized design (six tests to index a lookup table, that is then used to MUX the outputs), as per NorthGuy -
234.19 MHz
4.027ns
61 LUTs

So with the last design being twice as fast, and well under half the size, but took 4x longer to write :-)
I just ran this VHDL software approach bubble sort through the Xilinx synthesizer using 4 inputs each 8 bits wide:
https://stackoverflow.com/questions/42420983/bubble-sort-in-vhdl
Result: 73 LUTs when optimised for speed and 70 LUTs when optimised for area (Spartan6)

The result speaks for itself. The synthesizer does a way better job then off-the-cuff hardware-like implementations in HDL so just describe the problem and let the synthesizer deal with it. These discussions remind me of the endless C versus assembly arguments.

TLDR: Can you check that the results is actually a LUT count, and not occupied slices count?

Really interesting! Your results literally kept me awake at night....

A 4-element bubble sort is six identical compare-then-maybe-swap stages. This requires an 8-bit comparison and two 2:1 8-bit MUXes - around 4+2*8 = 20 slices. That checks out with my numbers, as 124 is divisible by 6. The second method only uses five of these stages, hence it uses 5/6th the resources.

Performance-wise the critical path of the bubble sort is through all six compare-then-maybe-swap stages, and in my second method it is only four stages, hence the second method clocking around 50% faster.

The final method uses six 8-bit compares, a 32x8-bit memory, and four 8-bit 4:1 MUXes, so should use around 6*4+8+32 = 64 LUTs, It gets its efficiency by having the pre-computed (and somewhat error prone) values in the 32x8 memory. It removes some of the work required and everything fits nicely with a LUT-6 architecture. As the critical path is only thorough a comparisons and two LUTs it should be about 3x faster than the bubble sort (as it may well be, if I constrain it harder).

If your method is a bubble sort (and I don't doubt it is), and does use 73 LUTs (which I slightly doubt), then it has taken less than 11 LUTs to do what should take at least 20, and I want to know why!

If it is a slice count, then the LUT count it is most likely around the 120 number that I would expect, and my universe is back in balance, and I will sleep well.

The performance is also pretty good for what is a generation older silicon, but not that good that I expect a bug.

nctnico · « **Reply #173 on:** June 26, 2017, 07:33:20 pm »

Actually my earlier LUT number is for an Artix7. Somehow ISE didn't catch I wanted to use a Spartan 6! The other numbers (speed) are for the Spartan6 design. The Spartan 6 design uses 79 Slice LUTs and occupies 33 slices (optimised for speed). I think your reasoning goes off the trail because the synthesizer turns the problem into logic equations which are then minimized keeping the architecture of the FPGA in mind. This means that some of the hardware you describe is probably combined in a way you can't see when designing 'in hardware'. I think it is very similar to a C compiler optimising for pre-fetching and caching.

Bruce Abbott · « **Reply #174 on:** June 26, 2017, 07:48:22 pm »

Quote from: nctnico on June 26, 2017, 03:19:11 pm

I just ran this VHDL software approach bubble sort through the Xilinx synthesizer using 4 inputs each 8 bits wide:
https://stackoverflow.com/questions/42420983/bubble-sort-in-vhdl

I have a question about that code.

This:-

Code: [Select]

        if rising_edge(clk) then
            for j in bubble'LEFT to bubble'RIGHT - 1 loop 
                for i in bubble'LEFT to bubble'RIGHT - 1 - j loop 
                    if unsigned(var_array(i)) > unsigned(var_array(i + 1)) then
                        temp := var_array(i);
                        var_array(i) := var_array(i + 1);
                        var_array(i + 1) := temp;
                    end if;
                end loop;
            end loop;
            sorted_array <= var_array;
        end if;

unfolds into multiple iterations (with different array indexes) of this, right?

Code: [Select]

if unsigned(var_array(0)) > unsigned(var_array(1)) then
                        temp := var_array(0);
                        var_array(0) := var_array(1);
                        var_array(1) := temp;

So we have a comparator who's output determines whether the two array entries are either 1. swapped, or 2. left alone. This is all happening during one clock cycle, and the ':=' means that the operation occurs immediately ie. the logic is not clocked but simply runs as fast as it can, right? What stops the the values in temp, var_array(0) and var_array(1) from continuously cycling around until the comparator changes state?


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Learning FPGAs: wrong approach? (Read 55692 times)

Share me