Learning FPGAs: wrong approach?

#125 Reply
Posted by Amazing on 25 Jun, 2017 03:41
Quote from: BrianHG on 24 Jun, 2017 01:35
How many of you feel this way?
Or, do you go to FPGA just because you have no other choice and would prefer a simple MCU only solution?

I got dragged into FPGA work kicking and screaming due to a contractor that bailed on the project after creating the hardware but before writing the VHDL. So we were stuck with a FPGA-based board and no one to program it.

I got lucky in that I found another contractor who was a wiz a VHDL and he got our board going. He also taught me a ton and now I really enjoy being able to harness the power of FPGAs in my design.

One thing that I think is really fun is getting deeply involved in breaking down a problem, designing cores (e.g. ALUs) specifically for that problem, and pipelining out the wazoo to increase efficiency.

Sadly though I tend not to have time for that sort of thing on a paying gig -- then I just buy the next size up, describe the logic in state machines, and let the synthesizer do it's thing. Much more cost efficient for low volume production that way.

What I learned about writing VHDL is that it's all about the mindset. EE's love to remind us software folks to "remember, you're creating hardware, not writing a program". But it's really much deeper and less obvious than that.

To everyone learning for the first time, I'd say, persevere, take small steps, don't worry about simulation or test benches at first, and read as much as you can on different styles of programming VHDL. Eventually it will soak in and you will "get it".

Quote from: BrianHG on 25 Jun, 2017 01:48
I use Verilog to make my life soooo much easier. Especially if you use a simple single synchronous clock for everything, nothing asynchronous. Coding this way make very portable designs across all FPGAs and PLDs. As for the above VHDL example, it twists my head and I avoid it at all cost and wont ever use it.

That's funny, I learned VHDL first and I think that Verilog is incomprehensible.

#126 Reply
Posted by mikeselectricstuff on 25 Jun, 2017 08:53
You must never forget that you are describing hardware.
<= means 'is connected to', not 'becomes equal to'.

#127 Reply
Posted by hamster_nz on 25 Jun, 2017 09:55
Quote from: mikeselectricstuff on 25 Jun, 2017 08:53
You must never forget that you are describing hardware.
<= means 'is connected to', not 'becomes equal to'.

"is connected to" doesn't work in clocked processes. e.g:

Code: [Select]
if rising_edge(clk) then a <= b; end if;
I don't think of it as "on the rising edge of 'clk, 'a' is connected to 'b' " - if asked to describe it, I would say "on each 'clk' tick, store the value of 'b' in 'a'".

I can't actually find the words that match what "<=" does in each of the different contexts in which it is used. I more think of "=>" as 'connected to', for example

Code: [Select]
i_counter: counter port map ( clk => sys_clk, count => cycle_count);
I think of that as "a counter, with 'clk' conencted ot 'sys_clk' and 'count' connected to 'cycle_count'"...

#128 Reply
Posted by AndyC_772 on 25 Jun, 2017 10:08
Quote from: mikeselectricstuff on 25 Jun, 2017 08:53
<= means 'is connected to', not 'becomes equal to'.

Sorry, Mike, I don't agree with you there. The only symbol which means "is connected to" is "=>", when used to map the ports of a component to signals at a higher level of hierarchy:

Code: [Select]
my_logic_gate: d_type PORT MAP ( d_in => my_data, q_out => my_output, clk => master_clock );
I think of "=>" as meaning "takes its new value from", or indeed, "becomes equal to" (at a point in the future one time quantum from now, but not actually now)

In a clocked process:
Code: [Select]
PROCESS (clk) -- exchange the values of a and b on every clock edge BEGIN IF clk'event AND clk = '1' THEN a <= b; b <= a; END IF; END PROCESS;
...or in asynchronous logic...

Code: [Select]
PROCESS (a) BEGIN b <= NOT a; END PROCESS;

#129 Reply
Posted by nctnico on 25 Jun, 2017 10:22
Quote from: mikeselectricstuff on 25 Jun, 2017 08:53
You must never forget that you are describing hardware.
Actually you must forget about the hardware otherwise you'll be writing way too much code. When programming in C you are also not going to bother whether a variable is stored in a register r1 or r2 or where exactly it is in RAM. VHDL is the same. For example: you can write a<=a*(b+d) +c; in VHDL and the synthesizer will figure out it needs a multiplyer and how it needs to be connected. No need to infer it and deal with how it is actually connected.

#130 Reply
Posted by AndyC_772 on 25 Jun, 2017 10:27
Quote from: Bruce Abbott on 24 Jun, 2017 20:07
What does 'signal' mean in this context? Why is there no mode? What does ':=' mean? Why do we need 'begin'? nq is less than or equal to nq0?

I hate this, when tutorials are written by people a little too familiar with the subject matter, and they begin with material that should have been on about page 5, leaving out the important introduction to the subject (definitions, context, general explanation of what the heck is going on) which should have filled pages 1 to 4.

A "signal" is any value which needs to be stored, or output from the device. Almost every piece of data which your FPGA handles will be a "signal". The values of signals" are generally retained in the D-type latches which form part of the FPGA fabric.

I don't know what you mean by "mode" in this context.

":=" is a symbol used, in this context, to assign a default value to a signal, which it will have at the point when the FPGA has just been powered up and configured. It's a method often used to ensure that counters start at zero, state machines initialise to a valid 'idle' state, and so on.

"<=" does indeed mean "less than or equal" when used in the context of a comparison, but here, it means assignment (see long rambling posts above).

"Begin" just means "by this point, we've declared all the signals we're going to use... now here's the logic which defines their behaviour". It's just semantics. Some things must go before the 'begin', and some after. Don't read too much into it, just copy an example and structure your code the same way.

#131 Reply
Posted by Cerebus on 25 Jun, 2017 11:15
Quote from: AndyC_772 on 25 Jun, 2017 10:27
Quote from: Bruce Abbott on 24 Jun, 2017 20:07
What does 'signal' mean in this context? Why is there no mode? What does ':=' mean? Why do we need 'begin'? nq is less than or equal to nq0?

I hate this, when tutorials are written by people a little too familiar with the subject matter, and they begin with material that should have been on about page 5, leaving out the important introduction to the subject (definitions, context, general explanation of what the heck is going on) which should have filled pages 1 to 4.

A "signal" is ...

I don't know what you mean by "mode" in this context.

":=" is a ...

"<=" does indeed mean ....

"Begin" just means "by this point, we've declared all the signals we're going to use... now here's the logic which defines their behaviour". It's just semantics. Some things must go before the 'begin', and some after. Don't read too much into it, just copy an example and structure your code the same way.

I think Bruce's questions were meant to be rhetorical. And I think you mean 'syntax' not "semantics".

#132 Reply
Posted by NorthGuy on 25 Jun, 2017 13:02
"<=" doesn't mean "connect", but it infers connection(s).

The only way to make things work on a breadboard is to place ICs and connect them with wires.

FPGA is a huge collection of elements (LUTs, FFs, RAM etc.). They're connected through configuration switches. The bitstream is simply a collection of bits. Each bit controls a switch (or switches) thus making or breaking a connection.

The VHDL code is simply a mechanism to convey which connections are needed.

Code: [Select]
PROCESS (clk) -- exchange the values of a and b on every clock edge BEGIN IF clk'event AND clk = '1' THEN -- A signal which changes in this block is going to be a flip-flop clocked by clk a <= b; -- connect the output of flip-flop b to the input of flip-flop a b <= a; -- connect the output of flip-flop a to the input of flip-flop b END IF; END PROCESS;
Code: [Select]
PROCESS (a) BEGIN b <= NOT a; -- build an inverter. Connect its input to a and output to b. END PROCESS;

#133 Reply
Posted by nctnico on 25 Jun, 2017 13:38
What is wrong with seeing <= and := as assignment operators? Just like in C the = assigns the value from what is on the right to what is on the left. In VHDL <= and := assign what is on the right to what is on the left so there really isn't any difference.

#134 Reply
Posted by rstofer on 25 Jun, 2017 15:32
Quote from: BrianHG on 25 Jun, 2017 01:48
Quote from: Mattjd on 24 Jun, 2017 23:59
Now i feel like everything i've learned about HDL is wrong.
I use Verilog to make my life soooo much easier. Especially if you use a simple single synchronous clock for everything, nothing asynchronous. Coding this way make very portable designs across all FPGAs and PLDs. As for the above VHDL example, it twists my head and I avoid it at all cost and wont ever use it.

It's odd how the language you start with becomes your language of choice. I started with VHDL and, for the life of me, I can't figure out Verilog. VHDL tends to be more Pascal like in that it is quite verbose. Verilog, in my view, is C like in that it can be quite terse.

I have made several half-hearted attempts to understand Verilog and I can't get there. What I need to do is design an entire project using only Verilog and force myself to work with it. But, no, I will get to the point where all I want is the finished project and it will be coded in VHDL.

I have NEVER understood the difference between blocking and non-blocking assignments in an 'always' block and whether it matters if the block is clocked. I read this and get completely confused...

https://electronics.stackexchange.com/questions/91688/difference-between-blocking-and-nonblocking-assignment-verilog

In VHDL, it's a simple concept: If the block is clocked, all assignments in the block are registered. If the block isn't clocked, all assignments are combinatorial. THIS I can understand!

Verilog has the '=' symbol for 'blocking' assignment and '<=' for 'non-blocking' assignments (whatever that may mean). But the idea that one creates sequential logic and the other creates parallel logic within the 'always' block escapes me. It's ALL parallel inside the chip!

I think I'm just too old to catch on...

#135 Reply
Posted by mikeselectricstuff on 25 Jun, 2017 16:16
Quote from: nctnico on 25 Jun, 2017 13:38
What is wrong with seeing <= and := as assignment operators? Just like in C the = assigns the value from what is on the right to what is on the left. In VHDL <= and := assign what is on the right to what is on the left so there really isn't any difference.
The problem is that in a programming language, assignment happens at a specific moment. In asynchronous logic, the assignment is effectively happenning continuously.

#136 Reply
Posted by Cerebus on 25 Jun, 2017 16:44
Quote from: rstofer on 25 Jun, 2017 15:32
Verilog has the '=' symbol for 'blocking' assignment and '<=' for 'non-blocking' assignments (whatever that may mean).

A 'blocking' assignment blocks anything else from happening (simultaneously) in the same code block while the assignment is happening, a 'non-blocking' one doesn't.

So, if we start off with three registers and their initial values A=1, B=2 and C=3.

If we execute the following sequence of blocking assignments:

begin B = A; C = B; end

we get the result A=1, B=1, C=1. That is, the first statement executed in its entirety before the second, each blocking assignment is 'executed' in sequence. Now let's do the same thing with non-blocking assignments, and the same initial values as before:

begin B <= A; C <= B; end

This time the result is A=1, B=1, C=2. The values for the right hand sides were taken as we 'passed' 'begin', all the assignments happened simultaneously, and they all finished at the same time, just as we reached 'end'.

That's slightly simplistic and wouldn't probably satisfy a language lawyer, but it gives the essentially flavour of what's going on.

The blocking assignment is useful in writing test beds and the like but dangerous, and usually wrong, in writing code that you actually expect to be implemented in hardware. You can fake up quite a complex signal for a test bed by combining blocking assignments with delays but that kind of usage is not synthesizeable and so will never make it to real hardware.

#137 Reply
Posted by Bruce Abbott on 25 Jun, 2017 16:50
Quote from: Cerebus on 25 Jun, 2017 11:15
I think Bruce's questions were meant to be rhetorical.
At the time I read the tutorial that was what I was thinking. I now know better, but this thread is helping to clarify some things in my mind.

#138 Reply
Posted by NorthGuy on 25 Jun, 2017 17:11
Quote from: Cerebus on 25 Jun, 2017 16:44
Quote from: rstofer on 25 Jun, 2017 15:32
Verilog has the '=' symbol for 'blocking' assignment and '<=' for 'non-blocking' assignments (whatever that may mean).

A 'blocking' assignment blocks anything else from happening (simultaneously) in the same code block while the assignment is happening, a 'non-blocking' one doesn't.

So, if we start off with three registers and their initial values A=1, B=2 and C=3.

If we execute the following sequence of blocking assignments:

begin B = A; C = B; end

we get the result A=1, B=1, C=1. That is, the first statement executed in its entirety before the second, each blocking assignment is 'executed' in sequence. Now let's do the same thing with non-blocking assignments, and the same initial values as before:

begin B <= A; C <= B; end

This time the result is A=1, B=1, C=2. The values for the right hand sides were taken as we 'passed' 'begin', all the assignments happened simultaneously, and they all finished at the same time, just as we reached 'end'.

That's slightly simplistic and wouldn't probably satisfy a language lawyer, but it gives the essentially flavour of what's going on.

The blocking assignment is useful in writing test beds and the like but dangerous, and usually wrong, in writing code that you actually expect to be implemented in hardware. You can fake up quite a complex signal for a test bed by combining blocking assignments with delays but that kind of usage is not synthesizeable and so will never make it to real hardware.

The first one infers a flip-flop with A as input and both B and C as outputs. Sequential (blocking) execution of Verilog statements produces parallel wiring.

The second one infers two flip-flops connected in a chain. A->FF->B->FF->C. Parallel (non-blocking) execution of Verilog statements produces serial wiring.

This is certainly a case of weird terminology.

I use VHDL because I started with it (pure coincidence). I have no intention of using Verilog. VHDL lets me do everything I would want it to do. I'm absolutely sure if I started with Verilog, the situation would be reverse and I would never wanted to use VHDL. Just as rstofer suggested. Imprinting

#139 Reply
Posted by hans on 25 Jun, 2017 17:36
in VHDL:

"<=" is used in assignments of signals.
":=" is used for assignment of variables.

Signals can exist in an architecture, process and procedures.
Variables can exist in process and functions.

A signal at an architecture level is basically a wire. It connects signals together with perhaps a few gates, like:
Code: [Select]
ARCHITECTURE ... OF ... IS SIGNAL a, b, c : STD_LOGIC; BEGIN a <= b AND c; END ARCHITECTURE;
This way you can compute new values within an entity (not shown in example).

Using a process you could compute new values of a at the rising edge of a clock, i.e. sequential logic:
Code: [Select]
ARCHITECTURE ... OF ... IS SIGNAL a, b, c : STD_LOGIC; BEGIN PROCESS(clk) BEGIN IF rising_edge(clk) THEN a <= b AND c; END IF; END PROCESS; END ARCHITECTURE;
Why have variables when we have signals? Because if you assign a new value to a signal, it's new value will not take action immediately. Only after the process is finished running the new value is used.

A variable however is updated instantly, so you can assign a value and then read that new value from it. A variable does hold it's value after you "exit" the process as well. But you cannot use them in an architecture, so they are best used as intermediate values.

In terms of simulation this is a key difference. Signals are simulated using delta delays. That means that if a new value is assigned to a signal, it's delayed to take that value at t+1 'delta'. If new values for other signals need to be computed (e.g. b or c changed in first example) it will happen at t+2 delta, t+3 delta, etc. Delta is an arbitrary time stamp, just to differentiate it will happen slightly later in the future.

Because all statements in a process happen at a 1 timestamp, time can only be advanced when the process is left or a wait statement has been hit (unusual to do if you target hardware, especially using free tools).

In terms of synthesis onto real hardware, either a signal or a variable in a process can result in a wire or D-flip flop.. This is dependent if the value is first written and then read (=wire) or first read and then written (= flip flop).

I'm sure this has high similarities to Verilog blocking and non-blocking statements, but I haven't programmed much Verilog, mostly read code. Both languages are very similar, VHDL is strongly typed , Verilog is loosely typed. Verilog has some unique features, but so does VHDL...

#140 Reply
Posted by mark03 on 25 Jun, 2017 17:55
Quote from: hamster_nz on 22 Jun, 2017 21:27
Quote from: Bruce Abbott on 22 Jun, 2017 19:47
Quote from: sporadic on 22 Jun, 2017 18:58
For all the Python haters, yes.. you can design hardware with Python - http://www.myhdl.org
As if VHDL and Verilog weren't confusing enough, now we have another HDL to learn.

What advantages does MyHDL have over the other two?

All these High Level Synthesis (HLS) HDLs seem to have common threads to address these (and other) problems:

MyHDL is not HLS. It is a bona fide HDL which just so happens to be implemented within the [very flexible] syntax of Python. Everything you have in Verilog and VHDL you get in MyHDL too, and there is very little in MyHDL which does not map 1:1 back into the incumbent languages.

As to why MyHDL and not an incumbent HDL, I think the author would claim to have avoided some of the mistakes that were made Verilog/VHDL, in the same way that *any* second try usually comes out better, simply because it is informed by experience. He (Jan) is more of a VHDL guy, and that definitely shows in MyHDL, but the verbosity and archaisms many people dislike in VHDL are much reduced in MyHDL.

Another big reason: writing test benches in Python is going to be almost infinitely better than writing them in Verilog/VHDL. You can take advantage of the Python unit-test frameworks, simulate your DSP flow using NumPy/SciPy, make actually useful plots, and so on. I think this aspect alone would tip the scales in MyHDL's favor were it not for...

The biggest reason NOT to use MyHDL: It's not directly supported by FPGA vendors, and never will be. The generated Verilog/VHDL output is fine as long as you are using vanilla HDL, but as soon as you need to work with and simulate a vendor-specific hard block, it becomes a major headache.

#141 Reply
Posted by hamster_nz on 25 Jun, 2017 21:50
Quote from: nctnico on 25 Jun, 2017 10:22
Actually you must forget about the hardware otherwise you'll be writing way too much code.

I don't think that is 100% true - if you forget that you are working in h/w you can drop into writing code that does not map well to H/W.

Thought experiment - A design requires a module that takes a clk signal and four 8-bit numbers (a_in, b_in, c_in, d_in) and sorts them low to high, to generate four outputs (a_out, b_out, c_out and d_out). Design it with a software mindset, and then a H/W mindset.

Software is easy:
Code: [Select]
// copy them all across a_out <= a_in; b_out <= b_in; c_out <= c_in; d_out <= d_in; // bubble sort them if(a_out > b_out) swap(a_out, b_out); if(b_out > c_out) swap(b_out, c_out); if(c_out > d_out) swap(c_out, d_out); if(a_out > b_out) swap(a_out, b_out); if(b_out > c_out) swap(b_out, c_out); if(a_out > b_out) swap(a_out, b_out); // Should now be in order(I might have 20% more code / cycles than needed)

The H/W mindset has additional factors
- Latency - can it be done in a single cycle? how many cycles are needed?
- Speed - what will clock fastest? - fastest is most likely three cycles.
- Logic resource used
- Maximizing concurrency
- Can it efficiently scale when the need for five ore more inputs inevitably comes along?

So when it comes to "which is the best way" for H/W there are more factors in play, even for as simple a task as ranking four numbers in order.

#142 Reply
Posted by nctnico on 25 Jun, 2017 22:12
Just like software things like speed, resources, size only become relevant for corner cases and it takes a lot of time & effort to accomplish. Why should you suddenly optimise all facets of an FPGA design if you have lots of gates and lots of speed?

#143 Reply
Posted by hamster_nz on 25 Jun, 2017 22:56
Quote from: nctnico on 25 Jun, 2017 22:12
Why should you suddenly optimise all facets of an FPGA design if you have lots of gates and lots of speed?
Plenty of reasons, some of which may or may not apply.

- If you didn't have constraints that you need to hit (speed, power, latency, cost, size) the you wouldn't be using FPGAs, and you would do it in S/W.

- If your design is even somewhat well thought out, you know the bits you have to worry performance-wise before you even start implementing, and you know what is fluff where you don't even have to try.

- Battery life. Making the bulk more efficient is the best way to reduce power demands

- If working on a product usually device will be selected well before the design is finished, and all the economics pretty much fixed. If you are in the nice place of using 60% of the resource then you can let the design bloat. If you are using 85% or 90% then bloat might force you to use a bigger part with compatible footprint,

- Spare resources = can add more features = better product for same price

- The easier the bulk of a design is to place and route, the more flexibility the tools have for placing and routing the toughest parts of the design.

- Changing pipeline depths late in the development process to improve timing is costly (redesign, retest, reintegrate)

- 6.73ns - The common FullHD pixel clock is 148.5MHz. If you are working on a video design you need to hit this and have a wee bit of slack.

- A sharp tool is better than a dull one

#144 Reply
Posted by Cerebus on 25 Jun, 2017 23:31
Quote from: hamster_nz on 25 Jun, 2017 21:50
Thought experiment - A design requires a module that takes a clk signal and four 8-bit numbers (a_in, b_in, c_in, d_in) and sorts them low to high, to generate four outputs (a_out, b_out, c_out and d_out). Design it with a software mindset, and then a H/W mindset.

Oh, I like that. I might give that a crack in Verilog tomorrow and see where I get.

#145 Reply
Posted by NorthGuy on 26 Jun, 2017 00:42
Quote from: hamster_nz on 25 Jun, 2017 21:50
Thought experiment - A design requires a module that takes a clk signal and four 8-bit numbers (a_in, b_in, c_in, d_in) and sorts them low to high, to generate four outputs (a_out, b_out, c_out and d_out). Design it with a software mindset, and then a H/W mindset.

I don't think you gain a lot in terms of efficiency, but I would argue it is easier to design with hardware mindset. You can synthesise your "software mindset" design and see how much resources it uses. You can then compare to what you would get with "hardware mindset".

Assuming Xilinx 7-series 6-input LUTs, you would need:

- 6 modules to do 6 comparisons - 4 LUTs each = 24 LUTs. It'll take 2 layers of combinatory logic. You'll get 6 outputs from this representing the results of the comparisons

- For each 8 bit output - 6 x 2 table which converts 6 outputs from the previous layer into the 2-bit index. The 2-bit index will select which input you want to multiplex to the given output. 2 LUTs each = 8 LUTs. One layer of combinatory logic.

- For each bit of the outputs (32 bits total) a mux which uses 2-bit index from the previous layer to select one of the 4 inputs. 1 LUT each = 32 LUTs. One layer of combinatory logic.

Bottom line:

24 + 8 + 32 = 64 LUTs = 16 slices.

2 + 1 + 1 = 4 layers of combinatory logic roughly 0.7 ns each (including intra-layer routing) = 2.8 ns. I'd expect it would run fine with 4 ns clock period - 250 MHz.

#146 Reply
Posted by hamster_nz on 26 Jun, 2017 04:31
Quote from: NorthGuy on 26 Jun, 2017 00:42
Quote from: hamster_nz on 25 Jun, 2017 21:50
Thought experiment - A design requires a module that takes a clk signal and four 8-bit numbers (a_in, b_in, c_in, d_in) and sorts them low to high, to generate four outputs (a_out, b_out, c_out and d_out). Design it with a software mindset, and then a H/W mindset.

I don't think you gain a lot in terms of efficiency, but I would argue it is easier to design with hardware mindset. You can synthesise your "software mindset" design and see how much resources it uses. You can then compare to what you would get with "hardware mindset".

Assuming Xilinx 7-series 6-input LUTs, you would need:

- 6 modules to do 6 comparisons - 4 LUTs each = 24 LUTs. It'll take 2 layers of combinatory logic. You'll get 6 outputs from this representing the results of the comparisons

- For each 8 bit output - 6 x 2 table which converts 6 outputs from the previous layer into the 2-bit index. The 2-bit index will select which input you want to multiplex to the given output. 2 LUTs each = 8 LUTs. One layer of combinatory logic.

- For each bit of the outputs (32 bits total) a mux which uses 2-bit index from the previous layer to select one of the 4 inputs. 1 LUT each = 32 LUTs. One layer of combinatory logic.

Bottom line:

24 + 8 + 32 = 64 LUTs = 16 slices.

2 + 1 + 1 = 4 layers of combinatory logic roughly 0.7 ns each (including intra-layer routing) = 2.8 ns. I'd expect it would run fine with 4 ns clock period - 250 MHz.

Pretty much the same idea I had - get all the comparisons out the way, then select the outputs.

I asked a software friend how they would do it. First reply was to put an "ORDERED BY" clause on the SQL query used to get the items.

The second one was along the lines of

Code: [Select]
array items = [a_in, b_in, c_in, d_in]; sort(items); a_out = items[0]; b_out = items[1]; c_out = items[2]; d_out = items[3];

#147 Reply
Posted by AndyC_772 on 26 Jun, 2017 06:23
Quote from: hamster_nz on 26 Jun, 2017 04:31
I asked a software friend how they would do it. First reply was to put an "ORDERED BY" clause on the SQL query used to get the items.
That's scary on so many levels

In an FPGA, I'd do it one of two ways depending on the required clock speed and latency.

To do it in a single cycle, I'd make use of VHDL variables, and translate your 'software mindset' example more or less directly.

If that method ended up too slow to meet the required fmax, then it would need to be pipelined. On the first clock, perform three of the compare/swap operations, store the intermediate results in internal registers, and set a flag. Then, on the second, perform the other three compare/swaps, assign the final result to the outputs, and clear the flag again.

#148 Reply
Posted by Someone on 26 Jun, 2017 07:59
Quote from: AndyC_772 on 26 Jun, 2017 06:23
To do it in a single cycle, I'd make use of VHDL variables, and translate your 'software mindset' example more or less directly.

If that method ended up too slow to meet the required fmax, then it would need to be pipelined. On the first clock, perform three of the compare/swap operations, store the intermediate results in internal registers, and set a flag. Then, on the second, perform the other three compare/swaps, assign the final result to the outputs, and clear the flag again.
I'm trying to find the reference but one of the big open source processor/SoC teams were using a strict coding style where the work was all done in functions, and registers were directly inferred as a discrete block with nothing else in it. Very tidy style when you're doing algorithm intensive work.

#149 Reply
Posted by BrianHG on 26 Jun, 2017 08:39
Quote from: AndyC_772 on 26 Jun, 2017 06:23
Quote from: hamster_nz on 26 Jun, 2017 04:31
I asked a software friend how they would do it. First reply was to put an "ORDERED BY" clause on the SQL query used to get the items.
That's scary on so many levels

In an FPGA, I'd do it one of two ways depending on the required clock speed and latency.

To do it in a single cycle, I'd make use of VHDL variables, and translate your 'software mindset' example more or less directly.

If that method ended up too slow to meet the required fmax, then it would need to be pipelined. On the first clock, perform three of the compare/swap operations, store the intermediate results in internal registers, and set a flag. Then, on the second, perform the other three compare/swaps, assign the final result to the outputs, and clear the flag again.

To pipeline speed this one up, here is how I would do it:

(a)
Compare all inputs with each other and generate 4 sets of 2 bit selection flags/words.
--- and --- Store all 4 inputs in D-flipflop registers.

(b) The output of all 4 D-flipflop registers would feed 4 x 4:1 mux selection units, each unit receiving the 2 bit selection flags generating 4 sorted outputs.

In a basic level, (a) can be done with 4x4 'if/else' statements generating the 4 sets of 2 bit selection registers, + 4 temporary storage registers.
(b) can be done with 4 x 'case' or 'if' statements creating the 4 sorted output registers, though there are better more compact advance coding methods to achieve the same results, this would just be simple sit in your face.

With Altera FPGAs, doing it this way, with 4 inputs, up to 16 bits each, sorted to 4 outputs, your sorts will be delayed by 2 clocks instead of 1, but this would achieve the best reasonable fmax & you can feed a new 4x number set every single clock. To achieve the best fmax with 32 bit numbers, or sorting more than 4 16 bit numbers will require multi stepped pipeline breaking down the magnitude of the numbers then even the mux selection feed of sorted result will require a piped multiple step clock due to the size of Altera's logic cells where the FMAX seems to deteriorate extensively with some operations squeezing more than a 2x32 bit comparison, or even mux selection per clock.

(NOTE, this is not an example of clean coding, I chose this strategy based on experience with Altera's Quartus knowing that the fitter will synthesize this code for top FMAX, not for the tightest possible gate count, and, I know there are many other methods to achieve the same results.)

I'm sure a hardwired ASIC could do much larger magnitude sorts at full speed in a single clock & the VHDL/Verilog code would be down to the few lines described a few posts above.