Author Topic: Learning FPGAs: wrong approach? (Read 55293 times)

westfw · « **on:** June 15, 2017, 01:02:51 am »

Off in another thread:

Quote

Do you know if there's any document which would describe the structure of the UDB in details?

(UDBs are the little FPGA-like block in a Cypress PSoC microcontroller. But this question is generic to all FPGAs, CLPDs, and similar devices.)

So I (a software engineer with an EE degree from pre-FPGA times) have looked at various FPGAs at various times, and there always seems to be a hump that I have trouble getting over. And I'm wondering if that's because of where I start - with the datasheet that describes the internal structure of the device. Usually they go on about product terms and LUTs and output macrocells and so on. And I know what each of those are (more or less) and how they work as individual pieces, but I lose the thread when I try to figure out how they might be combined to build larger structures. (I mean, in principle I can build a UART out of shift registers, and I know how to build a shift register from a PAL, but...)

But that's completely the wrong approach, isn't it?
For the most part, if you're designing with an FPGA or CPLD, you should be designing at a MUCH higher level, with Verilog or VHDL or some schematic-entry tool. And the tools knows about the resources available on a give chip, and will combine them appropriately or tell you to get a bigger chip with more macrocells (or something.) Yeah, I can probably eliminate the 512byte FIFO from the chip that only has 320 bytes of embedded RAM, and at some point knowing a bit about the internals can help me optimize my thinking ("there's not penalty for adding extra terms to THAT equation.") But otherwise ... it's like when you think about designing a SW algorithm, reading up on the multiple internal buses of your microcontroller is not the best starting point...

Am I getting closer?

daybyter · « **Reply #1 on:** June 15, 2017, 01:35:46 am »

I would just read a verilog tutorial.

BrianHG · « **Reply #2 on:** June 15, 2017, 01:43:32 am »

To star, both Altera and Xilinx are the top dogs here. They both can be programmed in VHDL and Verilog. I personally prefer Verilog since it is simpler language and you can still do all you want in Altera's Quartus.

Altera's Quartus allows you to enter visually gates, flipflops, ram, fifo, anything you like graphically and wire them together graphically as if it were a digital schematic. This includes Verilog code you have written whose inputs and outputs would be represented in quartus as a block device with ins and outs. On you schematic in Quartus, you then wire these function blocks to IO pins, have a selected chip for the project & compile. Then, you chosen FPGA will be created and programmed doing what you laid out. The compiler will also create a report telling you how much of the FPGA gates, memory & IO pins you used. Also, how fast the clock can function.

Now, as for the USART, a simple set of DFF can serial shift data in to make a serial decoder, but, if you want there are pre-made blocks, to do this, or, even public domain Verilog/VHDL blocks which already conform to standard RS-232 standards which you can add anywhere in your Quartus schematic layout which eventually becomes your chip. Dont worry about the size of what you are doing for these smaller functions like gates and serial decoders and even small dual port rams or fifos, I doubt you will fill even a % on the smallest FPGA.

Learning to read the data sheets on these devices helps so you understand when they say the IC has total 256kbit ram + 50k logic gates & how many IOs at what speed and voltage.

Also, look on youtube for tutorial videos for Quartus so you can see some examples of a user creating a device. I cant speak for the quality or complexity of such videos, so search for beginner, try watching a few at 2x speed, if something looks interesting, restart the video at 1x speed.

Best of luck.

BrianHG · « **Reply #3 on:** June 15, 2017, 01:59:38 am »

Oh, 1 additional thing, Quartus II Web edition is free and you can play with it without any programmer or IC, it's fully functional.

When Youtubing, searching Quartus Tutorial, look for schematic entry & how to create things like ram.
Don't worry about Quartus version numbers, it's basically the same thing from version 9 and up. Only a little different visual improvements.

https://www.youtube.com/results?search_query=quartus+tutorial

You will find basic schematic entry and how to add verilog code as well as plenty of stuff, like simulation. (Note that simulation has slightly changes over quartus versions over the years, for these, look at the latest Quartus versions on how to do that.)

hans · « **Reply #4 on:** June 15, 2017, 06:44:22 am »

I wouldn't study the cells of a FPGA too long if you want to get started. It's what turns out to be the least common dominator, and they just put a lot of them on 1 chip so it can be flexible to use.

What is most important to remember from a cell:

- Each cell has a LUT. Most entry level devices are 4-bit wide LUTs. This is what actually encodes the logic you programmed.
- Each cell has 1 flip flop. I.e. 1 bit of high-speed of 'memory'. This is rarely used for memory as such, but more for state information.

Then there are a ton of switches to route signals. Most FPGA's also contain a adder block in each cell, as it tends that those are often used.
FPGA's also have hard logic these days, which you could implement with cells - but as they are so often used the vendors have baked them on the chip. Most common options are hardware multipliers and embedded SRAM blocks. More advanced FPGA's even integrate complete ARM cortex CPU's or DDR controllers onchip at fixed locations.

Programming is best done in a high level language. Although you could do it graphically - I wouldn't recommended. It's too tedious after a short while.

I started out with VHDL and still doing that today. It's much more strongly typed opposed to Verilog which is loosely typed. Pick your poison.

In HDL you should describe how signals should behave and change at (clock) events. In VHDL processes you can actually write rather high level code, complete with functions and other abstractions. VHDL is actually a typical programming language in that respect: you can also write non-synthesizable code (useful for test benches for example).

I think one important tricks to understand is how statements are synthesized to hardware. RTL viewer is your friend. Usually if you write more complex statements, more hardware is being added. If you try to do more computation in 1 go, combinational paths will be longer and thus the maximum clock frequency of the design will go down.

This is actually no different than MCU design. You'll also look at the assembly. You'll also worry about execution times and sizes.

AndyC_772 · « **Reply #5 on:** June 15, 2017, 07:55:14 am »

I agree that looking in too much detail at the underlying hardware structure of the FPGA isn't useful, beyond the high level parameters that tell you how much logic resource you have available.

For example, if an FPGA is sold with 10,000 logic cells, that means roughly 10,000 bits that can be registered and stored. The look-up tables associated with them mean you can have, to a good first approximation, any logical relationship between one set of bits and another that you like.

FPGAs also include 'hard' logic blocks, like dual-port RAM, multipliers and PLLs. These are useful, and you should definitely learn how to instantiate, configure and use them, but there are other things that are worth learning first and getting to grips with.

Don't use schematic entry to design your logic. Seriously. It's fine for academic purposes as a way to introduce the basic concept of a configurable system, but it's not portable, doesn't scale, takes much longer to do non-trivial designs, and is difficult to maintain. Walk away before you even start, and instead, learn VHDL (my preference) or Verilog.

By far the most important things to get your head around are:

- writing 'code' for an FPGA is not like writing code for a microprocessor. It's NOT a sequence of instructions to be executed one after the other, even if it occasionally looks as though it might be. Each process you create describes a piece of hardware which exists in parallel with all the others you've defined, and is always carrying out its prescribed function independently. Nothing at all inherently happens sequentially. If you want different things to happen on consecutive clock edges, you need to make sure something changes on one edge which can then be read and taken into account on the next edge.

- always, always, always be aware of when signals get updated, and when they are required to be valid with respect to other signals. Get to grips on day one with the concept of a clock domain. After lunch (but still on day one), read up on metastability, how it happens, and how to stop it becoming a problem in your design. Be in no doubt whatsoever that FPGA vendors and their tool chains do NOT solve this problem for you, but they do give you everything you need to solve it yourself.

I make a big deal of this because it's a ridiculously easy way to mess up a design, in a way which is not obvious to look at, and which causes occasional (or perhaps frequent) errors that make a board unreliable. Avoid a world of hurt later on by taking clock relationships seriously right from the very start.

Here's an example to spoil your day. Suppose you have an FPGA design which is a simple square wave generator, and the period of the square wave needs to be programmable.

Generating the wave is easy. You create a counter, clocked from some master reference clock (call it FCLK), which counts up from 0 to some programmed value CLK_PERIOD, and when it matches, you toggle the output and zero the counter.

CLK_PERIOD needs to be set, let's say via an SPI interface. So, you create a simple SPI slave, which can be written by an external microcontroller. That SPI slave is clocked by an external pin (SCK). The new value of CLK_PERIOD is updated when the last active edge of SCK is received.

Now consider what happens just at that moment. Let's suppose CLK_PERIOD is changed from 0x7F to 0x80.

On every edge of FCLK, the value of CLK_PERIOD is being read. That's fine if all the bits in CLK_PERIOD are actually valid at that instant. But what if SCK and FCLK have just the wrong phase relationship, so at the time the counter is being compared with it, some bits have their new values and some have the old value? And what if a bit is just on the point of changing, which makes the comparator metastable?

At best, you get a single cycle with the wrong period, ie. neither the old value nor the new one. At worst, your counter runs off into the weeds and it takes 2^n clocks before the output starts toggling again (where 'n' is the number of bits in your counter).

The logic of your code might be completely fine. Functional simulation will never show a problem. But once every few seconds, minutes or hours, your real hardware will malfunction.

Ways to solve this problem include:

- make FCLK fast enough that SCK can be sampled as though it's an asynchronous signal, ie. don't use it as a clock at all, but instead, look for changes of state in SCK in a process that's driven from FCLK.

- use a dual-port RAM as a FIFO. Push setting changes into the FIFO from the SCK side, and read them out in the FCLK domain. The vendor's FIFO logic includes robust features for crossing clock domains. (Typically, this involves converting addresses to and from Gray codes, but you never see this in your own code as it's done for you when you instantiate a FIFO).

hamster_nz · « **Reply #6 on:** June 15, 2017, 09:34:34 am »

There are a few things that are worth thinking about. Or maybe it is just rambling.

FPGAs are not CPUs
CPUs hold this state in RAM and legislatures, and change this state slowly, at best only a few words at a time. FPGAs can change a lot of state all the time if you let them. If you code for FPGAs as though they are CPUs, then you will miss the point of FPGAs

A design for an FPGA is static - for the most part you can't add more state information at 'runtime' - you do not have the FPGA equivalent of "malloc()" or "new" to add more logic. Instead try to think of your data flowing through your designs, much like how signals flow through a circuit.

Overthinking and Overdesigning
For the most part, FPGAs are just chips with inputs, outputs and clocks. Your job is to get the output pins to change as required by the inputs and the passage of time (as measured by the clocks). Your design does this by keeping track of information in a hidden, internal state vector, which evolves from cycle to cycle.

The simpler and more concise your description how this happens, the better the end result will be.

It is easy to overthink the problem, esp for a newbie. Ask yourself often "is this the simplest way to do this?".

As a general rule deeply nested HDL code with lots of IF statements is bad. If you do this, you are missing something about the problem and should see if you can decompose the problem more.

A simple design is easier to debug than a complex one.

Structure or behavioral code
Learn the distinction between structural and behavioral design. Your design is made up of little bits that behave in a specific way, connected together to make your design. Try to keep these aspects of your designs as separate as much as possible, at least while you are starting out.

Structural is like designing a schematic or PCB - how things are connected. Graphical tools work good at this, and this works well when using IP blocks, But HDL code isn't too good at this - it gets too verbose

Behavioral is like making the 'model' describing how an OpAmp works or other complex component works. This is somewhat hard to do graphically, but works well when using HDL code.

If you try to describe how things are connected and how they behave at the same time, it doesn't work well in either code or in graphical tools. Avoid this ugly middle ground!

Different FPGAs from different vendors
If you ignore the more magical parts of an FPGA (e.g. PLLs), at the bottom of the heap are the FPGA's primitives. LUTs and FlipFlops, ALMs, Slices, whatever - these change between vendors and devices, but they are very simple and pretty generic.

You could build anything you want with enough four input LUTs and D-flipflops. It might not be most efficient, but you can do it. Likewise you can also build anything you like with a (impractically) wide enough RAM and an single address register (e.g. a 256x8 bit RAM can act as an 8-bit counter). For the most part, the different FPGA architectures have just decided to put a different stake in the ground along this continuum.

A design that is optimal for one FPGA architecture is usually pretty close to optimal for another - changing vendor or the part is not usually going to make your design much easier.

Loops
If you write software for a while you have the power of unbounded loops fixed in your head. You need to retrain yourself to achieve your design goal without them.

This is like learning to write non-blocking code, but taking it to the next level. Don't try to be cunning and fight it or work around it - you can't win.

Time
Time for FPGA designs is very different for time in software or electronics. Everything happens all at once (in parallel), and then things happen so slowly (you can only do so much in one clock cycle).

Much like the shift to when you start shifting from DC to doing AC designs in electronics, or working in with maths in the frequency domain, or moving to writing real-time software you didn't know how much you didn't know what time really means until you finally get to grips with it. Once you do understand it, you can't see how it could really be any other way.

legacy · « **Reply #7 on:** June 15, 2017, 09:56:31 am »

Quote from: AndyC_772 on June 15, 2017, 07:55:14 am

PLLs

PLLs are bad beasts. They are only present in recent devices and they are usually vendor specific. E.g. Spartan2 (obsolete, but still good as it's 5V tolerant therefore very useful to design PCI-boards) and Spartan3 don't have PLLs, whereas Spartan6 devices include a few of them, but in order to use them you need to pass through their IP-wizards which automatically instantiate resources according to user's requirements. Nothing wrong with this, it simply means the approach is specific to Xinlinx (Altera, others ..)

I say it because from my point of view, everything is related to HDL at the RTL level is just 'HDL', I mean a couple of vhdl files plus constraints for a simulator.

I don't play with vendor tools until I have a workset. Instead I spend the 70% of the time on the simulator, where a few constraints are different from those required by the final target. This is the first thing one should learn as it's the main approach's rule, and it also means I can't simulate PLLs (neither it makes sense) until I move to the synthesizer, where the block is physically defined and properly instantiated.

On simulator I usually describe PLL block as a black box entity, with it's behavior is idealized by a function (I can write it in C, matlab, and pass it to the simulator through wrappers).

In other words, playing with HDL is a mashup between pure logic behavior and pspice. This is enough for a preliminary workset, then (the last 30% of your development time) you need to experiment and verify on the physical target if time-constraints are really all satisfied.

chris_leyson · « **Reply #8 on:** June 15, 2017, 10:08:26 am »

Quote

writing 'code' for an FPGA is not like writing code for a microprocessor. It's NOT a sequence of instructions to be executed one after the other

Good advice from AndyC_772 it's an easy trap that new players can fall into. Always clock logic from a single clock source if you can and never ever use asynchronous logic. Use clock enable inputs to slow down a master clock if your logic needs a slower clock source. Always register or latch input signals to avoid metastability issues and if you need to cross clock boundaries then instantiate an asynchronous FIFO.
These days you don't need to do 'bare metal' design and you don't need to focus on the internal architecture of a particular FPGA family, it's all done with core generators now whereas back in the day with older generation silicon with limited resources you might have had to hand craft logic to save on resources.

mikeselectricstuff · « **Reply #9 on:** June 15, 2017, 10:11:35 am »

Always remember that you are not writing code, you are building hardware.Think about how you would use logic ICs to do the job.
Even things that look like sequential code, specifically VHDL process blocks, express priority, not sequence.
You don't need to know anything about LUTs, Slices etc. until you're getting into advanced optimisation.

Start with a devboard that has a fairly big device. Even for simple designs, place & route will be faster - you can just ignore the stuff you won't be using.
You will be using state machines a lot, so need to understand them.

AndyC_772 · « **Reply #10 on:** June 15, 2017, 10:19:41 am »

legacy, it sounds as though you're making life difficult for yourself. Do you really not know which device - or at least, family of devices - a given design will target? And what do you have against PLLs? They're essential tools that have been present in every major device family for the last decade.

My preferred family is Altera Cyclone IV E. Older parts are functionally similar but less good in every quantifiable way. Cyclone V parts are bigger and more costly, and Cyclone 10 isn't yet readily available. I probably could switch to another vendor, but in the absence of a compelling reason to do so, it would be a lot of work for little or no benefit.

With that in mind, I usually start a new design along the following lines...

- how many I/O's do I need? Design the rest of the schematic, then see how many pins end up on the empty FPGA page. Add a few more for test points, and for the feature I'll find I need by the time the design is at rev C.

- create a dummy FPGA project in Quartus, with all the I/O pins defined with their correct direction and I/O standard (LVDS, 3.3V CMOS, 1.8V CMOS and so on).

- allocate the pin-out of the device, ensuring all the rules about which pins can go where are met. For example, on Cyclone IV E, LVDS pins can only go in a 2.5V bank, differential inputs must be at least a certain number of pads away from single-ended outputs, and so on.

- estimate the logic capacity requirement of the design. This is hard. Often I'll actually write a complete first draft of the code at this point, and hold off completing the PCB until it's done. It's amazing how many bugs get spotted and fixed at this stage.

- finish the PCB and send off for manufacturing

- simulate the VHDL in ModelSim. The Altera free version of this includes complete behavioural simulation models for all the hard IP blocks (memory, PLLs, DSP and so on), so I can simulate the entire system without having to worry about these. Yes, it's vendor specific, but so is my board, so I really don't care.

- debug the VHDL. This is probably the most coffee intensive part of the whole process, up to this point.

- write the SDC file. This takes over as the most coffee intensive part of the process.

- start writing test code for the main processor. Keep doing this until real hardware arrives.

- when real hardware arrives, plug it in and begin testing. By this point, I should already know that my FPGA will, to a good first approximation, work as intended. Minor changes and enhancements can be tested functionally on real hardware. More significant changes require a return to ModelSim.

mikeselectricstuff · « **Reply #11 on:** June 15, 2017, 10:29:20 am »

The first hardware you should use when starting out with FPGAs should ALWAYS be a devboard. Any devboard. Even better if it has an on-board programmer. Most manufacturers ( and some third parties) do very cheap boards for most FPGA families.

There are so many other things that can make it not work, that you really don't want to be wasting time messing around trying to figure out if it's a hardware, programming or code problem.

legacy · « **Reply #12 on:** June 15, 2017, 11:26:41 am »

Quote from: AndyC_772 on June 15, 2017, 10:19:41 am

They're essential tools that have been present in every major device family for the last decade.

Does Spartan2 have pll ? No!
Does Spartan3 have pll ? No!
Does Spartan6 have pll ? Yes!

Is Spartan2 5V tolerant? Yes!
Is Spartan3 5V tolerant? No!
Is Spartan6 5V tolerant? No!

Before babysitting people, understand what people needs.

mikeselectricstuff · « **Reply #13 on:** June 15, 2017, 11:33:10 am »

Quote from: legacy on June 15, 2017, 11:26:41 am

Quote from: AndyC_772 on June 15, 2017, 10:19:41 am
They're essential tools that have been present in every major device family for the last decade.

Does Spartan2 have pll ? No!
Does Spartan3 have pll ? No!

Spartan 2&3 has DLLs, which serve the same purpose.

rstofer · « **Reply #14 on:** June 15, 2017, 02:37:47 pm »

I have used a simulator only once and it was a simple 4 bit counter. I just had to try it...

If my primary state machine has, say, 100 states (which uses a 100 bit 1-HOT state vector) and controls a few dozen outputs that kick off other processes, I just can't see how the simulator is going to help. It may be several thousand cycles in before I get to the part I want to see. I may actually be using a logic analyzer at the FPGA level to analyze a hang in the Operating System. Maybe something to do with booting the system from the Compact Flash. There a lot of cycles before I get this far. What sector did I read? What did the data look like? No, printf() is not a solution here!

So, I try to use a board with enough IO to feed a fairly wide logic analyzer. Now I can create some kind of trigger that starts the capture just before the spot I am interested in and not have to wade through a bazillion nanoseconds of trace.

Bottom line: I head straight to hardware. This takes time because the system has to be fully synthesized, placed and routed. It is probably not the most productive way to design FPGA projects but it works for me in my hobby world.

I don't have enough time with Vivado and a real project to know if the in-circuit logic analyzer will be a help. From what I have seen, it is a really nice feature. Once you become an uber-guru of the constraints file. What a PITA!

I have seen designs where the code writes straight to the LUT. All of the logic is specified around LUTS and DFFs. I don't tend to understand it...

I write my VHDL as though I want to understand it several years later. Just simple code, no tricky bits. I let the toolchain worry about the details. Again, this is not the high performance approach but it works for me.

mikeselectricstuff · « **Reply #15 on:** June 15, 2017, 02:41:45 pm »

I'd agree that to get something happenning quickly & get a feel for things, avoiding simulation is probably a good start as it's yet another tool to learn before you see anything working.
You may or may not chose to use it later on - personally I've never used one.
You need to trade off the time setting it all up versus the savings in compile/program time not having to do place & route every iteration.

free_electron · « **Reply #16 on:** June 15, 2017, 03:03:53 pm »

Quote from: BrianHG on June 15, 2017, 01:43:32 am

To star, both Altera and Xilinx are the top dogs here. They both can be programmed in VHDL and Verilog.

and schematic capture as well. or Abel or or AHDL.

As an FPGA designer you don;t deal with the 'guts' of the fpga. that is handled by the synthesizer and mapper.
simply make your schematic / code or mix thereof , click compile and blast it in the chip. done.
the tools come with extensive libraries with almost anything you can think of ( including whole cpu's and peripherals )

Rasz · « **Reply #17 on:** June 15, 2017, 03:37:13 pm »

the internal stuff gets important when optimizing your design, when you are a tight ass with low budget, or need every last MHz out of it

http://zipcpu.com/blog/2017/06/12/minimizing-luts.html

if you dont care about $ you can "program" fpgas in python, or even GO like javascript fed kids do these days https://reconfigure.io/ (on $3K dev board haha)

julian1 · « **Reply #18 on:** June 15, 2017, 10:54:08 pm »

I like the Lattice ice40 fpgas. I believe they're the only one's where the "guts" meaning LUTs and inter-connect as well as bitstream format have actually been reverse engineered and documented by Clifford Wolf. If you want to tinker at that low-level, then the code is available as an example. I don't believe any of the other vendors - eg xilinx or altera document this stuff. In fact the opposite is true and they all encumbered by patent protections.

It helps that there's a lightweight and open-source verilog compiler and p&r available - sufficiently mature to synthesize the pico RISC-V core. From memory, I believe the toolchain even beats Lattice's proprietary toolchain - in terms of reduced lut counts and better timing.

Sal Ammoniac · « **Reply #19 on:** June 15, 2017, 11:26:07 pm »

I like the Digilent FPGA dev boards. They have a good selection and they're not too costly.

Don't worry about how the FPGA fabric works at first--you really don't need to know those low-level details as a beginner. Later, when you have more experience, you can explore the inner workings of the part.

Simulation is your friend. If something doesn't work in simulation, it's not likely to work on the chip. Learn how to write test benches at the same time you learn how to write Verilog or VHDL code. You'll save lots of time in the long run.

You probably already have a good grasp of state machines, but if not, bone up on them because you'll be using them a lot when working with FPGAs.

Start with simple projects, like a simple serial UART or SPI before trying to tackle something like HDMI.

Strive for simplicity. Complex designs are rarely the best designs.

Cerebus · « **Reply #20 on:** June 16, 2017, 12:50:49 am »

Quote from: westfw on June 15, 2017, 01:02:51 am

Off in another thread:
Quote
Do you know if there's any document which would describe the structure of the UDB in details?
(UDBs are the little FPGA-like block in a Cypress PSoC microcontroller. But this question is generic to all FPGAs, CLPDs, and similar devices.)

So I (a software engineer with an EE degree from pre-FPGA times) have looked at various FPGAs at various times, and there always seems to be a hump that I have trouble getting over. And I'm wondering if that's because of where I start - with the datasheet that describes the internal structure of the device.

Yah, not the best place, as you've figured out for yourself.

If you're old school I presume you shouldn't have any trouble with the actual logic design for what you want to do. So start there. Start with the familiar and work towards the unfamiliar.

Design your logic with discrete flip-flop, registers, gates, whatever you need - but don't reach for your dog eared TI 7400 logic guide, just invent your own parts as you need because you can have any part you want. With the magic of HDLs and FPGAs you can make, and interconnect, those parts inside the FPGA. Seventeen bit adder? No problem. Twenty-nine input 'and' gate? No problem. You get the idea.

Next step would be to learn one of the HDLs. My recommendation would be for Verilog, but in 30 seconds there will be 10 fan-boys coming along to tell you that I'm muddle-headed and VHDL is the only true way to the light. (If HDLs were church denominations VHDL would be the Calvinists and Verilog the Pentecostal Baptists; although I suspect some of the VHDL guys would quite like to find a Plymouth Brethren HDL.

)

Once you've got the beginning of a grip on your chosen HDL, take the 'discrete' design you already made and implement the discrete parts you 'made up' in HDL, interconnect them in HDL, scribble a little HDL test-bed and hit the simulator.

In the process of doing this I think you'll find what I did, that you start thinking in HDL instead of flip-flops, gates, etc and you'll start to be able to do your design work directly in HDL.

The vendor tools for actual FPGAs can be quite a struggle to set up and get running with - not what you want at the 'hello world' stage. I'd recommend that if you're going Verilog that you grab the open source Icarus iverilog simulator and have a play with that and get yourself comfortable with some working results in simulation before you try and get them anywhere near an actual FPGA. If you want to go VHDL I'm sure someone can point you at some tools.

Quote from: westfw on June 15, 2017, 01:02:51 am

But otherwise ... it's like when you think about designing a SW algorithm, reading up on the multiple internal buses of your microcontroller is not the best starting point...

I'd prefer cache as the programming analogy. Some algorithms are going to suck unless you understand cache coherency, cache occupancy etc. You can always get something that works on any architecture, but getting it working well may mean tuning for the cache implementation on each architecture that you run it on.

Similarly, you can probably get a design in Verilog to work on any FPGA, but you may need to dig into the specific FPGA architecture to get it to work well, or take up a smaller capacity chip etc.

One area it is worth knowing the ins and outs of your particular FPGA is literally the ins and the outs. You can save quite a lot of grief by understanding how to use I/O cells to your advantage, and how to get the right clock into the right pin and distributed around inside the FPGA the right way.

As a software engineer you're going to have to keep reminding yourself that the HDL you're writing represents wires, gates and registers. It's all parallel and any temptation to fall back onto classic iterative programing habits will bite you in the backside. Every time your instincts tell you to write a for loop you're almost always looking for a Mealy/Moore state machine instead.

legacy · « **Reply #21 on:** June 16, 2017, 08:02:07 am »

Quote from: Rasz on June 15, 2017, 03:37:13 pm

if you dont care about $ you can "program" fpgas in python, or even GO like javascript fed kids do these days https://reconfigure.io/ (on $3K dev board haha)

LOL

jprozas · « **Reply #22 on:** June 16, 2017, 11:14:47 am »

This link is to learn digital design with FPGA (verilog) with open tools. (Spanish)

https://github.com/Obijuan/open-fpga-verilog-tutorial/wiki

Enviado desde mi Aquaris_A4.5 mediante Tapatalk

NorthGuy · « **Reply #23 on:** June 16, 2017, 02:27:09 pm »

Quote from: Rasz on June 15, 2017, 03:37:13 pm

if you dont care about $ you can "program" fpgas in python ...

You probably think you're joking, but human madness is already past that. Xilinx marketers take python very seriously, and even "scientists" from California University believe that python is the most efficient way to program FPGAs:

https://forums.xilinx.com/t5/Xcell-Daily-Blog/Best-Short-Paper-at-FCCM-2017-gets-30x-from-Python-based-PYNQ/ba-p/765899

https://arxiv.org/pdf/1705.05209.pdf

rstofer · « **Reply #24 on:** June 16, 2017, 02:37:52 pm »

Quote from: Sal Ammoniac on June 15, 2017, 11:26:07 pm

You probably already have a good grasp of state machines, but if not, bone up on them because you'll be using them a lot when working with FPGAs.

A state machine is just a C switch statement inside the while(1) loop. But, just ahead of the switch(), you need to define a default output state for every output signal you create. Otherwise, you have to define the output state of every signal at every state.

Like this:

Code: [Select]


    process(Reset,Clk) is
    begin
        if Reset = '1' then
            state <= s0;
        elseif rising_edge(clk) then
            state <= NextState;
        end if;
    end process;


    process (state, FullEA, FetchOpnd, F, TAG, IA, CO, OFL, OVFLInd, COtemp, CSET, VSET,
                r_Button0, CCC, CondMet, BOSC_Flag, SavedSign, A_BUS(15), ShiftCount,
                SZ, ZR, DVDS, Result, Ones, OVR,
                CountShifts, ACC, IncludeEXT, EXTN, Rotate, AFR,
                BitCount, XIO_Device, XIO_Function, XIO_Modifier,
                DisplaySwitch,
                ConsoleXIOCmdBusy, ConsoleXIOCmdAck,
                PrinterXIOCmdAck,	PrinterXIOCmdBusy,
                ReaderXIOCmdBusy, ReaderXIOCmdAck,
                DiskXIOCmdBusy, DiskXIOCmdAck,
                DiskReady, IAR,
                SingleStep, BreakPointActive, BreakPoint,
                PendingInterrupt, ReturnState_r, StartState) is
    begin
        A_BusCtrl			<= A_BUS_NOP;
        ACC_Ctrl			<= ACC_NOP;
        ACC_ShiftIn		<= '0';
        Add				<= '0';
        AFR_Ctrl			<= AFR_NOP;
        BitCountCtrl		<= BitCount_NOP;
        CI				<= '0';
        CIn				<= '0';
        CIX				<= '0';
        CarryIndCtrl		<= CARRY_IND_NOP;

        <and so on...>
        
        case state is
            when s0    => NextState <= s0a;		-- use this to IPL
            when s0a  => if DiskReady = '0' then	-- wait for disk to go not ready
                                      NextState <= s0b;
                                else
                                      NextState <= s0a;
                                end if;
            when s0b => if DiskReady = '1' and ColdstartHold = '0' then -- wait for disk to go ready and
                                                                                                     -- coldstart code to be copied

            <and so on>

There are two processes to create this FSM: The first just changes the state according to the NextState value on every clock cycle. In the case of a loop, the state may not actually change. See the second process...

The second process does all the work and it is not clocked. It is just a huge collection of combinatorial logic.

Here I defined default outputs for 10 signals (although they aren't shown in the snippet of FSM code). In the real world, there are 49 of these default outputs and 117 states.

I didn't say anything about the 'sensitivity list' that starts out as

Code: [Select]

process (state, FullEA, FetchOpnd, F, TAG, IA, CO, OFL, OVFLInd, COtemp, CSET, VSET,

This sensitivity list tells the simulator which signals to monitor to decide to actually run the process. If there are no changes to any signals in the list, the simulator won't evaluate the process.

This list is meaningless to synthesis but the synthesizer will whine if an input signal to the process is undeclared. But it's just whine and snivel, the output works with or without the list.

Cerebus · « **Reply #25 on:** June 16, 2017, 02:50:06 pm »

Quote from: NorthGuy on June 16, 2017, 02:27:09 pm

Quote from: Rasz on June 15, 2017, 03:37:13 pm
if you dont care about $ you can "program" fpgas in python ...

You probably think you're joking, but human madness is already past that. Xilinx marketers take python very seriously, and even "scientists" from California University believe that python is the most efficient way to program FPGAs:

https://forums.xilinx.com/t5/Xcell-Daily-Blog/Best-Short-Paper-at-FCCM-2017-gets-30x-from-Python-based-PYNQ/ba-p/765899

https://arxiv.org/pdf/1705.05209.pdf

I think you're jumping to conclusions there. The first URL is talking about using Python (running an a dedicated or soft processor on the FPGA) with pre-packaged bitstreams for the FPGA fabric. So it's just about using Python to interface to things implemented on the FPGA. Just a short way down the page you'll find this quote:

Quote

PYNQ does not currently provide or perform any high-level synthesis or porting of Python applications directly into the FPGA fabric. As a result, a developer still must use create a design using the FPGA fabric. While PYNQ does provide an Overlay framework to support interfacing with the board’s IO, any custom logic must be created and integrated by the developer.

They do indirectly reference a Python to HDL tool, but the thrust of that page (and paper) is not on programming the FPGA fabric in Python.

nctnico · « **Reply #26 on:** June 16, 2017, 03:10:22 pm »

Quote from: rstofer on June 16, 2017, 02:37:52 pm

Quote from: Sal Ammoniac on June 15, 2017, 11:26:07 pm

You probably already have a good grasp of state machines, but if not, bone up on them because you'll be using them a lot when working with FPGAs.

A state machine is just a C switch statement inside the while(1) loop. But, just ahead of the switch(), you need to define a default output state for every output signal you create. Otherwise, you have to define the output state of every signal at every state.

Like this:
Code: [Select]
process(Reset,Clk) is begin if Reset = '1' then state <= s0; elseif rising_edge(clk) then state <= NextState; end if; end process; process (state, FullEA, FetchOpnd, F, TAG, IA, CO, OFL, OVFLInd, COtemp, CSET, VSET, r_Button0, CCC, CondMet, BOSC_Flag, SavedSign, A_BUS(15), ShiftCount, SZ, ZR, DVDS, Result, Ones, OVR, CountShifts, ACC, IncludeEXT, EXTN, Rotate, AFR, BitCount, XIO_Device, XIO_Function, XIO_Modifier, DisplaySwitch, ConsoleXIOCmdBusy, ConsoleXIOCmdAck, PrinterXIOCmdAck, PrinterXIOCmdBusy, ReaderXIOCmdBusy, ReaderXIOCmdAck, DiskXIOCmdBusy, DiskXIOCmdAck, DiskReady, IAR, SingleStep, BreakPointActive, BreakPoint, PendingInterrupt, ReturnState_r, StartState) is begin A_BusCtrl <= A_BUS_NOP; ACC_Ctrl <= ACC_NOP; ACC_ShiftIn <= '0'; Add <= '0'; AFR_Ctrl <= AFR_NOP; BitCountCtrl <= BitCount_NOP; CI <= '0'; CIn <= '0'; CIX <= '0'; CarryIndCtrl <= CARRY_IND_NOP; <and so on...> case state is when s0 => NextState <= s0a; -- use this to IPL when s0a => if DiskReady = '0' then -- wait for disk to go not ready NextState <= s0b; else NextState <= s0a; end if; when s0b => if DiskReady = '1' and ColdstartHold = '0' then -- wait for disk to go ready and -- coldstart code to be copied <and so on>
There are two processes to create this FSM: The first just changes the state according to the NextState value on every clock cycle. In the case of a loop, the state may not actually change. See the second process...

The second process does all the work and it is not clocked. It is just a huge collection of combinatorial logic.

Here I defined default outputs for 10 signals (although they aren't shown in the snippet of FSM code). In the real world, there are 49 of these default outputs and 117 states.

I didn't say anything about the 'sensitivity list' that starts out as
Code: [Select]
process (state, FullEA, FetchOpnd, F, TAG, IA, CO, OFL, OVFLInd, COtemp, CSET, VSET,
This sensitivity list tells the simulator which signals to monitor to decide to actually run the process. If there are no changes to any signals in the list, the simulator won't evaluate the process.

This list is meaningless to synthesis but the synthesizer will whine if an input signal to the process is undeclared. But it's just whine and snivel, the output works with or without the list.

This is pretty bad coding because it is prone to creating latches. As a rule of thumb you only have 2 signals at most in the sensitivity list of a process: clock and reset. If there are other signals then it smells fishy. The problem is likely better solved using a function instead of a process.

NorthGuy · « **Reply #27 on:** June 16, 2017, 04:01:13 pm »

Quote from: Cerebus on June 16, 2017, 02:50:06 pm

They do indirectly reference a Python to HDL tool, but the thrust of that page (and paper) is not on programming the FPGA fabric in Python.

This is all semantics.

"At a system-level the skill set necessary to integrate multiple custom IP hardware cores, interconnects, memory interfaces, and now heterogeneous processing elements
is complex. Rather than drive FPGA development from the hardware up, we consider the impact of leveraging Python to accelerate application development."

This may not look as FPGA programming to you, but it is to them. Certainly, as they say, they're only at the beginning of that road, but they're on that road.

Note that the fabric per-se doesn't even appear in their description of the pre-Python FPGA programming. For them, FPGA programming is merely integration and interconnection of IPs. They see Python as the way forward to replace the process.

legacy · « **Reply #28 on:** June 16, 2017, 04:15:31 pm »

Quote from: NorthGuy on June 16, 2017, 02:27:09 pm

Xilinx marketers take python very seriously, and even "scientists" from California University believe that python is the most efficient way to program FPGA

So, we have HDL which means Hardware Description Language, and we need to use python?
Does it make sense?

rstofer · « **Reply #29 on:** June 16, 2017, 04:23:16 pm »

Quote from: nctnico on June 16, 2017, 03:10:22 pm

This is pretty bad coding because it is prone to creating latches. As a rule of thumb you only have 2 signals at most in the sensitivity list of a process: clock and reset. If there are other signals then it smells fishy. The problem is likely better solved using a function instead of a process.

The point of defining default output values before the case statement is to guarantee that latches are NOT inferred. In any event, XST complains when latches are inferred. Just fix the problem and move on.

I guess I don't see replacing a simple case structure with a multitude of functions although I have seen similar implementations. In my implementation, I can see by looking at a particular case exactly what outputs I am setting up and it will only be a small subset of the 49 declared. When I want to know what happens in each step of the Divide instruction, it is all in one place. Sure, it takes 7 states but they are all written together, as neighbors, not split over several functions. I could instead create a function for the Load Accumulator signal, for example, but that would require some kind of logic based on 18 values of the 'state' vector. Basically, a big OR statement on the 'state' vector. But that scatters the logic all over the place!

Actually, it wouldn't work well because my accumulator process takes 8 different values of the ACC_Ctrl signal to determine what it should do at each clock. These need to be mutually exclusive and I can't imagine using an 'if-endif' on 8 discrete signals.

Code: [Select]

	process(Reset, Clk, ACC_Ctrl)
	begin
		if Reset = '1' then
			ACC <= (others => '0');
		elsif Clk'event and Clk = '1' then
			case ACC_Ctrl is
				when ACC_NOP			=> null;
				when ACC_LOAD		=> ACC <= A_BUS;
				when ACC_AND			=> ACC <= ACC and A_BUS;
				when ACC_OR			=> ACC <= ACC or A_BUS;
				when ACC_EOR			=> ACC <= ACC xor A_BUS;
				when ACC_SHIFT_LEFT	=> ACC <= ACC(14 downto 0) & ACC_ShiftIn;
				when ACC_SHIFT_RIGHT	=> ACC <= ACC_ShiftIn & ACC(15 downto 1);
				when ACC_XCHG		=> ACC <= EXTN;
				when others			=> null;
			end case;
		end if;
	end process;

But, yes, it is possible to create functions using the 'state' vector as one of parameters.

There's always another way...

rstofer · « **Reply #30 on:** June 16, 2017, 04:37:20 pm »

Quote from: legacy on June 16, 2017, 04:15:31 pm

Quote from: NorthGuy on June 16, 2017, 02:27:09 pm
Xilinx marketers take python very seriously, and even "scientists" from California University believe that python is the most efficient way to program FPGA

So, we have HDL which means Hardware Description Language, and we need to use python?
Does it make sense?

Not in my world!

One of my favorite quotes (in "A Compiler Generator" McKeeman, Horning & Wortman, 1970, page 11):

Quote

"It is possible by ingenuity and at the expense of clarity..[to do almost anything in any language]. However, the fact that it is possible to push a pea up a mountain with your nose does not mean that this is a sensible way of getting it there. Each of these techniques of language extension should be used in its proper place."

Christopher Strachey
NATO Summer School in Programming (1969?)

NorthGuy · « **Reply #31 on:** June 16, 2017, 04:43:29 pm »

Quote from: legacy on June 16, 2017, 04:15:31 pm

So, we have HDL which means Hardware Description Language, and we need to use python?
Does it make sense?

It doesn't.

But you cannot explain this to The Python programmer. VHDL programming would look totally bizarre to him, because programming in Python is easy (whatever that means).

Similarly, programming by connecting individual LUTs and FFs would look bizarre to a VHDL programmer (such as yourself). If you can imagine the feeling, you know how The Python programmer would feel about the VHDL.

Bruce Abbott · « **Reply #32 on:** June 16, 2017, 04:58:40 pm »

Quote from: legacy on June 16, 2017, 04:15:31 pm

So, we have HDL which means Hardware Description Language, and we need to use python?
Does it make sense?

Yes, it makes perfect sense.

“The combining of both Python software and FPGA’s performance potential is a significant step in reaching a broader community of developers, akin to Raspberry Pi and Ardiuno. This work studied the performance of common image processing pipelines in C/C++, Python, and custom hardware accelerators to better understand the performance and capabilities of a Python + FPGA development environment. The results are highly promising, with the ability to match and exceed performances from C implementations, up to 30x speedup."

This is what we were promised 20 years ago - the ability to accelerate software performance using reconfigurable hardware. But instead we just got faster general-purpose CPUs.

Cerebus · « **Reply #33 on:** June 16, 2017, 06:04:26 pm »

Quote from: NorthGuy on June 16, 2017, 04:01:13 pm

Quote from: Cerebus on June 16, 2017, 02:50:06 pm
They do indirectly reference a Python to HDL tool, but the thrust of that page (and paper) is not on programming the FPGA fabric in Python.

This is all semantics.

It's got nothing to do with semantics which is "the branch of linguistics and logic concerned with meaning". That phrase would only make sense if we were quibbling over the precise meaning of words, which we weren't.

Quote from: NorthGuy on June 16, 2017, 04:01:13 pm

This may not look as FPGA programming to you, but it is to them.

The article says quite explicitly what the are doing and makes it explicitly clear that does not include trying to program the FPGA fabric in Python so I can see no basis for your assertion. It is quite clear that they understand the difference between programming the FPGA fabric and building a application framework around the FPGA in Python. It doesn't fit your narrative of 'Ho, ho, look at them, they think you can program an FPGA in Python'.

NorthGuy · « **Reply #34 on:** June 16, 2017, 06:28:23 pm »

Quote from: Cerebus on June 16, 2017, 06:04:26 pm

It's got nothing to do with semantics which is "the branch of linguistics and logic concerned with meaning". That phrase would only make sense if we were quibbling over the precise meaning of words, which we weren't.

We are. The words are "Programming FPGA". You interpret them as "Programming fabric with VHDL or alike". The other, broader meaning is "Building applications with FPGA".

Quote from: Cerebus on June 16, 2017, 06:04:26 pm

It doesn't fit your narrative of 'Ho, ho, look at them, they think you can program an FPGA in Python'.

That's not my narrative. My narrative is:

"Look. Python came to FPGAs too. The guys who are deemed to be scientists, but in fact know very little, write papers where they misinterpret their own facts and come to a wrong conclusions about Python efficiency and suitability. This false interpretation is spread and promoted by Xilinx as an established fact. Now more people will believe all this gibberish. So sad."

Howardlong · « **Reply #35 on:** June 16, 2017, 06:28:56 pm »

Quote from: westfw on June 15, 2017, 01:02:51 am

So I (a software engineer with an EE degree from pre-FPGA times) have looked at various FPGAs at various times, and there always seems to be a hump that I have trouble getting over. And I'm wondering if that's because of where I start - with the datasheet that describes the internal structure of the device.

I was in A very similar boat to you, of a certain vintage EE well before FPGAs.

I must've had a dozen false starts. I (usually) could get as far as running a demo on a dev board and getting a tool chain to work as a script monkey, but beyond that I floundered. There was still a gap between what I wanted to do, and where the tutorials finished.

This was the video that changed my understanding and got me in a position to write my own HDL:

Cerebus · « **Reply #36 on:** June 16, 2017, 06:58:50 pm »

Quote from: NorthGuy on June 16, 2017, 06:28:23 pm

Quote from: Cerebus on June 16, 2017, 06:04:26 pm
It's got nothing to do with semantics which is "the branch of linguistics and logic concerned with meaning". That phrase would only make sense if we were quibbling over the precise meaning of words, which we weren't.

We are. The words are "Programming FPGA". You interpret them as "Programming fabric with VHDL or alike". The other, broader meaning is "Building applications with FPGA".

Quote from: Cerebus on June 16, 2017, 06:04:26 pm
It doesn't fit your narrative of 'Ho, ho, look at them, they think you can program an FPGA in Python'.

That's not my narrative. My narrative is:

"Look. Python came to FPGAs too. The guys who are deemed to be scientists, but in fact know very little, write papers where they misinterpret their own facts and come to a wrong conclusions about Python efficiency and suitability. This false interpretation is spread and promoted by Xilinx as an established fact. Now more people will believe all this gibberish. So sad."

I think I'll just leave it at: People can follow the link and decide for themselves what the authors said and whether it agrees with my interpretation or yours.

rstofer · « **Reply #37 on:** June 16, 2017, 07:27:12 pm »

That MachX02 video is great! I really like the Lattice Diamond toolchain.
The fact that the board has a ton of IO on pads is a real selling point.
I wonder if I just changed vendors?

Alas, no... That device doesn't have anywhere near enough BlockRAM and I'm pretty sure it would be short of LUTs for my main project.

OTOH, as a starter board, with LOTS of IO and a compelling price, it seems like an excellent choice!

MK14 · « **Reply #38 on:** June 16, 2017, 07:45:35 pm »

Quote from: rstofer on June 16, 2017, 07:27:12 pm

That MachX02 video is great! I really like the Lattice Diamond toolchain.
The fact that the board has a ton of IO on pads is a real selling point.
I wonder if I just changed vendors?

The amazing and really nice thing about it, is that you can buy them, as shown in the video (now the MachX03 series), for only about $25/£19 from Digikey (there are probably other sellers). It has about 6,900 LE's, so is reasonably powerful, for many things.
It even has about 8 programmable Leds + few more Leds and a few tiny dil switches, for messing with.
Most other FPGA kits, cost considerably more (there are exceptions, I know). They even have configuration storage onboard and/or within the chip, as necessary. The programmer (USB), is included as well (built into board), along with any voltage regulators, crystals etc, as needed.
I.e. It is all ready to run, as is.

But at that price, it is practicable, to design it into your FPGA powered projects. Without having to worry about soldering big pin count BGA parts, and design complicated BGA ready PCB's.

EDIT: You edited your post.
I agree, very big projects (FPGA complexity wise), would need much more powerful chips. Lattice seem to be aiming for the low end.

rstofer · « **Reply #39 on:** June 16, 2017, 08:59:54 pm »

Quote from: MK14 on June 16, 2017, 07:45:35 pm

Quote from: rstofer on June 16, 2017, 07:27:12 pm
That MachX02 video is great! I really like the Lattice Diamond toolchain.
The fact that the board has a ton of IO on pads is a real selling point.
I wonder if I just changed vendors?

The amazing and really nice thing about it, is that you can buy them, as shown in the video (now the MachX03 series), for only about $25/£19 from Digikey (there are probably other sellers). It has about 6,900 LE's, so is reasonably powerful, for many things.
It even has about 8 programmable Leds + few more Leds and a few tiny dil switches, for messing with.
Most other FPGA kits, cost considerably more (there are exceptions, I know). They even have configuration storage onboard and/or within the chip, as necessary. The programmer (USB), is included as well (built into board), along with any voltage regulators, crystals etc, as needed.
I.e. It is all ready to run, as is.

But at that price, it is practicable, to design it into your FPGA powered projects. Without having to worry about soldering big pin count BGA parts, and design complicated BGA ready PCB's.

EDIT: You edited your post.
I agree, very big projects (FPGA complexity wise), would need much more powerful chips. Lattice seem to be aiming for the low end.

One thing I believe as a newcomer: The toolchain is more important than the device. Assuming, of course, that the device is large enough for the project. I really liked the presentation on Lattice Diamond. Looking at the MachX03, I think I'll order a board just so I can play with the tools. The toolchain looks a lot like Xilinx ISE with the added Logic Analyzer feature of Vivado. This really might be the ultimate startup board.

I never underestimate the number of things that have to work right in order to blink LEDs. The "HelloWorld" 'program' for FPGAs is every bit the equal of getting it running in C.

---

OK, I ordered the board direct from Lattice - they had stock. But, damn, they don't offer anything like a reasonable shipping rate. I'll quit whining... Soon...

Mattjd · « **Reply #40 on:** June 16, 2017, 09:32:50 pm »

Quote from: nctnico on June 16, 2017, 03:10:22 pm

Quote from: rstofer on June 16, 2017, 02:37:52 pm
Quote from: Sal Ammoniac on June 15, 2017, 11:26:07 pm

You probably already have a good grasp of state machines, but if not, bone up on them because you'll be using them a lot when working with FPGAs.

A state machine is just a C switch statement inside the while(1) loop. But, just ahead of the switch(), you need to define a default output state for every output signal you create. Otherwise, you have to define the output state of every signal at every state.

Like this:
Code: [Select]
process(Reset,Clk) is begin if Reset = '1' then state <= s0; elseif rising_edge(clk) then state <= NextState; end if; end process; process (state, FullEA, FetchOpnd, F, TAG, IA, CO, OFL, OVFLInd, COtemp, CSET, VSET, r_Button0, CCC, CondMet, BOSC_Flag, SavedSign, A_BUS(15), ShiftCount, SZ, ZR, DVDS, Result, Ones, OVR, CountShifts, ACC, IncludeEXT, EXTN, Rotate, AFR, BitCount, XIO_Device, XIO_Function, XIO_Modifier, DisplaySwitch, ConsoleXIOCmdBusy, ConsoleXIOCmdAck, PrinterXIOCmdAck, PrinterXIOCmdBusy, ReaderXIOCmdBusy, ReaderXIOCmdAck, DiskXIOCmdBusy, DiskXIOCmdAck, DiskReady, IAR, SingleStep, BreakPointActive, BreakPoint, PendingInterrupt, ReturnState_r, StartState) is begin A_BusCtrl <= A_BUS_NOP; ACC_Ctrl <= ACC_NOP; ACC_ShiftIn <= '0'; Add <= '0'; AFR_Ctrl <= AFR_NOP; BitCountCtrl <= BitCount_NOP; CI <= '0'; CIn <= '0'; CIX <= '0'; CarryIndCtrl <= CARRY_IND_NOP; <and so on...> case state is when s0 => NextState <= s0a; -- use this to IPL when s0a => if DiskReady = '0' then -- wait for disk to go not ready NextState <= s0b; else NextState <= s0a; end if; when s0b => if DiskReady = '1' and ColdstartHold = '0' then -- wait for disk to go ready and -- coldstart code to be copied <and so on>
There are two processes to create this FSM: The first just changes the state according to the NextState value on every clock cycle. In the case of a loop, the state may not actually change. See the second process...

The second process does all the work and it is not clocked. It is just a huge collection of combinatorial logic.

Here I defined default outputs for 10 signals (although they aren't shown in the snippet of FSM code). In the real world, there are 49 of these default outputs and 117 states.

I didn't say anything about the 'sensitivity list' that starts out as
Code: [Select]
process (state, FullEA, FetchOpnd, F, TAG, IA, CO, OFL, OVFLInd, COtemp, CSET, VSET,
This sensitivity list tells the simulator which signals to monitor to decide to actually run the process. If there are no changes to any signals in the list, the simulator won't evaluate the process.

This list is meaningless to synthesis but the synthesizer will whine if an input signal to the process is undeclared. But it's just whine and snivel, the output works with or without the list.
This is pretty bad coding because it is prone to creating latches. As a rule of thumb you only have 2 signals at most in the sensitivity list of a process: clock and reset. If there are other signals then it smells fishy. The problem is likely better solved using a function instead of a process.

I'm guessing what Sal wrote is in Python? Explain to me why latches are bad. The overall structure of his code looks similar to what would be found a state machine written in Verilog. In Verilog to create a state machine you need a control and a datapath. The datapath has all the logic you would need for the input and output signals. The Control would be defining your States using parameters followed by 3 Always blocks, 1) For transitioning states on clock edges 2) For defining the transistions 3) For Defining the outputs for each state. Here is code I wrote to display "EXTRA CREDIT PLZ" on a Hitachi 47780, written in Verilog. I know the code is ugly, and I've been working on making my code more legible but there are striking resemblances between what I wrote and what Sal wrote. When I compile this code, Quartus infers latches, but as far as I know, they're necessary.

The Control

Code: [Select]

module LCD_SM(Clock,Reset,
				  Delay45ms,Delay80ns,Delay240ns,Delay_TO,Inst_Cnt32,FinalWrite,
				  Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,
				  Reset40us,Reset100us,
				  CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,
				  EN,
				  FirstWrite);

	input Clock, Reset, Delay45ms, Delay80ns, Delay240ns, Delay_TO;
	input [4:0] Inst_Cnt32;
	input FinalWrite;
	
	output reg Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us;
	output reg CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us;
	output reg EN;
	output reg FirstWrite;

	parameter Pwr_Up = 4'b0000;
	parameter Pwr_Up_Delay = 4'b0001;
	parameter Off_Pwr_Up_Delay = 4'b0010;
	parameter Write_Data = 4'b0011;
	parameter Data_Setup_Delay = 4'b0100;
	parameter E_Pulse_Hi = 4'b0101;
	parameter E_Hi_Time = 4'b0110;
	parameter E_Pulse_Lo = 4'b0111;
	parameter Proc_Comp_Delay = 4'b1000;
	parameter Load_Next_Data = 4'b1001;
	parameter End0 = 4'b1010;
	parameter End1 = 4'b1011;
	parameter End2 = 4'b1100;
	parameter End3 = 4'b1101;
	parameter End4 = 4'b1110;
	parameter End5 = 4'b1111;
	
	reg [3:0] state, next_state;
	
	always@(posedge Clock or posedge Reset)
		begin
			if(Reset)
				state <= Pwr_Up;
			else
				state <= next_state;
		end
	
	always@(state or Delay45ms or Delay80ns or Delay240ns or Delay_TO or FinalWrite) //need to add transition signals to go with state
		begin
			case(state)
				
				default: next_state <= Pwr_Up;
				
				Pwr_Up: next_state <= Pwr_Up_Delay;
				
				Pwr_Up_Delay: if (Delay45ms)
									next_state <= Off_Pwr_Up_Delay;
								else 
									next_state <= Pwr_Up_Delay;
									
				Off_Pwr_Up_Delay: next_state <= Write_Data;
				
				Write_Data: next_state <= Data_Setup_Delay;
				
				Data_Setup_Delay: if(Delay80ns)
											next_state <= E_Pulse_Hi;
										else 
											next_state <= Data_Setup_Delay;
				
				E_Pulse_Hi: next_state <= E_Hi_Time;
				
				E_Hi_Time: if(Delay240ns)
								next_state <= E_Pulse_Lo;
							else 
								next_state <= E_Hi_Time;
				
				E_Pulse_Lo: next_state <= Proc_Comp_Delay;
				
				Proc_Comp_Delay: if(Delay_TO)
										next_state <= Load_Next_Data;
									else 
										next_state <= Proc_Comp_Delay;
				
				Load_Next_Data: if(FinalWrite)
										next_state <= End0;
									else 
										next_state <= Write_Data;
				
				End0: next_state <= End1;
				
				End1: next_state <= End2;
				
				End2: next_state <= End3;
				
				End3: next_state <= End4;
				
				End4: next_state <= End5;
				
				End5: next_state <= End5;
				
			endcase
		end
		
	always@(state or Inst_Cnt32)
		begin
			case(state)
				
				default: 	      {Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000000000;
				
				Pwr_Up: 	         {Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000000000;
				
				Pwr_Up_Delay:     {Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000010000000;
				
				Off_Pwr_Up_Delay: {Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000000000;
				
				Write_Data:       {Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000000000;
				
				Data_Setup_Delay: {Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000100000000;
				
				E_Pulse_Hi:       {Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000001000000010;
				
				E_Hi_Time:        {Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000001000000010;
				
				Proc_Comp_Delay:
					begin
							if (Inst_Cnt32 == 0)
								begin
									{Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000000001;
								end
							else if (Inst_Cnt32 == 1)
								begin
									{Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000100000;
								end
							else if (Inst_Cnt32 == 2)
								begin
									{Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000000100;
								end
							else if (Inst_Cnt32 == 3)
								begin
									{Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000001000;
								end
							else if (Inst_Cnt32 == 4)
								begin
									{Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000001000;
								end
							else if (Inst_Cnt32 == 5)
								begin
									{Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000001000;
								end
							else if (Inst_Cnt32 == 6)
								begin
									{Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000010000;
								end
							else if (Inst_Cnt32 == 7)
								begin
									{Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000001000;
								end
							else if (Inst_Cnt32 == 8)
								begin
									{Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000001000;
								end
							else if (Inst_Cnt32 == 9)
								begin
									{Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000001000;
								end
							else if (Inst_Cnt32 == 10)
								begin
									{Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000001000;
								end
							else if (Inst_Cnt32 == 11)
								begin
									{Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000001000;
								end
							else if (Inst_Cnt32 == 12)
								begin
									{Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000001000;
								end
							else if (Inst_Cnt32 == 13)
								begin
									{Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000001000;
								end
							else if (Inst_Cnt32 == 14)
								begin
									{Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000001000;
								end
							else if (Inst_Cnt32 == 15)
								begin
									{Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000001000;
								end
							else if (Inst_Cnt32 == 16)
								begin
									{Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000001000;
								end
							else if (Inst_Cnt32 == 17)
								begin
									{Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000001000;
								end
							else if (Inst_Cnt32 == 18)
								begin
									{Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000001000;
								end
							else if (Inst_Cnt32 == 19)
								begin
									{Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000001000;
								end
							else if (Inst_Cnt32 == 20)
								begin
									{Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000001000;
								end
							else if (Inst_Cnt32 == 21)
								begin
									{Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000001000;
								end
							else if (Inst_Cnt32 == 22)
								begin
									{Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000001000;
								end
							else if (Inst_Cnt32 == 23)
								begin
									{Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000001000;
								end
							else if (Inst_Cnt32 == 24)
								begin
									{Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000001000;
								end
							else if (Inst_Cnt32 == 25)
								begin
									{Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000001000;
								end
							else if (Inst_Cnt32 == 26)
								begin
									{Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000001000;
								end
							else if (Inst_Cnt32 == 27)
								begin
									{Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000001000;
								end
							else if (Inst_Cnt32 == 28)
								begin
									{Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000001000;
								end
							else if (Inst_Cnt32 == 29)
								begin
									{Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000001000;
								end
							else if (Inst_Cnt32 == 30)
								begin
									{Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000001000;
								end
							else if (Inst_Cnt32 == 31)
								begin
									{Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000001000;
								end
							else if (Inst_Cnt32 == 32)
								begin
									{Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000001000;
								end
							else 
								{Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000000000;
					end
				
				Load_Next_Data: {Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b011000000001001000;
				
				End0:       {Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000000000;
			
				End1:       {Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000000000;
			
				End2:       {Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000000000;
			
				End3:       {Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000000000;
			
				End4:       {Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000000000;
				
				End5:       {Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000000000;
			
			endcase
		end
		
endmodule

The Datapath

Code: [Select]

module LCD_Datapath(CE240ns,CE80ns,CE45ms,CE32,
						  CE4ms,CE2ms,CE40us,CE100us,
						  Clock,
						  Delay45ms,Delay80ns,Delay240ns,Inst_Cnt32,Delay_TO,
						  Reset45ms,Reset80ns,Reset240ns,ResetPC,
						  Reset4ms,Reset2ms,Reset40us,Reset100us,
						  FinalWrite,
						  FirstWrite);
						  
	input Clock;
	input Reset45ms,Reset80ns,Reset240ns,ResetPC;
	input Reset4ms,Reset2ms,Reset40us,Reset100us;
	input CE240ns,CE80ns,CE45ms,CE32;
	input CE4ms,CE2ms,CE40us,CE100us;
	input FirstWrite;
	
	output [4:0] Inst_Cnt32;
	output Delay45ms,Delay80ns,Delay240ns,Delay_TO;
	output FinalWrite;
	
	wire [4:0] Eighty_ns,TwoForty_ns;
	wire [21:0] FortyFive_ms,Four_ms,Two_ms,Forty_us,Hundred_us;
	wire Delay4ms,Delay2ms,Delay40us,Delay100us;
	wire FirstWrite;
	
	assign Delay_TO = Delay4ms|Delay2ms|Delay40us|Delay100us|FirstWrite;

	//module Counter_22bit(Clock,Reset,CE,Counter);
	Counter_22bit FortyFiveMilSec(Clock,Reset45ms,CE45ms,FortyFive_ms),
					  FourMilSec(Clock,Reset4ms,CE4ms,Four_ms),
					  TwoMilSec(Clock,Reset2ms,CE2ms,Two_ms),
					  FortyMicSec(Clock,Reset40us,CE40us,Forty_us),
					  HundredMicSec(Clock,Reset100us,CE100us,Hundred_us);
				
	
	//module Counter_5bit(Clock,Reset,CE,Counter);
	Counter_5bit EightyNanSec(Clock,Reset80ns,CE80ns,Eighty_ns),
					 TwoFourtyNanSec(Clock,Reset240ns,CE240ns,TwoForty_ns),
					 WriteCounter(Clock,ResetPC,CE32,Inst_Cnt32);
	
	//module comparator_standalone(A,B,G,E,L);
	comparator_5bit FinalWriteCompar(Inst_Cnt32,22,G,FinalWrite,L),
						 Eightyns(Eighty_ns,4,g,Delay80ns,L),
						 TwoFortyns(TwoForty_ns,12,g,Delay240ns,L);							 
	comparator_22bit FortyFivems(FortyFive_ms,2250000,G,Delay45ms,L),
						  Fourms(Four_ms,200000,G,Delay4ms,L),
						  Twoms(Two_ms,100000,G,Delay2ms,L),
						  Fortyus(Forty_us,2000,G,Delay40us,L),
						  Hundredus(Hundred_us,5000,G,Delay100us,L);
								 
								 
								 
endmodule

nctnico · « **Reply #41 on:** June 16, 2017, 10:09:40 pm »

Latches are bad because you can't control their timing and you they could oscillate before settling to a state or not get into the right state at all. The keyword is asynchronous logic. In an FPGA you want to prevent using asynchronous logic unless you really know what you are doing. The logic inside an FPGA does not receive all inputs simultaneously and most architectures use (strings of) lookup tables to create a combinatorial output so you can get a wild variety of signals at the output of a LUT.

rstofer · « **Reply #42 on:** June 16, 2017, 11:01:25 pm »

Quote from: Mattjd on June 16, 2017, 09:32:50 pm

Code: [Select]
Pwr_Up: {Reset45ms,Reset80ns,Reset240ns,ResetPC,Reset4ms,Reset2ms,Reset40us,Reset100us,CE240ns,CE80ns,CE45ms,CE32,CE4ms,CE2ms,CE40us,CE100us,EN,FirstWrite} <= 18'b000000000000000000;

This type of coding eliminates the requirement to define default values for the 18 signals but it sure takes a lot of typing when only one or two signals are changing. Furthermore, if the 8th bit is set, I have to wander through the signal list and count until I figure out which signal has been set. It isn't immediately obvious.

I have no idea how to assign default values to individual signals in Verilog.

In some ways, your coding looks a lot like microcode. So, you could create an array of 18 bit values and alias the bits to signal names (does Verilog have aliases?). Then you could just index into the array as a function of state.

I like microcode and I have thought about building a CPU using that scheme. It worked for the IBM 360 and a lot of other machines. Microcoding brought structure to CPU design.

I have always wanted to write a meta-assembler like was used on the AMD bit slice devices. Those were fun days!

Cerebus · « **Reply #43 on:** June 16, 2017, 11:02:13 pm »

I think it's not that latches are per se bad, it's that in HDL it's very easy to get a latch inferred without you realizing it. If you're looking at a device that has a data sheet that starts 'Transparent D-type latch' you choose it only when it is appropriate, whereas with case statements you have to be careful to make sure that you get what you're asking for.

Mattjd · « **Reply #44 on:** June 16, 2017, 11:31:19 pm »

for a case statement in Verilog, default is just "default" I have one in two of my always blocks, a default for state transitions and a default for state outputs. I suppose writing it could be a lot. I mean I use a combination of excel and Sublime text to do my typing. It really speeds things up. For debugging purposes I use a combination of testbenches, compilation reports, and RTL (netlist) viewer. The RTL view is nice because it creates a visual. I can easily find those output bits there.

For example

Thats from the RTL. Say I want to know what bits change from state Pwr_Up to Pwr_Up_Delay. I go into the RTL, find the State instance and select the net that belongs to Pwr_Up_Delay output, the entire net highlights and can be easily traced. That's just me.

Btw, as far as I know. One must account for the outputs and transitions for EVERY state, regardless if those outputs change or not. The "default" case (for outputs) is simply what the signals are going to be upon start up. The "default" case (for transitions) is what the initial state of the state machine is. If you forget to include a state, because the outputs dont change, you will get an error.

nctnico · « **Reply #45 on:** June 16, 2017, 11:44:52 pm »

Quote from: Cerebus on June 16, 2017, 11:02:13 pm

I think it's not that latches are per se bad, it's that in HDL it's very easy to get a latch inferred without you realizing it. If you're looking at a device that has a data sheet that starts 'Transparent D-type latch' you choose it only when it is appropriate, whereas with case statements you have to be careful to make sure that you get what you're asking for.

One way to avoid that is not to use clock-less processes in VHDL (besides being careful with x when a else y).

Mattjd · « **Reply #46 on:** June 16, 2017, 11:52:05 pm »

Quote from: Cerebus on June 16, 2017, 11:02:13 pm

I think it's not that latches are per se bad, it's that in HDL it's very easy to get a latch inferred without you realizing it. If you're looking at a device that has a data sheet that starts 'Transparent D-type latch' you choose it only when it is appropriate, whereas with case statements you have to be careful to make sure that you get what you're asking for.

From the compilation report.

Info (10041): Inferred latch for "y[0]" at Mux_9_bit_32_to_1_behavorial.v(19)
Info (10041): Inferred latch for "y[1]" at Mux_9_bit_32_to_1_behavorial.v(19)
Info (10041): Inferred latch for "y[2]" at Mux_9_bit_32_to_1_behavorial.v(19)
Info (10041): Inferred latch for "y[3]" at Mux_9_bit_32_to_1_behavorial.v(19)
Info (10041): Inferred latch for "y[4]" at Mux_9_bit_32_to_1_behavorial.v(19)
Info (10041): Inferred latch for "y[5]" at Mux_9_bit_32_to_1_behavorial.v(19)
Info (10041): Inferred latch for "y[6]" at Mux_9_bit_32_to_1_behavorial.v(19)
Info (10041): Inferred latch for "y[7]" at Mux_9_bit_32_to_1_behavorial.v(19)
Info (10041): Inferred latch for "y[8]" at Mux_9_bit_32_to_1_behavorial.v(19)
Info (10041): Inferred latch for "y[9]" at Mux_9_bit_32_to_1_behavorial.v(19)
Info (10041): Inferred latch for "y[10]" at Mux_9_bit_32_to_1_behavorial.v(19)
Info (10041): Inferred latch for "y[11]" at Mux_9_bit_32_to_1_behavorial.v(19)
Warning (14026): LATCH primitive "Mux_9_bit_32_to_1_behavorial:MUX_DUT|y[0]" is permanently enabled
Warning (14026): LATCH primitive "Mux_9_bit_32_to_1_behavorial:MUX_DUT|y[1]" is permanently enabled
Warning (14026): LATCH primitive "Mux_9_bit_32_to_1_behavorial:MUX_DUT|y[2]" is permanently enabled
Warning (14026): LATCH primitive "Mux_9_bit_32_to_1_behavorial:MUX_DUT|y[3]" is permanently enabled
Warning (14026): LATCH primitive "Mux_9_bit_32_to_1_behavorial:MUX_DUT|y[4]" is permanently enabled
Warning (14026): LATCH primitive "Mux_9_bit_32_to_1_behavorial:MUX_DUT|y[5]" is permanently enabled
Warning (14026): LATCH primitive "Mux_9_bit_32_to_1_behavorial:MUX_DUT|y[6]" is permanently enabled
Warning (14026): LATCH primitive "Mux_9_bit_32_to_1_behavorial:MUX_DUT|y[8]" is permanently enabled

Now I want those because of how I am using the Mux. I don't know what other software is like but Quartus by Altera gives a nice detailed report of stuff like that it. It also tells you any kind of optimizations like, the removal of registers because of redundant or otherwise bad logic (basically the value of the register is always the same so Quartus removes it). With loads of other stuff.

edit: I'm not arguing about the latches, but that, in my experience, there is a load of information provided upon compilation that is sooo helpful.

nctnico · « **Reply #47 on:** June 17, 2017, 12:07:29 am »

IMHO Xilinx' tools output so many messages for a reasonably sized project that it all becomes useless noise.

Mattjd · « **Reply #48 on:** June 17, 2017, 12:16:51 am »

To each their own I suppose. I don't know what "reasonably sized is" but I have built a 64 bit processor on a DE0 (Cyclone III E3PC16F484C6), using the LCD I spoke of earlier as a peripheral. I even wrote a program for it, the ROM would be read, and would take keyboard input from PS/2 then send the input to the MUX that controls the output to the LCD. Had maybe 500 messages. I found going through them easy. *Shrugs*

This processor was an academic requirement of course and may very well still be considered small. It overclocked to 66 mhz. Base clock was 50 mhz. It was not pipelined, did not have a FPU and could not perform multiplication or division. Very basic processor. The clock multiplier was provided through Altera's megafunction IPs. For the RAM, I made two, one using Altera provided megafunctions, and then describing it in Verilog.

hamster_nz · « **Reply #49 on:** June 17, 2017, 12:39:54 am »

Quote from: Mattjd on June 16, 2017, 09:32:50 pm

Explain to me why latches are bad.

Latches are bad because they are ambiguous.

Code: [Select]

  signal a: std_logic := '0';
  signal b: std_logic := '0';

-- Setting up the test cases
process(clk)
  begin 
    if rising_edge(clk) then
       a <= not a;
       b <= not b;
    end if;
  end if;

-- and now a latch.
process(a,b)
  begin
     if a = '1' then
      latch <= b; 
     end if;
  end process;

So, if 'a' changes from 1 to 0, and and 'b' also changes from 1 to 0 at the same time, what value ends up in 'latch'?

Ambiguity is evil in digital design.

rstofer · « **Reply #50 on:** June 17, 2017, 12:54:28 am »

Quote from: Mattjd on June 16, 2017, 11:31:19 pm

Btw, as far as I know. One must account for the outputs and transitions for EVERY state, regardless if those outputs change or not. The "default" case (for outputs) is simply what the signals are going to be upon start up. The "default" case (for transitions) is what the initial state of the state machine is. If you forget to include a state, because the outputs dont change, you will get an error.

Not exactly...

In VHDL you specify default values before the case statement and then you only need to change the value if a particular state needs to do that.

Code: [Select]

	process(state, PlotterXIOCmdReq, PlotterXIOCmd, XIOFunction)
	begin
		SetIntBusy        <= '0';
		ClearIntBusy	<= '0';
		ClearInterrupts	<= '0';
		case state is
			when ACK	=>  PlotterXIOCmdAck_i	<= '1';
						if PlotterXIOCmdReq = '1' then
							next_state		<= ACK;
						else
							case XIOFunction is
								when XIO_SenseDevice => ClearInterrupts	<= PlotterXIOCmd(0);

<clip>

The signal ClearInterrupts is defined to be '0' just before 'case state'. It will always have a value of '0' unless overridden as it is in case XIOFunction. Since the value is defined for all states, it will never infer a latch.

This code makes no sense as it was hacked from a much larger FSM. Nevertheless, it shows the proper technique for declaring default values for the FSM outputs. The trick is to add new states during development and then add new outputs while not forgetting to declare the default value. Nothing works if latches are inferred.

rstofer · « **Reply #51 on:** June 17, 2017, 01:31:10 am »

Quote from: Mattjd on June 17, 2017, 12:16:51 am

To each their own I suppose. I don't know what "reasonably sized is" but I have built a 64 bit processor on a DE0 (Cyclone III E3PC16F484C6), using the LCD I spoke of earlier as a peripheral. I even wrote a program for it, the ROM would be read, and would take keyboard input from PS/2 then send the input to the MUX that controls the output to the LCD. Had maybe 500 messages. I found going through them easy. *Shrugs*

This processor was an academic requirement of course and may very well still be considered small. It overclocked to 66 mhz. Base clock was 50 mhz. It was not pipelined, did not have a FPU and could not perform multiplication or division. Very basic processor. The clock multiplier was provided through Altera's megafunction IPs. For the RAM, I made two, one using Altera provided megafunctions, and then describing it in Verilog.

As a rough cut, your device had 15k logic elements (LUTs?) while the MachXO3 we were talking about earlier is less than half that size at 6900 elements. My Digilent Nexys2 board has 19,512 logic elements so somewhat larger.

The Digilent board costs a lot more than the Lattice board but it has switches, LEDs, 7-Segment display, PS/2 input and VGA output. It also has parallel flash and RAM on board. The Lattice board is a LOT cheaper but you're on your own for peripherals.

Still, if are talking about a beginner board, the Lattice board will do a lot of things. It just won't hold my CPU project...

BrianHG · « **Reply #52 on:** June 17, 2017, 04:55:53 am »

I think with these latest posts illustration code already has gone a little too far for what the OP has asked, being someone who has never developed on a FPGA before or even know what the languages are or how they work.

westfw · « **Reply #53 on:** June 17, 2017, 05:07:57 am »

(I'm very pleased with the discussion that is being generated here. A big "thank you" to everyone who is participating!)

Quote

Next step would be to learn one of the HDLs.

Once you've got the beginning of a grip on your chosen HDL, take the 'discrete' design you already made and implement the discrete parts you 'made up' in HDL, interconnect them in HDL, scribble a little HDL test-bed and hit the simulator.

The vendor tools for actual FPGAs can be quite a struggle to set up and get running with - not what you want at the 'hello world' stage. I'd recommend that if you're going Verilog that you grab the open source Icarus iverilog simulator

Wait! I can write verilog/VHDL and simulate it without picking some vendor tool? I was thinking that the only compiler/simulators around were in one vendor tool or another... How does this work, without having the limitations of a particular chip in mind? I just write my design files, and "later" move it to some vendor chip, tie in pin definitions and such, and see if it fits? Very interesting!

hamster_nz · « **Reply #54 on:** June 17, 2017, 05:44:30 am »

Quote from: westfw on June 17, 2017, 05:07:57 am

Wait! I can write verilog/VHDL and simulate it without picking some vendor tool? I was thinking that the only compiler/simulators around were in one vendor tool or another... How does this work, without having the limitations of a particular chip in mind? I just write my design files, and "later" move it to some vendor chip, tie in pin definitions and such, and see if it fits? Very interesting!

Yes - exactly this. The only proviso is that as soon as you use a single vendor-specific doohickey, and don't somehow isolate it from the rest of your design then you have lost that freedom. (much the same as mixing in any OS specific calls in S/W). It is very easy to get seduced by things like "Block RAM" macros, IP wizards and/or "megafunctions".

The more portable way is to find out how to infer them (e.g. write code where the tools go "Oh, I know that pattern! I can optimize that pattern into a RAM block!").

For Altera, have a look at http://www.gstitt.ece.ufl.edu/courses/spring10/eel4712/lectures/vhdl/qts_qii51007.pdf

For Xilinx have a look at https://www.xilinx.com/support/documentation/sw_manuals/xilinx2014_1/ug901-vivado-synthesis.pdf,

The tools are very picky at how they match the patterns, the closer you get to the vendor's published code the more likely you are to get the result you really want.

There is nothing worse than changing your toolset, and realizing that you have used a design pattern that in 30+ files and it doesn't work. This usually shows up as the design not fitting, because it hasn't used hard blocks for RAM and multipliers. You then have to recode and retest everything again. (Yes, this happened on a video design that went from Altera to Xilinx).

I would even go as far as suggesting that any parts of the design where you do these sorts of things should be isolated out into a sub-directory of vendor specific code.

BrianHG · « **Reply #55 on:** June 17, 2017, 05:54:26 am »

Third party stand alone VHDL and Verilog stand-alone compilers do exist, though they are not my area & they may cost money.
Whether you choose Altera's Quartus of Xilinx free toolsuite & write a verilog, or vhdl code, the program you write will be compatible in both tool suites except when you try to use a vendor specific library. In fact, your code is even more cross compatible than writing a C program for a PIC VS an Atmega. Remember, your Verilog/VHDL code describes nothing more than clocked bolean logic, with inputs and outputs. The FPGA vendor editor suites just allows you to wire your inputs and outputs of each of your Verilog/VHDL source codes to the pins of the FPGA. Only that there are optimized IO pins in some cases like dedicated clock inputs, but this is the same for whichever FPGA type you choose.

Now, when I say multiple VHDL/verilog source codes, this means in 1 chip, you can wire multiple copies of your code, or multiple different codes wired together or to different IO pins, or anything you can imagine. For example, in my FPGA based video scaler, I have verilog programs:

DDR3_Ram_sequencer.v (State machine which drives the RAS/CAS/WE/DQS... and RD_RDY and WR_RDY and DQ_OE)
Ram_8port_priority_bridge.v (Has 8 read address, 8 write address inputs, sends the next one in queque to the DDR3_ram Controller)
Video_Line_Cache_in.v (works with the above 2 codes.v for DDR ram 128 bit access, takes an input video stream at 32 bit at input pixel_in clock speed)
Video_Line_Cache_out.v (works with the above 2 codes.v for DDR ram 128 bit access, sends video out at 32 bit at pixel_out clock speed)
Video_color-space-converter.v (Works on the 32 bit video pipe, inbetween the input/output pins and the Video_Line_Cache_xxx.v, has brightness, contrast, saturation & hue controls.)
MCU_pic24_emulator.v (Uses onchip FPGA ram to run code for onscreen menus and system operations like listen to the Ethernet and front panel, instructs all the other .v modules which have configuration inputs.)
RS232_bidir-fifo_com_port.v
Master_Raster_Sync_Generator.v
Others....v

All these .v modules you may think of as a new digital IC and they can be wired to just IOs or between each other internally.
These .v programs (could be described as modules) I made will compile both in Altera's and Xilinx's IDE tools except for 2 minor inconveniences regarding setting up the custom PLL which differs in both chips and defining the FPGA's internal dual port ram memories since I want to use dedicated enhanced features. But this is a lesser problem since these configured functions are nothing more than another verilog_special_memory.v file personalized to the vendor's chip which for example my MCU_pic24_emulator.v would be wired to. But this shouldn't be anything you should worry about at this stage.

legacy · « **Reply #56 on:** June 17, 2017, 08:46:42 am »

Quote from: westfw on June 17, 2017, 05:07:57 am

simulate it without picking some vendor tool?

Well, here I use ModelSim, but it's not the version included with Xilinx's tools, it's an external tool, and as editor & checker I use Sigasi, an other external tool. It's very productive as it has a deep understanding of what you write.

So, I write HDL with Sigasi, I simulate it with ModelSim, then I move to the Vendor's toolchain (Xilinx in my case) for two new purposes

-1- timing constraints and their analysis
-2- synthesis (and optionally, optimization)

nctnico · « **Reply #57 on:** June 17, 2017, 08:59:58 am »

Quote from: westfw on June 17, 2017, 05:07:57 am

(I'm very pleased with the discussion that is being generated here. A big "thank you" to everyone who is participating!)

Quote
Next step would be to learn one of the HDLs.

Once you've got the beginning of a grip on your chosen HDL, take the 'discrete' design you already made and implement the discrete parts you 'made up' in HDL, interconnect them in HDL, scribble a little HDL test-bed and hit the simulator.

The vendor tools for actual FPGAs can be quite a struggle to set up and get running with - not what you want at the 'hello world' stage. I'd recommend that if you're going Verilog that you grab the open source Icarus iverilog simulator

Wait! I can write verilog/VHDL and simulate it without picking some vendor tool? I was thinking that the only compiler/simulators around were in one vendor tool or another... How does this work, without having the limitations of a particular chip in mind? I just write my design files, and "later" move it to some vendor chip, tie in pin definitions and such, and see if it fits? Very interesting!

There is a free one: GHDL. I use that to simulate VHDL which I later use in a Xilinx FPGA. However as usual with simulation you have to be aware that it is just as good as the stimuli you feed into it.

legacy · « **Reply #58 on:** June 17, 2017, 09:06:12 am »

p.s.
As I said before, there are some features which are vendor-specific.

e.g. Spartan6 comes with an useful built-in DDR controller. To use it ... you need to invoke an IP-wizard which automatically instantiates it for you, resulting an interface entity with the implementation hidden in a black block. It's hardware, implemented in hardware inside the fpga as special block which you can't change, you can only use it as Xilinx has designed.

Keep in mind, it's vendor-specific, and technology specific: not portable!

In this case, I take the interface entity, and I try to idealize its behavior into ModelSim, just to be able to simulate the whole system. Practically the DDR block is not simulated in details (at RTL level), I assume it works (Xilinx's homework), and it has been correctly instantiated (my homework), so I just add a large and ideally memory-block to ModelSim.

Of course, then I have to verify these two hypothesis.

legacy · « **Reply #59 on:** June 17, 2017, 09:12:27 am »

Quote from: nctnico on June 17, 2017, 08:59:58 am

GHDL

-1- it depends on GNAT, which is perpetually full of problems and bugs
-2- it doesn't cover the full vhdl specification, just a sub-set
-3- too much effort required since you need to adapt your source to it
-4- error-messages are silly, you can never understand what is wrong, you have to suppose
-5- stimuli are a mess, and very error prone as you have to write a lot of test-bench code
-6- which, especially points { 3, 4, 5}, reduces the productivity by five orders of magnitude

Conclusion:
GHDL is good if you don't have money, if you are a student and if your project is an university homework.

For professional projects done for business, (when someone checks your how long your job takes, and how complex it can go in a working-team) ModelSim is *THE* simulator one has to go.

nctnico · « **Reply #60 on:** June 17, 2017, 09:57:49 am »

If you simulate complete designs then going for Modelsim is a no-brainer but I don't simulate large designs. I only simulate small pieces and for that GHDL is good enough.

legacy · « **Reply #61 on:** June 17, 2017, 10:06:09 am »

Quote from: nctnico on June 17, 2017, 09:57:49 am

small pieces and for that GHDL is good enough.

Even for small pieces, GHDL is defective for the above points. I spend three years on it, frankly I wish someone had pointed me out those points instead of making me to waste my time trying to fix/use it.

rstofer · « **Reply #62 on:** June 17, 2017, 02:40:06 pm »

Vendor specific...

The video above that demonstrates how to install and use the Lattice toolchain is very good and as good a place to start as any. However, right out of the gate, the author uses the internal oscillator provided on the chip and this absolutely won't be portable to any other device family. The good news is that the MachX03 board itself does have an external 12 MHz oscillator. What do you want to bet that the PLL used to kick up the speed won't be portable either?

I have decided to use the features provided and worry about porting later. My hobby projects just aren't complex enough to worry about. Portability is an illusion! We can't even get the clock to work without vendor specific gadgets!

One thing I would hate is porting initialized BlockRAM. I have written external programs that grab the memory contents from some file and write the entire VHDL file. Just one more task when porting...

Back to the video... It covers:

1) Toolchain installation
2) License management - don't worry, the license is free!
3) Project creation
4) Verilog design entry
5) Testbench creation
6) Simulation
7) Synthesis

Pin assignment
9) Device programming
10) Virtual logic analyzer

Of course the coverage depth is quite shallow but it's a short video. It is enough to get started! The board itself is cheap enough, it's the shipping that I snivel about!

MK14 · « **Reply #63 on:** June 17, 2017, 02:52:26 pm »

Quote from: rstofer on June 17, 2017, 02:40:06 pm

it's the shipping that I snivel about!

I get free shipping (from the US to the UK), with Digikey, who sell it. As long as the order value, is at least £33. Which is quite easy to achieve.
Hopefully within the US, it is similar.

But too late for now, as you seemed to say you already bought it from Lattice.

rstofer · « **Reply #64 on:** June 17, 2017, 04:29:56 pm »

Quote from: MK14 on June 17, 2017, 02:52:26 pm

Quote from: rstofer on June 17, 2017, 02:40:06 pm
it's the shipping that I snivel about!

I get free shipping (from the US to the UK), with Digikey, who sell it. As long as the order value, is at least £33. Which is quite easy to achieve.
Hopefully within the US, it is similar.

But too late for now, as you seemed to say you already bought it from Lattice.

I don't get free shipping from Digikey but it is usually Priority Mail and that is very cheap and FAST. I looked for stock at Mouser and they didn't have any. I didn't look at Digikey and probably should have as they do have stock. All my bad...

Digikey is a great supplier.

Late breaking news: The board has shipped - from Mouser. The very place I looked for stock. I must have had a serious bout of 'senior moments' yesterday!

legacy · « **Reply #65 on:** June 17, 2017, 04:56:31 pm »

Quote from: rstofer on June 17, 2017, 02:40:06 pm

Portability is an illusion! We can't even get the clock to work without vendor specific gadgets!

Yup, sadly the Truth, especially if you use the Digital Clock Manager (DCM) primitive in Xilinx FPGA parts to implement delay locked loops, PLLs, digital frequency synthesizers, digital phase shifters, etc. This point is also relevant during timing-constraints analysis, which is both vendor and device specific, and it's a MUST-BE-DONE if you have to check low-level requirements from your customers.

p.s.
why Lattice? Never used, I am curious.

MK14 · « **Reply #66 on:** June 17, 2017, 05:03:31 pm »

Quote from: rstofer on June 17, 2017, 04:29:56 pm

I don't get free shipping from Digikey but it is usually Priority Mail and that is very cheap and FAST. I looked for stock at Mouser and they didn't have any. I didn't look at Digikey and probably should have as they do have stock. All my bad...

Digikey is a great supplier.

Late breaking news: The board has shipped - from Mouser. The very place I looked for stock. I must have had a serious bout of 'senior moments' yesterday!

Don't worry, similar/same things would wind me up. I find the Amazon system of almost constantly fluctuating prices, on many things, annoying.

Sometimes I buy something, and while it is being shipped, the price drops, and I find that annoying. But I'm kind of philosophical about it, and accept I will gain sometimes, and lose other times.

I know in theory, some people claim you can hassle Amazon customer services, and get the price dropped, on your order. Because the price dropped just after you ordered it. But I don't want to bother them and/or waste their and my time, over what is usually quite small amounts of money.

rstofer · « **Reply #67 on:** June 17, 2017, 05:35:03 pm »

Quote from: legacy on June 17, 2017, 04:56:31 pm

p.s.
why Lattice? Never used, I am curious.

You're right, why Lattice? Beats me... I have a large assortment of Digilent-Xilinx boards and I certainly don't need a low end board. So, why am I interested?

Well, I watched the video. I REALLY like the toolchain. The licensing scheme is pretty painless and not nearly as obtuse as Xilinx's. I like the way pins are configured with a spreadsheet. I like the touch and feel of Diamond as it is similar to Xilinx's ISE (sort of). In any event, the startup curve is a lot flatter than Vivado's (does anybody really understand the .XDC file?). I like the Just In Time syntax analysis - save the file and syntax analysis is automatic and FAST (at least for small projects).

I can see the value in a $25 startup board; I don't personally have any use for it but I'm sure something will come up. I like the high pin count on headers, I don't really like Digilent's PMOD connectors, there simply aren't enough pins. I do understand that the board has no peripheral gadgets except a bank of LEDs. If I need SRAM, I'm on my own!

For the newcomers, this setup is all they really need to start creating logic. The regrettable lack of switches and buttons is something of a bother but I imagine they can figure out something. If they can't, well, maybe golf is a better hobby.

In the back of my mind, I am thinking about Caxton C Foster's 'minicomputer' - BLUE. I have been thinking about this trivial 16 bit CPU for about 40 years. As a CPU, it implements only the most trivial operations but it's a good first project. In my case, it is just something I want to play with. One thing it needs is a lot of IO for the switches and LEDs. IO Expanders are one option but for the MachXO3 board, there is no need. There are plenty of pins. Maybe I'll finally get around to implementing it. Al Williams https://www.awce.com/ did a vastly expanded version a few years ago but I don't see it around on his site.

ETA: The BLUE project is available on OpenCores http://opencores.org/project,blue

Why do all this? Well, I hope my grandson gets into EE or CS as a major. It might be useful to have a trivial computer around just to discuss elementary architecture and 'the way it used to be'. I also suspect that Vivado will sink a newcomer. Just guessing...

rstofer · « **Reply #68 on:** June 17, 2017, 05:40:34 pm »

Quote from: MK14 on June 17, 2017, 05:03:31 pm

But I don't want to bother them and/or waste their and my time, over what is usually quite small amounts of money.

In the bigger scheme of things, the amount is trivial. If I didn't want to pay it, I wouldn't have bought it. Money is not one of my larger problems. Old age is a much larger concern.

NorthGuy · « **Reply #69 on:** June 17, 2017, 06:09:09 pm »

Quote from: rstofer on June 17, 2017, 05:35:03 pm

I like the way pins are configured with a spreadsheet. I like the touch and feel of Diamond as it is similar to Xilinx's ISE (sort of). In any event, the startup curve is a lot flatter than Vivado's (does anybody really understand the .XDC file?).

You can do this in Vivado too. Open "Elaborated Design" and it has a similar pin table. Once you select pins, it'll create an XDC file with the definitions (or update an existing one). You'll have do re-run synthesis though

Quote from: rstofer on June 17, 2017, 05:35:03 pm

I like the Just In Time syntax analysis - save the file and syntax analysis is automatic and FAST (at least for small projects).

Vivado does continuous syntax check for VHDL files. If something is wrong it draws a red squiggle and you can hover over it to see the error message. It is fast enough for me. Very handy when the synthesis is so slow.

In the Lattice video, the synthesis is rather fast, but I couldn't figure out if it was normal speed or fast forward.

mikeselectricstuff · « **Reply #70 on:** June 17, 2017, 06:12:19 pm »

Quote from: NorthGuy on June 17, 2017, 06:09:09 pm

Quote from: rstofer on June 17, 2017, 05:35:03 pm
I like the way pins are configured with a spreadsheet. I like the touch and feel of Diamond as it is similar to Xilinx's ISE (sort of). In any event, the startup curve is a lot flatter than Vivado's (does anybody really understand the .XDC file?).

You can do this in Vivado too. Open "Elaborated Design" and it has a similar pin table. Once you select pins, it'll create an XDC file with the definitions (or update an existing one). You'll have do re-run synthesis though

Quote from: rstofer on June 17, 2017, 05:35:03 pm
I like the Just In Time syntax analysis - save the file and syntax analysis is automatic and FAST (at least for small projects).

Vivado does continuous syntax check for VHDL files. If something is wrong it draws a red squiggle and you can hover over it to see the error message. It is fast enough for me. Very handy when the synthesis is so slow.

In the Lattice video, the synthesis is rather fast, but I couldn't figure out if it was normal speed or fast forward.

Diamond will synthesise & place & route a simple design in about 10 seconds.

legacy · « **Reply #71 on:** June 17, 2017, 06:18:56 pm »

If you have the possibility ( = if your boss/customers pay it ), switch to Sigasi. It's the Eclipse-like for HDL

nctnico · « **Reply #72 on:** June 17, 2017, 06:50:49 pm »

Quote from: legacy on June 17, 2017, 06:18:56 pm

If you have the possibility ( = if your boss/customers pay it ), switch to Sigasi. It's the Eclipse-like for HDL

Sigasi is nice but what I don't like is the time limited node locked license. For me such software is a no-go. What if they go out of business or my PC breaks just when I need to finish a project and I can't affort to wait until they change the license to a new PC? It would be great if Sigasi offered a perpetual license and someone cracked it so it is no longer node locked. I'd buy it in a heartbeat.

A reasonable alternative is the (open source) Eclipse plugin called Veditor. It can do much less than Sigasi but combined with Eclipse it is lightyears better than the editor in Xilinx ISE.

rstofer · « **Reply #73 on:** June 17, 2017, 06:53:41 pm »

Quote from: NorthGuy on June 17, 2017, 06:09:09 pm

In the Lattice video, the synthesis is rather fast, but I couldn't figure out if it was normal speed or fast forward.

For the simple counter LEDs, synthesis is about 1/2 second and building both the bitstream and JEDEC file, from 'Rerun All', takes about 15 seconds. Pretty impressive!

rstofer · « **Reply #74 on:** June 17, 2017, 06:55:42 pm »

Quote from: legacy on June 17, 2017, 06:18:56 pm

If you have the possibility ( = if your boss/customers pay it ), switch to Sigasi. It's the Eclipse-like for HDL

I'll check with Social Security and see what they have to say (not!).

Cerebus · « **Reply #75 on:** June 17, 2017, 07:09:05 pm »

Quote from: rstofer on June 17, 2017, 04:29:56 pm

Late breaking news: The board has shipped - from Mouser. The very place I looked for stock. I must have had a serious bout of 'senior moments' yesterday!

Rather than a 'senior moment', it's more likely that Lattice have some reserved fulfilment stock at Mouser that doesn't show up as stock available for sale.

Cerebus · « **Reply #76 on:** June 17, 2017, 07:19:11 pm »

Quote from: legacy on June 17, 2017, 04:56:31 pm

p.s.
why Lattice? Never used, I am curious.

For the equivalent sized parts to those offered by Xilinx or Altera I don't think that Lattice necessarily offers parts with any particular advantages. Where I think they have a winner is in the ICE40 range where there are a number of FPGAs in the £3-5 bracket (one off prices) with 1k to 8k LEs/cells/pick-your-own-terminology available in prototyping friendly QFP and QFN packages.

rstofer · « **Reply #77 on:** June 17, 2017, 07:53:42 pm »

To the recent comment re: Vivado and its capability, yes, it really will do everything. And, in many cases, there are multiple ways to get things done. But, if you were a brand new EE student, would you want to use Vivado for your very first project?

Part of my problem with Vivado is that I am used to ISE. I have been using ISE for 13 years or so and I still use it for Spartan 3 projects. I haven't spent enough time with Vivado to get comfortable. Lattice Diamond doesn't do everything that Vivado does and, in my view, Diamond is an easier way to start. Or maybe I just like it because it is closer to ISE.

But, yes, Vivado is a tremendous upgrade from ISE.

mikeselectricstuff · « **Reply #78 on:** June 17, 2017, 09:16:30 pm »

Quote from: Cerebus on June 17, 2017, 07:19:11 pm

Quote from: legacy on June 17, 2017, 04:56:31 pm
p.s.
why Lattice? Never used, I am curious.

For the equivalent sized parts to those offered by Xilinx or Altera I don't think that Lattice necessarily offers parts with any particular advantages. Where I think they have a winner is in the ICE40 range where there are a number of FPGAs in the £3-5 bracket (one off prices) with 1k to 8k LEs/cells/pick-your-own-terminology available in prototyping friendly QFP and QFN packages.

Not familiar with ICE40 but an advantage of the XO2 family is onboard flash, plus core voltage regulator, and even an internal oscillator, so they are very useable on 2-layer PCBs with no additional support parts - just a 3.3v supply, a JTAG header and off you go.

Cerebus · « **Reply #79 on:** June 17, 2017, 09:58:30 pm »

Quote from: mikeselectricstuff on June 17, 2017, 09:16:30 pm

Quote from: Cerebus on June 17, 2017, 07:19:11 pm
Quote from: legacy on June 17, 2017, 04:56:31 pm
p.s.
why Lattice? Never used, I am curious.

For the equivalent sized parts to those offered by Xilinx or Altera I don't think that Lattice necessarily offers parts with any particular advantages. Where I think they have a winner is in the ICE40 range where there are a number of FPGAs in the £3-5 bracket (one off prices) with 1k to 8k LEs/cells/pick-your-own-terminology available in prototyping friendly QFP and QFN packages.
Not familiar with ICE40 but an advantage of the XO2 family is onboard flash, plus core voltage regulator, and even an internal oscillator, so they are very useable on 2-layer PCBs with no additional support parts - just a 3.3v supply, a JTAG header and off you go.

Some, but not all, of the ICE40 range have those features with the exception of an on-board core voltage regulator - they need a nominal 1.2V plus whatever your I/O standard requires. Lattice have always been good at integrating features that get you closer to the ideal of 'just needs a supply and a programming header'. Anybody else remember their in system programmable PALs, when everybody else's PALs needed dedicated out of circuit, high voltage programming?

Bassman59 · « **Reply #80 on:** June 20, 2017, 10:47:22 pm »

Quote from: legacy on June 17, 2017, 06:18:56 pm

If you have the possibility ( = if your boss/customers pay it ), switch to Sigasi. It's the Eclipse-like for HDL

What does Sigasi cost these days? Their web site has the usual "contact me with pricing information" form, which usually indicates an expensive product. I seem to remember that it was $80 a month, but that was a few years ago.

jefflieu · « **Reply #81 on:** June 20, 2017, 10:57:18 pm »

Would learning FPGA by writing peripherals for NIOS system be an interesting to you?
I did learn quite a lot when I was doing intern and I had to modify a peripheral of an existing Microblaze sytem to extend its functionality.
Everything else had been setup, timing constraints, pins configuration ... etc ... etc.
I only needed to work out how the bus worked and wrote simple codes to let the bus read registers and write registers. Clear registers on read ... etc

I have a project here and always need new peripherals then verification on different boards.
www.github.com/jefflieu/recon
If you've been doing software then most of the stuff should be familiar to you.

Cheers,
Jeff

mikeselectricstuff · « **Reply #82 on:** June 21, 2017, 07:42:34 am »

Quote from: Cerebus on June 17, 2017, 09:58:30 pm

Quote from: mikeselectricstuff on June 17, 2017, 09:16:30 pm
Quote from: Cerebus on June 17, 2017, 07:19:11 pm
Quote from: legacy on June 17, 2017, 04:56:31 pm
p.s.
why Lattice? Never used, I am curious.

For the equivalent sized parts to those offered by Xilinx or Altera I don't think that Lattice necessarily offers parts with any particular advantages. Where I think they have a winner is in the ICE40 range where there are a number of FPGAs in the £3-5 bracket (one off prices) with 1k to 8k LEs/cells/pick-your-own-terminology available in prototyping friendly QFP and QFN packages.
Not familiar with ICE40 but an advantage of the XO2 family is onboard flash, plus core voltage regulator, and even an internal oscillator, so they are very useable on 2-layer PCBs with no additional support parts - just a 3.3v supply, a JTAG header and off you go.

Some, but not all, of the ICE40 range have those features with the exception of an on-board core voltage regulator - they need a nominal 1.2V plus whatever your I/O standard requires. Lattice have always been good at integrating features that get you closer to the ideal of 'just needs a supply and a programming header'. Anybody else remember their in system programmable PALs, when everybody else's PALs needed dedicated out of circuit, high voltage programming?

I thought ICE40 had OTP memory, or are there now some flash versions?

mikeselectricstuff · « **Reply #83 on:** June 21, 2017, 08:02:54 am »

Quote from: jefflieu on June 20, 2017, 10:57:18 pm

Would learning FPGA by writing peripherals for NIOS system be an interesting to you?
I did learn quite a lot when I was doing intern and I had to modify a peripheral of an existing Microblaze sytem to extend its functionality.
Everything else had been setup, timing constraints, pins configuration ... etc ... etc.
I only needed to work out how the bus worked and wrote simple codes to let the bus read registers and write registers. Clear registers on read ... etc

I have a project here and always need new peripherals then verification on different boards.
www.github.com/jefflieu/recon
If you've been doing software then most of the stuff should be familiar to you.

Cheers,
Jeff

Out of interest, what's the compile/run/debug cycle time doing that? Does including the NIOS stuff add a lot ?

jefflieu · « **Reply #84 on:** June 21, 2017, 08:48:30 am »

Quote from: mikeselectricstuff on June 21, 2017, 08:02:54 am

Quote from: jefflieu on June 20, 2017, 10:57:18 pm
Would learning FPGA by writing peripherals for NIOS system be an interesting to you?
I did learn quite a lot when I was doing intern and I had to modify a peripheral of an existing Microblaze sytem to extend its functionality.
Everything else had been setup, timing constraints, pins configuration ... etc ... etc.
I only needed to work out how the bus worked and wrote simple codes to let the bus read registers and write registers. Clear registers on read ... etc

I have a project here and always need new peripherals then verification on different boards.
www.github.com/jefflieu/recon
If you've been doing software then most of the stuff should be familiar to you.

Cheers,
Jeff
Out of interest, what's the compile/run/debug cycle time doing that? Does including the NIOS stuff add a lot ?

Can you please be more specific, doing what? (this could be off topic though)
When you say add a lot, if you mean resources then NIOS stuff costs about 1000 to 1500 LUs + Flops, simple CPU core and Avalon bus
If you mean add a lot of effort, then yeah, it takes some effort to setup hardware and software correctly, but not so bad.
I think if the NIOS system is setup, FPGA can be learnt by adding/creating new peripherals, especially if you're familiar with software, it'll be more interesting.
The coding for CPU peripherals is mostly RTL design I'd say.

mikeselectricstuff · « **Reply #85 on:** June 21, 2017, 09:45:53 am »

Quote from: jefflieu on June 21, 2017, 08:48:30 am

Quote from: mikeselectricstuff on June 21, 2017, 08:02:54 am
Quote from: jefflieu on June 20, 2017, 10:57:18 pm
Would learning FPGA by writing peripherals for NIOS system be an interesting to you?
I did learn quite a lot when I was doing intern and I had to modify a peripheral of an existing Microblaze sytem to extend its functionality.
Everything else had been setup, timing constraints, pins configuration ... etc ... etc.
I only needed to work out how the bus worked and wrote simple codes to let the bus read registers and write registers. Clear registers on read ... etc

I have a project here and always need new peripherals then verification on different boards.
www.github.com/jefflieu/recon
If you've been doing software then most of the stuff should be familiar to you.

Cheers,
Jeff
Out of interest, what's the compile/run/debug cycle time doing that? Does including the NIOS stuff add a lot ?
Can you please be more specific, doing what? (this could be off topic though)
When you say add a lot, if you mean resources then NIOS stuff costs about 1000 to 1500 LUs + Flops, simple CPU core and Avalon bus
If you mean add a lot of effort, then yeah, it takes some effort to setup hardware and software correctly, but not so bad.
I think if the NIOS system is setup, FPGA can be learnt by adding/creating new peripherals, especially if you're familiar with software, it'll be more interesting.
The coding for CPU peripherals is mostly RTL design I'd say.

No I mean comparing developing a standalone function versus hanging something off a NIOS processor, what is the time penalty of the synthesize/place & route time doing the latter?
I've not used Altera, but IME with ISE and Diamond, for small designs, compile cycles of low tens of seconds are typical, and tolerable for a write/compile/debug/repeat workflow.
If adding a processor makes this a lot longer, any benefit of using a processor to simplify testing may be outweighed by the extended debug cycle times.

legacy · « **Reply #86 on:** June 21, 2017, 10:23:57 am »

Quote from: Bassman59 on June 20, 2017, 10:47:22 pm

What does Sigasi cost these days?

I don't know, neither I want to know

What is the benefit of being an employed (even if not permanently, one year contract) for a job? When you are just a person who works freelance you might be asked to care about your personal tools (software, laptop, DSO, LA, RLC-meter, etc), which means that at least you have to phone vendors and waste your time with their marketing office, sometimes you also have phone your bank asking funds for buying them as your customer will refund you only when the job is done. I mean anticipatory funds, refunded with gain.

As employed there is always a wonderful secretary (yes, I have a secretary now, and two tulip plants in my office) who does the job for you, and a person in the stuff who pays for your tools.

Awesome!!! So, who cares? Now, I am more interested in productivity since more productivity means more plants in my office, may be a bigger office with two secretaries and an aquarium with tropical fishes

jefflieu · « **Reply #87 on:** June 21, 2017, 11:47:39 am »

Quote from: mikeselectricstuff on June 21, 2017, 09:45:53 am

No I mean comparing developing a standalone function versus hanging something off a NIOS processor, what is the time penalty of the synthesize/place & route time doing the latter?
I've not used Altera, but IME with ISE and Diamond, for small designs, compile cycles of low tens of seconds are typical, and tolerable for a write/compile/debug/repeat workflow.
If adding a processor makes this a lot longer, any benefit of using a processor to simplify testing may be outweighed by the extended debug cycle times.

Compilation time is about 4 minutes for a design of 3K LUTS. I wouldn't say there's any penalty for using NIOS, it depends on what you want to do. Generally, compilation time is related to the size of the design and chip size. Embedded processor lets you do certain stuff quickly once the hardware is done. Interesting systems often comprise of processor running control stuff and FPGA fabric implementing custom stuff.

JPortici · « **Reply #88 on:** June 21, 2017, 12:29:31 pm »

Quote from: legacy on June 21, 2017, 10:23:57 am

Awesome!!! So, who cares? Now, I am more interested in productivity since more productivity means more plants in my office, may be a bigger office with two secretaries and an aquarium with tropical fishes

[OT]
And Fantozzi's rise in ranks scene comes to mind

[/OT]

Cerebus · « **Reply #89 on:** June 21, 2017, 02:36:58 pm »

Quote from: mikeselectricstuff on June 21, 2017, 07:42:34 am

I thought ICE40 had OTP memory, or are there now some flash versions?

Sorry, in a effort at writing economy I stuffed that up. Some have OTP, some don't, all can work with external SPI flash, all can be configured over SPI by an MPU. Some have an on-board oscillator, some don't.

sporadic · « **Reply #90 on:** June 22, 2017, 06:58:37 pm »

For all the Python haters, yes.. you can design hardware with Python - http://www.myhdl.org

nctnico · « **Reply #91 on:** June 22, 2017, 07:46:33 pm »

Quote from: mikeselectricstuff on June 21, 2017, 09:45:53 am

No I mean comparing developing a standalone function versus hanging something off a NIOS processor, what is the time penalty of the synthesize/place & route time doing the latter?
I've not used Altera, but IME with ISE and Diamond, for small designs, compile cycles of low tens of seconds are typical, and tolerable for a write/compile/debug/repeat workflow.
If adding a processor makes this a lot longer, any benefit of using a processor to simplify testing may be outweighed by the extended debug cycle times.

You can always simulate. I like that better for complex designs because it allows to see ANY signal in detail and figure out what is wrong (very similar to stepping through a piece of C code with a debugger).

Bruce Abbott · « **Reply #92 on:** June 22, 2017, 07:47:05 pm »

Quote from: sporadic on June 22, 2017, 06:58:37 pm

For all the Python haters, yes.. you can design hardware with Python - http://www.myhdl.org

As if VHDL and Verilog weren't confusing enough, now we have another HDL to learn.

What advantages does MyHDL have over the other two?

sporadic · « **Reply #93 on:** June 22, 2017, 07:52:00 pm »

Quote from: Bruce Abbott on June 22, 2017, 07:47:05 pm

Quote from: sporadic on June 22, 2017, 06:58:37 pm
For all the Python haters, yes.. you can design hardware with Python - http://www.myhdl.org
As if VHDL and Verilog weren't confusing enough, now we have another HDL to learn.

What advantages does MyHDL have over the other two?

It actually processes into VHDL or Verilog for synthesis. The site does a better job explaining the pros and cons better than I ever could. It's legit though, been used for ASICs.

rstofer · « **Reply #94 on:** June 22, 2017, 09:18:59 pm »

Quote from: Bruce Abbott on June 22, 2017, 07:47:05 pm

Quote from: sporadic on June 22, 2017, 06:58:37 pm
For all the Python haters, yes.. you can design hardware with Python - http://www.myhdl.org
As if VHDL and Verilog weren't confusing enough, now we have another HDL to learn.

What advantages does MyHDL have over the other two?

I don't see it either! Anything I can do with Python HDL, I can do with VHDL and skip a couple of steps. Perhaps the Python simulation is a little faster (maybe even a lot faster) but I don't usually bother with simulation. If I did do simulation, I would use the chip vendor's simulator. It's the only opinion that matters.

I look at it as "just because I can".

Maybe somebody can make the case that I should care about this but, at the moment, I don't.

hamster_nz · « **Reply #95 on:** June 22, 2017, 09:27:01 pm »

Quote from: Bruce Abbott on June 22, 2017, 07:47:05 pm

Quote from: sporadic on June 22, 2017, 06:58:37 pm
For all the Python haters, yes.. you can design hardware with Python - http://www.myhdl.org
As if VHDL and Verilog weren't confusing enough, now we have another HDL to learn.

What advantages does MyHDL have over the other two?

All these High Level Synthesis (HLS) HDLs seem to have common threads to address these (and other) problems:

- Couldn't hardware design it be more like programming?
- The level of abstraction in HDLs is too low
- I don't want to micromanage bits - I just want it to work like integers and floats
- Productivity of HDLs is too low - e.g. testing through simulation is slow.
- I want to use programmers, not hardware designers

I have played with a couple.

- "It can't be more like programming?". There is a solid barrier that makes one not like the other. Programming updates a little bit of data every cycle. To make the most of FPGAs you can not use them like that, you need to make pipelines and have data flow through your design in a way programming can't do.

- "The level of abstraction in HDLs is too low". You can break out of low level HDL programming if you want, but at the cost of doing things somebody else's way, and most likely paying a lot for IP blocks that are huge, complex and costly. However if your needs are unique, then you need to work at low levels of abstraction, for at least part of the design. The 80/20 rule applies

- "I don't want to micromanage bits" - If you want to burn through FPGA resources at an alarming rate, and have minimal performance, all your 'variables' can be 64-bit integers. Sometimes the tools will pick up a reduced range and optimize unused bits way, sometimes it wont. The tighter you constrain the design (e.g. size of counters) the better the design will perform.

- "Productivity of HDLs is too low compared to programming" - Very valid. Being able to efficiently test designs like software is awesome. But then you have to verify that the resulting design actually is equivalent to the software...

- "I want to use programmers, not hardware designers" - If the programmer can't envisage what the design will look like in hardware, then they are just fumbling around in the dark. They will spend a lot of time trying to find an efficient way to express what they are trying to do in a way that the tool set likes and produces an efficient design.

In short - it seems to be great when cost is no object (e.g. research), performance is no object (e.g. research) and rapidly testing new things (e.g. research).

You can also paint yourself into a dead end. If your design tests out ok, fits into the target chip, but does not meet timing requirements then what can you do? You need a skilled HDL coder to to re-write the slow bit.

For some commercial use it is also workable, but requires a skilled hardware designer who knows the HLS tools and the problem space intimately, rather than a generic C/Python/whatever hack.

So in short it ends up with high-level code that is written in a quirky, ungainly way, but a 'normal programmer' can read and maybe make sense of - but a normal programmer will have minimal understanding of why it is like that. A single "refactor" of a module to make it "more normal" will break everything.

Bruce Abbott · « **Reply #96 on:** June 22, 2017, 09:54:51 pm »

Quote from: sporadic on June 22, 2017, 07:52:00 pm

It actually processes into VHDL or Verilog for synthesis. The site does a better job explaining the pros and cons better than I ever could.

Apart from 'empowering hardware designers with the elegance and simplicity of the Python language' the only advantage I could see is that you can apparently quickly create and simulate a design interactively. I say 'apparently' because the website is full of 'page not founds' everywhere.

"For more information about installing on non-Linux platforms such as Windows, read about Installing Python Modules." - 404 Not Found.

Great! can't even get started...

Cerebus · « **Reply #97 on:** June 22, 2017, 10:10:12 pm »

Quote from: rstofer on June 22, 2017, 09:18:59 pm

Perhaps the Python simulation is a little faster (maybe even a lot faster) but I don't usually bother with simulation. If I did do simulation, I would use the chip vendor's simulator. It's the only opinion that matters.

For general purpose verilog simulation (i.e. not process specific verification) there's verilator, an open source simulator that 'compiles' the verilog into C, which can then itself be compiled into machine code. It's fast, and on the right source material it's blazingly fast.

Also you can get at the internals of the simulation in a controlled fashion. I've used this to do mixed model simulation writing the analogue side of the simulation in C.

Someone · « **Reply #98 on:** June 23, 2017, 01:08:41 am »

Quote from: hamster_nz on June 22, 2017, 09:27:01 pm

- "The level of abstraction in HDLs is too low". You can break out of low level HDL programming if you want, but at the cost of doing things somebody else's way, and most likely paying a lot for IP blocks that are huge, complex and costly. However if your needs are unique, then you need to work at low levels of abstraction, for at least part of the design. The 80/20 rule applies.

Well you can go to extremely high levels of abstraction in VHDL, so its possible to have a higher level language by using the existing tools better. But the core issue is that programming for simultaneous execution is radically different to programming for sequential execution.

There have been some good attempts at C-hdl and they work well at matching some patterns, but remain poor at improving all code. So even with the high level tools you still end up needing to understand the flow and patterns that fit into logic, just as if you were programming HDL to begin with.

rstofer · « **Reply #99 on:** June 23, 2017, 02:09:53 am »

Quote from: Someone on June 23, 2017, 01:08:41 am

Quote from: hamster_nz on June 22, 2017, 09:27:01 pm
- "The level of abstraction in HDLs is too low". You can break out of low level HDL programming if you want, but at the cost of doing things somebody else's way, and most likely paying a lot for IP blocks that are huge, complex and costly. However if your needs are unique, then you need to work at low levels of abstraction, for at least part of the design. The 80/20 rule applies.
Well you can go to extremely high levels of abstraction in VHDL, so its possible to have a higher level language by using the existing tools better. But the core issue is that programming for simultaneous execution is radically different to programming for sequential execution.

There have been some good attempts at C-hdl and they work well at matching some patterns, but remain poor at improving all code. So even with the high level tools you still end up needing to understand the flow and patterns that fit into logic, just as if you were programming HDL to begin with.

This is exactly the point. There is a difference between writing sequential C code and designing hardware and hardware design is, well, hard. That's why it's called hardware.

Software speaks for itself - soft.

It doesn't seem to me that CS majors are going to do well with HDL unless they also took some EE courses. HDL is an entirely different thing.

Mattjd · « **Reply #100 on:** June 23, 2017, 07:25:33 am »

Quote from: rstofer on June 23, 2017, 02:09:53 am

Quote from: Someone on June 23, 2017, 01:08:41 am
Quote from: hamster_nz on June 22, 2017, 09:27:01 pm
- "The level of abstraction in HDLs is too low". You can break out of low level HDL programming if you want, but at the cost of doing things somebody else's way, and most likely paying a lot for IP blocks that are huge, complex and costly. However if your needs are unique, then you need to work at low levels of abstraction, for at least part of the design. The 80/20 rule applies.
Well you can go to extremely high levels of abstraction in VHDL, so its possible to have a higher level language by using the existing tools better. But the core issue is that programming for simultaneous execution is radically different to programming for sequential execution.

There have been some good attempts at C-hdl and they work well at matching some patterns, but remain poor at improving all code. So even with the high level tools you still end up needing to understand the flow and patterns that fit into logic, just as if you were programming HDL to begin with.

This is exactly the point. There is a difference between writing sequential C code and designing hardware and hardware design is, well, hard. That's why it's called hardware.

Software speaks for itself - soft.

It doesn't seem to me that CS majors are going to do well with HDL unless they also took some EE courses. HDL is an entirely different thing.

Yes, without a course in digital logic/design a CS major will not be able to do well with HDL at all. You don't necessarily have to know how to design an IC at the transistor level, but you sure a shit need to know the boolean algebra to be able to multiplex, decode, encode, create registers, etc. and most importantly the graph theory for state machines.

I don't think people are realize that when doing HDL, you create a module, that module is essentially an IC, lets call it IC

. IC
can be dropped into a solder-less breadboard or be put on a surface mount board, or whatever. Every time you "instantiate" a module, you're plugging another one of IC
into a breadboard.

See this guy

He built an 8 bit computer on a huge breadboard. When I created a 64 bit processor on my FPGA, I described each and every one of those IC he has on that board and described how to wire them together using HDL. The HDL then interpreted what I described, synthesized it, and wired the transistors of the FPGA to create those IC and connections that it interpreted.

AndyC_772 · « **Reply #101 on:** June 23, 2017, 08:23:15 am »

Quote from: Mattjd on June 23, 2017, 07:25:33 am

Yes, without a course in digital logic/design a CS major will not be able to do well with HDL at all. You don't necessarily have to know how to design an IC at the transistor level, but you sure a shit need to know the boolean algebra to be able to multiplex, decode, encode, create registers, etc. and most importantly the graph theory for state machines.

I'm not sure I'd agree with that. When you design using HDL, you're describing the behaviour of the finished design in terms of what you want it to do. The synthesis tool might infer the need for multiplexers and D-types, but that's not the way the designer has to think. We're a level abstracted.

For example, suppose you're writing an SPI slave, which needs to be able to return one of a number of different values depending on which address is being read. If you're well versed in fundamental digital building blocks, then you might start thinking about how this would be realised using multiplexers and latches. It's fine to be aware, at a very general level, that these are the components which will be required, but you don't need to actually work out how to implement your desired logic using them.

Your code might look something like this:

Code: [Select]

IF sclk'event AND sclk = '1' THEN
  IF reset_n = '0' THEN
    spi_result <= 0;
  ELSE
    CASE spi_addr IS
    WHEN 0 =>
      spi_result <= version_register;
    WHEN 1 =>
      spi_result <= bytes_remaining;
    WHEN 2 =>
      spi_result <= irq_outstanding;
      irq_clear_sig <= NOT irq_clear_ack;
      counter <= 0;
    WHEN 3 =>
      spi_result <= measured_value (counter);
      counter <= counter + 1;
    WHEN 4 =>
      counter <= spi_written_value;
    END CASE;
  END IF;
END IF;

In this example, reading different register addresses should return different values, so clearly a multiplexer is required. Some values stored in latches are also clearly going to be needed.

But: there's quite a bit more to it than that. Reading the interrupt flag at address 2 also has the effect of clearing the flag and resetting a counter, so the design also requires a comparator and some reset logic for the counter. The counter also increments every time a result is read, so we need an adder, and it can also be directly updated by writing another register address.

Trying to work out the underlying building blocks required to implement this rapidly gets out of hand, but thankfully that's the job of the synthesis tool. I only need to be vaguely aware that X number of bits need to be preserved from one clock to the next, so I can estimate the logic usage of the design - and only then if it's big enough compared to the capacity of the chip for that to even possibly be an issue.

Quote

When I created a 64 bit processor on my FPGA, I described each and every one of those IC he has on that board and described how to wire them together using HDL. The HDL then interpreted what I described, synthesized it, and wired the transistors of the FPGA to create those IC and connections that it interpreted.

Oh, my. I do hope that something has got lost in translation somewhere, because describing individual discrete ICs (ie. standard, low level logic functions) and then joining them up is a terrible, terrible way to program an FPGA. HDL allows us to describe what we actually want a device to do, not how we think the thing we want could be built up out of basic logic elements.

chris_leyson · « **Reply #102 on:** June 23, 2017, 09:43:55 am »

Quote

Oh, my. I do hope that something has got lost in translation somewhere, because describing individual discrete ICs (ie. standard, low level logic functions) and then joining them up is a terrible, terrible way to program an FPGA. HDL allows us to describe what we actually want a device to do, not how we think the thing we want could be built up out of basic logic elements.

I once described the Cinematronics CPU as a bunch of individual TTL ICs for a simulation test bench but gave up trying to write a behavioral model as it was just taking far too long. Ended up turning the simulation model into something that could be synthesized and it worked. I totally agree it's the wrong approach but it was quick and dirty and just for fun.

Bruce Abbott · « **Reply #103 on:** June 23, 2017, 03:11:02 pm »

Quote from: AndyC_772 on June 23, 2017, 08:23:15 am

When you design using HDL, you're describing the behavior of the finished design in terms of what you want it to do. The synthesis tool might infer the need for multiplexers and D-types, but that's not the way the designer has to think. We're a level abstracted.

When designing any logic circuit you should start out with what you want it to do. HDL just skips the boring part of having to decide what gates to use and how to wire them.

Quote

describing individual discrete ICs (ie. standard, low level logic functions) and then joining them up is a terrible, terrible way to program an FPGA.

It has one valid use - converting an existing discrete design. But for a new design it's working backwards. The purpose of HDL is to avoid having to describe the circuit at individual gate level. Being aware of what type of logic circuit you are creating is good, but trying to reproduce specific 'discrete' logic chips is unnecessarily limiting. Unfortunately many tutorials start out by doing that, presumably to give the student something they are familiar with (not a bad thing in itself, by may give a wrong impression of how best to create a design).

However as a rank beginner who so far has only used WinCUPL - and tried to understand VHDL - the main problem I have is getting to grips with the language itself. Most of the tutorials I have tried threw in new concepts without adequate explanation, and assumed you will pick up clues to the required syntax just by looking at examples. The result is I can read a piece of VHDL code and almost understand what is going on, but little details trip me up.

AndyC_772 · « **Reply #104 on:** June 23, 2017, 03:36:03 pm »

Quote from: Bruce Abbott on June 23, 2017, 03:11:02 pm

It has one valid use - converting an existing discrete design. But for a new design it's working backwards.

I can see that there might be justification for working this way if:

a known working design already exists, and
the new design must be a drop-in functional equivalent of the existing one, and
it must be formally proved, for some reason, that the new functionality exactly replicates the old under all conditions

Then I can see an argument for building a new HDL design based on existing circuits. In any other case, I do think that describing the desired behaviour is the way to go.

Quote

the main problem I have is getting to grips with the language itself. Most of the tutorials I have tried threw in new concepts without adequate explanation, and assumed you will pick up clues to the required syntax just by looking at examples. The result is I can read a piece of VHDL code and almost understand what is going on, but little details trip me up.

There's a lot to be tripped up on, especially if you try to read VHDL the same way as you might try to read and interpret a sequentially executed language that runs on a microprocessor. Most of us do, of course, because it's entirely natural to read it from top to bottom, and in some instances things which happen towards the end of the source file (note: most definitely not "later" in the file!) do take precedence over things which happen nearer the beginning (not "earlier"!).

The complete absence of any correlation between time of execution, and position in the source file, can easily do anyone's head in.

I'm not sure there's an easy way round this, other than asking questions when you get stuck - sorry.

Cerebus · « **Reply #105 on:** June 23, 2017, 03:53:50 pm »

Quote from: AndyC_772 on June 23, 2017, 03:36:03 pm

Quote from: Bruce Abbott on June 23, 2017, 03:11:02 pm
It has one valid use - converting an existing discrete design. But for a new design it's working backwards.

I can see that there might be justification for working this way if:
a known working design already exists, and
the new design must be a drop-in functional equivalent of the existing one, and
it must be formally proved, for some reason, that the new functionality exactly replicates the old under all conditions

Then I can see an argument for building a new HDL design based on existing circuits. In any other case, I do think that describing the desired behaviour is the way to go.

Lest we forget, this is the very reason that VHDL exists. The US Department of Gung-ho and Killin' Furruners found that it was increasingly relying on systems full of VLSI chips that might disappear from the supply chain before the weapons system they were in reached end-of-life. They wanted a way of formally documenting chip designs to allow them to re-create these chips as necessary to keep systems in operation. The use of HDL as primary design tool and fodder for synthesis came later.

Quote from: AndyC_772 on June 23, 2017, 03:36:03 pm

I'm not sure there's an easy way round this, other than asking questions when you get stuck - sorry.

You can make it less painful by picking Verilog instead of VHDL. [fx: ducks]

AndyC_772 · « **Reply #106 on:** June 23, 2017, 04:02:55 pm »

I've been doing a new FPGA design from scratch all this week. It's now beer o'clock on Friday evening and I'm completely frazzled. Every time I close my eyes I see traces wiggling in ModelSim.

Does it really show that badly?

nctnico · « **Reply #107 on:** June 23, 2017, 04:28:11 pm »

Quote from: Bruce Abbott on June 23, 2017, 03:11:02 pm

However as a rank beginner who so far has only used WinCUPL - and tried to understand VHDL - the main problem I have is getting to grips with the language itself. Most of the tutorials I have tried threw in new concepts without adequate explanation, and assumed you will pick up clues to the required syntax just by looking at examples. The result is I can read a piece of VHDL code and almost understand what is going on, but little details trip me up.

IMHO one of the problems of VHDL is that many don't know how to really take advantage of it and do stupid things like using the std_logic_vector for all multi-bit signals and/or describe logic instead of functionality. That leads to longwinded incomprehensible code very quickly. Above all VHDL is a parallel programming language. Treat it as such and you will discover it has great power.

NorthGuy · « **Reply #108 on:** June 23, 2017, 04:47:39 pm »

Quote from: AndyC_772 on June 23, 2017, 03:36:03 pm

There's a lot to be tripped up on, especially if you try to read VHDL the same way as you might try to read and interpret a sequentially executed language that runs on a microprocessor. Most of us do, of course, because it's entirely natural to read it from top to bottom, and in some instances things which happen towards the end of the source file (note: most definitely not "later" in the file!) do take precedence over things which happen nearer the beginning (not "earlier"!).

They do happen earlier (or later). However, the things in VHDL don't happen at run time (as in C, for example), they happen at compile (synthesis) time. VHDL is more like Basic where interpreter reads the code and executes it immediately. The circuit being built is the result of this execution.

This is completely different from traditional languages (such as C) where the compiler builds a program, but doesn't execute it. The program executes later at run time.

AndyC_772 · « **Reply #109 on:** June 23, 2017, 06:24:59 pm »

That's not a distinction I'd have made. BASIC and C may normally be processed into op-codes using different methods, but it's physically possible to compile BASIC and to interpret C if you were so inclined. Regardless of how each is parsed and processed, they both represent a set of instructions to be executed one at a time in a particular order.

Not so with VHDL. The example I like to use is the classic 'how not to swap two values' example found in beginner level textbooks:

Code: [Select]

a <= b;
b <= a;

We're all familiar with how this fails to work. The first assignment makes a equal to b, and the previous value of a is lost forever. The second assignment makes b equal to a, which it was already. Both end up equal to the original value of b.

In VHDL that's not the case. Put these signal assignments into a clocked process, and they do indeed switch values, because the meaning is quite different. "a <= b" means "signal a must, at a time a very short distance in the future, take the value which signal b has right now".

Since no time elapses between one line of code and the next, both signals do indeed switch.

Moreover, it doesn't matter in which order these two lines are written, the meaning is identically the same thing regardless.

However: the order in which commands are placed does affect their precedence in the event of a conflict. For example:

Code: [Select]

a <= b;
a <= a;

...has absolutely no effect whatsoever. The value of a remains completely unchanged, because the later assignment ('a' in the future takes the value of whatever 'a' is right now) simply overrides the earlier one. Synthesize this code, and precisely no FPGA resources at all will be required. There certainly won't be a glitch on the output, as there would be if a similar sequence of operations were to be carried out in order by a microprocessor.

I use this type of construct a lot when dealing with FIFOs. For example:

Code: [Select]

fifo_we <= '0';

IF <interesting set of conditions> THEN
  fifo_data <= new_data_value;
  fifo_we <= '1';
END IF;

This ensures that the FIFO is only ever advanced when there really is new data, and under no other possible conditions.

nctnico · « **Reply #110 on:** June 23, 2017, 07:02:40 pm »

I don't like to rely on how expressions are ordered. It will confuse people who are less familiar with VHDL and thus it costs time in the long run. For similar reasons I avoid certain constructs in C like the comma operator.

As a rule I code VHDL in a way so a signal assignment has a single condition. Sometimes this makes things harder at first but after some thinking about what I'm trying to achieve it usually results in less lines of code and a solution which is much easier to follow.

rstofer · « **Reply #111 on:** June 23, 2017, 07:05:00 pm »

Quote from: nctnico on June 23, 2017, 04:28:11 pm

Quote from: Bruce Abbott on June 23, 2017, 03:11:02 pm
However as a rank beginner who so far has only used WinCUPL - and tried to understand VHDL - the main problem I have is getting to grips with the language itself. Most of the tutorials I have tried threw in new concepts without adequate explanation, and assumed you will pick up clues to the required syntax just by looking at examples. The result is I can read a piece of VHDL code and almost understand what is going on, but little details trip me up.
IMHO one of the problems of VHDL is that many don't know how to really take advantage of it and do stupid things like using the std_logic_vector for all multi-bit signals and/or describe logic instead of functionality. That leads to longwinded incomprehensible code very quickly. Above all VHDL is a parallel programming language. Treat it as such and you will discover it has great power.

Most, if not all, entry level tutorials will use std_logic_vector rather than unsigned and that's how folks get started using std_logic_arith.all in order to implement counters or add vectors. OTOH, the more pedantic approach of using unsigned simply means that I spend a lot of time writing casts between the types.

As an example, I can't write the 32 bit unsigned output of an adder to a BlockRam (just to contrive an example) because the BlockRAM, over which I have no control of the definition, expects std_logic_vector.

http://www.synthworks.com/papers/vhdl_math_tricks_mapld_2003.pdf

And, yes, I am moving toward unsigned but I sure don't know why.

rstofer · « **Reply #112 on:** June 23, 2017, 07:10:09 pm »

Quote from: nctnico on June 23, 2017, 07:02:40 pm

I don't like to rely on how expressions are ordered. It will confuse people who are less familiar with VHDL and thus it costs time in the long run. For similar reasons I avoid certain constructs in C like the comma operator.

Assuming that the code above (FIFO) was part of a larger FSM, the choice is to define a default condition for fifo_we and then override it when necessary or define it at every single state which is a lot of useless typing.

I have been recommending the default value throughout this topic and this is another example where it applies.

Different people use different styles. I don't shy away from the C comma operator if it means I can avoid an 'else' block containing just a single line.

Again, different styles...

nctnico · « **Reply #113 on:** June 23, 2017, 07:14:13 pm »

Quote from: rstofer on June 23, 2017, 07:05:00 pm

Quote from: nctnico on June 23, 2017, 04:28:11 pm
Quote from: Bruce Abbott on June 23, 2017, 03:11:02 pm
However as a rank beginner who so far has only used WinCUPL - and tried to understand VHDL - the main problem I have is getting to grips with the language itself. Most of the tutorials I have tried threw in new concepts without adequate explanation, and assumed you will pick up clues to the required syntax just by looking at examples. The result is I can read a piece of VHDL code and almost understand what is going on, but little details trip me up.
IMHO one of the problems of VHDL is that many don't know how to really take advantage of it and do stupid things like using the std_logic_vector for all multi-bit signals and/or describe logic instead of functionality. That leads to longwinded incomprehensible code very quickly. Above all VHDL is a parallel programming language. Treat it as such and you will discover it has great power.

Most, if not all, entry level tutorials will use std_logic_vector rather than unsigned and that's how folks get started using std_logic_arith.all in order to implement counters or add vectors. OTOH, the more pedantic approach of using unsigned simply means that I spend a lot of time writing casts between the types.

As an example, I can't write the 32 bit unsigned output of an adder to a BlockRam (just to contrive an example) because the BlockRAM, over which I have no control of the definition, expects std_logic_vector.

http://www.synthworks.com/papers/vhdl_math_tricks_mapld_2003.pdf

And, yes, I am moving toward unsigned but I sure don't know why.

That is why you should use std_numeric. Basic rule: if it is a number then use the types signed and unsigned. And don't infer blockrams (or any other primitives). Just create an array and the synthesizer will decide whether to use blockrams or other resources.

Then you can do stuff like this to read data from a memory:

Code: [Select]

if rising_edge(clk)
  a <= ram_data(read_pointer);
end if;

AndyC_772 · « **Reply #114 on:** June 23, 2017, 07:31:51 pm »

I like the general approach of having the usual, most likely value of a signal be defined as a default, then have the code describe those things which are interesting or noteworthy under various conditions.

The FIFO example is one which works well. Another might be a counter, which increments on every clock edge except when some event causes it to reset to zero. You could end up with a lot of paths through a FSM all of which boil down to "do something interesting, and yes, don't forget to increment the bl**dy counter".

NorthGuy · « **Reply #115 on:** June 23, 2017, 08:34:52 pm »

Quote from: AndyC_772 on June 23, 2017, 06:24:59 pm

"a <= b" means "signal a must, at a time a very short distance in the future, take the value which signal b has right now".

What do you mean by "right now?"

rstofer · « **Reply #116 on:** June 23, 2017, 08:50:07 pm »

Quote from: nctnico on June 23, 2017, 07:14:13 pm

That is why you should use std_numeric. Basic rule: if it is a number then use the types signed and unsigned. And don't infer blockrams (or any other primitives). Just create an array and the synthesizer will decide whether to use blockrams or other resources.

I don't usually infer BlockRams, I instantiate them and, more often than not, specify the initial contents. Furthermore, it is easy to initialize contents post bit file generation by using Xilinx's data2mem utility. This means I don't need to re-synthesize or rerun place/route just to test a different program (assuming a CPU project).

At one time, I was specifying the contents in the .ucf file simply because it eliminated having to re-synthesize. Later on I ran across data2mem and now I don't even need to place/route or generate the bitfile.

These days, playing with the Lattice toolchain, it seems simpler to use their IPexpress to create the memory block and attach a filename for the contents. Unfortunately, this implies I will have to re-synthesize to update the contents. I'm not too sure what to think about that. The good news is that I am only messing around with the Lattice MachXO3 board. I'll spend most of my time in the Xilinx world.

AndyC_772 · « **Reply #117 on:** June 23, 2017, 10:19:37 pm »

Quote from: NorthGuy on June 23, 2017, 08:34:52 pm

Quote from: AndyC_772 on June 23, 2017, 06:24:59 pm
"a <= b" means "signal a must, at a time a very short distance in the future, take the value which signal b has right now".

What do you mean by "right now?"

I mean, at the precise instant when either:

a) an active clock edge occurs, or
b) at the time when a signal in a process's sensitivity list changes state, causing the process to (for want of a better term) execute.

For example:

Code: [Select]

IF rising_edge (clk) THEN
  b <= a;
  a <= b;
END IF;

...causes the values stored in registers 'a' and 'b' to switch places at the precise time when a rising edge of the clock occurs.

In a real device, there are of course propagation delays to consider, so the values will actually switch a few nsec after the edge. Nevertheless, the meaning of the code is unambiguous, and the synthesis tool will work out the necessary layout to make the real logic behave correctly, including any signals which depend on those which have just been assigned.

nctnico · « **Reply #118 on:** June 23, 2017, 10:32:22 pm »

Nitpicking mode: For synchronous processes it will work but for asynchronous processes (depending on other signals than a clock) it will result in a mess and the synthesizer won't be able to do anything about it. You can't rule out the nature of the FPGA fabric entirely so signals won't arrive at the same time.

NorthGuy · « **Reply #119 on:** June 23, 2017, 11:23:00 pm »

Quote from: AndyC_772 on June 23, 2017, 10:19:37 pm

Quote
What do you mean by "right now?"

I mean, at the precise instant when either:

a) an active clock edge occurs, or
b) at the time when a signal in a process's sensitivity list changes state, causing the process to (for want of a better term) execute.

You relate your time to the events which are going to happen on FPGA when you run your program. From this timing viewpoint, VHDL looks weird and non-sequential, which might be a cause of confusion. To see the sequence in VHDL, try to relate to the time when the VHDL compiler goes through your VHDL code.

Imagine, you're building a circuit from ICs on the breadboard. You have inserted the ICs and now you need to connect some wires. You will do this sequentially, but the exact sequence doesn't really matter - you can do it in any order as soon as the end result is the same. The sequence of connecting wires has nothing to do with the sequence of events which will transpire in the circuit when you turn it on.

Same with VHDL. As the VHDL reads your code, it connects wires accordingly (or do other changes to the circuit being buit). These operations are perfectly sequential. But there's no direct relation between the sequence of VHDL statements and the sequence of events in FPGA.

For example, look at the following VHDL code:

Code: [Select]

process(clk)
begin
  if rising_edge(clk) then
    -- Connect 7 wires to produce a shift
    for i in 0 to 6 loop
      a(i) <= a(i+1);
    end loop;
    -- Right Now the 8-th wire is still unconnected, so connect it
    a(7) <= a(0);
  end if;
end process;

Here VHDL goes through the "for" loop, making a single connection on every pass. This loop exists only in VHDL - nothing is going to loop in FPGA. The "i" variable will not exist in FPGA neither. The words "Right Now" in the comment refer to the time during the process of connecting wires (building circuit). This time point will not exist in FPGA. If you look at it this way, VHDL looks perfectly sequential. That's how I look at it.

BrianHG · « **Reply #120 on:** June 23, 2017, 11:28:24 pm »

Quote from: NorthGuy on June 23, 2017, 08:34:52 pm

Quote from: AndyC_772 on June 23, 2017, 06:24:59 pm
"a <= b" means "signal a must, at a time a very short distance in the future, take the value which signal b has right now".

What do you mean by "right now?"

This is how I like to think about it. the '<=' means the logic, or math will be performed once every clock cycle. It's basically a set of D flipflops. The variable to the left of '<=' is the outputs of the D flipflops.

So, if we say:

a <= b + c;

This means that the variables 'b' and 'c' go through and addition logic and feed the data inputs of flipflops which creates variable a. This means at the next clock cycle, the value of a which change to the sum of 'b+c'.

If we say:
a <= a + 1;

This still looks like a simple C or Basic program line. The inputs of 'a' d flipflops are tied to the output of 'a' flipflops added with 1. Once again, which each clock, the outputs of a will have a new value.

Now, if we say:
a <= b;
b <= a;

What's going on here is that 'a' data inputs are tied to 'b' flipflop's outputs. And 'b' data inputs are tied to 'a' data outputs. Now, since Verilog/VHDL runs all the logic in parallel, when 1 single clock cycle comes, 'a' data in will latch 'b' outputs and 'b' data inputs will latch data 'a' outputs. This effectively swaps the contents 'a' & 'b' each and every clock.

Now, if we say:
a <= b;
b <= c;
c <= a;

Just follow the above rules and you will see we sort of made a circular 3 word buffer, where every clock, all 3 variables a,b,c move all at once, at every clock, not one after the other.

Now, for that pesky line 'at a time a very short distance in the future'. This has to do with how fast the clock is going. If the clock is slower than the time for the wiring and logic gates on the silicon to setup all the data inputs, the formula on the right hand side of the '<=', then when the clock cycles, the output register being set on the left hand side of the '<=' will have the correct results. If your clock is going too fast, errors in some or all of the bits feeding the data inputs of the d flipflop registers on the left side of the '<=' will have the wrong results.

BrianHG · « **Reply #121 on:** June 24, 2017, 01:35:47 am »

Just as a note to HDL developers VS software developers. I do develop in both, and I do see how some of my HDL debugging can drive me nuts finding that tiny logic error which always creeps up somewhere in a sophisticated design, and it does take me longer to work with due to compile times for those designs which evolve to large to simulate on a small home setup. However, I think I thoroughly enjoy the work and effort put into working FPGA just for the raw awesome potential to make anything happen, even at all the costs involved.

How many of you feel this way?
Or, do you go to FPGA just because you have no other choice and would prefer a simple MCU only solution?

Bruce Abbott · « **Reply #122 on:** June 24, 2017, 08:07:33 pm »

Quote from: AndyC_772 on June 23, 2017, 03:36:03 pm

There's a lot to be tripped up on, especially if you try to read VHDL the same way as you might try to read and interpret a sequentially executed language that runs on a microprocessor... The complete absence of any correlation between time of execution, and position in the source file, can easily do anyone's head in.

I don't have a problem with that. It's obvious that hardware operates sequentially only when it is wired sequentially, and why should the position in the source file matter?

What I am talking about is stuff like this:-

A VHDL Tutorial from Green Mountain Computing Systems, Inc. introduces you to VHDL with this:-

Quote

entity latch is
port (s,r: in bit;
q,nq: out bit);
end latch;

The first line indicates a definition of a new entity, whose name is latch. The last line marks the end of the definition. The lines in between, called the port clause, describe the interface to the design. The port clause contains a list of interface declarations. Each interface declaration defines one or more signals that are inputs or outputs to the design.

Each interface declaration contains a list of names, a mode, and a type. In the first interface declaration of the example, two input signals are defined, s and r. The list to the left of the colon contains the names of the signals, and to the right of the colon is the mode and type of the signals. The mode specifies whether this is an input (in), output (out), or both (inout). The type specifies what kind of values the signal can have. The signals s and r are of mode in (inputs) and type bit. Next the signals q and nq are defined to be of the mode out (outputs) and of the type bit (binary). Notice the particular use of the semicolon in the port clause. Each interface declaration is followed by a semicolon, except the last one, and the entire port clause has a semicolon at the end.

That's a lot to take in for your very first exposure to VHDL - a dense paragraph that introduces many new concepts but expects you to infer others (what is the meaning of 'is'? where is whitespace permitted? etc.)

Then they follow it with this:-

Quote

architecture dataflow of latch is
signal q0 : bit := '0';
signal nq0 : bit := '1';
begin
q0<=r nor nq0;
nq0<=s nor q0;

nq<=nq0;
q<=q0;
end dataflow;

The first line of the declaration indicates that this is the definition of a new architecture called dataflow and it belongs to the entity named latch. So this architecture describes the operation of the latch entity. The lines in between the begin and end describe the latch's operation.

What does 'signal' mean in this context? Why is there no mode? What does ':=' mean? Why do we need 'begin'? nq is less than or equal to nq0?

I am currently reading this tutorial, which is making a lot more sense to me so far...

Mattjd · « **Reply #123 on:** June 24, 2017, 11:59:27 pm »

Now i feel like everything i've learned about HDL is wrong.

BrianHG · « **Reply #124 on:** June 25, 2017, 01:48:22 am »

Quote from: Mattjd on June 24, 2017, 11:59:27 pm

Now i feel like everything i've learned about HDL is wrong.

I use Verilog to make my life soooo much easier. Especially if you use a simple single synchronous clock for everything, nothing asynchronous. Coding this way make very portable designs across all FPGAs and PLDs. As for the above VHDL example, it twists my head and I avoid it at all cost and wont ever use it.

Amazing · « **Reply #125 on:** June 25, 2017, 03:41:36 am »

Quote from: BrianHG on June 24, 2017, 01:35:47 am

How many of you feel this way?
Or, do you go to FPGA just because you have no other choice and would prefer a simple MCU only solution?

I got dragged into FPGA work kicking and screaming due to a contractor that bailed on the project after creating the hardware but before writing the VHDL. So we were stuck with a FPGA-based board and no one to program it.

I got lucky in that I found another contractor who was a wiz a VHDL and he got our board going. He also taught me a ton and now I really enjoy being able to harness the power of FPGAs in my design.

One thing that I think is really fun is getting deeply involved in breaking down a problem, designing cores (e.g. ALUs) specifically for that problem, and pipelining out the wazoo to increase efficiency.

Sadly though I tend not to have time for that sort of thing on a paying gig -- then I just buy the next size up, describe the logic in state machines, and let the synthesizer do it's thing. Much more cost efficient for low volume production that way.

What I learned about writing VHDL is that it's all about the mindset. EE's love to remind us software folks to "remember, you're creating hardware, not writing a program". But it's really much deeper and less obvious than that.

To everyone learning for the first time, I'd say, persevere, take small steps, don't worry about simulation or test benches at first, and read as much as you can on different styles of programming VHDL. Eventually it will soak in and you will "get it".

Quote from: BrianHG on June 25, 2017, 01:48:22 am

I use Verilog to make my life soooo much easier. Especially if you use a simple single synchronous clock for everything, nothing asynchronous. Coding this way make very portable designs across all FPGAs and PLDs. As for the above VHDL example, it twists my head and I avoid it at all cost and wont ever use it.

That's funny, I learned VHDL first and I think that Verilog is incomprehensible.

mikeselectricstuff · « **Reply #126 on:** June 25, 2017, 08:53:18 am »

You must never forget that you are describing hardware.
<= means 'is connected to', not 'becomes equal to'.

hamster_nz · « **Reply #127 on:** June 25, 2017, 09:55:04 am »

Quote from: mikeselectricstuff on June 25, 2017, 08:53:18 am

You must never forget that you are describing hardware.
<= means 'is connected to', not 'becomes equal to'.

"is connected to" doesn't work in clocked processes. e.g:

Code: [Select]

  if rising_edge(clk) then
     a <= b;
  end if;

I don't think of it as "on the rising edge of 'clk, 'a' is connected to 'b' " - if asked to describe it, I would say "on each 'clk' tick, store the value of 'b' in 'a'".

I can't actually find the words that match what "<=" does in each of the different contexts in which it is used. I more think of "=>" as 'connected to', for example

Code: [Select]

i_counter: counter port map (
    clk     => sys_clk,
    count => cycle_count);

I think of that as "a counter, with 'clk' conencted ot 'sys_clk' and 'count' connected to 'cycle_count'"...

AndyC_772 · « **Reply #128 on:** June 25, 2017, 10:08:34 am »

Quote from: mikeselectricstuff on June 25, 2017, 08:53:18 am

<= means 'is connected to', not 'becomes equal to'.

Sorry, Mike, I don't agree with you there. The only symbol which means "is connected to" is "=>", when used to map the ports of a component to signals at a higher level of hierarchy:

Code: [Select]

my_logic_gate: d_type PORT MAP (
  d_in => my_data,
  q_out => my_output,
  clk => master_clock
);

I think of "=>" as meaning "takes its new value from", or indeed, "becomes equal to" (at a point in the future one time quantum from now, but not actually now)

In a clocked process:

Code: [Select]

PROCESS (clk)
-- exchange the values of a and b on every clock edge
BEGIN
  IF clk'event AND clk = '1' THEN
    a <= b;
    b <= a;
  END IF;
END PROCESS;

...or in asynchronous logic...

Code: [Select]

PROCESS (a)
BEGIN
  b <= NOT a;
END PROCESS;

nctnico · « **Reply #129 on:** June 25, 2017, 10:22:07 am »

Quote from: mikeselectricstuff on June 25, 2017, 08:53:18 am

You must never forget that you are describing hardware.

Actually you must forget about the hardware otherwise you'll be writing way too much code. When programming in C you are also not going to bother whether a variable is stored in a register r1 or r2 or where exactly it is in RAM. VHDL is the same. For example: you can write a<=a*(b+d) +c; in VHDL and the synthesizer will figure out it needs a multiplyer and how it needs to be connected. No need to infer it and deal with how it is actually connected.

AndyC_772 · « **Reply #130 on:** June 25, 2017, 10:27:04 am »

Quote from: Bruce Abbott on June 24, 2017, 08:07:33 pm

What does 'signal' mean in this context? Why is there no mode? What does ':=' mean? Why do we need 'begin'? nq is less than or equal to nq0?

I hate this, when tutorials are written by people a little too familiar with the subject matter, and they begin with material that should have been on about page 5, leaving out the important introduction to the subject (definitions, context, general explanation of what the heck is going on) which should have filled pages 1 to 4.

A "signal" is any value which needs to be stored, or output from the device. Almost every piece of data which your FPGA handles will be a "signal". The values of signals" are generally retained in the D-type latches which form part of the FPGA fabric.

I don't know what you mean by "mode" in this context.

":=" is a symbol used, in this context, to assign a default value to a signal, which it will have at the point when the FPGA has just been powered up and configured. It's a method often used to ensure that counters start at zero, state machines initialise to a valid 'idle' state, and so on.

"<=" does indeed mean "less than or equal" when used in the context of a comparison, but here, it means assignment (see long rambling posts above).

"Begin" just means "by this point, we've declared all the signals we're going to use... now here's the logic which defines their behaviour". It's just semantics. Some things must go before the 'begin', and some after. Don't read too much into it, just copy an example and structure your code the same way.

Cerebus · « **Reply #131 on:** June 25, 2017, 11:15:58 am »

Quote from: AndyC_772 on June 25, 2017, 10:27:04 am

Quote from: Bruce Abbott on June 24, 2017, 08:07:33 pm
What does 'signal' mean in this context? Why is there no mode? What does ':=' mean? Why do we need 'begin'? nq is less than or equal to nq0?

I hate this, when tutorials are written by people a little too familiar with the subject matter, and they begin with material that should have been on about page 5, leaving out the important introduction to the subject (definitions, context, general explanation of what the heck is going on) which should have filled pages 1 to 4.

A "signal" is ...

I don't know what you mean by "mode" in this context.

":=" is a ...

"<=" does indeed mean ....

"Begin" just means "by this point, we've declared all the signals we're going to use... now here's the logic which defines their behaviour". It's just semantics. Some things must go before the 'begin', and some after. Don't read too much into it, just copy an example and structure your code the same way.

I think Bruce's questions were meant to be rhetorical. And I think you mean 'syntax' not "semantics".

NorthGuy · « **Reply #132 on:** June 25, 2017, 01:02:04 pm »

"<=" doesn't mean "connect", but it infers connection(s).

The only way to make things work on a breadboard is to place ICs and connect them with wires.

FPGA is a huge collection of elements (LUTs, FFs, RAM etc.). They're connected through configuration switches. The bitstream is simply a collection of bits. Each bit controls a switch (or switches) thus making or breaking a connection.

The VHDL code is simply a mechanism to convey which connections are needed.

Code: [Select]

PROCESS (clk)
-- exchange the values of a and b on every clock edge
BEGIN
  IF clk'event AND clk = '1' THEN -- A signal which changes in this block is going to be a flip-flop clocked by clk
    a <= b; -- connect the output of flip-flop b to the input of flip-flop a
    b <= a; -- connect the output of flip-flop a to the input of flip-flop b
  END IF;
END PROCESS;

Code: [Select]

PROCESS (a)
BEGIN
  b <= NOT a; -- build an inverter. Connect its input to a and output to b.
END PROCESS;

nctnico · « **Reply #133 on:** June 25, 2017, 01:38:55 pm »

What is wrong with seeing <= and := as assignment operators? Just like in C the = assigns the value from what is on the right to what is on the left. In VHDL <= and := assign what is on the right to what is on the left so there really isn't any difference.

rstofer · « **Reply #134 on:** June 25, 2017, 03:32:15 pm »

Quote from: BrianHG on June 25, 2017, 01:48:22 am

Quote from: Mattjd on June 24, 2017, 11:59:27 pm
Now i feel like everything i've learned about HDL is wrong.
I use Verilog to make my life soooo much easier. Especially if you use a simple single synchronous clock for everything, nothing asynchronous. Coding this way make very portable designs across all FPGAs and PLDs. As for the above VHDL example, it twists my head and I avoid it at all cost and wont ever use it.

It's odd how the language you start with becomes your language of choice. I started with VHDL and, for the life of me, I can't figure out Verilog. VHDL tends to be more Pascal like in that it is quite verbose. Verilog, in my view, is C like in that it can be quite terse.

I have made several half-hearted attempts to understand Verilog and I can't get there. What I need to do is design an entire project using only Verilog and force myself to work with it. But, no, I will get to the point where all I want is the finished project and it will be coded in VHDL.

I have NEVER understood the difference between blocking and non-blocking assignments in an 'always' block and whether it matters if the block is clocked. I read this and get completely confused...

https://electronics.stackexchange.com/questions/91688/difference-between-blocking-and-nonblocking-assignment-verilog

In VHDL, it's a simple concept: If the block is clocked, all assignments in the block are registered. If the block isn't clocked, all assignments are combinatorial. THIS I can understand!

Verilog has the '=' symbol for 'blocking' assignment and '<=' for 'non-blocking' assignments (whatever that may mean). But the idea that one creates sequential logic and the other creates parallel logic within the 'always' block escapes me. It's ALL parallel inside the chip!

I think I'm just too old to catch on...

mikeselectricstuff · « **Reply #135 on:** June 25, 2017, 04:16:06 pm »

Quote from: nctnico on June 25, 2017, 01:38:55 pm

What is wrong with seeing <= and := as assignment operators? Just like in C the = assigns the value from what is on the right to what is on the left. In VHDL <= and := assign what is on the right to what is on the left so there really isn't any difference.

The problem is that in a programming language, assignment happens at a specific moment. In asynchronous logic, the assignment is effectively happenning continuously.

Cerebus · « **Reply #136 on:** June 25, 2017, 04:44:01 pm »

Quote from: rstofer on June 25, 2017, 03:32:15 pm

Verilog has the '=' symbol for 'blocking' assignment and '<=' for 'non-blocking' assignments (whatever that may mean).

A 'blocking' assignment blocks anything else from happening (simultaneously) in the same code block while the assignment is happening, a 'non-blocking' one doesn't.

So, if we start off with three registers and their initial values A=1, B=2 and C=3.

If we execute the following sequence of blocking assignments:

begin B = A; C = B; end

we get the result A=1, B=1, C=1. That is, the first statement executed in its entirety before the second, each blocking assignment is 'executed' in sequence. Now let's do the same thing with non-blocking assignments, and the same initial values as before:

begin B <= A; C <= B; end

This time the result is A=1, B=1, C=2. The values for the right hand sides were taken as we 'passed' 'begin', all the assignments happened simultaneously, and they all finished at the same time, just as we reached 'end'.

That's slightly simplistic and wouldn't probably satisfy a language lawyer, but it gives the essentially flavour of what's going on.

The blocking assignment is useful in writing test beds and the like but dangerous, and usually wrong, in writing code that you actually expect to be implemented in hardware. You can fake up quite a complex signal for a test bed by combining blocking assignments with delays but that kind of usage is not synthesizeable and so will never make it to real hardware.

Bruce Abbott · « **Reply #137 on:** June 25, 2017, 04:50:22 pm »

Quote from: Cerebus on June 25, 2017, 11:15:58 am

I think Bruce's questions were meant to be rhetorical.

At the time I read the tutorial that was what I was thinking. I now know better, but this thread is helping to clarify some things in my mind.

NorthGuy · « **Reply #138 on:** June 25, 2017, 05:11:48 pm »

Quote from: Cerebus on June 25, 2017, 04:44:01 pm

Quote from: rstofer on June 25, 2017, 03:32:15 pm
Verilog has the '=' symbol for 'blocking' assignment and '<=' for 'non-blocking' assignments (whatever that may mean).

A 'blocking' assignment blocks anything else from happening (simultaneously) in the same code block while the assignment is happening, a 'non-blocking' one doesn't.

So, if we start off with three registers and their initial values A=1, B=2 and C=3.

If we execute the following sequence of blocking assignments:

begin B = A; C = B; end

we get the result A=1, B=1, C=1. That is, the first statement executed in its entirety before the second, each blocking assignment is 'executed' in sequence. Now let's do the same thing with non-blocking assignments, and the same initial values as before:

begin B <= A; C <= B; end

This time the result is A=1, B=1, C=2. The values for the right hand sides were taken as we 'passed' 'begin', all the assignments happened simultaneously, and they all finished at the same time, just as we reached 'end'.

That's slightly simplistic and wouldn't probably satisfy a language lawyer, but it gives the essentially flavour of what's going on.

The blocking assignment is useful in writing test beds and the like but dangerous, and usually wrong, in writing code that you actually expect to be implemented in hardware. You can fake up quite a complex signal for a test bed by combining blocking assignments with delays but that kind of usage is not synthesizeable and so will never make it to real hardware.

The first one infers a flip-flop with A as input and both B and C as outputs. Sequential (blocking) execution of Verilog statements produces parallel wiring.

The second one infers two flip-flops connected in a chain. A->FF->B->FF->C. Parallel (non-blocking) execution of Verilog statements produces serial wiring.

This is certainly a case of weird terminology.

I use VHDL because I started with it (pure coincidence). I have no intention of using Verilog. VHDL lets me do everything I would want it to do. I'm absolutely sure if I started with Verilog, the situation would be reverse and I would never wanted to use VHDL. Just as rstofer suggested. Imprinting

hans · « **Reply #139 on:** June 25, 2017, 05:36:28 pm »

in VHDL:

"<=" is used in assignments of signals.
":=" is used for assignment of variables.

Signals can exist in an architecture, process and procedures.
Variables can exist in process and functions.

A signal at an architecture level is basically a wire. It connects signals together with perhaps a few gates, like:

Code: [Select]

ARCHITECTURE ... OF ... IS
SIGNAL a, b, c : STD_LOGIC;
BEGIN
a <= b AND c;
END ARCHITECTURE;

This way you can compute new values within an entity (not shown in example).

Using a process you could compute new values of a at the rising edge of a clock, i.e. sequential logic:

Code: [Select]

ARCHITECTURE ... OF ... IS
SIGNAL a, b, c : STD_LOGIC;
BEGIN
PROCESS(clk)
BEGIN
IF rising_edge(clk) THEN
a <= b AND c;
END IF;
END PROCESS;
END ARCHITECTURE;

Why have variables when we have signals? Because if you assign a new value to a signal, it's new value will not take action immediately. Only after the process is finished running the new value is used.

A variable however is updated instantly, so you can assign a value and then read that new value from it. A variable does hold it's value after you "exit" the process as well. But you cannot use them in an architecture, so they are best used as intermediate values.

In terms of simulation this is a key difference. Signals are simulated using delta delays. That means that if a new value is assigned to a signal, it's delayed to take that value at t+1 'delta'. If new values for other signals need to be computed (e.g. b or c changed in first example) it will happen at t+2 delta, t+3 delta, etc. Delta is an arbitrary time stamp, just to differentiate it will happen slightly later in the future.

Because all statements in a process happen at a 1 timestamp, time can only be advanced when the process is left or a wait statement has been hit (unusual to do if you target hardware, especially using free tools).

In terms of synthesis onto real hardware, either a signal or a variable in a process can result in a wire or D-flip flop.. This is dependent if the value is first written and then read (=wire) or first read and then written (= flip flop).

I'm sure this has high similarities to Verilog blocking and non-blocking statements, but I haven't programmed much Verilog, mostly read code. Both languages are very similar, VHDL is strongly typed , Verilog is loosely typed. Verilog has some unique features, but so does VHDL...

mark03 · « **Reply #140 on:** June 25, 2017, 05:55:11 pm »

Quote from: hamster_nz on June 22, 2017, 09:27:01 pm

Quote from: Bruce Abbott on June 22, 2017, 07:47:05 pm
Quote from: sporadic on June 22, 2017, 06:58:37 pm
For all the Python haters, yes.. you can design hardware with Python - http://www.myhdl.org
As if VHDL and Verilog weren't confusing enough, now we have another HDL to learn.

What advantages does MyHDL have over the other two?

All these High Level Synthesis (HLS) HDLs seem to have common threads to address these (and other) problems:

MyHDL is not HLS. It is a bona fide HDL which just so happens to be implemented within the [very flexible] syntax of Python. Everything you have in Verilog and VHDL you get in MyHDL too, and there is very little in MyHDL which does not map 1:1 back into the incumbent languages.

As to why MyHDL and not an incumbent HDL, I think the author would claim to have avoided some of the mistakes that were made Verilog/VHDL, in the same way that *any* second try usually comes out better, simply because it is informed by experience. He (Jan) is more of a VHDL guy, and that definitely shows in MyHDL, but the verbosity and archaisms many people dislike in VHDL are much reduced in MyHDL.

Another big reason: writing test benches in Python is going to be almost infinitely better than writing them in Verilog/VHDL. You can take advantage of the Python unit-test frameworks, simulate your DSP flow using NumPy/SciPy, make actually useful plots, and so on. I think this aspect alone would tip the scales in MyHDL's favor were it not for...

The biggest reason NOT to use MyHDL: It's not directly supported by FPGA vendors, and never will be. The generated Verilog/VHDL output is fine as long as you are using vanilla HDL, but as soon as you need to work with and simulate a vendor-specific hard block, it becomes a major headache.

hamster_nz · « **Reply #141 on:** June 25, 2017, 09:50:59 pm »

Quote from: nctnico on June 25, 2017, 10:22:07 am

Actually you must forget about the hardware otherwise you'll be writing way too much code.

I don't think that is 100% true - if you forget that you are working in h/w you can drop into writing code that does not map well to H/W.

Thought experiment - A design requires a module that takes a clk signal and four 8-bit numbers (a_in, b_in, c_in, d_in) and sorts them low to high, to generate four outputs (a_out, b_out, c_out and d_out). Design it with a software mindset, and then a H/W mindset.

Software is easy:

Code: [Select]

  // copy them all across
  a_out <= a_in;  b_out <= b_in;  c_out <= c_in; d_out <= d_in;
  // bubble sort them
  if(a_out > b_out) swap(a_out, b_out);
  if(b_out > c_out) swap(b_out, c_out);
  if(c_out > d_out) swap(c_out, d_out);

  if(a_out > b_out) swap(a_out, b_out);
  if(b_out > c_out) swap(b_out, c_out);

  if(a_out > b_out) swap(a_out, b_out);
  // Should now be in order

(I might have 20% more code / cycles than needed)

The H/W mindset has additional factors
- Latency - can it be done in a single cycle? how many cycles are needed?
- Speed - what will clock fastest? - fastest is most likely three cycles.
- Logic resource used
- Maximizing concurrency
- Can it efficiently scale when the need for five ore more inputs inevitably comes along?

So when it comes to "which is the best way" for H/W there are more factors in play, even for as simple a task as ranking four numbers in order.

nctnico · « **Reply #142 on:** June 25, 2017, 10:12:13 pm »

Just like software things like speed, resources, size only become relevant for corner cases and it takes a lot of time & effort to accomplish. Why should you suddenly optimise all facets of an FPGA design if you have lots of gates and lots of speed?

hamster_nz · « **Reply #143 on:** June 25, 2017, 10:56:53 pm »

Quote from: nctnico on June 25, 2017, 10:12:13 pm

Why should you suddenly optimise all facets of an FPGA design if you have lots of gates and lots of speed?

Plenty of reasons, some of which may or may not apply.

- If you didn't have constraints that you need to hit (speed, power, latency, cost, size) the you wouldn't be using FPGAs, and you would do it in S/W.

- If your design is even somewhat well thought out, you know the bits you have to worry performance-wise before you even start implementing, and you know what is fluff where you don't even have to try.

- Battery life. Making the bulk more efficient is the best way to reduce power demands

- If working on a product usually device will be selected well before the design is finished, and all the economics pretty much fixed. If you are in the nice place of using 60% of the resource then you can let the design bloat. If you are using 85% or 90% then bloat might force you to use a bigger part with compatible footprint,

- Spare resources = can add more features = better product for same price

- The easier the bulk of a design is to place and route, the more flexibility the tools have for placing and routing the toughest parts of the design.

- Changing pipeline depths late in the development process to improve timing is costly (redesign, retest, reintegrate)

- 6.73ns - The common FullHD pixel clock is 148.5MHz. If you are working on a video design you need to hit this and have a wee bit of slack.

- A sharp tool is better than a dull one

Cerebus · « **Reply #144 on:** June 25, 2017, 11:31:54 pm »

Quote from: hamster_nz on June 25, 2017, 09:50:59 pm

Thought experiment - A design requires a module that takes a clk signal and four 8-bit numbers (a_in, b_in, c_in, d_in) and sorts them low to high, to generate four outputs (a_out, b_out, c_out and d_out). Design it with a software mindset, and then a H/W mindset.

Oh, I like that. I might give that a crack in Verilog tomorrow and see where I get.

NorthGuy · « **Reply #145 on:** June 26, 2017, 12:42:34 am »

Quote from: hamster_nz on June 25, 2017, 09:50:59 pm

Thought experiment - A design requires a module that takes a clk signal and four 8-bit numbers (a_in, b_in, c_in, d_in) and sorts them low to high, to generate four outputs (a_out, b_out, c_out and d_out). Design it with a software mindset, and then a H/W mindset.

I don't think you gain a lot in terms of efficiency, but I would argue it is easier to design with hardware mindset. You can synthesise your "software mindset" design and see how much resources it uses. You can then compare to what you would get with "hardware mindset".

Assuming Xilinx 7-series 6-input LUTs, you would need:

- 6 modules to do 6 comparisons - 4 LUTs each = 24 LUTs. It'll take 2 layers of combinatory logic. You'll get 6 outputs from this representing the results of the comparisons

- For each 8 bit output - 6 x 2 table which converts 6 outputs from the previous layer into the 2-bit index. The 2-bit index will select which input you want to multiplex to the given output. 2 LUTs each = 8 LUTs. One layer of combinatory logic.

- For each bit of the outputs (32 bits total) a mux which uses 2-bit index from the previous layer to select one of the 4 inputs. 1 LUT each = 32 LUTs. One layer of combinatory logic.

Bottom line:

24 + 8 + 32 = 64 LUTs = 16 slices.

2 + 1 + 1 = 4 layers of combinatory logic roughly 0.7 ns each (including intra-layer routing) = 2.8 ns. I'd expect it would run fine with 4 ns clock period - 250 MHz.

hamster_nz · « **Reply #146 on:** June 26, 2017, 04:31:10 am »

Quote from: NorthGuy on June 26, 2017, 12:42:34 am

Quote from: hamster_nz on June 25, 2017, 09:50:59 pm
Thought experiment - A design requires a module that takes a clk signal and four 8-bit numbers (a_in, b_in, c_in, d_in) and sorts them low to high, to generate four outputs (a_out, b_out, c_out and d_out). Design it with a software mindset, and then a H/W mindset.

I don't think you gain a lot in terms of efficiency, but I would argue it is easier to design with hardware mindset. You can synthesise your "software mindset" design and see how much resources it uses. You can then compare to what you would get with "hardware mindset".

Assuming Xilinx 7-series 6-input LUTs, you would need:

- 6 modules to do 6 comparisons - 4 LUTs each = 24 LUTs. It'll take 2 layers of combinatory logic. You'll get 6 outputs from this representing the results of the comparisons

- For each 8 bit output - 6 x 2 table which converts 6 outputs from the previous layer into the 2-bit index. The 2-bit index will select which input you want to multiplex to the given output. 2 LUTs each = 8 LUTs. One layer of combinatory logic.

- For each bit of the outputs (32 bits total) a mux which uses 2-bit index from the previous layer to select one of the 4 inputs. 1 LUT each = 32 LUTs. One layer of combinatory logic.

Bottom line:

24 + 8 + 32 = 64 LUTs = 16 slices.

2 + 1 + 1 = 4 layers of combinatory logic roughly 0.7 ns each (including intra-layer routing) = 2.8 ns. I'd expect it would run fine with 4 ns clock period - 250 MHz.

Pretty much the same idea I had - get all the comparisons out the way, then select the outputs.

I asked a software friend how they would do it. First reply was to put an "ORDERED BY" clause on the SQL query used to get the items.

The second one was along the lines of

Code: [Select]

   array items = [a_in, b_in, c_in, d_in];
   sort(items);
   a_out = items[0];
   b_out = items[1];
   c_out = items[2];
   d_out = items[3];

AndyC_772 · « **Reply #147 on:** June 26, 2017, 06:23:45 am »

Quote from: hamster_nz on June 26, 2017, 04:31:10 am

I asked a software friend how they would do it. First reply was to put an "ORDERED BY" clause on the SQL query used to get the items.

That's scary on so many levels

In an FPGA, I'd do it one of two ways depending on the required clock speed and latency.

To do it in a single cycle, I'd make use of VHDL variables, and translate your 'software mindset' example more or less directly.

If that method ended up too slow to meet the required fmax, then it would need to be pipelined. On the first clock, perform three of the compare/swap operations, store the intermediate results in internal registers, and set a flag. Then, on the second, perform the other three compare/swaps, assign the final result to the outputs, and clear the flag again.

Someone · « **Reply #148 on:** June 26, 2017, 07:59:16 am »

Quote from: AndyC_772 on June 26, 2017, 06:23:45 am

To do it in a single cycle, I'd make use of VHDL variables, and translate your 'software mindset' example more or less directly.

If that method ended up too slow to meet the required fmax, then it would need to be pipelined. On the first clock, perform three of the compare/swap operations, store the intermediate results in internal registers, and set a flag. Then, on the second, perform the other three compare/swaps, assign the final result to the outputs, and clear the flag again.

I'm trying to find the reference but one of the big open source processor/SoC teams were using a strict coding style where the work was all done in functions, and registers were directly inferred as a discrete block with nothing else in it. Very tidy style when you're doing algorithm intensive work.

BrianHG · « **Reply #149 on:** June 26, 2017, 08:39:31 am »

Quote from: AndyC_772 on June 26, 2017, 06:23:45 am

Quote from: hamster_nz on June 26, 2017, 04:31:10 am
I asked a software friend how they would do it. First reply was to put an "ORDERED BY" clause on the SQL query used to get the items.
That's scary on so many levels

In an FPGA, I'd do it one of two ways depending on the required clock speed and latency.

To do it in a single cycle, I'd make use of VHDL variables, and translate your 'software mindset' example more or less directly.

If that method ended up too slow to meet the required fmax, then it would need to be pipelined. On the first clock, perform three of the compare/swap operations, store the intermediate results in internal registers, and set a flag. Then, on the second, perform the other three compare/swaps, assign the final result to the outputs, and clear the flag again.

To pipeline speed this one up, here is how I would do it:

(a)
Compare all inputs with each other and generate 4 sets of 2 bit selection flags/words.
--- and --- Store all 4 inputs in D-flipflop registers.

(b) The output of all 4 D-flipflop registers would feed 4 x 4:1 mux selection units, each unit receiving the 2 bit selection flags generating 4 sorted outputs.

In a basic level, (a) can be done with 4x4 'if/else' statements generating the 4 sets of 2 bit selection registers, + 4 temporary storage registers.
(b) can be done with 4 x 'case' or 'if' statements creating the 4 sorted output registers, though there are better more compact advance coding methods to achieve the same results, this would just be simple sit in your face.

With Altera FPGAs, doing it this way, with 4 inputs, up to 16 bits each, sorted to 4 outputs, your sorts will be delayed by 2 clocks instead of 1, but this would achieve the best reasonable fmax & you can feed a new 4x number set every single clock. To achieve the best fmax with 32 bit numbers, or sorting more than 4 16 bit numbers will require multi stepped pipeline breaking down the magnitude of the numbers then even the mux selection feed of sorted result will require a piped multiple step clock due to the size of Altera's logic cells where the FMAX seems to deteriorate extensively with some operations squeezing more than a 2x32 bit comparison, or even mux selection per clock.

(NOTE, this is not an example of clean coding, I chose this strategy based on experience with Altera's Quartus knowing that the fitter will synthesize this code for top FMAX, not for the tightest possible gate count, and, I know there are many other methods to achieve the same results.)

I'm sure a hardwired ASIC could do much larger magnitude sorts at full speed in a single clock & the VHDL/Verilog code would be down to the few lines described a few posts above.

AndyC_772 · « **Reply #150 on:** June 26, 2017, 09:30:46 am »

Quote from: BrianHG on June 26, 2017, 08:39:31 am

To pipeline speed this one up, here is how I would do it:

(a)
Compare all inputs with each other and generate 4 sets of 2 bit selection flags/words.
--- and --- Store all 4 inputs in D-flipflop registers.

(b) The output of all 4 D-flipflop registers would feed 4 x 4:1 mux selection units, each unit receiving the 2 bit selection flags generating 4 sorted outputs.

I think that's a better algorithm, thanks.

My method requires three comparators on the first cycle, three more on the second cycle, and the inputs to some depend on the outputs of others, so there's an extra propagation delay to consider, which might limit fmax.

Your method also requires six comparators, but all their inputs are known at the start of the first cycle, so they can operate faster.

You also require multiplexers, but I'm willing to bet they're faster than logical comparators.

nctnico · « **Reply #151 on:** June 26, 2017, 10:02:10 am »

Regarding sorting: write a VHDL function with a variable number of inputs which does the sorting in a for-loop.

Cerebus · « **Reply #152 on:** June 26, 2017, 11:32:10 am »

A purely combinatorial version of the sorting problem in Verilog. Code actually simulated against random numbers and it works.

Because it's purely combinatorial one hopes that a decent synthesizer would mush this down to the minimum possible number of gates. If you want to count up discrete circuit elements it's 12 8-bit comparators, 12 1-bit adders, 16 2-bit comparators, 128 2-input AND gates and 8 4-bit OR gates.

Code: [Select]

module sorter (input wire [7:0] A, B, C, D, output wire [7:0] E, F, G, H);

wire AgtB = (A > B);
wire AgtC = (A > C);
wire AgtD = (A > D);
wire [1:0] Apos = (AgtB + AgtC + AgtD);	// population count of how many other inputs this input is greater than

wire BgtA = (B > A);
wire BgtC = (B > C);
wire BgtD = (B > D);
wire [1:0] Bpos = (BgtA + BgtC + BgtD);

wire CgtA = (C > A);
wire CgtB = (C > B);
wire CgtD = (C > D);
wire [1:0] Cpos = (CgtA + CgtB + CgtD);

wire DgtA = (D > A);
wire DgtB = (D > B);
wire DgtC = (D > C);
wire [1:0] Dpos = (DgtA + DgtB + DgtC);

// For all you VHDL-only crowd the {8{aBit}} 'widens' the single bit to 8 bits
assign E = A & {8{Apos==3}} | B & {8{Bpos==3}} | C & {8{Cpos==3}} | D & {8{Dpos==3}};
assign F = A & {8{Apos==2}} | B & {8{Bpos==2}} | C & {8{Cpos==2}} | D & {8{Dpos==2}};
assign G = A & {8{Apos==1}} | B & {8{Bpos==1}} | C & {8{Cpos==1}} | D & {8{Dpos==1}};
assign H = A & {8{Apos==0}} | B & {8{Bpos==0}} | C & {8{Cpos==0}} | D & {8{Dpos==0}};

endmodule

Cerebus · « **Reply #153 on:** June 26, 2017, 11:35:51 am »

Quote from: nctnico on June 26, 2017, 10:02:10 am

Regarding sorting: write a VHDL function with a variable number of inputs which does the sorting in a for-loop.

Wouldn't it have been quicker to write "Wave magic wand." or "Assign task to minion."?

hamster_nz · « **Reply #154 on:** June 26, 2017, 11:41:41 am »

Spent a night watching Doctorr Who and fiddling with code.

All solutions are single-cycle, and outputs are registered, constrained for 200MHz. Results:

1) Bubble sort
96.833 MHZ
10.327 ns
124 LUTs

2) A bit like a shell sort -
148.65 MHZ
6.727ns
105 LUTs

3) H/W optimized design (six tests to index a lookup table, that is then used to MUX the outputs), as per NorthGuy -
234.19 MHz
4.027ns
61 LUTs

So with the last design being twice as fast, and well under half the size, but took 4x longer to write :-)

nctnico · « **Reply #155 on:** June 26, 2017, 11:49:02 am »

Quote from: Cerebus on June 26, 2017, 11:35:51 am

Quote from: nctnico on June 26, 2017, 10:02:10 am
Regarding sorting: write a VHDL function with a variable number of inputs which does the sorting in a for-loop.

Wouldn't it have been quicker to write "Wave magic wand." or "Assign task to minion."?

No, it lets the synthesizer deal with the problem. You might be surprised by the results.

BrianHG · « **Reply #156 on:** June 26, 2017, 12:06:06 pm »

Quote from: hamster_nz on June 26, 2017, 11:41:41 am

Spent a night watching Doctorr Who and fiddling with code.

All solutions are single-cycle, and outputs are registered, constrained for 200MHz. Results:

1) Bubble sort
96.833 MHZ
10.327 ns
124 LUTs

2) A bit like a shell sort -
148.65 MHZ
6.727ns
105 LUTs

3) H/W optimized design (six tests to index a lookup table, that is then used to MUX the outputs), as per NorthGuy -
234.19 MHz
4.027ns
61 LUTs

So with the last design being twice as fast, and well under half the size, but took 4x longer to write :-)

The 2 stage HW optimized recommendation was not NorthGuy, it was me BrianHG...
As for the longer writes of my optimized designs, after making a 1080p video mixers and filters on really old slow Cyclone 1 devices a decade ago with a buggy crashing Quartus at the time, and slow compiles, you could imagine my frustrations. But getting such old FPGAs running 2 channel 30 bit color at 148.5MHz with simple DDR ram, you better believe the ingenious compact chunks of Verilog I created was as compact & as fast as can be without having to resort to AHDL and no special Altera functions other than the PLL clock function block and their pipeline multiply/add and dual-port ram mega-functions.

Cerebus · « **Reply #157 on:** June 26, 2017, 12:09:45 pm »

Quote from: nctnico on June 26, 2017, 11:49:02 am

Quote from: Cerebus on June 26, 2017, 11:35:51 am
Quote from: nctnico on June 26, 2017, 10:02:10 am
Regarding sorting: write a VHDL function with a variable number of inputs which does the sorting in a for-loop.

Wouldn't it have been quicker to write "Wave magic wand." or "Assign task to minion."?
No, it lets the synthesizer deal with the problem. You might be surprised by the results.

I doubt the synthesizer is going to "write a VHDL function" for you. Sounds like the Montgomery Scott solution - [Fx: pick up mouse, use as microphone] "Computer: write me a VHDL function that sorts a variable number of 8 bit numbers, and pour me a nice single malt in the replicator."

NorthGuy · « **Reply #158 on:** June 26, 2017, 02:13:10 pm »

Quote from: BrianHG on June 26, 2017, 08:39:31 am

To pipeline speed this one up, here is how I would do it:

(a)
Compare all inputs with each other and generate 4 sets of 2 bit selection flags/words.
--- and --- Store all 4 inputs in D-flipflop registers.

(b) The output of all 4 D-flipflop registers would feed 4 x 4:1 mux selection units, each unit receiving the 2 bit selection flags generating 4 sorted outputs.

About pipelining.

The speed of combinatorial logic depends on the number of layers. Simple design has only one layer:

IN->LUT->OUT

Of course, there may be many parallel paths like that, but all LUTs are fed directly from the input. This makes it the fastest.

Then you introduce LUTs which depends on the values produced by other LUTs, like this:

IN->LUT->LUT->OUT

Here you have two layers of LUTs. You need to wait until the LUTs of the first layer settle and provide stable outputs to the LUTs of the second layer. Then you must wait for the LUTs of the second layer. Therefore, it takes longer. Each layer adds roughly 0.7ns on Xilinx.

The design we're discussing has 4 layers:

IN->LUT->LUT->LUT->LUT->OUT

Only the longest path affects the overall speed. For example, in this design there's a shorter path which goes from input to the final MUX. It only has one LUT. It could be done faster, but the presence of longer paths don't let the design run faster. The speed is roughly determined by the number of layers on the longest path.

Any combinatorial design can be pipelined.

You don't do it as AndyC suggested by splitting things which already can run in parallel. You do it by inserting flip-flops between combinatorial layers:

IN->LUT->LUT->FF->LUT->LUT->OUT

Now the clock doesn't need to wait for all four layers to complete. Once two layers are done, the flip-flop can clock and remember the intermediary result. On the next clock, the next two layers of LUTs will finish the job. You turned 4-layer design into 2-layer design, but now there's one clock delay.

One flip-flop must be inserted in every path, be it a simple wire or a LUT.

To maximize the clock speed, you need to minimize the number of layers. This can be done by inserting flip-flops exactly in the middle of LUT chain. In the example above, two layers go before the flip-flop and two layers go after it.

You don't do it as BrianHG did:

IN->LUT->LUT->LUT->FF->LUT->OUT

In his design, he put 3 layers (2 layers of comparison and one layer to generate MUX inputs) before the flip-flops, and only one layer (MUX) after the flip-flop. If you do this, the first stage will be 3-layer design and the second stage will be 1-layer design. Since they're clocked by the same clock, the overall design is still 3-layer. It is faster than 4-layer design, but it is slower than 2-layer design.

To get 2 layer design you need this:

IN->LUT->LUT->FF->LUT->LUT->OUT

Which means the 2 layers of comparisons go before the flip-flop, and everything else goes after, as this:

Stage 1. 6 bits of comparison results are saved using 6 flip-flops. Since flip-flops must go into every path, we also need 32 flip-flop to save the original inputs.

Stage 2. MUX input is generated from comparison results (one layer) and MUX selects the appropriate input (second layer).

This produces fast 2-layer design.

We can pipeline even further:

IN->LUT->FF->LUT->FF->LUT->FF->LUT->OUT

Now we've got one-layer design, which is as fast as it gets, but you need to wait 3 extra clocks to get the result. Also, this will be tedious to program - you'll have to pipeline comparison operations.

AndyC_772 · « **Reply #159 on:** June 26, 2017, 02:24:26 pm »

Quote from: NorthGuy on June 26, 2017, 02:13:10 pm

You don't do it as AndyC suggested by splitting things which already can run in parallel.

Just for the sake of clarity, what I had in mind was an implementation of the 'bubble sort' method, not the 'rank-then-multiplex method':

- on the first clock, perform the first three compare/swap operations (a-b, b-c, c-d). The outcome of each of these depends on the previous operation, so it takes 3 levels' worth of delay time

- on the second clock, perform the second set of three compare/swaps (a-b, b-c, a-b) on the intermediate results which were stored after the first clock.

The overall effect is to split a logical operation that would have taken 6 levels' worth of delay, and splits it into two operations each of which takes only 3. It would, of course, be possible to split this into a 6 stage pipe, each of which does just one compare/swap, and that might not be a bad implementation at all if you don't mind the latency or storage requirement.

That's not splitting things that can run in parallel... is it?

nctnico · « **Reply #160 on:** June 26, 2017, 02:42:33 pm »

Quote from: Cerebus on June 26, 2017, 12:09:45 pm

Quote from: nctnico on June 26, 2017, 11:49:02 am
Quote from: Cerebus on June 26, 2017, 11:35:51 am
Quote from: nctnico on June 26, 2017, 10:02:10 am
Regarding sorting: write a VHDL function with a variable number of inputs which does the sorting in a for-loop.

Wouldn't it have been quicker to write "Wave magic wand." or "Assign task to minion."?
No, it lets the synthesizer deal with the problem. You might be surprised by the results.

I doubt the synthesizer is going to "write a VHDL function" for you. Sounds like the Montgomery Scott solution - [Fx: pick up mouse, use as microphone] "Computer: write me a VHDL function that sorts a variable number of 8 bit numbers, and pour me a nice single malt in the replicator."

Duhhu

. You are supposed to write the VHDL function yourself but let the synthesizer deal with the actual implementation.

Cerebus · « **Reply #161 on:** June 26, 2017, 02:54:16 pm »

Quote from: nctnico on June 26, 2017, 02:42:33 pm

Quote from: Cerebus on June 26, 2017, 12:09:45 pm
Quote from: nctnico on June 26, 2017, 11:49:02 am
Quote from: Cerebus on June 26, 2017, 11:35:51 am
Quote from: nctnico on June 26, 2017, 10:02:10 am
Regarding sorting: write a VHDL function with a variable number of inputs which does the sorting in a for-loop.

Wouldn't it have been quicker to write "Wave magic wand." or "Assign task to minion."?
No, it lets the synthesizer deal with the problem. You might be surprised by the results.

I doubt the synthesizer is going to "write a VHDL function" for you. Sounds like the Montgomery Scott solution - [Fx: pick up mouse, use as microphone] "Computer: write me a VHDL function that sorts a variable number of 8 bit numbers, and pour me a nice single malt in the replicator."
Duhhu . You are supposed to write the VHDL function yourself but let the synthesizer deal with the actual implementation.

Indeed one is, but you just waved your hand and regally said 'Let it be done', that's what I'm poking fun at.

nctnico · « **Reply #162 on:** June 26, 2017, 03:19:11 pm »

Quote from: hamster_nz on June 26, 2017, 11:41:41 am

Spent a night watching Doctorr Who and fiddling with code.

All solutions are single-cycle, and outputs are registered, constrained for 200MHz. Results:

1) Bubble sort
96.833 MHZ
10.327 ns
124 LUTs

2) A bit like a shell sort -
148.65 MHZ
6.727ns
105 LUTs

3) H/W optimized design (six tests to index a lookup table, that is then used to MUX the outputs), as per NorthGuy -
234.19 MHz
4.027ns
61 LUTs

So with the last design being twice as fast, and well under half the size, but took 4x longer to write :-)

I just ran this VHDL software approach bubble sort through the Xilinx synthesizer using 4 inputs each 8 bits wide:
https://stackoverflow.com/questions/42420983/bubble-sort-in-vhdl
Result: 73 LUTs when optimised for speed and 70 LUTs when optimised for area (Spartan6)

The result speaks for itself. The synthesizer does a way better job then off-the-cuff hardware-like implementations in HDL so just describe the problem and let the synthesizer deal with it. These discussions remind me of the endless C versus assembly arguments.

NorthGuy · « **Reply #163 on:** June 26, 2017, 04:05:10 pm »

Quote from: nctnico on June 26, 2017, 03:19:11 pm

I just ran this VHDL software approach bubble sort through the Xilinx synthesizer using 4 inputs each 8 bits wide:
https://stackoverflow.com/questions/42420983/bubble-sort-in-vhdl
Result: 73 LUTs when optimised for speed and 70 LUTs when optimised for area (Spartan6)

What clock speed are you getting with this?

nctnico · « **Reply #164 on:** June 26, 2017, 04:08:56 pm »

Quote from: NorthGuy on June 26, 2017, 04:05:10 pm

Quote from: nctnico on June 26, 2017, 03:19:11 pm
I just ran this VHDL software approach bubble sort through the Xilinx synthesizer using 4 inputs each 8 bits wide:
https://stackoverflow.com/questions/42420983/bubble-sort-in-vhdl
Result: 73 LUTs when optimised for speed and 70 LUTs when optimised for area (Spartan6)
What clock speed are you getting with this?

That depends entirely on the FPGA so I didn't include that.

NorthGuy · « **Reply #165 on:** June 26, 2017, 04:28:26 pm »

Quote from: nctnico on June 26, 2017, 04:08:56 pm

That depends entirely on the FPGA so I didn't include that.

Please tell us.

LUTs also depend on FPGA. Spartan-6 has 6-input LUTs. Others have 4-input LUTs, so you would need lot more of them for the same design.

nctnico · « **Reply #166 on:** June 26, 2017, 05:00:56 pm »

Quote from: NorthGuy on June 26, 2017, 04:28:26 pm

Quote from: nctnico on June 26, 2017, 04:08:56 pm
That depends entirely on the FPGA so I didn't include that.
Please tell us.

LUTs also depend on FPGA. Spartan-6 has 6-input LUTs. Others have 4-input LUTs, so you would need lot more of them for the same design.

On a Spartan6 speed grade 2 device I can go slightly over 100MHz (while making sure all paths are constrained by adding extra input and output registers). If I enable 'register balancing' (more or less automatic pipelining) I can get it to run at over 400MHz. Both frequencies come from place&routed designs.

Yansi · « **Reply #167 on:** June 26, 2017, 05:39:06 pm »

Quote from: Bruce Abbott on June 24, 2017, 08:07:33 pm

I am currently reading this tutorial, which is making a lot more sense to me so far...

Thank you very much for that book. Might be really helpful for me, a total dumb CPLD/FPGA beginner that spent all of his previous life with sequential MCUs.

NorthGuy · « **Reply #168 on:** June 26, 2017, 05:40:02 pm »

Quote from: nctnico on June 26, 2017, 05:00:56 pm

On a Spartan6 speed grade 2 device I can go slightly over 100MHz (while making sure all paths are constrained by adding extra input and output registers). If I enable 'register balancing' (more or less automatic pipelining) I can get it to run at over 400MHz. Both frequencies come from place&routed designs.

This is a similar result to what hamster_nz have posted. The "hardware mindset" produces about 2x speed for combinatorial logic compare to the "software mindset" optimized with tools. This is about the same speed difference as the difference between Xilinx UltraScale+ and Spartan-6.

I'm surprised that the tools didn't do a better job. They're taking so much time from the code to bitstream. What the hell are they doing all this time? I expected their optimizations to be nearly perfect.

nctnico · « **Reply #169 on:** June 26, 2017, 05:49:31 pm »

Quote from: NorthGuy on June 26, 2017, 05:40:02 pm

Quote from: nctnico on June 26, 2017, 05:00:56 pm
On a Spartan6 speed grade 2 device I can go slightly over 100MHz (while making sure all paths are constrained by adding extra input and output registers). If I enable 'register balancing' (more or less automatic pipelining) I can get it to run at over 400MHz. Both frequencies come from place&routed designs.
This is a similar result to what hamster_nz have posted. The "hardware mindset" produces about 2x speed for combinatorial logic compare to the "software mindset" optimized with tools. This is about the same speed difference as the difference between Xilinx UltraScale+ and Spartan-6.

Without knowing which FPGA Hamster_nz targeted and what synthesis settings he used you can't make this comparison. So where do you get a 2x speed improvement from? Also 400MHz is more than 234MHz so I'd say the 'software approach' is ahead for now.

NorthGuy · « **Reply #170 on:** June 26, 2017, 06:36:23 pm »

Quote from: nctnico on June 26, 2017, 05:49:31 pm

Without knowing which FPGA Hamster_nz targeted and what synthesis settings he used you can't make this comparison. So where do you get a 2x speed improvement from?

Whatever he used was the same FPGA and he's got roughly 2x difference. Your numbers are similar to his, and why wouldn't they be - you did the same thing.

Quote from: nctnico on June 26, 2017, 05:49:31 pm

Also 400MHz is more than 234MHz so I'd say the 'software approach' is ahead for now.

As I explained few posts ago, you can pipeline any pure combinatorial design.

The speed of the design depends on the number of combinatorial layers. You can either run all layers in a single clock - then your clock speed get limited. Or you can pipeline the layers (by inserting flip-flops between them). If completely pipelined, the clock speed will be roughly the same for any design, but it'll be one extra clock delay for every combinatorial layer you remove by pipelining.

It is meaningless to compare pipelined design with purely combinatorial design in terms of clock speed (or in terms of clock cycles for that matter).

Cerebus · « **Reply #171 on:** June 26, 2017, 07:06:17 pm »

Quote from: NorthGuy on June 26, 2017, 06:36:23 pm

The speed of the design depends on the number of combinatorial layers. You can either run all layers in a single clock - then your clock speed get limited. Or you can pipeline the layers (by inserting flip-flops between them). If completely pipelined, the clock speed will be roughly the same for any design, but it'll be one extra clock delay for every combinatorial layer you remove by pipelining.

It is meaningless to compare pipelined design with purely combinatorial design in terms of clock speed (or in terms of clock cycles for that matter).

It would be helpful if you didn't use 'speed' for both 'latency' and 'throughput', what you're trying to say would be much clearer if you used the two separate terms.

hamster_nz · « **Reply #172 on:** June 26, 2017, 07:09:27 pm »

Quote from: nctnico on June 26, 2017, 03:19:11 pm

Quote from: hamster_nz on June 26, 2017, 11:41:41 am
Spent a night watching Doctorr Who and fiddling with code.

All solutions are single-cycle, and outputs are registered, constrained for 200MHz. Results:

1) Bubble sort
96.833 MHZ
10.327 ns
124 LUTs

2) A bit like a shell sort -
148.65 MHZ
6.727ns
105 LUTs

3) H/W optimized design (six tests to index a lookup table, that is then used to MUX the outputs), as per NorthGuy -
234.19 MHz
4.027ns
61 LUTs

So with the last design being twice as fast, and well under half the size, but took 4x longer to write :-)
I just ran this VHDL software approach bubble sort through the Xilinx synthesizer using 4 inputs each 8 bits wide:
https://stackoverflow.com/questions/42420983/bubble-sort-in-vhdl
Result: 73 LUTs when optimised for speed and 70 LUTs when optimised for area (Spartan6)

The result speaks for itself. The synthesizer does a way better job then off-the-cuff hardware-like implementations in HDL so just describe the problem and let the synthesizer deal with it. These discussions remind me of the endless C versus assembly arguments.

TLDR: Can you check that the results is actually a LUT count, and not occupied slices count?

Really interesting! Your results literally kept me awake at night....

A 4-element bubble sort is six identical compare-then-maybe-swap stages. This requires an 8-bit comparison and two 2:1 8-bit MUXes - around 4+2*8 = 20 slices. That checks out with my numbers, as 124 is divisible by 6. The second method only uses five of these stages, hence it uses 5/6th the resources.

Performance-wise the critical path of the bubble sort is through all six compare-then-maybe-swap stages, and in my second method it is only four stages, hence the second method clocking around 50% faster.

The final method uses six 8-bit compares, a 32x8-bit memory, and four 8-bit 4:1 MUXes, so should use around 6*4+8+32 = 64 LUTs, It gets its efficiency by having the pre-computed (and somewhat error prone) values in the 32x8 memory. It removes some of the work required and everything fits nicely with a LUT-6 architecture. As the critical path is only thorough a comparisons and two LUTs it should be about 3x faster than the bubble sort (as it may well be, if I constrain it harder).

If your method is a bubble sort (and I don't doubt it is), and does use 73 LUTs (which I slightly doubt), then it has taken less than 11 LUTs to do what should take at least 20, and I want to know why!

If it is a slice count, then the LUT count it is most likely around the 120 number that I would expect, and my universe is back in balance, and I will sleep well.

The performance is also pretty good for what is a generation older silicon, but not that good that I expect a bug.

nctnico · « **Reply #173 on:** June 26, 2017, 07:33:20 pm »

Actually my earlier LUT number is for an Artix7. Somehow ISE didn't catch I wanted to use a Spartan 6! The other numbers (speed) are for the Spartan6 design. The Spartan 6 design uses 79 Slice LUTs and occupies 33 slices (optimised for speed). I think your reasoning goes off the trail because the synthesizer turns the problem into logic equations which are then minimized keeping the architecture of the FPGA in mind. This means that some of the hardware you describe is probably combined in a way you can't see when designing 'in hardware'. I think it is very similar to a C compiler optimising for pre-fetching and caching.

Bruce Abbott · « **Reply #174 on:** June 26, 2017, 07:48:22 pm »

Quote from: nctnico on June 26, 2017, 03:19:11 pm

I just ran this VHDL software approach bubble sort through the Xilinx synthesizer using 4 inputs each 8 bits wide:
https://stackoverflow.com/questions/42420983/bubble-sort-in-vhdl

I have a question about that code.

This:-

Code: [Select]

        if rising_edge(clk) then
            for j in bubble'LEFT to bubble'RIGHT - 1 loop 
                for i in bubble'LEFT to bubble'RIGHT - 1 - j loop 
                    if unsigned(var_array(i)) > unsigned(var_array(i + 1)) then
                        temp := var_array(i);
                        var_array(i) := var_array(i + 1);
                        var_array(i + 1) := temp;
                    end if;
                end loop;
            end loop;
            sorted_array <= var_array;
        end if;

unfolds into multiple iterations (with different array indexes) of this, right?

Code: [Select]

if unsigned(var_array(0)) > unsigned(var_array(1)) then
                        temp := var_array(0);
                        var_array(0) := var_array(1);
                        var_array(1) := temp;

So we have a comparator who's output determines whether the two array entries are either 1. swapped, or 2. left alone. This is all happening during one clock cycle, and the ':=' means that the operation occurs immediately ie. the logic is not clocked but simply runs as fast as it can, right? What stops the the values in temp, var_array(0) and var_array(1) from continuously cycling around until the comparator changes state?

NorthGuy · « **Reply #175 on:** June 26, 2017, 08:13:31 pm »

Quote from: Cerebus on June 26, 2017, 07:06:17 pm

It would be helpful if you didn't use 'speed' for both 'latency' and 'throughput', what you're trying to say would be much clearer if you used the two separate terms.

Sorry for the confusing. I'll try to re-write in your terms.

When running in pure combinatorial form (in one-clock cycle), a design with more combinatorial layers will require longer clock period and lower clock frequency. Thus, it will have lower throughput and longer latency. The latency will be equal to one clock period.

If fully pipelined, any design will have the same maximum clock frequency and the same throughput. However, a design with more combinatorial layers will have longer latency. Its latency will be equal to the number of combinatorial layers multiplied by clock period.

Is this more understandable?

AndyC_772 · « **Reply #176 on:** June 26, 2017, 09:04:07 pm »

Quote from: Bruce Abbott on June 26, 2017, 07:48:22 pm

What stops the the values in temp, var_array(0) and var_array(1) from continuously cycling around until the comparator changes state?

It's worth taking a moment to re-iterate: VHDL code is *not* a sequence of instructions that are executed one at a time in order by the target device.

When you write an algorithm using variables, think of it as saying "Hey, compiler! I want you to synthesize some logic for me. Here's a method which describes what outputs I want for a given set of inputs. Now *you* go off and work out the best set of logic gates to give me the outputs I want, OK?"

All that this nested loop is doing, is providing a way - some way - of determining what the outputs should be for a given set of inputs. The compiler executes the loops, works out what the eventual relationship will be between input and output on a given clock edge, then programs this into look-up tables.

hamster_nz · « **Reply #177 on:** June 26, 2017, 09:16:53 pm »

Quote from: nctnico on June 26, 2017, 07:33:20 pm

Actually my earlier LUT number is for an Artix7. Somehow ISE didn't catch I wanted to use a Spartan 6! The other numbers (speed) are for the Spartan6 design. The Spartan 6 design uses 79 Slice LUTs and occupies 33 slices (optimised for speed). I think your reasoning goes off the trail because the synthesizer turns the problem into logic equations which are then minimized keeping the architecture of the FPGA in mind. This means that some of the hardware you describe is probably combined in a way you can't see when designing 'in hardware'. I think it is very similar to a C compiler optimising for pre-fetching and caching.

Can you post/PM me the code you are using?

The code I saw Stack Exchange only made one pass of the items per clock cycle, so to sort four items would take three cycles.

It would also explain the low LUT usage

nctnico · « **Reply #178 on:** June 26, 2017, 09:30:41 pm »

Quote from: hamster_nz on June 26, 2017, 09:16:53 pm

Quote from: nctnico on June 26, 2017, 07:33:20 pm
Actually my earlier LUT number is for an Artix7. Somehow ISE didn't catch I wanted to use a Spartan 6! The other numbers (speed) are for the Spartan6 design. The Spartan 6 design uses 79 Slice LUTs and occupies 33 slices (optimised for speed). I think your reasoning goes off the trail because the synthesizer turns the problem into logic equations which are then minimized keeping the architecture of the FPGA in mind. This means that some of the hardware you describe is probably combined in a way you can't see when designing 'in hardware'. I think it is very similar to a C compiler optimising for pre-fetching and caching.
Can you post/PM me the code you are using?

The code I saw Stack Exchange only made one pass of the items per clock cycle, so to sort four items would take three cycles.

It would also explain the low LUT usage

Here it is but it uses a nested loop so it seems to me it is doing a full bubble-sort.

Code: [Select]

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use ieee.numeric_std.all;

package array_type is
    type bubble is array (0 to 3) of unsigned(7 downto 0);
end package;

library ieee;
use ieee.std_logic_1164.all;
use work.array_type.all;

entity bubblesort is
    port (
        signal clk:             in  std_logic;
        signal reset:           in  std_logic;
        signal in_array_in:        in  bubble;
        signal sorted_array_out:    out bubble 
    );
end entity;


architecture foo of bubblesort is
    use ieee.numeric_std.all;

	--signals to allow optimal routing
    signal in_array: bubble;
    signal sorted_array: bubble;

begin


BSORT:
    process (clk)
        variable temp:      unsigned (7 downto 0);
        variable var_array:     bubble;        
    begin

	 --move inside if rising_edge... to catch the whole thing inside the clock constraint and
	 --not depend on routing delays between input & output pads.
	 in_array<=in_array_in;
	 sorted_array_out <=sorted_array;
	 --

        var_array := in_array;
        if rising_edge(clk) then
				
            for j in bubble'LEFT to bubble'RIGHT - 1 loop 
                for i in bubble'LEFT to bubble'RIGHT - 1 - j loop 
                    if var_array(i) > var_array(i + 1) then
                        temp := var_array(i);
                        var_array(i) := var_array(i + 1);
                        var_array(i + 1) := temp;
                    end if;
                end loop;
            end loop;
            sorted_array <= var_array;
        end if;
    end process;
end architecture foo;

BrianHG · « **Reply #179 on:** June 26, 2017, 09:46:57 pm »

Am I wrong (I'm a Veriliog guy nit VHDL), but isn't this bubble sort sorting 4 variables 'array (0 to 3)' only 3 bit 'of unsigned(7 downto 0)' ?

nctnico · « **Reply #180 on:** June 26, 2017, 09:50:11 pm »

Quote from: BrianHG on June 26, 2017, 09:46:57 pm

Am I wrong (I'm a Veriliog guy nit VHDL), but isn't this bubble sort sorting 4 variables 'array (0 to 3)' only 3 bit 'of unsigned(7 downto 0)' ?

No, it is sorting an array with 4 elements where each element is an 8 bit unsigned int.

BrianHG · « **Reply #181 on:** June 27, 2017, 12:38:57 am »

Quote from: nctnico on June 26, 2017, 09:50:11 pm

Quote from: BrianHG on June 26, 2017, 09:46:57 pm
Am I wrong (I'm a Veriliog guy nit VHDL), but isn't this bubble sort sorting 4 variables 'array (0 to 3)' only 3 bit 'of unsigned(7 downto 0)' ?
No, it is sorting an array with 4 elements where each element is an 8 bit unsigned int.

Sorry, I'm used to seeing 255 downto 0, as in my default internal ram-buss size. It's been over 5 years since I've done any HDL.
Or in Verilog, I skimp out and just use wire[255:0], or, reg[255:0] for a single 256 bit word/buss, or just 'integer' and let the compiler workout how many bits it needs to be to finalize the logic...
Back when I started in 2004, Quartus' internal compiler was very crappy in just decoding a buss & at the time, as well as crashing with anything too complex, they recommended using a third party HDL compiler with Quartus, or to learn AHDL, Altera hardware definition language. This is why my coding style leans towards more like 'Assembly' rather than letting the compiler work for you like 'C' coding.

Bruce Abbott · « **Reply #182 on:** June 27, 2017, 02:37:22 am »

Quote from: AndyC_772 on June 26, 2017, 09:04:07 pm

It's worth taking a moment to re-iterate: VHDL code is *not* a sequence of instructions that are executed one at a time in order by the target device.

Yes, and that's where I was confused because there seemed to be a circular assignment. Now I can see that while ':=' assignments are immediate, statements using them are executed sequentially by the compiler as it builds up the logical relationship between them.

Quote

All that this nested loop is doing, is providing a way - some way - of determining what the outputs should be for a given set of inputs. The compiler executes the loops, works out what the eventual relationship will be between input and output on a given clock edge, then programs this into look-up tables.

Right. So after examining all the statements sequentially it figures out what logic is required to bubble sort the entire array in a single clock cycle?

AndyC_772 · « **Reply #183 on:** June 27, 2017, 06:26:43 am »

Quote from: Bruce Abbott on June 27, 2017, 02:37:22 am

Right. So after examining all the statements sequentially it figures out what logic is required to bubble sort the entire array in a single clock cycle?

Almost. It figures out what logic is required to *sort* the entire array in a single clock - not necessarily *bubble* sort.

The details of how the outputs were derived from each possible set of inputs are lost. The compiler executes the loop, builds up a table of outputs vs inputs, then uses that table to generate the necessary logic.

That's not to say it can't make use of the original code to get some hints as to how the logic might work, but it doesn't have to.

There might even be an interesting exercise here. For example, a bubble sort algorithm to sort 'n' elements has to do (n-1) compare/swap operations on the first pass, then (n-2) on the second, and (n-3) on the third, and so on. In a software implementation, shortening each subsequent pass by one element is a trivial and worthwhile optimisation. In VHDL, though, it really shouldn't make any difference at all whether this is done, because the outcome of the nested loop is exactly the same whether each subsequent pass gets shortened or not.

hamster_nz · « **Reply #184 on:** June 27, 2017, 09:35:45 am »

Quote from: nctnico on June 26, 2017, 09:30:41 pm

Quote from: hamster_nz on June 26, 2017, 09:16:53 pm
Quote from: nctnico on June 26, 2017, 07:33:20 pm
Actually my earlier LUT number is for an Artix7. Somehow ISE didn't catch I wanted to use a Spartan 6! The other numbers (speed) are for the Spartan6 design. The Spartan 6 design uses 79 Slice LUTs and occupies 33 slices (optimised for speed). I think your reasoning goes off the trail because the synthesizer turns the problem into logic equations which are then minimized keeping the architecture of the FPGA in mind. This means that some of the hardware you describe is probably combined in a way you can't see when designing 'in hardware'. I think it is very similar to a C compiler optimising for pre-fetching and caching.
Can you post/PM me the code you are using?

The code I saw Stack Exchange only made one pass of the items per clock cycle, so to sort four items would take three cycles.

It would also explain the low LUT usage

Here it is but it uses a nested loop so it seems to me it is doing a full bubble-sort.

Code: [Select]
library IEEE; use IEEE.STD_LOGIC_1164.ALL; use ieee.numeric_std.all; package array_type is type bubble is array (0 to 3) of unsigned(7 downto 0); end package; library ieee; use ieee.std_logic_1164.all; use work.array_type.all; entity bubblesort is port ( signal clk: in std_logic; signal reset: in std_logic; signal in_array_in: in bubble; signal sorted_array_out: out bubble ); end entity; architecture foo of bubblesort is use ieee.numeric_std.all; --signals to allow optimal routing signal in_array: bubble; signal sorted_array: bubble; begin BSORT: process (clk) variable temp: unsigned (7 downto 0); variable var_array: bubble; begin --move inside if rising_edge... to catch the whole thing inside the clock constraint and --not depend on routing delays between input & output pads. in_array<=in_array_in; sorted_array_out <=sorted_array; -- var_array := in_array; if rising_edge(clk) then for j in bubble'LEFT to bubble'RIGHT - 1 loop for i in bubble'LEFT to bubble'RIGHT - 1 - j loop if var_array(i) > var_array(i + 1) then temp := var_array(i); var_array(i) := var_array(i + 1); var_array(i + 1) := temp; end if; end loop; end loop; sorted_array <= var_array; end if; end process; end architecture foo;

Got back home and tested it, using the same testbench - a 32-bit counter feeding the inputs, outputting to pins.

With the Vivado default Strategy & PerfOprimized_High:

Code: [Select]

1. Utilization by Hierarchy
---------------------------

+----------------+----------------+------------+------------+---------+------+-----+--------+--------+--------------+
|    Instance    |     Module     | Total LUTs | Logic LUTs | LUTRAMs | SRLs | FFs | RAMB36 | RAMB18 | DSP48 Blocks |
+----------------+----------------+------------+------------+---------+------+-----+--------+--------+--------------+
| top_sort_4     |          (top) |        125 |        125 |       0 |    0 |  96 |      0 |      0 |            0 |
|   (top_sort_4) |          (top) |          1 |          1 |       0 |    0 |  64 |      0 |      0 |            0 |
|   uut          | sort_4_wrapper |        124 |        124 |       0 |    0 |  32 |      0 |      0 |            0 |
|     uut        |     bubblesort |        124 |        124 |       0 |    0 |  32 |      0 |      0 |            0 |
+----------------+----------------+------------+------------+---------+------+-----+--------+--------+--------------+

Timing is 10.563ns / 94.67 MHz (when constrained for 200MHz)

With the Vivado "Area Optimized" Strategy:

Code: [Select]

+----------------+----------------+------------+------------+---------+------+-----+--------+--------+--------------+
|    Instance    |     Module     | Total LUTs | Logic LUTs | LUTRAMs | SRLs | FFs | RAMB36 | RAMB18 | DSP48 Blocks |
+----------------+----------------+------------+------------+---------+------+-----+--------+--------+--------------+
| top_sort_4     |          (top) |        109 |        109 |       0 |    0 |  96 |      0 |      0 |            0 |
|   (top_sort_4) |          (top) |          1 |          1 |       0 |    0 |  64 |      0 |      0 |            0 |
|   uut          | sort_4_wrapper |        108 |        108 |       0 |    0 |  32 |      0 |      0 |            0 |
|     uut        |     bubblesort |        108 |        108 |       0 |    0 |  32 |      0 |      0 |            0 |
+----------------+----------------+------------+------------+---------+------+-----+--------+--------+--------------+

Interestingly, timing is a slightly faster 10.205 ns / 97.99 MHz (when constrained for 200MHz)

"sort_4_wrapper.vhd" just makes the interface compatible with the interface I used at the top level:

Code: [Select]

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;

use work.array_type.all;


entity sort_4_wrapper is
    Port ( clk   : in STD_LOGIC;
           a_in  : in STD_LOGIC_VECTOR (7 downto 0);
           b_in  : in STD_LOGIC_VECTOR (7 downto 0);
           c_in  : in STD_LOGIC_VECTOR (7 downto 0);
           d_in  : in STD_LOGIC_VECTOR (7 downto 0);
           a_out : out STD_LOGIC_VECTOR (7 downto 0);
           b_out : out STD_LOGIC_VECTOR (7 downto 0);
           c_out : out STD_LOGIC_VECTOR (7 downto 0);
           d_out : out STD_LOGIC_VECTOR (7 downto 0));
end sort_4_wrapper;

architecture Behavioral of sort_4_wrapper is
    signal in_array_in : bubble;
    component bubblesort is
    port (
        signal clk              : in  std_logic;
        signal reset            : in  std_logic;
        signal in_array_in      : in  bubble;
        signal sorted_array_out : out bubble 
    );
    end component;
    signal sorted_array_out : bubble;
begin

    in_array_in(0)   <= unsigned(a_in);
    in_array_in(1)   <= unsigned(b_in);
    in_array_in(2)   <= unsigned(c_in);
    in_array_in(3)   <= unsigned(d_in);
    
uut: bubblesort port map (
        clk              => clk,
        reset            => '0',
        in_array_in      => in_array_in,
        sorted_array_out => sorted_array_out);
    a_out  <= std_logic_vector(sorted_array_out(0));
    b_out  <= std_logic_vector(sorted_array_out(1));
    c_out  <= std_logic_vector(sorted_array_out(2));
    d_out  <= std_logic_vector(sorted_array_out(3));
end Behavioral;

So now I am at a loss - I can't recreate your results. I get exactly what my unrolled BubbleSort does (which is what I expected). Do you have any hints of what I have missed?

What are you using as the 'source' for your inputs? as mentioned, I'm just using a 32-bit counter, and the outputs just to to external pins. It was the easiest way to ensure that nothing can get optimized away...

nctnico · « **Reply #185 on:** June 27, 2017, 09:54:29 am »

I have I/O pins at the inputs & outputs. It might be the synthesizer from ISE14.7 is better then the one from Vivado. Last news I heard is that Vivado's synthesizer isn't quite there yet. Besides that I also placed & routed the design which gives some extra logic optimisation.

hamster_nz · « **Reply #186 on:** June 27, 2017, 10:58:33 am »

Quote from: nctnico on June 27, 2017, 09:54:29 am

I have I/O pins at the inputs & outputs. It might be the synthesizer from ISE14.7 is better then the one from Vivado. Last news I heard is that Vivado's synthesizer isn't quite there yet. Besides that I also placed & routed the design which gives some extra logic optimisation.

Humm, fully P+R under ISE with defaults, for Spartan 6 LX 4. I get 255 LUTs (see attached image).

One thing I do see as "odd" is that as written you need to add two more signals to the process's sensitivity list - it should be:

process (clk, in_array_in, sorted_array)

I haven't made the change, but I wonder if that is the cause for our differences?

nctnico · « **Reply #187 on:** June 27, 2017, 11:12:57 am »

No, not to the sensitivity list but inside the 'if rising_edge...' clause so the logic is caught by the clock constraint. Otherwise the input to logic and flipflop to output routing would add extra delays.
Did you enable optimise across hierarchy / keep hierarchy? AFAIK it is off by default but it produces lesser results but it would clutter your results because they would include the counter.

Bruce Abbott · « **Reply #188 on:** June 27, 2017, 06:05:22 pm »

Quote from: AndyC_772 on June 27, 2017, 06:26:43 am

Almost. It figures out what logic is required to *sort* the entire array in a single clock - not necessarily *bubble* sort.

Yes, I understand that. The compiler creates logic that performs the requested function, but it decides how to do that. The source algorithm is a bubble sort, but the logic doesn't have to look anything like a bubble sort - it just has to produce the same output as a bubble sort.

In software we think of an array as a block of memory with numbers stored in it, and sorting the array changes the order of its contents. An advantage of Bubble Sort over other algorithms is that since elements are swapped 'in place' its memory footprint can be very low. I had incorrectly assumed that the VHDL code was also sorting the array 'in place'. However it actually takes an array as input and then fills another array with the sorted data. IOW the output is a (separate) sorted version of the input. If the array data came from some memory (registers or RAM) then it could be stored back into that memory if desired, or it could be used elsewhere without affecting the original array's contents.

The disadvantage of Bubble Sort is that as array size increases so processing time increases exponentially. In a hardware implementation this is not necessarily true because an entire array can be sorted in one clock cycle, but (I presume) in real hardware the number of gates required will increase exponentially, which may increase latency and reduce the maximum permitted clock frequency. Also the largest array that can be sorted in 1 clock cycle is limited by the maximum number of bits that can be operated on in parallel. Large arrays would have to be stored in RAM and sorted in several passes just like in software.

NorthGuy · « **Reply #189 on:** June 27, 2017, 06:34:57 pm »

Quote from: Bruce Abbott on June 27, 2017, 06:05:22 pm

Also the largest array that can be sorted in 1 clock cycle is limited by the maximum number of bits that can be operated on in parallel. Large arrays would have to be stored in RAM and sorted in several passes just like in software.

Exactly. If you wanted it to be scalable. or if you wanted to sort arrays of variable size, it would be much better to store the array in BRAM and sort it in place. You could create a specialized soft core for this (or a state machine if you will), which would do it sequentially. But it would be much much slower.

If you wanted to make it faster, you could use a more advanced software sorting algorithms, such as quicksort. Or you could come up with FPGA-friendly algorithm which manages to establish a pipeline and make bubble-sort faster. But such approach would be much more complex than sorting a small array with combinatorial logic.

nctnico · « **Reply #190 on:** June 27, 2017, 07:33:04 pm »

Quote from: NorthGuy on June 27, 2017, 06:34:57 pm

Quote from: Bruce Abbott on June 27, 2017, 06:05:22 pm
Also the largest array that can be sorted in 1 clock cycle is limited by the maximum number of bits that can be operated on in parallel. Large arrays would have to be stored in RAM and sorted in several passes just like in software.

Exactly. If you wanted it to be scalable. or if you wanted to sort arrays of variable size, it would be much better to store the array in BRAM and sort it in place. You could create a specialized soft core for this (or a state machine if you will), which would do it sequentially. But it would be much much slower.

If you wanted to make it faster, you could use a more advanced software sorting algorithms, such as quicksort. Or you could come up with FPGA-friendly algorithm which manages to establish a pipeline and make bubble-sort faster. But such approach would be much more complex than sorting a small array with combinatorial logic.

One of the things I did to the sorting example I posted was adding extra buffer (flipflop) stages between the input & output. The synthesizer can use that to do the pipelining for you. IOW you can let the tools do a lot of work for you before you need to resort to getting into the nitty-gritty bits of an FPGA.

However sorting large amounts of data is better done using an iterative approach.

Mattjd · « **Reply #191 on:** June 27, 2017, 08:09:12 pm »

How are you going about doing the pipelining? Are you running all your in/outs through always blocks that represent a register or what?

NorthGuy · « **Reply #192 on:** June 27, 2017, 08:49:03 pm »

Quote from: Mattjd on June 27, 2017, 08:09:12 pm

How are you going about doing the pipelining? Are you running all your in/outs through always blocks that represent a register or what?

I don't use Verilog. In VHDL it is very simple.

Combinatorial within one clock cycle:

Code: [Select]

process(clk)
begin
  if rising_edge(clk) then
    x <= (a+b)+c;
  end if;
end process;

Pipelined:

Code: [Select]

process(clk)
begin
  if rising_edge(clk) then
    -- stage 1
    a_plus_b <= a+b;
    c_stage_2 <= c;

    -- stage 2
    x <= a_plus_b + c_stage_2;
  end if;
end process;

Signals a_plus_b and c_stage_2 are added flip-flops.

hamster_nz · « **Reply #193 on:** June 28, 2017, 12:01:04 am »

Quote from: nctnico on June 27, 2017, 11:12:57 am

No, not to the sensitivity list but inside the 'if rising_edge...' clause so the logic is caught by the clock constraint. Otherwise the input to logic and flipflop to output routing would add extra delays.
Did you enable optimise across hierarchy / keep hierarchy? AFAIK it is off by default but it produces lesser results but it would clutter your results because they would include the counter.

So... mystery deepens. I see a learning experience ahead for me.

I've now using ISE. I've promoted your module to be the top level, so all 32 inputs are on pins, and the outputs are registered before pins - usage is now 76 LUTs

As everything except the output registers is async, the timing looks to be > 20ns for the inputs to the output registers to be valid. Not quite sure how to read the numbers in the timing report... However, registering the inputs makes the LUT count go up dramatically to 222, with a Fmax of 89.8 MHz.

The "designed with hardware in mind" version is 76 LUTs, (with an FMAX of > 213MHz), so without trolling through the technology schematic it seems that the inputs-unregistered bubble sort version has the freedom to be optimized down to the design, but the restrictions placed on it by having registers on the inputs prevents this from occurring. (wonder why that would be?...)

So the "designed with hardware in mind" can be almost 3x smaller, and > 2x faster than the "simple bubble sort" version. It also delivers consistent performance and usage no matter how it is used.

BrianHG · « **Reply #194 on:** June 28, 2017, 01:53:06 am »

Though not with this code, Altera's Quartus II v9 & above, just adding an extra stage of DFF, without any additional logic or deliberate pipe-lining, before or after feeding such a piece of HDL code will actually have the same effect of potentially slimming the LUT count and doubling the FMAX. Though this may be just the way I was coding at the time, but there are features in the compiler to decompose logic & reconstruct logic to achieve the best possible FMAX both int compiler stage and the fitting/physical synthesis stage.

Darn, if I had quartus installed on one of my PC's today, I would have played with this VHDL code already and posted the results...

hamster_nz · « **Reply #195 on:** June 28, 2017, 03:22:15 am »

Quote from: BrianHG on June 28, 2017, 01:53:06 am

Though not with this code, Altera's Quartus II v9 & above, just adding an extra stage of DFF, without any additional logic or deliberate pipe-lining, before or after feeding such a piece of HDL code will actually have the same effect of potentially slimming the LUT count and doubling the FMAX. Though this may be just the way I was coding at the time, but there are features in the compiler to decompose logic & reconstruct logic to achieve the best possible FMAX both int compiler stage and the fitting/physical synthesis stage.

Darn, if I had quartus installed on one of my PC's today, I would have played with this VHDL code already and posted the results...

There are a lot of undocumented tricks on how you can get great results with inference - how to cast your code 'just right' so a DSP block is inferred, with all the right pipeline registers, or so it uses block RAM, or using LUTs become shift registers rather than a chain of FFs.

The thing that annoys me is that the patterns that work are not well defined. For example, in a clocked process

data <= memory(to_integer(unsigned(address)));

should infer a block RAM if memory is big enough, but as a general rule, anything with an expression for the array index won't:

data <= memory(to_integer(unsigned(address)+1));

Will only infer LUTs and flip-flops. It leaves you with 'land mines' in your code:

addr_temp := unsigned(address)+1; -- assign address to variable
data <= memory(to_integer(addr_temp)); -- look up address

They have big flags waving away saying "Who wrote this junk! Make me shiny!", and when you touch them your design blows up.

(the example is somewhat contrived, I would have to test to find an exact case when I can prove this to be the case, but you get the idea)

BrianHG · « **Reply #196 on:** June 28, 2017, 03:48:02 am »

Quote from: hamster_nz on June 28, 2017, 03:22:15 am

There are a lot of undocumented tricks on how you can get great results with inference - how to cast your code 'just right' so a DSP block is inferred, with all the right pipeline registers, or so it uses block RAM, or using LUTs become shift registers rather than a chain of FFs.

That pretty much boils down what's been going on...

NorthGuy · « **Reply #197 on:** June 28, 2017, 04:04:45 am »

Quote from: hamster_nz on June 28, 2017, 03:22:15 am

The thing that annoys me is that the patterns that work are not well defined. For example, in a clocked process

data <= memory(to_integer(unsigned(address)));

should infer a block RAM if memory is big enough, but as a general rule, anything with an expression for the array index won't:

data <= memory(to_integer(unsigned(address)+1));

On Xilinx BRAM must be clocked. You cannot calculate the address, give it to BRAM and get a combinatorial result. Therefore, when you try to do that (as in your second expression above), you will never get BRAM. You may get "distributed memory" instead, because the distributed memory can get you the combinatorial result you want.

If "address" is registered, then you may get the BRAM by removing "+1".

nctnico · « **Reply #198 on:** June 28, 2017, 06:01:04 am »

Quote from: hamster_nz on June 28, 2017, 12:01:04 am

Quote from: nctnico on June 27, 2017, 11:12:57 am
No, not to the sensitivity list but inside the 'if rising_edge...' clause so the logic is caught by the clock constraint. Otherwise the input to logic and flipflop to output routing would add extra delays.
Did you enable optimise across hierarchy / keep hierarchy? AFAIK it is off by default but it produces lesser results but it would clutter your results because they would include the counter.
So... mystery deepens. I see a learning experience ahead for me.

I've now using ISE. I've promoted your module to be the top level, so all 32 inputs are on pins, and the outputs are registered before pins - usage is now 76 LUTs

As everything except the output registers is async, the timing looks to be > 20ns for the inputs to the output registers to be valid. Not quite sure how to read the numbers in the timing report... However, registering the inputs makes the LUT count go up dramatically to 222, with a Fmax of 89.8 MHz.

The "designed with hardware in mind" version is 76 LUTs, (with an FMAX of > 213MHz), so without trolling through the technology schematic it seems that the inputs-unregistered bubble sort version has the freedom to be optimized down to the design, but the restrictions placed on it by having registers on the inputs prevents this from occurring. (wonder why that would be?...)

So the "designed with hardware in mind" can be almost 3x smaller, and > 2x faster than the "simple bubble sort" version. It also delivers consistent performance and usage no matter how it is used.

That is the wrong conclusion. By adding the registers you add more logic to the design. Also: did you P&R the design? There is an extra logic optimisation stage in there as well.

hamster_nz · « **Reply #199 on:** June 28, 2017, 08:03:47 am »

Quote from: nctnico on June 28, 2017, 06:01:04 am

That is the wrong conclusion. By adding the registers you add more logic to the design. Also: did you P&R the design? There is an extra logic optimisation stage in there as well.

I checked the Technology Schematic - Inputs go straight into a FF from the IBUF, outputs straight from the FF to OBUF.

And then I checked the FPGA editor, and all 32 inputs run into a slice, and in that slice they run directly into a FF's D input (the input MUX on the FF that acts as CE is set to a fixed value).

As for the 32 outputs, they are all directly from the output of a flipflop to the output buffer.

So that is all 64 FFs accounted for. All the logic is between these two sets of FFs, and no retiming has occurred.

As far as I have seen for Xilinx, the P+R optimization makes zero difference to what generic primitives are actually used - only where they are placed on the die and how they are connected (hence the name place and route). The logic of the design at that point is fixed by the Implementation step.

(And of course the timing of a design depends on how well it is P+Red due to how well it minimzes routing delays, and there are a few little corner cases like route-throughs, which do consume additional LUTs by running signals through them like a buffer to tweek timing, but resource usage can only go up during P+R, and the logical design is not transformed at all)


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee