Author Topic: Poor man's FPGA debugging techniques (Read 3659 times)

berke · « **on:** September 01, 2022, 04:35:34 pm »

Hi all,

I'm debugging some logic on an ice40HX with Yosys etc. on Linux. That chip doesn't have JTAG.

I want to know if there are any (simple) debugging tricks I'm missing.

Starting from the basics.

- Simulation with iverilog. Good for debugging individual modules (which I did) but becomes a bit involved for debugging a whole system. Especially when it works, you add one little thing and things mysteriously stop working.
- Verilator doesn't seem to work very well.
- Checking the Yosys output for warnings, etc. Unfortunately Yosys is a bit too tolerant of various kinds of silly mistakes. Of course I have `default_nettype none.
- Model checking using temporal logic properties
- Yosys seems to have facilities for those things, but it probably requires a couple focused days of investment. It might be worth the effort but I don't know how mature those are in Yosys.
- Toggling one or more LEDs and visually inspecting the result. Very limited of course.
- Need to put counters so that it is visible
- Can distinguish a couple events by the pulse length
- A slightly fancier version uses morse code
- Outputting some debug info on pins and looking at the result with a logic analyzer. For example the state of a state machine.
- Quite powerful, but consumes a lot of pins
- Probably the best way.will take an hour to solder the proper headers.
- May need to add a strobe pin.
- Instantiating a TX UART (the FPGA version of printf-debugging)
- Uses only one output pin
- I can define a task to send one character and insert it anywhere.
- I could use a simple state machine to send say 32-bit debugging values. But if I strobe I can only send one value per clock.
- Add a FIFO would solve that problem and one could even send entire strings.
- I don't thing Verilog can handle variable-length strings but setting a fixe string size could do it.

Ideally there would be some kind of core that is easy to add to an existing project (e.g. single file? possibly generated from the code) and that can easily record a large set of signals, and provide access
- There is a 512 kB SRAM chip on the board I'm using (Olimex eval board)
- Usually there will be enough LUTs left but may have to underclock the main logic a bit to give the debugging core enough time to sample the signals and write it to SRAM
- Ideally it would be configurable at run time
I suspect the bigger FPGAs have that kind of stuff.

Any other ideas?

Bonus questions:
- Is there post-synthesis simulation available using open-source tools? If yes how fast is it?
- Can I freeze the FPGA and read out the whole internal state?

fourfathom · « **Reply #1 on:** September 01, 2022, 05:08:08 pm »

Quote from: berke on September 01, 2022, 04:35:34 pm

- Outputting some debug info on pins and looking at the result with a logic analyzer. For example the state of a state machine.

That's the main way I do this at home, using a logic analyzer or a scope. One thing I do to make it easier is to have every module provide a "debug bus" output that I can easily connect and pass up through the design, and ultimately get connected to spare pins set aside for this purpose. I have to modify and re-compile the Verilog if I want to monitor something new, but that's a fairly quick process and the structure is already in place which makes it easy to do. You could certainly build in a debug mux to let you monitor different parts, using spare inputs for source selection, but I've never gone to that much trouble.

berke · « **Reply #2 on:** September 01, 2022, 05:38:02 pm »

Quote from: fourfathom on September 01, 2022, 05:08:08 pm

I have to modify and re-compile the Verilog if I want to monitor something new, but that's a fairly quick process...

Yes, but ideally one shouldn't have to edit all the code to disable the debug infrastructure, so that it can be re-activated later when a bug occurs.

One problem is that the instantiating module must know the width of the debug bus of each of its components.
Say module A instantiates module B which needs 8 debug signals. All is fine until you add a feature and instantiate module C from B which itself needs say 4 signals. Now B needs 8 + 4 = 12 signals and A needs to know about it. Gets tedious pretty fast.

There must be some language trick possibly involving the preprocessor that makes this process easier.

rstofer · « **Reply #3 on:** September 01, 2022, 08:20:08 pm »

Interesting...

Several years ago, I was building an ALU for a PDP-11/45 and I needed a way to test the logic. I just created a shift register that received the operands, flags and op code via SPI and sent back the result and updated flags, again over SPI.

I use an SBC to connect to the target SPI and ethernet to connect to a file server. The server held various files of test data and known results. Today, I would use a Raspberry Pi but the SBC of the time used uClinux. It was an Analog Devices Blackfin.

Pretty simple but it had an advantage that I could add another test sample by just editing a text file on the server. Speed wasn't an issue for these tests.

hamster_nz · « **Reply #4 on:** September 01, 2022, 08:25:23 pm »

Design for successful testing via simulation is key - I now structure projects to have a simulation for every non-trivial part of a design, and a script that checks all test benches can be compiled and run.

If that fails, you need a make a poor man's VLA - Write a trace to dual-port BRAM (e.g 512 x 32-bit values), and dump the trace the serial port as hex values.

Then have a small script to capture the convert the trace values to something pretty.

fourfathom · « **Reply #5 on:** September 01, 2022, 09:51:19 pm »

Apologies in advance for what is probably a distraction, but when designing gate arrays we would implement an internal scan chain that included most/all of the internal registers. We did this so we could use automatic test-vector generation for chip fault-coverage, and not for design debugging. The overhead for this wasn't too bad, essentially a mux for each register and separate clock-domains needed to be dealt with one way or another. If you are willing to freeze the state of the chip while clocking out the scan chain this could be a workable test tool. Doing a while-running snapshot would double the number of registers so that's probably not practical.

And yes, it's much easier to check this stuff in simulation.

DiTBho · « **Reply #6 on:** September 01, 2022, 10:02:34 pm »

Quote from: hamster_nz on September 01, 2022, 08:25:23 pm

Design for successful testing via simulation is key

indeed

DiTBho · « **Reply #7 on:** September 01, 2022, 10:05:41 pm »

Quote from: berke on September 01, 2022, 04:35:34 pm

Any other ideas?

use Xilinx' or Altera' simulators

berke · « **Reply #8 on:** September 02, 2022, 07:28:47 am »

Quote from: fourfathom on September 01, 2022, 09:51:19 pm

Apologies in advance for what is probably a distraction, but when designing gate arrays we would implement an internal scan chain that included most/all of the internal registers.

That's a good idea, because the debug port would then have a fixed number of signals and you can easily insert or remove components with very little code change. I suppose that's basically the idea behind JTAG? Anyway, the annoying problem is then figuring out the mapping between the number of clocks and your signals. Maybe you can have Yosys compute that for you by daisy-chaining instance parameters and using $display statements.

Quote

And yes, it's much easier to check this stuff in simulation.

But only if you already have or are willing to write the models.

laugensalm · « **Reply #9 on:** September 02, 2022, 08:05:07 am »

Quote from: berke on September 01, 2022, 04:35:34 pm

Bonus questions:
- Is there post-synthesis simulation available using open-source tools? If yes how fast is it?
- Can I freeze the FPGA and read out the whole internal state?

If you're using yosys, you can - at any time of the mapping steps - dump your design to Verilog and run it through iverilog (Verilator may not digest the primitive models). However, depending on the primitive ('gate level') complexity, your model may eat up a lot of memory. I've tried this for various novel FPGA technologies and have run into issues with some vendor models, so this is often a game.

The ice40 can't be frozen natively, such that you can read out specific cells. This requires a fully implemented JTAG TAP and corresponding boundary scan commands which may not be documented at all. Of course you're free to implement your own JTAG monitor. The poor man's Chip scope (or whatever the vendor call their solution) can easily be hacked with a fast JTAG interface and a trace dump to .vcd format piped into a GTKwave front end.

The strategy that worked best for me so far: Drop a tiny CPU into the system that has a known good JTAG in-circuit-emulation working via GDB. Then you can monitor plenty of states and registers either interactively or write test routines in C. There are reference solutions that play virtually using co-simulation, so you can verify simulation against real hardware as well. Unfortunately, I'm only familiar with the GHDL (VHDL sources only) side.

rstofer · « **Reply #10 on:** September 02, 2022, 02:59:53 pm »

At some point, the effort to simulate exceeds the time to synthesize. Years back, Xilinx ISE didn't come with a free simulator so I never spent much time thinking about simulation. In the best case, it isn't real hardware. In the worst case, it is an illusion. At some point, you need to debug a CPU core running real programs. Some instructions may not work correctly and that may not come up until thousands of lines of source code have gone by. Debugging factory assembly code where the CPU is jammed up trying to get a ready signal from a non-existent device comes to mind. No fair trying to reassemble the factory code, it was known to work back in '70 and there is no reason it won't work today.

It turns out that when you build up a 'load deck' to initialize the disk for this system, certain configuration options are dealt with. It's up to the user to make sure the code is looking at existing devices and where different models of printers are available, the correct one is the only one that counts. Same with the card reader.

Breakpoints in hardware are fairly easy to achieve (PC = <some address from switches>) and some form of readout is similarly easy - that's why I buy boards with buttons, switches, LEDs and 7-segment displays. I can break the execution and search through any register I select with the switches.

Like this:

https://digilent.com/shop/nexys-a7-fpga-trainer-board-recommended-for-ece-curriculum/

Sal Ammoniac · « **Reply #11 on:** September 02, 2022, 05:19:06 pm »

Quote

Quote
And yes, it's much easier to check this stuff in simulation.
But only if you already have or are willing to write the models.

Are you doing this as a hobby? The concept of anyone doing this work commercially and not willing to write test benches for simulation is just unfathomable to me.

fourfathom · « **Reply #12 on:** September 02, 2022, 08:48:13 pm »

Quote from: Sal Ammoniac on September 02, 2022, 05:19:06 pm

Quote
Quote
And yes, it's much easier to check this stuff in simulation.
But only if you already have or are willing to write the models.

The concept of anyone doing this work commercially and not willing to write test benches for simulation is just unfathomable to me.

Agree. But a test bench is just one part of the process. Using the test bench (or more likely benches) you will want to do functional simulation to make sure your logic is correct, and then parametric simulation (or at least setup/hold analysis). In my experience, the tools extract the parametric timing, pre and post-layout. I suppose if you are using an open-source toolchain without access to the parametric models and data then you can only do functional simulation and then debugging on the actual hardware.

Caveat: My industry experience ended over 20 years ago, and I was doing gate arrays, not FPGAs. These days I play with FPGAs as part of my hobby, using free Lattice tools.

berke · « **Reply #13 on:** September 02, 2022, 09:27:27 pm »

Quote from: laugensalm on September 02, 2022, 08:05:07 am

If you're using yosys, you can - at any time of the mapping steps - dump your design to Verilog and run it through iverilog (Verilator may not digest the primitive models). However, depending on the primitive ('gate level') complexity, your model may eat up a lot of memory. I've tried this for various novel FPGA technologies and have run into issues with some vendor models, so this is often a game.

That's very interesting, it does indeed work, I created a foo.cmd containing

Code: [Select]

read_verilog foo.v
snyth_ice40 -top top
write_verilog foo-dump.v

and then did yosys -s foo.cmd, the resulting foo-dump.v looks like this

Code: [Select]

  (* module_not_derived = 32'd1 *)
  (* src = "/usr/local/bin/../share/yosys/ice40/cells_map.v:26.33-27.52" *)
  SB_LUT4 #(
    .LUT_INIT(16'hac00)
  ) \genblk1[4].hdd.out_SB_DFFESR_Q_D_SB_LUT4_O_I0_SB_LUT4_O_I0_SB_LUT4_O_1  (
    .I0(pwm_level_SB_DFFESR_Q_E_SB_LUT4_O_I2_SB_LUT4_I2_13_O_SB_DFFSR_R_Q[53]),
    .I1(\genblk1[4].hdd.out_SB_DFFESR_Q_D_SB_LUT4_O_I0_SB_LUT4_O_I0_SB_LUT4_O_1_I1 [1]),
    .I2(j[1]),
    .I3(j[2]),
    .O(\genblk1[4].hdd.out_SB_DFFESR_Q_D_SB_LUT4_O_I0_SB_LUT4_O_I0 [1])
  );

and iverilog even accepts it. Now just need to write a little wrapper to generate stimulus inputs and write the VCD and see how fast it runs.

Quote

The strategy that worked best for me so far: Drop a tiny CPU into the system that has a known good JTAG in-circuit-emulation working via GDB. Then you can monitor plenty of states and registers either interactively or write test routines in C. There are reference solutions that play virtually using co-simulation, so you can verify simulation against real hardware as well. Unfortunately, I'm only familiar with the GHDL (VHDL sources only) side.

That's a bit involved but for a complex project it might be worth it. I could try that with the picorv32 core that comes with prjtrellis (ECP5 backend). It might fit on what's remaining of the HX8K. OTOH I like the custom JTAG-like debug chain idea.

Quote from: Sal Ammoniac on September 02, 2022, 05:19:06 pm

Are you doing this as a hobby? The concept of anyone doing this work commercially and not willing to write test benches for simulation is just unfathomable to me.

I never said I don't write test benches. What happens when your test bench doesn't catch the error because you modeled the thing that the FPGA talks to based on your (mis)understanding of some spec (or the spec was ambiguous to begin with and you had to fill in the blanks)? Or maybe it's a timing issue, or you clocked your chip too fast.

DiTBho · « **Reply #14 on:** September 02, 2022, 09:47:26 pm »

Chip specific must be vendor's simulator specific.

laugensalm · « **Reply #15 on:** September 03, 2022, 06:55:46 am »

Quote from: berke on September 02, 2022, 09:27:27 pm

...

Code: [Select]
(* module_not_derived = 32'd1 *) (* src = "/usr/local/bin/../share/yosys/ice40/cells_map.v:26.33-27.52" *) SB_LUT4 #( .LUT_INIT(16'hac00) ) \genblk1[4].hdd.out_SB_DFFESR_Q_D_SB_LUT4_O_I0_SB_LUT4_O_I0_SB_LUT4_O_1 ( .I0(pwm_level_SB_DFFESR_Q_E_SB_LUT4_O_I2_SB_LUT4_I2_13_O_SB_DFFSR_R_Q[53]), .I1(\genblk1[4].hdd.out_SB_DFFESR_Q_D_SB_LUT4_O_I0_SB_LUT4_O_I0_SB_LUT4_O_1_I1 [1]), .I2(j[1]), .I3(j[2]), .O(\genblk1[4].hdd.out_SB_DFFESR_Q_D_SB_LUT4_O_I0_SB_LUT4_O_I0 [1]) );

Yep, you've made your way. It's a bit of a trip to understand the mapping steps, nice thing is that you can also make this visible (unfortunately, the built-in .dot method is a little ugly). As long as you have unencrypted Verilog models of the involved primitives, you can basically simulate this down to the technology using iverilog. I read ECP5, with that one you can get quite far.

When it gets to simulation, my credo is: Always simulate against the real world and go through the homework (writing good test benches). Yes, there are folks who tend to skip that and test on hardware, and typically waste a lot of iteration time when there's a glitch like every few days in a 365/24/7 run time scenario. That's where co-simulation in the loop with the existing software and intended jitter in timing helps a lot.

There's some material online (https://github.com/hackfin/hdlplayground), e.g. the above technology mapping/verification dance for ECP5 with a non-stupid UART loopback, running as co-simulation from a Jupyter Notebook (walk through the embedded world section presentation after starting the Binder). It's done in MyHDL, but will convert to Verilog. Just don't go down that path for complex projects, that part of MyHDL is abandonware and PoC only. The download to hardware will obviously work only when running as a local container.

DiTBho · « **Reply #16 on:** September 03, 2022, 08:54:34 am »

Quote from: laugensalm on September 03, 2022, 06:55:46 am

As long as you have unencrypted Verilog models of the involved primitives, you can basically simulate this down to the technology using iverilog

Precisely the point; here the Commercial Simulators are ready to go, while the OpenSource ones need more hacking.

DiTBho · « **Reply #17 on:** September 03, 2022, 09:08:26 am »

Quote

When it gets to simulation, my credo is: Always simulate against the real world and go through the homework (writing good test benches). Yes, there are folks who tend to skip that and test on hardware, and typically waste a lot of iteration time when there's a glitch like every few days in a 365/24/7 run time scenario. That's where co-simulation in the loop with the existing software and intended jitter in timing helps a lot.

like this guy, I guess ...

Quote from: rstofer on September 02, 2022, 02:59:53 pm

At some point, the effort to simulate exceeds the time to synthesize. Years back, Xilinx ISE didn't come with a free simulator so I never spent much time thinking about simulation. In the best case, it isn't real hardware. In the worst case, it is an illusion.

those who think that co-simulation is an illusion shouldn't use OpenSource tools because it makes that feeling ever worse.

my 50 cents

rstofer · « **Reply #18 on:** September 03, 2022, 03:18:52 pm »

As I mentioned above, for the first 10 years I played with FPGAs, I didn't have a simulator. I sometimes had to create extra logic to detect an errant state or logic fault or at least be capable of seeing that some piece of logic actually kicked off.

One idea is to make your state clock (or the main clock) capable of single-stepping via a pushbutton. For my CPU project, I could run full speed until I hit a breakpoint and then single step to see exactly what is happening.

Again, that's why I use boards with buttons, knobs and dials. For one project, I sent the register values out over SPI to a 16 digit hex display. I could see the memory address register, memory data register, accumulator and extended accumulator on the displays while single stepping. Today, I might get fancy and drive a full 7" display with everything I could get at in the design. It's obvious that these displays are only useful with single-step or very slow clocks.

Keeping modules small and lightweight helps enormously.

Be creative! We had to do a lot of this kind of thing in the early days of uCs - back around '75 or so. Who can forget their first core dump from a minicomputer. Pages of obtuse hex values displaying the contents of memory. Try single stepping through that to find out why you got the wrong answer in your Fortran code. Or Digital Research's DDT debugger for the 8080. Single stepping is a complete waste of time. Printf(), where available, is a much easier approach.

rstofer · « **Reply #19 on:** September 03, 2022, 03:25:50 pm »

How about hanging one or more of these displays on an SPI bus:

https://www.amazon.com/gp/product/B085WCRS7C

Sure, there's going to be a ton of logic gathering up signal states, converting to something readable and blasting it out over SPI but it might be worth the effort.

DiTBho · « **Reply #20 on:** September 03, 2022, 04:34:43 pm »

Quote from: rstofer on September 03, 2022, 03:18:52 pm

As I mentioned above, for the first 10 years I played with FPGAs, I didn't have a simulator.

10 years ago, now you can!
Stop suggesting wrong approaches because you don't understand co-simulation!

rstofer · « **Reply #21 on:** September 03, 2022, 05:20:25 pm »

I think I came out in favor of Vivado's ILA earlier in the thread. It is pretty slick and it runs as hardware.
There are many ways to troubleshoot logic and systems.

laugensalm · « **Reply #22 on:** September 03, 2022, 07:08:17 pm »

Quote from: rstofer on September 03, 2022, 03:18:52 pm

Single stepping is a complete waste of time. Printf(), where available, is a much easier approach.

Please, read up on In-Circuit-Emulation methods and JTAG. When looking for rare occasion errors, you'd want to debug *least intrusively*. It's like quantum mechanics, you don't want it to behave differently when you look at it. That's why printf() debugging has been deprecated many years ago, but I have some issues with the analogy on an FPGA anyhow. Trace buffers or out of band data transmission during a streaming scenario would probably be more like it. No objections against that, but if it's intrusive on the design, it may not help with debugging at all. I'm not fond of Reveal, ChipScope or its Vivado descendants in complex projects, especially because it doesn't simulate and might have impact on your design timing.
I can understand some of the motives to move away from simulators that used to/are still broken (esp. when it comes to VHDL), but the opensource simulators GHDL and iverilog set a pretty good standard (since close to 20 years) and can be trusted like for example Modelsim.

It will take way less time to get a RISC-V SoC running cycle accurately as a virtual uC in a GHDL environment than iterative synthesizing and LED debugging. In fact, test driven development and early (formal) verification of components may save you those 90% of development overhead wasted on late error hunting and it is absolutely mandatory for those who are providing IP cores professionally, let aside those who are going to spend a few 100k on a fresh ASIC wafer.

SiliconWizard · « **Reply #23 on:** September 03, 2022, 07:55:08 pm »

Quote from: laugensalm on September 03, 2022, 07:08:17 pm

It will take way less time to get a RISC-V SoC running cycle accurately as a virtual uC in a GHDL environment than iterative synthesizing and LED debugging. In fact, test driven development and early (formal) verification of components may save you those 90% of development overhead wasted on late error hunting and it is absolutely mandatory for those who are providing IP cores professionally, let aside those who are going to spend a few 100k on a fresh ASIC wafer.

Oh definitely. I would never do it any other way, and I've actually developed a RISC-V SoC. Actually, I didn't just write it right in VHDL and simulate it with GHDL - I first wrote a cycle-accurate simulator of my core in C, then implemented it in VHDL, then tested it with GHDL. It saved me a LOT of time. Note that it's a 6-stage pipelined fully bypassed core, so for me the C simulation first was definitely a big help. I actually ran real code (including relatively complex benchmarks) on it until all bugs I could find were ironed out. For simpler cores, you could of course directly go for a HDL implementation and forgo the software simulation. But I think it's definitely nice to have a software simulator - not only speeds up debugging considerably, but also helps trying various approaches and directly see the performance impact with very short development cycles.

Sure for debugging long sequences of code, you'll need a bit of work to set up proper HDL simulations - you usually can't afford simulating millions of cycles until it runs into the bug you're trying to pinpoint. So that requires some thought and observation in order to locate the bug, then trying to find minimal sequences that reproduce it. For this, software simulation helps tremendously. It may look like unnecessary extra work, but for a complex system, the investment definitely pays off.

Now of course what you will catch with simulation is mostly behavioral issues. But IME those are the most common by far for complex systems such as CPU cores.
The typical other kinds of issues you're bound to run into (and that simulators will usually not catch) are:
* timing issues (but those are normally caught by the tools you use for synthesis and P&R unless maybe you use very odd constructs) - those can be caught with post-P&R simulation though to some
extent if your tools offer this,
* clock domain crossing issues (those can be very hard to pinpoint even with pure hardware 'scoping' solutions, but can be easily avoided right from the start if you carefully handle this
in your code to begin with. But definitely a major source of hard bugs for beginners.)

DiTBho · « **Reply #24 on:** September 04, 2022, 12:41:26 am »

Quote from: laugensalm on September 03, 2022, 07:08:17 pm

That's why printf() debugging has been deprecated many years ago

(
talking about "printf" ... this summer a friend of mine and I got rid of the old "print format" and made it fully deprecated, banned and removed from our libC

Printf is too inefficient, dangerous and excessive and unreasonable full of unwanted features especially when you want to use it for debugging things by converting a data type to string and sending it via the ICE (or GDB) debug channel to the host.

Replaced with show_${data_type} and format_${data_type}, both paired with safestring. It works marvelously, a bit too Ada-ish style, but it's not a problem!

Some things need to go ...
)

PatrickCPE · « **Reply #25 on:** September 05, 2022, 02:00:00 am »

I'll preface this with the caveat I have no idea what you're trying to debug rather than issues on the design, and also you asked for simple so this might be a bit more in the realm of "complicated but extremely useful given the right scenario".

Here's my typical steps for writing something.

1. Lint design as I write it (I like Verillator's linting tool a lot)
2. Testbench for each individual module you have within the design (and the associated regression suite, something simple like Make with some grepping works fine)
3. Testbench for top level of design also added to regression.
4. Check timing analysis as you go (Xilinx has this, Lattice must have it in their real tool, I'm not sure if the opensource tools support this at all)
5. Gate level sim with generated .sdf file(Depending on the scope of your project doing GLS on the entire thing might be a bit much)

With all of those steps done when you put the design on the part it should work assuming your testbenches properly tested the behavior. If your design doesn't work and you truly can't sim it then there are a few options depending on the complexity of the design.

1. Simple Designs - UART or Blinky or Pins and logic analyzer for debug as you mentioned
2. Complicated Bus Based Design - Internal Bus Scope and UART.

Elaborating on 2, most designs make use of some internal bus like AXI or APB or Wishbone, etc. Designs this complex usually have issues within the communications of different masters and slave devices. The Bus Scope is effectively an internal logic analyzer storing every transaction that occurred on the bus for N cycles. The caveat is that it uses resources on your chip, and the depth is directly related to how much ram you can give up. Your UART will act as a master and allow you to probe the bus scope for its internal data. You can set an internal trigger wire to tell the scope to record. A simple python script then formats that data for you to debug.

Zip CPU has a nice writeup https://github.com/ZipCPU/wbscope on his Wishbone bus scope here that I enjoyed and it worked when I tried it out. Alternatively, I know Xilinx has a Internal scope IP you can drag and drop into your designhttps://www.xilinx.com/products/intellectual-property/chipscope_ila.html, and I've been told Alterra does as well. Maybe Lattice does too (this seems like it: https://www.latticesemi.com/-/media/LatticeSemi/Documents/UserManuals/RZ/Reveal34UserGuide.ashx?document_id=50887)? The Xilinx one even allows you to bind to individual wires rather than just the bus itself.

Again, this might be a bit much but I don't really know what you're trying to debug here, knowing may help a bit

DiTBho · « **Reply #26 on:** September 05, 2022, 06:43:10 am »

Quote from: PatrickCPE on September 05, 2022, 02:00:00 am

4. Check timing analysis as you go (Xilinx has this, Lattice must have it in their real tool, I'm not sure if the opensource tools support this at all)

This is an example of what I meant with - Chip specific must be vendor's tool specific - and why OpenSource tools need more hacking

berke · « **Reply #27 on:** September 05, 2022, 07:37:11 am »

Quote from: PatrickCPE on September 05, 2022, 02:00:00 am

I'll preface this with the caveat I have no idea what you're trying to debug rather than issues on the design, and also you asked for simple so this might be a bit more in the realm of "complicated but extremely useful given the right scenario".

For the purpose at hand I did find the bug that prompted me to start this thread.

What I was doing is nothing too complicated FPGA wise, a multi-channel GPS-controlled camera trigger controller listening on an I2C bus. An MCU could kind of do it but it would be a bitch to get the timings tight to less than 100 ns. Because it's proprietary I couldn't just drop a GPL I2C core in, had to write it myself and the NXP I2C spec is not ultra clear (i.e. exactly when does the master release SDA so that the slave may ACK it ?)

Quote

Here's my typical steps for writing something.

1. Lint design as I write it (I like Verillator's linting tool a lot)

Check, I compile with yosys and iverilog with -Wall ; but I couldn't get Verilator to like the SiliconBlue cell library simulation models.

Quote

2. Testbench for each individual module you have within the design (and the associated regression suite, something simple like Make with some grepping works fine)

Almost check, I have benches for complex modules but not for simple ones where I don't suspect bugs (e.g. clock divider or reset controller.)

Quote

3. Testbench for top level of design also added to regression.

Haven't been doing that but I think I should put the effort into it.

Quote

4. Check timing analysis as you go (Xilinx has this, Lattice must have it in their real tool, I'm not sure if the opensource tools support this at all)

icetime gives maximum frequencies and critical path delays but with generated symbol names, it's kind of unreadable and looks like this

Code: [Select]

     2.793 ns net_26661 ($abc$52038$techmap\S_GRN.$3\q[31:0][14]_new_inv_)
        odrv_7_18_26661_26803 (Odrv4) I -> O: 0.372 ns
        t7449 (Span4Mux_v4) I -> O: 0.372 ns
        t7448 (LocalMux) I -> O: 0.330 ns
        inmux_9_17_38161_38188 (InMux) I -> O: 0.260 ns
        lc40_9_17_3 (LogicCell40) in0 -> lcout: 0.449 ns

Quote

5. Gate level sim with generated .sdf file(Depending on the scope of your project doing GLS on the entire thing might be a bit much)

So with laugensalm's tip I can run LUT-level sims with iverilog but I'm not sure it really adds anything to a top-level Verilog simulation. Will have to check to see if the SiliconBlue cell libs have timing information and how well iverilog makes use of them.

Quote

With all of those steps done when you put the design on the part it should work assuming your testbenches properly tested the behavior. If your design doesn't work and you truly can't sim it then there are a few options depending on the complexity of the design.

Like I was saying the main issue is when your FPGA has to talk to the outside world (and usually has to, otherwise why are you using an FPGA?) and then what? You have to model all the external things in Verilog too if you can. Some people said "co-simulation" but I'm not sure I understand the concept.

Quote

1. Simple Designs - UART or Blinky or Pins and logic analyzer for debug as you mentioned

Done

Quote

2. Complicated Bus Based Design - Internal Bus Scope and UART.

UART done, for the next time I think I'll define a JTAG-like daisy chained debug bus with a fixed number of signals and some preprocessor macros to make it easier. If needed I can modify the Yosys code to have it output what's necessary.

Quote

Elaborating on 2, most designs make use of some internal bus like AXI or APB or Wishbone, etc. Designs this complex usually have issues within the communications of different masters and slave devices. The Bus Scope is effectively an internal logic analyzer storing every transaction that occurred on the bus for N cycles. The caveat is that it uses resources on your chip, and the depth is directly related to how much ram you can give up. Your UART will act as a master and allow you to probe the bus scope for its internal data. You can set an internal trigger wire to tell the scope to record. A simple python script then formats that data for you to debug.

Like I was saying there's a nice 512 kiB SRAM chip on the dev board, I could use it to store traces.

Another option is to debug it on an ECP5 which has the JTAG port.

Quote

Zip CPU has a nice writeup https://github.com/ZipCPU/wbscope on his Wishbone bus scope here that I enjoyed and it worked when I tried it out.

I'll check. Instantiating a CPU (PicoV32 or something) for debugging/self-test feels a bit excessive but that's more of a gut feeling, rationally it's not necessarily a stupid idea.

Quote

Alternatively, I know Xilinx has a Internal scope IP you can drag and drop into your designhttps://www.xilinx.com/products/intellectual-property/chipscope_ila.html, and I've been told Alterra does as well. Maybe Lattice does too (this seems like it: https://www.latticesemi.com/-/media/LatticeSemi/Documents/UserManuals/RZ/Reveal34UserGuide.ashx?document_id=50887)? The Xilinx one even allows you to bind to individual wires rather than just the bus itself.

Again, this might be a bit much but I don't really know what you're trying to debug here, knowing may help a bit

Thanks for the tips, I may end up switching to the proprietary toolchain but I'm slightly allergic to the Windows-style clicky-feely interfaces, I'm more of an Emacs+make in termal guy.

BTW thanks to everyone for the useful information (and guilt trips about not writing full benches

the core works, I'll switch back to the analog side of things.

laugensalm · « **Reply #28 on:** September 05, 2022, 08:10:47 am »

Quote from: SiliconWizard on September 03, 2022, 07:55:08 pm

...

But I think it's definitely nice to have a software simulator - not only speeds up debugging considerably, but also helps trying various approaches and directly see the performance impact with very short development cycles.

Sure for debugging long sequences of code, you'll need a bit of work to set up proper HDL simulations - you usually can't afford simulating millions of cycles until it runs into the bug you're trying to pinpoint. So that requires some thought and observation in order to locate the bug, then trying to find minimal sequences that reproduce it. For this, software simulation helps tremendously. It may look like unnecessary extra work, but for a complex system, the investment definitely pays off.

On a sidenote only, before flooding the TO with too much info: The CXXRTL backend in yosys also allows to build pretty neat executables for virtual SoCs with full freedom, so you can indeed trigger on an event to debug an error scenario in particular. And you'll skip the step of developing your own simulator. However, it has the same basic architecture/issues as Verilator, no built-in asynchronous/delay timing/delta cycling awareness possible, synthesizable code only, thus it will *not* eat the vendor cell models, so you'll have to stay with synchronous functional simulation most of the time. Since you also have to write the (co-)simulation front end and stimuli drivers yourself (in C++ or using a Python wrapper), I'd rather not promote that method to begin with.

Speaking of timing optimization: Yeah, yosys won't do that. But you can get a timing estimate from nextpnr that is pretty accurate. To be honest, at some point, novel methods of hardware generation outside the Verilog/VHDL domain are more effective than implicit attempts by the tools to optimize badly designed pipelines at mapping or even PnR level. In many cases in the past, I've found myself iterating through V-sources in order to avoid architecture-specific congestions. Anyhow, before this is becoming too off-topic: A two way approach isn't bad: yosys for verification/debugging, vendor tools for the final optimization (their debugging caps being again mostly horrible). Systematical bugs still can appear at any layer, but having more options helps to narrow down errors.

josuah · « **Reply #29 on:** September 05, 2022, 08:18:30 pm »

There is simulation (I use Verilator for its rather interesting correctness), and then there is synthesis (Yosys here).
Verilator acts more or less as a "syntax guardian", but some difference between what is supported by Verilator and Yosys lead me to simulation/hardware mismatch!

The reason: Feature missing from Yosys, that still lead to a valid output (with a Warning).

Debug method acquired: looking at the JSON output of Yosys...

If the content of "bits" are numbers, everything is all right: all "5" are connected together, all "6" are, all "7" are... they act as net names:

Code: [Select]

                "top.peri0.wb_dat_i": {
                    "hide_name": 0,
                    "bits": [
                        224,
                        223,
                        222,
                        221,
                        220,
                        219,
                        218,
                        216
                    ],

If the content of "bits" has any "x", then it suggests there is something that could not be synthesized, and not connected to something else: a placeholder value:

Code: [Select]

                "top.peri0.wb_dat_o": {
                    "hide_name": 0,
                    "bits": [
                        "x",
                        "x",
                        "x",
                        "x",
                        "x",
                        "x",
                        "x",
                        "x"
                    ],

This might also be visible on Yosys's "dot" output format, showing a diagram of the design.

PatrickCPE · « **Reply #30 on:** September 06, 2022, 04:24:44 am »

I still use IVerilog often for a lot of my stuff, alongside Verilator if I want to easily model stuff in C++. For things like gate level test benches I fire up the GUI and run it there.

Same thing here on the hating the tools bloated GUIs. I myself use Iverilog for the basic stuff more often then not. You can use Xilinx tools completely via the command line, but it's a fair bit of work to set up the makefile. This ought to be true for all the toolchains. Verilog mode in Emacs allows you to specify custom targets and it defaults to whatever makefile is in the pwd I believe, but I always just hop to my terminal and make there. You could probably set up a simple compiler flag in a makefile with some different flag definitions to chose whether you run on the open source tools or the propietary tools via the command line.

Glad you figured out your problem!


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Poor man's FPGA debugging techniques (Read 3659 times)

Share me