EEVblog #58 – Warm and Fuzzy FPGA TroubleshootingPosted on January 31st, 2010 23 comments
Look, up in the sky, is it a hardware fault?, is it a software fault?, no, it’s a bloody FPGA
Wooo.. you’re asking not just for a weasel but a lightly grilled weasel. Software expects hardware to the the same and if you have to account for all the glitches in silicon you’re going to get software bloat.
Interesting blog. It kind of sounds like some weird kind of race condition when everything boots up.
As far as the differences between batches. Why not call the application engineers at Altera? They might know something about what you are running into between different batches. Even if it doesn’t have an effect on your current work its probably something useful to know.
Another great vblog. I really liked the walk-through of your debugging process.
But your conclusion is wrong. The problem isn’t the software. The problem isn’t even the different batches of FPGAs. The problem is the FPGA design methodology someone is using for your project.
Your problem stems from the FPGA designer’s failure to do a careful timing analysis of the FPGA design after place and route. (It is also possible that he mistakenly classified a path as multi-cycle). Note that this can also lead to strange temperature related failures.
I believe, no, in fact I am certain that there is a timing path somewhere in the FPGA design that is not meeting the required propagation delay determined by your FPGA’s clock rate. Or, even worse, the designer may be using asynchronous logic and some data path is loosing the race with the control path. Using synchronous design techniques is very important in FPGAs.
It is extremely likely that both batches of FPGA’s meet Altera’s timing specs. But one batch is slightly faster than the other. This difference allows one batch to work even with a timing issue in the design, and the other batch does not.
Check the timing analysis of the FPGA design after place and route, not the estimated, pre-route numbers. You will find the root of your problem. You may find dozens of other problems (timing errors) that haven’t popped up yet, but will with various FPGA batches and/or warmer temperatures.
Did you try reducing the FPGA clock speed. If the original “firmware” works with reduced clock speed then the timing margin of the FPGA design is marginal.
If the hardware worked properly previously with the compiler switch enabled, then most likely you didn’t find the problem.
You can pretty much bet that its a timing issue. As Silicon Farmer said, the design needs to be run through the static timing analyzer and any paths that violate your timing needs to be checked for validity.
If a valid violation is found, the path would need to be recoded using proper synchronous techniques and minimizing the propagation delay from the combinational logic.
That would be the correct way to do it.
Otherwise, if you ever recompile your FPGA code, you can’t guarantee that this problem won’t occur again.
As they say in software,
If you didn’t fix it, then it ain’t fixed.
I have a lot of similar histories to tell, something I hate is that we (the ones that need to debug the problem) always test everything possible (as you mention), sometimes we even double check or use different test systems, and in my case when I report is a software issue, the first question they ask me is : Are you sure your test system is OK ? Are you using the correct things ? Man !!! I hate that, I already double check it.
Good video blog.
I’m a micro guy, now I’m thinking about buying a FPGA’s dev kit to start learning but I keep earring about this sort of things about how FPGA’s do weird stuff some times, specially softcore cpu’s.
Do you think I should forget FPGA’s and continue my work on micros ? does it really worth learning verilog, vdhl, installing eclipse, buying books and all that to learn FPGA’s ?
BTW, can you pint out some good FPGA’s dev kits for starters ?
I have worked with this very same Altera chip for last two projects (actually, three projects have been FPGA projects, but last is a different chip).
I think that when building the hardware, one should always primarily think that how the software is going to use a particular feature. If at all possible, I tend to test the features myself, so I can show the SW guys where the problem lies when they come to me complaining that something isn’t working properly
Terminating the unused inputs safely is one thing. If they are left floating, things like Dave mentioned do happen. And they can cause all sorts of weird behaviour, as the pin state depends then on the leakage currents on the chip, which can vary very much. Altera FPGA’s have switchable pull-up-feature which can often come to the resque, if the board design is already fixed.
I have also experienced weird behaviour with the FPGAs. Here are couple of stories.
First strange one was that there was a USB transceiver ISP1582 connected to the FPGA. Sometimes during the operation the USB transceiver would just die and started to take excessive current. Note that there were no other connections to the USB transceiver than the parallel bus to FPGA and the USB itself. After some investigation we noticed that setting the Cyclone II IO’s to their maximum output current strength produced so fast rise times that it overshooted so much that it killed the USB transceiver after some use. Reducing the current strength output from 24 mA downto 8 mA solved this.
Second one turned out to be signal integrity problem. The FPGA was driving a bus where there were three chips using the same /WR-signal. The problem was that DSP which was connected to the bus did sometimes perform two writes, although there should have been only one.
I didn’t find the reason until I measured the signal using a 6 GHz oscilloscope with an active probe. And there it was, it was a about 1 ns long flat plateau on the /WR signal falling edge, just in the critical voltage region. So the DSP was fast enough
Ouch that is very nasty, good find
It really annoys me when a datasheet doesn’t show all of the relevant package markings, so that you have numbers printed on the device that are impossible to identify. They could be relevant, or totally irrelevant. Who knows? But the seed of doubt is sewn, with no way to get rid of it.
Most of my experience is with CPLDs. One of the weirdest faults I had was on a project a few years ago. The CPLD was doing various background tasks – monitoring inputs, outputting signals dependent on them and reporting all of this to a microprocesor.
The problem I had – it could not count to 5. one task it did was to output up to 5 80ms pulses with 300ms intervals. The number dependent on the input. The logic worked fine until the micro-processor was connected.
Turned out that there was a noise problem either with the chip or on the board. I turned the slew rate on the outputs down and it all worked.
Experience is is best appreciated after your knuckles have healed…
A properly designed and constrained FPGA design should behave exactly the same every build, no matter which batch of device you use.
The timing figures Quartus uses _should_ be the worst case values for voltage, temperature and process, if the design is constrained properly it _will_ work.
I’d bet you a pint you aren’t constraining something on the FPGA properly.
BTW, are you using Altium Designer to put the FPGA together?
When to consider using an FPGA: When you need absolute determinism – X must happen exactly N clocks after Y. Or when nanoseconds matter. Or when there is a lot of parallelism. Or when there is a lot of pipelineing opportunity. Or if the system is mostly nearly idle, then quickly needs a burst of speed. This last case is an example of where an FPGA can be lower power than a cluster of MCUs.
One important difference between FPGAs and just about any other hardware is understanding routing delays. Most logic design, using CPLD, ASIC, MSI, you worry about the propagation delay of the gates, with some allowance for capacitive load slowing down the signal transitions.
In FPGAs, the routing goes through programmable elements, which can have delays on the order of, or worse, and sometimes (randomly) much worse, than the logic delays. The routing delays change every time the FPGA design changes. Getting asynchronous logic to work can be a nightmare. Carefully checking post-route timing is an absolute must.
Hey Dave, I like your videos.
Your pictures of c3 devices show what looks like a two chips with a different silicon lot number produced around the same time frame. In theory these chips should be the same.
I inferred this by reading the product change notifications (PCN) for cyclone 3. The 1st line is the device, second line contains the silicon type 65 vs 60nm & date code among other things, third line is the silicon lot number, and the last line is probably more manufacturing details, but I saw no PCNs which referred to this line so I cannot guess the purpose.
I know I’m late to the party, but have to complement you on your parting quote. “My favorite programming language is solder.” That quote is going right above my bench at work. Genius.
Leave a reply