No Script, No Fear, All Opinion
RSS icon Home icon
  • EEVblog #58 – Warm and Fuzzy FPGA Troubleshooting

    Posted on January 31st, 2010 EEVblog 23 comments

    Look, up in the sky, is it a hardware fault?, is it a software fault?, no, it’s a bloody FPGA

    Be Sociable, Share!
    • http://truthspew.wordpress.com Tony P

      Wooo.. you’re asking not just for a weasel but a lightly grilled weasel. Software expects hardware to the the same and if you have to account for all the glitches in silicon you’re going to get software bloat.

      • http://www.eevblog.com EEVblog

        Software expects hardware to the the same and if you have to account for all the glitches in silicon you

    • Richard Nienhuis

      Interesting blog. It kind of sounds like some weird kind of race condition when everything boots up.

      As far as the differences between batches. Why not call the application engineers at Altera? They might know something about what you are running into between different batches. Even if it doesn’t have an effect on your current work its probably something useful to know.

    • SiliconFarmer

      Another great vblog. I really liked the walk-through of your debugging process.

      But your conclusion is wrong. The problem isn’t the software. The problem isn’t even the different batches of FPGAs. The problem is the FPGA design methodology someone is using for your project.

      Your problem stems from the FPGA designer’s failure to do a careful timing analysis of the FPGA design after place and route. (It is also possible that he mistakenly classified a path as multi-cycle). Note that this can also lead to strange temperature related failures.

      I believe, no, in fact I am certain that there is a timing path somewhere in the FPGA design that is not meeting the required propagation delay determined by your FPGA’s clock rate. Or, even worse, the designer may be using asynchronous logic and some data path is loosing the race with the control path. Using synchronous design techniques is very important in FPGAs.

      It is extremely likely that both batches of FPGA’s meet Altera’s timing specs. But one batch is slightly faster than the other. This difference allows one batch to work even with a timing issue in the design, and the other batch does not.

      Check the timing analysis of the FPGA design after place and route, not the estimated, pre-route numbers. You will find the root of your problem. You may find dozens of other problems (timing errors) that haven’t popped up yet, but will with various FPGA batches and/or warmer temperatures.

      • http://www.eevblog.com EEVblog

        What you say is correct in it’s own right, but is not the entire issue here.
        My story didn’t actually have a technical conclusion, it’s just a story about what happened up to this point and the need to be extra careful with these sorts of “soft” FPGA hardware designs.
        The code running on the synthesized 32bit soft core processor can also inherently have problems.
        I’ve seen essentially the exact same problem on hard silicon processors, so the problem is not necessarily just limited to FPGA timing issues.

    • http://robotics.ong.id.au Stephen

      Hi Dave,

      Did you try reducing the FPGA clock speed. If the original “firmware” works with reduced clock speed then the timing margin of the FPGA design is marginal.

    • http://www.freaklabs.org Akiba

      If the hardware worked properly previously with the compiler switch enabled, then most likely you didn’t find the problem.
      You can pretty much bet that its a timing issue. As Silicon Farmer said, the design needs to be run through the static timing analyzer and any paths that violate your timing needs to be checked for validity.
      If a valid violation is found, the path would need to be recoded using proper synchronous techniques and minimizing the propagation delay from the combinational logic.
      That would be the correct way to do it.
      Otherwise, if you ever recompile your FPGA code, you can’t guarantee that this problem won’t occur again.

      • Machina

        As they say in software,

        If you didn’t fix it, then it ain’t fixed.

    • DavidDLC

      I have a lot of similar histories to tell, something I hate is that we (the ones that need to debug the problem) always test everything possible (as you mention), sometimes we even double check or use different test systems, and in my case when I report is a software issue, the first question they ask me is : Are you sure your test system is OK ? Are you using the correct things ? Man !!! I hate that, I already double check it.

      Good video blog.

    • Ben

      I’m a micro guy, now I’m thinking about buying a FPGA’s dev kit to start learning but I keep earring about this sort of things about how FPGA’s do weird stuff some times, specially softcore cpu’s.
      Do you think I should forget FPGA’s and continue my work on micros ? does it really worth learning verilog, vdhl, installing eclipse, buying books and all that to learn FPGA’s ?
      BTW, can you pint out some good FPGA’s dev kits for starters ?
      Thanks !
      Ben.

      • http://www.eevblog.com EEVblog

        Hi Ben
        FPGA’s are just another means to an end, unless you want to play for the sake of playing.
        If you are just running a CPU soft core in the FPGA and not much else, then really you are wasting your time, as the micro will be faster, more power efficient, and easier to use.
        FPGA’s come into their own in more niche applications that require the integration of all sorts of special purpose logic that would be much slower to implement in code than hardware. And you get the advantage of having an essentially competely reconfigurable system. Massively parallel processing systems, or super high speed serial data processing are two examples where FPGA’s are worthwhile.
        But FPGA’s are not magic and never will be, that’s why they haven’t eclipsed general purpose microcontrollers for general purpose applications.
        And FPGA system design and the associated tools is complex business, but there are easier options like the Altium NB3000 for instance which is trying to make FPGA’s easy for the masses.

        • http://kaqelec.dyndns.org Kathy Quinlan

          Sigh, In the Kave I have a NB1, NB2, missed wining the NB3000 on the last Perth Seminar….

          I have never used them for anything productive, I also have a Spartan 3 Dev board from Memec. I have NEVER found a job that I can not do in a uC (mainly Atmel AVR).

          I have seen newbies complain that eg the AVR is not fast enough for a task.. I can remember doing the same task on an 8051 running 8Mhz..

          I think for MOST applications FPGA’s are overkill, most systems I have designed commercially that needed more power, just had extra uC’s put in (one product had 2 Mega 128′s, 1 Tiny26, 1 8051 (custom IRDA processor and Layer chip set) and I think the BlueTooth module had an MSP430 series chip

          The project started out with an ATMEL flipchip (FPGA and Mega 128 AVR on one die) but we were able to rewrite the entire print head driver in ASM and do away with the FPGA side and just use the Mega128.

          I would love to get into FPGA and find a decent use for one (not just throwing it in for the hell of it)

          Regards,

          Kat.

          • Mr. X

            Convert the Spartan 3 into a logic analyzer http://www.sump.org/projects/analyzer/ or http://code.google.com/p/cheapla/

            • http://kaqelec.dyndns.org Kathy Quinlan

              This would be a for the sake of it project as I own an LA ;)

              If you look at the EEVBlog forum, you will see a topic about a community bench meter, this will have a LA module (this idea is turning into an entire lab ;)

              Regards,

              Kat.

    • Janne

      I have worked with this very same Altera chip for last two projects (actually, three projects have been FPGA projects, but last is a different chip).

      I think that when building the hardware, one should always primarily think that how the software is going to use a particular feature. If at all possible, I tend to test the features myself, so I can show the SW guys where the problem lies when they come to me complaining that something isn’t working properly :)

      Terminating the unused inputs safely is one thing. If they are left floating, things like Dave mentioned do happen. And they can cause all sorts of weird behaviour, as the pin state depends then on the leakage currents on the chip, which can vary very much. Altera FPGA’s have switchable pull-up-feature which can often come to the resque, if the board design is already fixed.

      I have also experienced weird behaviour with the FPGAs. Here are couple of stories.

      First strange one was that there was a USB transceiver ISP1582 connected to the FPGA. Sometimes during the operation the USB transceiver would just die and started to take excessive current. Note that there were no other connections to the USB transceiver than the parallel bus to FPGA and the USB itself. After some investigation we noticed that setting the Cyclone II IO’s to their maximum output current strength produced so fast rise times that it overshooted so much that it killed the USB transceiver after some use. Reducing the current strength output from 24 mA downto 8 mA solved this.

      Second one turned out to be signal integrity problem. The FPGA was driving a bus where there were three chips using the same /WR-signal. The problem was that DSP which was connected to the bus did sometimes perform two writes, although there should have been only one.

      I didn’t find the reason until I measured the signal using a 6 GHz oscilloscope with an active probe. And there it was, it was a about 1 ns long flat plateau on the /WR signal falling edge, just in the critical voltage region. So the DSP was fast enough

    • Brian Hoskins

      It really annoys me when a datasheet doesn’t show all of the relevant package markings, so that you have numbers printed on the device that are impossible to identify. They could be relevant, or totally irrelevant. Who knows? But the seed of doubt is sewn, with no way to get rid of it.

      Annoying.

      Brian

    • Neil

      Most of my experience is with CPLDs. One of the weirdest faults I had was on a project a few years ago. The CPLD was doing various background tasks – monitoring inputs, outputting signals dependent on them and reporting all of this to a microprocesor.

      The problem I had – it could not count to 5. one task it did was to output up to 5 80ms pulses with 300ms intervals. The number dependent on the input. The logic worked fine until the micro-processor was connected.

      Turned out that there was a noise problem either with the chip or on the board. I turned the slew rate on the outputs down and it all worked.

      Experience is is best appreciated after your knuckles have healed…

    • http://www.nialstewartdevelopments.co.uk Nial Stewart

      Dave,

      A properly designed and constrained FPGA design should behave exactly the same every build, no matter which batch of device you use.

      The timing figures Quartus uses _should_ be the worst case values for voltage, temperature and process, if the design is constrained properly it _will_ work.

      I’d bet you a pint you aren’t constraining something on the FPGA properly.

      BTW, are you using Altium Designer to put the FPGA together?

      Nial

    • SiliconFarmer

      When to consider using an FPGA: When you need absolute determinism – X must happen exactly N clocks after Y. Or when nanoseconds matter. Or when there is a lot of parallelism. Or when there is a lot of pipelineing opportunity. Or if the system is mostly nearly idle, then quickly needs a burst of speed. This last case is an example of where an FPGA can be lower power than a cluster of MCUs.

      One important difference between FPGAs and just about any other hardware is understanding routing delays. Most logic design, using CPLD, ASIC, MSI, you worry about the propagation delay of the gates, with some allowance for capacitive load slowing down the signal transitions.

      In FPGAs, the routing goes through programmable elements, which can have delays on the order of, or worse, and sometimes (randomly) much worse, than the logic delays. The routing delays change every time the FPGA design changes. Getting asynchronous logic to work can be a nightmare. Carefully checking post-route timing is an absolute must.

    • Ken

      Hey Dave, I like your videos.

      Your pictures of c3 devices show what looks like a two chips with a different silicon lot number produced around the same time frame. In theory these chips should be the same.

      I inferred this by reading the product change notifications (PCN) for cyclone 3. The 1st line is the device, second line contains the silicon type 65 vs 60nm & date code among other things, third line is the silicon lot number, and the last line is probably more manufacturing details, but I saw no PCNs which referred to this line so I cannot guess the purpose.

      Regards,
      Ken

    • Eric

      I know I’m late to the party, but have to complement you on your parting quote. “My favorite programming language is solder.” That quote is going right above my bench at work. Genius.