Author Topic: FPGA failure modes  (Read 4848 times)

0 Members and 1 Guest are viewing this topic.

Offline AussieBruceTopic starter

  • Regular Contributor
  • *
  • Posts: 60
  • Country: au
FPGA failure modes
« on: December 22, 2021, 04:45:13 am »
I have an application that requires backup to deal with device failure, it uses an ARM micro (LPC1768) and a sidekick small PIC acting as a watchdog, there is an SPI data loop between the two and if an irregularity is detected whatever CPU is in action initiates a safe state. There is an incentive to replace the larger CPU with an FPGA for various reasons, but I’m wondering whether the failure modes of that class of device might prejudice integrity. When a computer fails, it generally either stops completely, or goes into some sort of chaotic loop, either event will be handled correctly by my present system. However, with a gate array, where functions are performed by different areas of the device, somewhat autonomously, I wonder whether there is a risk that a fault may cause one area of function to fail, but others to continue operating correctly. Is that the case, or do some gate arrays somehow check for irregularities across the entire die?

Incidentally, this is not a ‘certified’ application, ie. it doesn’t have to comply with official standards. Just has to be ‘significantly better’.
 

Offline james_s

  • Super Contributor
  • ***
  • Posts: 21611
  • Country: us
Re: FPGA failure modes
« Reply #1 on: December 22, 2021, 06:10:18 am »
I have never experienced a failure of any FPGA, I don't think they are any less reliable than most other parts, certainly isn't something I'd worry about so long as it's operated within its specs. If reliability is critical you might want two separate entire units.
 
The following users thanked this post: AussieBruce

Offline Berni

  • Super Contributor
  • ***
  • Posts: 5023
  • Country: si
Re: FPGA failure modes
« Reply #2 on: December 22, 2021, 07:00:40 am »
Chips don't tend to fail in just local areas of the die.

Some old chips are famous for giving up the ghost of old age randomly, they come from a much more primitive time of semiconductor manufacturing. But this has much improved since then so in general for a chip to stop working you have to expose it to enough torture externally. Like too much voltage or current going trough a pin, running it really hot, physical damage etc.. And usually when damage happens on the silicon die there is a high chance the damage ends up finding a power trace, causing the whole supply rail to go short, this is why dead chips usually get really hot while going completely dead.

In terms of resilience FPGAs do tend to be better since a single bit flip caused by a cosmic ray will usually not crash a FPGA in any big way. One could even build logic blocks in triplicate then use voting to cut away any that misbehave.

Failure is actually most likely to come from memory. Programmable logic with a burned in ROM is a thing of the past, today most large programmable logic boots from just regular flash memory. This flash memory gets read in byte by byte to configure all the junctions between logic blocks and the blocks themselves. This itself is stored in basically local SRAM cells, so should be just as susceptible to a cosmic ray bit flip, but does come back after a power cycle. More of a issue is flash bit rot since more modern flash crams more bits in there to a point where flash bits might rot away in just a few decades. Once that happens the FPGA might refuse to load that flash image and just sit there non initialized doing nothing. However since flash is only used on boot means that this won't crash a FPGA if a bit rots away while the FPGA is running (unlike a typical MCU that actively runs from flash all the time)

I did something similar before and ended up using a Altera MAXV CPLD for the job. However it's more like a tiny FPGA than a CPLD. Point is to use a simple chip that doesn't have much things to fail like a gazilion supply voltages, fancy sensitive PLLs, external memory etc...
 
The following users thanked this post: AussieBruce

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 15250
  • Country: fr
Re: FPGA failure modes
« Reply #3 on: December 22, 2021, 05:41:11 pm »
Well, there are a couple things specific to FPGAs compared to CPUs. Most FPGAs are RAM-based, meaning their whole configuration is essentially RAM-based. So whether RAM content can be corrupted more easily frome external events than general logic is a complex matter. A CPU-based system can run entirely from Flash or ROM, but will still need RAM at some point to store data/state. The consequences of corruption can be different, but not necessarily any less severe.

As said above, one thing you can do with FPGAs is implement redundant structures  - something that most general-purpose CPUs do not have. So, from that POV, that would give FPGAs the edge. As long as you take those extra steps.

 
The following users thanked this post: AussieBruce

Offline x86guru

  • Regular Contributor
  • *
  • Posts: 51
  • Country: us
Re: FPGA failure modes
« Reply #4 on: December 22, 2021, 07:38:16 pm »
FPGA's that that use volatile SRAM based LUTs are susceptible to alpha and neutron radiation that can cause LUT and general configuration corruption. And since FPGA's configuration static RAM/LUTs do not have ECC, the corruption can't be corrected.
 
The following users thanked this post: AussieBruce

Offline vstrakh

  • Contributor
  • Posts: 23
  • Country: ua
Re: FPGA failure modes
« Reply #5 on: December 22, 2021, 07:54:01 pm »
And since FPGA's configuration static RAM/LUTs do not have ECC, the corruption can't be corrected.
Some FPGA, like the Cyclone V for example, have countermeasures for Single Event Upsets.
Configuration ram does have the blocks' CRC, and in some cases able to fix detected error without reloading the whole configuration.

https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/an/an866.pdf
 
The following users thanked this post: woofy, AussieBruce

Online NorthGuy

  • Super Contributor
  • ***
  • Posts: 3243
  • Country: ca
Re: FPGA failure modes
« Reply #6 on: December 22, 2021, 11:52:14 pm »
Some FPGA, like the Cyclone V for example, have countermeasures for Single Event Upsets.
Configuration ram does have the blocks' CRC, and in some cases able to fix detected error without reloading the whole configuration.

Xilinx FPGAs stores Hamming codes for configuration blocks. You can configure FPGA to verify them against values supplied in the bitstream and if something gets changed you can correct, or possibly reload.
 

Online jmelson

  • Super Contributor
  • ***
  • Posts: 2822
  • Country: us
Re: FPGA failure modes
« Reply #7 on: December 23, 2021, 12:01:10 am »
I have found FPGAs to be very reliable.  Other than getting 24 V into the board, they just keep working.  And, generally in cases like that, they fail and short the 1.2 and/or 3.3 V power supply rails, and therefore shut down that part of the system.
Jon
 
The following users thanked this post: AussieBruce

Offline Bassman59

  • Super Contributor
  • ***
  • Posts: 2501
  • Country: us
  • Yes, I do this for a living
Re: FPGA failure modes
« Reply #8 on: December 23, 2021, 05:40:51 am »
And since FPGA's configuration static RAM/LUTs do not have ECC, the corruption can't be corrected.
Some FPGA, like the Cyclone V for example, have countermeasures for Single Event Upsets.
Configuration ram does have the blocks' CRC, and in some cases able to fix detected error without reloading the whole configuration.

https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/an/an866.pdf

Only a few of the Cyclone V devices actually support configuration RAM scrubbing. And when I looked at them a few years ago (well before the current sourcing issues) I couldn't actually buy them.
 

Offline Bassman59

  • Super Contributor
  • ***
  • Posts: 2501
  • Country: us
  • Yes, I do this for a living
Re: FPGA failure modes
« Reply #9 on: December 23, 2021, 05:47:18 am »
I have an application that requires backup to deal with device failure, it uses an ARM micro (LPC1768) and a sidekick small PIC acting as a watchdog, there is an SPI data loop between the two and if an irregularity is detected whatever CPU is in action initiates a safe state. There is an incentive to replace the larger CPU with an FPGA for various reasons, but I’m wondering whether the failure modes of that class of device might prejudice integrity. When a computer fails, it generally either stops completely, or goes into some sort of chaotic loop, either event will be handled correctly by my present system. However, with a gate array, where functions are performed by different areas of the device, somewhat autonomously, I wonder whether there is a risk that a fault may cause one area of function to fail, but others to continue operating correctly. Is that the case, or do some gate arrays somehow check for irregularities across the entire die?

Incidentally, this is not a ‘certified’ application, ie. it doesn’t have to comply with official standards. Just has to be ‘significantly better’.

As you have probably guessed by now, if you've spent any time reading this forum, people here love to run off and speculate and give advice based on very limited information.

So to help us help you, you need to be more specific about your environment and what kind of failures or irregularities you have experienced in the past or that you might expect to occur.

Are you in a radiation environment, and if so, what sort of radiation? Heavy ions is different from cosmic rays, for example.

Do you expect power-supply glitches? Heavy RFI?

 
The following users thanked this post: AussieBruce

Offline AussieBruceTopic starter

  • Regular Contributor
  • *
  • Posts: 60
  • Country: au
Re: FPGA failure modes
« Reply #10 on: December 23, 2021, 09:30:48 am »
Thanks to all who reverted. I've got some good ideas, in particular to replicate the implementaiton of the more critical areas, which are quite small. 
 

Offline BrianHG

  • Super Contributor
  • ***
  • Posts: 8076
  • Country: ca
Re: FPGA failure modes
« Reply #11 on: December 23, 2021, 11:51:59 am »
Remember, an FPGA is not a CPU.  Your HDL codes wiring through logic gates.  You can easily a code watch dog timer for your application which may first interrupt your design, then soft reset your design at the next timeout, then, at a third timeout, reboot the FPGA from the bootprom.  A watchdog timer is nothing more than a big binary counter with a reset input which only resets on the toggle-edge detect of that signal.  Unless there is a catastrophic failure the FPGA's silicon itself, there isn't much which can go wrong when using such a piece of code as a freeze countermeasure.
 

Online NorthGuy

  • Super Contributor
  • ***
  • Posts: 3243
  • Country: ca
Re: FPGA failure modes
« Reply #12 on: December 23, 2021, 03:21:31 pm »
Thanks to all who reverted. I've got some good ideas, in particular to replicate the implementaiton of the more critical areas, which are quite small.

The tools tend to optimize the design and most likely will remove duplicate logic. You need to make sure this doesn't happen.
 

Offline Berni

  • Super Contributor
  • ***
  • Posts: 5023
  • Country: si
Re: FPGA failure modes
« Reply #13 on: December 27, 2021, 08:16:44 am »
Thanks to all who reverted. I've got some good ideas, in particular to replicate the implementaiton of the more critical areas, which are quite small.
The tools tend to optimize the design and most likely will remove duplicate logic. You need to make sure this doesn't happen.

Yep this is indeed a thing to watch out for.

When implementation matters i like to check the output of the compiler. This is usually the easiest by having the IDE show the mapped result as a block diagram. If the OP is looking at one of these for the first time it might look like a confusing mess. One way to make sense of it a bit better is to look at the presynthesis block diagram as that one is more high level and better ordered since you can generally find blocks that repentant individual .v/.hdl files and most wires have sensible net names. Some very basic optimizations can already show up here (Like inputs that don't affect any outputs). You can find some wires of interest here and trace them down into that messy mapping block diagram to see exactly where that goes in the actual physical FPGA. If you have duplicated logic with voting for output you should be able to find the voting logic and all the duplicate result signals feeding it (likely also keeping there original net name)

The way to keep the compiler from optimizing away is usually to place extra compiler directives around the wires that force the compiler to keep them around. Check the documentation for your particular tools for this and as said above don't just trust that it did what you think it did, check the compilers output.
 

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 15250
  • Country: fr
Re: FPGA failure modes
« Reply #14 on: December 27, 2021, 06:22:06 pm »
Thanks to all who reverted. I've got some good ideas, in particular to replicate the implementaiton of the more critical areas, which are quite small.

The tools tend to optimize the design and most likely will remove duplicate logic. You need to make sure this doesn't happen.

Yes, implementing fully redundant structures will take some care here, otherwise they might get optimized and fused in unexpected ways.
Read your tools docs. There usually are attributes to force keeping components as black boxes, or something in that vein.
 

Offline Sal Ammoniac

  • Super Contributor
  • ***
  • Posts: 1763
  • Country: us
Re: FPGA failure modes
« Reply #15 on: January 14, 2022, 10:49:11 pm »
Another possibility is an MCU with lock-step processors, like the TI TMS570 or the Infineon TriCore.
"That's not even wrong" -- Wolfgang Pauli
 

Offline Tim_usernametooshort

  • Newbie
  • Posts: 1
  • Country: gb
Re: FPGA failure modes
« Reply #16 on: March 11, 2022, 12:17:13 pm »
Don't know who's chips you are using, but Xilinx XAPP197 might help.. this discusses triple redundancy in Xilnx Virtex chips. The same things will apply to any other device though, if not actually the tools to do it automagically.
 

Offline mascotte

  • Newbie
  • Posts: 6
  • Country: nl
Re: FPGA failure modes
« Reply #17 on: March 11, 2022, 05:08:13 pm »
What environment is that? If you deal with radiation (space, particle accelerators etc.), you need to scrub the configuration memory of the FPGA and consider to apply TMR at least partially. This is usually done on netlist level because as mentioned by some, redundancy described in RTL can be easily optimized away by the toolchain. To answer your initial question, it's perfectly possible that some parts of the FPGA work fine, e.g. your SPI core and watchdog function, while other parts produce complete rubbish. A single bit upset in the configuration memory can already lead to this situation.
 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2812
  • Country: nz
Re: FPGA failure modes
« Reply #18 on: March 11, 2022, 07:52:25 pm »
The problem with a watchdog for an FPGA is observability. In software something like this does a lot of heavy lifting:

Code: [Select]
   int main(void) {
       while(1) {
           do_a_little_bit_of_work();
           feed_the_watchdog();
       }
   }

If your watchdog is being feed you can assume that quite a lot of things are working correctly.

With FPGA designs there is no main() function. You have to actively monitor the health of your device - is it actually doing what it needs to do? Is the device healthy?

Dull brain :

Code: [Select]
output_pin <= input_pin;  -- Check FPGA is working and properly/configured

Glowing brain:

Code: [Select]
     if rising_edge(clk) then
         counter <= counter+1;
         ouput_pin <= counter(counter'high); -- Check FPGA is working/configured
     end if;

Galaxy Brain:

Code: [Select]
i_fpga_monitor: fpga_monitor port map (
       -- A dependable clock source
       clk                      => clk,
       -- External Outputs: A generic 'system good' signal, and an system attention for interrupts
       system_good              => system_good,
       system_attention         => system_attention,

       -- External SPI interface for watchdog processor
       spi_clk                  => spi_clk,
       spi_miso                 => spi_miso,
       spi_mosi                 => spi_mosi,
       -- Inputs: Event_counters
       event_frame_seen         => event_frame_seen,
       event_pll_reset_occured  => event_pll_reset_occured,
       event_single_bit_mem_eer => event_single_bit_mem_err,
       ...
       -- Internal Inputs: On-chip health monitoring
       fpga_temperature         => fpga_temperature,
       power_good_signals       => power_good_signals,
       ...
       -- Internal Outputs: Partial resets for fault recovery without full reconfigure
       reset_xrc_1              => reset_xrc_1,
       reset_data_pipeline      => reset_data_pipeline
    );

Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf