Electronics > FPGA

FPGA failure modes

(1/4) > >>

AussieBruce:
I have an application that requires backup to deal with device failure, it uses an ARM micro (LPC1768) and a sidekick small PIC acting as a watchdog, there is an SPI data loop between the two and if an irregularity is detected whatever CPU is in action initiates a safe state. There is an incentive to replace the larger CPU with an FPGA for various reasons, but I’m wondering whether the failure modes of that class of device might prejudice integrity. When a computer fails, it generally either stops completely, or goes into some sort of chaotic loop, either event will be handled correctly by my present system. However, with a gate array, where functions are performed by different areas of the device, somewhat autonomously, I wonder whether there is a risk that a fault may cause one area of function to fail, but others to continue operating correctly. Is that the case, or do some gate arrays somehow check for irregularities across the entire die?

Incidentally, this is not a ‘certified’ application, ie. it doesn’t have to comply with official standards. Just has to be ‘significantly better’.

james_s:
I have never experienced a failure of any FPGA, I don't think they are any less reliable than most other parts, certainly isn't something I'd worry about so long as it's operated within its specs. If reliability is critical you might want two separate entire units.

evb149:
You're not using devices that are customarily chosen for high reliability / high integrity uses in terms of your microcontrollers.
Yeah sure anything you do like add a watch dog function, implement the watch-dog in another independent device, etc. helps with the assurance of
availability in detecting / handling some failure modes, but as you mentioned there are other classes of failures whether in micro controllers or FPGAs.

Many systems that do have to have some level of functional safety certifications do use ARM-R class parts, sometimes use ARM-A or ARM-M but the general use case of many ARM-R
products is to bring additional capabilities for real time / high availability / high integrity / high reliability applications whether 'certified' or 'safety critical' or just 'highly reliable / stable / available'.

FPGAs on the other hand are commonly used for avionics and space applications for control systems (as are suitable processors / microprocessors, too, but FPGAs have their niches where MPUs / MCUs can't easily tread).    Look at the LEON / LEON3 / etc. cores and IP cores for instance and take a look at the industry that such was originally provided for.
Take a look at space rated Xilinx devices, for example (I'm not saying you could / should use one), but being aware of the application architectures and reliability considerations / failure mode considerations / failure detection & mitigation strategies used for say some of the aerospace oriented FPGA cores and devices may inform your risk assessment & mitigation model for using industrial / commercial FPGAs & IP cores in a high availability application.

Sure as you say some failure modes of MCUs or MPUs or FPGAs are more common than others and you can build that into your system's FMECA / risk assessment analysis and mitigation / fail-safe architecture.

Murphy's law, if anything can go wrong, [assume] it will [with some statistical probability or putative possibility].
So then the question is how to detect / fail-safe the various possibilities from likely to unlikely and of minor consequence to major consequence?

For instance cosmic rays, ESD, manufacturing defects, buggy firmware/software, loading the wrong fw/sw version, PCB errors, electrical problems, etc. can impact either MPU or FPGA systems.  Many MPU/MCUs actually have multiple fairly independent cores, peripherals, etc. so it is very possible for a peripheral of a MPU or even an entire core of a MPU to continue on "working normally" for some sub-function of a system while another peripheral or core or whatever is totally stopped or malfunctioning.  No different in concept FPGA or MPU.
Also with RTOS or multi-tasking systems you can have bugs, deadlocks, faults, etc. totally halt or compromise several tasks / processes on a system while others continue fine.
Ever have your web browser or PDF reader freeze while your system's terminal window or clock display kept working?

As you design more complex systems with different "threads", "tasks", "partitions", "subsystems" whether entirely SW, FW or gate-ware or some mix you simply have to assume
that some sub-systems may be "workable" while others are not due to some fault / bug.  FPGA, MPU, a whole rack of redundant card-level blade servers, doesn't matter.

Look at many of the historical system / software bugs that have actually happened in say space probes / satellites etc. and the public stories about how those affected operations and how they were recovered from.  To a very great extent that involves some problems that were unexpected that compromised some sub-system(s), then that caused some watch-dog or other kind of system management system to note a failure or malfunction, then the sub-system or system as a whole gets partially shut down or put into some kind of safe mode / recovery mode and the still working aspects of the system (communications, control, diagnostic, logs, ...) are used to analyze the scope of the fault and modify the configuration or code of the system so that the fault is worked-around, isolated / disabled, or remedied based on whatever update / upgrade / redundancy is available.

There is no "magic answer" to how to make a reliable system with MPUs or FPGAs or whatever.  Only best practices to make good use of the strengths of each option and to
put protections in place that prevent possible bugs / glitches from breaking something critical to the desired availability / integrity level.  That starts at the single chip or single gate / LUT level, but it goes up from there to encompass all SW, the whole circuit card, the whole subsystem, the whole system, etc. etc.

https://en.wikipedia.org/wiki/Triple_modular_redundancy

https://www.gaisler.com/index.php/products/processors/leon3

https://en.wikipedia.org/wiki/Integrated_modular_avionics

https://en.wikipedia.org/wiki/Failure_mode,_effects,_and_criticality_analysis

https://en.wikipedia.org/wiki/ARINC_653

https://en.wikipedia.org/wiki/List_of_software_bugs

https://en.wikipedia.org/wiki/Spirit_(rover)#Sol_17_flash_memory_management_anomaly

https://web.archive.org/web/20161230103247/http://research.microsoft.com/en-us/um/people/mbj/Mars_Pathfinder/Authoritative_Account.html

etc. etc.


--- Quote ---...
The architecture has evolved over time, and version seven of the architecture, ARMv7, defines three architecture "profiles":
    A-profile, the "Application" profile, implemented by 32-bit cores in the Cortex-A series and by some non-ARM cores
    R-profile, the "Real-time" profile, implemented by cores in the Cortex-R series
    M-profile, the "Microcontroller" profile, implemented by most cores in the Cortex-M series

--- End quote ---

Berni:
Chips don't tend to fail in just local areas of the die.

Some old chips are famous for giving up the ghost of old age randomly, they come from a much more primitive time of semiconductor manufacturing. But this has much improved since then so in general for a chip to stop working you have to expose it to enough torture externally. Like too much voltage or current going trough a pin, running it really hot, physical damage etc.. And usually when damage happens on the silicon die there is a high chance the damage ends up finding a power trace, causing the whole supply rail to go short, this is why dead chips usually get really hot while going completely dead.

In terms of resilience FPGAs do tend to be better since a single bit flip caused by a cosmic ray will usually not crash a FPGA in any big way. One could even build logic blocks in triplicate then use voting to cut away any that misbehave.

Failure is actually most likely to come from memory. Programmable logic with a burned in ROM is a thing of the past, today most large programmable logic boots from just regular flash memory. This flash memory gets read in byte by byte to configure all the junctions between logic blocks and the blocks themselves. This itself is stored in basically local SRAM cells, so should be just as susceptible to a cosmic ray bit flip, but does come back after a power cycle. More of a issue is flash bit rot since more modern flash crams more bits in there to a point where flash bits might rot away in just a few decades. Once that happens the FPGA might refuse to load that flash image and just sit there non initialized doing nothing. However since flash is only used on boot means that this won't crash a FPGA if a bit rots away while the FPGA is running (unlike a typical MCU that actively runs from flash all the time)

I did something similar before and ended up using a Altera MAXV CPLD for the job. However it's more like a tiny FPGA than a CPLD. Point is to use a simple chip that doesn't have much things to fail like a gazilion supply voltages, fancy sensitive PLLs, external memory etc...

SiliconWizard:
Well, there are a couple things specific to FPGAs compared to CPUs. Most FPGAs are RAM-based, meaning their whole configuration is essentially RAM-based. So whether RAM content can be corrupted more easily frome external events than general logic is a complex matter. A CPU-based system can run entirely from Flash or ROM, but will still need RAM at some point to store data/state. The consequences of corruption can be different, but not necessarily any less severe.

As said above, one thing you can do with FPGAs is implement redundant structures  - something that most general-purpose CPUs do not have. So, from that POV, that would give FPGAs the edge. As long as you take those extra steps.

Navigation

[0] Message Index

[#] Next page

There was an error while thanking
Thanking...
Go to full version