EEVblog Electronics Community Forum

General => General Technical Chat => Topic started by: xaxaisme on September 24, 2019, 11:27:19 am

Title: Why there are lots of methods to correct storage error, but not CPU?
Post by: xaxaisme on September 24, 2019, 11:27:19 am
So there is CRC for disk drive and network, ECC/parity for RAM, and many more, but I've never heard any technique to correct or even detect errors in CPU or GPU, even for server grade hardware. Why?
Title: Re: Why there are lots of methods to correct storage error, but not CPU?
Post by: Mr. Scram on September 24, 2019, 11:33:52 am
So there is CRC for disk drive and network, ECC/parity for RAM, and many more, but I've never heard any technique to correct or even detect errors in CPU or GPU, even for server grade hardware. Why?
There are chips with ECC for the internal cache and there is majority voting or lockstep CPUs to detect and mitigate computational errors. These features tend to be found in safety critical processes or processors.
Title: Re: Why there are lots of methods to correct storage error, but not CPU?
Post by: wraper on September 24, 2019, 11:42:21 am
With RAM it's simple, add some additional RAM and ECC. You cannot do that for CPU core. To have something to compare result with, you basically need to do the same computation in parallel and then compare results. Which BTW is how it's implemented in mission critical systems. Especially those used in presence of radiation.
Title: Re: Why there are lots of methods to correct storage error, but not CPU?
Post by: Berni on September 24, 2019, 11:46:25 am
There are methods for CPUs.

CPUs for safety critical applications feature dual lockstep processors where two identical CPUs execute the same code while dedicated hardware watches over them to make sure they both get the same results.

For even more critical applications like spacecraft you typically have whole multiple computers with each one having such a lockstep CPU so that operation can continue if a computer dies.

But things like ECC are often used inside caches without this being shown as a banner feature in the marketing material.
Title: Re: Why there are lots of methods to correct storage error, but not CPU?
Post by: coppice on September 24, 2019, 11:52:01 am
So there is CRC for disk drive and network, ECC/parity for RAM, and many more, but I've never heard any technique to correct or even detect errors in CPU or GPU, even for server grade hardware. Why?
I guess you haven't seen techniques because you haven't looked. They are easy to find. You may have one in some of the key safety systems in your car.

There are numerous high reliability systems where at least two CPUs run in parallel, and discrepancies are picked up. When there is a discrepancy the system just stops and reports a fault.

There are systems with at least three CPUs, where a majority decision is taken when one of the CPUs disagrees with the others.

There are more elaborate systems, where CPUs operate in pairs with discrepancy detection, and the results from pairs of those pairs are then compared. A discrepancy between the two CPUs at the first stage stops them reporting results to the second stage, so bad results are not fed into the second decision.

The above approaches look for errors at a pretty low level. Other strategies work at a higher level, looking for errors over a larger area of the system. These are seldom entirely transparent to the programmer, but they do provide more system coverage in their error detection.
Title: Re: Why there are lots of methods to correct storage error, but not CPU?
Post by: daqq on September 24, 2019, 12:04:12 pm
With memories it's fairly easy - add a parity/ECC bit and a small-ish bit of hardware, overall power consumption increase minimal.

With CPUs you need at least another device that does the same exact operation, or do the same operation twice on the same hardware and compare the results and hope that the bad result was just a non-repeatable fluke. Overall power consumption increase - double or more.
Title: Re: Why there are lots of methods to correct storage error, but not CPU?
Post by: amyk on September 24, 2019, 12:06:10 pm
I believe normal PC CPUs do already have ECC cache and internal checksums on the databuses. Makes you wonder whether the processes are already so small that disturbances are more common than they want you to think... :-//
Title: Re: Why there are lots of methods to correct storage error, but not CPU?
Post by: Berni on September 24, 2019, 12:10:13 pm
Oh and to add to it OSes also provide some protection from things getting out of hand.

Whenever a application crashes but everything else keeps running that was the OS stepping in and saving the day. The applications run in a walled garden with land mines (in the form of hardware traps) placed around it. As soon as the application tries to access memory or execute code from a weird location it was not supposed to it "sets off the landmine". This causes the cpu to interrupt into its trap handler where the OS looks at the situation and kills the application safely, releases its resources and then continues executing other applications that are still alive.

The really bad crash is when the CPU hits a trap when its not running around in the walled garden of safety. This is what triggers a bluescreen or kernel panic. This means the code that is part of the OS has done something bad, but has no OS above it to catch it so what happens instead is a little crash handler is called that display the bluescreen, dumps memory to disk (so that they can figure out why it crashed) and reboots the system since that's the only way to pick it back up into a safe operational state.
Title: Re: Why there are lots of methods to correct storage error, but not CPU?
Post by: Kleinstein on September 24, 2019, 12:33:23 pm
Catching memory errors is relatively simple with low effort. Parity / ECC add something like 10% to the memory  (e.g. 36 raw bits to have 32 user bits).  Modern RAMs use such systems internally, not visible to the outside to compensate for a few HW defects, so that the yield can be higher with still >99% of the memory still protected by redundancy. So it is a clear win / win.

CPU protection (other than caches) is more effort: priority voting takes a little more than 3 times the effort / power. The basic form of a watchdog is relatively low effort on the HW side, but can only catch a limited number of problems. Still it is quite some extra software effort.

Quite some OS problems are actually pure software problems. More modern languages / compilers add more automatic range checks  than old days C. However this is still far from full protection.
Title: Re: Why there are lots of methods to correct storage error, but not CPU?
Post by: 2N3055 on September 24, 2019, 12:58:00 pm
Google.

keywords:

Reliable Computer Systems
Fault Tolerant Systems
Reliable Computer Systems

There are hundreds of books and papers on the topic..
Title: Re: Why there are lots of methods to correct storage error, but not CPU?
Post by: T3sl4co1l on September 24, 2019, 01:56:51 pm
In terms of error correction on the logic operations themselves (instruction decoding, ALU, etc.): you can, yes.  It's not done, because very low BER and very high reliability logic is still achievable.

As mentioned, simpler solutions are currently used where high reliability is required.  Rather than having to figure out how to error-correct the logic, they just duplicate all the logic, and produce an error signal when they don't match up.  The error can trigger an operating system routine that can see what instruction was being executed, on what data, and can correct the result, or mitigate the fault (notify user? shutdown current program?), or simply reboot.  Or the error triggers a hardware fault, causing a reset.  Or anything else.  I haven't looked at any examples to see how they handle it, but yeah, they're out there, commercially available; read datasheets if you're curious!

Some day, signal levels will fall so low that probabilistic computing will be a necessity, and then logic complexity will have to increase to ensure the correctness of logic operations.  This is probably some decades out though; the kind of logic complexity we'd be talking here would likely demand multi-layer logic (stacked or otherwise), and the reason for the low signal levels is to minimize power dissipation, required to make multi-layer logic feasible.

Tim
Title: Re: Why there are lots of methods to correct storage error, but not CPU?
Post by: djacobow on September 24, 2019, 03:14:44 pm
Back in my Intel days I distinctly remember discussion and papers about how to manage soft errors in CPUs. For any memory array, the normal approaches apply. For the state of the CPU apart from arrays (FFs and latches) the solution was to make the storage elements large enough that they would be immune to flipping from a single ray event, at least to some acceptable likelihood of failure. (Like one error per year)

Designs were considered where each flop was doubled, with an xor output that was then or'd to with others to indicate a fault, but I think by the time you've designed your fancy fault detecting flop you might as well have just made a bigger normal flop, thus statistically obviating the problem.

This was from the 0.28um era. Calculus may have changed since then.
Title: Re: Why there are lots of methods to correct storage error, but not CPU?
Post by: RoGeorge on September 24, 2019, 03:24:58 pm
xaxaisme, apart from what it was already said, you may also want to Google the term "radhard".
Title: Re: Why there are lots of methods to correct storage error, but not CPU?
Post by: djacobow on September 24, 2019, 03:58:55 pm
I think in a discussion like this it's also important to understand that a discussion about reliability is hard to have without first establishing what level of reliability is required.

For example, CPU manufacturers could design CPUs that are intrinsically reliably to whatever arbitrary level you specify, but doing so would require measures that increase the cost and complexity of the device, potentially to the point that they are not competitive for the majority of the market. Therefore, some moderate specification is used (like one soft fault per core annually). For important safety-critical and financially-critical markets, this is not going to be adequate, so they'll have to find more reliable parts (rad hard, etc) or figure out ways to make more reliable systems out of less reliable components (voting, etc).

IIRC, part of the reason the 386 flew on the Space Shuttle as the main computer was that it was a BiCMOS device, using CMOS arrangements for the logic portion of a gate, and bipolar transistors as the output totem pole, providing robust drive. At the time, this was more rad hard than CMOS. On the flip side, I can think of some logic styles that are probably more susceptible to soft errors. We used some self-resetting zipper-style domino logic on Pentium 4. This is something that probably does not belong in space. :-)