Author Topic: How does ECC memory work? (Read 3024 times)

Zbig · « **on:** December 09, 2021, 10:09:49 pm »

I understand the notion of parity in the context of digital data storage. I understand how RAID works and how XOR operation is used to restore a piece of binary data when a known part of it is missing or corrupted. I understand that ECC memory, in great simplification, could be called a “RAID with parity for RAM” where for each word there is a redundant bit dedicated to parity. I get how you are able to detect single-bit flips using that. What I don’t understand is how ECC DIMM is able to recover original information where the position of the offending bit is not known. RAID made of hard disks - easy: you know which HDD has gone up in smoke or reported uncorrectable read errors to the host. But RAM? “It says here there should be even number of ones and I see odd number: hmm, probably a rouge neutrino flipped a bit somewhere”. How do do you go from there and actually restore the original value? How could you be sure it’s only a single bit that’s affected? I don’t get it and after trying to find some information online I always read “by using sophisticated algorithms” that tells me the authors are none the wiser.

EDITed to add:
I mean, I’m vaguely aware of terms like Reed-Solomon encoding, Virtebi algorithm or erasure codes but fail to see how any of these could be applied using just one extra bit per byte of actual information.

ataradov · « **Reply #1 on:** December 09, 2021, 11:00:03 pm »

I'm not familiar with how modern PC SDRAM ECC works, but it is absolutely impossible to do this with one bit. Typical ECC would use something like 8 bits for the correction code per 64 bits of data.

It is possible that SDRAM only exposes one bit (error/no error) so that controller could refresh that word.

Different code types specify how many additional bits are needed to detect and correct certain number of errors. There is no way to make it work with fewer bits.

magic · « **Reply #2 on:** December 10, 2021, 06:42:16 am »

Yep, common ECC memory is still 72 bits wide instead of 64, so that's where the redundant information hides.

Daixiwen · « **Reply #3 on:** December 10, 2021, 08:17:52 am »

You are right that you can't have an error correction with just one bit of parity, you will need more than that.
What you have to take into account when using am error detection/correction code is the minimum hamming distance. This is the minimal number of bits that will be different between two valid codes.
Without any parity, the distance is 1, there is at least one bit that changes between two words. With a parity bit the distance becomes two, there will always be at least two bits that will be different between two valid codes. And other algorithms will have higher minimal hamming distance.
The minimal hamming distance will tell you how good the algorithm is to detect and correct errors. If the distance is high enough you can start correct bits, but for that you need to "sacrifice" some error detection.

With a minimal hamming distance of 3 for example, you may be able to detect up to 2 simultaneous errors, because any code with two bits flipped will not be a valid code. But if you have 3 simultaneous errors, then the new code will be detected as valid and none of the 3 errors will be detected.
Instead you could decide that you want to be able to correct one error. How the algorithm works in that case is that when an error is detected, it will "correct" the code by jumping to the nearest valid code. If you have only one error then this will work, it will always be corrected. But now if you have two errors, the algorithm will jump to the nearest valid code, by flipping only one bit, and the resulting code will be even worse than the original one, with 3 errors. You have created an algorithm that can correct one error, but for that you had to sacrifice double error detection.

Of course increasing the minimal hamming distance even further will enable you to detect or correct more errors. The general formula is h >= e + 2*c + 1 where h is the minimal hamming distance, e is the number of errors you want to detect, and c the number of errors you want to correct. The more you want to correct, the fewer you will be able to detect.

bsdphk · « **Reply #4 on:** December 10, 2021, 11:22:41 am »

Quote from: Zbig on December 09, 2021, 10:09:49 pm

I understand that ECC memory, in great simplification, could be called a “RAID with parity for RAM”

Hammings original article is very readable:

https://archive.org/details/bstj29-2-147/mode/2up

The basic idea is that you have many parity bits, and to arrange which data bits the cover such that the pattern of parity errors point out which single bit got corrupted.

Mathematically what you do is you use a wider storage word, and make sure to only use "codes" which have a hamming distance of three or more.

Here is an example of that: https://datamuseum.dk/wiki/Rational/R1000s400/Logbook#2021-11-28_7_down,_7_to_go

ejeffrey · « **Reply #5 on:** December 16, 2021, 05:40:52 pm »

Take a step back from error correction. Lets say you have a byte where only a single bit is set. i.e., it is 0000_0001, 0000_0010, 0000_0100, ... There are 8 possible values. You can encode that into only three bits whose value tell you which bit is set in the original. One way we can calculate this 3 bit representation is that each bit is an XOR of 4 input bits. The LSB is the xor of every other bit, the MSB is the XOR of the 4 most significant inputs, and the middle one takes 2 groups of 2. If I give you the three bit code you can reconstruct which original bit was set. If in addition I want to allow the case where all bits are zero, I add one more parity bit that is the XOR of all 8 input bits. The 4th bit will be '1' if there is one bit set, zero if none are set. Now I have a 4 bit "syndrome" that will distinguish these 9 states. If 2 bits are set in the input, then the output will be XXX0, where at least one of the Xs is 1. This is an invalid code.

Now we can turn that into an error correcting code. This is easy because we did everything with XOR. If I have a data byte I store DATA + SYNDROME(DATA). Just consider the case where there is an error in the DATA, not the SYNDROME. When I retrieve it, I now have DATA' + SYNDROME(DATA). I calculate SYNDROME(DATA') XOR SYNDROME(DATA). Because the XORs commute, this ends up SYNDROME(DATA XOR DATA'). That turns it into the same one-hot decoder problem I described above. If the result is all zeros there was no error. If the result is XXX1 then there was a single bit error and Xs tell me which index was flipped. Otherwise if the result is XXX0 there was an uncorrectable multi-bit error.

In reality you have to deal with the case of errors in the parity bits as well. The real hamming code takes that into account and calculates the parity slightly differently. However, the idea is quite similar.

Zbig · « **Reply #6 on:** December 21, 2021, 12:34:22 pm »

Thank you, everyone


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: How does ECC memory work? (Read 3024 times)

Zbig

How does ECC memory work?

ataradov

Re: How does ECC memory work?

magic

Re: How does ECC memory work?

Daixiwen

Re: How does ECC memory work?

bsdphk

Re: How does ECC memory work?

ejeffrey

Re: How does ECC memory work?

Zbig

Re: How does ECC memory work?

Share me