Author Topic: How does ECC memory work?  (Read 3229 times)

0 Members and 1 Guest are viewing this topic.

Offline ZbigTopic starter

  • Frequent Contributor
  • **
  • Posts: 927
  • Country: pl
How does ECC memory work?
« on: December 09, 2021, 10:09:49 pm »
I understand the notion of parity in the context of digital data storage. I understand how RAID works and how XOR operation  is used to restore a piece of binary data when a known part of it is missing or corrupted. I understand that ECC memory, in great simplification, could be called a “RAID with parity for RAM” where for each word there is a redundant bit dedicated to parity. I get how you are able to detect single-bit flips using that. What I don’t understand is how ECC DIMM is able to recover original information where the position of the offending bit is not known. RAID made of hard disks - easy: you know which HDD has gone up in smoke or reported uncorrectable read errors to the host. But RAM? “It says here there should be even number of ones and I see odd number: hmm, probably a rouge neutrino flipped a bit somewhere”. How do do you go from there and actually restore the original value? How could you be sure it’s only a single bit that’s affected? I don’t get it and after trying to find some information online I always read “by using sophisticated algorithms” that tells me the authors are none the wiser.

EDITed to add:
I mean, I’m vaguely aware of terms like Reed-Solomon encoding, Virtebi algorithm or erasure codes but fail to see how any of these could be applied using just one extra bit per byte of actual information.
« Last Edit: December 09, 2021, 10:17:23 pm by Zbig »
 

Offline ataradov

  • Super Contributor
  • ***
  • Posts: 11782
  • Country: us
    • Personal site
Re: How does ECC memory work?
« Reply #1 on: December 09, 2021, 11:00:03 pm »
I'm not familiar with how modern PC SDRAM ECC works, but it is absolutely impossible to do this with one bit.  Typical ECC would use something like 8 bits for the correction code per 64 bits of data.

It is possible that SDRAM only exposes one bit (error/no error) so that controller could refresh that word.

Different code types specify how many additional bits are needed to detect and correct certain number of errors. There is no way to make it work with fewer bits.
Alex
 
The following users thanked this post: Zbig

Offline magic

  • Super Contributor
  • ***
  • Posts: 7252
  • Country: pl
Re: How does ECC memory work?
« Reply #2 on: December 10, 2021, 06:42:16 am »
Yep, common ECC memory is still 72 bits wide instead of 64, so that's where the redundant information hides.
 
The following users thanked this post: Zbig

Offline Daixiwen

  • Frequent Contributor
  • **
  • Posts: 367
  • Country: no
Re: How does ECC memory work?
« Reply #3 on: December 10, 2021, 08:17:52 am »
You are right that you can't have an error correction with just one bit of parity, you will need more than that.
What you have to take into account when using am error detection/correction code is the minimum hamming distance. This is the minimal number of bits that will be different between two valid codes.
Without any parity, the distance is 1, there is at least one bit that changes between two words. With a parity bit the distance becomes two, there will always be at least two bits that will be different between two valid codes. And other algorithms will have higher minimal hamming distance.
The minimal hamming distance will tell you how good the algorithm is to detect and correct errors. If the distance is high enough you can start correct bits, but for that you need to "sacrifice" some error detection.

With a minimal hamming distance of 3 for example, you may be able to detect up to 2 simultaneous errors, because any code with two bits flipped will not be a valid code. But if you have 3 simultaneous errors, then the new code will be detected as valid and none of the 3 errors will be detected.
Instead you could decide that you want to be able to correct one error. How the algorithm works in that case is that when an error is detected, it will "correct" the code by jumping to the nearest valid code. If you have only one error then this will work, it will always be corrected. But now if you have two errors, the algorithm will jump to the nearest valid code, by flipping only one bit, and the resulting code will be even worse than the original one, with 3 errors. You have created an algorithm that can correct one error, but for that you had to sacrifice double error detection.

Of course increasing the minimal hamming distance even further will enable you to detect or correct more errors. The general formula is h >= e + 2*c + 1 where h is the minimal hamming distance, e is the number of errors you want to detect, and c the number of errors you want to correct. The more you want to correct, the fewer you will be able to detect.
 
The following users thanked this post: Zbig

Offline bsdphk

  • Regular Contributor
  • *
  • Posts: 206
  • Country: dk
Re: How does ECC memory work?
« Reply #4 on: December 10, 2021, 11:22:41 am »
I understand that ECC memory, in great simplification, could be called a “RAID with parity for RAM”

Hammings original article is very readable:

https://archive.org/details/bstj29-2-147/mode/2up

The basic idea is that you have many parity bits, and to arrange which data bits the cover such that the pattern of parity errors point out which single bit got corrupted.

Mathematically what you do is you use a wider storage word, and make sure to only use "codes" which have a hamming distance of three or more.

Here is an example of that:  https://datamuseum.dk/wiki/Rational/R1000s400/Logbook#2021-11-28_7_down,_7_to_go

 
The following users thanked this post: Zbig

Online ejeffrey

  • Super Contributor
  • ***
  • Posts: 3939
  • Country: us
Re: How does ECC memory work?
« Reply #5 on: December 16, 2021, 05:40:52 pm »
Take a step back from error correction.  Lets say you have a byte where only a single bit is set.  i.e., it is 0000_0001, 0000_0010, 0000_0100, ...  There are 8 possible values.  You can encode that into only three bits whose value tell you which bit is set in the original.  One way we can calculate this 3 bit representation is that each bit is an XOR of 4 input bits.  The LSB is the xor of every other bit, the MSB is the XOR of the 4 most significant inputs, and the middle one takes 2 groups of 2.  If I give you the three bit code you can reconstruct which original bit was set.  If in addition I want to allow the case where all bits are zero, I add one more parity bit that is the XOR of all 8 input bits.  The 4th bit will be '1' if there is one bit set, zero if none are set.  Now I have a 4 bit "syndrome" that will distinguish these 9 states.  If 2 bits are set in the input, then the output will be XXX0, where at least one of the Xs is 1.  This is an invalid code.

Now we can turn that into an error correcting code.  This is easy because we did everything with XOR.  If I have a data byte I store DATA + SYNDROME(DATA).  Just consider the case where there is an error in the DATA, not the SYNDROME. When I retrieve it, I now have DATA' + SYNDROME(DATA).  I calculate SYNDROME(DATA') XOR SYNDROME(DATA).  Because the XORs commute, this ends up SYNDROME(DATA XOR DATA').   That turns it into the same one-hot decoder problem I described above.  If the result is all zeros there was no error.  If the result is XXX1 then there was a single bit error and Xs tell me which index was flipped.  Otherwise if the result is XXX0 there was an uncorrectable multi-bit error.

In reality you have to deal with the case of errors in the parity bits as well.  The real hamming code takes that into account and calculates the parity slightly differently.  However, the idea is quite similar.
« Last Edit: December 16, 2021, 05:58:36 pm by ejeffrey »
 
The following users thanked this post: Zbig

Offline ZbigTopic starter

  • Frequent Contributor
  • **
  • Posts: 927
  • Country: pl
Re: How does ECC memory work?
« Reply #6 on: December 21, 2021, 12:34:22 pm »
Thank you, everyone
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf