Simple questions:
If you write to DDR3 ECC RAM and you never access that memory location again, will the system detect an error in that section of memory?
If so, how does this work and where is this documented?
I could find lots of general explanations of what ECC memory is, but no specifics of operation.
Reg
If you write to DDR3 ECC RAM and you never access that memory location again, will the system detect an error in that section of memory? No, ECC is not calculated or verified in the RAM, but inside the chipset &/or CPU.
Some enterprise OSes have memory scrubbing, where idle CPU cycles are burnt by reading 'quiet' memory pages, allowing memory errors to be proactively detected and cleaned up.
.
If so, how does this work and where is this documented?See above, what happens when an ECC error is detected and corrected is still a bit of mystery to me. It seems to raise a Machine Check Exception that is processed by the OS. Under Linux this is visible as /sys/devices/system/edac/mc if the correct driver is loaded. It may be possible that the BMC can log these errors independantly.
I could find lots of general explanations of what ECC memory is, but no specifics of operation."SECDED" is the magic word for finding technical details in Google - Attached is an image of matrix used by Intel for their implementation of SECDED.
A single bit error will result in a "syndrome" (posh word for set of parity errors) that can match with one of the columns, so can be corrected.
You can't XOR two single-bit error syndromes to get either zero, or the syndrome of a different single bit error. This ensures that any double-bit error will still have a non-zero syndrome, and will not be mistaken for of a single-bit error. However, as different double-bit errors can result in the same syndrome the error can only be detected, not corrected.