Author Topic: Bit fade testing: memtest86 vs memtest86+  (Read 5598 times)

0 Members and 1 Guest are viewing this topic.

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2812
  • Country: nz
Re: Bit fade testing: memtest86 vs memtest86+
« Reply #25 on: March 17, 2021, 12:00:49 am »
Simple questions: 

If you write  to DDR3 ECC RAM and you never access that memory location again, will the system detect an error in that section of memory? 

If so, how does this work and where is this documented?

I could find lots of general explanations of what ECC memory is, but no specifics of operation.

Reg

If you write  to DDR3 ECC RAM and you never access that memory location again, will the system detect an error in that section of memory?

No, ECC is not calculated or verified in the RAM, but inside the chipset &/or CPU.

Some enterprise OSes have memory scrubbing, where idle CPU cycles are burnt by reading 'quiet' memory pages, allowing memory errors to be proactively detected and cleaned up.
.
If so, how does this work and where is this documented?

See above, what happens when an ECC error is detected and corrected is still a bit of mystery to me. It seems to raise a Machine Check Exception that is processed by the OS. Under Linux this is visible as /sys/devices/system/edac/mc if the correct driver is loaded. It may be possible that the BMC can log these errors independantly.


I could find lots of general explanations of what ECC memory is, but no specifics of operation.

"SECDED" is the magic word for finding technical details in Google - Attached is an image of matrix used by Intel for their implementation of SECDED.

A single bit error will result in a "syndrome" (posh word for set of parity errors) that can match with one of the columns, so can be corrected.

You can't XOR two single-bit error syndromes to get either zero, or the syndrome of a different single bit error. This ensures that any double-bit error will still have a non-zero syndrome, and will not be mistaken for of a single-bit error. However, as different double-bit errors can result in the same syndrome the error can only be detected, not corrected.
« Last Edit: March 17, 2021, 12:04:15 am by hamster_nz »
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline rhbTopic starter

  • Super Contributor
  • ***
  • Posts: 3491
  • Country: us
Re: Bit fade testing: memtest86 vs memtest86+
« Reply #26 on: March 17, 2021, 12:43:54 am »
You seem not to have read my post just before yours.

"Enterprise OSes" is useless jargon.  Solaris used to be the "enterprise OS" , but sadly it is a zombie now.  What OS reads memory in the idle loop to look for  ECC errors?  I've used some 30+ OSes, about half *nix and the rest various.  This includes playing a bit with Plan 9.

The ECC correction process is simple.  A read is performed of a word which is longer than the data word.  A calculation is made from the data word of what the extra bits should be.  If they do not match an ECC exception is raised.  If it is correctable, the correct value is written back to memory.  If not an uncorrectable error is generated.    From there it gets rather vague.

Single bit parity systems (tape, disk and memory) threw an exception if the parity bit did not match, but could not fix it. Current ECC added  extra bits such that a single bit error could be corrected.  More than one bit and it was only able to say it's wrong.  I'd expect that 2 bits would suffice to correct a single bit error, but it's been a very long time since I read that book. 

In "double" parity disk ECC such as ZFS RAIDZ2, if you lose 2 of 4 disks you can recover with no data loss.  I'd expect the same to apply in memory. I've actually tested that and it works. But that's more than 2 bits. In a 3 disk "single" parity RAIDZ1 array, ~ 1/3 of the disk space is used for ECC.  In a 4 disk RAIDZ2 it's 1/2 the disk space.

Reg

 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2812
  • Country: nz
Re: Bit fade testing: memtest86 vs memtest86+
« Reply #27 on: March 17, 2021, 03:25:56 am »
You seem not to have read my post just before yours.

"Enterprise OSes" is useless jargon.  Solaris used to be the "enterprise OS" , but sadly it is a zombie now.  What OS reads memory in the idle loop to look for  ECC errors?  I've used some 30+ OSes, about half *nix and the rest various.  This includes playing a bit with Plan 9.

Sorry I triggered you with "enterprise OSes"!

A long while ago I use to work as an HP Field engineer, going around and replacing memory that was detected through memory scrubbing. If you want you can have a look at the quickspecs for the Integrity rx2800 server, where it explicitly states

Quote
Key features:
• DRAM ECC (Double Device Data Correction - DDDC)
• Memory Scrubbing
• Scalable Memory Interface (SMI) memory channel protection
• ECC and double chip spare to overcome single DRAM chip failures.

Or you can have a look at the Power8 systems from IBM:

Quote

The memory design also takes advantage of a hardware based memory scrubbing that allows the service processor and other system firmware to make a clear distinction between random-soft errors and solid uncorrectable errors. Random soft errors are corrected with scrubbing, without using any spare capacity or having to make predictive parts callouts.

So some system do support memory scrubbing.

In "double" parity disk ECC such as ZFS RAIDZ2, if you lose 2 of 4 disks you can recover with no data loss.  I'd expect the same to apply in memory.

Sadly it doesn't. That is an Erasure Code - it can recover from lost data, but not from corrupted data - unless you include a way to identify corrupted data (like a block level checksum) so you can ignore the bad bits.
« Last Edit: March 17, 2021, 03:39:15 am by hamster_nz »
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline ejeffrey

  • Super Contributor
  • ***
  • Posts: 3865
  • Country: us
Re: Bit fade testing: memtest86 vs memtest86+
« Reply #28 on: March 17, 2021, 04:43:44 am »
I've had verification that ECCs are *only* computed on a read.  So a function pointer table initialized at boot which faded would not generate an ECC error because it  caused a kernel panic. This is exactly what I have assumed was the case.

I booted the system, completed a scrub of all the pools with no errors.   I'm leaving the system idle and late tomorrow I'll repeat the scrub.  I expect a kernel panic and this time I'll take a look at the core dump to see if I can divine from it which DIMM is bad.  I suspect that I cannot, but I'll try.  Mostly I'm looking for a pointer dereference fault.  Dtrace might let me get more detailed, but I'll likely need some guidance on that.

I'm confused what you are getting at.  What pointers do you think are getting corrupted?  Memory which is initialized and then degrades but is not read back will not cause ECC errors but also won't cause any problems.  If it does get read back it will cause ECC errors.  Is your theory that you only ever get multi-bit errors which ECC can't correct and therefore cause a system halt?  That seems extremely unlikely.

Honestly if you have ECC memory and you aren't seeing ECC errors then what you are seeing aren't memory errors and they aren't memory fade.  There are plenty of other hardware problems you could have, but I think you are barking up the wrong tree here.
 

Offline rhbTopic starter

  • Super Contributor
  • ***
  • Posts: 3491
  • Country: us
Re: Bit fade testing: memtest86 vs memtest86+
« Reply #29 on: March 17, 2021, 04:02:34 pm »
What I am saying is that pointers that got initialized at boot time which were not referenced in normal operation unless a scrub was started, faded and caused a kernel panic when the pointer was dereferenced. At which point everything stops except the core dump operation. fmd is no longer runnable as the scheduler has stopped.  Think about what has to happen if you kernel panic.  It generates an interrupt.  The service routine for that interrupt writes all the kernel memory to disk and then halts the system at which point it may or may not reboot depending on the fault and how the system is configured.

And if the CPU has a split supply as did the MicroVAX II, if you drop one side of the supply you don't even get a core dump.  It took 7 or 8 trips by DEC before we found that.  And we only found it because it happened when we had the skins off the BA123 world box and could see the LEDs for the PSU.  After coming in to find the system hung and logging a service call many times at ~ 60 day intervals I got on the phone to the field service manager.  He blew me off until I started reading the service call notes and dates.  I had DEC FSEs camped out for a solid week before we found it.  The first one was a hoot.  Mouse grey suit, purple ruffle shirt and grey hair down to the middle of his back.  He was a biker and had started working for DEC going around with a bag of transistors and a soldering iron fixing PDP-8s.  They didn't let me have him long as he was their most senior FSE.  Boy was he good!  By the time he showed up I had been logging power conditions with the UT Austin Computer Services instrument for almost a month.  Not a single glitch.  Actual fault was a bad thermistor in the top of the cabinet.

Objects consist of structures filled with pointers.  I don't like C++, but in this case because of the versions of the pools it's a very appropriate tool for managing different pool versions on a system.

I expect I'll eventually see ECC errors now as I have a 1 in 6 chance that the DIMM is in the same slot.  It did not kernel panic on a scrub this morning, so I've started my TB+ backup again.    If the system remains stable now, I'll write a program that initializes a very large array and then scans it for bit errors over ever longer time periods.  That should generate ECC errors that will get logged.

The HP and IBM systems memory checking is quite interesting.  Is the HP system related to the Convex acquisition?  I know Convex had developed a very impressive system backplane design about that time.  I think they called it "Exemplar", but that was 25 years ago so I could easily have that wrong.  It looked like the perfect system to support thousands of X terms such as NCD sold in a uniform environment. "Thin client" before the buzzword was invented.

I'm beginning to regret not buying the 5th ed of "Computer Architecture" by Patterson and Hennessy as now that it is out of print it is a bit pricey.  I have the first 4 editions.  I also see they have an edition devoted to the RISC-V.  That should be very interesting. 

Have Fun!
Reg
 

Offline rhbTopic starter

  • Super Contributor
  • ***
  • Posts: 3491
  • Country: us
Re: Bit fade testing: memtest86 vs memtest86+
« Reply #30 on: March 17, 2021, 09:08:13 pm »
It turns out that fmd *will* log a kernel panic.  However, it's never done that in the 10 years I've been running the Solaris 10 system and I've kernel panicked it many times.  The first was via "format -e" on an unlabeled 3 TB disk.  Panic logging via fmd may be a more recent feature.

The system locked up on me completely part way through the "zfs send".  No screen or keyboard response.  I had to force it down.  At least with zfs one doesn't risk corrupting the filesystem by doing that.  There were a large number of errors on one vdev in the scratch pool so I detached it to make the backup.

I've swapped the PSU.  Frustratingly neither of my 2 Chinese PSU testers works now.  No idea why, they both worked fine the last time I used them.  Now one just beeps without showing anything on the LCD.  The other does nothing at all.  Very annoying as it raises the question of will a replacement last until I need it.

There are days when I really don't like computers.

Reg
 

Offline ejeffrey

  • Super Contributor
  • ***
  • Posts: 3865
  • Country: us
Re: Bit fade testing: memtest86 vs memtest86+
« Reply #31 on: March 18, 2021, 05:37:25 am »
You could run a rowhammer test like this: https://github.com/google/rowhammer-test.  I don't know if that example will require modifications to run on solaris, but it shouldn't be too hard.

If you have 10 year old DDR3, chances are high that it is vulnerable to rowhammer regardless of any other issues.  This will likely generate a range of both single and multi-bit errors so you can verify that your ECC reporting is working and also to see what happens when you get uncorrectable errors.
 

Offline magic

  • Super Contributor
  • ***
  • Posts: 7045
  • Country: pl
Re: Bit fade testing: memtest86 vs memtest86+
« Reply #32 on: March 18, 2021, 08:13:58 am »
There is the most boring possibility it's just a software bug >:D

I'm not familiar with Intel, but on AMD there is an option in the BIOS to enable background ECC RAM scrubbing by the CPU.

I would also try to provoke ECC errors (bad DIMM or overclocking if possible) and see if the OS reacts to them at all, if you aren't sure of that. I think memtest also supports reporting ECC errors on some platforms, though not on any of mine IIRC. Reducing some memory timing by a notch tended to yield plentiful correctable ECC errors for me.
« Last Edit: March 18, 2021, 08:17:26 am by magic »
 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2812
  • Country: nz
Re: Bit fade testing: memtest86 vs memtest86+
« Reply #33 on: March 18, 2021, 10:21:46 am »
There is the most boring possibility it's just a software bug >:D

I'm not familiar with Intel, but on AMD there is an option in the BIOS to enable background ECC RAM scrubbing by the CPU.

I would also try to provoke ECC errors (bad DIMM or overclocking if possible) and see if the OS reacts to them at all, if you aren't sure of that. I think memtest also supports reporting ECC errors on some platforms, though not on any of mine IIRC. Reducing some memory timing by a notch tended to yield plentiful correctable ECC errors for me.

Be smarter than me, and never run a full fsck when you suspect you might have bad memory in your system...
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline rhbTopic starter

  • Super Contributor
  • ***
  • Posts: 3491
  • Country: us
Re: Bit fade testing: memtest86 vs memtest86+
« Reply #34 on: March 18, 2021, 03:28:01 pm »
At the moment the main issue has become a couple of failing disk drives :-(  Fortunately I am running zfs which is very robust.  This brings the failed disk total over 10 years to 4.  No lost data.

There is something else going on as now the screen won't unblank. That always worked prior to the DIMM shuffle.  RBAC settings won't let me in as root and the RAIDZ1 export pool is corrupted so I can't get in via my user account.  Another item on my "To Do" list.

So do I migrate the Windows & Debian instance off the 2 TB spare drive I "borrowed" onto a 5 TB disk? Or wait for replacement drives?

All the comments about changing memory settings and so forth don't apply to the Z400.  Not possible.  The settings are controlled by what is stored in the DIMMs.  My biggest complaint about the HP  BIOS is it won't let me select which of multiple hard drives to boot forcing me to use more complex means.

I've used swappable drive caddies for years to avoid the "installing B kills A" problem.  It's *really* nasty if you try to boot Windows, Linux and Solaris off the same laptop drive.  It took 12 tries to figure that one out.

Reg
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf