Author Topic: Bit fade testing: memtest86 vs memtest86+  (Read 5750 times)

0 Members and 1 Guest are viewing this topic.

Offline rhbTopic starter

  • Super Contributor
  • ***
  • Posts: 3492
  • Country: us
Bit fade testing: memtest86 vs memtest86+
« on: March 14, 2021, 12:21:46 am »
Searches produced nothing recent or relevant, so I thought I'd try here.

I've got 3x 4 slot HP Z400s and 1x 6 slot Z400.  The 6 slot machine has become unstable running Solaris 10 u8 after some 10 years of operation.

I know from experience that the issue is bit fade.  But identifying the bad DIMM is proving very troublesome.

I compiled memtest86+ from source.  Both the 6 slot and a 4 slot machine report copious bit fade errors.   There have been no issues with the 4 slot machine.  I should note that all Z400s use ECC memory.  Whether compiled from source or a binary ISO download, both machines report essentially every location as bad using memtest86+.  Both machines pass all the other tests in the suite with both versions.

The free version of memtest86 4.3.7 from the Passmark website reports no bit fade errors on either machine nor errors on any other tests.

I have been studying the source for memtest86+ and do not see any obvious errors.  However, I had one run in which the value read was reported as the address.  That led me to examine the source in hope that it was simply a failure to dereference a pointer, but that proved not to be the case.

I have a general idea of the addressing mode being used in memtest86+, but not good enough to spot an error.

If you have not encountered bit fade, it is the nastiest HW fault I know of.  A freshly booted system will complete an operation such as "dd if=/etc/passwd of=/dev/tape" without a problem.  But leave the system for a day and repeat the same command the next evening and the system will kernel panic.   I spent about 2 weeks learning about that on a 3/60 for which I had built a custom kernel.  I eventually simply went back to using the generic kernel which mapped the bad RAM to a device I did not have.  What makes the problem so difficult is that it might take days for the bit to fade.

What's is happening is that critical kernel data is being written at boot time, but the refresh rate is not adequate to maintain the correct values.

Any illuminati in the neighborhood?  I'd like to modify memtest86+ to initialize memory and then read at intervals that double up to a user specified limit so it would locate the 2-3 day fade problems.  But I've got to get it to work at all first.

Reg
 

Offline retiredfeline

  • Frequent Contributor
  • **
  • Posts: 572
  • Country: au
Re: Bit fade testing: memtest86 vs memtest86+
« Reply #1 on: March 14, 2021, 12:35:23 am »
Sorry. can't help you but thanks for introducing me to a new technical term I can use.

"Sorry I forgot about that appointment, I've got bit fade."  :-DD
 

Offline rhbTopic starter

  • Super Contributor
  • ***
  • Posts: 3492
  • Country: us
Re: Bit fade testing: memtest86 vs memtest86+
« Reply #2 on: March 14, 2021, 12:43:06 am »
I just modified the source to use a millisecond sleep between writing a block of memory and reading it.  This produced an example of the value read aka "error" being the address.

Unfortunately, there doesn't appear to be a functioning forum for memtest86+.  I'd *really* like to solve this as it is the sort of problem that will drive a mission critical systems admin out of his mind.

Reg
 

Offline golden_labels

  • Super Contributor
  • ***
  • Posts: 1334
  • Country: pl
Re: Bit fade testing: memtest86 vs memtest86+
« Reply #3 on: March 14, 2021, 01:58:18 am »
This is a very old bug that appeared back in 2013. Version 4.20, if you can get it, doesn’t exhibit the problem.
People imagine AI as T1000. What we got so far is glorified T9.
 

Offline rhbTopic starter

  • Super Contributor
  • ***
  • Posts: 3492
  • Country: us
Re: Bit fade testing: memtest86 vs memtest86+
« Reply #4 on: March 14, 2021, 02:20:03 am »
Can you explain the bug?  I'd like to fix it.

Reg
 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2812
  • Country: nz
Re: Bit fade testing: memtest86 vs memtest86+
« Reply #5 on: March 14, 2021, 02:52:12 am »
Does it use ECC? If so, can you not run HP's diagnostics to read the hardware logs?
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline rhbTopic starter

  • Super Contributor
  • ***
  • Posts: 3492
  • Country: us
Re: Bit fade testing: memtest86 vs memtest86+
« Reply #6 on: March 14, 2021, 03:31:22 am »
It is ECC.  What diagnostics and what logs?  I run Solaris 10 u8 on this.  My experience to date with HP diagnostics is they are a waste of time for what I am doing.

It came with a Win 7 Pro license, but I use that on a Vista licensed Z400 I bought for $100.

Were it not for my long standing interest in the problems that bit fade causes, I'd just buy 4 GB DIMMs and do an upgrade from 12 to 24 GB.  Actually, I'm going to do that any way. But I do *not* lose arguments with mere machines of any type.  And especially with this issue.  Quite simply this is a "death match" and I expect to live long enough to win.

Reg
 

Offline Monkeh

  • Super Contributor
  • ***
  • Posts: 8050
  • Country: gb
Re: Bit fade testing: memtest86 vs memtest86+
« Reply #7 on: March 14, 2021, 03:37:07 am »
If it is, as you believe, a RAM issue, you should be seeing uncorrectable errors from the memory controller. The Z400 should be recording these - accessing that I leave up to you and their documentation.

If the memory controller is not reporting errors, it seems likely that you're chasing your tail.
 

Offline Nominal Animal

  • Super Contributor
  • ***
  • Posts: 6830
  • Country: fi
    • My home page and email address
Re: Bit fade testing: memtest86 vs memtest86+
« Reply #8 on: March 14, 2021, 08:49:12 am »
I compiled memtest86+ from source.
Which version? 5.31b? 5.01? 4.20? 4.10? 5.01+5.01-3.1 Debian? Or perhaps the most maintained-looking Coreboot memtest86plus fork?
 

Offline rhbTopic starter

  • Super Contributor
  • ***
  • Posts: 3492
  • Country: us
Re: Bit fade testing: memtest86 vs memtest86+
« Reply #9 on: March 14, 2021, 04:24:41 pm »
I compiled Memtest86+ 5.31b from http://www.memtest.org/

The Coreboot repository link:

https://review.coreboot.org/memtest86plus.git

returns "Not found".  I can find the individual files, but no way to download a tarball. 

Sad to say after dealing with SCCS, RCS, Subversion, Mercurial and a few other version control systems I sort of lost enthusiasm for learning new ones.  I continue to use RCS as it suits my needs.  I did set up a git repository once, but lost interest in that project and with it my knowledge of git.

Running Solaris 10 u8 on an HP Z400 is shall we say "not supported any longer" as both are 10 years old.   Nor was it ever likely that the system would log ECC errors on Solaris.  However, thanks for the tip.  I shall look at the Z400 service manual to see if it offers any information.

The HP documentation from that era is very Windows centric, and generally of little use to me.  So I have tended not to look at it much.  Just knowing that there is a facility for logging ECC errors helps.

I discovered that there are real serial ports on the mother board.  They require a level shifter kit which I found on ebay and ordered a pair.  Once those arrive I should be able to run memtest86+ under gdb as a remote target.  I'm sure that will be an adventure as I've not done that but a few times many years ago with an MSP430.

Memtest86+ 5.31b writes patterns to memory in slices.   That makes sense for a lot of the other tests, but for a bit fade that takes 1-2 days to appear, is not optimal.  So I'm going to see if I can puzzle out the code so that it writes a pattern to all memory above where it is located and then tests at ever longer intervals as I outlined earlier.

On to the HP Z400 service manual!

Have Fun!
Reg

Edit:

Well, as I feared, according to the Z400 service manual  excess ECC errors "generate a local user alert".  So only applicable to Windows.
« Last Edit: March 14, 2021, 04:39:16 pm by rhb »
 

Offline Nominal Animal

  • Super Contributor
  • ***
  • Posts: 6830
  • Country: fi
    • My home page and email address
Re: Bit fade testing: memtest86 vs memtest86+
« Reply #10 on: March 14, 2021, 04:53:07 pm »
I can find the individual files, but no way to download a tarball. 
Run
    git clone https://review.coreboot.org/memtest86plus.git
and it'll create and download it into memtest86plus/ under the current working directory.
When in the memtest86plus/ directory, git pull will check and download any changes.

I would recommend trying this one, because it is the only tree/fork that seems to be maintained.
 
The following users thanked this post: rhb

Offline Monkeh

  • Super Contributor
  • ***
  • Posts: 8050
  • Country: gb
Re: Bit fade testing: memtest86 vs memtest86+
« Reply #11 on: March 14, 2021, 04:53:39 pm »
The Coreboot repository link:

https://review.coreboot.org/memtest86plus.git

returns "Not found".  I can find the individual files, but no way to download a tarball. 

It's git, so you need to use git. It's not exactly hard to find instruction on this.

Well, as I feared, according to the Z400 service manual  excess ECC errors "generate a local user alert".  So only applicable to Windows.

They have remote management support with an appropriate HP NIC which should expose such errors. Solaris should be able to log ECC failures with FMA, but don't ask me how to use it.
 

Offline rhbTopic starter

  • Super Contributor
  • ***
  • Posts: 3492
  • Country: us
Re: Bit fade testing: memtest86 vs memtest86+
« Reply #12 on: March 14, 2021, 07:45:18 pm »
All the stuff I've gotten from github has had a zip or tar option.  I've got the bootcore code now and will do a diff with the other version.

The Z400 has a Broadcom NIC embedded in the board.  The larger problem is what it wants to talk to and someplace to run it.

At the moment I'm trying to backup a 3 TB ZFS mirrored pool to a 12 TB USB disk which I just bought.  I'm a bit worried that the WD USB disk might be going to sleep in the middle of the write operation.  It wouldn't be the first time I ran into something that stupid.

It might not be the DIMMs.  It could be the memory controller on the MB.  Mostly I want a reliable test.

Because of the very serious problem that bit fade poses I expect this will eat a bunch of my time.  The Z400 BIOS doesn't appear to offer any control of DRAM parameters.  ECC is not foolproof.  One can have multibit errors that are not detected.  It's better than not having ECC, but by no means a complete solution to all problems.

Time to search for an HP remote support utility.

Reg
 

Offline golden_labels

  • Super Contributor
  • ***
  • Posts: 1334
  • Country: pl
Re: Bit fade testing: memtest86 vs memtest86+
« Reply #13 on: March 15, 2021, 03:02:53 am »
Can you explain the bug?  I'd like to fix it.
Unfortunately no.
People imagine AI as T1000. What we got so far is glorified T9.
 

Offline ejeffrey

  • Super Contributor
  • ***
  • Posts: 3881
  • Country: us
Re: Bit fade testing: memtest86 vs memtest86+
« Reply #14 on: March 15, 2021, 03:33:51 am »
It is possible to have multi bit errors that ECC can't correct but if that is happening there will also be lots of correctable errors as well.  If you aren't seeing that, the either ECC isn't enabled, you aren't looking at the log, or the problem is not with the memory itself.
 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2812
  • Country: nz
Re: Bit fade testing: memtest86 vs memtest86+
« Reply #15 on: March 15, 2021, 04:02:33 am »
It is possible to have multi bit errors that ECC can't correct but if that is happening there will also be lots of correctable errors as well.  If you aren't seeing that, the either ECC isn't enabled, you aren't looking at the log, or the problem is not with the memory itself.

At best it takes 4 bit errors to silently corrupt ECC memory....
 
* A first bit flip for a single bit error

*  Another bit flip for a multi-bit error

* Another bit flip for it to be a single bit error, which can be ECC-corrected to the wrong value

* And a last bit flip for it to become the wrong value, with valid ECC.

That's a big ask... not saying ECC is perfect, but it is pretty effective at finding memory errors.
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline ejeffrey

  • Super Contributor
  • ***
  • Posts: 3881
  • Country: us
Re: Bit fade testing: memtest86 vs memtest86+
« Reply #16 on: March 15, 2021, 05:43:33 am »
Exactly.  Basically my point is that if you have a faulty DIMM, whatever else happens you are going to have a lot of single-bit correctable errors that should be showing up in your logs. If you don't see those your memory may or may not be bad, but at least one other thing is wrong.
 

Offline rhbTopic starter

  • Super Contributor
  • ***
  • Posts: 3492
  • Country: us
Re: Bit fade testing: memtest86 vs memtest86+
« Reply #17 on: March 15, 2021, 04:27:18 pm »
I have been unable to find any reference to a way to access a log of ECC memory errors on the Z400  despite numerous searches.  There is nothing in the BIOS setup menus.

 I went through the entire HP Z400 service and maintenance manual.  There are some POST codes with reference to memory errors listed in the manual, but I've never seen one.

I did download a couple of Softpaqs which might help.  One is for Windows and the other for Linux, but both are .exe files so I have to transfer them to a Windows system to unpack them.  At the moment I'm running Debian on that system.

At present the problem system is booted from an OI LiveImage disk and has been sending a zfs filesystem image to a USB drive for 18 hours.  Not sure how much longer that will take.

At this point few, if any, of the DIMMs are in the slots they previously occupied and they have been inserted and removed 3-4 times.  The recent issue where it became too unstable to complete the transfer of this zfs pool to the USB drive appears to have been caused by a bad connection that the multiple insertions has corrected.  Up until the system became too unstable to finish the zfs send operation it had never had the DIMMs touched and had run close to 24x7 for 10 years.  The exceptions being instances where the heat build up from running 3 Z400s was such I had to take something down because it was 80 F in my lab/office.

Once I get the system back together I'll do a scrub, leave the system idle for a day or two and repeat the scrub.  My expectation is that the first scrub will go fine and the 2nd will kernel panic.

Simple questions: 

If you write  to DDR3 ECC RAM and you never access that memory location again, will the system detect an error in that section of memory? 

If so, how does this work and where is this documented?

I could find lots of general explanations of what ECC memory is, but no specifics of operation.

Reg
 


Offline rhbTopic starter

  • Super Contributor
  • ***
  • Posts: 3492
  • Country: us
Re: Bit fade testing: memtest86 vs memtest86+
« Reply #19 on: March 15, 2021, 06:07:50 pm »
Oh,  so cool!  A huge thanks!!!

I made the transition to Solaris 10  at home when we were still running 8 at work  and once they fired the really good admins I no longer had any admins to have lunch with and thus  never  learned the fault management system.  One of the guys they fired made a practice of reading all the man pages once a year.  He was a real gem.

Before my introduction to Unix, I was a grad student admin for a MicroVAX II and had to hike across campus to study the "grey wall" as I did not have a manual set.

I recently loaded all the Solaris manuals  onto a 12.9" iPad Pro and think it is really neat that I have tens of thousands of pages of manuals in an 8.5" x 11" x 0.25" form factor.  More documentation than I could physically lift in paper form presented in precisely the same format.

You have really made my day!

Have Fun!
Reg
 

Offline rhbTopic starter

  • Super Contributor
  • ***
  • Posts: 3492
  • Country: us
Re: Bit fade testing: memtest86 vs memtest86+
« Reply #20 on: March 16, 2021, 04:51:21 pm »
It seems that according to the fault management system there are no memory faults at all.  In fact there are no events other than zfs events in the 10 years since I set the system up..

This leaves me rather baffled as to how scrubs started immediately after a reboot complete succeed.  But if I start the same operation a day or two after booting the system it consistently crashes.  After the reboot immediate  scrubs consistently succeed.

So time to investigate the crash dump.  It crashed and rebooted after spending 25 hours transferring a zfs pool snapshot to a 12 TB USB drive.  However, as I was running the LiveImage from DVD the crash dump didn't get written to disk.

Reg




 

Offline Monkeh

  • Super Contributor
  • ***
  • Posts: 8050
  • Country: gb
Re: Bit fade testing: memtest86 vs memtest86+
« Reply #21 on: March 16, 2021, 04:56:20 pm »
Well, that's all under the assumption it has a driver for that memory controller and that it's loaded, and that any logs are persistent - but if there is bit fade it's statistically unlikely that you wouldn't encounter some correctible errors at the least for it to log before an outright failure.

Again, it's very possible you're chasing your tail when the issue could be a bad USB controller or device, or a power supply, or something else. If the tool won't agree with your diagnosis you need to question both.
 

Offline rhbTopic starter

  • Super Contributor
  • ***
  • Posts: 3492
  • Country: us
Re: Bit fade testing: memtest86 vs memtest86+
« Reply #22 on: March 16, 2021, 05:10:00 pm »
The fact that different versions of Memtest86 produce different results, but both versions produce the same result on 2 different machines is very difficult to understand.  Unless it's a bug in the version I've compiled and neither machine has a memory fault.

The error bits matching the address with a 1 ms sleep is also weird.

Something is not right, but what is still a mystery.  Probably time to use dtrace to determine where the zfs process is in memory and periodically read the structure that gets read when you start a scrub.

Reg
 

Offline rhbTopic starter

  • Super Contributor
  • ***
  • Posts: 3492
  • Country: us
Re: Bit fade testing: memtest86 vs memtest86+
« Reply #23 on: March 16, 2021, 05:46:05 pm »
If you write to memory and don't read it again, can the system detect an ECC error?  If so, how?  Is there sufficient logic in the refresh that it can detect ECC errors? 

I looked for the DDR3 specifications but all I found was ads and "how to"s.  No technical document that describes the architecture and operation.  I learned that Samsung developed the specs, but that was all I learned.

Reg
 

Offline rhbTopic starter

  • Super Contributor
  • ***
  • Posts: 3492
  • Country: us
Re: Bit fade testing: memtest86 vs memtest86+
« Reply #24 on: March 16, 2021, 11:10:33 pm »
I've had verification that ECCs are *only* computed on a read.  So a function pointer table initialized at boot which faded would not generate an ECC error because it  caused a kernel panic. This is exactly what I have assumed was the case.

I booted the system, completed a scrub of all the pools with no errors.   I'm leaving the system idle and late tomorrow I'll repeat the scrub.  I expect a kernel panic and this time I'll take a look at the core dump to see if I can divine from it which DIMM is bad.  I suspect that I cannot, but I'll try.  Mostly I'm looking for a pointer dereference fault.  Dtrace might let me get more detailed, but I'll likely need some guidance on that.

Reg
 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2812
  • Country: nz
Re: Bit fade testing: memtest86 vs memtest86+
« Reply #25 on: March 17, 2021, 12:00:49 am »
Simple questions: 

If you write  to DDR3 ECC RAM and you never access that memory location again, will the system detect an error in that section of memory? 

If so, how does this work and where is this documented?

I could find lots of general explanations of what ECC memory is, but no specifics of operation.

Reg

If you write  to DDR3 ECC RAM and you never access that memory location again, will the system detect an error in that section of memory?

No, ECC is not calculated or verified in the RAM, but inside the chipset &/or CPU.

Some enterprise OSes have memory scrubbing, where idle CPU cycles are burnt by reading 'quiet' memory pages, allowing memory errors to be proactively detected and cleaned up.
.
If so, how does this work and where is this documented?

See above, what happens when an ECC error is detected and corrected is still a bit of mystery to me. It seems to raise a Machine Check Exception that is processed by the OS. Under Linux this is visible as /sys/devices/system/edac/mc if the correct driver is loaded. It may be possible that the BMC can log these errors independantly.


I could find lots of general explanations of what ECC memory is, but no specifics of operation.

"SECDED" is the magic word for finding technical details in Google - Attached is an image of matrix used by Intel for their implementation of SECDED.

A single bit error will result in a "syndrome" (posh word for set of parity errors) that can match with one of the columns, so can be corrected.

You can't XOR two single-bit error syndromes to get either zero, or the syndrome of a different single bit error. This ensures that any double-bit error will still have a non-zero syndrome, and will not be mistaken for of a single-bit error. However, as different double-bit errors can result in the same syndrome the error can only be detected, not corrected.
« Last Edit: March 17, 2021, 12:04:15 am by hamster_nz »
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline rhbTopic starter

  • Super Contributor
  • ***
  • Posts: 3492
  • Country: us
Re: Bit fade testing: memtest86 vs memtest86+
« Reply #26 on: March 17, 2021, 12:43:54 am »
You seem not to have read my post just before yours.

"Enterprise OSes" is useless jargon.  Solaris used to be the "enterprise OS" , but sadly it is a zombie now.  What OS reads memory in the idle loop to look for  ECC errors?  I've used some 30+ OSes, about half *nix and the rest various.  This includes playing a bit with Plan 9.

The ECC correction process is simple.  A read is performed of a word which is longer than the data word.  A calculation is made from the data word of what the extra bits should be.  If they do not match an ECC exception is raised.  If it is correctable, the correct value is written back to memory.  If not an uncorrectable error is generated.    From there it gets rather vague.

Single bit parity systems (tape, disk and memory) threw an exception if the parity bit did not match, but could not fix it. Current ECC added  extra bits such that a single bit error could be corrected.  More than one bit and it was only able to say it's wrong.  I'd expect that 2 bits would suffice to correct a single bit error, but it's been a very long time since I read that book. 

In "double" parity disk ECC such as ZFS RAIDZ2, if you lose 2 of 4 disks you can recover with no data loss.  I'd expect the same to apply in memory. I've actually tested that and it works. But that's more than 2 bits. In a 3 disk "single" parity RAIDZ1 array, ~ 1/3 of the disk space is used for ECC.  In a 4 disk RAIDZ2 it's 1/2 the disk space.

Reg

 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2812
  • Country: nz
Re: Bit fade testing: memtest86 vs memtest86+
« Reply #27 on: March 17, 2021, 03:25:56 am »
You seem not to have read my post just before yours.

"Enterprise OSes" is useless jargon.  Solaris used to be the "enterprise OS" , but sadly it is a zombie now.  What OS reads memory in the idle loop to look for  ECC errors?  I've used some 30+ OSes, about half *nix and the rest various.  This includes playing a bit with Plan 9.

Sorry I triggered you with "enterprise OSes"!

A long while ago I use to work as an HP Field engineer, going around and replacing memory that was detected through memory scrubbing. If you want you can have a look at the quickspecs for the Integrity rx2800 server, where it explicitly states

Quote
Key features:
• DRAM ECC (Double Device Data Correction - DDDC)
• Memory Scrubbing
• Scalable Memory Interface (SMI) memory channel protection
• ECC and double chip spare to overcome single DRAM chip failures.

Or you can have a look at the Power8 systems from IBM:

Quote

The memory design also takes advantage of a hardware based memory scrubbing that allows the service processor and other system firmware to make a clear distinction between random-soft errors and solid uncorrectable errors. Random soft errors are corrected with scrubbing, without using any spare capacity or having to make predictive parts callouts.

So some system do support memory scrubbing.

In "double" parity disk ECC such as ZFS RAIDZ2, if you lose 2 of 4 disks you can recover with no data loss.  I'd expect the same to apply in memory.

Sadly it doesn't. That is an Erasure Code - it can recover from lost data, but not from corrupted data - unless you include a way to identify corrupted data (like a block level checksum) so you can ignore the bad bits.
« Last Edit: March 17, 2021, 03:39:15 am by hamster_nz »
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline ejeffrey

  • Super Contributor
  • ***
  • Posts: 3881
  • Country: us
Re: Bit fade testing: memtest86 vs memtest86+
« Reply #28 on: March 17, 2021, 04:43:44 am »
I've had verification that ECCs are *only* computed on a read.  So a function pointer table initialized at boot which faded would not generate an ECC error because it  caused a kernel panic. This is exactly what I have assumed was the case.

I booted the system, completed a scrub of all the pools with no errors.   I'm leaving the system idle and late tomorrow I'll repeat the scrub.  I expect a kernel panic and this time I'll take a look at the core dump to see if I can divine from it which DIMM is bad.  I suspect that I cannot, but I'll try.  Mostly I'm looking for a pointer dereference fault.  Dtrace might let me get more detailed, but I'll likely need some guidance on that.

I'm confused what you are getting at.  What pointers do you think are getting corrupted?  Memory which is initialized and then degrades but is not read back will not cause ECC errors but also won't cause any problems.  If it does get read back it will cause ECC errors.  Is your theory that you only ever get multi-bit errors which ECC can't correct and therefore cause a system halt?  That seems extremely unlikely.

Honestly if you have ECC memory and you aren't seeing ECC errors then what you are seeing aren't memory errors and they aren't memory fade.  There are plenty of other hardware problems you could have, but I think you are barking up the wrong tree here.
 

Offline rhbTopic starter

  • Super Contributor
  • ***
  • Posts: 3492
  • Country: us
Re: Bit fade testing: memtest86 vs memtest86+
« Reply #29 on: March 17, 2021, 04:02:34 pm »
What I am saying is that pointers that got initialized at boot time which were not referenced in normal operation unless a scrub was started, faded and caused a kernel panic when the pointer was dereferenced. At which point everything stops except the core dump operation. fmd is no longer runnable as the scheduler has stopped.  Think about what has to happen if you kernel panic.  It generates an interrupt.  The service routine for that interrupt writes all the kernel memory to disk and then halts the system at which point it may or may not reboot depending on the fault and how the system is configured.

And if the CPU has a split supply as did the MicroVAX II, if you drop one side of the supply you don't even get a core dump.  It took 7 or 8 trips by DEC before we found that.  And we only found it because it happened when we had the skins off the BA123 world box and could see the LEDs for the PSU.  After coming in to find the system hung and logging a service call many times at ~ 60 day intervals I got on the phone to the field service manager.  He blew me off until I started reading the service call notes and dates.  I had DEC FSEs camped out for a solid week before we found it.  The first one was a hoot.  Mouse grey suit, purple ruffle shirt and grey hair down to the middle of his back.  He was a biker and had started working for DEC going around with a bag of transistors and a soldering iron fixing PDP-8s.  They didn't let me have him long as he was their most senior FSE.  Boy was he good!  By the time he showed up I had been logging power conditions with the UT Austin Computer Services instrument for almost a month.  Not a single glitch.  Actual fault was a bad thermistor in the top of the cabinet.

Objects consist of structures filled with pointers.  I don't like C++, but in this case because of the versions of the pools it's a very appropriate tool for managing different pool versions on a system.

I expect I'll eventually see ECC errors now as I have a 1 in 6 chance that the DIMM is in the same slot.  It did not kernel panic on a scrub this morning, so I've started my TB+ backup again.    If the system remains stable now, I'll write a program that initializes a very large array and then scans it for bit errors over ever longer time periods.  That should generate ECC errors that will get logged.

The HP and IBM systems memory checking is quite interesting.  Is the HP system related to the Convex acquisition?  I know Convex had developed a very impressive system backplane design about that time.  I think they called it "Exemplar", but that was 25 years ago so I could easily have that wrong.  It looked like the perfect system to support thousands of X terms such as NCD sold in a uniform environment. "Thin client" before the buzzword was invented.

I'm beginning to regret not buying the 5th ed of "Computer Architecture" by Patterson and Hennessy as now that it is out of print it is a bit pricey.  I have the first 4 editions.  I also see they have an edition devoted to the RISC-V.  That should be very interesting. 

Have Fun!
Reg
 

Offline rhbTopic starter

  • Super Contributor
  • ***
  • Posts: 3492
  • Country: us
Re: Bit fade testing: memtest86 vs memtest86+
« Reply #30 on: March 17, 2021, 09:08:13 pm »
It turns out that fmd *will* log a kernel panic.  However, it's never done that in the 10 years I've been running the Solaris 10 system and I've kernel panicked it many times.  The first was via "format -e" on an unlabeled 3 TB disk.  Panic logging via fmd may be a more recent feature.

The system locked up on me completely part way through the "zfs send".  No screen or keyboard response.  I had to force it down.  At least with zfs one doesn't risk corrupting the filesystem by doing that.  There were a large number of errors on one vdev in the scratch pool so I detached it to make the backup.

I've swapped the PSU.  Frustratingly neither of my 2 Chinese PSU testers works now.  No idea why, they both worked fine the last time I used them.  Now one just beeps without showing anything on the LCD.  The other does nothing at all.  Very annoying as it raises the question of will a replacement last until I need it.

There are days when I really don't like computers.

Reg
 

Offline ejeffrey

  • Super Contributor
  • ***
  • Posts: 3881
  • Country: us
Re: Bit fade testing: memtest86 vs memtest86+
« Reply #31 on: March 18, 2021, 05:37:25 am »
You could run a rowhammer test like this: https://github.com/google/rowhammer-test.  I don't know if that example will require modifications to run on solaris, but it shouldn't be too hard.

If you have 10 year old DDR3, chances are high that it is vulnerable to rowhammer regardless of any other issues.  This will likely generate a range of both single and multi-bit errors so you can verify that your ECC reporting is working and also to see what happens when you get uncorrectable errors.
 

Online magic

  • Super Contributor
  • ***
  • Posts: 7154
  • Country: pl
Re: Bit fade testing: memtest86 vs memtest86+
« Reply #32 on: March 18, 2021, 08:13:58 am »
There is the most boring possibility it's just a software bug >:D

I'm not familiar with Intel, but on AMD there is an option in the BIOS to enable background ECC RAM scrubbing by the CPU.

I would also try to provoke ECC errors (bad DIMM or overclocking if possible) and see if the OS reacts to them at all, if you aren't sure of that. I think memtest also supports reporting ECC errors on some platforms, though not on any of mine IIRC. Reducing some memory timing by a notch tended to yield plentiful correctable ECC errors for me.
« Last Edit: March 18, 2021, 08:17:26 am by magic »
 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2812
  • Country: nz
Re: Bit fade testing: memtest86 vs memtest86+
« Reply #33 on: March 18, 2021, 10:21:46 am »
There is the most boring possibility it's just a software bug >:D

I'm not familiar with Intel, but on AMD there is an option in the BIOS to enable background ECC RAM scrubbing by the CPU.

I would also try to provoke ECC errors (bad DIMM or overclocking if possible) and see if the OS reacts to them at all, if you aren't sure of that. I think memtest also supports reporting ECC errors on some platforms, though not on any of mine IIRC. Reducing some memory timing by a notch tended to yield plentiful correctable ECC errors for me.

Be smarter than me, and never run a full fsck when you suspect you might have bad memory in your system...
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline rhbTopic starter

  • Super Contributor
  • ***
  • Posts: 3492
  • Country: us
Re: Bit fade testing: memtest86 vs memtest86+
« Reply #34 on: March 18, 2021, 03:28:01 pm »
At the moment the main issue has become a couple of failing disk drives :-(  Fortunately I am running zfs which is very robust.  This brings the failed disk total over 10 years to 4.  No lost data.

There is something else going on as now the screen won't unblank. That always worked prior to the DIMM shuffle.  RBAC settings won't let me in as root and the RAIDZ1 export pool is corrupted so I can't get in via my user account.  Another item on my "To Do" list.

So do I migrate the Windows & Debian instance off the 2 TB spare drive I "borrowed" onto a 5 TB disk? Or wait for replacement drives?

All the comments about changing memory settings and so forth don't apply to the Z400.  Not possible.  The settings are controlled by what is stored in the DIMMs.  My biggest complaint about the HP  BIOS is it won't let me select which of multiple hard drives to boot forcing me to use more complex means.

I've used swappable drive caddies for years to avoid the "installing B kills A" problem.  It's *really* nasty if you try to boot Windows, Linux and Solaris off the same laptop drive.  It took 12 tries to figure that one out.

Reg
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf