General > General Technical Chat
GZIP exceptions, but only on hot or rainy days
(1/1)
xmris:
I just found it on my news feed, may interest you guys...

https://alexyorke.github.io//2022/11/11/gzip-exceptions-but-only-on-hot-or-rainy-days/
hans:
Fun read. But it's probably nothing out of the ordinary. Computers have to work with noisy channels all around: a hard-drive is a noisy channel. The PCIe or SATA bridge to a device is. The CPU DDR bus, potentially made worse with bad DRAM refresh timings, is as well. Etc.

In theory we like to black box every device, and assume that it sufficiently checks for noise errors and can either correct or report them. E.g. the bits on a HDD platter will slowly rot away over time. If that sector is 'scrubbed' at regular intervals, the HDD has still chance to recover the data using ECC and reallocate it if the errors exceed a certain threshold. However, if these scrubs are too infrequent then the data could be lost. That is: if the HDD's CRC will tell that the data is incorrect. If the HDD gives back a corrupted sector, then IMO that's a bug in the HDD given our black box model.
Errors can still originate at other interfaces though. A bad SATA cable can show the same symptoms of a HDD failure, even though a different error counter will increase in the SMART data, as the SATA protocol includes a CRC to hopefully catch those corruptions.

Probably the least protected connection in modern PCs is/was the system memory. ECC DRAM is not common, and so the CPU assumes that all data it fetches from RAM is always valid. Incorrect memory timings can make a system 'unstable' at best. There is good reason why workstations and servers use ECC RAM at only JEDEC speeds..
And also good reasons why a file system such as ZFS does regular scrubs to avoid bitrot, and has hierarchical CRC checks built into the filesystem (instead of relying on HDD CRC checks such as in hardware RAID), so that it can detect data corruption that originates from other noisy channels in a system (such as DRAM). Pulling in data, having it corrupt while processing it, and storing it back to HDD is the worst and potentially least traceable event that can happen.

I wonder if the blogposter would have found the issue with an utility like memtest in combination with the faulty environment. My gut feeling is that this is a system memory issue, however, IIRC memtest only uses ancient fixed data test patterns instead of a PN sequence.
tom66:
It's hard to believe the issues were caused solely by the missing ground for the laptop.  Perhaps there was some other noisy appliance on an extension cord that was causing a problem, but even that stretches credibility.

I'd definitely be more inclined to suggest bad RAM in this case like hans.  I had a problem with my desktop PC that wasn't apparent until I started doing large FPGA builds.  Vivado would die in the middle of a build with a checksum or integrity error, sometimes only after a few minutes.  I couldn't find anyone with this problem.  And a simple memory test revealed nothing, but when I ran a memory test overnight, a few errors began to appear.  One of the Kingston RAM modules was dead/dying.  My best guess is that the FPGA compiler exposed the issue due to the specific pattern of accesses across a large area of memory, perhaps a "rowhammer-by-accident" type fault that then only became apparent in memtest after some time.
Navigation
Message Index
There was an error while thanking
Thanking...

Go to full version
Powered by SMFPacks Advanced Attachments Uploader Mod