Author Topic: Another SSD bites the dust  (Read 2906 times)

0 Members and 1 Guest are viewing this topic.

Offline MarkF

  • Super Contributor
  • ***
  • Posts: 2630
  • Country: us
Re: Another SSD bites the dust
« Reply #25 on: October 06, 2020, 07:38:37 am »
  Back in the early '80s, we were told that our memory failures were caused
by the software execution loops being too short.

  Maybe you are accessing you files too often?   :-//     :-DD

  Or your files are just too small?
 

Offline Wuerstchenhund

  • Super Contributor
  • ***
  • Posts: 3088
  • Country: gb
  • Able to drop by occasionally only
Re: Another SSD bites the dust
« Reply #26 on: October 06, 2020, 08:34:08 am »
Most RAID set ups won't protect you from bit rot. If a file becomes corrupt, that corrupt file will just be replicated across the array.

To prevent that, you'll want to use a more robust file system like ZFS that detects and corrects that kind of corruption.

It's not so simple.

First of all, "bit rot" (soft errors, a flipped data bit) in hard drives has been *vastly* overhyped (mostly out of ignorance).

And ZFS, too, has been overhyped, again usually because of ignorance. Yes, it's a robust file system, and it has built-in check summing so it can detect (and correct, if there's a 2nd copy of the data) flipped data bits.

The thing that's usually ignored is that in a modern PC almost everything is protected by some form of ECC, and this includes data that goes across the SATA cable and also the data that is on your cheap SATA hard drive. If there's a flipped bit, it will be detected, corrected and reported by the hard drive's ECC correction (shown in SMART as 'recoverable error'). That's the first layer.

In a common server, there's also a hardware RAID controller, which, if configured in a redundancy setup, performs regular patrol scans ("scrubbing") which will find and correct "flipped" data bits if there are any. That's the second layer.

In reality, both layers do such a great job in making sure the integrity of logical data presented to the file system is maintained that file system checksuming isn't needed as real "bit rot" would be captured at a lower level already.

However, there's a caveat:

In a regular PC, the RAM usually isn't ECC protected. And while soft errors because of cosmic radiation are usually rare, there also are other factors (like overclocking, voluntary or involuntary) which can cause sporadic memory errors. Considering that all modern OSes use unused RAM as data cache, it's much more likely that reports of "bit rot" have actually been caused by memory soft errors. The fact that there seem to be no (reliable!) reports of "bit rot" occurrence in systems with ECC RAM supports that.

Now as to ZFS, it's worth remembering why it exists, and to understand that it's necessary to go back in history. Back in the old days when there was still a Sun Microsystems building SPARC based servers running Solaris, Sun of course also offered disk storage which was based on large arrays full of SCSI disks, which normally contained some form of hardware RAID controller. Multiple arrays were then combined with an expensive 3rd party product (Veritas Volume Manager, which was also used by other UNIX vendors like HP) to create RAID setups across storage racks. However, eventually the shared SCSI bus became a bottleneck, and Sun was looking for a standard with more space to grow so they moved to Fiber Channel disks. However, unlike for SCSI for which there were several standard RAID solutions there were none for FC. In addition, Sun wanted to get rid of Veritas VM, so they developed their own product - a file system which can also manage large arrays of FC disks. This became ZFS. Since ZFS had to incorporate functionality which was normally provided by a RAID controller, it also had to have some provision to check data integrity (which a RAID controller does). Which is why ZFS has checksumming.

In short, ZFS was developed for one purpose, which is to build large scale RAID setups across spinning rust without the need for a hardware RAID controller. However, ZFS can only ensure data integrity if the underlying hardware itself is ECC protected - and that includes RAM (which ZFS uses a lot of!).

That means ZFS is great a storage system where running multiple hard drives in a RAID setup without a RAID controller, as long as the system has plenty of ECC RAM. If not then the checksum function becomes worthless as data integrity can no longer be guaranteed.

For everything else, ZFS sucks. It's comparatively slow and inflexible (RAIDZ expansion is still in alpha state). If there's a hardware RAID controller or the system uses SSDs instead of spinning rust then ZFS is mostly inferior to pretty much any other modern file system out there.
« Last Edit: October 06, 2020, 12:40:59 pm by Wuerstchenhund »
 
The following users thanked this post: nugglix, bd139


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf