Author Topic: Raid1 + {ext3, ext4, xfs} is problematic if a disk silently has IO errors  (Read 6968 times)

0 Members and 1 Guest are viewing this topic.

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4343
  • Country: gb
Imagine you have two HDDs in Raid1 (mirroring), formatted as { ext3, ext4, xfs } you copy a file, but one of the two disk Raid1 + {ext3, ext4, xfs} has several IO errors.

What will happen?

Well, { Ext3, Ext4, XFS, ... } have no checksum on data, only metadata have it, so ... if one of the two disks randomly fails (fail()={ data=IO_read; cache_data=data; write(data+random); } ) during a copy, the result will be silently currupted.

Except:
- Smarmon will log IO errors
- dmesg will output IO errors
- but "/bin/cp" will return without any error.

If you copied several files, e.g. with "cp -r", you don't know which file got corrupted during the copy!

Worse still, as long as the data remains in the disk cache, everything seems fine, syncing just forces pending writes, it doesn't clear the kernel-side cache.

Moral of the story: as soon as you un-mount and re-mount the disk, you have a good chance of finding a corrupt file (the copy is corrupt, the original is not, the two md5sums do not match), both on the healthy disk and on the disk that showed IO errors  :o :o :o

- - -

If you use filesystems with checksums on both data and metadata, like { ZFS, Btrfs, ... } with at least two pooled disks (mirroring), then the whole filesystem does it. It's not exactly automatic that you will notice corruption, but there are tools already implemented that allow you to restore data.

ZFS immediately tells you which file has been corrupted.
Btrfs ... tells you indirectly, through dmesg, so you need to log everything and filter the output.


The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4343
  • Country: gb
Re: Raid1 + {ext3, ext4, xfs} is problematic if a disk silently has IO errors
« Reply #1 on: November 21, 2023, 04:04:51 pm »
Defect?
In Raid1 with two storage devices, there is only one instance of the application and the application is not aware of multiple copies.

the redundancy exists in a layer transparent to the storage-using application:
  • when the application reads, the raid layer chooses the storage device to read
  • when a storage device fails, the Raid-layer chooses to read the other, without the application instance knowing of the failure
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Online Marco

  • Super Contributor
  • ***
  • Posts: 7032
  • Country: nl
Re: Raid1 + {ext3, ext4, xfs} is problematic if a disk silently has IO errors
« Reply #2 on: November 21, 2023, 04:12:46 pm »
Where is the corrupted data supposed to come from? An IO error is not interpreted as data by the RAID driver.
« Last Edit: November 21, 2023, 04:14:42 pm by Marco »
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4343
  • Country: gb
Re: Raid1 + {ext3, ext4, xfs} is problematic if a disk silently has IO errors
« Reply #3 on: November 21, 2023, 04:20:54 pm »
Where is the corrupted data supposed to come from? An IO error is not interpreted as data by the RAID driver.

p(hdd write error)=~1/1M
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4343
  • Country: gb
Re: Raid1 + {ext3, ext4, xfs} is problematic if a disk silently has IO errors
« Reply #4 on: November 21, 2023, 04:39:49 pm »
what happens:
  • if a write on one of the two disks is incorrect, { ext3m ext4, xfs, ... } doesn't warn you
  • you won't notice until you un-mount and re-mount the disk
  • worse yet, when you go back into Raid1, if it notices that two blocks are different, they are deleted
in practice you also lose the correct copy on the disk which did not show any errors
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4343
  • Country: gb
Re: Raid1 + {ext3, ext4, xfs} is problematic if a disk silently has IO errors
« Reply #5 on: November 21, 2023, 04:43:48 pm »
how to check on ZFS?
Code: [Select]
zpool scrub $pool_name
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline gf

  • Super Contributor
  • ***
  • Posts: 1414
  • Country: de
Re: Raid1 + {ext3, ext4, xfs} is problematic if a disk silently has IO errors
« Reply #6 on: November 21, 2023, 06:29:49 pm »
If I/O to one disk of a RAID1 mirror fails, the failed disk is taken out of service and operation continues with the surviving disk. All I/O request from the file system to the virtual RAID1 device still succeed and there is no data loss or corruption, but redundancy is lost, and a second failure (of the surviving disk) cannot be tolerated until the failed disk has been replaced and resynchronized.

To the file system and application, the RAID device behaves as if it were a single (virtual) disk with higher availability. However, it is important to monitor the health of the RAID and replace a failed disk ASAP. If a disk failure is ignored, then sooner or later the 2nd disk will fail as well, and a (non-recoverable) double failure will occur.

There is also no redundancy while the mirror is being resynchronized (IOW, the source disk must not fail during the resynchronization). Note that resynchronization occurs not only after replacing a failed disk, but also when a RAID device w/o failed disks is reactivated after it was not shut down cleanly (e.g. after a system crash). EDIT: However, dirty region logging can speed-up the resynchronization after a system crash significantly, reducing the time window where the mirror runs w/o redundancy.
« Last Edit: November 21, 2023, 06:44:49 pm by gf »
 

Online magic

  • Super Contributor
  • ***
  • Posts: 7435
  • Country: pl
Re: Raid1 + {ext3, ext4, xfs} is problematic if a disk silently has IO errors
« Reply #7 on: November 21, 2023, 06:50:18 pm »
Silent data corruption is always problematic, nothing to do with RAID1.

That being said, I'm not aware of any evidence that it occurs like you describe, with the disk receiving correct data and writing garbage to the platter. Platter contents are ECC protected and any irreparable corruption downstream of the HDD controller chip should be caught by it, resulting in properly reported I/O error and retry from another disk in a RAID environment. What could theoretically fail is the chip or its cache memory, if ECC isn't used there, or something along the way from the CPU to the disk (but note that SATA links have CRC and I believe PCIe has some error detection too).

What certainly does fail silently in the real world is garbage USB-SATA bridges. After one too many "adventure", I recently started transitioning all my USB disks to btrfs.
« Last Edit: November 21, 2023, 06:55:19 pm by magic »
 
The following users thanked this post: Ed.Kloonk

Offline gf

  • Super Contributor
  • ***
  • Posts: 1414
  • Country: de
Re: Raid1 + {ext3, ext4, xfs} is problematic if a disk silently has IO errors
« Reply #8 on: November 21, 2023, 06:59:00 pm »
Silent data corruption is always problematic, even without RAID1.

Of course.
And RAID still relies on the disk's and the bus' ECC to detect uncorrectable errors.
It just allows you to continue working with the other disk if such an error is detected.
Zfs and btrfs add an additional checksum layer.

Edit: Bit errors can also happen in main memory, and if they are not detected and you write garbled data to the disk, then the data on the disk are also corrupted.
« Last Edit: November 21, 2023, 07:02:25 pm by gf »
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4343
  • Country: gb
Re: Raid1 + {ext3, ext4, xfs} is problematic if a disk silently has IO errors
« Reply #9 on: November 21, 2023, 07:01:17 pm »
If I/O to one disk of a RAID1 mirror fails, the failed disk is taken out of service

Only on severe failure, otherwise, since there is no write-readback the error silently happens.
I verified this in person!

The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4343
  • Country: gb
Re: Raid1 + {ext3, ext4, xfs} is problematic if a disk silently has IO errors
« Reply #10 on: November 21, 2023, 07:06:55 pm »
I'm not aware of any evidence that it occurs like you describe

you can easily test this by modeling the behavior of two virtual disks. For one disk you model perfect behavior, for the other you assign the probability of failure every p(write_error) operations, then they put them in Raid1, format as ext4, and look at what happens trying to copy files.

The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Online magic

  • Super Contributor
  • ***
  • Posts: 7435
  • Country: pl
Re: Raid1 + {ext3, ext4, xfs} is problematic if a disk silently has IO errors
« Reply #11 on: November 21, 2023, 07:12:20 pm »
I question your proposed mechanism of silent data corruption, not the obvious fact that Linux RAID1 doesn't attempt to detect it.

You must have some shitty hardware if you see 1ppm silent corruption rate, and it's likely not the disk.
« Last Edit: November 21, 2023, 07:14:14 pm by magic »
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4343
  • Country: gb
Re: Raid1 + {ext3, ext4, xfs} is problematic if a disk silently has IO errors
« Reply #12 on: November 21, 2023, 07:22:47 pm »
Silent data corruption is always problematic, nothing to do with RAID1.

The point is that there are ways to notice while this is happening(1) that the copy of a file is not the same as the original!

{ ZFS, btrfs, ... } offers a way to notice this
{ xfs, ext3, ext4, ... } does not, and you have to develop solutions.

For my NAS I wrote a special version of "cp" that verifies that the copy has the same md5 as the source. It doesn't help to restore anything (I use backups for that), but it immediately tells you that a file has been corrupted, which is better than nothing!


My version of "cp" does two things:
  • bypasses the kernel disk-cache for reading after writing, so it actually reads data blocks from disk, and not from RAM
  • reads n blocks { 8Kbyte, 16Kbyte, ... } from the source, writes-and-re-reads them, iteratively calculates, blocks after blocks, that the source and destination have the same checksum
It is much slower than "cp", about 3x times slower, but if the checksum does not match, it returns an error and can be logged.


(1) it may also be written correctly and become lately corrupted due to deterioration of the storage device, this case is not covered
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4343
  • Country: gb
Re: Raid1 + {ext3, ext4, xfs} is problematic if a disk silently has IO errors
« Reply #13 on: November 21, 2023, 07:29:45 pm »
I question your proposed mechanism of silent data corruption, not the obvious fact that Linux RAID1 doesn't attempt to detect it.

You must have some shitty hardware if you see 1ppm silent corruption rate, and it's likely not the disk.

yes, using "nbd" + a low-level backend written in C to model the behavior of a disk that has write errors can be... a stretch.

Surely the real hw behaves better, and above all SMART reports the symptoms and in any case takes this into account, which I cannot model with a virtual disk.

But hey? at least this is how Raid1 behaves in the worst case (=shitty hdds)  :D
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline gf

  • Super Contributor
  • ***
  • Posts: 1414
  • Country: de
Re: Raid1 + {ext3, ext4, xfs} is problematic if a disk silently has IO errors
« Reply #14 on: November 21, 2023, 08:02:09 pm »
If I/O to one disk of a RAID1 mirror fails, the failed disk is taken out of service

Only on severe failure, otherwise, since there is no write-readback the error silently happens.
I verified this in person!

As magic said, the same also happen with a plain disk if the disk does not recognize the error and does not report a write error back to the driver.

If the badly written block is detected by the disk when you try to read it back later, then the RAID driver has at least the chance to retry the read from the other disk (with a single plain disk, you don't have this chance).

However, RAID1 is not expected to detect data corruption which is not detected and reported by the disk itself or the disk driver.  Who promised it would? The data associated with the read and write I/O request is simply passed through transparently as long as the underlying disk driver does not abort any I/O request with error.
« Last Edit: November 21, 2023, 08:44:14 pm by gf »
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4343
  • Country: gb
Re: Raid1 + {ext3, ext4, xfs} is problematic if a disk silently has IO errors
« Reply #15 on: November 21, 2023, 09:06:35 pm »
How to check on Btrfs?
Code: [Select]
# journalctl -k | grep "checksum error" | cut -d: -f6 | cut -d')' -f1 | sort | uniq
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Online magic

  • Super Contributor
  • ***
  • Posts: 7435
  • Country: pl
Re: Raid1 + {ext3, ext4, xfs} is problematic if a disk silently has IO errors
« Reply #16 on: November 21, 2023, 09:08:14 pm »
I'm sure you can get total corruption event count from btrfs somehow. It is printed when the FS is mounted.

To find all currently bad files run btrfs scrub.
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4343
  • Country: gb
Re: Raid1 + {ext3, ext4, xfs} is problematic if a disk silently has IO errors
« Reply #17 on: November 21, 2023, 09:10:02 pm »
Linus Tech Tips fails at using ZFS properly, and loses data; read here  :o :o :o
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline Foxxz

  • Regular Contributor
  • *
  • Posts: 129
  • Country: us
Re: Raid1 + {ext3, ext4, xfs} is problematic if a disk silently has IO errors
« Reply #18 on: November 22, 2023, 04:34:41 am »
Its a different layer's responsibility to ensure the data the filesystem is getting is correct.  Some filesystems have just added additional redundancy. Hardware raid has been a thing forever and worked pretty dang well.

You could make the same argument about ECC memory vs non-ECC memory. The OS isn't doing any additional checks to ensure RAM contents are correct.
 

Offline David Hess

  • Super Contributor
  • ***
  • Posts: 17407
  • Country: us
  • DavidH
Re: Raid1 + {ext3, ext4, xfs} is problematic if a disk silently has IO errors
« Reply #19 on: November 22, 2023, 05:46:01 am »
On my Areca RAID controllers, if the disk reports an error, then the data is read from the redundant disk, and steps are taken to make the disks consistent again.  There is also another mode where the data is read from all disks and checked for consistence before being forwarded to the driver, which would at least catch disks silently corrupting data.

My preference would be that the filesystem also at least detect data and metadata corruption.
« Last Edit: November 22, 2023, 05:50:42 am by David Hess »
 

Online Marco

  • Super Contributor
  • ***
  • Posts: 7032
  • Country: nl
Re: Raid1 + {ext3, ext4, xfs} is problematic if a disk silently has IO errors
« Reply #20 on: November 22, 2023, 06:30:48 am »
Where is the corrupted data supposed to come from? An IO error is not interpreted as data by the RAID driver.

p(hdd write error)=~1/1M
If it's during the physical write the on disk CRC/ECC will be faulty, so it will return a read error when read. It can get corrupted during processing by some cosmic ray fluke, then again so can the data in the CPU before it computes a filesystem CRC.
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4343
  • Country: gb
Re: Raid1 + {ext3, ext4, xfs} is problematic if a disk silently has IO errors
« Reply #21 on: November 22, 2023, 08:17:13 am »
If it's during the physical write the on disk CRC/ECC will be faulty, so it will return a read error when read. It can get corrupted during processing by some cosmic ray fluke, then again so can the data in the CPU before it computes a filesystem CRC.

The ECC mechanism used with HDDs has a limited fault-tolerance, hence a probability of silent error.
It is on this that I modeled the behavior of the virtual disks on which I carried out the analysis.

In other words: suppose that a disk sucks so bad, or is so unlucky, that this probability amplifies(1):
  • How does the filesystem react and behave?
  • Do the kernel and the tools { cp, ... } in userspace at least tell me that something went wrong?
these are the points of experimental investigation!

(1) this conjecture makes sense, if we note that that probability can be modeled with approximately the same reasoning that is used to choose the cryptographic hash functions to have the property of "strong resistance to collisions": that is, the larger the disk , the more likely you are to make many more I/Os, and the more I/Os you make, the more you amplify that probability of a nefarious event (silent I/O error) happens.
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Online Marco

  • Super Contributor
  • ***
  • Posts: 7032
  • Country: nl
Re: Raid1 + {ext3, ext4, xfs} is problematic if a disk silently has IO errors
« Reply #22 on: November 22, 2023, 08:38:45 am »
The ECC mechanism used with HDDs has a limited fault-tolerance, hence a probability of silent error.
It has a checksum too.
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4343
  • Country: gb
Re: Raid1 + {ext3, ext4, xfs} is problematic if a disk silently has IO errors
« Reply #23 on: November 22, 2023, 08:41:54 am »
There is also another mode where the data is read from all disks and checked for consistence before being forwarded to the driver, which would at least catch disks silently corrupting data.

yup, in this case the Raid1 hw with proprietary extensions for enterprise services is better than the soft-Rraid1, precisely because the controller CPU on the PCI-board will check for consistence before being forwarded to the kernel, and in case will propagates an I/O error.

Unfortunately this approach is not portable to non-x86 architectures for commercial reasons, and the implementations I have seen are strongly PCI-bus master, so I cannot consider it.

I'm only using softRaid1 implementations, even with very dummy-cards that offer only two sATA or SCSI channels, leaving all the work to the kernel: this is portable, but slower and weaker.
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline gf

  • Super Contributor
  • ***
  • Posts: 1414
  • Country: de
Re: Raid1 + {ext3, ext4, xfs} is problematic if a disk silently has IO errors
« Reply #24 on: November 22, 2023, 09:43:43 am »
There is also another mode where the data is read from all disks and checked for consistence before being forwarded to the driver, which would at least catch disks silently corrupting data.
yup, in this case the Raid1 hw with proprietary extensions for enterprise services is better than the soft-Rraid1, precisely because the controller CPU on the PCI-board will check for consistence before being forwarded to the kernel, and in case will propagates an I/O error.

"Better"? Not in terms of performance, because reading both copies and comparing them prevents striping of concurrent read requests across disks, and the latter can almost double throughput for some read-intensive workloads.

Quote
Unfortunately this approach is not portable to non-x86 architectures for commercial reasons, and the implementations I have seen are strongly PCI-bus master, so I cannot consider it.

I'm only using softRaid1 implementations, even with very dummy-cards that offer only two sATA or SCSI channels, leaving all the work to the kernel: this is portable, but slower and weaker.

Why do you think that reading and comparing both copies could not be implemented in software RAID, too? Of course there is not free lunch -- it costs CPU power and memory bandwidth.

[ If you use ZFS, you don't need a separate software RAID anyway, since ZFS has RAID and volume manager functionality built-in. I do not know if the same applies to BTRFS -- I am not familiar with it. ]
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4343
  • Country: gb
Re: Raid1 + {ext3, ext4, xfs} is problematic if a disk silently has IO errors
« Reply #25 on: November 22, 2023, 10:09:40 am »
"Better"? Not in terms of performance, because reading both copies and comparing them prevents striping of concurrent read requests across disks, and the latter can almost double throughput for some read-intensive workloads.

we're talking about "reliability".

Why do you think that reading and comparing both copies could not be implemented in software RAID, too? Of course there is not free lunch -- it costs CPU power and memory bandwidth.

Well, possible, just no one to date has implemented anything, probably because softRAID already costs CPU power and memory bandwidth, and usually people - as you just pointed out yourself - care more about taking less time to copy things than actually being sure that things were copied correctly.

The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline Nominal Animal

  • Super Contributor
  • ***
  • Posts: 7156
  • Country: fi
    • My home page and email address
Re: Raid1 + {ext3, ext4, xfs} is problematic if a disk silently has IO errors
« Reply #26 on: November 22, 2023, 03:49:16 pm »
we're talking about "reliability".
That should probably exclude RAID0 and RAID1, then.  They provide zero protection against silent corruption, which means RAID1 actually doubles the probability of silent corruption.

Traditionally in Linux, with all types of software-RAID, smartmontools are used to track the actual storage device error logs and statistics.  Background scrubbing –– necessary to detect data degradation –– is controlled by the system administrator (by writing check or repair to the md/sync_action sysfs pseudofile; see man 4 md Scrubbing and mismatches chapter).

I'm sure you can see how it is quite logical for this kind of stack (per Unix philosophy!) that the topmost filesystem (ext2/3/4) does not have any file integrity checks, either: each layer relies on the function of the lower layer, and each has tools and policies one can apply to detect problems.

It is good that we do not all agree, though, and that there are competing filesystems and approaches that do include file integrity checks.
Knowing the real-world probabilities, I'm not really interested in having those on my own workstations (even though I do like to use RAID-0 and RAID-1) because I prefer the higher throughput over integrity checks there; but might choose differently for certain servers and appliances.



Do note it is not a matter of not caring about integrity: it is about having other means to achieve sufficient practical probabilities.
(You [plural!] may have noticed that I myself often rail against programmers who assume syscalls succeed, and ignore "rare" errors because they feel they are so rare they don't need to be cared about.  I do want my tools to report any errors when those are reported by the kernel or hardware, but I still do not expect them to be perfect.  It is having the information about an error occurring available and ignoring it that really bothers me; and not that the tool is imperfect and sometimes may garble my precious data.)

Currently, with so few Intel/AMD desktop and laptop processors supporting ECC RAM, I do believe silent RAM bit-flips may occur more often in practice, for example during copying of files.  As the data in RAM is then not protected by any checksum or error correction code, nothing can detect the corruption either, and e.g. on RAID-1, both copies will inherit the changed data without any error.  (Very few file copy utilities actually read back the copied file, to verify the original and new data match.)

This is one reason why I like to use tar archives for backing up my important files: it adds the checksum verification.  (Yes, you could achieve similar by using e.g. ZFS or some other filesystem for the backups that has file integrity checks.  I find the tar archives suitable for my needs, that's all.)

If I ever suspect any kind of silent file corruption –– be it from hardware or from software (like kernel driver) ––, the following bash-find stanza can be very useful:

    find root -type f -printf '%s %T+ %p\0' | while read -d "" size modified path ; do csum="$(sha256sum -b "$path")" ; csum="${csum%% *}" ; printf '%s %12s %s %s\n' "$modified" "$size" "$csum" "$path" ; done

which generates a listing of all files under root with their last modification time (in YYYY-MM-DD+hh:mm:ss.nnnnnnnnn in local time, sorting correctly), size in bytes, SHA256 checksum, and full path.  (If there are multiple file owners/groups or access modes, they're easily added to the find print pattern, the while read ... list, and the final printf output.)
You could also use SHA512 checksum (both are part of coreutils, and thus available on all Linux distributions), but the output then becomes too wide for my liking.  If you have files with strange file names, I recommend you reset the locale export LANG=C LC_ALL=C first, and use NUL (\0) instead of newline for the record separator.  You can then use tr '\0' '\n' filename | less to view it.

Redirecting or tee'ing that to a file lets one easily verify the files later, using either diff -Nabur or similar, or a simple awk scriptlet (with size, checksum, and modification timestamps in separate arrays keyed by the path; reporting only if conflicting information is read).  If you use NUL instead of a newline, you can start with the following gawk/mawk snippet:

    awk -v RS='\0' '{ modified=$1; size=$2; csum=$3; path=$0; sub(/^[^ ]+ +[^ ]+ +[^ ]+ */, "", path);
                      if (path in fdate) {
                          if (fdate[path] == modified && fcsum[path] != csum) {
                              /* report error */
                          }
                      } else {
                          fdate[path] = modified;
                          fcsum[path] = csum;
                          fsize[path] = size;
                    }' files...

which correctly extracts the path part even when it contains spaces in it.
The combination of scanning and comparison can then easily be wrapped in a script that one triggers from crontab or similar, running only when the machine and storage devices are idle (i.e., nice -n +20 ionice -c 3 script...).

DiTBho has a good point in that this is a completely different approach to verifying and reporting the error on a per-file basis, as soon as it is noticed.  One reason I personally prefer this opposite/offline method, is that many current programs don't handle those error reports well, aborting and/or producing garbage: having a logically separate scrubber, or method of verification, allows those programs to still work, but also tell me whenever corruption or problems have occurred.  It is not optimal, but given the current tools at hand, no optimal approach exists.

On server-class hardware, I do prefer to use proper hardware RAID (RAID-6 for example), and ECC RAM, so that the hardware layer does the monitoring for me.  (My own data – hobby projects and such – don't currently warrant the cost, that's all.)  There, too, it is important to ensure the hardware reports are monitored and any issues are quickly reviewed by a human; just having the hardware do monitoring is not sufficient.
 
The following users thanked this post: SiliconWizard, DiTBho

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4343
  • Country: gb
Re: Raid1 + {ext3, ext4, xfs} is problematic if a disk silently has IO errors
« Reply #27 on: November 22, 2023, 04:22:07 pm »
This is one reason why I like to use tar archives for backing up my important files: it adds the checksum verification.

This is the old but good trick I also use with { DDS4-tape, DVD-RAMs, iRev } backups.
Formatted with the simple filesystem possible, files added to tar-archives, without compressions :D
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 
The following users thanked this post: Nominal Animal

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4343
  • Country: gb
Re: Raid1 + {ext3, ext4, xfs} is problematic if a disk silently has IO errors
« Reply #28 on: November 22, 2023, 06:58:34 pm »


Boom, makes good points  :o :o :o
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline Halcyon

  • Global Moderator
  • *****
  • Posts: 6115
  • Country: au
Re: Raid1 + {ext3, ext4, xfs} is problematic if a disk silently has IO errors
« Reply #29 on: November 26, 2023, 01:45:13 am »
Linus Tech Tips fails at using ZFS properly, and loses data; read here  :o :o :o

One of the comments in that thread is spot on: "Linus Sebastian, the epitome of knowing just enough to be dangerous".

Linus Tech Tips isn't a technical channel, it's merely there for entertainment.
 
The following users thanked this post: DiTBho

Online Marco

  • Super Contributor
  • ***
  • Posts: 7032
  • Country: nl
Re: Raid1 + {ext3, ext4, xfs} is problematic if a disk silently has IO errors
« Reply #30 on: November 26, 2023, 03:56:33 pm »
Boom, makes good points  :o :o :o
Software RAID is pretty dead too at all but the smallest scale.

Everything big is cluster, redundant array of file servers. Redundancy at the lowest level is just needless cost and complexity at that point.
 

Online SiliconWizard

  • Super Contributor
  • ***
  • Posts: 15752
  • Country: fr
Re: Raid1 + {ext3, ext4, xfs} is problematic if a disk silently has IO errors
« Reply #31 on: November 26, 2023, 09:50:00 pm »
Boom, makes good points  :o :o :o
Software RAID is pretty dead too at all but the smallest scale.

Everything big is cluster, redundant array of file servers. Redundancy at the lowest level is just needless cost and complexity at that point.

Really?
 

Offline Halcyon

  • Global Moderator
  • *****
  • Posts: 6115
  • Country: au
Re: Raid1 + {ext3, ext4, xfs} is problematic if a disk silently has IO errors
« Reply #32 on: November 26, 2023, 11:39:57 pm »
Boom, makes good points  :o :o :o
Software RAID is pretty dead too at all but the smallest scale.

Everything big is cluster, redundant array of file servers. Redundancy at the lowest level is just needless cost and complexity at that point.

I'd actually disagree with that. I think software "RAID" is more relevant now that it has been in the past. With technologies like ZFS, they have benefits over your more traditional hardware RAID setups.

Software RAID got a pretty bad reputation in the consumer market, when Windows Dynamic Disks and half-baked software RAID built-in to consumer boards started to become a thing. Back then, proper hardware RAIDs were superior, but those were mostly reserved for the enterprise market (and for those who knew what they were doing).
« Last Edit: November 27, 2023, 02:07:06 am by Halcyon »
 
The following users thanked this post: DiTBho

Offline David Hess

  • Super Contributor
  • ***
  • Posts: 17407
  • Country: us
  • DavidH
Re: Raid1 + {ext3, ext4, xfs} is problematic if a disk silently has IO errors
« Reply #33 on: November 27, 2023, 12:06:48 am »
Software RAID got a pretty bad reputation in the consumer market, when Windows Dynamic Disks and half-baked software RAID built-in to consumer boards started to become a thing. Back then, proper hardware RAIDs were superior, but those were mostly reserved for the enterprise market (and for those who knew what they were doing).

I am actually pretty impressed with the Windows replacement for Dynamic Disks.  Storage Spaces operates a lot like ZFS except that disks can be added and removed as needed.  Unfortunately Microsoft removed the capability to use their new ZFS style of file system leaving NTFS.
 

Offline Nominal Animal

  • Super Contributor
  • ***
  • Posts: 7156
  • Country: fi
    • My home page and email address
Re: Raid1 + {ext3, ext4, xfs} is problematic if a disk silently has IO errors
« Reply #34 on: November 27, 2023, 02:13:33 am »
Software developers working with large code bases on Linux, and others working with large datasets on Linux, really should consider using software-RAID 0 and/or 1.

RAID-0 (striping) increases large-file copy bandwidth, whereas RAID-1 reduces small-block random read latencies.  Until you've tried it in practice, it is hard to understand how large a difference it can make in practice.  In Linux, it is completely okay to mix the two, on a per-partition basis.  I recommend you use LVM; see man 7 lvmraid for details; you then get all the LVM bonuses like snapshots and dynamic resizing for "free".

Neither RAID-0 or RAID-1 provides any protection against silent data corruption.  RAID-1 only provides protection against complete malfunction of one drive (which is basically the protection I want; the rest depending on a good daily/weekly/monthly backup policy), so combined with the increased small-block random read rates, it's pretty nice for a Linux development machine.  Just make sure you use HDDs or SSDs with the same firmware, make and model.
« Last Edit: November 27, 2023, 02:15:38 am by Nominal Animal »
 

Online magic

  • Super Contributor
  • ***
  • Posts: 7435
  • Country: pl
Re: Raid1 + {ext3, ext4, xfs} is problematic if a disk silently has IO errors
« Reply #35 on: November 27, 2023, 07:52:31 am »
RAID-0 (striping) increases large-file copy bandwidth, whereas RAID-1 reduces small-block random read latencies.
Do you still see a meaningful latency difference on SSD, though?
(Sequential throughput - yes, obviously.)


Software RAID got a pretty bad reputation in the consumer market, when Windows Dynamic Disks and half-baked software RAID built-in to consumer boards started to become a thing. Back then, proper hardware RAIDs were superior, but those were mostly reserved for the enterprise market (and for those who knew what they were doing).
You sound like a Windows user ;)

Software RAID has a well deserved bad reputation, always had and always will, because it sucks. RAID 0/1 are OK if you accept their compromises (lack of redundancy / large storage overhead). But RAID 5/6 are difficult to get right in presence of power failures (with software: also OS crashes ::)) during partial stripe writes. And it loads the CPU; maybe less of a problem with modern hardware (CPUs faster than spinning rust).

It supposedly works better in ZFS, becasue ZFS is aware of what it's doing and which data belong to where and it's allowed do COW. (AFAIK the better software RAIDs also effectively did COW, by means of journaling to an extra disk, but after committing they had to flush the data back to main storage).



It is true that large storage solutions these days are moving redundancy higher up the stack, to maintain availability when whole servers (if not data centers) go down. If you have the data replicated (or RAID5- striped) across multiple machines, maybe across continents, you no longer need a traditional RAID to ensure smooth operation when a disk fails somewhere. The machine simply reports the data as gone forever and you get them from elsewhere, and rebuild the missing replica (also elsewhere). That being said, striped RAID on storage nodes may still offer throughput advantage without the hassle of downloading separate fragments from many different machines.
« Last Edit: November 27, 2023, 07:58:39 am by magic »
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4343
  • Country: gb
Re: Raid1 + {ext3, ext4, xfs} is problematic if a disk silently has IO errors
« Reply #36 on: November 27, 2023, 11:17:07 am »
Linus Tech Tips fails at using ZFS properly, and loses data; read here  :o :o :o

One of the comments in that thread is spot on: "Linus Sebastian, the epitome of knowing just enough to be dangerous".

Linus Tech Tips isn't a technical channel, it's merely there for entertainment.

Yup, it's there for entertainment, just I have to say that I also made the same mistake with ZFS, because I thought it would immediately reread what it wrote.

Not by default! It's instead the user who must periodically force a rereading!

Compared to LTT, I worked with 20GB of data, of synthetic data, that is, not created specifically in the laboratory, just to test ZFS (kernel integrated). LTT, on the other hand, lost YouTube videos and projects because its snapshots also became corrupted.

The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline Halcyon

  • Global Moderator
  • *****
  • Posts: 6115
  • Country: au
Re: Raid1 + {ext3, ext4, xfs} is problematic if a disk silently has IO errors
« Reply #37 on: November 27, 2023, 11:53:17 pm »
You sound like a Windows user ;)

I used to be... Then Windows 8/10/11 became a thing. I use MacOS as my daily driver and Linux for everything else these days.
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf