we're talking about "reliability".
That should probably exclude RAID0 and RAID1, then. They provide zero protection against silent corruption, which means RAID1 actually doubles the probability of silent corruption.
Traditionally in Linux, with all types of software-RAID,
smartmontools are used to track the actual storage device error logs and statistics. Background
scrubbing –– necessary to detect data degradation –– is controlled by the system administrator (by writing
check or
repair to the
md/sync_action sysfs pseudofile; see
man 4 md Scrubbing and mismatches chapter).
I'm sure you can see how it is quite logical for this kind of stack (per Unix philosophy!) that the topmost filesystem (ext2/3/4) does not have any file integrity checks, either: each layer relies on the function of the lower layer, and each has tools and policies one can apply to detect problems.
It is good that we do not all agree, though, and that there are competing filesystems and approaches that do include file integrity checks.
Knowing the real-world probabilities, I'm not really interested in having those on my own workstations (even though I do like to use RAID-0 and RAID-1) because I prefer the higher throughput over integrity checks there; but might choose differently for certain servers and appliances.
Do note it is
not a matter of not caring about integrity: it is about having other means to achieve sufficient practical probabilities.
(You [plural!] may have noticed that I myself often rail against programmers who assume syscalls succeed, and ignore "rare" errors because they feel they are so rare they don't need to be cared about. I do want my tools to report any errors when those are reported by the kernel or hardware, but I still do not expect them to be
perfect. It is having the information about an error occurring available and ignoring it that really bothers me; and not that the tool is imperfect and sometimes may garble my precious data.)
Currently, with so few Intel/AMD desktop and laptop processors supporting ECC RAM, I do believe silent RAM bit-flips may occur more often in practice, for example during copying of files. As the data in RAM is then not protected by any checksum or error correction code, nothing can detect the corruption either, and e.g. on RAID-1, both copies will inherit the changed data without any error. (Very few file copy utilities actually read back the copied file, to verify the original and new data match.)
This is one reason why I like to use tar archives for backing up my important files: it adds the checksum verification. (Yes, you could achieve similar by using e.g. ZFS or some other filesystem for the backups that has file integrity checks. I find the tar archives suitable for my needs, that's all.)
If I ever suspect any kind of silent file corruption –– be it from hardware or from software (like kernel driver) ––, the following bash-find stanza can be very useful:
find root -type f -printf '%s %T+ %p\0' | while read -d "" size modified path ; do csum="$(sha256sum -b "$path")" ; csum="${csum%% *}" ; printf '%s %12s %s %s\n' "$modified" "$size" "$csum" "$path" ; done
which generates a listing of all files under
root with their last modification time (in YYYY-MM-DD+hh:mm:ss.nnnnnnnnn in local time, sorting correctly), size in bytes, SHA256 checksum, and full path. (If there are multiple file owners/groups or access modes, they're easily added to the
find print pattern, the
while read ... list, and the final
printf output.)
You could also use SHA512 checksum (both are part of coreutils, and thus available on all Linux distributions), but the output then becomes too wide for my liking. If you have files with strange file names, I recommend you reset the locale
export LANG=C LC_ALL=C first, and use NUL (\0) instead of newline for the record separator. You can then use
tr '\0' '\n' filename | less to view it.
Redirecting or
tee'ing that to a file lets one easily verify the files later, using either
diff -Nabur or similar, or a simple awk scriptlet (with size, checksum, and modification timestamps in separate arrays keyed by the path; reporting only if conflicting information is read). If you use NUL instead of a newline, you can start with the following gawk/mawk snippet:
awk -v RS='\0' '{ modified=$1; size=$2; csum=$3; path=$0; sub(/^[^ ]+ +[^ ]+ +[^ ]+ */, "", path); if (path in fdate) { if (fdate[path] == modified && fcsum[path] != csum) { /* report error */ } } else { fdate[path] = modified; fcsum[path] = csum; fsize[path] = size; }' files...which correctly extracts the path part even when it contains spaces in it.
The combination of scanning and comparison can then easily be wrapped in a script that one triggers from crontab or similar, running only when the machine and storage devices are idle (i.e.,
nice -n +20 ionice -c 3 script...).
DiTBho has a good point in that this is a completely different approach to verifying and reporting the error on a per-file basis, as soon as it is noticed. One reason I personally prefer this opposite/offline method, is that many current programs don't handle those error reports well, aborting and/or producing garbage: having a logically separate scrubber, or method of verification, allows those programs to still work, but also tell me whenever corruption or problems have occurred. It is not optimal, but given the current tools at hand, no optimal approach exists.
On server-class hardware, I do prefer to use proper hardware RAID (RAID-6 for example), and ECC RAM, so that the hardware layer does the monitoring for me. (My own data – hobby projects and such – don't currently warrant the cost, that's all.) There, too, it is important to ensure the hardware reports are monitored and any issues are quickly reviewed by a human; just having the hardware do monitoring is not sufficient.