Author Topic: Checksums on Linux for file repair  (Read 4156 times)

0 Members and 1 Guest are viewing this topic.

Offline InfravioletTopic starter

  • Super Contributor
  • ***
  • Posts: 1185
  • Country: gb
Checksums on Linux for file repair
« on: May 25, 2024, 07:46:46 pm »
Ok, so one can take checksums of files on Linux by doing:
sha256sum *
in a folder to get the terminal to calculate a sha256 string for each file in the folder.

But Sha256 is designed for security, the main property being that it is extremely hard to reverse engineer, an attacker has very little hope of being able to modify a file in such a fashion that after modification it gives the same checksum despite the changes made. So by its very nature one doesn't have much hope of taking a sha256 checksum and a slightly corrupted file and being able to tell from the checksum of the corrupted file vs the sha256 of the file before corruption, what the corruption was and which bits if flipped could reverse it.

Is there a better checksum type available in the linux terminal for this sort of thing? As a checksum which could be transferred along with big (hundreds of Mb to some Gb) files and if the file when the checksum is calculated at the other end doesn't match the checksum, you'd have a better chance, for a small amount of corrupt bits, of being able to use the checksum as a hint for where to reverse the corruption.

I know a checksum string length broadly approximating a sha256 sum's length isn't going to be great for fixing corruption, but I bet there's a more appropriate terminal tool available for this than sha256sum.

One can also consider situations where one is trying to reconstruct a large file from two copies of that file, each with tiny amounts of corruption in different places, plus the checksum which either would have were it not corrupted at all.

Any ideas of what woud be a good checksum to use?

I'm not trying to reconstruct a corrupted file here, rather to know what checksums to take, beforehand, for files so I could do that as an element within future projects relating to transferring files.

Thanks
 

Online nctnico

  • Super Contributor
  • ***
  • Posts: 28433
  • Country: nl
    • NCT Developments
Re: Checksums on Linux for file repair
« Reply #1 on: May 25, 2024, 07:55:59 pm »
Some of the older checksum algorithms like MD5 and SHA1 are 'broken' and can be reversed. Look here for a start: https://en.wikipedia.org/wiki/Rainbow_table
I just don't know whether this is doable for a large file.

Then again, it would be better to use an error correction algorithm and transfer that along with the file. Then you'll have a chance of correcting errors as well: https://humphryscomputing.com/Notes/Networks/data.error.html
There are small lies, big lies and then there is what is on the screen of your oscilloscope.
 

Online radiolistener

  • Super Contributor
  • ***
  • Posts: 4136
  • Country: 00
Re: Checksums on Linux for file repair
« Reply #2 on: May 25, 2024, 08:52:42 pm »
I'm using md5, it is more lightweight, so it should run more fast, but I didn't performed testing how md5 is faster than sha256 on rpi4

Regarding to error-correcting, checksum algorithms are not intended to do error-correction.
If you want error-correction you're needs to encode file with some error-correcting codes like Reed–Solomon codes.

As I remember, RAR archive supports error-correction codes if you enable it, but note that it will increase archive size because it needs to store redundant information which is needed for recovery.
« Last Edit: May 25, 2024, 09:00:20 pm by radiolistener »
 

Offline shapirus

  • Super Contributor
  • ***
  • Posts: 1725
  • Country: ua
Re: Checksums on Linux for file repair
« Reply #3 on: May 25, 2024, 09:01:17 pm »
Is there a better checksum type available in the linux terminal for this sort of thing?
Better in terms of?

As a checksum which could be transferred along with big (hundreds of Mb to some Gb) files and if the file when the checksum is calculated at the other end doesn't match the checksum, you'd have a better chance, for a small amount of corrupt bits, of being able to use the checksum as a hint for where to reverse the corruption.
What protocol will be used to transfer the file? Or, rather, what is the application?

Reasonable checksumming for error detection and retransmission is normally implemented in both hardware and software layers, when we're talking generic TCP/IP transmissions. That means there's little chance of receiving a corrupted file, thus it's fine (in terms of resource usage) to calculate even a sha256 sum for entire file for a final validation.

If for some reason you need more fine-grained checksumming, then, depending on application, you can calculate checksums block by block as data is transmitted and received, or block by block for the file stored on disk. In the latter case, I don't know of any ready made tools, but it is fairly easy to accomplish with some scripting (can be pure shell), dd, and sha256sum. The smaller the block size, the finer resolution it will have in telling the location of the first mismatching block.
 

Offline janoc

  • Super Contributor
  • ***
  • Posts: 3925
  • Country: de
Re: Checksums on Linux for file repair
« Reply #4 on: May 25, 2024, 09:08:36 pm »
Hashes like MD5 or SHA256 are not designed for this use. You can't "reverse" the hash and find from it which part of the file has been corrupted - that's not possible by the mathematical definition of the hash, that information is lost.

Rainbow tables won't help for arbitrary files, they work by precomputing all possible hashes for a given size of the hashed data and storing them, so that they original data can be simply looked up when one sees a hash. That's fine for something like finding an encryption key (e.g. 128 bits of data) but not for recovering a content of a file the hash was computed from - the rainbow table would be intractably huge.

The reason stuff like SHA1 or MD5 are not recommended for use is not rainbow tables but that it is possible to produce hash collisions - create a file that has the same has as another, different file. Which is bad juju when you are using the hash as part of some cryptography setup to ensure that something hasn't been tampered with. For basic file integrity check e.g. when transferring files over the network they are still fine.

If you want to not only detect the corruption but also to correct it (up to some amount of corruption) without retransmission (which is often the simpler way of dealing with the problem), you need to use error correction codes, e.g. some form of Hamming code. See here:

https://en.wikipedia.org/wiki/Error_correction_code

The idea is that you encode the data you are transmitting in some way and include certain amount of redundant information. Then when decoding the data you can both discover that it has been corrupted and also to correct certain amount of errors using that redundant information. The disadvantage is that you are trading the security for space/bandwidth because you must transmit/store the extra information. As always, there is no free lunch.
« Last Edit: May 25, 2024, 09:15:40 pm by janoc »
 

Offline golden_labels

  • Super Contributor
  • ***
  • Posts: 1480
  • Country: pl
Re: Checksums on Linux for file repair
« Reply #5 on: May 25, 2024, 11:01:55 pm »
Infraviolet:
Parchive is what you are asking for.

Also, forget all the talk about rainbow tables etc. This is irrelevant to your problem. Even if a hash function no longer offers preimage resistance(1) or tadeoffs may be make for repeated attacks,(2) your goal is not finding a matching value, but the value. Preimage attacks do only the first.


(1) So far all popular hash functions — MD5, SHA1, SHA2 and SHA3 — have no practical preimage attacks against them.
(2) As in the rainbow tables approach, where space is traded for speed for repeated attacks.

People imagine AI as T1000. What we got so far is glorified T9.
 
The following users thanked this post: magic

Offline m k

  • Super Contributor
  • ***
  • Posts: 2662
  • Country: fi
Re: Checksums on Linux for file repair
« Reply #6 on: May 26, 2024, 11:48:16 am »
I'm not trying to reconstruct a corrupted file here, rather to know what checksums to take, beforehand, for files so I could do that as an element within future projects relating to transferring files.

Think fault tolerance of RAID.
Model 6 is 2/6 and model 1 is 1/2 minimum.

So multiple copies is better against everything of that category.
It may still need checksums but those can be pretty simple.

Without checksums it's a voting system.
Not very good if throughout is very uncertain.

For transfer here we use TCP/IP.
Back in the day MTU had nothing to do with quality of any kind.
It was the size that was guaranteed to go through, no idea how it is today.
Advance-Aneng-Appa-AVO-Beckman-Danbridge-Data Tech-Fluke-General Radio-H. W. Sullivan-Heathkit-HP-Kaise-Kyoritsu-Leeds & Northrup-Mastech-OR-X-REO-Simpson-Sinclair-Tektronix-Tokyo Rikosha-Topward-Triplett-Tritron-YFE
(plus lesser brands from the work shop of the world)
 

Offline janoc

  • Super Contributor
  • ***
  • Posts: 3925
  • Country: de
Re: Checksums on Linux for file repair
« Reply #7 on: May 26, 2024, 03:49:08 pm »
For transfer here we use TCP/IP.
Back in the day MTU had nothing to do with quality of any kind.
It was the size that was guaranteed to go through, no idea how it is today.


TCP/IP only ensures checksum per packet but that doesn't prevent corruption of a file being transferred.  The data will get split into packets and then recombined at the receiving end. If that data gets corrupted once it is out of the TCP/IP stack because of a bad memory on some router somewhere, recombined wrong due to a firmware bug, etc. the TCP checksums don't help you at all because the individual packets were transferred fine.

And that assumes you are using TCP (e.g. HTTP or FTP protocol). If the protocol is something custom, running over UDP, then all bets are off and you are on your own - you could have lost packets, packets that got duplicated, delivered out of order, etc. Hopefully that system implements its own integrity checks.

It is very easy to get a corrupted file transfer with larger files. That's why most file distributions schemes use something like MD5, SHA1, SHA256 and similar hashes to detect that the file has changed during transfer and the download is likely bad, regardless of TCP having its own checksums on the individual packets. Data don't get corrupted only "on the wire".

Concerning MTU - that is not the size that is "guaranteed to go through". It is the maximum data size that could be transferred over the link without splitting it (fragmentation) into multiple packets. Could because it is not guaranteed it won't be fragmented even if the buffer size is smaller - that's completely up to the network stack. MTU only guarantees that anything larger than this value will be fragmented.

« Last Edit: May 26, 2024, 03:54:14 pm by janoc »
 

Offline m k

  • Super Contributor
  • ***
  • Posts: 2662
  • Country: fi
Re: Checksums on Linux for file repair
« Reply #8 on: May 26, 2024, 05:42:12 pm »
Yes, guaranteed to go through for MTU was inaccurate.
Back in the day parts of the network didn't guarantee the integrity of fragments.

I'd say the first thing to figure out is partial or not.
Then how much time or space can be wasted.

If partial is not accepted then multiple copies with simple checksums is how backups do.
There time is limited but not critical and size is sort of unlimited.
Backup here is a concept, not a single act.

If partial is accepted and completed transfer stored then some checksums are again needed, or storage completely trusted.
A computer mass storage is not without checks, mostly the same with work memory.
Advance-Aneng-Appa-AVO-Beckman-Danbridge-Data Tech-Fluke-General Radio-H. W. Sullivan-Heathkit-HP-Kaise-Kyoritsu-Leeds & Northrup-Mastech-OR-X-REO-Simpson-Sinclair-Tektronix-Tokyo Rikosha-Topward-Triplett-Tritron-YFE
(plus lesser brands from the work shop of the world)
 

Online radiolistener

  • Super Contributor
  • ***
  • Posts: 4136
  • Country: 00
Re: Checksums on Linux for file repair
« Reply #9 on: May 26, 2024, 05:57:39 pm »
The reason stuff like SHA1 or MD5 are not recommended for use is not rainbow tables but that it is possible to produce hash collisions - create a file that has the same has as another, different file.

If you're don't do it intentionally using special methods and algorithm weak points, the probability that it can happens almost zero. You can use md5 and sha1 with no issues for file integrity checks. The same you can use even CRC32 it is well enough for file integrity checks.

It should be avoid if you're using it for cryptography like electronic signature verification, because there is known vulnerabilities. But sha256 also vulnerable.
« Last Edit: May 26, 2024, 06:00:22 pm by radiolistener »
 

Offline soldar

  • Super Contributor
  • ***
  • Posts: 3595
  • Country: es
Re: Checksums on Linux for file repair
« Reply #10 on: May 26, 2024, 10:04:58 pm »
Checksums and hashes are only intended to verify the integrity of data, not to provide error correction.

If you want to be able to correct errors without retransmitting the entire file then the obvious thing to do is to break it up in chunks and send each chunk with its hash and then reassemble the file at the other end.

Have a look here
https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction
All my posts are made with 100% recycled electrons and bare traces of grey matter.
 

Online nctnico

  • Super Contributor
  • ***
  • Posts: 28433
  • Country: nl
    • NCT Developments
« Last Edit: May 26, 2024, 10:25:49 pm by nctnico »
There are small lies, big lies and then there is what is on the screen of your oscilloscope.
 

Offline Foxxz

  • Regular Contributor
  • *
  • Posts: 129
  • Country: us
Re: Checksums on Linux for file repair
« Reply #12 on: May 27, 2024, 02:36:49 am »
You are looking for forward error correction (FEC). Some types, such as Reed Solomon, have been suggested. Hamming codes are another https://en.wikipedia.org/wiki/Hamming_code
Such codes are used in things such as audio CDs to correct audio data during playback to lessen the impact of scratches and dirt on the disc https://en.wikipedia.org/wiki/Cross-interleaved_Reed%E2%80%93Solomon_coding
 

Offline Nominal Animal

  • Super Contributor
  • ***
  • Posts: 7212
  • Country: fi
    • My home page and email address
Re: Checksums on Linux for file repair
« Reply #13 on: May 27, 2024, 09:24:23 am »
For detection of file integrity, no error correction support, I recommend using b2sum (part of coreutils, so if you have sha256sum, you have b2sum too), which calculates a 512-bit checksum using the BLAKE2b hash function.  On typical hardware it is faster than md5sum or sha1sum, and much (3×) faster than sha224sum/sha256sum etc.

For filesystem integrity checks and change detection, running "ionice -c 3 nice -n 20 find root -type f -print0 | ionice -c 3 nice -n 20 b2sum -bz -" generates a list of paths and checksums that "ionice -c 3 nice -n 20 b2sum -cz file" can check.  Both commands run in the background, idle, and should not significantly slow down anything else running at the same time.  Note that the list file uses NUL ('\0') as the separator to support all possible file names, instead of newline, so you need to use e.g. tr '\0' '\n' < file | less to view it.

For error correction, as mentioned by golden_labels, Parchive (PAR2 format) is what comes to my mind as closest existing common tool.

I've personally found that fixing corruption using such a scheme is not worthwhile, compared to storing multiple copies in physically separate media.  I do use checksums to detect file corruption, but instead of trying to fix the corruption I distribute copies physically in the hopes that at least one of them maintains integrity.  I've found correctable errors (only a tiny part of the file corrupted) rather rare, compared to losing entire media, especially when using Flash (cards, USB sticks) for backups.

There are many filesystems with error detection and correction built-in, so one option is to use a filesystem image (via a loopback device) to store the data.  Essentially, you use losetup to setup a block device backed by your filesystem image.  Initially, you format it with the desired filesystem, and then mount the loopback device, at which point it becomes accessible.  Afterwards, you unmount the filesystem, fsync to ensure the image is sync'd to storage media, and detach the loop device.  Of course, if you use dedicated media like Flash or spinny-rust drives, it'd make sense to format the device with that filesystem in the first place.

If there was an actual use case for individual bit flips (as opposed to entire sectors lost), one could use e.g. libcorrect to write an utility that optionally compresses the input file, then generates an output file with Reed-Solomon codes, and another that decodes such files and optionally decompresses it.  libcorrect is often used for Software Defined Radio (SDR), and regardless of the recent infrastructure attack on xz-utils, I would use xz as the optional precompressor/postdecompressor.
 
The following users thanked this post: soldar

Offline DiTBho

  • Super Contributor
  • ***
  • Posts: 4379
  • Country: gb
Re: Checksums on Linux for file repair
« Reply #14 on: May 28, 2024, 12:23:06 am »
Software Defined Radio (SDR)

I think a further example is P2P
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline InfravioletTopic starter

  • Super Contributor
  • ***
  • Posts: 1185
  • Country: gb
Re: Checksums on Linux for file repair
« Reply #15 on: December 16, 2024, 08:19:27 pm »
I know this is a bit of an old thread, but I've just been trying to understand a bit more about the Parchive PAR2 format which was mentioned? Does it actually have any resilience advantage over simply having two copies of the same file on the same medium (obviously having copies on separeate mediums is a massive advantage over anything a format alone can do)? Or is it just potentially a bit more space efficient? If you had say a 1Gb zip file you wanted to protect, and your data rate for transfer or your storage medium was fast/large enough that 2Gb could be processed almost as easily as 1Gb, would 2 copies of the same always be better than parchive when speed/space permits it? or Does Parchive have some feature which makes it more effective for repairing corruption than two copies of the same file can be?
Thanks
« Last Edit: December 16, 2024, 08:24:30 pm by Infraviolet »
 

Offline golden_labels

  • Super Contributor
  • ***
  • Posts: 1480
  • Country: pl
Re: Checksums on Linux for file repair
« Reply #16 on: December 17, 2024, 02:00:18 am »
With nothing but two copies and one corrupted you can’t tell which one is valid. It’s mathematical impossibility. Either may be the corrupted one and it’s not possible to distinguish between them. Worse: strictly speaking you can’t even tell if it’s not the case both are damaged.

This may seem a bit unexpected, given many backup setup only make copies. The critical distinction is in phrase “nothing but [two copies].” The deployments based on just copies assume there is a “something.” The something is your knowledge that the original has been damaged. The knowledge comes from the storage being obviously damaged (SSD/HDD malfunction, theft, data erasure, security breach) or specific content no longer being usable (video glitches in a movie, obvious size difference, unreadable files). The premise here is that you yourself detect damage and replace the corrupted content with hopefully undamaged data. This is a perfectly valid backup strategy for many scenarios, but it fails in others: large number of files, bit rot, damage in backup copies, missing fragments etc.

This is where proper error detection and correction comes into play. This is why RAID levels 2 and above always include a parity bit. Why Btrfs and ZFS have checksums. Equally Parchive provides that kind of properties too. For relatively little additional space you get a clear indication if the data is corrupted and, unless the damage is very extensive, the ability to recover missing data.
People imagine AI as T1000. What we got so far is glorified T9.
 

Offline Nominal Animal

  • Super Contributor
  • ***
  • Posts: 7212
  • Country: fi
    • My home page and email address
Re: Checksums on Linux for file repair
« Reply #17 on: December 17, 2024, 11:04:58 am »
Does [PAR2] actually have any resilience advantage over simply having two copies of the same file on the same medium (obviously having copies on separeate mediums is a massive advantage over anything a format alone can do)?
Yes, because even a single copy can detect errors in itself, and if not too many, fix them.

PAR2 uses a subset of forward error correction codes called erasure codes.

If you had say a 1Gb zip file you wanted to protect, and your data rate for transfer or your storage medium was fast/large enough that 2Gb could be processed almost as easily as 1Gb, would 2 copies of the same always be better than parchive when speed/space permits it?
No, because the copy alone cannot tell its own integrity.

Even if you stored the sha256sum of the file elsewhere, you could only detect if one or both copies were intact.

Of course, two PAR2 archives on completely separate storage would be utterly superior, because each copy can tell whether they're damaged; and if one suffers catastrophic damage (complete loss of storage, or large enough loss that correction is no longer possible), the other one may save the day.

Does Parchive have some feature which makes it more effective for repairing corruption than two copies of the same file can be?
Yes, the erasure/forward error correction codes.  When one reads the data, they can tell whether the bit stream is correct or not; and if sufficiently few bits are incorrect, the correct values can be recovered.

CD's use a different error coding algorithm, cross-interleaved Reed–Solomon coding, but with roughly similar properties: a scratch on the disc surface will damage the bit stream, but if not too large/dense, readers can recover the original data without any loss.
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf