Products > Computers

Checksums on Linux for file repair

(1/3) > >>

Infraviolet:
Ok, so one can take checksums of files on Linux by doing:
sha256sum *
in a folder to get the terminal to calculate a sha256 string for each file in the folder.

But Sha256 is designed for security, the main property being that it is extremely hard to reverse engineer, an attacker has very little hope of being able to modify a file in such a fashion that after modification it gives the same checksum despite the changes made. So by its very nature one doesn't have much hope of taking a sha256 checksum and a slightly corrupted file and being able to tell from the checksum of the corrupted file vs the sha256 of the file before corruption, what the corruption was and which bits if flipped could reverse it.

Is there a better checksum type available in the linux terminal for this sort of thing? As a checksum which could be transferred along with big (hundreds of Mb to some Gb) files and if the file when the checksum is calculated at the other end doesn't match the checksum, you'd have a better chance, for a small amount of corrupt bits, of being able to use the checksum as a hint for where to reverse the corruption.

I know a checksum string length broadly approximating a sha256 sum's length isn't going to be great for fixing corruption, but I bet there's a more appropriate terminal tool available for this than sha256sum.

One can also consider situations where one is trying to reconstruct a large file from two copies of that file, each with tiny amounts of corruption in different places, plus the checksum which either would have were it not corrupted at all.

Any ideas of what woud be a good checksum to use?

I'm not trying to reconstruct a corrupted file here, rather to know what checksums to take, beforehand, for files so I could do that as an element within future projects relating to transferring files.

Thanks

nctnico:
Some of the older checksum algorithms like MD5 and SHA1 are 'broken' and can be reversed. Look here for a start: https://en.wikipedia.org/wiki/Rainbow_table
I just don't know whether this is doable for a large file.

Then again, it would be better to use an error correction algorithm and transfer that along with the file. Then you'll have a chance of correcting errors as well: https://humphryscomputing.com/Notes/Networks/data.error.html

radiolistener:
I'm using md5, it is more lightweight, so it should run more fast, but I didn't performed testing how md5 is faster than sha256 on rpi4

Regarding to error-correcting, checksum algorithms are not intended to do error-correction.
If you want error-correction you're needs to encode file with some error-correcting codes like Reed–Solomon codes.

As I remember, RAR archive supports error-correction codes if you enable it, but note that it will increase archive size because it needs to store redundant information which is needed for recovery.

shapirus:

--- Quote from: Infraviolet on May 25, 2024, 07:46:46 pm ---Is there a better checksum type available in the linux terminal for this sort of thing?

--- End quote ---
Better in terms of?


--- Quote from: Infraviolet on May 25, 2024, 07:46:46 pm ---As a checksum which could be transferred along with big (hundreds of Mb to some Gb) files and if the file when the checksum is calculated at the other end doesn't match the checksum, you'd have a better chance, for a small amount of corrupt bits, of being able to use the checksum as a hint for where to reverse the corruption.

--- End quote ---
What protocol will be used to transfer the file? Or, rather, what is the application?

Reasonable checksumming for error detection and retransmission is normally implemented in both hardware and software layers, when we're talking generic TCP/IP transmissions. That means there's little chance of receiving a corrupted file, thus it's fine (in terms of resource usage) to calculate even a sha256 sum for entire file for a final validation.

If for some reason you need more fine-grained checksumming, then, depending on application, you can calculate checksums block by block as data is transmitted and received, or block by block for the file stored on disk. In the latter case, I don't know of any ready made tools, but it is fairly easy to accomplish with some scripting (can be pure shell), dd, and sha256sum. The smaller the block size, the finer resolution it will have in telling the location of the first mismatching block.

janoc:
Hashes like MD5 or SHA256 are not designed for this use. You can't "reverse" the hash and find from it which part of the file has been corrupted - that's not possible by the mathematical definition of the hash, that information is lost.

Rainbow tables won't help for arbitrary files, they work by precomputing all possible hashes for a given size of the hashed data and storing them, so that they original data can be simply looked up when one sees a hash. That's fine for something like finding an encryption key (e.g. 128 bits of data) but not for recovering a content of a file the hash was computed from - the rainbow table would be intractably huge.

The reason stuff like SHA1 or MD5 are not recommended for use is not rainbow tables but that it is possible to produce hash collisions - create a file that has the same has as another, different file. Which is bad juju when you are using the hash as part of some cryptography setup to ensure that something hasn't been tampered with. For basic file integrity check e.g. when transferring files over the network they are still fine.

If you want to not only detect the corruption but also to correct it (up to some amount of corruption) without retransmission (which is often the simpler way of dealing with the problem), you need to use error correction codes, e.g. some form of Hamming code. See here:

https://en.wikipedia.org/wiki/Error_correction_code

The idea is that you encode the data you are transmitting in some way and include certain amount of redundant information. Then when decoding the data you can both discover that it has been corrupted and also to correct certain amount of errors using that redundant information. The disadvantage is that you are trading the security for space/bandwidth because you must transmit/store the extra information. As always, there is no free lunch.

Navigation

[0] Message Index

[#] Next page

There was an error while thanking
Thanking...
Go to full version
Powered by SMFPacks Advanced Attachments Uploader Mod