Author Topic: Annoying discrepancy in filesize definitions Windows vs. Linux! Argh!!!  (Read 1399 times)

0 Members and 1 Guest are viewing this topic.

Offline edyTopic starter

  • Super Contributor
  • ***
  • Posts: 2385
  • Country: ca
    • DevHackMod Channel
Hi folks,

I'm in the process of copying a bunch of old CD-R's to HD storage and verifying file sizes along the way to make sure all things are copied. Due to reading errors I've had to try copying files on both Windows and my Linux machine, using special utilities to recover. In the process I noticed something very annoying (fortunately they are mostly MP3's so even the odd blip is not bothersome).

When checking sizes of files and folders, different file size definitions are used on my Win machine vs. Linux. For example, on my Windows machine a folder would be listed as using 264MB (276,893,957 bytes), yet on my Linux machine the folder size would display 276.9 MB. Obviously one is using the definition of 1 MB = 1024x1024 bytes (1,048,576) (which is *wrong*) and the other is simply dividing by 1,000,000 (they should not both be using the same units "MB"!). Perhaps this is due to my particular File Manager in Ubuntu Studio, and other file managers or Linux distros would show the same as Windows. Very annoying! I see the same thing when buying hard-drives and SSD's these days... they are all using the 1,000,000 definition when historically MB was 2^20 (not 10^6).

Now I know there is a whole thing about Mebibyte vs Megabyte and one is abbreviated MiB and the other MB. But when I look at a file size in Windows it says MB next to the value, not MiB... even though it is displaying actually the MiB. So is Windows erroneous while Linux displays the actual value in MB according to the *newish* definition of Megabyte? Then I went into a Linux shell and used "du -b" to get the folder size in bytes directly on the CD-R and it showed yet a different value (276,926,725) for the above folder. Not sure if that is an issue with block size on HD vs. CD-R and what exactly it is displaying (could also be read errors that resulted in file size discrepancies that I am not aware of). NOTE that the same CD-R read better on one machine than the other... Sometimes I have better luck on my HP desktop, sometimes better on my ASUS laptop. I figure this has to do with the drives not the operating system and this particularly crappy CD-R's that it has trouble with (almost all of them Mitsumi).  |O

I'm using "ddrescue" on Linux for the most part to recover files as I've had difficulty with Roadkil's unstoppable copier on Windows (takes forever and doesn't seem to work quite as well). Again this could be down to the drives or just the CD-R being really garbage that nothing will work well on it. Learned my lesson with CD-R and DVD-R I guess... good while it lasted.
« Last Edit: December 09, 2021, 03:37:02 pm by gnif »
YouTube: www.devhackmod.com LBRY: https://lbry.tv/@winegaming:b Bandcamp Music Link
"Ye cannae change the laws of physics, captain" - Scotty
 

Online RoGeorge

  • Super Contributor
  • ***
  • Posts: 6202
  • Country: ro
Re: Annoying discrepancy in filesize definitions Windows vs. Linux! Argh!!!
« Reply #1 on: December 09, 2021, 03:47:56 pm »
By definition, 1 kilobyte (kB) is 1000 bytes, while 1 kibibyte (kiB) is 1024 bytes.  If you see kiB marked as kB then file a bug report, kiB and kB notations are not interchangeable.

My Dolphin file browser (Kubuntu) uses kiB, MiB, GiB by default.
On the same PC, Nautilus (now called "Files") uses kB, MB, GB by default.

In regards to file size, there is yet another difference:  between size of file, and size on disk.  Then, there are hidden and system files, which might be copied or not, depending on what tool is used to make the copies.

Size is a good indicator to check if a copy has ended, but not very good for errors.  A checksum like 'md5' or 'sha', or maybe something else, could detect errors better.  Checksums can be run to check a whole tree of files at once.

If you want to be sure about the content, then don't bother checking the size, use a checksum instead.




Some CD/DVD copy protections may include intentional burnt errors, those can not be ripped by direct sector readings or by disk rescue tools.
« Last Edit: December 09, 2021, 04:18:03 pm by RoGeorge »
 
The following users thanked this post: edy

Online Nominal Animal

  • Super Contributor
  • ***
  • Posts: 6255
  • Country: fi
    • My home page and email address
Re: Annoying discrepancy in filesize definitions Windows vs. Linux! Argh!!!
« Reply #2 on: December 09, 2021, 04:00:29 pm »
du -b reports the allocation size in bytes, which on basically all filesystems is aligned to a sector size (say, 512 bytes).

Use du -h if you want human-readable sizes with k=1024, M=1048576, G=1073741824.
Use du -h --si if you want human-readable sizes with k=1000, M=1000000, G=1000000000.

Whenever I burn files to DVD-R's for backup, I always include the output of
    find . -type f -print0 | xargs -r0 sha256sum | sort -k 2
for later verification, and optionally
    LANG=C LC_ALL=C find . -type f -ls
or
    LANG=C LC_ALL=C find . -type f -printf '%TY-%Tm-%Td %TT %-4TZ %12s %p\n' | sort -k 4
as an index.  The listed size is in bytes.

If you want a single index, with SHA256SUM, size in bytes, latest modification date and time, and path and file name, use for example
    find . -type f -print0 | while read -d '' File ; do Sum="$(sha256sum "$File")"; printf '%s ' "${Sum%% *}" ; LANG=C LC_ALL=C stat "${File#./}" -c '%12s %y %n' ; done | sort -k 7
which gives you a line of form
    fb7ecca1350a0e11069dd3eed930e70280a704ef5ee2f875813b0be2cf59e07c         7536 2021-11-16 20:53:11.911512992 +0200 069-config-snippets/libsnippet.so
for each file in the current directory and in subdirectories if any.
 
The following users thanked this post: edy

Offline mariush

  • Super Contributor
  • ***
  • Posts: 5022
  • Country: ro
  • .
Re: Annoying discrepancy in filesize definitions Windows vs. Linux! Argh!!!
« Reply #3 on: December 09, 2021, 04:40:25 pm »
Windows kept using multiples of 2 for backwards compatibility.
In the DOS times you had very little memory to work with and few cpu cycles ... every little trick helped.

Because of this, everywhere they could they used multiples of 2 ... the hard drive was formatted in 512 byte sectors. Even now, they still use 512 byte sectors on mechanical hard drives, but 4096 byte sectors are also common.

1 KiB is 210 so if you wanted to convert a file size to KB to show it in a nicer way, or just to calculate how many sectors it would use on disk, you simply had to shift the number 10 bits to the right  ... ex 9000 bytes shifted 10 bits to the right is b 0010 0011 0010 1000 = b0010 =  8 KiB (bolded stuff is shifted to the right and gone). You want MiB, you shift 20 bytes to the right..

Shifting 10 bits to the right was much much faster than dividing a number by 1000 and you don't have to deal with floating point numbers.

CD-R uses 2352 byte sectors, 2048 bytes of data plus sector headers and footers and some error correction info, see https://en.wikipedia.org/wiki/CD-ROM#Sector_structure

Note that it's kinda misleading ...  files will physically use multiples of the sector size, so if you use 512 byte sectors and you have a 700 byte file, your file will use 1024 bytes on your drive.
But below some threshold, ex maybe 50-100 bytes, Windows will not reserve 512 bytes for the file, but rather put the content in the area reserved in the file system for file information/ metadata  (file name, creating time, last modified, attributes etc).  If/when the file size increases, Windows will "upgrade" it and reserve actual sectors on the drive for it.

 
« Last Edit: December 09, 2021, 04:43:44 pm by mariush »
 
The following users thanked this post: edy

Offline edyTopic starter

  • Super Contributor
  • ***
  • Posts: 2385
  • Country: ca
    • DevHackMod Channel
Re: Annoying discrepancy in filesize definitions Windows vs. Linux! Argh!!!
« Reply #4 on: December 09, 2021, 06:01:35 pm »
Thanks for the explanations and suggestions. If I ever burn a CD-R or DVD-R again I will make sure to add checksum files and look to buy the best possible media. Seeing that the computer industry seems to going away from CD/DVD-RW anyways, and that everything is either solid-state memory or HD platters (becoming cheap enough for archival purposes that it is more competitive than removable media), I think this will probably be the last time I ever try to read these discs again. These last few are taking hours only to recover a small file, it's just not worth it. For example, I'm using ddrescue now on a 100 MB file and it's telling me it will need 3 days to recover it (based on the speed it is working at).  :-DD The few that won't read are not worth my time to grab the data from (usually some downloaded music mix MP3 files that I rarely listen to anyways). I feel most of the errors seem to be happening around the edge of the disc, it usually errors out on the last few files to be copied.

I see blank CD-R for about $50 for 100 pack, so if I get 700 MB x 100 roughly equivalent to 70000 MB or 70 GB of storage with 1 spindle.... Compare that to 120 GB SSD for $25 (half the cost!). Same thing looking at DVD-R, Amazon has 50-pack 4.7 GB for $19 which gives you about 235 GB total space.... If I wanted close to 1 TB I'd have to buy 4 (235GBx4 = 940 GB) which would cost me almost $80. Compare that with a 1 TB HDD for $50-70. I know we are comparing Apples vs. Oranges but it seems that removable R/W media is dead for the average consumer who wants backup options.... and while *most* of my data has been readable (going back almost 20 years)... I get a bad taste left over from some of these bad batches I now have discovered are unreadable (which were totally fine a few years out).

[EDIT:  I know 20 years ago there was a huge difference in prices for CD/DVD-RW and hard drives, but now it is not even a question which gives you better value.]
« Last Edit: December 09, 2021, 07:05:11 pm by edy »
YouTube: www.devhackmod.com LBRY: https://lbry.tv/@winegaming:b Bandcamp Music Link
"Ye cannae change the laws of physics, captain" - Scotty
 

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 14466
  • Country: fr
Re: Annoying discrepancy in filesize definitions Windows vs. Linux! Argh!!!
« Reply #5 on: December 09, 2021, 06:38:26 pm »
Uh.
First, if you want to compare sizes, just look at the size in *bytes*. All OSs can show you this. Also make sure you get the actual file size, and not the size on disk. Windows, for instance, can show you both.
(Yes, size multipliers - K, M, G, ... - are still a huge PITA as far as standardization goes, and despite the standardization of binary multipliers - KiB, MiB... - it looks like many people stubbornly don't want to use them, so it'll be back to square one for still a long time. Quite a few threads about this point already on this forum...)

Size on disk is almost always larger, since most filesystems out there use some kind of "clustering" with "clusters" of a fixed size.
But the file size itself, as long as it properly displays the data size contained in the file, should be identical if the copy is lossless. No data should be added or removed. (Exception can be found using some copy tools on text files, due to them potentially transforming line endings... but low-level file copies should NOT do this.) So in the end it's all in retrieving the *file size*, which depends on how you try to display it...

 
The following users thanked this post: newbrain

Offline x86guru

  • Regular Contributor
  • *
  • Posts: 51
  • Country: us
Re: Annoying discrepancy in filesize definitions Windows vs. Linux! Argh!!!
« Reply #6 on: December 09, 2021, 07:16:25 pm »
When checking sizes of files and folders, different file size definitions are used on my Win machine vs. Linux.

The size of individual files should be identical when comparing with a POSIX stat(). The size of directories can be different. In Linux, the size of a directory is the size of the sum of the dentries (directory entries -- meta data for each file within the directory, including previously deleted/zero'd entries). 

So ignore the size of directories and only focus on the size of the files, or the sum of the file sizes within a directory.
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf