Author Topic: Disk usage discrepancy - Linux  (Read 6029 times)

0 Members and 1 Guest are viewing this topic.

Offline eutectique

  • Frequent Contributor
  • **
  • Posts: 457
  • Country: be
Re: Disk usage discrepancy - Linux
« Reply #25 on: November 07, 2023, 05:43:13 pm »
OTOH, I don't think the discrepancy can be completely (or even mostly) explained by the aforementioned issue of 4KB block allocation, because it would take a huge lot of small files to waste so much space in this manner.

You don't need files, directories are equally wasteful. Here is an example -- 5 directories, the innermost contains a file of 1 byte.

tree -h reports 1 byte for the file and 4K for each directory:

Code: [Select]
work> tree -h one
one
└── [4.0K]  two
    └── [4.0K]  three
        └── [4.0K]  four
            └── [4.0K]  five
                └── [   1]  one-byte.txt

4 directories, 1 file


du -sh reports 24K allocated for the whole thing:

Code: [Select]
work> du -sh one
24K one


Xfce file manager (thunar) says "6 items, totalling 1 byte"
 

Online magic

  • Super Contributor
  • ***
  • Posts: 7529
  • Country: pl
Re: Disk usage discrepancy - Linux
« Reply #26 on: November 07, 2023, 06:40:37 pm »
It would also take a huge lot of directories to waste 450GB with only 130GB of file data ;)

But good to know that XFCE also sucks.
That's why I generally distrust such things.
 

Online DavidAlfa

  • Super Contributor
  • ***
  • Posts: 6438
  • Country: es
Re: Disk usage discrepancy - Linux
« Reply #27 on: November 07, 2023, 07:39:52 pm »
For some strange reason I still can't understand, some retard decided 2^10=1024 was discriminating the International System of Units or something like that.
The all-time Megabyte (2^20 =1.048.576 Bytes) became 10^6=1.000.000 bytes, and the old measuring method was renamed to Mebibyte (MiB).
Some file explorers report MiB, others MB, a file might be 12GiB or 11.17GB. It's a *** mess.
This might have something to do with storage manufacturers, who (To the confused consumer) decided to sell devices using this measurement system.
So, you got 1TB (1000*1000*1000*1000 bytes) but 931GiB. Still, 448GB would be 417GiB, the difference is too high.
Try:
Code: [Select]
df -h


« Last Edit: November 07, 2023, 07:46:22 pm by DavidAlfa »
Hantek DSO2x1x            Drive        FAQ          DON'T BUY HANTEK! (Aka HALF-MADE)
Stm32 Soldering FW      Forum      Github      Donate
 

Offline golden_labels

  • Super Contributor
  • ***
  • Posts: 1507
  • Country: pl
Re: Disk usage discrepancy - Linux
« Reply #28 on: November 07, 2023, 11:09:31 pm »
The debate about prefixes, while the difference is 75%. Yes, that must be this new fangled Fibonacci unit. Fibobyte = 1650 bytes, fimobyte = 16502 = 2722500 bytes, figobyte = 16503 = 4492125000 bytes; 130.5 figobytes = 586222312500 bytes ≈ 586 GB!

Circlotron: your file manager reports total size of files by counting them one by one. With insufficient access rights, it can’t read sizes of all files and they are not counted towards the total. So I reckon this is what’s happening. You are seeing incomplete value from counting, while the filesystem usage is reported directly from information about the filesystem and therefore contains the actual usage.

For some strange reason I still can't understand, some retard decided 2^10=1024 was discriminating the International System of Units or something like that.
How binary prefixes came into existence would be a good substance for psychological studies. Starting from how an informal notation, used to indicate an order of magnitude, has became a part of a nerd lore, through how it turned into a collective false memory and invaded professional circles, finishing with fanatical reactions from people facing a mere suggestion the lore is false. And all this in just 30–40 years.
« Last Edit: November 07, 2023, 11:11:27 pm by golden_labels »
People imagine AI as T1000. What we got so far is glorified T9.
 
The following users thanked this post: SiliconWizard

Online magic

  • Super Contributor
  • ***
  • Posts: 7529
  • Country: pl
Re: Disk usage discrepancy - Linux
« Reply #29 on: November 08, 2023, 07:06:16 am »
If you bothered to read the thread you are responding to, you would understand that prefixes are responsible for one of the issues brought up by the OP.

As for the rest of your post, aren't you some goddamn millenial kid anyway? :P
 

Offline CirclotronTopic starter

  • Super Contributor
  • ***
  • Posts: 3373
  • Country: au
Re: Disk usage discrepancy - Linux
« Reply #30 on: November 09, 2023, 01:16:04 am »
Because it (hopefully) isn't the source of OP's confusion,
This is the source of the confusion, the difference between these two figures.
You have 130.5GB in the root directory of the ONE_TB volume, and a total of 586.3GB overall in that volume.

Let's say you have ONE_TB mounted in the typical location in Linux,
    /media/circlotron/ONE_TB
and in it,
    /media/circlotron/ONE_TB/movie1.mkv
    :
    /media/circlotron/ONE_TB/movieN.mkv
that total to 130.5GB.  Then, you have a subdirectory in it,
    /media/circlotron/ONE_TB/not-porn/movie1.mkv
    :
    /media/circlotron/ONE_TB/not-porn/movieN.mkv
that total to 455.8GB.

Many Linux file managers only show you the folder statistics when you right-click and select properties.  The summary is then of only that folder, not including any subfolders.  Above, you have 586.3GB of stuff on a volume named ONE_TB, about 130.5GB in the topmost directory, the rest, about 455.8GB, somewhere in subdirectories.

Me, I use Nemo right now, and that starts scanning the subdirectories interactively when you look at the properties of a folder, updating the total in real time as you wait.  I don't know which file manager or desktop environment Circlotron is using, and there are several that can be configured to look like that, so it could also be a funky bug in one instead.
I'm using Mint Mate with Caja file manager.
The disk has no files in the root directory. No hidden files either. Everything is in folders.
For comparison, here is a disk on my work pc. The numbers are exactly as I would have expected, not like the other one.
 

Offline Nominal Animal

  • Super Contributor
  • ***
  • Posts: 7314
  • Country: fi
    • My home page and email address
Re: Disk usage discrepancy - Linux
« Reply #31 on: November 09, 2023, 02:14:08 am »
Because it (hopefully) isn't the source of OP's confusion,
This is the source of the confusion, the difference between these two figures.
You have 130.5GB in the root directory of the ONE_TB volume, and a total of 586.3GB overall in that volume.
I'm using Mint Mate with Caja file manager.
The disk has no files in the root directory. No hidden files either. Everything is in folders.
For comparison, here is a disk on my work pc. The numbers are exactly as I would have expected, not like the other one.
Ah, my mistake.  The two differ even in the size of the device!  Are you sure they mount the same partition?  One definitely uses a standard ext3/ext4 mount, whereas the work pc mounts it using FUSE.

Could you run and report the outputs of
    sudo fdisk -l /dev/sdb
which describes the partition table on the device itself (feel free to obfuscate disk model and identifier),
    df -H /media/$(id -un)/*
which reports the space on any mounted external filesystems (using powers of ten kilo/mega/giga/tera suffixes), and
    sudo tune2fs -l /dev/sdb1
which describes the properties of the first partition (assuming ext2/ext3/ext4 format)?  They are both non-destructive.  Of the latter, we're interested in "Filesystem state", "Inode count", "Block count", "Reserved block count", "Free blocks", "Free inodes", "Block size", and "Last checked".

You cannot run a filesystem check while it is mounted, but if you unmount the volume first, you can then run
    sudo e2fsck -f -n /dev/sdb1
which checks the filesystem without making any changes; if it finds any issues, consider re-running it without the -n option, in which case it asks whether you want it to try and fix issues it finds.
 

Online magic

  • Super Contributor
  • ***
  • Posts: 7529
  • Country: pl
Re: Disk usage discrepancy - Linux
« Reply #32 on: November 09, 2023, 06:57:35 am »
I suppose these are different disks ;)
 

Offline CirclotronTopic starter

  • Super Contributor
  • ***
  • Posts: 3373
  • Country: au
Re: Disk usage discrepancy - Linux
« Reply #33 on: November 10, 2023, 02:57:56 am »
I suppose these are different disks ;)
And that is the point.
On the second disk the numbers make sense, on the first disk they are confusing.
 

Offline ejeffrey

  • Super Contributor
  • ***
  • Posts: 4062
  • Country: us
Re: Disk usage discrepancy - Linux
« Reply #34 on: November 10, 2023, 03:41:40 am »
In this case, it was (and is) universally accepted in computer science that 1 KB = 1024 bytes, 1 MB = 1024 * 1024 bytes, and so on. The ISO "standard" has created ambiguity by leaving it uncertain whether 1 MB is 1024x1024 bytes, or 1000x1000 bytes. Standards are supposed to remove ambiguity, not create it.

It really wasn't.  Despite frequent claims to the contrary and a notable lawsuit, manufacturers *never* consistently reported hard drive size with 1 megabyte = 1024*1024.  Prior to the early 90s, manufacturers just did whatever they felt like, later they all adopted base 10 terminology.  Most notably in the consumer space, the ST-506, *the* original consumer level hard drive came in 5 MB and 10 MB (formatted) sizes, 5,013,504 bytes and 10,027,008.   Mainframe hard drives from the 70s were 100% exclusively power-of-10.
 And other uses were exclusively base 10, particularly communication.  For instance 10 megabit ethernet was always 10*10^6 bits/second, nobody ever considered that it should be anything else.

There were like 2 years in the late 80s where it was pretty common for *consumer* hard drives often used power-of-two sizes in their marketing materials.  That just happens to correspond to when a lot of nerds of a certain age got their first hard drive, and led them to mistakenly believe that there was some universal standard until the marketing department figured they could use the confusion to inflate their product copy.  But that's really not how it happened.

It's also a bit of a silly convention, at least for hard drives.  Once you get past the sector size of 512-4096 bytes there is no physical or logical reason to keep using binary powers.  Drive platter counts and cylinder counts have no reason to be power-of-two aligned, and in fact usually aren't, nor are sector counts which aren't even constant.
 
The following users thanked this post: golden_labels

Offline Nominal Animal

  • Super Contributor
  • ***
  • Posts: 7314
  • Country: fi
    • My home page and email address
Re: Disk usage discrepancy - Linux
« Reply #35 on: November 10, 2023, 04:33:51 am »
Because it (hopefully) isn't the source of OP's confusion,
This is the source of the confusion, the difference between these two figures.
Ah.  Me fail. :palm: I misunderstood.  I do believe magic is right here.

As Caja traverses the filesystem to obtain the amount of space used by the contents of the files (using nftw() with fn() that sums the used space) in the textual summary, but the total space free and used for the pie chart via an statvfs() call (that does no traversal, just peeks at the filesystem statistics), the significant difference is caused by Caja not having read access to all directories.

Note that traversal ('x') access to all directories is required but not sufficient; Caja needs read ('r') access also, so that it can get the file information.  It does not need to have read access to the files, though.

Because the du command-line command does the same traversal, it suffers from the same limitation, unless you temporarily elevate its privileges.  You can verify this is indeed the case, by simply running
    du -h /media/$USER/ONE_TB >/dev/null
as it will tell you which directories it cannot read, and whose contents (including subdirectories) are therefore not counted in its summary.
Caja's textual summary is limited in the exact same manner, it just doesn't visibly complain.
« Last Edit: November 10, 2023, 04:36:04 am by Nominal Animal »
 

Offline CirclotronTopic starter

  • Super Contributor
  • ***
  • Posts: 3373
  • Country: au
Re: Disk usage discrepancy - Linux
« Reply #36 on: November 10, 2023, 05:47:29 am »
Because the du command-line command does the same traversal, it suffers from the same limitation, unless you temporarily elevate its privileges.  You can verify this is indeed the case, by simply running
    du -h /media/$USER/ONE_TB >/dev/null
as it will tell you which directories it cannot read, and whose contents (including subdirectories) are therefore not counted in its summary.
I ran that command and it said
Code: [Select]
du: cannot read directory '/media/myname/ONE_TB/.Trash-0': Permission denied
du: cannot read directory '/media/myname/ONE_TB/lost+found': Permission denied
Turns out in .Trash-0/expunged/ there was a folder with 53.7GB of files from an old disk image. They were meant to have been put in the trash but they don't show there. The file owner is root, not me. That still doesn't account for the whole discrepancy though.

Edit-> opened Caja as root and looked in the trash folder on that disk and there is six folders with a total of 352GB of disk image files. So it seems the program I used to make the disk backups, Redo Rescue, had root permissions. Never thought of that. Oh well, I think that kind of solves it.  :)
« Last Edit: November 10, 2023, 05:54:07 am by Circlotron »
 

Offline CirclotronTopic starter

  • Super Contributor
  • ***
  • Posts: 3373
  • Country: au
Re: Disk usage discrepancy - Linux
« Reply #37 on: November 10, 2023, 05:58:17 am »
Not quite...
Run Caja as root and try and delete the files and I get the following error message in the console.

** (caja:5902): WARNING **: 16:56:17.701: Could not inhibit power management: GDBus.Error:org.freedesktop.DBus.Error.NameHasNoOwner: Name "org.gnome.SessionManager" does not exist
 

Offline Nominal Animal

  • Super Contributor
  • ***
  • Posts: 7314
  • Country: fi
    • My home page and email address
Re: Disk usage discrepancy - Linux
« Reply #38 on: November 10, 2023, 06:36:10 am »
Not quite...
Run Caja as root and try and delete the files and I get the following error message in the console.

** (caja:5902): WARNING **: 16:56:17.701: Could not inhibit power management: GDBus.Error:org.freedesktop.DBus.Error.NameHasNoOwner: Name "org.gnome.SessionManager" does not exist
You can ignore that error.  It occurs exactly because you're running Caja as root, as it cannot connect to your session manager and user dbus agent.  All that warning means is that Caja could not connect to power management.  (It only does so to ensure your machine does not go to sleep or hibernate while doing significant work via Caja, like copying files around.)

If you want, you can run
    cd /media/yourself/ONE_TB
    sudo rm -rf .Trash-0
    sudo rm -rf lost+found
    cd
instead.  Obviously, sudo rm -rf is the nuclear weapon among Linux commands, allowing you to render your system completely inoperable, but the above pattern –– changing the working directory to the one containing the offending items, and then running the command to delete each one with one file or directory name only, no paths, no slashes! –– makes it much safer.  Assuming the sudo rm commands are written correctly, the worst thing that can happen is that you completely delete a similarly named tree elsewhere (if and only if the cd command failed or had a typo); in this case, these two are safe to run.  lost+found is a directory used to put orphan inodes and other similar oddities, when the filesystem is checked (using e2fsck/fsck).  Just be careful, and re-check the command before hitting Enter.

The last "cd" alone switches back to your home directory, as you cannot unmount the device if you have a shell open with its working directory on that device (making such a mount "active").  In Linux, and more generally in BSDs and Unix, the current working directory for each process is not based on the path, but the actual directory itself.  The kernel itself basically keeps a file description open on the directory.  This means that if you in one shell run "cd ; mkdir example ; cd example" to create a subdirectory named "example" under your home directory, and then change the name of that subdirectory in another shell (cd ; mv example foobar), the first shell still works just fine and does not notice its actual path has changed, and is able to access everything in the renamed directory.  The only thing that will fail is "cd $CWD" and similar commands that use the path string.  This is also why mounts stay active (and not unmountable) as long as you have a shell open in there.  It also means that with the above sudo rm pattern, if you check you are in the correct directory by using e.g. "ls -laF" (to list the contents of the current working directory), nobody can do hidden tricks to cause you to delete the wrong things by the immediately following sudo rm commands.  In some other operating systems, a nefarious user also logged in might be able to do that by renaming the directory you have a shell open, replacing it with a symlink to some other directory containing the thing they want you to delete, because they track paths using the string, and not the actual inodes.  This kind of race window is not possible when using file description based approaches, only when paths are used.

And that last one is the reason why I keep telling people that if they write code to traverse directories using opendir()/readdir()/closedir instead of nftw() or FTS family of BSD functions, they will almost certainly open up bugs that would allow similar bait-and-hook path-based trickery to work.  The underlying file description based machinery grew the entire ATFILE stuff, including functions like openat(), linkat(), unlinkat() (actual modern syscall used to delete files and directories), and even execveat(), to protect against path manipulation during operations.  Simply put, these, including current working directory, don't care if the name used to access the directory or any of the parent directories changes; they have a robust name-bypassing "hook" into the actual directory or file instead.

If you do write your own directory traversal code, then you should definitely be using openat() and fstatat(), or your code IS vulnerable to path bait-and-switch attacks and bugs, including simple file renames within the same directory.  (The trick with opendir() in Linux is to use /proc/self/fd/N where N is the nonnegative file descriptor to a read-only descriptor opened by openat() to the desired subdirectory.)  A well-implemented directory traverser maintains a global set of device,inode tuples identifying each directory having already traversed, and while traversing each directory, a device,inode tuple for each file.  It can still miss files renamed during the traversal (but usually does not, because renames-in-place tends to not reorder the directory contents), but that is acceptable; it will be able to deal with directory renaming during traversal, and not report both pre-rename and post-rename names and statistics for files that are renamed or moved during traversal.  As you can see, that takes quite a lot of code, so it is better to rely on the nftw() instead provided by the standard C library (POSIX feature) instead.

Apologies for the bit of a rant, but it is pertinent to the discussion at hand.  I bet not even DiTBho has done the traversal correctly in their "lsprettysize" program, and I just hate shoddily done, known-vulnerable base utilities.
 

Offline IanB

  • Super Contributor
  • ***
  • Posts: 12570
  • Country: us
Re: Disk usage discrepancy - Linux
« Reply #39 on: November 10, 2023, 08:24:47 am »
It really wasn't.  Despite frequent claims to the contrary and a notable lawsuit, manufacturers *never* consistently reported hard drive size with 1 megabyte = 1024*1024.  Prior to the early 90s, manufacturers just did whatever they felt like, later they all adopted base 10 terminology.  Most notably in the consumer space, the ST-506, *the* original consumer level hard drive came in 5 MB and 10 MB (formatted) sizes, 5,013,504 bytes and 10,027,008.   Mainframe hard drives from the 70s were 100% exclusively power-of-10.

I do not concur with this, since total hard drive size is not really of particular significance. What is of much more relevance is operating systems, software, memory management, file management systems, and the size of files stored on the disk. In that area, memory is usually organized in pages sized in powers of two, and disk storage is measured in blocks which are also sized in powers of two for convenience of memory mapping and paging. This has been true from early on right up to today.

For instance, minicomputers like the DEC PDP/11 or VAX machines measured disk storage in blocks of 512 bytes. The Multics mainframe system measured disk storage in pages of 1024 words, and segments of up to 64 pages. This goes back to the 70's, long before the 90's came around.
 

Online magic

  • Super Contributor
  • ***
  • Posts: 7529
  • Country: pl
Re: Disk usage discrepancy - Linux
« Reply #40 on: November 10, 2023, 08:46:49 am »
There were like 2 years in the late 80s where it was pretty common for *consumer* hard drives often used power-of-two sizes in their marketing materials.  That just happens to correspond to when a lot of nerds of a certain age got their first hard drive, and led them to mistakenly believe that there was some universal standard until the marketing department figured they could use the confusion to inflate their product copy.
Some of those nerds still have their first disks.

My "1 gigabyte" Seagate ST51080A is labeled "2,116,800 sectors, 1083 Mbytes", which is indeed 1083·10⁶ bytes, but it notably isn't 10⁹ bytes and rather happens to be 1.009GiB. They were also quite careful to avoid saying "megabytes" or "MB". This disk was made in 1996 and things stayed that way through the '90s.

As for universal standards, let's just agree that the universal standard of 1990s consumer computing was Microsoft Windows. In Windows, 1MB stood for 1024KB and 1GB for 1024MB, and that's why disks were made this way. And I wouldn't be surprised if Windows wasn't the only major OS to do so. Linux, for one, also used the same convention.

So maybe it's more fair to to talk about a battle between marketing and consumer expectations which lasted for two decades before the marketers ultimately won, but you can't say that nothing happened.
« Last Edit: November 10, 2023, 08:55:44 am by magic »
 

Offline golden_labels

  • Super Contributor
  • ***
  • Posts: 1507
  • Country: pl
Re: Disk usage discrepancy - Linux
« Reply #41 on: November 11, 2023, 07:15:06 pm »
Circlotron: did you read my response?

It really wasn't.  Despite frequent claims to the contrary and a notable lawsuit, manufacturers *never* consistently reported hard drive size with 1 megabyte = 1024*1024. (…)
The even funnier part is that RAM and RAM sticks manufacturers also use 10-based prefixes for everything except capacity. A 19200 MB/s DDR4 stick is 19.2·106 B/s, not 20132659200 B/s (19200 MiB/s).
People imagine AI as T1000. What we got so far is glorified T9.
 

Offline CirclotronTopic starter

  • Super Contributor
  • ***
  • Posts: 3373
  • Country: au
Re: Disk usage discrepancy - Linux
« Reply #42 on: November 12, 2023, 10:14:14 am »
If you want, you can run
    cd /media/yourself/ONE_TB
    sudo rm -rf .Trash-0
    sudo rm -rf lost+found
    cd
Okay then. That got things moving a bit! There is still a remaining difference that I can't account for though. Maybe I'll fire up the Redo Restore CD and see if some extra files become visible.

 

Online magic

  • Super Contributor
  • ***
  • Posts: 7529
  • Country: pl
Re: Disk usage discrepancy - Linux
« Reply #43 on: November 12, 2023, 11:10:25 am »
If you haven't already, run df -h to get reliable usage statistics. Compare with what you see above.
Run du on the mount point directory again to get reliable count of total disk usage. If it still reports inaccessible files, delete them too or run du as root.

The resulting numbers should be close to each other. Then you can try to guess what your file manager is doing.
 

Offline CirclotronTopic starter

  • Super Contributor
  • ***
  • Posts: 3373
  • Country: au
Re: Disk usage discrepancy - Linux
« Reply #44 on: November 12, 2023, 12:13:14 pm »
df -h says  917G size 122G used 749G free

du says 127445556 right at the bottom, so 127G more or less.

Those two figures are near enough to 130.5GB, so why do we have "180.5GB used" ???

And why would the pic above now say "free space unknown" ?
« Last Edit: November 12, 2023, 12:18:33 pm by Circlotron »
 

Online magic

  • Super Contributor
  • ***
  • Posts: 7529
  • Country: pl
Re: Disk usage discrepancy - Linux
« Reply #45 on: November 12, 2023, 12:50:23 pm »
These numbers check out because du reports in 1KiB blocks by default (I forgot that it too has an -h option which switches to MiB/GiB automatically) and 127445556KiB is 121.5GiB.

This is exactly the 130.5GB shown by your file manager, which appears to be using the decimal convention.
Similarly, 984GB = 917GiB, so total capacity matches too.
Free space also matches - 804GB = 749GiB.

At this point we must notice that "used"+"free" doesn't add up to "size" in df output. I strongly suspect that most of the difference is the reserved space I mentioned in the first reply (see Nominal's reply as to why the command I posted didn't work, and what to do instead). Some of the difference may also be filesystem metadata.
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf