Author Topic: Borg backup - deduplication and cache location  (Read 1734 times)

0 Members and 1 Guest are viewing this topic.

Offline RoGeorgeTopic starter

  • Super Contributor
  • ***
  • Posts: 6630
  • Country: ro
Borg backup - deduplication and cache location
« on: June 20, 2022, 10:28:23 am »
Last weekend tried Borg backup on the main desktop (Kubuntu 20.04 LTS). 
https://www.borgbackup.org/
https://borgbackup.readthedocs.io/en/stable/
https://github.com/borgbackup

Borg features seems impressive, includes data deduplication, compression, encryption, mounting, pruning, can backup whole partitions (by dd and piping its STDOUT to borg), etc.  The backups are filesystem agnostic, Borg only deals with files, and implements its own filesystem representation for saved backups (just files no hardlinks or other features than might be filesystem specific).  The drawback is the backups are not usable directly (like a mirror disk would be), Borg must be installed before decoding the backups.  Though, backups can be moved or copied just like any file.

Deduplication is made before compressing or encrypting, and a cache directory of file chunks and their checksum ID is created locally.

Q1.  Anybody knows why the cache is kept by default in ~/.cache/borg/ and not on the same disk where the source files are sitting?
Q2.  There is a system variable that can be set to change the location of the cache, BORG_CACHE_DIR.  I've tried to move it and it was ignored.  Is this enough, or the variable must be set AND exported?
Code: [Select]
BORG_CACHE_DIR=/new/location/for/borg_cache/;  \
borg blablabla_commands_and_parameters

The backups ar saved on a slow 10MB/s NAS, and are a few TB total, so first backup will be very time consuming.

Any other advice about borgbackup, or info I should know, or how to split/organize the backup repositories?

Offline Nominal Animal

  • Super Contributor
  • ***
  • Posts: 6762
  • Country: fi
    • My home page and email address
Re: Borg backup - deduplication and cache location
« Reply #1 on: June 20, 2022, 06:56:34 pm »
Q2.  There is a system variable that can be set to change the location of the cache, BORG_CACHE_DIR.  I've tried to move it and it was ignored.  Is this enough, or the variable must be set AND exported?
Code: [Select]
BORG_CACHE_DIR=/new/location/for/borg_cache/;  \
borg blablabla_commands_and_parameters
The syntax is
    BORG_CACHE_DIR=/new/location/for/borg_cache/ borg args...
or
    export BORG_CACHE_DIR=/new/location/for/borg_cache/ ; borg args...

The mixed version you used does not pass the variable to the command.

To see this for yourself, try:
    VAL=x ; sh -c 'echo $VAL'
    (outputs an empty line)

    VAL=x sh -c 'echo $VAL'
    (outputs x)

    export VAL=x ; sh -c 'echo $VAL'
    (outputs x)
 
The following users thanked this post: RoGeorge

Offline RoGeorgeTopic starter

  • Super Contributor
  • ***
  • Posts: 6630
  • Country: ro
Re: Borg backup - deduplication and cache location
« Reply #2 on: June 20, 2022, 08:02:54 pm »
I've used export, and it works as expected, thank you.

Bash always surprised me in how it process the command line, never was able to correctly predict how it works, and never took the time to read the docs.

For example
Code: [Select]
aaa@zub:~$ X1=present echo $X1

aaa@zub:~$ echo $X1

aaa@zub:~$ X1=present; echo $X1
present
aaa@zub:~$ echo $X1
present
aaa@zub:~$ unset X1
aaa@zub:~$ echo $X1

aaa@zub:~$ X1=present; \
> echo $X1
present
aaa@zub:~$

The last format is what I've used with borg, but it didn't work, maybe borg spawns some other hidden instance of bash, and that's why the variable is not seen, IDK.  With export it works as expected.

For now I think I'll stay with Borg for backup.  It's a pity to use ZFS on the desktop and to use yet another thing for backup, but the NAS I have is a dedicated RAID 5 with ARM and some frozen in time Linux.  The NAS knows Samba 1.0 and NFS, and it is rather slow, only 10MB/s max, thought the LAN link is 1GBps.  Only rarely start it for manual backups a few times a year.  So far it was very reliable (for ~10 years), and I'm afraid to mess with it trying to upgrade, or to reformat it as ZFS.

Since we are at Borg backup, there is GUI for it, called Vorta, but I've decided to use the command line and craft my own scripts for manual backup.  Regarding Vorta, I've noticed that if the password manager is disabled (so no keyring), then Vorta saves the password in clear! in a table in its own database and the table remains as a leftover even after uninstalling Vorta, so I filed a security bug for Vorta and uninstall it.

Borg is quite fast at incremental backups.  For example a few hundreds GB that took many hours finishes in less than 15 minutes when doing incremental backups.  Before borg I was doing manual copy/paste then delete the old version, which usually took a whole day, if not a whole weekend to complete!  ;D

Offline ve7xen

  • Super Contributor
  • ***
  • Posts: 1194
  • Country: ca
    • VE7XEN Blog
Re: Borg backup - deduplication and cache location
« Reply #3 on: June 20, 2022, 08:20:29 pm »
The last format is what I've used with borg, but it didn't work, maybe borg spawns some other hidden instance of bash, and that's why the variable is not seen, IDK.  With export it works as expected.

The issue with your examples that might be leading to confusion is that bash's internal state is used to expand the variables *before* the command is executed. So 'echo' is not getting passed the variable name, but the expanded text, and if the variable doesn't exist, the empty string is substituted, even if the variable gets passed in the environment. 'echo' doesn't even do its own variable expansion, if '$X1' were passed to it, it's going to echo '$X1' verbatim.

So in the end there are two behaviours here. In the form:

Code: [Select]
var=value command
The environment variable 'var' is set in the *command's* environment, not in the shell itself, so it disappears once the command is complete. The command obviously needs to know what to do with that environment variables.

Code: [Select]
var=value ; command
This sets the environment variable 'var' in the *shell's* environment, since it's not associated with a command (the command is separate due to the ';'). Since it's not been exported, it (importantly, here) *doesn't* get added to the command's environment.

When you export the variable, it will automatically be added to any command/subprocess' environment until you unset it or start a new shell instance (that isn't a child of one where it's been exported, anyway).

If you want to experiment, the 'env' command is probably a better way to help understand what's being passed, because of the variable expansion issue described above. For example:
Code: [Select]
$ X1=present env | grep X1
X1=present
$ X1=present ; env | grep X1
$ export X1=present
$ env | grep X1
X1=present
« Last Edit: June 20, 2022, 08:26:32 pm by ve7xen »
73 de VE7XEN
He/Him
 
The following users thanked this post: RoGeorge, Nominal Animal

Offline Foxxz

  • Regular Contributor
  • *
  • Posts: 124
  • Country: us
Re: Borg backup - deduplication and cache location
« Reply #4 on: June 21, 2022, 01:25:14 am »
Q1.  Anybody knows why the cache is kept by default in ~/.cache/borg/ and not on the same disk where the source files are sitting?

Any other advice about borgbackup, or info I should know, or how to split/organize the backup repositories?

The user cache directory is the best default place to store the filechecksums. Theres no guarantee that just because the user can read files at some particular location that it can be written to. Plus you'd have cache files all over the place.

Other advice:
You can have multiple backups and multiple machines backup to the same borg repo. The de-duplication is repo-wide and can save alot of disk space if the same files appear on difference machines sending data to the same repo. The downside is the repo can only be in use/locked by a single running instance of borg which can be a pain. The cache for the repo will be copied to any machine/user accessing the repo.

You can mount borg backups as a fuse filesystem which is pretty awesome but slow. Unmount with "fusermount -u <mountpoint>"
 
The following users thanked this post: RoGeorge

Offline RoGeorgeTopic starter

  • Super Contributor
  • ***
  • Posts: 6630
  • Country: ro
Re: Borg backup - deduplication and cache location
« Reply #5 on: June 21, 2022, 07:25:10 am »
You can have multiple backups and multiple machines backup to the same borg repo. The de-duplication is repo-wide and can save alot of disk space if the same files appear on difference machines sending data to the same repo. The downside is the repo can only be in use/locked by a single running instance of borg which can be a pain. The cache for the repo will be copied to any machine/user accessing the repo.

The number of borg instances doesn't bother me, since in the current setup the NAS (and borg repos) are for single user, single computer making backups at a given time.  It's a setup to backup the home desktop, personal photos, hobby projects, etc.

So far I've mount/umount from borg, without using fusermount, like this:
Code: [Select]
aaa@zub:~$ borg mount ~/hdd/borg/w_relocated_cache/ ~/smb4k/
aaa@zub:~$ ls ~/smb4k/
_Linux            _Linux_2022-06-20_15-43  _Linux_2022-06-20_16:43
_Linux2022-06-20  _Linux_2022-06-20_15:44
aaa@zub:~$ borg umount ~/smb4k/
aaa@zub:~$ ls ~/smb4k/
aaa@zub:~$

The wish to have the cache with the indexed files on the source disk was because sometimes I mount the same disks to different OSs.  For example:
- mount an external disk wd8TB to the main desktop, save some files on wd8TB/important/ and do a borg backup of the wd8TB/important.  The backup will be saved on a remote and slow NAS.
- then shutdown all, start a laptop with another OS, and move the external wd8TB to the laptop, and save some more files from the laptop into wd8TB/important/
- now, to backup the wd8TB/important/ files from the laptop's borg to the NAS repo, would mean to rebuild the indexes cache, isn't it?

My hope was that the laptop would reuse (and update) the same cache that was built with the desktop, a cache that I will specify to be stored in wd8TB/borg_cache/ by defining  BORG_CACHE_DIR to point to wd8TB/borg_cache/ for both the laptop and the desktop.

Not sure if this will work, or if borg _must_ have a distinct cache for each machine that will mount the external drive wd8TB.
« Last Edit: June 21, 2022, 07:30:44 am by RoGeorge »
 

Offline Foxxz

  • Regular Contributor
  • *
  • Posts: 124
  • Country: us
Re: Borg backup - deduplication and cache location
« Reply #6 on: June 22, 2022, 01:47:00 am »
I see what you are going for now. You should be able to have the cache directory on a remote filesystem. Instead of the environment variable consider just making ~/.cache/borg a symlink to the cache directory on your mountpoint.
 
The following users thanked this post: RoGeorge

Offline RoGeorgeTopic starter

  • Super Contributor
  • ***
  • Posts: 6630
  • Country: ro
Re: Borg backup - deduplication and cache location
« Reply #7 on: June 22, 2022, 03:51:44 pm »
Made a few more tests:
- boot in Kubuntu and backup of 300k files totaling 590GB from wd8TB to NAS, with the cache sitting on wd8TB, and it took 5 hours and 40 minutes
- boot in FreeBSD and backup the same (unchanged) 300k files totaling 590GB from wd8TB to NAS, using the same cache sitting on wd8TB, and it took 1 hour and 30 minutes, and the cache size grown from 60 to 90MB, it clearly rebuilt the cache (there were no new files to backup)
- once it did that, no matter what OS the backup is started from, Kubuntu or FreeBSD, while pointing to the same cache sitting on wd8TB, it takes less than 3 minutes to parse the cache and finish the backup (when nothing changed).

The Borg on Kubuntu was v1.15, while the Borg on the FreeBSD was v1.17.  I don't know why the index cache was rebuilt when switching the OS, might have been rebuilt because of the bigger version number, or because of the different OS/machine.

If it upgraded the cache because Borg v1.17 is using a different format, then it will be risky to use the same cache with a lower version of Borg.  Reusing the same cache index with different machines might be risky for data integrity, especially when the cache reusing is done from different versions of Borg.


This was for the first backup:
Code: [Select]
        Duration: 5 hours 41 minutes 33.67 seconds
        Number of files: 351433
        Utilization of max. archive size: 0%
        ------------------------------------------------------------------------------
                            Original size      Compressed size    Deduplicated size
        This archive:              590.64 GB            289.55 GB            213.55 GB
        All archives:              870.12 GB            332.20 GB            213.55 GB

                            Unique chunks         Total chunks
        Chunk index:                  327021               559884
        ------------------------------------------------------------------------------


This is for a no-changes backup:
Code: [Select]
        Duration: 2 minutes 41.56 seconds
        Number of files: 351434
        Utilization of max. archive size: 0%
        ------------------------------------------------------------------------------
                            Original size      Compressed size    Deduplicated size
        This archive:              590.90 GB            289.57 GB             33.88 MB
        All archives:                2.89 TB              1.23 TB            213.66 GB

                            Unique chunks         Total chunks
        Chunk index:                  330365              2130538
        ------------------------------------------------------------------------------
« Last Edit: June 22, 2022, 03:58:57 pm by RoGeorge »
 

Offline Foxxz

  • Regular Contributor
  • *
  • Posts: 124
  • Country: us
Re: Borg backup - deduplication and cache location
« Reply #8 on: June 22, 2022, 06:03:32 pm »
As I was reading your post I thought "I wonder what borg versions each of the OSes is using"
I don't think the cache and backup format has solidified yet and has been evolving between releases. The changelogs I've read in the past seem to indicate that, while there is some compatibility, you will see them state certain options and formats are not backwards compatible.

I've run into this myself. It usually comes into play when using packaged versions of borg on different distributions.
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf