Create and update duplicate archives #199

tasket · 2024-05-22T17:50:18Z

Add an archive duplication feature to create backups of archives in another location.

Problem

Although commands like rsync may be considered sufficient for making duplicates intact, there are a couple of drawbacks:

Directory scans (and possibly data scans) must be performed repeatedly, which makes the process less efficient than what Wyng itself could achieve.
With rsync, cp, etc. extra care must be taken to avoid incomplete transfers resulting in a corrupt copy.
There's no possibility for selecting which volumes or sessions within an archive to duplicate (using traditional tools). Users may want to prioritize only certain volumes or sessions for their backup-of-a-backup.

Solution

A Wyng duplication function could make and refresh duplicate archives using the same safety patterns (data first, metadata last) employed when creating original archives. It could rely on its knowledge from archive metadata, avoiding costly dir and data scanning. It would also be possible to add some level of selectivity (per volume, etc) at some point.

The main task is to open the source and destination (copy) archives in tandem and then:

Find volumes that are not in both archive copies, and sync or delete as needed.
For volumes common to both, find the newest session that is common to both and copy over any newer sessions to the destination.
Sessions that are absent in the source can be pruned out of the destination (this might come before step 2).*
Sync the 'updated_at' and encryption counter values.

It could be of further help if the duplicate archive were marked as having a special status and possibly with a different uuid; this would be to avoid temptation of backing up to two+ copies absentmindedly with users thinking they are the same archive.

External duplication/updater scripts:

A simple rsync based script can duplicate archives and also update them:

      mv destpath destpath-incomplete
      rsync -uaHW --delete --no-compress sourcepath/. destpath-incomplete
      mv destpath-incomplete destpath

Also see working example of a more efficient script that is based on rsync but performs merging of pruned session dirs ahead of time to avoid unnecessary data transfers by rsync:
https://gist.github.com/tasket/08f38279d8702c7defcb62cb4afdae7a

Notes

Encryption keys could be the same between two archives. If they are not, then duplication would involve a re-encryption process.
Duplication function would have to acquire two unique coexisting instances of Destination class and probably ArchiveSet as well. Some currently 'independent' helper functions may have to be moved into those classes to raise their effective encapsulation.
send_volume() may have to be used to ensure that the archive copy receives a comparable level of deduplication (if dedup is enabled). Alternately, dedup functions can be abstracted or relocated.

Concept:

With the following difference between Src archive and Dest...

Src Vol_a12345/             Dest Vol_a12345/
   S_20250104-000001/           S_20250104-000001/
                                S_20250112-000001/
                                S_20250115-000001/
   S_20250123-000001/           S_20250123-000001/

...on Dest do something like:

cd Vol_a12345
cp -al S_20250112-000001/* S_20250104-000001
rm -r S_20250112-000001
cp -al S_20250115-000001/* S_20250104-000001
rm -r S_20250115-000001

Follow up with usual raw sync of archive dir using rsync or similar. I'm not sure cp -al would be appropriate but I think you get the idea. Python might be a better way to script it, since you could then use mv/rename commands without creating links.

tasket · 2025-02-26T02:47:37Z

I think a Python script using the above idea, showing how to make rsync updates more efficient would be good to try.

tasket · 2025-03-11T12:01:05Z

Performing independent Wyng backups to the additional "copy" archive may be the most efficient general workaround, currently. For this to work correctly, the additional archive cannot be a simple clone; the two archives must have different UUIDs so that snapshots are managed right. (Otherwise, Wyng will repeatedly discard the local snapshots and send the volumes in slower "full scan" mode.) If you already have a cloned copy and want to make its UUID unique, run wyng arch-check --change-uuid on the copy.

However, if your server and offsite backup both use Btrfs or ZFS, then you could efficiently btrfs-send the archive updates, for example.

Notes on external updater approach:

The updater is probably worthwhile for a lot of corner cases because when oldest sessions (which cover all parts of the volume) are pruned on the source, then almost all data will appear to rsync to be in the wrong dir. Getting this one thing aligned between src and dest before running rsync (or similar) should have a big impact for anyone who regularly does pruning.
rsync and most other sync tools may never be able to account for deduplication hardlinks that already exist in the duplicate, at least not very efficiently. A lot depends on whether an updater like rsync will remember the local inode numbers of older files that it has skipped over, and use that info to create new links on the remote.
Timezone changes on the client side can produce incorrect ordering of sessions, causing a script (like above) to move the data to the wrong paths. For these specific (rare?) instances, an rsync process will simply have more work to do. (Wyng internally knows the correct order.)
rsync compression should probably be turned off.
Its unclear to me if adding --update would help (or if that is usually the default).

tasket · 2025-03-13T23:52:37Z

Updated PoC script. Its closer to a usable solution and just needs SSH URL parsing (and send the commands over ssh in that case). I used cp -l hardlink copying because Unix has no "move into" directory merge that I'm aware of, while cp -l can "copy into" pretty efficiently. Also... dir_scan() needs to be updated to work from remote scans, i.e. find command.

tasket · 2025-03-23T15:45:01Z

@t-gh-ctrl I created a gist of an updater script that I tested today:

https://gist.github.com/tasket/08f38279d8702c7defcb62cb4afdae7a

Feel free to suggest changes incl. different rsync options.

t-gh-ctrl · 2025-03-26T12:38:51Z

I created a gist of an updater script that I tested today:

Awesome ! :)

It might take a bit of time until I report back because I won't have access to my remote backup server for the next month ; so I'll have to do some tests with a test remote host when time permits.

tasket · 2025-03-28T19:38:50Z

The script still doesn't take into account when timezone change shifts the session name (local time) in reverse (such as when backups were done just before traveling, then again from a laptop that has just flown west). But I'm adding a small bit of plaintext info to archives that will show the correct session order; the script can be updated to take advantage of that.

tasket · 2025-04-07T23:05:48Z

Update: The 08wip branch now saves a simple in-order json list of sessions in the archive dir. These "sessions" files can be used by a script to avoid the chunkfile timezone misplacement issue.

tasket added the enhancement New feature or request label May 22, 2024

tasket added this to the v0.9 milestone May 22, 2024

tasket mentioned this issue Jun 8, 2024

[Contribution] wyng-backup QubesOS/qubes-issues#858

Open

t-gh-ctrl mentioned this issue Feb 8, 2025

Moving archives between disks #140

Closed

This comment has been minimized.

Sign in to view

This was referenced Mar 19, 2025

Support fast delta backups of same volume to multiple archives #242

Open

v0.9 release timeline #197

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create and update duplicate archives #199

Create and update duplicate archives #199

tasket commented May 22, 2024 •

edited

Loading

tasket commented Feb 8, 2025

tasket commented Feb 26, 2025

This comment has been minimized.

tasket commented Mar 11, 2025

tasket commented Mar 13, 2025 •

edited

Loading

tasket commented Mar 23, 2025

t-gh-ctrl commented Mar 26, 2025

tasket commented Mar 28, 2025 •

edited

Loading

tasket commented Apr 7, 2025

Create and update duplicate archives #199

Create and update duplicate archives #199

Comments

tasket commented May 22, 2024 • edited Loading

Problem

Solution

External duplication/updater scripts:

Notes

Related

tasket commented Feb 8, 2025

Concept:

tasket commented Feb 26, 2025

This comment has been minimized.

tasket commented Mar 11, 2025

tasket commented Mar 13, 2025 • edited Loading

tasket commented Mar 23, 2025

t-gh-ctrl commented Mar 26, 2025

tasket commented Mar 28, 2025 • edited Loading

tasket commented Apr 7, 2025

tasket commented May 22, 2024 •

edited

Loading

tasket commented Mar 13, 2025 •

edited

Loading

tasket commented Mar 28, 2025 •

edited

Loading