Skip to content

Create and update duplicate archives #199

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Tracked by #197
tasket opened this issue May 22, 2024 · 9 comments
Open
Tracked by #197

Create and update duplicate archives #199

tasket opened this issue May 22, 2024 · 9 comments
Labels
enhancement New feature or request
Milestone

Comments

@tasket
Copy link
Owner

tasket commented May 22, 2024

Add an archive duplication feature to create backups of archives in another location.

Problem

Although commands like rsync may be considered sufficient for making duplicates intact, there are a couple of drawbacks:

  1. Directory scans (and possibly data scans) must be performed repeatedly, which makes the process less efficient than what Wyng itself could achieve.
  2. With rsync, cp, etc. extra care must be taken to avoid incomplete transfers resulting in a corrupt copy.
  3. There's no possibility for selecting which volumes or sessions within an archive to duplicate (using traditional tools). Users may want to prioritize only certain volumes or sessions for their backup-of-a-backup.

Solution

A Wyng duplication function could make and refresh duplicate archives using the same safety patterns (data first, metadata last) employed when creating original archives. It could rely on its knowledge from archive metadata, avoiding costly dir and data scanning. It would also be possible to add some level of selectivity (per volume, etc) at some point.

The main task is to open the source and destination (copy) archives in tandem and then:

  1. Find volumes that are not in both archive copies, and sync or delete as needed.
  2. For volumes common to both, find the newest session that is common to both and copy over any newer sessions to the destination.
  3. Sessions that are absent in the source can be pruned out of the destination (this might come before step 2).*
  4. Sync the 'updated_at' and encryption counter values.

It could be of further help if the duplicate archive were marked as having a special status and possibly with a different uuid; this would be to avoid temptation of backing up to two+ copies absentmindedly with users thinking they are the same archive.

External duplication/updater scripts:

A simple rsync based script can duplicate archives and also update them:

      mv destpath destpath-incomplete
      rsync -uaHW --delete --no-compress sourcepath/. destpath-incomplete
      mv destpath-incomplete destpath

Also see working example of a more efficient script that is based on rsync but performs merging of pruned session dirs ahead of time to avoid unnecessary data transfers by rsync:
https://gist.github.com/tasket/08f38279d8702c7defcb62cb4afdae7a

Notes

  • Encryption keys could be the same between two archives. If they are not, then duplication would involve a re-encryption process.
  • Duplication function would have to acquire two unique coexisting instances of Destination class and probably ArchiveSet as well. Some currently 'independent' helper functions may have to be moved into those classes to raise their effective encapsulation.
  • send_volume() may have to be used to ensure that the archive copy receives a comparable level of deduplication (if dedup is enabled). Alternately, dedup functions can be abstracted or relocated.

Related

#140
#175
#184

@tasket tasket added the enhancement New feature or request label May 22, 2024
@tasket tasket added this to the v0.9 milestone May 22, 2024
@tasket
Copy link
Owner Author

tasket commented Feb 8, 2025

Some observations from another issue:

  • When updating a duplicate archive, rsync doesn't handle pruning changes efficiently
  • It can be helped by performing a raw data-level merge of the pruned session dirs (see concept below)
  • This suggests that efficient updates to duplicate archives can be done without authentication

Concept:

With the following difference between Src archive and Dest...

Src Vol_a12345/             Dest Vol_a12345/
   S_20250104-000001/           S_20250104-000001/
                                S_20250112-000001/
                                S_20250115-000001/
   S_20250123-000001/           S_20250123-000001/

...on Dest do something like:

cd Vol_a12345
cp -al S_20250112-000001/* S_20250104-000001
rm -r S_20250112-000001
cp -al S_20250115-000001/* S_20250104-000001
rm -r S_20250115-000001

Follow up with usual raw sync of archive dir using rsync or similar. I'm not sure cp -al would be appropriate but I think you get the idea. Python might be a better way to script it, since you could then use mv/rename commands without creating links.

@tasket
Copy link
Owner Author

tasket commented Feb 26, 2025

I think a Python script using the above idea, showing how to make rsync updates more efficient would be good to try.

@tasket

This comment has been minimized.

@tasket
Copy link
Owner Author

tasket commented Mar 11, 2025

Performing independent Wyng backups to the additional "copy" archive may be the most efficient general workaround, currently. For this to work correctly, the additional archive cannot be a simple clone; the two archives must have different UUIDs so that snapshots are managed right. (Otherwise, Wyng will repeatedly discard the local snapshots and send the volumes in slower "full scan" mode.) If you already have a cloned copy and want to make its UUID unique, run wyng arch-check --change-uuid on the copy.

However, if your server and offsite backup both use Btrfs or ZFS, then you could efficiently btrfs-send the archive updates, for example.

Notes on external updater approach:

  • The updater is probably worthwhile for a lot of corner cases because when oldest sessions (which cover all parts of the volume) are pruned on the source, then almost all data will appear to rsync to be in the wrong dir. Getting this one thing aligned between src and dest before running rsync (or similar) should have a big impact for anyone who regularly does pruning.
  • rsync and most other sync tools may never be able to account for deduplication hardlinks that already exist in the duplicate, at least not very efficiently. A lot depends on whether an updater like rsync will remember the local inode numbers of older files that it has skipped over, and use that info to create new links on the remote.
  • Timezone changes on the client side can produce incorrect ordering of sessions, causing a script (like above) to move the data to the wrong paths. For these specific (rare?) instances, an rsync process will simply have more work to do. (Wyng internally knows the correct order.)
  • rsync compression should probably be turned off.
  • Its unclear to me if adding --update would help (or if that is usually the default).

@tasket
Copy link
Owner Author

tasket commented Mar 13, 2025

Updated PoC script. Its closer to a usable solution and just needs SSH URL parsing (and send the commands over ssh in that case). I used cp -l hardlink copying because Unix has no "move into" directory merge that I'm aware of, while cp -l can "copy into" pretty efficiently. Also... dir_scan() needs to be updated to work from remote scans, i.e. find command.

@tasket
Copy link
Owner Author

tasket commented Mar 23, 2025

@t-gh-ctrl I created a gist of an updater script that I tested today:

https://gist.github.com/tasket/08f38279d8702c7defcb62cb4afdae7a

Feel free to suggest changes incl. different rsync options.

@t-gh-ctrl
Copy link

I created a gist of an updater script that I tested today:

Awesome ! :)

It might take a bit of time until I report back because I won't have access to my remote backup server for the next month ; so I'll have to do some tests with a test remote host when time permits.

@tasket
Copy link
Owner Author

tasket commented Mar 28, 2025

The script still doesn't take into account when timezone change shifts the session name (local time) in reverse (such as when backups were done just before traveling, then again from a laptop that has just flown west). But I'm adding a small bit of plaintext info to archives that will show the correct session order; the script can be updated to take advantage of that.

@tasket
Copy link
Owner Author

tasket commented Apr 7, 2025

Update: The 08wip branch now saves a simple in-order json list of sessions in the archive dir. These "sessions" files can be used by a script to avoid the chunkfile timezone misplacement issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants