Skip to content

Optimization - Archive space efficiency #238

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
tasket opened this issue Feb 22, 2025 · 0 comments
Open

Optimization - Archive space efficiency #238

tasket opened this issue Feb 22, 2025 · 0 comments

Comments

@tasket
Copy link
Owner

tasket commented Feb 22, 2025

Notes and exploration of disk space usage by the Wyng archive format

Possible sub-topics:

  • Impact of dest filesystem metadata (incl. chunk sizes, links and end-tying)
  • Deduplication & Compression effectiveness
  • Pruning parameters and possible tweaks
  • Archive metadata size
  • etc.

Initial observations:

There is a 3-way tradeoff between the impacts of chunk size, compression and deduplication. The Wyng defaults try to strike a balance for typical use cases. For example, a smaller chunk size allows dedup of more data however this increases the dest filesystem (and internal archive) metadata usage; it also makes compression slightly less effective.

Dedup Anecdote: The default 128KB chunk size can yield great dedup results for distantly-related volumes. A pair of Qubes template root imgs, one basic Debian img and a fancy, large KDE variant which diverged years ago (and upgraded twice) enjoy a 21% dedup savings when the basic/small img is already in the archive and then the large KDE img is added to it – from a test send performed today. The raw on-disk usage of these imgs are 5.4GB and 10.8GB, respectively, which means that a very large portion of the small volume was utilized in the Wyng dedup process. This should be representative and its worth noting that these two volumes have never been internally defragged or otherwise re-packed or re-organized, so they are about as randomly arrayed as one could expect for an Ext4 root fs.

Even without --dedup the incremental send mode functions like a very simple dedup. Interestingly, this form of dedup incurs zero metadata overhead both within the archive and on the dest fs.

Changes in the compressor implementation (such as upgrading the compression library to newer versions) can result in large reduction in dedup effectiveness, since the chunks compressed before being hashed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant