Skip to content

Difference between COW filesystems and non-COW when using file based dedupe #310

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
KaibutsuX opened this issue Apr 18, 2025 · 12 comments

Comments

@KaibutsuX
Copy link

Can you offer a short summary of the implementation differences when using a file-based dedupe tool like jdupe/duperemove on a COW filesystem vs something like ext4?

When I first came to to btrfs I was expecting inband dedupe. So when I realized it was out of band, I couldn't understand what benefits a COW fs offered if you still have a use a userspace utility to link/delete dupes after the fact. After all I was using fdupes on standard ext2/3/4 filesystems with essentially the same results.

So what advantages (at least specifically for dedupe/COW) does btrfs provide over legacy filesystems?

@kakra
Copy link
Contributor

kakra commented Apr 18, 2025

CoW is not for dedupe. Rather it allows for snapshots and safe file updates: data is never written inplace, you get either all the new data or none of the new data if a file is updated and the system crashes, the old version of the file is kept intact until all new data has been successfully written. This is different to ext4 which writes data inplace (it has a data journal mode but that cuts write speeds in half).

Having something like dedupe is a side effect of CoW: You can make blocks of different files becoming shared in the same physical on-disk block. If one of the files will be updated, a copy of that block is made (copy on write), thus keeping both copies intact.

ext2/3/4 dedupe works differently: It creates hardlinks of whole files. Firstly, the granularity is lower: you either dedupe the whole file or nothing. Secondly, both file names of a hardlink essentially reference the same data on disk: modify one file's content and the other file will change, too. This can come at a surprise: hardlinks are not dedupes.

CoW file systems just work for dedupe without such surprises or limitations.

xfs supports something similar but it isn't really CoW: It can share file blocks for dedupe just like btrfs, and it will unshare these blocks if modified, just like CoW. But it is still a filesystem which changes file blocks inplace like ext4.

zfs is also a CoW filesystem which supports snapshots, redundancy, self-healing, pooling and virtual block devices.

ReFS in Windows is probably a lot more like zfs or btrfs, it supports CoW, snapshots, redundancy, self-healing, pooling and tiering.

NTFS in Windows supports snapshots, and probably some sort of tiering. But it cannot dedupe despite using some simple CoW features for snapshots obviously. Thus, it's not a CoW filesystem despite supporting snapshots.

btrfs is somewhere all over that area, too: snapshots, simple self-healing, redundancy (just mirroring currently as the stable option) and pooling.

ext4 offers none of such options. So btrfs is a lot more than just a single-device filesystem.

There are probably more filesystems that have one or another of such features:

  • pooling: combine multiple devices/disks into a single filesystem
  • tiering: support migration of data between slow/cold and fast/hot storage members
  • self-healing: support automatic replacement of broken data blocks with good copies using redundancy and checksums
  • redundancy: support storing multiple copies of the same data or parity/error recovery data
  • snapshots: support fast snapshots of files/directories/subvolumes without creating a full copy, meaning it will copy blocks if written to (copy on write)

In theory, you can pool any filesystem in Linux through lvm/md (and RAID, too) but this is implemented at a separate layer. zfs und btrfs support native pooling/redundancy without an intermediate layer.

@KaibutsuX
Copy link
Author

This can come at a surprise: hardlinks are not dedupes.

Maybe I'm thinking of dedupes in a different way then? I understand the file/block level granularity, but if a hard link is not de-duping an identical file, then what is it?

If the implementation of de-duping at the file-level in a btrfs system using something like duperemove or fdupes uses something available only to a btrfs filesystem, then if I were to rsync an entire btrfs filesystem (with maximum deduped file data) to an ext4 filesystem, I would expect the resulting ext4 system to be larger since it doesn't support the dedupe features that btrfs does.

However, if I had a btrfs filesystem and just ran a de-duping userspace tool that de-duped by creating hard links, I could rsync that btrfs filesystem to an ext4 filesystem and the resulting size would be exactly the same, no?

@kakra
Copy link
Contributor

kakra commented Apr 18, 2025

but if a hard link is not de-duping an identical file, then what is it?

It creates a second file name for the same file object.

then if I were to rsync an entire btrfs filesystem (with maximum deduped file data) to an ext4 filesystem, I would expect the resulting ext4 system to be larger since it doesn't support the dedupe features that btrfs does.

In your thinking, yes. But rsync is not aware of shared file blocks, it only knows hardlinks. So it would unshare all the files anyway, even if the target is a btrfs.

You need to think "files" as two distinctive things: A file consists of its file name, and the file contents. The name is only a pointer. Symlinks and hardlinks can point to the same file data.

Btrfs goes a bit further: It supports symlinks and hardlinks the same way as other file systems. But - as with other filesystems - the file data is a list of extents. In reality, a file name points to this list of extents. Btrfs can reference such an extent from multiple files. But since it never overwrites an extent, on any write, it will create a new extent, and swap the point to that extent in the file. So even if two files shared an extent, you now end up with two extents: one in the original file, and one in the modified file.

I could rsync that btrfs filesystem to an ext4 filesystem and the resulting size would be exactly the same, no?

Yes, if you tell rsync to detect and recreate hardlinks. However, "deduping" via hardlink is usually not what you want: If you modify such a file, it modifies the other hardlinks, too.

Think of it like this:

Imagine you had a complex project, and wanted to make complicated or experimental modifications. In case something may break, you will create a backup copy first.

  1. If you do that with hardlinks, any file modification in the project would affect its copy.
  2. So you probably rather create a full copy, which takes space.
  3. In btrfs you can create a snapshot. That takes one second. And you can modify any file without affecting the backup copy.

With both filesystem, you could create a full backup copy of the project. But if you want to save space later:

  1. A hardlink "dedupe" will bring you back to situation 1 and makes your copy unusable.
  2. A reflink dedupe (as supported by btrfs) will bring you to situation 3 where all files stay individual but share data blocks.

@Zygo
Copy link
Owner

Zygo commented Apr 19, 2025

if I were to rsync an entire btrfs filesystem (with maximum deduped file data) to an ext4 filesystem, I would expect the resulting ext4 system to be larger since it doesn't support the dedupe features that btrfs does.

Yes, that’s exactly what happens. CoW filesystems like Btrfs (and to some extent XFS) can share physical blocks across different files even when their logical block structures are distinct. Tools like rsync operate purely at the file level and don't understand the distinction between logical and physical blocks—so when you copy a Btrfs filesystem with deduplicated blocks to ext4, rsync will write each shared block separately. The result is a larger destination filesystem, because ext4 lacks the underlying block-sharing mechanism.

The key difference is in how deduplication is supported. ext4 only supports file-level deduplication via hardlinks—if two files are identical, they can be hardlinked, and that’s it. There's no concept of sharing data at a finer granularity. Filesystems like XFS (with reflink) and Btrfs allow block-level deduplication, where logically distinct files or blocks can reference the same underlying data without needing to be identical across the entire file.

This also affects tool behavior. Tools like fdupes or fclones work on any filesystem that supports hardlinks because they operate at the file level—they only look for whole-file duplicates. Their behavior is mostly independent of the filesystem in use.

In contrast, duperemove and bees operate at the block level. They can find and deduplicate partial matches between files—something that’s only meaningful on filesystems that support block-level sharing. fclones in clone mode can replace an entire duplicate file with reflinks to another, but it still works at the file granularity.

What bees brings to the table is continuous, incremental, block-level deduplication at the lowest layers of the filesystem. It’s not the same as running duperemove in a cron job—it’s closer in spirit to ZFS’s approach: you write your data and forget about it, and deduplication happens automatically in the background. The implementation is completely different, but the user experience of “just write the data and let the system optimize it later” is quite similar.

@KaibutsuX
Copy link
Author

Thanks for the explanations, both of you.

So for the casual user of a filesystem like me who doesn't care about the more "enterprise" features like snapshotting, I was only interested in the de-dupe features with an expectation in storage savings for things like rsync'ed backups, it seems like at least how I was expecting it to work is definitely not going to give me the space savings I was expecting.

Not to diminish the features of btrfs or say that it's overblown, but it does seem like some of that might be more marketing speak with respect to dedupe support. I get the inband/outband dedupe support, but to claim that btrfs "supports" deduping with external tools like bees is kind of a given. Ext4 also supports deduping with the support of external tools like ln. While the dedupe support in btrfs does allow much greater granularity at the block level which could create more space savings than an ext system which only supports it at the file level, it seems that any filesystem which allows you to create links technically "supports" deduping. Would that be an accurate statement?

With respect to how btrfs implements the block level dedupe, I assume that the filesystem simply provides more granular metadata that userspace utils can then utilize. Would that mean then that at some point in the future that tools like rsync could implement features to support an approximation of "in-band" dedupe support? Something like rsync src/ dest && rsync src/ dest2 where rsync could talk to btrfs and detect that everyblock in dest2 has already been written, or would something like that still require bees because the dedupe must be run after the blocks are already written?

@Zygo
Copy link
Owner

Zygo commented Apr 19, 2025

it seems that any filesystem which allows you to create links technically "supports" deduping. Would that be an accurate statement?

That would be a misleading statement. "Deduplication" as a term of art generally refers to a filesystem feature that separates logical and physical storage layers—so that modifying one logical copy of shared data doesn't affect the others.

Hardlinks, by contrast, are a legacy of the original Unix design (dating back over 50 years), where a file could have multiple names but only one physical instance. Writing to any hardlink modifies the shared file. Hardlink-based deduplication tools have existed since the late 1980s, but they typically rely on users keeping files read-only afterward to avoid unintended data corruption. There's no built-in mechanism to preserve the independence of logically identical files.

By contrast, Btrfs supports true deduplication through three core mechanisms:

  1. Immutable shared extents at the filesystem level, with automatic copy-on-write (CoW) when a modification is attempted. (Copy-on-write is a misleading name--new writes are redirected to new locations, old data is never copied).
  2. An ioctl for cloning file data: it creates a logical copy that shares physical storage without duplicating blocks.
  3. An ioctl for comparing and deduplicating data ranges after the fact: if two regions are identical, one is replaced by a reference to the other.

Component (1) is what Btrfs, XFS (with reflink), ZFS, and bcachefs have. ext4 does not.
Component (2) is used by tools like cp, mv, and cat (on reflink-aware filesystems) to avoid redundant writes.
Component (3) is leveraged by deduplication tools to retroactively consolidate identical data blocks.

Something like rsync src/ dest && rsync src/ dest2 where rsync could talk to btrfs and detect that every block in dest2 has already been written...

That’s certainly possible, and there have been efforts to add support for copy_file_range() or reflink into rsync. But none have made it into the mainline, and the existing patches are poorly maintained.

It’s also worth distinguishing rsync-then-bees from rsync-with-reflink:

  • rsync-with-reflink would avoid writing duplicate data to disk in the first place, whereas rsync-then-bees deduplicates after all writes are complete.
  • However, rsync operates on one file at a time, so rsync-with-reflink would only create clones within a single file, not across different files.

By comparison, rsync-then-bees can deduplicate across files, detect renamed or copied content on the sending side, and dedupe partial overlaps. bees doesn't guarantee full deduplication—its sliding hash table window means some potential matches may be missed as older hashes are evicted—but in practice, the losses from that are usually far outweighed by the gains from being able to use any data in the filesystem as a reflink source.

@KaibutsuX
Copy link
Author

Thank you so much for the in-depth explanation, I appreciate it.

I have one more question/though experiment related to another aspect of deduping.

Let's say I have 1TB of data that I want to backup to a 3TB btrfs partition nightly to a directory named from the current timestamp (so essentially every night I would rsync src "$(date)"/

In the current state of btrfs/bees, I would literally be transferring 1TB of data every single night, and then bees would consolidate those duplicated blocks to conserve storage space, but even still, my nightly transfer and disk write activity is going to be the full 1TB. So essentially, my destination btrfs partition would need always need to be as large as the total de-duped data set, because I have to write all of the data first before it can be deduped?

The alternative is using rsync-with-reflink or some other yet to be created userspace tool which could essentially dedupe at write time (but still only at the file, not block, level) which would mean I could theoretically backup 5 nights (5 TB) of data to a 3TB drive (assuming no data changes enough to actually warrant needing 5 full TB to represent the single 1TB data).

I also assume that this is essentially what zfs provides through in-band dedupe (whose system requirements are beyond reasonable for my simple scenarios)

@Zygo
Copy link
Owner

Zygo commented Apr 19, 2025

I handle this by using btrfs snapshots to avoid extra copies:

# one-time setup:
btrfs sub create staging

# repeatable backup step:
rsync -aHS --del src staging && btrfs sub snap -r staging $(date)

btrfs sub snap creates a snapshot, which is a lazy reflink of the entire subvolume. Each time the rsync command runs, it updates staging incrementally, and then creates a full logical snapshot to a date-stamped directory. This avoids multiple full copies while ensuring each snapshot reflects a complete and consistent backup.

For added safety during unreliable transfers:

if rsync -aHS --del src staging; then
  btrfs sub snap -r staging $(date)-good
else
  btrfs sub snap -r staging $(date)-bad
fi

This way, you retain both successful and failed transfers, clearly labeled—useful if something goes wrong and you need to inspect partial results.

If your source is ext4, xfs, or any filesystem without data checksums, you can add -c to the rsync command line to force block-by-block verification instead of relying on timestamps. And if you're extra cautious, you can periodically delete staging (e.g. monthly) and let rsync rebuild it from scratch, to catch any long-term bitrot or drift.

One nice property here is that bees can run concurrently with rsync. So even while data is being written, bees can start deduping it. In the worst case (i.e. without using any snapshots), the second rsync might appear to write another 1 TiB, but deduplication will often reclaim space before long—you just need a bit of extra space while bees catches up.

@kakra
Copy link
Contributor

kakra commented Apr 19, 2025

This avoids multiple full copies while ensuring each snapshot reflects a complete and consistent backup.

I did something similar but my staging area was called scratch... ;-)

About using rsync together with bees: I can confirm that bees is extremely fast in keeping pace with rsync.

If using a scratch area with rsync, it may actually be better to use --no-whole-file --inplace on btrfs because we have previous snapshots of the scratch area - a partial file update which rsync tries to avoid isn't important here. This way, rsync will not recreate the complete file which should be more efficient for CoW and if used with bees.

@Zygo
Copy link
Owner

Zygo commented Apr 20, 2025

it may actually be better to use --no-whole-file --inplace on btrfs

This reminds me to add a disclaimer, especially for anybody who lands here from a search:

Please don't cut+paste the commands as I have written them--$(date) is a terrible filename, especially without surrounding quotes, and real backup use cases will require many more rsync flags (--numeric-ids, -aHSXX, timeouts if you're doing remote transfers, --inplace if you don't have sparse files, etc) and a snapshot on the sending side as well. The shell fragments are meant to highlight the key steps in a complete backup process.

@lilydjwg
Copy link

the more "enterprise" features like snapshotting

No. Snapshotting isn't an enterprise feature. It's a feature that helps a lot of ordinary users to recover data from failed upgrades, mistyped deletions, unexpected command results, etc.

Ext4 also supports deduping with the support of external tools like ln.

No. Thinking hard links as deduping is harmful and may cause you data loss---unless your filesystem is squashfs or iso9660. What do you expect when you modify one of the hard links? If you expect all of them to update, some programs update a file by writing a new file and replacing the old file via rename. If you expect only the modified path updates, some other programs will directly write into the file being modified. For files managed by humans (instead of, say package managers or git), hard links are too easy to be handled in the wrong way.

@Zygo
Copy link
Owner

Zygo commented Apr 20, 2025

Snapshotting isn't an enterprise feature.

Indeed. Snapshotting in btrfs is merely a faster way to write cp -a --reflink=always src dst on specific src directories, but atomically. If you ignore it, you're only depriving yourself of a very useful tool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants