Skip to content

unixfs support for large directories #1720

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
willglynn opened this issue Sep 17, 2015 · 4 comments
Closed

unixfs support for large directories #1720

willglynn opened this issue Sep 17, 2015 · 4 comments

Comments

@willglynn
Copy link

This might be an extreme case, but consider someone wishing to make Wikipedia available on IPFS:

$ zcat dumps.wikimedia.org/enwiki/20150901/enwiki-20150901-all-titles-in-ns0.gz | wc -l
 11944438

It seems like a bad idea to have a merkledag node with 12M links, but that would be the representation using unixfs Data_Directory. It seems like a similarly bad idea to have a merkledag node past even 1k links, and directories with a thousand files occur more commonly in practice.

Alternate directory representations (Data_PrefixTreeDirectory and/or Data_HashTableDirectory) might be a solution, using intermediate merkledag nodes to ultimately reference all of a large directory's children in a way that permits efficient traversal and selective retrieval. The distinction between directory representations could be transparent to users, with implementations free to choose whichever data structure it deems suitable.

Going back to the example: the list of Wikipedia pages is 67 MB gzipped, before adding hashes. A user shouldn't have to download ~400 MB of protobufs just to find the hash of a single article page.

What's a sensible upper limit for a merkledag node's link count or encoded size in bytes? Is there precedent for reducing merkledag fan out? What other components would need to know about a new unixfs DataType?

@whyrusleeping
Copy link
Member

We've been thinking about doing fanout on unixfs directories for some time now. But the 'best' way to do it isnt immediately apparent. There are a few different strategies,

the first is to just add all entries past the first N to a specially named link. And go down that way until we have stored all the entries. This is really convenient for listing directories, as you can just keep loading more until you hit the end, providing output the whole way. The downside is that searching for a certain directory becomes.... hard.

The second way is to shard them git style, and have K links at each level, each with log(k) characters in their name, and recurse that downward until each node has a sane number of links. This method works a lot better for lookups on large amounts of links, but kinda sucks for listing directories.

1k links is totally fine, our max block size is 256k (soft limit), and links are 47 bytes, with that we can store around 5500 links in each node and still be under 256k.

@willglynn
Copy link
Author

COW filesystems are also worth exploring.

ZFS implements POSIX directories on top of "ZAP", the ZFS Attribute Processor, which provides a generalized key-value data structure on top of the underlying COW storage layer. ZAP objects, and thus POSIX directories, are encoded into COW objects for storage using two distinct formats.

The smaller format ("microzap") lumps an entire directory together into a single COW object containing a big boring hash table. It's used if the resulting object would fit into a single block (<= 128 KB) and the keys/filenames would all fit into fixed-sized structs (64 bytes each, so filenames must be <= 50 bytes). In practice this means most directories under 2000 files can be stored trivially, and every directory modification requires writing an entire replacement directory object to the storage layer. This seems analogous to Data_Directory, though the microzap structure uses defined sorting to optimize lookups.

The larger format ("fatzap") is a hash table guaranteed to span at least two objects. The top object is a pointer table, which references 2^N leaf objects, each acting as a hash table bucket. Leaf objects contain a header describing their contents ("I store objects whose hash starts with 10111") and the corresponding hash table, which in the case of a POSIX directory includes filenames pointing to file objects. There are a couple other less common bits of indirection too (leaf chaining for pointer table collisions and external pointer tables past 2^13 leaves), but those are details. The key feature is locality: ZFS can find what it's looking for, either for retrieval or modification, without touching too many blocks or even too many bits inside a block.

ZFS directories work well in practice having these two formats. POSIX readdir() does not guarantee that its results have any particular sorting, and hash ordering on-disk has better properties for lookups than lexical ordering, so that's what ZFS uses.

One other design note: ZAP objects/ZFS directories use a salted hash, meaning that multiple ZFS directories containing the same files will be stored and returned in different orders. I don't know the exact rationale behind this ZFS feature, but Go maps were changed from "undefined" to random iteration because an undefined yet stable ordering allowed people to depend on it.

@jbenet
Copy link
Member

jbenet commented Sep 17, 2015

Thanks @willglynn these are great suggestions!

Maybe we should probably move this discussion into the https://github.com/ipfs/specs repo? it may be good to move discussion about general system changes to that repo. I'm sorry, i know it's very far from complete, but maybe with some help we can get it there!


there's one tricky thing to solve around this: how to discern whether a request should access the underlying ipfs objects, or get the "virtual" unioned directory instead. today we achieve this by separating ipfs {add, cat, ls} (unixfs) from ipfs object {get, put, data, links} (merkledag). in the future, i'd like to find some resolution to this in a way that doesn't create N+1 interfaces (where N is the number of "datastructures layered on ipfs" or "applications"), and instead just 2. (( tricky indeed )).

@willglynn
Copy link
Author

Closing in favor of ipfs/specs#32.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants