-
-
Notifications
You must be signed in to change notification settings - Fork 3.1k
unixfs support for large directories #1720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
We've been thinking about doing fanout on unixfs directories for some time now. But the 'best' way to do it isnt immediately apparent. There are a few different strategies, the first is to just add all entries past the first N to a specially named link. And go down that way until we have stored all the entries. This is really convenient for listing directories, as you can just keep loading more until you hit the end, providing output the whole way. The downside is that searching for a certain directory becomes.... hard. The second way is to shard them git style, and have K links at each level, each with log(k) characters in their name, and recurse that downward until each node has a sane number of links. This method works a lot better for lookups on large amounts of links, but kinda sucks for listing directories. 1k links is totally fine, our max block size is 256k (soft limit), and links are 47 bytes, with that we can store around 5500 links in each node and still be under 256k. |
COW filesystems are also worth exploring. ZFS implements POSIX directories on top of "ZAP", the ZFS Attribute Processor, which provides a generalized key-value data structure on top of the underlying COW storage layer. ZAP objects, and thus POSIX directories, are encoded into COW objects for storage using two distinct formats. The smaller format ("microzap") lumps an entire directory together into a single COW object containing a big boring hash table. It's used if the resulting object would fit into a single block (<= 128 KB) and the keys/filenames would all fit into fixed-sized structs (64 bytes each, so filenames must be <= 50 bytes). In practice this means most directories under 2000 files can be stored trivially, and every directory modification requires writing an entire replacement directory object to the storage layer. This seems analogous to The larger format ("fatzap") is a hash table guaranteed to span at least two objects. The top object is a pointer table, which references 2^N leaf objects, each acting as a hash table bucket. Leaf objects contain a header describing their contents ("I store objects whose hash starts with 10111") and the corresponding hash table, which in the case of a POSIX directory includes filenames pointing to file objects. There are a couple other less common bits of indirection too (leaf chaining for pointer table collisions and external pointer tables past 2^13 leaves), but those are details. The key feature is locality: ZFS can find what it's looking for, either for retrieval or modification, without touching too many blocks or even too many bits inside a block. ZFS directories work well in practice having these two formats. POSIX One other design note: ZAP objects/ZFS directories use a salted hash, meaning that multiple ZFS directories containing the same files will be stored and returned in different orders. I don't know the exact rationale behind this ZFS feature, but Go maps were changed from "undefined" to random iteration because an undefined yet stable ordering allowed people to depend on it. |
Thanks @willglynn these are great suggestions! Maybe we should probably move this discussion into the https://github.com/ipfs/specs repo? it may be good to move discussion about general system changes to that repo. I'm sorry, i know it's very far from complete, but maybe with some help we can get it there! there's one tricky thing to solve around this: how to discern whether a request should access the underlying ipfs objects, or get the "virtual" unioned directory instead. today we achieve this by separating |
Closing in favor of ipfs/specs#32. |
This might be an extreme case, but consider someone wishing to make Wikipedia available on IPFS:
It seems like a bad idea to have a merkledag node with 12M links, but that would be the representation using unixfs
Data_Directory
. It seems like a similarly bad idea to have a merkledag node past even 1k links, and directories with a thousand files occur more commonly in practice.Alternate directory representations (
Data_PrefixTreeDirectory
and/orData_HashTableDirectory
) might be a solution, using intermediate merkledag nodes to ultimately reference all of a large directory's children in a way that permits efficient traversal and selective retrieval. The distinction between directory representations could be transparent to users, with implementations free to choose whichever data structure it deems suitable.Going back to the example: the list of Wikipedia pages is 67 MB gzipped, before adding hashes. A user shouldn't have to download ~400 MB of protobufs just to find the hash of a single article page.
What's a sensible upper limit for a merkledag node's link count or encoded size in bytes? Is there precedent for reducing merkledag fan out? What other components would need to know about a new unixfs
DataType
?The text was updated successfully, but these errors were encountered: