-
Notifications
You must be signed in to change notification settings - Fork 41
Docs and API follow-ups to #601 #619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
docs/custom_readers.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't know why git mv
didn't understand that I was just renaming this file. (It has a bunch of other changes too.)
def custom_parser(file_url: str, object_store: ObjectStore) -> ManifestStore: | ||
# access the file's contents, e.g. using the ObjectStore instance | ||
readable_file = obstore.open_reader(object_store, file_url) | ||
|
||
# parse the file contents to extract its metadata | ||
# this is generally where the format-specific logic lives | ||
manifestgroup: ManifestGroup = extract_metadata(readable_file) | ||
|
||
# optionally create an object store registry, used to actually load chunk data from file later | ||
registry = ObjectStoreRegistry({store_prefix: object_store}) | ||
|
||
# construct the Manifeststore from the parsed metadata and the object store registry | ||
return ManifestStore(group=manifestgroup, store_registry=registry) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Writing this out made me realize it's a bit weird that exactly one ObjectStore
is required by the call signature, but not actually technically needed by the code...
IMO more targeted PRs are preferable because they are simpler to review, faster to merge, and keep the git log more descriptive. This number of files touched by this PR motivated my request for a faster turnaround for #615 in #615 (comment). |
That's fair - I can definitely split out the changes to where the |
Okay I've split that out in #621, which should be merged first. The rest of this PR basically does a few (related) things:
Those could be separated further if it would help, but it's now already (almost) down to being a pure docs (including docs examples) PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great documentation! This really helped me start to wrap my head around things.
Most of my suggestions are minor format/syntax/grammar suggestions, but there are also a few regarding use of context managers in examples, and a naming question (which would be best addressed in a separate PR, if is makes sense).
vds = vz.open_virtual_dataset( | ||
file_url, | ||
object_store=object_store, | ||
parser=custom_parser, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we use context managers in examples to show recommended usage to ensure resources are properly managed to avoid leaks:
vds = vz.open_virtual_dataset( | |
file_url, | |
object_store=object_store, | |
parser=custom_parser, | |
) | |
with vz.open_virtual_dataset( | |
file_url, | |
object_store=object_store, | |
parser=custom_parser, | |
) as vds: | |
... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tangentially, can we rename the object_store
parameter to simply store
? That would be consistent with the names store_prefix
and store_registry
elsewhere.
However, would that then cause potential confusion with zarr.abc.Store
? If so, then wouldn't store_prefix
and store_registry
also cause confusion about what type of store they are related to (obstore or zarr)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On context managers: Do we really need to? It makes all the examples more complex to read...
On renaming: I agree this is potentially confusing. I think I would prefer everything be object_store
, but then on the other hand we do have type hints to help disambiguate... Doesn't help that zarr.storage.ObjectStore
is a zarr.abc.Store
that wraps an obstore.Store
🙃 Is the word object
redundant in any way? Might we want to generalize that later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding context managers would certainly add a minor amount of complexity to the examples, but my fear is that most readers of any code examples (regardless of library) tend to repeat the same patterns, even if those patterns are likely not ideal for production code. How many context managers have I already had to add to the codebase itself to resolve problems (both in main code and test code)?
At the very least, I recommend a very obvious, bold warning in at least one place in the docs (ideally somewhere most readers are likely to see) that very clearly indicates that use of context managers is recommended for production code, but for brevity, code examples will not use them. And the callout should show an explicit example of the recommended practice, so that the syntax is visually imprinted in the reader's mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My preference is to make repeated use of context managers throughout the examples, so that the repetition is imprinted in the reader's mind, and will be the syntax they repeat, rather than repeatedly not using context managers.
Even with a big, bold warning somewhere in the docs, I suspect the reader will repeat what they see, not what the warning says, because that's what they would repeatedly see in the examples. I recommend repeating the recommended practice, not repeating the "poor" practice simply for saving a modicum of keystrokes/simplification.
Of course, if I'm outvoted, I won't block things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's totally reasonable. My only remaining concern is that it's tricker to do that in narrative documentation than in real code, because I need text between opening the virtual dataset and using the virtual dataset. But this isn't going to work if users copy it verbatim:
with open_virtual_dataset() as vds:
...
some explanatory text
vds.virtualize.to_kerchunk()
In the docs I can't really wrap all later uses of vds
inside the context manager, unless I keep opening it again and again, which also wouldn't be very clear. It feels like a compromise either way.
FWIW all your arguments could apply to the xarray documentation too, but they don't use context managers there either
https://docs.xarray.dev/en/stable/user-guide/io.html#reading-and-writing-files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point about interleaving prose with code. Perhaps we can at least find a good place to put a callout explaining that use of context managers is strongly recommended to prevent memory/resource leaks in critical code (along with a code example), but that for convenience throughout the docs, context managers might be dropped.
manifestgroup: ManifestGroup = extract_metadata(readable_file) | ||
|
||
# optionally create an object store registry, used to actually load chunk data from file later | ||
registry = ObjectStoreRegistry({store_prefix: object_store}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
store_prefix
is undefined. Do we want to add a line or 2 of code/comment about it, or at least a comment referring users to a section of the docs covering registries?
|
||
vds = open_virtual_dataset('air.nc') | ||
vds = open_virtual_dataset('air.nc', object_store=LocalStore, parser=HDFParser()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
context manager?
also, a few lines above, I suggest using a context manager for opening the air_temperature tutorial dataset
vds = open_virtual_dataset( | ||
'relative_refs.json', | ||
filetype='kerchunk', | ||
virtual_backend_kwargs={'fs_root': 'file:///some_directory/'} | ||
object_store=LocalStore, | ||
parser=KerchunkJSONParser( | ||
fs_root='file:///data_directory/', | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
context manager?
for more information, see https://pre-commit.ci
Co-authored-by: Chuck Daniels <[email protected]>
This started out as a targeted PR to address #616 and ended up as an attempt to address all the uncheck bullets from #601 (i.e. everything in the docs that touches the concept of parsers).
Tests addeddocs/releases.rst
api.rst
fyi @sharkinsspatial @maxrjones @chuckwondo