Skip to content

Docs and API follow-ups to #601 #619

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: develop
Choose a base branch
from

Conversation

TomNicholas
Copy link
Member

@TomNicholas TomNicholas commented Jun 17, 2025

This started out as a targeted PR to address #616 and ended up as an attempt to address all the uncheck bullets from #601 (i.e. everything in the docs that touches the concept of parsers).

fyi @sharkinsspatial @maxrjones @chuckwondo

@TomNicholas TomNicholas added the documentation Improvements or additions to documentation label Jun 17, 2025
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know why git mv didn't understand that I was just renaming this file. (It has a bunch of other changes too.)

Comment on lines +20 to +32
def custom_parser(file_url: str, object_store: ObjectStore) -> ManifestStore:
# access the file's contents, e.g. using the ObjectStore instance
readable_file = obstore.open_reader(object_store, file_url)

# parse the file contents to extract its metadata
# this is generally where the format-specific logic lives
manifestgroup: ManifestGroup = extract_metadata(readable_file)

# optionally create an object store registry, used to actually load chunk data from file later
registry = ObjectStoreRegistry({store_prefix: object_store})

# construct the Manifeststore from the parsed metadata and the object store registry
return ManifestStore(group=manifestgroup, store_registry=registry)
Copy link
Member Author

@TomNicholas TomNicholas Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing this out made me realize it's a bit weird that exactly one ObjectStore is required by the call signature, but not actually technically needed by the code...

@maxrjones
Copy link
Member

This started out as a targeted PR to address #616 and ended up as an attempt to address all the uncheck bullets from #601 (i.e. everything in the docs that touches the concept of parsers).

IMO more targeted PRs are preferable because they are simpler to review, faster to merge, and keep the git log more descriptive. This number of files touched by this PR motivated my request for a faster turnaround for #615 in #615 (comment).

@TomNicholas
Copy link
Member Author

TomNicholas commented Jun 17, 2025

That's fair - I can definitely split out the changes to where the Parser is defined from the rest of the changes. But the docs do just need altering on almost every page.

@TomNicholas
Copy link
Member Author

TomNicholas commented Jun 18, 2025

Okay I've split that out in #621, which should be merged first. The rest of this PR basically does a few (related) things:

  1. grep for "reader" and replace with "parser"
  2. modify any language referring to parsers / ManifestStore to be up-to-date with the changes in Refactor codebase to support a new simplified Parser->ManifestStore model. #601
  3. update code examples to use parser and obstore instead of reader

Those could be separated further if it would help, but it's now already (almost) down to being a pure docs (including docs examples) PR.

Copy link
Collaborator

@chuckwondo chuckwondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great documentation! This really helped me start to wrap my head around things.

Most of my suggestions are minor format/syntax/grammar suggestions, but there are also a few regarding use of context managers in examples, and a naming question (which would be best addressed in a separate PR, if is makes sense).

Comment on lines +35 to +39
vds = vz.open_virtual_dataset(
file_url,
object_store=object_store,
parser=custom_parser,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we use context managers in examples to show recommended usage to ensure resources are properly managed to avoid leaks:

Suggested change
vds = vz.open_virtual_dataset(
file_url,
object_store=object_store,
parser=custom_parser,
)
with vz.open_virtual_dataset(
file_url,
object_store=object_store,
parser=custom_parser,
) as vds:
...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tangentially, can we rename the object_store parameter to simply store? That would be consistent with the names store_prefix and store_registry elsewhere.

However, would that then cause potential confusion with zarr.abc.Store? If so, then wouldn't store_prefix and store_registry also cause confusion about what type of store they are related to (obstore or zarr)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On context managers: Do we really need to? It makes all the examples more complex to read...

On renaming: I agree this is potentially confusing. I think I would prefer everything be object_store, but then on the other hand we do have type hints to help disambiguate... Doesn't help that zarr.storage.ObjectStore is a zarr.abc.Store that wraps an obstore.Store 🙃 Is the word object redundant in any way? Might we want to generalize that later?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding context managers would certainly add a minor amount of complexity to the examples, but my fear is that most readers of any code examples (regardless of library) tend to repeat the same patterns, even if those patterns are likely not ideal for production code. How many context managers have I already had to add to the codebase itself to resolve problems (both in main code and test code)?

At the very least, I recommend a very obvious, bold warning in at least one place in the docs (ideally somewhere most readers are likely to see) that very clearly indicates that use of context managers is recommended for production code, but for brevity, code examples will not use them. And the callout should show an explicit example of the recommended practice, so that the syntax is visually imprinted in the reader's mind.

Copy link
Collaborator

@chuckwondo chuckwondo Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference is to make repeated use of context managers throughout the examples, so that the repetition is imprinted in the reader's mind, and will be the syntax they repeat, rather than repeatedly not using context managers.

Even with a big, bold warning somewhere in the docs, I suspect the reader will repeat what they see, not what the warning says, because that's what they would repeatedly see in the examples. I recommend repeating the recommended practice, not repeating the "poor" practice simply for saving a modicum of keystrokes/simplification.

Of course, if I'm outvoted, I won't block things.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's totally reasonable. My only remaining concern is that it's tricker to do that in narrative documentation than in real code, because I need text between opening the virtual dataset and using the virtual dataset. But this isn't going to work if users copy it verbatim:

with open_virtual_dataset() as vds:
    ...

some explanatory text

vds.virtualize.to_kerchunk()

In the docs I can't really wrap all later uses of vds inside the context manager, unless I keep opening it again and again, which also wouldn't be very clear. It feels like a compromise either way.

FWIW all your arguments could apply to the xarray documentation too, but they don't use context managers there either

https://docs.xarray.dev/en/stable/user-guide/io.html#reading-and-writing-files

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point about interleaving prose with code. Perhaps we can at least find a good place to put a callout explaining that use of context managers is strongly recommended to prevent memory/resource leaks in critical code (along with a code example), but that for convenience throughout the docs, context managers might be dropped.

manifestgroup: ManifestGroup = extract_metadata(readable_file)

# optionally create an object store registry, used to actually load chunk data from file later
registry = ObjectStoreRegistry({store_prefix: object_store})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

store_prefix is undefined. Do we want to add a line or 2 of code/comment about it, or at least a comment referring users to a section of the docs covering registries?


vds = open_virtual_dataset('air.nc')
vds = open_virtual_dataset('air.nc', object_store=LocalStore, parser=HDFParser())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

context manager?

also, a few lines above, I suggest using a context manager for opening the air_temperature tutorial dataset

Comment on lines 397 to 403
vds = open_virtual_dataset(
'relative_refs.json',
filetype='kerchunk',
virtual_backend_kwargs={'fs_root': 'file:///some_directory/'}
object_store=LocalStore,
parser=KerchunkJSONParser(
fs_root='file:///data_directory/',
)
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

context manager?

Co-authored-by: Chuck Daniels <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants