-
Notifications
You must be signed in to change notification settings - Fork 11
get dataset coordinates from describe
parts instead of only at service variables
#277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
describe
part instead of only at service variablesdescribe
parts instead of only at service variables
Hello Jelmer, It looks like there has been some progress on this front. In version 2.1.0, a new method called get_coordinates has been added to the parts, which now allows you to retrieve coordinates. You can find the documentation here: If you're looking to extract the minimum and maximum values from the datasets in your example, you can now use the following approach:
This skips some verification steps of your code as they were not central to the demonstration. The new command makes the following steps not needed:
It is not at the service level yet, but we feel it is already a step. Do you have any thoughts on it? |
@gkoenig-mercator this seems like an improvement indeed, nice! I have some questions, mostly related to the indexing of the first of products/datasets/versions/parts. The There are most likely multiple When reading about parts I see this is for instance relevant for insitu data. I guess the temporal extents are equal for all parts of a dataset version? It might also be sensible to get the coordinates from the dataset Furthermore, I noticed that you access I ran this code in a clean environment with only Lastly, in my example I use a private xarray function to convert the timestamps. Private functions have the risk of being phased out some day. Is this the best method or are there better alternatives? Maybe there is even a method in the |
Hello @veenstrajelmer , I am not entirely sure I understood the question. To clarify, if you use the describe command with a product id for example, you will retrieve many datasets and will need to parse the data a bit to access a precise dataset, as shown in the following example:
I am unsure for the second question. Maybe @renaudjester could answer it better. I am also very unsure for the third question. I am waiting for someone with a better knowledge to answer. Yes we could discuss for a change of format. For the timestamps we are discussing internally for changing some things, thanks for bringing that up :) |
Yes :D In 99.999999% of the cases. In very rare cases, a datasetId can be in several products (usually a transition period around the release). If you want to be 100% sure, you can also add the productId in the call.
Yes! There is a whole logic in the toolbox to order the versions based on released date and retired date. So the first one will be the most recent, that has been released.
Unfortunately, coordinates can be very different in different parts! Take the dataset:
Could be considered, indeed, but not sure how it would work. That might introduce some circular references (for example,
Not sure why this is like this in the metadata, the toolbox is getting this information and putting it in the describe as is.
Nothing in the toolbox, no 🤔 In v1, we used |
@renaudjester thanks for this elaborate response.
I am a bit hesitant getting the extents via this method if the returned values vary, although I could loop over all parts and take the min and max extents of all of them.
That makes sense. Follow-up question, are the time extents at equal between these variables and services? In other words, is it safe to only access the first via From an xarray point of view (which might not be applicable for all datasets in copernicusmarine), the time coordinate is a attached to each of the time-varying variables, but it is also attached to the dataset. For my usecase, it would be amazing to be able to get the time extents directly from the dataset instead. This saves a lot of looping or assumptions. I cannot make any assumptions, since it would probably break one day in the future. I would also like to minimize the required code. The current method (via xarray) is quite slow, but it is at least very robust, which is essential in my view.
Thanks, this is helpful and straight forward. It might also resolve the parsing of the time units. |
@veenstrajelmer Maybe it can help you if you consider which data you will access later on! You should know which part you are accessing:
I think there is a misunderstanding here: You can safely assume that inside a part, the extents are the same! Since you mention |
Thanks both! I needed some time to try some things: import copernicusmarine
import pandas as pd
import cftime
import numpy as np
dataset_id = 'cmems_mod_glo_phy-cur_anfc_0.083deg_P1D-m' # parts: ['default']
# dataset_id = 'cmems_mod_glo_phy-so_anfc_0.083deg_P1D-m' # parts: ['default']
# dataset_id = 'cmems_mod_glo_phy-thetao_anfc_0.083deg_P1D-m' # parts: ['default']
# dataset_id = 'cmems_mod_glo_phy_anfc_0.083deg_P1D-m' # parts: ['default']
# dataset_id = 'cmems_mod_glo_phy_myint_0.083deg_P1D-m' # parts: ['default']
# dataset_id = 'cmems_mod_glo_phy_my_0.083deg_P1D-m' # parts: ['default']
# dataset_id = 'cmems_mod_glo_bgc-bio_anfc_0.25deg_P1D-m' # parts: ['default']
# dataset_id = 'cmems_mod_glo_bgc_myint_0.25deg_P1D-m' # parts: ['default']
# dataset_id = 'cmems_mod_glo_bgc_my_0.25deg_P1D-m' # parts: ['default']
# dataset_id = 'cmems_mod_glo_bgc_my_0.25deg_P1M-m' # parts: ['default'] # time_coordinate.minimum_value and time_coordinate.maximum_value are None
# dataset_id='cmems_obs-ins_arc_phybgcwav_mynrt_na_irr' # parts: ['latest', 'history', 'monthly']
# dataset_id='cmems_obs-ins_glo_phy-ssh_my_na_PT1H' # parts: ['history']
data = copernicusmarine.describe(
dataset_id=dataset_id,
disable_progress_bar=True,
)
def convert_time(time_raw: np.datetime64, cftime_unit: str) -> int:
if time_units == 'ISO8601':
return pd.to_datetime(time_raw)
return cftime.num2date(time_raw, cftime_unit)
print("parts:", [x.name for x in data.products[0].datasets[0].versions[0].parts])
coordinates = data.products[0].datasets[0].versions[0].parts[0].get_coordinates()
time_coordinate = coordinates['time'][0]
time_units = time_coordinate.coordinate_unit
time_min_raw = time_coordinate.minimum_value
time_max_raw = time_coordinate.maximum_value
time_min = convert_time(time_min_raw, time_units)
time_max = convert_time(time_max_raw, time_units)
print(f"Minimum Value: {time_min}")
print(f"Maximum Value: {time_max}") Some points of attention that might already be known:
As for your comments:
This does not work unfortunately, describe has no such argument.
Sorry, you are right, I was confused indeed. |
@veenstrajelmer, thanks for the investigation!
To be checked by us!
Yep! That's how the data comes in the metadata, and we don't convert it. (and not sure it's planned right now)
What I meant is: in your workflow, let's say you want to subset some data. Then about the part there are two possibilities:
|
As originally described in Deltares/dfm_tools#1082, it would be useful to easily get the coordinates from a part that is returned by
describe
. Or maybe even from the dataset version, so one level up. For instance, getting the time extents is currently a bit cumbersome and not so robust since they are only available at the variables of the service:The text was updated successfully, but these errors were encountered: