Skip to content

get dataset coordinates from describe parts instead of only at service variables #277

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
veenstrajelmer opened this issue Jan 27, 2025 · 8 comments
Labels
enhancement New feature or request

Comments

@veenstrajelmer
Copy link

veenstrajelmer commented Jan 27, 2025

As originally described in Deltares/dfm_tools#1082, it would be useful to easily get the coordinates from a part that is returned by describe. Or maybe even from the dataset version, so one level up. For instance, getting the time extents is currently a bit cumbersome and not so robust since they are only available at the variables of the service:

import copernicusmarine
import pandas as pd
# importing a private xarray function here
from xarray.coding.times import decode_cf_datetime

data = copernicusmarine.describe(
    dataset_id='cmems_mod_glo_phy-cur_anfc_0.083deg_P1D-m',
    disable_progress_bar=True,
    )

def convert_time(time_raw, time_units):
    time_np = decode_cf_datetime(num_dates=[time_raw], units=time_units)
    time_pd = pd.Timestamp(time_np[0])
    return time_pd

# check if there is indeed only one of products/datasets/versions/parts
assert len(data.products) == 1
assert len(data.products[0].datasets) == 1
assert len(data.products[0].datasets[0].versions) == 1
assert len(data.products[0].datasets[0].versions[0].parts) == 1

# there are four services, but only geoseries and timeseries contain coordinates in their data variables.
# I expect that in the service "original-files" the time values 
# would be 12 hours different from those in geoseries/timeseries (netcdf vs ARCO)
part = data.products[0].datasets[0].versions[0].parts[0]
service_arco_geo_series = part.get_service_by_service_name(service_name="arco-geo-series")

# Therefore, we get the coordinates from the first data variable in the service.
# We assume now that all data variables contain the coordinates, might be tricky
# It would be useful to attach the coordinates also to the service itself,
# since I expect these are valid here. This would avoid some nesting
var0 = service_arco_geo_series.variables[0]
var0_coordinates = var0.coordinates

# get time coordinate by searching for coordinate_id="time"
coordinate_ids = [x.coordinate_id for x in var0_coordinates]
timecoord_idx = coordinate_ids.index("time")
time_coord = var0_coordinates[timecoord_idx]

# the time extents are raw numbers w.r.t. a reference date
time_units = time_coord.coordinate_unit
time_min_raw = time_coord.minimum_value
time_max_raw = time_coord.maximum_value

# convert to pandas timestamps
time_min = convert_time(time_min_raw, time_units)
time_max = convert_time(time_max_raw, time_units)
print(f"Minimum Value: {time_min}")
print(f"Maximum Value: {time_max}")
@veenstrajelmer veenstrajelmer changed the title add dataset coordinates also to describe part instead of only at service variables get dataset coordinates from describe parts instead of only at service variables Jan 27, 2025
@gkoenig-mercator
Copy link
Collaborator

gkoenig-mercator commented Apr 15, 2025

Hello Jelmer,

It looks like there has been some progress on this front. In version 2.1.0, a new method called get_coordinates has been added to the parts, which now allows you to retrieve coordinates. You can find the documentation here:

https://toolbox-docs.marine.copernicus.eu/en/v2.1.0b1/response-types.html#copernicusmarine.CopernicusMarinePart.get_coordinates.

If you're looking to extract the minimum and maximum values from the datasets in your example, you can now use the following approach:


import copernicusmarine
import pandas as pd
# importing a private xarray function here
from xarray.coding.times import decode_cf_datetime

data = copernicusmarine.describe(
    dataset_id='cmems_mod_glo_phy-cur_anfc_0.083deg_P1D-m',
    disable_progress_bar=True,
    )

def convert_time(time_raw, time_units):
    time_np = decode_cf_datetime(num_dates=[time_raw], units=time_units)
    time_pd = pd.Timestamp(time_np[0])
    return time_pd

coordinates = data.products[0].datasets[0].versions[0].parts[0].get_coordinates()
time_coordinate = coordinates['time'][0]


time_units = time_coordinate.coordinate_unit
time_min_raw = time_coordinate.minimum_value
time_max_raw = time_coordinate.maximum_value

time_min = convert_time(time_min_raw, time_units)
time_max = convert_time(time_max_raw, time_units)
print(f"Minimum Value: {time_min}")
print(f"Maximum Value: {time_max}")

This skips some verification steps of your code as they were not central to the demonstration.

The new command makes the following steps not needed:

service_arco_geo_series = part.get_service_by_service_name(service_name="arco-geo-series")

var0 = service_arco_geo_series.variables[0]
var0_coordinates = var0.coordinates

coordinate_ids = [x.coordinate_id for x in var0_coordinates]
timecoord_idx = coordinate_ids.index("time")
time_coord = var0_coordinates[timecoord_idx]

It is not at the service level yet, but we feel it is already a step. Do you have any thoughts on it?

@veenstrajelmer
Copy link
Author

@gkoenig-mercator this seems like an improvement indeed, nice! I have some questions, mostly related to the indexing of the first of products/datasets/versions/parts.

The data object in the example is the result of a query with a dataset_id as input, so I guess that there is only one products and only one datasets. Is this indeed always the case?

There are most likely multiple versions for each dataset, is the latest version always the first in the list? That would prevent having to loop over all versions.

When reading about parts I see this is for instance relevant for insitu data. I guess the temporal extents are equal for all parts of a dataset version? It might also be sensible to get the coordinates from the dataset version instead (so one level up), but that might defeat the purpose (since the spatial coordinates are different). Either way, is it guaranteed that the time extents are equal for all parts of the latest version of a dataset? It would be great not having to loop over the list of parts.

Furthermore, I noticed that you access coordinates['time'][0]. Upon running your code (with copernicusmarine 2.1.0.b1), I see that coordinates['time'] is a tuple consisting of: (1) CopernicusMarineCoordinate, (2) ['uo', 'vo', 'uo', 'vo'] and (3) ['arco-geo-series', 'arco-geo-series', 'arco-time-series', 'arco-time-series']. It would make more sense to me as the lists of variables and the services are objects of the CopernicusMarineCoordinate instead, and that coordinates['time'] is not a tuple but a CopernicusMarineCoordinate. Could this be considered?

I ran this code in a clean environment with only copernicusmarine installed. This gives me an "ImportError: The cftime package is required for working with non-standard calendars but could not be imported. Please install it with your package manager (e.g. conda or pip).". This happens because time_coordinate.coordinate_unit is a non-standard format ("milliseconds since 1970-01-01 00:00:00Z (no leap seconds)"). When setting this to "milliseconds since 1970-01-01" or "milliseconds since 1970-01-01 00:00:00Z", the conversion of dates works just fine. Would be great if the unit can be according to some widely-used standard.

Lastly, in my example I use a private xarray function to convert the timestamps. Private functions have the risk of being phased out some day. Is this the best method or are there better alternatives? Maybe there is even a method in the copernicusmarine package?

@gkoenig-mercator
Copy link
Collaborator

Hello @veenstrajelmer ,

I am not entirely sure I understood the question. To clarify, if you use the describe command with a product id for example, you will retrieve many datasets and will need to parse the data a bit to access a precise dataset, as shown in the following example:

 data = copernicusmarine.describe(product_id = "GLOBAL_ANALYSISFORECAST_PHY_001_024")
data.products[0].datasets[0]

I am unsure for the second question. Maybe @renaudjester could answer it better.

I am also very unsure for the third question. I am waiting for someone with a better knowledge to answer.

Yes we could discuss for a change of format.

For the timestamps we are discussing internally for changing some things, thanks for bringing that up :)

@renaudjester
Copy link
Collaborator

renaudjester commented Apr 25, 2025

The data object in the example is the result of a query with a dataset_id as input, so I guess that there is only one products and only one datasets. Is this indeed always the case?

Yes :D In 99.999999% of the cases. In very rare cases, a datasetId can be in several products (usually a transition period around the release). If you want to be 100% sure, you can also add the productId in the call.

There are most likely multiple versions for each dataset. Is the latest version always the first in the list? That would prevent having to loop over all versions.

Yes! There is a whole logic in the toolbox to order the versions based on released date and retired date. So the first one will be the most recent, that has been released.

Either way, is it guaranteed that the time extents are equal for all parts of the latest version of a dataset?

Unfortunately, coordinates can be very different in different parts! Take the dataset: cmems_obs-ins_arc_phybgcwav_mynrt_na_irr, it can have different time extents, as you say. Also, some datasets have an originalGrid part, which means that their geographical coordinates are different (name, extent, units).

It would make more sense to me as the lists of variables and the services are objects of the CopernicusMarineCoordinate instead, and that coordinates['time'] is not a tuple but a CopernicusMarineCoordinate.

Could be considered, indeed, but not sure how it would work. That might introduce some circular references (for example, CopernicusMarineVariable contains itself CopernicusMarineCoordinate. So we would need to introduce new types, I think. A sort of describe reversed! Not sure it is in the scope of the 2.1.0 release at least.

This happens because time_coordinate.coordinate_unit is a non-standard format ("milliseconds since 1970-01-01 00:00:00Z (no leap seconds)").

Not sure why this is like this in the metadata, the toolbox is getting this information and putting it in the describe as is.

Lastly, in my example I use a private xarray function to convert the timestamps. Private functions have the risk of being phased out some day. Is this the best method or are there better alternatives? Maybe there is even a method in the copernicusmarine package?

Nothing in the toolbox, no 🤔 In v1, we used cftime if you want to check how it was done.

@veenstrajelmer
Copy link
Author

@renaudjester thanks for this elaborate response.

Unfortunately, coordinates can be very different in different parts! Take the dataset: cmems_obs-ins_arc_phybgcwav_mynrt_na_irr, it can have different time extents, as you say. Also, some datasets have an originalGrid part, which means that their geographical coordinates are different (name, extent, units).

I am a bit hesitant getting the extents via this method if the returned values vary, although I could loop over all parts and take the min and max extents of all of them.

Could be considered, indeed, but not sure how it would work. That might introduce some circular references (for example, CopernicusMarineVariable contains itself CopernicusMarineCoordinate. So we would need to introduce new types, I think. A sort of describe reversed! Not sure it is in the scope of the 2.1.0 release at least.

That makes sense. Follow-up question, are the time extents at equal between these variables and services? In other words, is it safe to only access the first via coordinates['time'][0] as suggested in #277 (comment) or should I also loop over all of them? And, while we are at it, I guess it cannot be guaranteed that all of these variables have a time coordinate right? Then this loop also requires an exception.

From an xarray point of view (which might not be applicable for all datasets in copernicusmarine), the time coordinate is a attached to each of the time-varying variables, but it is also attached to the dataset. For my usecase, it would be amazing to be able to get the time extents directly from the dataset instead. This saves a lot of looping or assumptions. I cannot make any assumptions, since it would probably break one day in the future. I would also like to minimize the required code. The current method (via xarray) is quite slow, but it is at least very robust, which is essential in my view.

Nothing in the toolbox, no 🤔 In v1, we used cftime if you want to check how it was done.

Thanks, this is helpful and straight forward. It might also resolve the parsing of the time units.

@renaudjester
Copy link
Collaborator

renaudjester commented Apr 29, 2025

@veenstrajelmer Maybe it can help you if you consider which data you will access later on! You should know which part you are accessing:

  • If you use the dataset_part option, you know the name, so you can just select this one.
  • If you don't use the dataset_part option, then you can safely assume that the toolbox will use the first part (as I said with the sorting): part[0]

In other words, is it safe to only access the first via coordinates['time'][0]

I think there is a misunderstanding here: coordinates['time'] is not a list it's a tuple, so coordinates['time'][0] is just taking the coordinate. Then, if you want to know if the variable has this coordinate, you can check into: coordinates['time'][1]. See here.. So I am not sure you need the loop you mention.

You can safely assume that inside a part, the extents are the same! Since you mention xarray: an xarray dataset in Copernicus Marine is basically what you have at "part" level. 2 "parts" are two different arrays. "Services" (the lower level) are different ways to represent the data but the data should be the same, or at least the extents, variables etc. (I am simplifying a bit here...)

@renaudjester renaudjester added the enhancement New feature or request label May 7, 2025
@veenstrajelmer
Copy link
Author

Thanks both! I needed some time to try some things:

import copernicusmarine
import pandas as pd
import cftime
import numpy as np

dataset_id = 'cmems_mod_glo_phy-cur_anfc_0.083deg_P1D-m' # parts: ['default']
# dataset_id = 'cmems_mod_glo_phy-so_anfc_0.083deg_P1D-m' # parts: ['default']
# dataset_id = 'cmems_mod_glo_phy-thetao_anfc_0.083deg_P1D-m' # parts: ['default']
# dataset_id = 'cmems_mod_glo_phy_anfc_0.083deg_P1D-m' # parts: ['default']
# dataset_id = 'cmems_mod_glo_phy_myint_0.083deg_P1D-m' # parts: ['default']
# dataset_id = 'cmems_mod_glo_phy_my_0.083deg_P1D-m' # parts: ['default']
# dataset_id = 'cmems_mod_glo_bgc-bio_anfc_0.25deg_P1D-m' # parts: ['default']
# dataset_id = 'cmems_mod_glo_bgc_myint_0.25deg_P1D-m' # parts: ['default']
# dataset_id = 'cmems_mod_glo_bgc_my_0.25deg_P1D-m' # parts: ['default']
# dataset_id = 'cmems_mod_glo_bgc_my_0.25deg_P1M-m' # parts: ['default'] # time_coordinate.minimum_value and time_coordinate.maximum_value are None
# dataset_id='cmems_obs-ins_arc_phybgcwav_mynrt_na_irr' # parts: ['latest', 'history', 'monthly']
# dataset_id='cmems_obs-ins_glo_phy-ssh_my_na_PT1H' # parts: ['history']

data = copernicusmarine.describe(
    dataset_id=dataset_id,
    disable_progress_bar=True,
    )

def convert_time(time_raw: np.datetime64, cftime_unit: str) -> int:
    if time_units == 'ISO8601':
        return pd.to_datetime(time_raw)
    return cftime.num2date(time_raw, cftime_unit)

print("parts:", [x.name for x in data.products[0].datasets[0].versions[0].parts])

coordinates = data.products[0].datasets[0].versions[0].parts[0].get_coordinates()
time_coordinate = coordinates['time'][0]

time_units = time_coordinate.coordinate_unit
time_min_raw = time_coordinate.minimum_value
time_max_raw = time_coordinate.maximum_value

time_min = convert_time(time_min_raw, time_units)
time_max = convert_time(time_max_raw, time_units)
print(f"Minimum Value: {time_min}")
print(f"Maximum Value: {time_max}")

Some points of attention that might already be known:

  • time_coordinate.minimum_value and time_coordinate.maximum_value for dataset_id = 'cmems_mod_glo_bgc_my_0.25deg_P1M-m' are None
  • time_coordinate.minimum_value and time_coordinate.maximum_value for dataset_id = 'cmems_obs-ins_arc_phybgcwav_mynrt_na_irr' are datestrings instead of numpy floats.

As for your comments:

If you use the dataset_part option, you know the name, so you can just select this one.

This does not work unfortunately, describe has no such argument.

I think there is a misunderstanding here: coordinates['time'] is not a list it's a tuple, so coordinates['time'][0] is just taking the coordinate.

Sorry, you are right, I was confused indeed.

@renaudjester
Copy link
Collaborator

@veenstrajelmer, thanks for the investigation!

time_coordinate.minimum_value and time_coordinate.maximum_value for dataset_id = 'cmems_mod_glo_bgc_my_0.25deg_P1M-m' are None

To be checked by us!

time_coordinate.minimum_value and time_coordinate.maximum_value for dataset_id = 'cmems_obs-ins_arc_phybgcwav_mynrt_na_irr' are datestrings instead of numpy floats.

Yep! That's how the data comes in the metadata, and we don't convert it. (and not sure it's planned right now)

If you use the dataset_part option, you know the name, so you can just select this one.
This does not work unfortunately, describe has no such argument.

What I meant is: in your workflow, let's say you want to subset some data. Then about the part there are two possibilities:

  • you don't know the part, so you don't set in your subset command: then you are taking the default part and can access it with: data.products[0].datasets[0].versions[0].parts[0]
  • you know the part you want: "history" for example. Then you can select it: [part for part in data.products[0].datasets[0].versions[0].parts if part.name == "history"][0]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants