Skip to content

Ab/refactor + replicating results #36

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 46 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,50 @@
# H5Cloud - Cloud-Optimized Access of Hierarchical ICESat-2 Photon Data

## Quickstart

Launch on [CryoCloud](https://cryointhecloud.com). Hosted on AWS US West2 region.
Requires [CryoCloud user account](https://book.cryointhecloud.com/content/Getting_Started.html).

[![CryoCloud JupyterHub](https://img.shields.io/badge/launch-CryoCloud-lightblue?logo=jupyter)](https://hub.cryointhecloud.com/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.jpy.wang%2FICESAT-2HackWeek%2Fh5cloud&urlpath=lab%2Ftree%2Fh5cloud%2Fnotebooks%2Fformat-preprocessing-times.ipynb&branch=main)

## How to use

Create a config (or use an existing config) in `h5cloud/configs`. Each config defines a set of files which should be from the same collection but from different processing pipelines.

```yaml
collection: "atl03"
group: "/gt1l/heights"
variable: "h_ph"
lat_group: "lat_ph"
lon_group: "lon_ph"
# create
files:
original:
link: "s3://is2-cloud-experiments/h5cloud/atl03/average/original/ATL03_20191225111315_13680501_006_01.h5"
processing: null
repack_page_4gb:
link: "s3://is2-cloud-experiments/h5cloud/atl03/average/repacked/ATL03_20191225111315_13680501_006_01.h5"
processing: "h5repack -S PAGE -G 4000000"
```

Then when you run the test:

```python
from h5cloud import TestConfig
from h5cloud.tests import H5pyArrSubsetMean

tc = TestConfig.init(yaml_file="h5cloud/file_configs/atl03.yml" bucket="eodc-scratch", directory="h5cloud-tests/test-results")
h5py_arr_subset_test = H5pyArrSubsetMean(tc)
og_result = h5py_arr_subset_test.run('orginal')
repacked_result = h5py_arr_subset_test.run('repack_page_4gb')
print(f"Original: {og_result}")
print(f"Repacked: {repacked_result}")
```

Results will also be stored in the bucket and directory passed in the TestConfg initialization.

## Collaborators

Aimee Barciauskas, Development Seed <br>
Andy Barrett, NSIDC <br>
Wei Ji, Development Seed <br>
Expand All @@ -19,12 +57,14 @@ Rachel Wegener, University of Maryland, College Park <br>
JP Swinski, NASA/GSFC <br>

## The Problem

ICESat-2 photon-data is formatted as HDF5 files, which provide many advantages for scientific applications including being self-describing and able to store heterogenous data.
However, ICESat-2 granules are frequently over a larger spatial extent than is needed for scientific workflows, meaning users must read in the full ATL03 HDF5 file to geolocate the data, then subset to a given area of interest. Applications like EarthData and NSIDC data portals have simplified this process for users by allowing them to provide a bounding box and only returning the subset data. Still, because HDF5 files are serialized, the original ATL03 H5 file must be read fully into memory by the cloud provider.

This is in contrast to raster data, where cloud-optimized GeoTIFFs are organized internally such that it is easy to access only a specific subset of the total area using HTTP GET range requests. A similar capability for ICESat-2 along-track photon data would provide measurable read performance improvements for cloud data providers and local data users alike. The current aims of this Hackweek group are to benchmark current methods of accessing/subsetting ATL03 data from a public cloud data source (AWS S3), investigate methods of repacking photon data, and determine how libraries like [kerchunk](https://fsspec.github.io/kerchunk/) can be used for more efficient requesting of data from specific along-track locations.

## Sample Data

ATL03 files used for testing were selected to maximize baseline ATL03 filesize, and are available at [s3://is2-cloud-experiments](s3://is2-cloud-experiments).

## Benchmark table
Expand All @@ -36,16 +76,18 @@ Notebooks exist in the [`notebooks/`](./notebooks/) folder for generating those
| **Library/File Format** | **Original HDF4** | **h5repack** | **kerchunk original** | **kerchunk repacked** | **GeoParquet** | **Flatgeobuf** |
| -------------------------------------- | ----------------- | ------------ | --------------------- | --------------------- | -------------- | -------------- |
| 1a - h5py | | | | n/a | n/a | n/a |
| 1b - gedi\_subsetter H5DataFrame | | | | n/a | n/a | n/a |
| 1b - gedi_subsetter H5DataFrame | | | | n/a | n/a | n/a |
| 2 - xarray via h5netcdf engine | | | | | n/a | n/a |
| 3 - h5coro | | | | n/a | n/a | n/a |
| 4a - geopandas via pyogrio/GDAL driver | | n/a | n/a | n/a | | |
| 4b - geopandas via parquet driver | | n/a | n/a | n/a | | n/a |

Key:

- n/a = Not applicable as the file format is not supported by library.

## Resources

- [ICESat-2 Hackweek Data Access Tutorials](https://icesat-2-2023.hackweek.io/tutorials/data-access-and-format/index.html)
- [SlideRule](https://github.com/ICESat2-SlideRule)
- [gedi-subsetter](https://github.com/MAAP-Project/gedi-subsetter)
Expand All @@ -59,13 +101,16 @@ Key:
## Folders

### `contributors`

Each team member has it's own folder under contributors, where he/she can
work on their contribution. Having a dedicated folder for one-self helps to
prevent conflicts when merging with master.

### `notebooks`

Notebooks that are considered delivered results for the project should go in
here.

### `scripts`

Helper utilities that are shared with the team
32 changes: 32 additions & 0 deletions h5cloud/file_configs/ATL03_006.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
collection: ATL03
files:
kerchunk:
input_file: original
link: s3://nasa-veda-scratch/eodc_hdf5_experiments/ATL03___006/kerchunk___ATL03_20181014000347_02350101_006_02.json
processing: "def generate_json_reference(in_filename, out_filename):\n so =\
\ dict(mode=\"rb\", default_fill_cache=False, default_cache_type=\"first\")\n\
\ with fs_read.open(in_filename, **so) as infile: \n h5chunks\
\ = SingleHdf5ToZarr(in_filename)\n suffix = out_filename.split(\".\"\
)[-1]\n out_filename = out_filename.replace(suffix, 'json')\n \
\ with open(out_filename, 'w') as outfile:\n outfile.write(json.dumps(h5chunks.translate()))\n\
\ return out_filename\n"
kerchunk_repacked_page_4mb:
input_file: repacked_page_4mb
link: s3://nasa-veda-scratch/eodc_hdf5_experiments/ATL03___006/kerchunk_repacked_page_4mb___ATL03_20181014000347_02350101_006_02.json
processing: "def generate_json_reference(in_filename, out_filename):\n so =\
\ dict(mode=\"rb\", default_fill_cache=False, default_cache_type=\"first\")\n\
\ with fs_read.open(in_filename, **so) as infile: \n h5chunks\
\ = SingleHdf5ToZarr(in_filename)\n suffix = out_filename.split(\".\"\
)[-1]\n out_filename = out_filename.replace(suffix, 'json')\n \
\ with open(out_filename, 'w') as outfile:\n outfile.write(json.dumps(h5chunks.translate()))\n\
\ return out_filename\n"
original:
input_file: null
link: s3://nasa-veda-scratch/eodc_hdf5_experiments/ATL03___006/original/ATL03_20181014000347_02350101_006_02.h5
procssing: null
repacked_page_4mb:
input_file: original
link: s3://nasa-veda-scratch/eodc_hdf5_experiments/ATL03___006/repacked_page_4mb___ATL03_20181014000347_02350101_006_02.h5
processing: h5repack -S PAGE -G 4000000
group: /gt1r/heights
variable: h_ph
32 changes: 32 additions & 0 deletions h5cloud/file_configs/ATL08_006.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
collection: ATL08
files:
kerchunk:
input_file: original
link: s3://nasa-veda-scratch/eodc_hdf5_experiments/ATL08___006/kerchunk___ATL08_20181014001049_02350102_006_02.json
processing: "def generate_json_reference(in_filename, out_filename):\n so =\
\ dict(mode=\"rb\", default_fill_cache=False, default_cache_type=\"first\")\n\
\ with fs_read.open(in_filename, **so) as infile: \n h5chunks\
\ = SingleHdf5ToZarr(in_filename)\n suffix = out_filename.split(\".\"\
)[-1]\n out_filename = out_filename.replace(suffix, 'json')\n \
\ with open(out_filename, 'w') as outfile:\n outfile.write(json.dumps(h5chunks.translate()))\n\
\ return out_filename\n"
kerchunk_repacked_page_4mb:
input_file: repacked_page_4mb
link: s3://nasa-veda-scratch/eodc_hdf5_experiments/ATL08___006/kerchunk_repacked_page_4mb___ATL08_20181014001049_02350102_006_02.json
processing: "def generate_json_reference(in_filename, out_filename):\n so =\
\ dict(mode=\"rb\", default_fill_cache=False, default_cache_type=\"first\")\n\
\ with fs_read.open(in_filename, **so) as infile: \n h5chunks\
\ = SingleHdf5ToZarr(in_filename)\n suffix = out_filename.split(\".\"\
)[-1]\n out_filename = out_filename.replace(suffix, 'json')\n \
\ with open(out_filename, 'w') as outfile:\n outfile.write(json.dumps(h5chunks.translate()))\n\
\ return out_filename\n"
original:
input_file: null
link: s3://nasa-veda-scratch/eodc_hdf5_experiments/ATL08___006/original/ATL08_20181014001049_02350102_006_02.h5
procssing: null
repacked_page_4mb:
input_file: original
link: s3://nasa-veda-scratch/eodc_hdf5_experiments/ATL08___006/repacked_page_4mb___ATL08_20181014001049_02350102_006_02.h5
processing: h5repack -S PAGE -G 4000000
group: /gt1l/land_segments
variable: dem_h
32 changes: 32 additions & 0 deletions h5cloud/file_configs/GPM_3IMERGHHE_06.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
collection: GPM_3IMERGHHE
files:
kerchunk:
input_file: original
link: s3://nasa-veda-scratch/eodc_hdf5_experiments/GPM_3IMERGHHE___06/kerchunk___3B-HHR-E.MS.MRG.3IMERG.20000601-S000000-E002959.0000.V06B.json
processing: "def generate_json_reference(in_filename, out_filename):\n so =\
\ dict(mode=\"rb\", default_fill_cache=False, default_cache_type=\"first\")\n\
\ with fs_read.open(in_filename, **so) as infile: \n h5chunks\
\ = SingleHdf5ToZarr(in_filename)\n suffix = out_filename.split(\".\"\
)[-1]\n out_filename = out_filename.replace(suffix, 'json')\n \
\ with open(out_filename, 'w') as outfile:\n outfile.write(json.dumps(h5chunks.translate()))\n\
\ return out_filename\n"
kerchunk_repacked_page_4mb:
input_file: repacked_page_4mb
link: s3://nasa-veda-scratch/eodc_hdf5_experiments/GPM_3IMERGHHE___06/kerchunk_repacked_page_4mb___3B-HHR-E.MS.MRG.3IMERG.20000601-S000000-E002959.0000.V06B.json
processing: "def generate_json_reference(in_filename, out_filename):\n so =\
\ dict(mode=\"rb\", default_fill_cache=False, default_cache_type=\"first\")\n\
\ with fs_read.open(in_filename, **so) as infile: \n h5chunks\
\ = SingleHdf5ToZarr(in_filename)\n suffix = out_filename.split(\".\"\
)[-1]\n out_filename = out_filename.replace(suffix, 'json')\n \
\ with open(out_filename, 'w') as outfile:\n outfile.write(json.dumps(h5chunks.translate()))\n\
\ return out_filename\n"
original:
input_file: null
link: s3://nasa-veda-scratch/eodc_hdf5_experiments/GPM_3IMERGHHE___06/original/3B-HHR-E.MS.MRG.3IMERG.20000601-S000000-E002959.0000.V06B.HDF5
procssing: null
repacked_page_4mb:
input_file: original
link: s3://nasa-veda-scratch/eodc_hdf5_experiments/GPM_3IMERGHHE___06/repacked_page_4mb___3B-HHR-E.MS.MRG.3IMERG.20000601-S000000-E002959.0000.V06B.HDF5
processing: h5repack -S PAGE -G 4000000
group: Grid
variable: precipitationCal
32 changes: 32 additions & 0 deletions h5cloud/file_configs/OMVFPITMET_003.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
collection: OMVFPITMET
files:
kerchunk:
input_file: original
link: s3://nasa-veda-scratch/eodc_hdf5_experiments/OMVFPITMET___003/kerchunk___OMI-Aura_ANC-OMVFPITMET_2004m1001t000305-o01132_v003-2018m1015t140716.json
processing: "def generate_json_reference(in_filename, out_filename):\n so =\
\ dict(mode=\"rb\", default_fill_cache=False, default_cache_type=\"first\")\n\
\ with fs_read.open(in_filename, **so) as infile: \n h5chunks\
\ = SingleHdf5ToZarr(in_filename)\n suffix = out_filename.split(\".\"\
)[-1]\n out_filename = out_filename.replace(suffix, 'json')\n \
\ with open(out_filename, 'w') as outfile:\n outfile.write(json.dumps(h5chunks.translate()))\n\
\ return out_filename\n"
kerchunk_repacked_page_4mb:
input_file: repacked_page_4mb
link: s3://nasa-veda-scratch/eodc_hdf5_experiments/OMVFPITMET___003/kerchunk_repacked_page_4mb___OMI-Aura_ANC-OMVFPITMET_2004m1001t000305-o01132_v003-2018m1015t140716.json
processing: "def generate_json_reference(in_filename, out_filename):\n so =\
\ dict(mode=\"rb\", default_fill_cache=False, default_cache_type=\"first\")\n\
\ with fs_read.open(in_filename, **so) as infile: \n h5chunks\
\ = SingleHdf5ToZarr(in_filename)\n suffix = out_filename.split(\".\"\
)[-1]\n out_filename = out_filename.replace(suffix, 'json')\n \
\ with open(out_filename, 'w') as outfile:\n outfile.write(json.dumps(h5chunks.translate()))\n\
\ return out_filename\n"
original:
input_file: null
link: s3://nasa-veda-scratch/eodc_hdf5_experiments/OMVFPITMET___003/original/OMI-Aura_ANC-OMVFPITMET_2004m1001t000305-o01132_v003-2018m1015t140716.nc4
procssing: null
repacked_page_4mb:
input_file: original
link: s3://nasa-veda-scratch/eodc_hdf5_experiments/OMVFPITMET___003/repacked_page_4mb___OMI-Aura_ANC-OMVFPITMET_2004m1001t000305-o01132_v003-2018m1015t140716.nc4
processing: h5repack -S PAGE -G 4000000
group: /
variable: PS
32 changes: 32 additions & 0 deletions h5cloud/file_configs/SNDRSNIML2CCPRETN_2.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
collection: SNDRSNIML2CCPRETN
files:
kerchunk:
input_file: original
link: s3://nasa-veda-scratch/eodc_hdf5_experiments/SNDRSNIML2CCPRETN___2/kerchunk___SNDR.SNPP.CRIMSS.20120120T1254.m06.g130.L2_CLIMCAPS_RET_NSR.std.v02_28.G.200409132855.json
processing: "def generate_json_reference(in_filename, out_filename):\n so =\
\ dict(mode=\"rb\", default_fill_cache=False, default_cache_type=\"first\")\n\
\ with fs_read.open(in_filename, **so) as infile: \n h5chunks\
\ = SingleHdf5ToZarr(in_filename)\n suffix = out_filename.split(\".\"\
)[-1]\n out_filename = out_filename.replace(suffix, 'json')\n \
\ with open(out_filename, 'w') as outfile:\n outfile.write(json.dumps(h5chunks.translate()))\n\
\ return out_filename\n"
kerchunk_repacked_page_4mb:
input_file: repacked_page_4mb
link: s3://nasa-veda-scratch/eodc_hdf5_experiments/SNDRSNIML2CCPRETN___2/kerchunk_repacked_page_4mb___SNDR.SNPP.CRIMSS.20120120T1254.m06.g130.L2_CLIMCAPS_RET_NSR.std.v02_28.G.200409132855.json
processing: "def generate_json_reference(in_filename, out_filename):\n so =\
\ dict(mode=\"rb\", default_fill_cache=False, default_cache_type=\"first\")\n\
\ with fs_read.open(in_filename, **so) as infile: \n h5chunks\
\ = SingleHdf5ToZarr(in_filename)\n suffix = out_filename.split(\".\"\
)[-1]\n out_filename = out_filename.replace(suffix, 'json')\n \
\ with open(out_filename, 'w') as outfile:\n outfile.write(json.dumps(h5chunks.translate()))\n\
\ return out_filename\n"
original:
input_file: null
link: s3://nasa-veda-scratch/eodc_hdf5_experiments/SNDRSNIML2CCPRETN___2/original/SNDR.SNPP.CRIMSS.20120120T1254.m06.g130.L2_CLIMCAPS_RET_NSR.std.v02_28.G.200409132855.nc
procssing: null
repacked_page_4mb:
input_file: original
link: s3://nasa-veda-scratch/eodc_hdf5_experiments/SNDRSNIML2CCPRETN___2/repacked_page_4mb___SNDR.SNPP.CRIMSS.20120120T1254.m06.g130.L2_CLIMCAPS_RET_NSR.std.v02_28.G.200409132855.nc
processing: h5repack -S PAGE -G 4000000
group: /
variable: surf_temp
13 changes: 13 additions & 0 deletions h5cloud/file_configs/atl03.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
collection: "atl03__006"
group: "/gt1l/heights"
variable: "h_ph"
lat_group: "lat_ph"
lon_group: "lon_ph"
# create
files:
original:
link: "s3://is2-cloud-experiments/h5cloud/atl03/average/original/ATL03_20191225111315_13680501_006_01.h5"
processing: null
repack_page_4gb:
link: "s3://is2-cloud-experiments/h5cloud/atl03/average/repacked/ATL03_20191225111315_13680501_006_01.h5"
processing: "h5repack -S PAGE -G 4000000"
34 changes: 34 additions & 0 deletions h5cloud/test_config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
import yaml
from pydantic import Field, BaseModel
from typing import Dict, Optional

class FileConfig(BaseModel):
link: str
processing: Optional[str] = None

class TestConfig(BaseModel):
results_bucket: str
results_directory: str
collection: str
group: str
variable: str
lat_group: Optional[str] = None
lon_group: Optional[str] = None
files: Dict[str, FileConfig]

def load_from_yaml(yaml_file: str, results_bucket: str, results_directory: str):
"""
Load the YAML configuration from the specified file and create a TestCollectionFiles object.
"""
try:
with open(yaml_file, 'r') as file:
data = yaml.safe_load(file)
return TestConfig(results_bucket=results_bucket, results_directory=results_directory, **data)
except FileNotFoundError:
print(f"Error: The file {yaml_file} was not found.")
except yaml.YAMLError as exc:
print(f"Error in YAML formatting: {exc}")
except Exception as e:
print(f"An error occurred: {e}")

return None
File renamed without changes.
24 changes: 24 additions & 0 deletions h5cloud/tests/h5coro_arr_mean.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
from .h5test import H5Test, timer_decorator
import numpy as np
import subprocess

try:
import h5coro
except:
completed_process = subprocess.run([
'mamba', 'install', '-c', 'conda-forge', 'h5coro', '--yes'
])
import h5coro

from h5coro import h5coro, s3driver, filedriver
h5coro.config(errorChecking=True, verbose=False, enableAttributes=False)

class H5CoroArrMean(H5Test):
@timer_decorator
def run(self, file_format: str):
tc = self.test_config
final_h5coro_array = []
file = tc.files[file_format]
h5obj = h5coro.H5Coro(file.link.replace("s3://", ""), s3driver.S3Driver)
h5obj.readDatasets(datasets=[f'{tc.group}/{tc.variable}'], block=True)
return h5obj[f'{tc.group}/{tc.variable}'].values.mean()
12 changes: 12 additions & 0 deletions h5cloud/tests/h5py_arr_mean.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
from .h5test import H5Test, timer_decorator
import h5py

class H5pyArrMean(H5Test):
@timer_decorator
def run(self, file_format: str, io_params: dict = {}):
tc = self.test_config
file = tc.files[file_format]
final_h5py_array = []
with h5py.File(self.s3_fs.open(file.link, 'rb', **io_params.get('fsspec_params', {})), 'r', **io_params.get('h5py_params', {})) as h5obj:
data = h5obj[f'{tc.group}/{tc.variable}'][:].flatten()
return data.mean()
Loading