ICESAT-2HackWeek · abarciauskas-bgse · Apr 29, 2024 · Apr 30, 2024 · Apr 30, 2024 · May 3, 2024
diff --git a/README.md b/README.md
@@ -1,12 +1,50 @@
 # H5Cloud - Cloud-Optimized Access of Hierarchical ICESat-2 Photon Data
 
 ## Quickstart
+
 Launch on [CryoCloud](https://cryointhecloud.com). Hosted on AWS US West2 region.
 Requires [CryoCloud user account](https://book.cryointhecloud.com/content/Getting_Started.html).
 
 [![CryoCloud JupyterHub](https://img.shields.io/badge/launch-CryoCloud-lightblue?logo=jupyter)](https://hub.cryointhecloud.com/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.jpy.wang%2FICESAT-2HackWeek%2Fh5cloud&urlpath=lab%2Ftree%2Fh5cloud%2Fnotebooks%2Fformat-preprocessing-times.ipynb&branch=main)
 
+## How to use
+
+Create a config (or use an existing config) in `h5cloud/configs`. Each config defines a set of files which should be from the same collection but from different processing pipelines.
+
+```yaml
+collection: "atl03"
+group: "/gt1l/heights"
+variable: "h_ph"
+lat_group: "lat_ph"
+lon_group: "lon_ph"
+# create
+files:
+  original:
+    link: "s3://is2-cloud-experiments/h5cloud/atl03/average/original/ATL03_20191225111315_13680501_006_01.h5"
+    processing: null
+  repack_page_4gb:
+    link: "s3://is2-cloud-experiments/h5cloud/atl03/average/repacked/ATL03_20191225111315_13680501_006_01.h5"
+    processing: "h5repack -S PAGE -G 4000000"
+```
+
+Then when you run the test:
+
+```python
+from h5cloud import TestConfig
+from h5cloud.tests import H5pyArrSubsetMean
+
+tc = TestConfig.init(yaml_file="h5cloud/file_configs/atl03.yml" bucket="eodc-scratch", directory="h5cloud-tests/test-results")
+h5py_arr_subset_test = H5pyArrSubsetMean(tc)
+og_result = h5py_arr_subset_test.run('orginal')
+repacked_result = h5py_arr_subset_test.run('repack_page_4gb')
+print(f"Original: {og_result}")
+print(f"Repacked: {repacked_result}")
+```
+
+Results will also be stored in the bucket and directory passed in the TestConfg initialization.
+
 ## Collaborators
+
 Aimee Barciauskas, Development Seed <br>
 Andy Barrett, NSIDC <br>
 Wei Ji, Development Seed <br>
@@ -19,12 +57,14 @@ Rachel Wegener, University of Maryland, College Park <br>
 JP Swinski, NASA/GSFC <br>
 
 ## The Problem
+
 ICESat-2 photon-data is formatted as HDF5 files, which provide many advantages for scientific applications including being self-describing and able to store heterogenous data.
 However, ICESat-2 granules are frequently over a larger spatial extent than is needed for scientific workflows, meaning users must read in the full ATL03 HDF5 file to geolocate the data, then subset to a given area of interest. Applications like EarthData and NSIDC data portals have simplified this process for users by allowing them to provide a bounding box and only returning the subset data. Still, because HDF5 files are serialized, the original ATL03 H5 file must be read fully into memory by the cloud provider.
 
 This is in contrast to raster data, where cloud-optimized GeoTIFFs are organized internally such that it is easy to access only a specific subset of the total area using HTTP GET range requests. A similar capability for ICESat-2 along-track photon data would provide measurable read performance improvements for cloud data providers and local data users alike. The current aims of this Hackweek group are to benchmark current methods of accessing/subsetting ATL03 data from a public cloud data source (AWS S3), investigate methods of repacking photon data, and determine how libraries like [kerchunk](https://fsspec.github.io/kerchunk/) can be used for more efficient requesting of data from specific along-track locations.
 
 ## Sample Data
+
 ATL03 files used for testing were selected to maximize baseline ATL03 filesize, and are available at [s3://is2-cloud-experiments](s3://is2-cloud-experiments).
 
 ## Benchmark table
@@ -36,16 +76,18 @@ Notebooks exist in the [`notebooks/`](./notebooks/) folder for generating those
 | **Library/File Format**                | **Original HDF4** | **h5repack** | **kerchunk original** | **kerchunk repacked** | **GeoParquet** | **Flatgeobuf** |
 | -------------------------------------- | ----------------- | ------------ | --------------------- | --------------------- | -------------- | -------------- |
 | 1a - h5py                              |                   |              |                       | n/a                   | n/a            | n/a            |
-| 1b - gedi\_subsetter H5DataFrame       |                   |              |                       | n/a                   | n/a            | n/a            |
+| 1b - gedi_subsetter H5DataFrame        |                   |              |                       | n/a                   | n/a            | n/a            |
 | 2 - xarray via h5netcdf engine         |                   |              |                       |                       | n/a            | n/a            |
 | 3 - h5coro                             |                   |              |                       | n/a                   | n/a            | n/a            |
 | 4a - geopandas via pyogrio/GDAL driver |                   | n/a          | n/a                   | n/a                   |                |                |
 | 4b - geopandas via parquet driver      |                   | n/a          | n/a                   | n/a                   |                | n/a            |
 
 Key:
+
 - n/a = Not applicable as the file format is not supported by library.
 
 ## Resources
+
 - [ICESat-2 Hackweek Data Access Tutorials](https://icesat-2-2023.hackweek.io/tutorials/data-access-and-format/index.html)
 - [SlideRule](https://github.com/ICESat2-SlideRule)
 - [gedi-subsetter](https://github.com/MAAP-Project/gedi-subsetter)
@@ -59,13 +101,16 @@ Key:
 ## Folders
 
 ### `contributors`
+
 Each team member has it's own folder under contributors, where he/she can
 work on their contribution. Having a dedicated folder for one-self helps to
 prevent conflicts when merging with master.
 
 ### `notebooks`
+
 Notebooks that are considered delivered results for the project should go in
 here.
 
 ### `scripts`
+
 Helper utilities that are shared with the team
diff --git a/h5cloud/file_configs/ATL03_006.yaml b/h5cloud/file_configs/ATL03_006.yaml
@@ -0,0 +1,32 @@
+collection: ATL03
+files:
+  kerchunk:
+    input_file: original
+    link: s3://nasa-veda-scratch/eodc_hdf5_experiments/ATL03___006/kerchunk___ATL03_20181014000347_02350101_006_02.json
+    processing: "def generate_json_reference(in_filename, out_filename):\n    so =\
+      \ dict(mode=\"rb\", default_fill_cache=False, default_cache_type=\"first\")\n\
+      \    with fs_read.open(in_filename, **so) as infile:     \n        h5chunks\
+      \ = SingleHdf5ToZarr(in_filename)\n        suffix = out_filename.split(\".\"\
+      )[-1]\n        out_filename = out_filename.replace(suffix, 'json')\n       \
+      \ with open(out_filename, 'w') as outfile:\n            outfile.write(json.dumps(h5chunks.translate()))\n\
+      \        return out_filename\n"
+  kerchunk_repacked_page_4mb:
+    input_file: repacked_page_4mb
+    link: s3://nasa-veda-scratch/eodc_hdf5_experiments/ATL03___006/kerchunk_repacked_page_4mb___ATL03_20181014000347_02350101_006_02.json
+    processing: "def generate_json_reference(in_filename, out_filename):\n    so =\
+      \ dict(mode=\"rb\", default_fill_cache=False, default_cache_type=\"first\")\n\
+      \    with fs_read.open(in_filename, **so) as infile:     \n        h5chunks\
+      \ = SingleHdf5ToZarr(in_filename)\n        suffix = out_filename.split(\".\"\
+      )[-1]\n        out_filename = out_filename.replace(suffix, 'json')\n       \
+      \ with open(out_filename, 'w') as outfile:\n            outfile.write(json.dumps(h5chunks.translate()))\n\
+      \        return out_filename\n"
+  original:
+    input_file: null
+    link: s3://nasa-veda-scratch/eodc_hdf5_experiments/ATL03___006/original/ATL03_20181014000347_02350101_006_02.h5
+    procssing: null
+  repacked_page_4mb:
+    input_file: original
+    link: s3://nasa-veda-scratch/eodc_hdf5_experiments/ATL03___006/repacked_page_4mb___ATL03_20181014000347_02350101_006_02.h5
+    processing: h5repack -S PAGE -G 4000000
+group: /gt1r/heights
+variable: h_ph
diff --git a/h5cloud/file_configs/ATL08_006.yaml b/h5cloud/file_configs/ATL08_006.yaml
@@ -0,0 +1,32 @@
+collection: ATL08
+files:
+  kerchunk:
+    input_file: original
+    link: s3://nasa-veda-scratch/eodc_hdf5_experiments/ATL08___006/kerchunk___ATL08_20181014001049_02350102_006_02.json
+    processing: "def generate_json_reference(in_filename, out_filename):\n    so =\
+      \ dict(mode=\"rb\", default_fill_cache=False, default_cache_type=\"first\")\n\
+      \    with fs_read.open(in_filename, **so) as infile:     \n        h5chunks\
+      \ = SingleHdf5ToZarr(in_filename)\n        suffix = out_filename.split(\".\"\
+      )[-1]\n        out_filename = out_filename.replace(suffix, 'json')\n       \
+      \ with open(out_filename, 'w') as outfile:\n            outfile.write(json.dumps(h5chunks.translate()))\n\
+      \        return out_filename\n"
+  kerchunk_repacked_page_4mb:
+    input_file: repacked_page_4mb
+    link: s3://nasa-veda-scratch/eodc_hdf5_experiments/ATL08___006/kerchunk_repacked_page_4mb___ATL08_20181014001049_02350102_006_02.json
+    processing: "def generate_json_reference(in_filename, out_filename):\n    so =\
+      \ dict(mode=\"rb\", default_fill_cache=False, default_cache_type=\"first\")\n\
+      \    with fs_read.open(in_filename, **so) as infile:     \n        h5chunks\
+      \ = SingleHdf5ToZarr(in_filename)\n        suffix = out_filename.split(\".\"\
+      )[-1]\n        out_filename = out_filename.replace(suffix, 'json')\n       \
+      \ with open(out_filename, 'w') as outfile:\n            outfile.write(json.dumps(h5chunks.translate()))\n\
+      \        return out_filename\n"
+  original:
+    input_file: null
+    link: s3://nasa-veda-scratch/eodc_hdf5_experiments/ATL08___006/original/ATL08_20181014001049_02350102_006_02.h5
+    procssing: null
+  repacked_page_4mb:
+    input_file: original
+    link: s3://nasa-veda-scratch/eodc_hdf5_experiments/ATL08___006/repacked_page_4mb___ATL08_20181014001049_02350102_006_02.h5
+    processing: h5repack -S PAGE -G 4000000
+group: /gt1l/land_segments
+variable: dem_h
diff --git a/h5cloud/file_configs/GPM_3IMERGHHE_06.yaml b/h5cloud/file_configs/GPM_3IMERGHHE_06.yaml
@@ -0,0 +1,32 @@
+collection: GPM_3IMERGHHE
+files:
+  kerchunk:
+    input_file: original
+    link: s3://nasa-veda-scratch/eodc_hdf5_experiments/GPM_3IMERGHHE___06/kerchunk___3B-HHR-E.MS.MRG.3IMERG.20000601-S000000-E002959.0000.V06B.json
+    processing: "def generate_json_reference(in_filename, out_filename):\n    so =\
+      \ dict(mode=\"rb\", default_fill_cache=False, default_cache_type=\"first\")\n\
+      \    with fs_read.open(in_filename, **so) as infile:     \n        h5chunks\
+      \ = SingleHdf5ToZarr(in_filename)\n        suffix = out_filename.split(\".\"\
+      )[-1]\n        out_filename = out_filename.replace(suffix, 'json')\n       \
+      \ with open(out_filename, 'w') as outfile:\n            outfile.write(json.dumps(h5chunks.translate()))\n\
+      \        return out_filename\n"
+  kerchunk_repacked_page_4mb:
+    input_file: repacked_page_4mb
+    link: s3://nasa-veda-scratch/eodc_hdf5_experiments/GPM_3IMERGHHE___06/kerchunk_repacked_page_4mb___3B-HHR-E.MS.MRG.3IMERG.20000601-S000000-E002959.0000.V06B.json
+    processing: "def generate_json_reference(in_filename, out_filename):\n    so =\
+      \ dict(mode=\"rb\", default_fill_cache=False, default_cache_type=\"first\")\n\
+      \    with fs_read.open(in_filename, **so) as infile:     \n        h5chunks\
+      \ = SingleHdf5ToZarr(in_filename)\n        suffix = out_filename.split(\".\"\
+      )[-1]\n        out_filename = out_filename.replace(suffix, 'json')\n       \
+      \ with open(out_filename, 'w') as outfile:\n            outfile.write(json.dumps(h5chunks.translate()))\n\
+      \        return out_filename\n"
+  original:
+    input_file: null
+    link: s3://nasa-veda-scratch/eodc_hdf5_experiments/GPM_3IMERGHHE___06/original/3B-HHR-E.MS.MRG.3IMERG.20000601-S000000-E002959.0000.V06B.HDF5
+    procssing: null
+  repacked_page_4mb:
+    input_file: original
+    link: s3://nasa-veda-scratch/eodc_hdf5_experiments/GPM_3IMERGHHE___06/repacked_page_4mb___3B-HHR-E.MS.MRG.3IMERG.20000601-S000000-E002959.0000.V06B.HDF5
+    processing: h5repack -S PAGE -G 4000000
+group: Grid
+variable: precipitationCal
diff --git a/h5cloud/file_configs/OMVFPITMET_003.yaml b/h5cloud/file_configs/OMVFPITMET_003.yaml
@@ -0,0 +1,32 @@
+collection: OMVFPITMET
+files:
+  kerchunk:
+    input_file: original
+    link: s3://nasa-veda-scratch/eodc_hdf5_experiments/OMVFPITMET___003/kerchunk___OMI-Aura_ANC-OMVFPITMET_2004m1001t000305-o01132_v003-2018m1015t140716.json
+    processing: "def generate_json_reference(in_filename, out_filename):\n    so =\
+      \ dict(mode=\"rb\", default_fill_cache=False, default_cache_type=\"first\")\n\
+      \    with fs_read.open(in_filename, **so) as infile:     \n        h5chunks\
+      \ = SingleHdf5ToZarr(in_filename)\n        suffix = out_filename.split(\".\"\
+      )[-1]\n        out_filename = out_filename.replace(suffix, 'json')\n       \
+      \ with open(out_filename, 'w') as outfile:\n            outfile.write(json.dumps(h5chunks.translate()))\n\
+      \        return out_filename\n"
+  kerchunk_repacked_page_4mb:
+    input_file: repacked_page_4mb
+    link: s3://nasa-veda-scratch/eodc_hdf5_experiments/OMVFPITMET___003/kerchunk_repacked_page_4mb___OMI-Aura_ANC-OMVFPITMET_2004m1001t000305-o01132_v003-2018m1015t140716.json
+    processing: "def generate_json_reference(in_filename, out_filename):\n    so =\
+      \ dict(mode=\"rb\", default_fill_cache=False, default_cache_type=\"first\")\n\
+      \    with fs_read.open(in_filename, **so) as infile:     \n        h5chunks\
+      \ = SingleHdf5ToZarr(in_filename)\n        suffix = out_filename.split(\".\"\
+      )[-1]\n        out_filename = out_filename.replace(suffix, 'json')\n       \
+      \ with open(out_filename, 'w') as outfile:\n            outfile.write(json.dumps(h5chunks.translate()))\n\
+      \        return out_filename\n"
+  original:
+    input_file: null
+    link: s3://nasa-veda-scratch/eodc_hdf5_experiments/OMVFPITMET___003/original/OMI-Aura_ANC-OMVFPITMET_2004m1001t000305-o01132_v003-2018m1015t140716.nc4
+    procssing: null
+  repacked_page_4mb:
+    input_file: original
+    link: s3://nasa-veda-scratch/eodc_hdf5_experiments/OMVFPITMET___003/repacked_page_4mb___OMI-Aura_ANC-OMVFPITMET_2004m1001t000305-o01132_v003-2018m1015t140716.nc4
+    processing: h5repack -S PAGE -G 4000000
+group: /
+variable: PS
diff --git a/h5cloud/file_configs/SNDRSNIML2CCPRETN_2.yaml b/h5cloud/file_configs/SNDRSNIML2CCPRETN_2.yaml
@@ -0,0 +1,32 @@
+collection: SNDRSNIML2CCPRETN
+files:
+  kerchunk:
+    input_file: original
+    link: s3://nasa-veda-scratch/eodc_hdf5_experiments/SNDRSNIML2CCPRETN___2/kerchunk___SNDR.SNPP.CRIMSS.20120120T1254.m06.g130.L2_CLIMCAPS_RET_NSR.std.v02_28.G.200409132855.json
+    processing: "def generate_json_reference(in_filename, out_filename):\n    so =\
+      \ dict(mode=\"rb\", default_fill_cache=False, default_cache_type=\"first\")\n\
+      \    with fs_read.open(in_filename, **so) as infile:     \n        h5chunks\
+      \ = SingleHdf5ToZarr(in_filename)\n        suffix = out_filename.split(\".\"\
+      )[-1]\n        out_filename = out_filename.replace(suffix, 'json')\n       \
+      \ with open(out_filename, 'w') as outfile:\n            outfile.write(json.dumps(h5chunks.translate()))\n\
+      \        return out_filename\n"
+  kerchunk_repacked_page_4mb:
+    input_file: repacked_page_4mb
+    link: s3://nasa-veda-scratch/eodc_hdf5_experiments/SNDRSNIML2CCPRETN___2/kerchunk_repacked_page_4mb___SNDR.SNPP.CRIMSS.20120120T1254.m06.g130.L2_CLIMCAPS_RET_NSR.std.v02_28.G.200409132855.json
+    processing: "def generate_json_reference(in_filename, out_filename):\n    so =\
+      \ dict(mode=\"rb\", default_fill_cache=False, default_cache_type=\"first\")\n\
+      \    with fs_read.open(in_filename, **so) as infile:     \n        h5chunks\
+      \ = SingleHdf5ToZarr(in_filename)\n        suffix = out_filename.split(\".\"\
+      )[-1]\n        out_filename = out_filename.replace(suffix, 'json')\n       \
+      \ with open(out_filename, 'w') as outfile:\n            outfile.write(json.dumps(h5chunks.translate()))\n\
+      \        return out_filename\n"
+  original:
+    input_file: null
+    link: s3://nasa-veda-scratch/eodc_hdf5_experiments/SNDRSNIML2CCPRETN___2/original/SNDR.SNPP.CRIMSS.20120120T1254.m06.g130.L2_CLIMCAPS_RET_NSR.std.v02_28.G.200409132855.nc
+    procssing: null
+  repacked_page_4mb:
+    input_file: original
+    link: s3://nasa-veda-scratch/eodc_hdf5_experiments/SNDRSNIML2CCPRETN___2/repacked_page_4mb___SNDR.SNPP.CRIMSS.20120120T1254.m06.g130.L2_CLIMCAPS_RET_NSR.std.v02_28.G.200409132855.nc
+    processing: h5repack -S PAGE -G 4000000
+group: /
+variable: surf_temp
diff --git a/h5cloud/file_configs/atl03.yml b/h5cloud/file_configs/atl03.yml
@@ -0,0 +1,13 @@
+collection: "atl03__006"
+group: "/gt1l/heights"
+variable: "h_ph"
+lat_group: "lat_ph"
+lon_group: "lon_ph"
+# create
+files:
+  original:
+    link: "s3://is2-cloud-experiments/h5cloud/atl03/average/original/ATL03_20191225111315_13680501_006_01.h5"
+    processing: null
+  repack_page_4gb:
+    link: "s3://is2-cloud-experiments/h5cloud/atl03/average/repacked/ATL03_20191225111315_13680501_006_01.h5"
+    processing: "h5repack -S PAGE -G 4000000"
diff --git a/h5cloud/test_config.py b/h5cloud/test_config.py
@@ -0,0 +1,34 @@
+import yaml
+from pydantic import Field, BaseModel
+from typing import Dict, Optional
+
+class FileConfig(BaseModel):
+    link: str
+    processing: Optional[str] = None
+
+class TestConfig(BaseModel):
+    results_bucket: str
+    results_directory: str
+    collection: str
+    group: str
+    variable: str
+    lat_group: Optional[str] = None
+    lon_group: Optional[str] = None
+    files: Dict[str, FileConfig]
+
+    def load_from_yaml(yaml_file: str, results_bucket: str, results_directory: str):
+        """
+        Load the YAML configuration from the specified file and create a TestCollectionFiles object.
+        """
+        try:
+            with open(yaml_file, 'r') as file:
+                data = yaml.safe_load(file)
+                return TestConfig(results_bucket=results_bucket, results_directory=results_directory, **data)
+        except FileNotFoundError:
+            print(f"Error: The file {yaml_file} was not found.")
+        except yaml.YAMLError as exc:
+            print(f"Error in YAML formatting: {exc}")
+        except Exception as e:
+            print(f"An error occurred: {e}")
+
+        return None
diff --git a/h5tests/README.md → h5cloud/tests/README.md b/h5tests/README.md → h5cloud/tests/README.md
diff --git a/h5tests/h5_data_frame_arr_mean.py → h5cloud/tests/h5_data_frame_arr_mean.py b/h5tests/h5_data_frame_arr_mean.py → h5cloud/tests/h5_data_frame_arr_mean.py
diff --git a/h5cloud/tests/h5coro_arr_mean.py b/h5cloud/tests/h5coro_arr_mean.py
@@ -0,0 +1,24 @@
+from .h5test import H5Test, timer_decorator
+import numpy as np
+import subprocess
+
+try:
+    import h5coro
+except:
+    completed_process = subprocess.run([
+        'mamba', 'install', '-c', 'conda-forge', 'h5coro', '--yes'
+    ])
+    import h5coro
+
+from h5coro import h5coro, s3driver, filedriver
+h5coro.config(errorChecking=True, verbose=False, enableAttributes=False)
+
+class H5CoroArrMean(H5Test):
+    @timer_decorator
+    def run(self, file_format: str):
+        tc = self.test_config
+        final_h5coro_array = []
+        file = tc.files[file_format]
+        h5obj = h5coro.H5Coro(file.link.replace("s3://", ""), s3driver.S3Driver)
+        h5obj.readDatasets(datasets=[f'{tc.group}/{tc.variable}'], block=True)
+        return h5obj[f'{tc.group}/{tc.variable}'].values.mean()
diff --git a/h5tests/h5coro_subset_arr_mean.py → h5cloud/tests/h5coro_subset_arr_mean.py b/h5tests/h5coro_subset_arr_mean.py → h5cloud/tests/h5coro_subset_arr_mean.py
diff --git a/h5cloud/tests/h5py_arr_mean.py b/h5cloud/tests/h5py_arr_mean.py
@@ -0,0 +1,12 @@
+from .h5test import H5Test, timer_decorator
+import h5py
+
+class H5pyArrMean(H5Test):
+    @timer_decorator
+    def run(self, file_format: str, io_params: dict = {}):
+        tc = self.test_config
+        file = tc.files[file_format]        
+        final_h5py_array = []
+        with h5py.File(self.s3_fs.open(file.link, 'rb', **io_params.get('fsspec_params', {})), 'r', **io_params.get('h5py_params', {})) as h5obj:
+            data = h5obj[f'{tc.group}/{tc.variable}'][:].flatten()
+        return data.mean()