SIParCS 2021 - Lucas Sterzinger

Lucas Sterzinger

Lucas Sterzinger, University of California, Davis

Fake it till you make it - Zarr-like access of existing netCDF4 datasets

With the rise of low cost cloud object storage options, many organizations are moving their data archives to the cloud. With this switch, however, comes an important question: what data format is best for cloud-based object storage? The new, cloud-optimized Zarr format has emerged as one of the best storage formats for this move to cloud-based storage. However, a growing problem is emerging; data is continually being uploaded in current (often NetCDF4/HDF5) format while a suitable alternative is discussed. NetCDF4 data is still the current data standard for many applications and is widely supported, making it difficult to replace existing data archives with more cloud-friendly formats and instead requiring full data duplication, which is time consuming and expensive.

In this presentation, a potential solution to this problem is introduced. ReferenceMaker, a new part of the Intake group’s fsspec project, is able to utilize some of the cloud-performat features of Zarr while keeping the original NetCDF4 file intact. ReferenceMaker uses the Zarr metadata specification but instead of pointing to individual chunk files, it points to byte-offsets in the binary NetCDF4 file itself, requiring no data duplication.

We show a test cloud-based scientific workflow across NetCDF, Zarr, and ReferenceMaker data access methods and show that ReferenceMaker performs at near-Zarr speeds while requiring a small fraction of the storage space.

Mentors: Julia Kent, Kevin Paul, & Chelle Gentemann (Farallon Institute)

Slides and poster