Cloud-Native Benchmarking of Geospatial Time Series Array Storages for System Optimized Efficient Data Retrieval and Extractions

  • Type: Master Thesis
  • Status: Completed
  • ID: 2021-014
  • Student: Lianne Kirsten Bonita Visperas

Hydrological-gridded geospatial datasets are increasing in volumes and the current storage solutions becoming I/O bound and not system performant due to a variety of storage choices and access patterns in several different applications. The issue is highly dependent on hydrological geospatial time-series array management, which includes storage, data extent, chunking, data compression and unique querying especially geometry, time and spatial-time extraction. This research focuses developing a benchmarking suite for query extractions considering the chunked structuring and data compression. Parametrized benchmark tests have been formulated for multi-dimensional data arrays such as NetCDF4, Zarr, and Dask in file-based storage, object-based cloud and database based. The storage benchmarking framework uses the National Water Model (NWM) dataset along with a randomized Xarray dataset with the following parameters: data format, chunk-use cases for extraction, data extent, compression, storage backends (POSIX, S3 and HBASE), and number of parallel or concurrent workers. The benchmarking uses cases such as points, all time stamps, single to multiple time stamps, geospatial bounding box chunks and aggregate extractions are formulated with hydrologist expert consultancy from KISTERS AG, Business Unit Water. The benchmarking environment was based on Array Storage, KISTERS’ gridded big data hydrological solution that tackles the extraction, transform and load of the NWM dataset as well as the storage back-end APIS and connectors. This research also aims to capture the efficiency of storage and retrieval patterns that in turn provide the best query extraction (I/O throughput and computation) and system (disk and memory) performance.