One of the known HDF5 limitations is a lack of multi-threaded support for accessing data. It creates performance, deployment, and adoption barriers for multi-threaded applications. To date, no effort has been made to overcome this restriction, mostly due to the difficulty of retrofitting a large code base with thread concurrency. Recent developments in the HDF5 architecture allows us to suggest a strategy of retrofitting thread-safety incrementally, without affecting any other development going on in HDF5, and without disrupting applications that use it. We will retrofit a few HDF5 packages and provide a special external HDF5 VOL connector to binary storage that will allow applications to get advantages of multi-threaded access to data in the relatively short period of time. We plan to contribute the enhanced HDF5 components back to open-source HDF5 thus enabling other multi-threaded VOL connectors created by the community, e.g., Caching VOL by Argonne National Laboratory. In the future we plan to deliver a multi-threaded version of the HDF5 library to the community.
Sparse data is common in many scientific disciplines. Examples include large-scale simulations of physical phenomena, High Energy Physics experiments, machine learning applications, and many more. Acquired data is often stored in HDF5 data format. As the amount of data in HDF5 continues to grow due to higher instrument and detector resolution, higher sampling rates, etc., there is a clear demand for efficient management of sparse data. The support for sparse data is also accompanied by a growing demand in the experimental sciences to perform data analysis in the high-performance environment, where HDF5 is widely used. Adding support for sparse data will eliminate demand for custom data processing software and will reduce the size of required storage and memory usage for applications that work with sparse data.
We will implement sparse data storage in HDF5 and will contribute the changes to the open-source HDF5 software. Access to HDF5 sparse storage will be transparent to the applications and will not require special sparse data encoding or additional coding effort. We will use existing elements of the software to implement the sparse feature while enhancing some of HDF5 components and documentation. We’ve already outlined the required file format changes and suggested new APIs. If you are interested, please check the open discussion item in the HDF5 GitHub repository.
You can find more information about our work and learn about our progress by visiting the Lifeboat GitHub repository. The source and documentation we create is open to the community. Your feedback would be highly appreciated.