SC22 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Workshops Archive

Emergency Backup for Scientific Applications


Workshop: Third International Symposium on Checkpointing for Supercomputing (SuperCheck-SC22)

Authors: Aniello Esposito, Christopher Haine, and Ali Mohammed (HPE HPC/AI EMEA Research Lab (ERL), Switzerland)


Abstract: A framework for the efficient in-network data transfer between a parallel application and an independent storage server is proposed. The case of an unexpected and unrecoverable interruption of the application is considered, where the server takes the role of an emergency backup service preventing the unnecessary loss of valuable information. Cleanup time buffers can be optimally exploited by the framework making use of RDMA transport and redistribution of data by means of the Maestro middleware. Experiments are performed on a HPE/Cray EX system to construct a heuristics for amounts of data that can realistically be backed up during a given time buffer. The method proves to be faster than VELOC and plain MPI-IO using one server node already, for a number of user ranks up to a hundred, with the promise of also better scalability in the long run due to the in-network approach as opposed to filesystem transport.





Back to Third International Symposium on Checkpointing for Supercomputing (SuperCheck-SC22) Archive Listing



Back to Full Workshop Archive Listing