Emergency Backup for Scientific Applications
DescriptionA framework for the efficient in-network data transfer between a parallel application and an independent storage server is proposed. The case of an unexpected and unrecoverable interruption of the application is considered, where the server takes the role of an emergency backup service preventing the unnecessary loss of valuable information. Cleanup time buffers can be optimally exploited by the framework making use of RDMA transport and redistribution of data by means of the Maestro middleware. Experiments are performed on a HPE/Cray EX system to construct a heuristics for amounts of data that can realistically be backed up during a given time buffer. The method proves to be faster than VELOC and plain MPI-IO using one server node already, for a number of user ranks up to a hundred, with the promise of also better scalability in the long run due to the in-network approach as opposed to filesystem transport.
Event Type
Workshop
TimeMonday, 14 November 202211:25am - 11:45am CST
LocationC143-149
Registration Categories
W
Tags
Reliability and Resiliency
Session Formats
Recorded
Back To Top Button