ReStore: In-Memory REplicated STORagE for Rapid Recovery in Fault-Tolerant Algorithms

SC22 Proceedings

Workshops Archive

ReStore: In-Memory REplicated STORagE for Rapid Recovery in Fault-Tolerant Algorithms

Workshop: 12th Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS 2022)

Authors: Lukas Hübner (Karlsruhe Institute of Technology, Heidelberg Institute of Theoretical Studies); Demian Hespe and Peter Sanders (Karlsruhe Institute of Technology); and Alexandros Stamatakis (Karlsruhe Institute of Technology, Heidelberg Institute of Theoretical Studies)

Abstract: Fault-tolerant applications need to recover data lost after process failures. It is typically impractical to request replacement resources after a failure. Therefore, applications have to continue with the remaining resources. This requires redistributing the workload. We present an algorithmic framework and its C++ implementation ReStore that enables recovery of data after process failures. By storing all required data in memory via an appropriate data distribution and replication, recovery is substantially faster than with standard checkpointing schemes that rely on a parallel file system. As the application developer can specify which data to load, we also support shrinking recovery instead of recovery using spare compute nodes. Our experiments show loading times of lost input data in the range of milliseconds on up to 24,576 processors and a substantial speedup of the recovery time for the fault-tolerant version of a widely used bioinformatics application.

Back to 12th Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS 2022) Archive Listing

Back to Full Workshop Archive Listing