ReStore: In-Memory REplicated STORagE for Rapid Recovery in Fault-Tolerant Algorithms
DescriptionFault-tolerant applications need to recover data lost after process failures. It is typically impractical to request replacement resources after a failure. Therefore, applications have to continue with the remaining resources. This requires redistributing the workload. We present an algorithmic framework and its C++ implementation ReStore that enables recovery of data after process failures. By storing all required data in memory via an appropriate data distribution and replication, recovery is substantially faster than with standard checkpointing schemes that rely on a parallel file system. As the application developer can specify which data to load, we also support shrinking recovery instead of recovery using spare compute nodes. Our experiments show loading times of lost input data in the range of milliseconds on up to 24,576 processors and a substantial speedup of the recovery time for the fault-tolerant version of a widely used bioinformatics application.
Event Type
TimeMonday, 14 November 20224:09pm - 4:34pm CST
Registration Categories
Session Formats
Back To Top Button