Recovery of Distributed Iterative Solvers for Linear Systems Using Non-Volatile RAM
DescriptionHPC systems are a critical resource for scientific research. The increased demand for computational power and memory ushers in the exascale era, in which complex supercomputers consist of numerous compute nodes and are consequently expected to experience frequent faults and crashes.
Exact state reconstruction (ESR) was proposed as an alternative mechanism to alleviate the impact of frequent failures on long-term computations. ESR has been shown to provide exact reconstruction of iterative solvers while avoiding the need for costly checkpointing. However, ESR currently relies on volatile memory for fault tolerance, and must therefore maintain redundancies in the RAM of multiple nodes, incurring high memory and network overheads.
Recent supercomputer designs feature emerging non-volatile RAM (NVRAM) technology. This paper investigates how NVRAM can be utilized to devise an enhanced ESR-based recovery mechanism that is more efficient and provides full resilience, based on a novel MPI implementation of One-Sided Communication (OSC) over RDMA.
TimeMonday, 14 November 20223:29pm - 3:54pm CST