HPC Checkpoint-Restart Strategy Using NVRAM
DescriptionThe recent entrance of the High-Performance Computing (HPC) world into the exascale era challenges how vast amounts of data are analyzed, manipulated, and stored. However, the already substantial performance gap between computing, memory, and storage expands rapidly in the presence of distributed large-scale applications on new generation supercomputers. The widest gap of all, the memory-storage one, is still 2-3 orders of magnitude wide. As a result, said applications struggle with two main storage-oriented tasks – diagnostics and checkpointing – in which there is a need to persist data during runtime for further usage. Recently, novel interdependent introductions of non-volatile RAM (NVRAM) hardware and persistent memory file systems (PMFSs) were made to the storage stack and are planned to collectively integrate into the next Aurora exascale system. Fridman et al. (FTXS@SC’21) benchmarked the diagnostics (FIO, BT-IO) and checkpointing (SCR, DMTCP) use-cases as in supercomputers with the aid of NVRAM and several PMFSs, excluding block-oriented non-volatile devices. Rather, this strategy solely relies on using RAM-NVRAM and even pure-NVRAM memory-storage configuration. We review these results, and introduce how NVRAM can be utilized not only for C/R mechanisms and diagnostics via PMFSs, but also for Algorithm-Based Fault Tolerance (ABFT), with the PMDK library and MPI one-sided communication directly to byte-addressable NVRAM. We specifically focus on Exact State Reconstruction of iterative linear solvers. We show that this strategy utilizes hardware properly and reliably, achieving best-known performances for those use-cases and, as such, suggesting a new approach to devise HPC recoverable algorithms.
Event Type
Workshop
TimeMonday, 14 November 202211:45am - 11:50am CST
LocationC143-149
W
Reliability and Resiliency
Recorded