Extending MPI API Support in MANA
DescriptionMANA is an MPI-Agnostic, Network-Agnostic transparent checkpointing tool for MPI applications, which is a recent breakthrough in transparent checkpointing. NERSC has been in collaboration with MANA team at Northeastern University and MemVerge, Inc to enable MANA for NERSC’s top applications to support DOE’s experimental facilities’ real-time workloads by checkpointing lower priority jobs and resuming them later. MANA employs a novel split-process approach and works by intercepting the MPI APIs to ensure that transparent checkpointing to occur at a consistent state between MPI processes and also to achieve network agnosticism. Thus, writing proper wrapper functions for MPI APIs is critical for MANA to checkpoint and restart MPI applications correctly and efficiently. While it is straightforward to implement a wrapper function for most of the MPI APIs, it is not trivial to correctly intercept some of the APIs, and the major challenge is to ensure the same behavior after intercepting the MPI APIs. In this lightning talk, we will review the current status of MPI API support in MANA, and present challenges in supporting various MPI APIs including its communicators, objects, data types, environments, etc., as well as the roadmap to extend the MPI API support in current and future versions of MPI standard. What we learned from supporting MPI APIs in MANA will be helpful to similar approaches that intercept MPI APIs.
MANA uses DMTCP as its checkpointing tool, and is implemented in the DMTCP framework as a plugin. MANA is an open source project.
MANA uses DMTCP as its checkpointing tool, and is implemented in the DMTCP framework as a plugin. MANA is an open source project.
Event Type
Workshop
TimeMonday, 14 November 202210:30am - 10:35am CST
LocationC143-149
W
Reliability and Resiliency
Recorded