Authors: Zhengji Zhao (Lawrence Berkeley National Laboratory (LBNL)), Rebecca Hartman-Baker (Lawrence Berkeley National Laboratory (LBNL)), Gene Cooperman (Northeastern University), Bogdan Nicolae (Argonne National Laboratory (ANL)), Kapil Arya (Azure Systems Research), Donglai Dai (X-ScaleSolutions)
Abstract: As a primary approach to fault-tolerant computing, Checkpoint/Restart (C/R) is essential to a wide range of HPC communities. To help the community develop portable C/R codes to harness C/R benefits, which go far beyond resilience, the C/R Standard Forum will release the first version of the C/R interface standard in SC22. In this session, the C/R Standard Forum will present their first release of the C/R interface standard specification, inviting feedback from the HPC community on both the features included in the specification and the roadmap for future efforts.
Long Description: As a primary approach to fault-tolerant computing, Checkpoint/Restart (C/R) is essential to a wide range of HPC communities. It has seen increasing adoption for many additional scenarios, including suspend-resume, process migration, and replay debugging. More recently, with the convergence of HPC, big data analytics and machine learning, checkpointing is becoming an essential pattern in allowing applications to progress with their computations. Because software and hardware are fast evolving and becoming more complicated and heterogeneous, the development cycles for C/R tools will lengthen to support new hardware and the workloads on it, resulting in C/R tools chasing ever-newer hardware and never quite being able to catch up. This cycle has impeded HPC communities from reaping the benefits of C/R.
To help the HPC community to develop more portable C/R codes to harness the C/R benefits that are far beyond resilience, the C/R Standard Forum will release the first version of the C/R interface standard, which requires all parties in HPC to work together to achieve portability of the C/R codes. In this BOF meeting, the C/R Standard Forum will present the C/R interface standard specification, getting feedback from the SC22 attendees (HPC hardware/software vendors, system software developers, C/R tools/libraries developers, applications and other tools/libraries developers, application end users, and HPC practitioners) on the features included in the specification as well as the roadmap for future efforts to help guide future extensions and modifications.
The C/R Standard Forum was formed in January 2022, and its efforts kicked off by gathering requirements for the C/R standard via a requirements gathering workshop and bi-weekly meetings. This is their first time to propose a BoF. The session will be led by Zhengji Zhao, the primary organizer of the C/R Standard Forum, with the help of experts on both transparent and application-initiated checkpointing who are actively working on the C/R Standard Forum. The speakers will be Zhengji Zhao (NERSC), Gene Cooperman (Northeastern Univ.), Bogdan Nicolae (ANL) and Rebecca Hartman-Baker (LBNL). The discussions in this BoF meeting will be summarized and published on the C/R Standard Forum website, and will be used to inform C/R Forum working groups in their future activities, e.g., developing the next version of the C/R interface standard.
The C/R Standard Forum is open to anyone interested; new members are always welcome. We hope to use this BoF as an opportunity to make the Forum better known in the HPC community and to encourage participation from a broader and more diverse group of interested people. Our goal is to make the Forum known to and more approachable for the wider community, inform them about ongoing activities and future directions, and encourage larger community participation in this important standard that impacts a wide range of HPC communities, expanding its adoption and deployment in the HPC community.
Back to Birds of a Feather Archive Listing