Workshop: Third International Symposium on Checkpointing for Supercomputing (SuperCheck-SC22)
Event TypeWorkshop
Registration Categories
Reliability and Resiliency
Session Formats
TimeMonday, 14 November 20228:30am - 12pm CST
DescriptionAs a primary approach to fault-tolerant computing, Checkpoint/Restart (C/R) is essential to a wide range of HPC communities. While there has been much C/R research and tools development, continued C/R research is indispensable to keep pace with ever-changing HPC architectures, technologies, and workloads. More effort is also needed to narrow the gap between proof-of-concept C/R research codes and production-quality codes capable of deployment in real-world workloads. In this workshop, we will bring together C/R researchers and tools developers, practitioners, application developers, and end users to focus on C/R research and successes in production use, motivating the development of usable C/R tools, the closing of the gap between state-of-the-art research and production, and the harnessing of the full benefits of C/R for the HPC community.

Workshop Website
8:30am - 8:35am CSTSuperCheck – Opening Remarks
8:35am - 9:15am CSTFeatured Talk: DAOS – Nextgen Storage Stack for HPC and AI
9:15am - 9:35am CSTSpot-On: A Checkpointing Framework for Fault-Tolerant Long-Running Workloads on Cloud Spot Instances
9:35am - 9:55am CSTAnalyzing the Energy Consumption of Synchronous and Asynchronous Checkpointing Strategies
9:55am - 10:00am CSTpyDMTCP: Python Interface to DMTCP via SLURM
10:00am - 10:30am CSTSuperCheck – Morning Break
10:30am - 10:35am CSTExtending MPI API Support in MANA
10:35am - 10:40am CSTReMPI: A Record-and-Replay Tool for Debugging Non-Deterministic MPI Applications
10:40am - 10:45am CSTQ&A Session for Lightning Talks
10:45am - 11:05am CSTDebugging MPI Implementations via Reduction-to-Primitives
11:05am - 11:25am CSTDesigning an Adaptive Application-Level Checkpoint Management System for Malleable MPI Applications
11:25am - 11:45am CSTEmergency Backup for Scientific Applications
11:45am - 11:50am CSTHPC Checkpoint-Restart Strategy Using NVRAM
11:50am - 11:55am CSTCXL, PMEM, and Checkpointing – The Path Forward
11:55am - 12:00pm CSTQ&A Session for Lightning Talks
