pyDMTCP: Python Interface to DMTCP via SLURM
DescriptionSupercomputers have become increasingly important due to the growing demand for computational power and the amount of available data. As supercomputing systems become larger and serve many users simultaneously, the costs of building and maintaining such systems increase, and the probability of faults increases. Therefore, such systems’ efficiency and resilience are essential for providers and users. One primary tool that provides system resilience is DMTCP, a system-level Checkpoint/Restart (C/R) library that allows performing C/R operations seamlessly without any source code modifications. Meanwhile, Python has become one of the major languages for application programming; hence providing it with C/R capabilities is desirable in many systems. Accordingly, previous work has brought C/R to Python by supporting DMTCP C/R programmatically from within a Python program. Nevertheless, a particular class of python codes is not self-contained but rather designed to support other applications by scheduling, managing, and analyzing their results, such as execution wrappers and pipelining, parameter sweeping, etc. This class of Python codes is widespread on HPC systems using the SLURM job scheduler by all types of users. In this work, we extend the previous integration of DMTCP to Python programs and first introduce pyDMTCP. This Python module enables Python wrappers of scientific applications to easily utilize DMTCP checkpointing via a Python interface and externally to applications via SLURM. The interface also maps the entire HPC system according to several main parameters to allow fault-free and optimized C/R executions between different nodes.

The source code of pyDMTCP will be available at
Event Type
TimeMonday, 14 November 20229:55am - 10am CST
Registration Categories
Reliability and Resiliency
Session Formats
Back To Top Button