Author: Avinash Maurya (Rochester Institute of Technology)
Advisor: M. Mustafa Rafique (Rochester Institute of Technology), Bogdan Nicolae (Argonne National Laboratory (ANL))
Abstract: Modern HPC workloads produce massive amounts of distributed intermediate data that needs to be checkpointed concurrently in real-time at scale. One such popular scenario is the use of checkpoint-restore for revisiting previous states (intermediate data) to advance computations, such as adjoint methods. In this context, GPUs have shown tremendous performance improvements during computations but demonstrate I/O limitations while managing high-frequency large-volume data movement across heterogeneous memory tiers. Existing data movement runtimes are not well suited for such I/O because of factors such as imbalance in checkpoint distribution across fast memory tiers, slow memory allocation, and restore oblivious cache eviction and prefetching strategies. We address these challenges by designing a set of transparent, asynchronous checkpoint-restore techniques that minimize the blocking time of the application during I/O using three novel contributions. First, we design techniques to evenly distribute checkpoints across fast memory tiers (e.g. peer GPUs) using collaborative checkpointing that leverages fast interconnects such as NVLinks and NVSwitches for load balancing. Second, we mitigate the slow cache allocation for storing checkpoints on both GPU and host by leveraging techniques such as CUDA's virtual memory management functions, eager memory mapping, and lazy pinning. Third, we design a restore-order aware eviction and prefetching approach that is coordinated by a finite state machine based on a unified checkpoint-restore abstraction for optimal evictions. Our evaluations across real-world and synthetic benchmarks demonstrate significant speedup in both checkpoint and restore phases of the application compared to the current state-of-the-art data movement engines.
Thesis Canvas: pdf