Analyzing the Energy Consumption of Synchronous and Asynchronous Checkpointing Strategies
DescriptionWith exascale computing, the number of components that comprise high-performance computing (HPC) systems has increased by more than 70%, leading to a shorter mean time between failure (MTBF) and larger power budgets. These issues induce the need for (1) checkpoint/restart (C/R) and (2) energy reduction techniques. C/R has evolved with different software and hardware advances, thus it is crucial to understand how its energy usage differs under various storage tiers and synchronicity. In this paper, we present a comparison of the energy consumption of leading, state-of-the-art C/R libraries, VELOC and GenericIO. We perform weak and strong scalability tests of the C/R libraries and show that asynchronous C/R provides 4x greater throughput while using 33% less energy than synchronous C/R. Data size and throughput are directly correlated to energy consumption. Therefore, C/R developers should focus on ways to improve/maintain high throughput in order to reduce energy consumption to address exascale needs.
Event Type
Workshop
TimeMonday, 14 November 20229:35am - 9:55am CST
LocationC143-149
Session Formats
Recorded
Tags
Reliability and Resiliency
Registration Categories
W
Back To Top Button