Not All GPUs Are Created Equal: Characterizing Variability in Large-Scale, Accelerator-Rich Systems
DescriptionRecent work demonstrated that power management (PM) can impact application performance in CPU-based HPC systems, even when machines have the same architecture and SKU. This variation occurs due to manufacturing variability and the chip’s PM. However, while modern HPC systems widely employ GPUs, there is limited work on how variability affects GPU applications. In this paper, we study 4 HPC clusters with state-of-the-art GPUs: Oak Ridge’s Summit, Sandia’s Vortex, TACC’s Longhorn, and Livermore’s Corona. The first three clusters use NVIDIA V100 GPUs, while the fourth uses AMD MI60 GPUs. After identifying applications that stress different GPU components, we gathered data from over 90% of the GPUs in the clusters. In total, we collected over 100,000 hours of data. Regardless of application and cluster, our results show significant variance: 32% (max 72%) average performance variation, despite GPU architecture and vendor SKU being the same.
TimeThursday, 17 November 202211am - 11:30am CST