Authors: Prasoon Sinha, Akhil Guliani, Rutwik Jain, and Brandon Tran (University of Wisconsin, Madison); Matthew Sinclair (University of Wisconsin, Madison; AMD Research); and Shivaram Venkataraman (University of Wisconsin, Madison)
Abstract: Recent work demonstrated that power management (PM) can impact application performance in CPU-based HPC systems, even when machines have the same architecture and SKU. This variation occurs due to manufacturing variability and the chip’s PM. However, while modern HPC systems widely employ GPUs, there is limited work on how variability affects GPU applications. In this paper, we study 4 HPC clusters with state-of-the-art GPUs: Oak Ridge’s Summit, Sandia’s Vortex, TACC’s Longhorn, and Livermore’s Corona. The first three clusters use NVIDIA V100 GPUs, while the fourth uses AMD MI60 GPUs. After identifying applications that stress different GPU components, we gathered data from over 90% of the GPUs in the clusters. In total, we collected over 100,000 hours of data. Regardless of application and cluster, our results show significant variance: 32% (max 72%) average performance variation, despite GPU architecture and vendor SKU being the same.
Presentation: file
Back to Technical Papers Archive Listing