SC22 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Workshops Archive

SUGAR: Speeding Up GPGPU Application Resilience Estimation with Input Sizing


Workshop: Women in HPC: Diversifying the HPC Community and Engaging Male Allies

Authors: Lishan Yang (George Mason University (GMU))


Abstract: As Graphics Processing Units (GPUs) are becoming a de facto solution for accelerating a wide range of applications, their reliable operation is becoming increasingly important. One major challenge is to accurately measure GPGPU application error resilience. A typical GPGPU application spawns a huge number of threads and utilizes a large amount of potentially unreliable compute and memory resources available on the GPUs. As the number of possible fault locations can be in the billions, evaluating every fault and examining its effect on the application error resilience is impractical. Application resilience is evaluated via extensive fault injection campaigns based on sampling of an extensive fault site space. Typically, the larger the input of the GPGPU application, the longer the experimental campaign. We devise a methodology, SUGAR (Speeding Up GPGPU Application Resilience Estimation with input sizing), that dramatically speeds up the evaluation of GPGPU application error resilience by judicious input sizing.


Website:






Back to Women in HPC: Diversifying the HPC Community and Engaging Male Allies Archive Listing



Back to Full Workshop Archive Listing