Mitigating Silent Data Corruptions in HPC Applications across Multiple Program Inputs
DescriptionWith the ever-shrinking size of transistors, silent data corruptions (SDCs) are becoming a common yet serious issue in HPC. Selective instruction duplication (SID) is a widely used fault-tolerance technique that can obtain high SDC coverage with low-performance overhead. However, existing SID methods are confined to single program input in its assessment, assuming that error resilience of a program remains similar across inputs. Nevertheless, we observe that the assumption cannot always hold, leading to a drastic loss in SDC coverage in different inputs, compromising HPC reliability. We notice that the SDC coverage loss correlates with a small set of instructions – we call them incubative instructions, which reveal elusive error propagation characteristics across multiple inputs. We proposed MINPSID, an automated SID framework that identifies incubative instructions in programs and re-prioritizes incubative instructions. Evaluation shows MINPSID can effectively mitigate the loss of SDC coverage across multiple inputs.
TimeTuesday, 15 November 20223:30pm - 4pm CST
Reliability and Resiliency
Best Student Paper Finalists