Authors: Yafan Huang (University of Iowa), Shengjian Guo (Baidu Security), Sheng Di (Argonne National Laboratory (ANL)), Guanpeng Li (University of Iowa), and Franck Cappello (Argonne National Laboratory (ANL))
Abstract: With the ever-shrinking size of transistors, silent data corruptions (SDCs) are becoming a common yet serious issue in HPC. Selective instruction duplication (SID) is a widely used fault-tolerance technique that can obtain high SDC coverage with low-performance overhead. However, existing SID methods are confined to single program input in its assessment, assuming that error resilience of a program remains similar across inputs. Nevertheless, we observe that the assumption cannot always hold, leading to a drastic loss in SDC coverage in different inputs, compromising HPC reliability. We notice that the SDC coverage loss correlates with a small set of instructions – we call them incubative instructions, which reveal elusive error propagation characteristics across multiple inputs. We proposed MINPSID, an automated SID framework that identifies incubative instructions in programs and re-prioritizes incubative instructions. Evaluation shows MINPSID can effectively mitigate the loss of SDC coverage across multiple inputs.
Back to Technical Papers Archive Listing