From Correctable Memory Errors to Uncorrectable Memory Errors: What Error Bits Tell
DescriptionWe present an empirical study on memory reliablity by correlating correctable errors (CEs) with uncorrectable errors (UEs) using the large-scale field data across 3 major DIMM manufacturers from a contemporary server farm of ByteDance. Unlike the traditional chipkill error correction code (ECC), in contemporary Intel server platforms the ECC gets weakened, not able to tolerate some error-bit patterns from a single chip. Using obtainable coarse-grained ECC knowledge, we derive a new indicator from the error-bit information: risky CE occurrence in terms of ECC guaranteed coverage. We show that the new indicator is consistently sensitive and specific in testing future UEs indicating the substantial contribution of the weakened ECC to those UEs today. We empirically demonstrate how practically useful UE predictors are constructed based on the new indicator in conjunction with other useful attributes such as certain micro-level fault indicators and DIMM part numbers, achieving the state-of-the-art performance.
Event Type
Paper
TimeThursday, 17 November 20221:30pm - 2pm CST
LocationC146
Registration Categories
TP
Tags
Extreme Scale Computing
Memory Systems
Parallel Programming Systems
State of the Practice
Session Formats
Recorded
Back To Top Button