Authors: Cong Li (Intel Corporation), Yu Zhang (ByteDance Ltd), Jialei Wang and Hang Chen (Intel Corporation), Xian Liu (ByteDance Ltd), Tai Huang (Intel Corporation), Liang Peng (ByteDance Ltd), Shen Zhou (Intel Corporation), and Lixin Wang and Shijian Ge (ByteDance Ltd)
Abstract: We present an empirical study on memory reliablity by correlating correctable errors (CEs) with uncorrectable errors (UEs) using the large-scale field data across 3 major DIMM manufacturers from a contemporary server farm of ByteDance. Unlike the traditional chipkill error correction code (ECC), in contemporary Intel server platforms the ECC gets weakened, not able to tolerate some error-bit patterns from a single chip. Using obtainable coarse-grained ECC knowledge, we derive a new indicator from the error-bit information: risky CE occurrence in terms of ECC guaranteed coverage. We show that the new indicator is consistently sensitive and specific in testing future UEs indicating the substantial contribution of the weakened ECC to those UEs today. We empirically demonstrate how practically useful UE predictors are constructed based on the new indicator in conjunction with other useful attributes such as certain micro-level fault indicators and DIMM part numbers, achieving the state-of-the-art performance.
Presentation: file
Back to Technical Papers Archive Listing