A Taxonomy of Error Sources in HPC I/O Machine Learning Models
DescriptionI/O efficiency is crucial to productivity in scientific computing, but the growing complexity of HPC systems and applications complicates efforts to understand and optimize I/O behavior at scale. Data-driven machine learning-based I/O throughput models offer a solution: they can be used to identify bottlenecks, automate I/O tuning, or optimize job scheduling with minimal human intervention. Unfortunately, current state-of-the-art I/O models are not robust enough for production use and under-perform after being deployed.
We analyze four years of application, scheduler, and storage system logs on two leadership-class HPC platforms to understand why I/O models under-perform in practice. We propose a taxonomy consisting of five categories of I/O modeling errors: poor application and system modeling, inadequate dataset coverage, I/O contention, and I/O noise. We develop litmus tests to quantify each category, allowing researchers to narrow down failure modes, enhance I/O throughput models, and improve future generations of HPC logging and analysis tools.
We analyze four years of application, scheduler, and storage system logs on two leadership-class HPC platforms to understand why I/O models under-perform in practice. We propose a taxonomy consisting of five categories of I/O modeling errors: poor application and system modeling, inadequate dataset coverage, I/O contention, and I/O noise. We develop litmus tests to quantify each category, allowing researchers to narrow down failure modes, enhance I/O throughput models, and improve future generations of HPC logging and analysis tools.
Event Type
Paper
TimeTuesday, 15 November 20224pm - 4:30pm CST
LocationC141-143-149
TP
Reliability and Resiliency
Recorded
Archive
view