Authors: Michael Ott (Leibniz Supercomputing Centre), Melissa Romanus (Lawrence Berkeley National Laboratory (LBNL)), Norm Bourassa (Lawrence Berkeley National Laboratory (LBNL)), Rachel Palumbo (Oak Ridge National Laboratory (ORNL)), Woong Shin (Oak Ridge National Laboratory (ORNL)), Jeff Hanson (Hewlett Packard Enterprise (HPE)), Torsten Wilde (Hewlett Packard Enterprise (HPE), Energy Efficient HPC Working Group), Jim Brandt (Sandia National Laboratories), Ben Schwaller (Sandia National Laboratories), Natalie Bates (Lawrence Livermore National Laboratory (LLNL), Energy Efficient High Performance Computing Working Group EEHPCWG))
Abstract: Operational Data Analytics (ODA) provides unique opportunities to analyze, understand, and optimize operations of HPC systems. Readily available open-source frameworks make the collection of monitoring data from different domains of the HPC system (infrastructure, system hardware, software, applications) increasingly easy. However, making the data work for HPC operations is not straight-forward. AI-based methods seem interesting, but which tools and methods are suitable for this type of data is not obvious. This BoF aims to bring together practitioners in HPC operations to share use cases for ODA, discuss problems, and provide feedback.
Long Description: Most sites that operate HPC systems are engaged in Operational Data Analytics one way or another. Some may not even be aware of it since they are “merely monitoring” their HPC system for faults or emergencies and don’t consider this to be ODA. Others try to collect as much data as possible from their HPC operations, covering the whole data center with its supporting infrastructure, the system hardware and software, and the applications running on the system. Many are overwhelmed by the amount of data they are collecting and find it difficult to either visualize the data in enough detail or find the right tool or approach to analyze the data in order to extract actionable knowledge from it. In the big data world, a plethora of tools and methods are available to analyze such large amounts of data, but choosing the right ones is not trivial and requires expertise not only in data analytics but also in the respective domain.
The goal of this BOF is to bring together researchers in ODA, operators of HPC systems, and data analytics experts to share their use cases, ideas and experiences, and lessons learned. It builds on previously held BOFs on Operational Data Analytics that focused on technical infrastructure and collecting telemetry data. This BOF will focus on analyzing, interpreting, and using the data. For example, current threshold-based methods often cause nuisance alarms that can overwhelm operators. Anomaly-based methods hold the promise to make alarms more relevant.
Previous BOFs: SC21: "Operational Data Analytics", SC19: "Operational Data Analytics", SC18: "Data Analytics for System and Facility Energy Management"
Back to Birds of a Feather Archive Listing