Authors: Verónica Melesse Vergara (Oak Ridge National Laboratory (ORNL)), Bilel Hadri (King Abdullah University of Science and Technology (KAUST))
Abstract: This BoF brings together experts from HPC centers around the globe to discuss future system testing methodologies. The session will include a panel focusing on HPC system testing at scale including acceptance testing of Perlmutter, Frontier, Fugaku, and LUMI. Panelists will describe challenges faced and share their perspectives on how those could have been overcome. Then, we will host two speakers to spark ideas for the open discussion in which attendees will be invited to identify key areas that HPC center staff and vendors should focus on to prepare for the next-generation of compute and data resources.
Long Description: The goal of this BoF is to bring together experts from HPC centers around the globe to discuss system testing methodologies utilized in today’s HPC systems but also looking ahead at the post-exascale future of system testing. The BoF will invite international participation from HPC centers, as well as representatives from major vendor companies in the supercomputing space.
The session will kick-off with a panel focusing on HPC system testing at scale including acceptance testing of Perlmutter, Frontier, Fugaku and LUMI. In addition to discussing procedures and tools utilized, panelists will describe challenges faced and share their perspectives on how those could have been overcome. Then, we will host two invited speakers to spark ideas for the open discussion in which attendees will be invited to identify key areas that we, as HPC center staff and vendors, should focus on to prepare for the next-generation of compute and data resources. As machine learning (ML) and deep learning (DL) become more prevalent workloads, HPC centers must provide a wider range of services and more robust and resilient resources in order to support both traditional HPC and ML/DL workloads. Rather than a single monolithic system, we need to understand and define how centers can evaluate and support HPC ecosystems and complex workflows.
This session will build upon the knowledge gathered from centers including the IT Center for Science in Finland, Swiss National Supercomputing Centre, Indiana University, King Abdullah University of Science & Technology, National Center for Supercomputing Applications, National Energy Research Scientific Computing center, Oak Ridge National Laboratory, Los Alamos National Laboratory, and Lawrence Livermore National Laboratory, RIKEN Center for Computational Science (R-CCS). Previous editions have shown that although there is significant overlap in the types of tools and tests used, individual centers have custom setups which include both in-house developed and open source frameworks to launch and monitor tests during acceptance and regression testing. Some centers (e.g., KAUST, CSCS) rely on ReFrame for regression testing, whereas others (e.g., ORNL, LANL, LLNL) use in-house developed tools that are now open source.
In this iteration of the HPC System Test BoF, however, we would like to provide an opportunity for the audience to think about future needs. While it is clear that, currently, centers have developed testing solutions that work for their current needs, with the shift in the types of workloads and workflows supported, how should we adapt? What are some clear gaps that must be addressed today? How will tests and tools need to change in the next couple of years? Beyond the technological aspects, what are some steps that we can take that will help the HPC system testing community prepare to conduct end-to-end testing of complex HPC ecosystems?
As a result of this session, we plan to generate a public access technical report including contributions from speakers, panelists, attendees, and organizers that covers key ideas, gaps, challenges, and potential directions identified to address them. In addition, we will make the presentations and survey results publicly available at https://olcf.github.io/hpc-system-test-wg/
URL: https://olcf.github.io/hpc-system-test-wg/events/sc22bof.html
Back to Birds of a Feather Archive Listing