SC22 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Workshops Archive

Overcoming HPC System Management Challenges: An Open Source Approach


Workshop: HPC Systems Professionals Workshop (HPCSYSPROS22)

Authors: Michael Hartman (Stanford University)


Abstract: We outline four different tools developed to either solve a specific problem or streamline a workflow related to the configuration and administration of a HPC Cluster. The issues that prompted the creation of the tools were identified at the Stanford Research Computing Center in the context of managing the Sherlock HPC Cluster and Oak long-term data storage environments. The tools that were created to address the issues encountered have been used in multiple locations, both internal and external to Stanford University. In this paper, we describe the solutions developed for four different areas of system management: filesystem (Lustre), drives (SAS), interconnect (InfiniBand), and job scheduler (Slurm).





Back to HPC Systems Professionals Workshop (HPCSYSPROS22) Archive Listing



Back to Full Workshop Archive Listing