Overcoming HPC System Management Challenges: An Open Source Approach
DescriptionWe outline four different tools developed to either solve a specific problem or streamline a workflow related to the configuration and administration of a HPC Cluster. The issues that prompted the creation of the tools were identified at the Stanford Research Computing Center in the context of managing the Sherlock HPC Cluster and Oak long-term data storage environments. The tools that were created to address the issues encountered have been used in multiple locations, both internal and external to Stanford University. In this paper, we describe the solutions developed for four different areas of system management: filesystem (Lustre), drives (SAS), interconnect (InfiniBand), and job scheduler (Slurm).
Event Type
Workshop
TimeMonday, 14 November 20221:55pm - 2:15pm CST
LocationC141
W
Benchmarking
Cloud and Distributed Computing
Containers
Datacenter
Networks
Privacy
Resource Management and Scheduling
Security
SIGHPC
State of the Practice
System Administration
System Software
Recorded