Toward Scalable Resource Management for Supercomputers
DescriptionToday's supercomputers offer massive computation resources to execute a large number of user jobs. Effectively managing such large-scale hardware parallelism and workloads is essential for supercomputers. However, existing HPC resource management (RM) systems fail to capitalize on the hardware parallelism by following a centralized design used decades ago. They give poor scalability and inefficient performance on today's supercomputers, which will worsen in exascale computing. We present ESlurm, a better RM for supercomputers. As a departure from existing HPC RMs, ESlurm implements a distributed communication structure. It employs a new communication tree strategy and uses job runtime estimation to improve communications and job scheduling efficiency. ESlurm is deployed into production in a real supercomputer. We evaluate ESlurm on up to 100K nodes. Compared to state-of-the-art RM solutions, ESlurm exhibits better scalability, significantly reducing the resource usage of master nodes and improving data transfer and job scheduling efficiency by a large margin.
Event Type
Paper
TimeTuesday, 15 November 20222:30pm - 3pm CST
LocationC146
Registration Categories
TP
Tags
Resource Management and Scheduling
System Software
Reproducibility Badges
Session Formats
Recorded
Back To Top Button