Student: Baixi Sun (Indiana University)
Supervisor: Dingwen Tao (Indiana University)
Abstract: Deep learning surrogate models have drawn much attention in large-scale scientific simulations because they can provide similar results to simulations at lower computational costs. To process large amounts of scientific data, distributed training on high-performance computing (HPC) clusters is often used. Training a surrogate model with data parallelism consists of three major steps: (1) Each device loads a subset of the dataset from the parallel filesystem; (2) Computing the model update on each device; (3) Communicating between devices to synchronize the model update. During these steps, we observe that data loading is the main performance bottleneck for training surrogate models. To this end, we propose SurrogateTrain, an efficient data-loading approach for training surrogate models, including offline scheduling and on-demand buffering. Our evaluation on a scientific surrogate model demonstrates that SurrogateTrain reduces the amount of data loaded by 6.7× and achieves up to 4.7× speedup in data loading.
ACM-SRC Semi-Finalist: no
Poster Summary: PDF
Back to Poster Archive Listing