Author: John Ravi (North Carolina State University)
Advisor: Michela Becchi (North Carolina State University)
Abstract: As computational resources scale larger, applications often need to be refactored to deal with bottlenecks that arise to gain the advantages of strong scaling. When not properly addressed legacy workloads can lead to inefficient usage of available hardware which leads to poor throughput. One solution is to allow multiple tasks to share a system to provide multi-tenancy. Multi-tenant environments fall into two categories: time-sharing and space-sharing. Time-sharing has been an effective technique to deal with multiple applications sharing the CPU and GPU at the node-level. However, time-sharing can have a heavy performance cost such as saving and restoring architectural state (context switch overhead) which is very costly on GPUs. While space-sharing can avoid this overhead and improve throughput, current hardware and software systems lack full isolation to provide the necessary quality of service. In this work, we identify key challenges that arise when sharing resources in a HPC context. We evaluate real-world scenarios both at the node-level and cluster-level. Using these insights, we propose middleware to mitigate and improve quality of service. We introduce a runtime CUDA middleware that improves QoS for GPUs. We also introduce and study two new features of HDF5, GDS VFD and Async I/O. The former improves I/O latency while the latter improves and hides variability in I/O latency.
Thesis Canvas: pdf