Workshop: Workshop on Latest Advances in Scalable Algorithms for Large-Scale Heterogeneous Systems (ScalAH'22)
Authors: Yu-Hsiang Tsai and Pratik Nayak (Karlsruhe Institute of Technology); Edmond Chow (Georgia Institute of Technology); and Hartwig Anzt (University of Tennessee, Innovative Computing Laboratory; Karlsruhe Institute of Technology)
Abstract: Computation on architectures that feature fine-grained parallelism requires algorithms that overcome load imbalance, inefficient memory accesses, serialization, and excessive synchronization. In this paper, we explore an algorithm that completely removes the need for synchronization but allows for asynchronous updates in the spirit of chaotic relaxation. Methods of this type have been identified as highly competitive for computations on exascale machines, but practical implementations for GPU platforms featuring extreme parallelism levels are a scarce resource. We present an asynchronous Richardson iteration optimized for high-end GPUs, demonstrate the superiority of the algorithm over a highly tuned synchronous Richardson iteration, and deploy the algorithm as production-ready implementation in the Ginkgo open source library. The ideas presented here on the algorithm design, implementation, and performance can help guide the design of other asynchronous algorithms on GPUs.