DescriptionComputation on architectures that feature fine-grained parallelism requires algorithms that overcome load imbalance, inefficient memory accesses, serialization, and excessive synchronization. In this paper, we explore an algorithm that completely removes the need for synchronization but allows for asynchronous updates in the spirit of chaotic relaxation. Methods of this type have been identified as highly competitive for computations on exascale machines, but practical implementations for GPU platforms featuring extreme parallelism levels are a scarce resource. We present an asynchronous Richardson iteration optimized for high-end GPUs, demonstrate the superiority of the algorithm over a highly tuned synchronous Richardson iteration, and deploy the algorithm as production-ready implementation in the Ginkgo open source library. The ideas presented here on the algorithm design, implementation, and performance can help guide the design of other asynchronous algorithms on GPUs.