· Contributors · Organizations · Search
Scalable Distributed High-Order Stencil Computations
DescriptionStencil computations lie at the heart of many scientific and industrial applications. Stencil algorithms pose several challenges on machines with cache based memory hierarchy, due to low re-use of memory accesses if special care is not taken to optimize them. This work shows that for stencil computation a novel algorithm that leverages a localized communication strategy effectively exploits the second generation Cerebras Wafer-Scale Engine (WSE-2), which has no cache hierarchy. This study focuses on a 25-point stencil finite-difference method for the 3D wave equation, a kernel frequently used in earth modeling as numerical simulation. In essence, the algorithm trades memory accesses for data communication and takes advantage of the fast communication fabric provided by the architecture. The algorithm —historically memory-bound— becomes compute-bound. This allows the implementation to achieve near perfect weak scaling, reaching up to 503 TFLOPs on a single WSE-2, a figure that only full clusters can eventually yield.