DescriptionWith the slowing of Moore’s Law and explosion of data-flow compute in the domains of AI and HPC, there has been a new renaissance in domain specific architectures (DSAs) to help meet today’s compute demands. A large swath of these architectures are spatial in nature, where compute is unrolled in space to expose more parallelism for data-flow-heavy workloads. With these spatial architectures comes the challenge of effectively mapping workloads to the available compute units. Parallelizing compilers are often touted as the means to this goal, but their effectiveness is largely limited by the abstraction exposed by hardware to software. Here we explore the inherent challenges faced by some existing spatial architectures, such as GPUs, and explain how focusing on deterministic compute can alleviate these challenges. We do this by diving deep into Groq’s Tensor Streaming Processor (TSP), exploring how the architecture empowers software to efficiently map data-flow workloads to the chip’s massive amounts of compute. We demonstrate how this “software-defined hardware” approach is well-suited for data-flow compute, showcasing >5x improvements compared to current state-of-the-art on LSTM and Transformer-based models. We also explore how the compiler and architecture allow for powerful hardware-software co-design capabilities.