A Data-Centric Optimization Workflow for the Python Language
DescriptionPython's extensive software ecosystem leads to high productivity, rendering it the language of choice for scientific computing. However, executing Python code is often slow or impossible in emerging architectures and accelerators. To complement Python's productivity with the performance and portability required in high-performance computing (HPC), we introduce a workflow based on data-centric (DaCe) parallel programming. Python code with HPC-oriented extensions is parsed into a dataflow-based intermediate representation, facilitating analysis of the program's data movement. The representation is optimized via graph transformations driven by the users, performance models, and automatic heuristics. Subsequently, hardware-specific code is generated for supported architectures, including CPU, GPU, and FPGA. We evaluate the above workflow through three case studies. First, to compare our work to other Python-accelerating solutions, we introduce NPBench, a collection of over 50 Python microbenchmarks across a wide range of scientific domains. We show performance results and scaling across CPU, GPU, FPGA, and the Piz Daint supercomputer. DaCe runs 10x faster than the reference Python execution and achieves 2.47x and 3.75x speedups over previous-best solutions and up to 93.16% scaling efficiency. Second, we re-implement in Python and optimize the Quantum Transport Simulator OMEN. The application's DaCe version executes one to two orders of magnitude faster than the original code written in C++, achieving 42.55% of the Summit supercomputer's peak performance. Last, we utilize our workflow to build Deinsum, an automated framework for distributed multilinear algebra computations expressed in Einstein notation. Deinsum performs up to 19x faster over state-of-the-art solutions on the Piz Daint supercomputer.
TimeTuesday, 15 November 20228:30am - 5pm CST