Authors: Gilad Shainer (NVIDIA Corporation), Dhabaleswar Panda (Ohio State University), Jithin Jose (Microsoft Corporation), Richard Graham (NVIDIA Corporation), Sergio Iserte (Barcelona Supercomputing Center (BSC))
Abstract: Being a standard-based interconnect, InfiniBand enjoys the continuous development of new capabilities. NDR 400G InfiniBand In-Network Computing and Data Processing Unit (DPU) technologies provide innovative hardware and programmable engines offloading and accelerating communication frameworks and application algorithms. The session will discuss the InfiniBand In-Network Computing technology and testing results from leading supercomputing platforms as well as the NVIDIA Selene AI supercomputer. As the needs for faster data speed accelerate, the InfiniBand Trade Association has been working to set the goals for future speeds (XDR and beyond). This topic will also be covered at the session, and the first NDR results.
Long Description: HPC and AI Supercomputers are the essential tools we need to conduct research, enable scientific discoveries, design new products, and develop self-learning software algorithms. Supercomputing leadership means scientific leadership, which explains the investments made by many governments and research institutes to build faster and more powerful supercomputing platforms. The heart of a supercomputer is the network that connects the compute elements together, enabling parallel and synchronized computing cycles. Over the past decades, multiple network technologies were created and multiple have disappeared. InfiniBand, an industry standard developed in 1999, continues to show a strong presence in the high-performance computing market. It connected one of the top three supercomputers in 2003 and today it is being used in many of the leading supercomputers in the world based on the TOP500 supercomputers list. Being a standard-based interconnect, InfiniBand enjoys the continuous development of new capabilities, better performance, and scalability.
InfiniBand technology can be separated into three main pillars: connectivity, network, and communication. The connectivity pillar refers to the elements around the interconnect infrastructure such as topologies. The network pillar refers to the network transport and routing for example. And the communication pillar refers to co-design elements related to communication frameworks such as MPI, SHMEM/PGAS, NCCL and more. The past focus for smart interconnects development was to offload the network functions from the CPU to the network. With the new efforts in the co-design approach, the new generation of smart interconnects will also offload data algorithms that will be managed within the network, allowing users to run these algorithms as the data being transferred within the system interconnect, rather than waiting for the data to reach the CPU. This technology is being referred to as In-Network Computing.
In-Network Computing transforms the data center interconnect to become a “distributed CPU”, and “distributed memory”, enables to overcome performance walls and to enable faster and more scalable data analysis. NDR 400G InfiniBand In-Network Computing technology provide innovative engines accelerating and improving each of the pillars, such as Scalable Hierarchical Aggregation and Reduction Protocol (SHARP), a technology that was developed by Oak Ridge National Laboratory and Mellanox and received the R&D100 award, smart Tag Matching and rendezvoused protocol, SHIELD and more.
The recent introduction of the InfiniBand Data Processing Unit (DPU) brings a new tier of computing to further address performance bottlenecks. InfiniBand DPUs introduce new programmability tier into the InfiniBand network, to support algorithms overlapping, an effective process management, adaptive performance isolation and more.
The session will discuss the InfiniBand In-Network Computing technology and testing results from leading supercomputing platforms. The session will also deliver a deep dive into InfiniBand DPU computing. Performance results of a variety of HPC and AI applications will be presented as well.
URL:
Back to Birds of a Feather Archive Listing