Authors: Ke Fan and Sidharth Kumar (University of Alabama, Birmingham)
Abstract: The standard implementation of MPI_Alltoall uses a combination of techniques, including the spread-out and Bruck algorithms. The existing Bruck algorithm implementation is limited to a radix of two, so the total number of communication steps is fixed at log2(P) (P: total number of processes). The spread-out algorithm, on the other hand, requires P-1 communication steps. There remains a wide unexplored parameter area between these two extremities of the communication spectrum that can be tuned. In this paper, we formalize a generalized formula and implementation of the Bruck algorithm, whose radix can be varied from 2 to P-1. With this ability, both the total number of communication steps and the total amount of data transmitted can be tuned, which allows performance tuning. We performed an experimental investigation and demonstrated that the Bruck with the optimal radix is up to 57% faster than the vendor's optimized MPI_Alltoall on the Theta supercomputer.
Best Poster Finalist (BP): no
Poster summary: PDF
Back to Poster Archive Listing