BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Chicago
X-LIC-LOCATION:America/Chicago
BEGIN:DAYLIGHT
TZOFFSETFROM:-0600
TZOFFSETTO:-0500
TZNAME:CDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0500
TZOFFSETTO:-0600
TZNAME:CST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20230124T171523Z
LOCATION:C1-2-3
DTSTART;TZID=America/Chicago:20221117T083000
DTEND;TZID=America/Chicago:20221117T170000
UID:submissions.supercomputing.org_SC22_sess275_rpost177@linklings.com
SUMMARY:Parameterized Radix-r Bruck Algorithm for All-to-All Communication
DESCRIPTION:Posters, Research Posters\n\nParameterized Radix-r Bruck Algor
ithm for All-to-All Communication\n\nFan, Kumar\n\nThe standard implementa
tion of MPI_Alltoall uses a combination of techniques, including the sprea
d-out and Bruck algorithms. The existing Bruck algorithm implementation is
limited to a radix of two, so the total number of communication steps is
fixed at log2(P) (P: total number of processes). The spread-out algorithm,
on the other hand, requires P-1 communication steps. There remains a wide
unexplored parameter area between these two extremities of the communicat
ion spectrum that can be tuned. In this paper, we formalize a generalized
formula and implementation of the Bruck algorithm, whose radix can be vari
ed from 2 to P-1. With this ability, both the total number of communicatio
n steps and the total amount of data transmitted can be tuned, which allow
s performance tuning. We performed an experimental investigation and demon
strated that the Bruck with the optimal radix is up to 57% faster than the
vendor's optimized MPI_Alltoall on the Theta supercomputer.\n\nRegistrati
on Category: Tech Program Reg Pass, Exhibits Reg Pass
END:VEVENT
END:VCALENDAR