Optimizing Communication in Parallel Deep Learning via Parameter Pruning
DescriptionLarge scale neural network training is challenging due to the high ratio of communication to computation. Recent work has shown that these large networks contain sparse subnetworks consisting of 10-20% of the parameters, which when trained in isolation reach comparable accuracy to the larger network. In this work, we propose a novel approach that exploits the existence of these sparse subnetworks to dramatically improve the efficiency of large scale neural network training. By storing in sparse and computing in dense, we are able to reduce the number of parameters drastically while matching the compute efficiency of the original network. We exploit this reduced parameter set to optimize the communication time of AxoNN, a state-of-the-art framework for parallel deep learning. Our approach yields a significant speedup of 17% when training a 2.7 billion parameter transformer model on 384 GPUs.