GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics

SC22 Proceedings

Technical Papers Archive

GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics

Authors: Maxim Zvyagin (Argonne National Laboratory (ANL)); Alexander Brace (Argonne National Laboratory (ANL), University of Chicago); Kyle Hippe (Argonne National Laboratory (ANL)); Yuntian Deng (NVIDIA Corporation, Harvard University); Bin Zhang and Cindy Bohorquez (Cerebras Systems); Austin Clyde (Argonne National Laboratory (ANL), University of Chicago); Bharat Kale (Northern Illinois University); Danilo Perez-Rivera (Argonne National Laboratory (ANL), New York University (NYU)); Heng Ma (Argonne National Laboratory (ANL)); Carla M. Mann (Argonne National Laboratory (ANL), University of Chicago); Michael Irvin (Argonne National Laboratory (ANL)); J. Gregory Pauloski (University of Chicago); Logan Ward (Argonne National Laboratory (ANL)); Valerie Hayot-Sasson (Argonne National Laboratory (ANL), University of Chicago); Murali Emani, Sam Foreman, and Zhen Xie (Argonne National Laboratory (ANL)); Diangen Lin and Maulik Shukla (Argonne National Laboratory (ANL), University of Chicago); Weili Nie and Josh Romero (NVIDIA Corporation); Christian Dallago (NVIDIA Corporation, Technical University Munich); Arash Vahdat (NVIDIA Corporation); Chaowei Xiao (Arizona State University, NVIDIA Corporation); Thomas Gibbs (NVIDIA Corporation); Ian Foster and James J. Davis (Argonne National Laboratory (ANL), University of Chicago); Michael Papka (Argonne National Laboratory (ANL); University of Illinois, Chicago); Thomas Brettin (Argonne National Laboratory (ANL)); Rick Stevens (Argonne National Laboratory (ANL), University of Chicago); Anima Anandkumar (NVIDIA Corporation, California Institute of Technology); and Venkatram Vishwanath and Arvind Ramanathan (Argonne National Laboratory (ANL))

Abstract: We seek to transform how new and emergent variants of pandemic-causing viruses, specifically SARS-CoV-2, are identified and classified. By adapting large language models (LLMs) for genomic data, we build genome-scale language models (GenSLMs) which can learn the evolutionary landscape of SARS-CoV-2 genomes. By pre-training on over 110 million prokaryotic gene sequences and fine-tuning a SARS-CoV-2-specific model on 1.5 million genomes, we show that GenSLMs can accurately and rapidly identify variants of concern. Thus, to our knowledge, GenSLMs represents one of the first whole genome scale foundation models which can generalize to other prediction tasks. We demonstrate scaling of GenSLMs on GPU-based supercomputers and AI-hardware accelerators utilizing 1.63 Zettaflops in training runs with a sustained performance of 121 PFLOPS in mixed precision and peak of 850 PFLOPS. We present initial scientific insights from examining GenSLMs in tracking evolutionary dynamics of SARS-CoV-2, paving the path to realizing this on large biological data.

Back to Technical Papers Archive Listing