Colossal-AI: Scaling Large AI Models on Distributed Systems and Supercomputers
DescriptionThe success of the Transformer model has pushed the limits of deep learning to operate on the scale of billions of parameters. This proliferation of larger model size has outpaced advances in hardware, resulting in an urgent need to distribute the training of enormous models across multiple GPU clusters. Despite this trend, best practices for choosing an optimal strategy are still lacking due to the breadth of knowledge required across both deep learning and parallel computing.

The Colossal-AI system addresses the above challenge by introducing a unified interface to scale your sequential code of model training to distributed environments. It supports parallel training methods such as data, pipeline, tensor and sequence parallelism, as well as heterogeneous training methods such as a zero redundancy optimizer. The system mirrors its design with the predominant way that the AI community is familiar with in writing non-distributed code and can easily be adapted to efficient parallel training.

We provide AWS computing instances with example code to help attendees get familiar with the system and apply it to scale their large AI models with minimal effort. More information about Colossal-AI is available at https://github.com/hpcaitech/ColossalAI.
Event Type
Tutorial
TimeMonday, 14 November 20221:30pm - 5pm CST
LocationD168
Registration Categories
TUT
Tags
AI-HPC Convergence
Applications
Cloud and Distributed Computing
Data Analytics
Data Mangement
Exascale Computing
Machine Learning and Artificial Intelligence
Performance
Resource Management and Scheduling
Session Formats
Recorded
Back To Top Button