Scaling Out ML Training with the Oracle Cloud Infrastructure
DescriptionWe have seen an increase in the need for large-scale training of ML models as more and more startups and established companies seek to gain an edge with increasingly large and powerful models. These models require hundreds or thousands of GPUs for an extended period of time. Performance is crucial both at the level of individual GPU and of scaling efficiently across the network. A well-known example is Aleph Alpha, whose 5-language GPT-3-like model has up to over 300 billion machine learning parameters and even offers visual understanding in full multimodality, significantly extending the range of established possibilities. Scaling these large training models can be very complex and certainly difficult to tune, requiring a cost-effective infrastructure with A100 Nvidia GPUs and high throughput ultra-low latency RDMA that can provide availability, resiliency, and performance at scale.

In this talk we will discuss our approach to support the needs of these large-scale ML models for training and inference on Oracle Cloud and showcase the full-stack foundation of the transformative business potential in this industrial revolution. We will show examples of use from various companies and will discuss the challenges that were addressed to run these models at such scale in a modern enterprise architecture. We will finish the presentation with a discussion of some of the open research problems that still need to be addressed in this area.
Event Type
Exhibitor Forum
TimeThursday, 17 November 202210:30am - 11am CST
LocationD171
Registration Categories
TP
XO/EX
Session Formats
Recorded
Back To Top Button