Fault-Tolerance for High-Performance and Big Data Applications: Theory and Practice
DescriptionResilience is a critical issue for large-scale platforms. This tutorial provides a comprehensive survey of fault-tolerant techniques for high-performance and big data applications, with a fair balance between theory and practice. This tutorial is organized across four main topics:

(i) Overview of failure types (software/hardware, transient/fail-stop), and typical probability distributions (Exponential, Weibull, Log-Normal);

(ii) General-purpose techniques, which include several checkpoints and rollback recovery protocols, replication, prediction, and silent error detection;

(iii) Application-specific techniques, such as user-level in-memory checkpointing, data replication (map-reduce), or fixed-point convergence for iterative applications (back-propagation);

(iv) Practical deployment of fault tolerance techniques with User Level Fault Mitigation (a proposed MPI standard extension). Examples include widely used routines such as Monte-Carlo methods, SPMD stencil, map-reduce, and back-propagation in neural networks.

A step-by-step approach will show how to protect these routines and make them fault-tolerant, using a variety of techniques, in a hands-on session (a docker container will be provided).

The tutorial is open to all SC22 attendees who are interested in the current status and expected promise of fault-tolerant approaches for scientific and big data applications. Background will be provided for all protocols and probabilistic models. Basic MPI knowledge will be helpful for the hands-on session.
Event Type
Tutorial
TimeMonday, 14 November 20228:30am - 5pm CST
LocationD175
Registration Categories
TUT
Tags
Algorithms
Applications
Big Data
Cloud and Distributed Computing
Datacenter
Performance
Reliability and Resiliency
Session Formats
Recorded
Session Evaluation give feedback
Back To Top Button