BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Chicago
X-LIC-LOCATION:America/Chicago
BEGIN:DAYLIGHT
TZOFFSETFROM:-0600
TZOFFSETTO:-0500
TZNAME:CDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0500
TZOFFSETTO:-0600
TZNAME:CST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20230124T171522Z
LOCATION:C143-149
DTSTART;TZID=America/Chicago:20221114T114500
DTEND;TZID=America/Chicago:20221114T115000
UID:submissions.supercomputing.org_SC22_sess439_ws_scsc105@linklings.com
SUMMARY:HPC Checkpoint-Restart Strategy Using NVRAM
DESCRIPTION:Workshop\n\nHPC Checkpoint-Restart Strategy Using NVRAM\n\nFri
 dman, Snir, Rusanovsky, Zvi, Levin...\n\nThe recent entrance of the High-P
 erformance Computing (HPC) world into the exascale era challenges how vast
  amounts of data are analyzed, manipulated, and stored. However, the alrea
 dy substantial performance gap between computing, memory, and storage expa
 nds rapidly in the presence of distributed large-scale applications on new
  generation supercomputers. The widest gap of all, the memory-storage one,
  is still 2-3 orders of magnitude wide. As a result, said applications str
 uggle with two main storage-oriented tasks – diagnostics and checkpointing
  – in which there is a need to persist data during runtime for further usa
 ge. Recently, novel interdependent introductions of non-volatile RAM (NVRA
 M) hardware and persistent memory file systems (PMFSs) were made to the st
 orage stack and are planned to collectively integrate into the next Aurora
  exascale system. Fridman et al. (FTXS@SC’21) benchmarked the diagnostics 
 (FIO, BT-IO) and checkpointing (SCR, DMTCP) use-cases as in supercomputers
  with the aid of NVRAM and several PMFSs, excluding block-oriented non-vol
 atile devices. Rather, this strategy solely relies on using RAM-NVRAM and 
 even pure-NVRAM memory-storage configuration. We review these results, and
  introduce how NVRAM can be utilized not only for C/R mechanisms and diagn
 ostics via PMFSs, but also for Algorithm-Based Fault Tolerance (ABFT), wit
 h the PMDK library and MPI one-sided communication directly to byte-addres
 sable NVRAM. We specifically focus on Exact State Reconstruction of iterat
 ive linear solvers. We show that this strategy utilizes hardware properly 
 and reliably, achieving best-known performances for those use-cases and, a
 s such, suggesting a new approach to devise HPC recoverable algorithms.\n\
 nSession Format: Recorded\n\nTag: Reliability and Resiliency\n\nRegistrati
 on Category: Workshop Reg Pass
END:VEVENT
END:VCALENDAR
