Home Search Program
Search Program
Organizations
Contributors
Presentations
Posters
Scientific Visualization & Data Analytics Showcase
Recorded
TP
DescriptionThe Advanced Visualization Lab at the National Center for Supercomputing Applications created a cinematic scientific visualization of the ArcticDEM survey and Vavilov ice cap collapse for the documentary film "Atlas of a Changing Earth", in both digital fulldome and flatscreen television formats. While the ArcticDEM dataset is the main one featured here, this visualization fills in gaps using other datasets, including a climate simulation by Bates et al and Landsat imagery. The visualization required a number of steps including: both manual and algorithmic data cleaning, processing, and alignment; data fusion; virtual scene design; morphing interpolation; lighting design; camera choreography; compositing; and rendering on the Blue Waters supercomputer.
ACM Gordon Bell Finalist
Awards Presentation
Recorded
Awards
TP
DescriptionOver the past three decades, ab initio electronic structure calculations of large, complex and metallic systems are limited to tens of thousands of atoms in computational accuracy and efficiency on leadership supercomputers. We present a massively parallel discontinuous Galerkin density functional theory (DGDFT) implementation, which adopts adaptive local basis functions to discretize the Kohn-Sham equation, resulting in a block-sparse Hamiltonian matrix. A highly efficient pole expansion and selected inversion (PEXSI) sparse direct solver is implemented in DGDFT to achieve O(N1.5) scaling for quasi two-dimensional systems. DGDFT allows us to compute the electronic structures of complex metallic heterostructures with 2.5 million atoms (17.2 million electrons) using 35.9 million cores on the new Sunway supercomputer. The peak performance of PEXSI can achieve 64 PFLOPS (5% of theoretical peak), which is unprecedented for sparse direct solvers. This accomplishment paves the way for quantum mechanical simulations into mesoscopic scale for designing next-generation electronic devices.
Awards Presentation
Recorded
Awards
TP
DescriptionLinking scientific instruments and computation: Patterns, technologies, experiences
Powerful detectors at modern experimental facilities that collect data at multiple GB/s require online computing to process the resulting data flows. I review common patterns associated with such online analyses, and present new methods for configuring and running the resulting distributed computing pipelines. I present experiences with the application of these methods to the processing of data from five scientific instruments, each of which engages powerful computers for data inversion, model training, or other purposes. I also discuss implications of such methods for operators and users of scientific facilities.
Powerful detectors at modern experimental facilities that collect data at multiple GB/s require online computing to process the resulting data flows. I review common patterns associated with such online analyses, and present new methods for configuring and running the resulting distributed computing pipelines. I present experiences with the application of these methods to the processing of data from five scientific instruments, each of which engages powerful computers for data inversion, model training, or other purposes. I also discuss implications of such methods for operators and users of scientific facilities.
Awards Presentation
Recorded
Awards
TP
DescriptionFrom two strong oxen to billions of fleas: orchestrating
computation and data in modern high-performance computing
Following Sidney Fernbach's legacy, we will explore how massively parallel distributed supercomputers are designed, programmed, and operated today. We focus on the aspects of distributed-memory parallelism using Remote Direct Memory Access through the Message Passing Interface. We will close with an outlook of where technology will leads us and new problems for the HPC community to tackle in the coming years.
computation and data in modern high-performance computing
Following Sidney Fernbach's legacy, we will explore how massively parallel distributed supercomputers are designed, programmed, and operated today. We focus on the aspects of distributed-memory parallelism using Remote Direct Memory Access through the Message Passing Interface. We will close with an outlook of where technology will leads us and new problems for the HPC community to tackle in the coming years.
Awards Presentation
Recorded
Awards
TP
DescriptionQuotes from Seymour Cray—Are we living up to his legacy?
Seymour Cray, often regarded as the “father of supercomputing”, endowed us with valuable quotes during his stellar career, and many of those quotes can now be found online. One can say that the HPC in general has made massive progress since his unfortunate passing, but the question is, has HPC made advances in a way that lives up to his ideals and his legacy, and furthermore is properly moving forward? Moreover, it is difficult even for a genius to predict the future accurately, and as such, are the ideals from his quotes living up to the present day HPC? We review his quotes against some historical supercomputing developments I have been involved in to address these questions.
Seymour Cray, often regarded as the “father of supercomputing”, endowed us with valuable quotes during his stellar career, and many of those quotes can now be found online. One can say that the HPC in general has made massive progress since his unfortunate passing, but the question is, has HPC made advances in a way that lives up to his ideals and his legacy, and furthermore is properly moving forward? Moreover, it is difficult even for a genius to predict the future accurately, and as such, are the ideals from his quotes living up to the present day HPC? We review his quotes against some historical supercomputing developments I have been involved in to address these questions.
Birds of a Feather
TP
XO/EX
DescriptionData intensive supercomputer applications are increasingly important workloads, especially for “Big Data” problems, but are ill suited for most of today’s computing platforms (at any scale!). The Graph500 list has grown to over 328 entries and has demonstrated the challenges of even simple analytics. The new SSSP kernel introduced at SC17 has increased the benchmark’s overall difficulty. This BoF will unveil the latest Graph500 lists, provide in-depth analysis of the kernels and machines, and enhance the new energy metrics the Green Graph500. It will offer a forum for community and provide a rallying point for data intensive supercomputing problems.
Student Cluster Competition
TP
XO/EX
DescriptionThe SDSC/UCSD SCC22 team is enthusiasm-driven, technically capable, fast-learning, and deeply experienced across the computer hardware and software stacks. Each team member is uniquely qualified and committed to using HPC to advance their field. We have one returning team member from the SCC21 virtual cluster competition, one team member graduating from previous competition training to the competition team, three former team members serving as team mentors, and four new students joining the competition team. We are confident in our team’s ability to tackle expected and unexpected challenges in the competition, using a combination of rigorous preparation, strong communication, robust planning, detailed learning, and efficient teamwork. Our team training activities are fully supported by SDSC through the HPC Students Program, and we are engaging directly with each of our sponsors for expert sessions on computer architecture, optimizing compilers, HPC in the cloud, containerization, and more.
Our team members exploit the full flexibility of the UCSD computer science, cognitive science, and computer engineering majors. Our technical stack includes: major programming languages (C, C++, Java, Python, Fortran), system administration, firmware engineering, parallel programming (MPI, OpenMP, CUDA), hardware design (SystemVerilog, tcl, Cadence, Synopsis), scientific applications (LAMMPS, Quantum Espresso, Avogadro, VMD), full stack web development (Node.JS, REACT, HTML), scripting and batch processing, and machine learning. Many team members have both undergraduate research and industry internship experience.
Edward Burns previously interned at SDSC, and he brings image processing, software engineering, and batch scheduler optimization experience to the team. He hopes that HPC experience will help him build highly scalable computer vision software throughout his career.
Davit Margarian brings a VLSI chip design and firmware background to the team. He hopes to use his HPC experience to accelerate computer-aided design tools for billion-gate integrated circuits.
Stefanie Dao is experienced across operating systems, computer vision, and high-performance software. She plans to apply her HPC experience to server-side processing and updating of augmented reality experiences in real time.
Longtian Bao has strong scripting, software engineering, and web development skills, and he participated in last year’s team training. He is excited to apply his skills to resource budgeting and performance monitoring during the competition.
Yuchen Jing has extensive networking and Linux system administration experience from hosting network proxies, file transfer servers, and version control systems. He is looking forward to strengthening his skills in developing, deploying, and maintaining high performance software.
Matthew Mikhailov competed at SCC21, and is the go-to person for the team. He specializes in VLSI chip design and computational materials science, and he uses the LAMMPS code for his research. He hopes to learn from his SCC experience to design the next generation of supercomputer chips.
Team advisor, Dr. Mary Thomas, SDSC HPC Training Lead, holds degrees in physics, computer science, and computational science, and taught parallel computing for 16 years. She has a personal commitment to the SCC program -- she has led 4 teams: SCC16 and 17 (San Diego State University) and SCC20-21 (UCSD). Her enthusiasm, knowledge, and practical experience will benefit the team.
Our team members exploit the full flexibility of the UCSD computer science, cognitive science, and computer engineering majors. Our technical stack includes: major programming languages (C, C++, Java, Python, Fortran), system administration, firmware engineering, parallel programming (MPI, OpenMP, CUDA), hardware design (SystemVerilog, tcl, Cadence, Synopsis), scientific applications (LAMMPS, Quantum Espresso, Avogadro, VMD), full stack web development (Node.JS, REACT, HTML), scripting and batch processing, and machine learning. Many team members have both undergraduate research and industry internship experience.
Edward Burns previously interned at SDSC, and he brings image processing, software engineering, and batch scheduler optimization experience to the team. He hopes that HPC experience will help him build highly scalable computer vision software throughout his career.
Davit Margarian brings a VLSI chip design and firmware background to the team. He hopes to use his HPC experience to accelerate computer-aided design tools for billion-gate integrated circuits.
Stefanie Dao is experienced across operating systems, computer vision, and high-performance software. She plans to apply her HPC experience to server-side processing and updating of augmented reality experiences in real time.
Longtian Bao has strong scripting, software engineering, and web development skills, and he participated in last year’s team training. He is excited to apply his skills to resource budgeting and performance monitoring during the competition.
Yuchen Jing has extensive networking and Linux system administration experience from hosting network proxies, file transfer servers, and version control systems. He is looking forward to strengthening his skills in developing, deploying, and maintaining high performance software.
Matthew Mikhailov competed at SCC21, and is the go-to person for the team. He specializes in VLSI chip design and computational materials science, and he uses the LAMMPS code for his research. He hopes to learn from his SCC experience to design the next generation of supercomputer chips.
Team advisor, Dr. Mary Thomas, SDSC HPC Training Lead, holds degrees in physics, computer science, and computational science, and taught parallel computing for 16 years. She has a personal commitment to the SCC program -- she has led 4 teams: SCC16 and 17 (San Diego State University) and SCC20-21 (UCSD). Her enthusiasm, knowledge, and practical experience will benefit the team.
Posters
Research Posters
TP
XO/EX
DescriptionRadio-frequency cavities are key components for high-energy particle accelerators, quantum computing, etc. Designing cavities comes along with many computational challenges such as multi-objective optimization, high performance computing (HPC) requirement for handling large-sized cavities etc. To be more precise, its multi-objective optimization requires an efficient 3D full-wave electromagnetic simulator. For which, we rely on the integral equation (IE) method and it requires fast solver with HPC and ML algorithms to search for resonance modes.
We propose an HPC-based fast direct matrix solver for IE, combined with hybrid optimization algorithms to attain an efficient simulator for accelerator cavity modeling. First, we solve the linear eigen problem for each trial frequency by a distributed-memory parallel, fast direct solver. Second, we propose the combination of the global optimizer Gaussian Process with the local optimizer Downhill-simplex methods to generate the trial frequency samples which successfully optimize the corresponding 1D objective function with multiple sharp minimums.
We propose an HPC-based fast direct matrix solver for IE, combined with hybrid optimization algorithms to attain an efficient simulator for accelerator cavity modeling. First, we solve the linear eigen problem for each trial frequency by a distributed-memory parallel, fast direct solver. Second, we propose the combination of the global optimizer Gaussian Process with the local optimizer Downhill-simplex methods to generate the trial frequency samples which successfully optimize the corresponding 1D objective function with multiple sharp minimums.
Posters
Research Posters
TP
XO/EX
DescriptionWe present a modern C++20 interface for MPI 4.0. The interface utilizes recent language features to ease development of MPI applications. An aggregate reflection system enables generation of MPI data types from user-defined classes automatically. Immediate and persistent operations are mapped to futures, which can be chained to describe sequential asynchronous operations and task graphs in a concise way. This work introduces the prominent features of the interface with examples. We further measure its performance overhead with respect to the raw C interface.
Workshop
Recorded
W
DescriptionIn High-Performance Computing, new use cases are currently emerging in which classical numerical simulations are coupled with machine learning as a surrogate for complex physical models that are expensive to compute. In the context of simulating reactive thermo-fluid systems, the idea to replace current state-of-the-art tabulated chemistry with machine learning
inference is an active field of research. For this purpose, a simplified OpenFOAM application is coupled with an artificial neural network. In this work, we present a case study focusing solely on the performance of the coupled OpenFOAM-ML application. Our coupling approach features a heterogeneous cluster architecture combining pure CPU nodes and nodes equipped with two Nvidia V100 GPUs. We evaluate our approach by comparing the inference performance and the communication our approach induces with various machine learning frameworks. Additionally,
we also compare the GPUs with NEC Vector Engine Type 10B regarding inference performance.
inference is an active field of research. For this purpose, a simplified OpenFOAM application is coupled with an artificial neural network. In this work, we present a case study focusing solely on the performance of the coupled OpenFOAM-ML application. Our coupling approach features a heterogeneous cluster architecture combining pure CPU nodes and nodes equipped with two Nvidia V100 GPUs. We evaluate our approach by comparing the inference performance and the communication our approach induces with various machine learning frameworks. Additionally,
we also compare the GPUs with NEC Vector Engine Type 10B regarding inference performance.
Workshop
Recorded
Quantum Computing
W
DescriptionPhotons are natural resources in quantum information, and the last decade showed significant progress in high-quality single photon generation and detection. Furthermore, photonic qubits are easy to manipulate and do not require particularly strongly sealed environments, making them an appealing platform for QC. With the one-way model, the vision of universal and large-scale QCs based on photonics becomes feasible. In one-way computing, the input state is not an initial product state |0>^n, but a so-called cluster state. A series of measurements on the cluster state's individual qubits and their temporal order, together with a feed-forward procedure, determine the quantum circuit to be executed. We propose a pipeline to convert a QASM circuit into a graph representation named measurement-graph (m-graph), that can be directly translated to hardware instructions on an optical one-way QC. Additionally, we optimize the graph using ZX-Calculus before evaluating the execution on an experimental discrete variable photonic platform.
Workshop
Recorded
Applications
Architectures
Benchmarking
Exascale Computing
Modeling and Simulation
Performance
Performance Portability
W
DescriptionScientific applications are increasingly adopting Artificial Intelligence (AI) techniques to advance science. High-performance computing centers are evaluating emerging novel hardware accelerators to efficiently run AI-driven science applications. With a wide diversity in the hardware architectures and software stacks of these systems, it is challenging to understand how these accelerators perform. The state-of-the-art in the evaluation of deep learning workloads primarily focuses on CPUs and GPUs. In this paper, we present an overview of dataflow-based novel AI accelerators from SambaNova, Cerebras, Graphcore, and Groq.
We present a first-of-a-kind evaluation of these accelerators with a diverse set of workloads, such as deep learning (DL) primitives, benchmark models, and scientific machine learning applications. We also evaluate the performance of collective communication, which is key for distributed DL implementation, along with a study of scaling efficiency. We then discuss key insights, challenges, and opportunities in integrating these novel AI accelerators in supercomputing systems.
We present a first-of-a-kind evaluation of these accelerators with a diverse set of workloads, such as deep learning (DL) primitives, benchmark models, and scientific machine learning applications. We also evaluate the performance of collective communication, which is key for distributed DL implementation, along with a study of scaling efficiency. We then discuss key insights, challenges, and opportunities in integrating these novel AI accelerators in supercomputing systems.
Invited Talk
Recorded
TP
XO/EX
DescriptionPredictive understanding and actionable insights in sustainability in the modern era requires an effective blend of theory and data-driven sciences. Relevant theory include physics, biogeochemistry, and ecology within the natural sciences, and engineering principles, economics, social, and governance principles in human-engineered systems and the social sciences. The data-driven sciences need to consider Big Data such as from archived numerical model simulations along with remotely sensed observations, and relatively small data such as from historical observations or even prehistorical proxy records, as well as based on prior domain knowledge and lessons learned from rare events and extremes. The underlying spatiotemporal data generation processes may be nonlinear dynamical, even chaotic, while the variability may be low frequency, even 1/f noise. Data may be sparse or incomplete, prior knowledge and physics may be incomplete or over-parameterized, while falsifiability and comprehensive uncertainty characterization are critical to inform decisions and add to our collective knowledge. Understanding the implications for domain-aware high performance computing may be critical both for the sciences and engineering and for investments or research directions in supercomputing. The first part of the presentation will describe these challenges and discuss how next-generation Artificial Intelligence may be able to provide solutions and where further developments may be necessary. The second part of the presentation will discuss recent research at my Sustainability and Data Sciences Laboratory, specifically, on the impacts of climate variability and weather extremes in ecology and biodiversity and in urban or regional critical lifeline infastructures, with an emphasis on the associated challenges and opportunities in processing earth science data.
Doctoral Showcase
Posters
Recorded
TP
DescriptionPython's extensive software ecosystem leads to high productivity, rendering it the language of choice for scientific computing. However, executing Python code is often slow or impossible in emerging architectures and accelerators. To complement Python's productivity with the performance and portability required in high-performance computing (HPC), we introduce a workflow based on data-centric (DaCe) parallel programming. Python code with HPC-oriented extensions is parsed into a dataflow-based intermediate representation, facilitating analysis of the program's data movement. The representation is optimized via graph transformations driven by the users, performance models, and automatic heuristics. Subsequently, hardware-specific code is generated for supported architectures, including CPU, GPU, and FPGA. We evaluate the above workflow through three case studies. First, to compare our work to other Python-accelerating solutions, we introduce NPBench, a collection of over 50 Python microbenchmarks across a wide range of scientific domains. We show performance results and scaling across CPU, GPU, FPGA, and the Piz Daint supercomputer. DaCe runs 10x faster than the reference Python execution and achieves 2.47x and 3.75x speedups over previous-best solutions and up to 93.16% scaling efficiency. Second, we re-implement in Python and optimize the Quantum Transport Simulator OMEN. The application's DaCe version executes one to two orders of magnitude faster than the original code written in C++, achieving 42.55% of the Summit supercomputer's peak performance. Last, we utilize our workflow to build Deinsum, an automated framework for distributed multilinear algebra computations expressed in Einstein notation. Deinsum performs up to 19x faster over state-of-the-art solutions on the Piz Daint supercomputer.
Workshop
Recorded
Cloud and Distributed Computing
In Situ Processing
Scientific Computing
Workflows
W
DescriptionScientific Workflow Management Systems (SWfMS) systematically capture and store diverse provenance information at various phases. Scientists compose multitude of queries on this information. The support of integrated query composition and visualization in existing SWfMS is limited. Most systems do not support any custom query composition. VisTrails and Taverna introduced custom query languages vtPQL and TriQL to support limited workflow monitoring. Galaxy only tracks histories of operations and displays in lists. No SWfMS supports a scientist-friendly user interface for provenance query composition and visualization. We propose a domain-specific composition environment for provenance query of scientific workflows. As a proof of concept, we developed a provenance system for bioinformatics workflow management system and evaluated it in multiple dimensions, one for measuring the subjective perception of participants on the usability of it using NASA-TLX and SUS survey instruments and the other for measuring the flexibility through plugin integration using NASA-TLX.
Workshop
Recorded
W
DescriptionThere is an increasing demand to incorporate hybrid environments as part of workflows across edge, cloud, and HPC systems. In a such converging environment of cloud and HPC, containers are starting to play a more prominent role, bringing their networking infrastructure along with them. However, the current body of work shows that container overlay networks, which are often used to connect containers across physical hosts, are ill-suited for the HPC environment. They tend to impose significant overhead and noise, resulting in degraded performance and disturbance to co-processes on the same host.
This presentation focuses on utilizing a novel class of hardware, Data Processing Unit, to offload the networking stack of overlay networks away from the host onto the DPU. We intend to show that such ancillary offload is possible and that it will result in decreased overhead on host nodes which in turn will improve the performance of running processes.
This presentation focuses on utilizing a novel class of hardware, Data Processing Unit, to offload the networking stack of overlay networks away from the host onto the DPU. We intend to show that such ancillary offload is possible and that it will result in decreased overhead on host nodes which in turn will improve the performance of running processes.
Workshop
Recorded
W
DescriptionVersion 4.0 of the Message Passing Interface standard introduced the concept of Partitioned Communication, which adds support for multiple contributions to a communication buffer. Although initially targeted at multithreaded MPI applications, Partitioned Communication currently receives attraction in the context of accelerators, especially GPUs. In this publication, it is demonstrated that this communication concept can be implemented for SYCL-programmed FPGAs. This includes a discussion of the design space and the presentation of a prototype implementation. Experimental results show that a lightweight implementation on top of an existing MPI library is possible. The presented approach also reveals issues in both the SYCL and the MPI standard, which needs to be addressed for improved support for the intended communication style.
Workshop
Recorded
W
DescriptionBackground. Automated breast tumor segmentation for dynamic contrast-enhanced magnetic resonance (DCE-MR) is a crucial step to advance and help with the implementation of radiomics for image-based, quantitative assessment of breast tumors and cancer phenotyping. Current studies focus on developing tumor segmentation, which often requires initial seed points from expert radiologists or atlas-based segmentation methods. We develop a robust, fully automated end-to-end segmentation pipeline for breast cancers on bilateral breast MR studies.
Methods. On IRB-approved diverse breast cancer MR cases, a deep learning segmentation algorithm was created and trained. The model’s backbone is UNet++, which consists of U-Nets of varying depths whose decoders are densely connected at the same resolution via the skip connections and all the constituent UNets are trained simultaneously to learn a shared image representation. This design not only improves the overall segmentation performance, but also enables model pruning during the inference time. The model was trained on the breast tumors located independently by a radiologist with consensus review by a second radiologist with at least five years of experience. MRI was performed using a 3.0-T imaging system in the prone position with a dedicated 16-channel breast coil and T1 weighted DEC-MR images were analyzed for the study. We used 80:20 random split for training and validation of the model.
Results. A total of 124 breast cancer patients had pre-treatment MR imaging before the start of NST - the cohort comprised 49 HR+HER2-, 37 HR+HER2+, 11 HR-HER2+, and 27 TNBC cases (mean tumor 2.3 cm (+/- 3.1mm).) The model was tested on 2571 individual images. Overall, the model scored 0.85 [0.84 – 0.86, 95% CI] dice score and 0.8[0.79-0.81, 95% CI] IoU score. TNBC tumors scored dice [0.88 – 0.89, 95% CI], HER2 neg and ER/PR positive dice [0.84-0.85, 95% CI] and HER2 positive dice [0.84-0.85, 95% CI]. We observed that model performed equally for the solid tumors and irregular shapes and didn’t observe any difference in the segmentation performance between residual and non-residual tumors types - dice score [0.85 – 0.86, 95% CI] and [0.83 – 0.84, 95% CI] respectively.
Conclusion. The proposed segmentation model can perform equally well on various clinical breast cancer subtypes. The model has high false positive rate towards biopsy clip and high background enhancement which we plan to solve by adding annotation of the clip and high non-cancer enhancement in future training data. We will release the trained model with open-source license to increase the scalability of the radiomics studies with fully automated segmentation. Given the importance of breast cancer subtypes as prognostic factors in women with operable breast cancer, automated segmentation of varying breast tumor subtypes will help to analyze imaging biomarkers embedded within the standard of care imaging studies in a larger scale study which will ¬potentially help radiologists, pathologists, surgeons, and clinicians understand features driving breast cancer phenotypes and pave the way for developing digital twin for breast cancer patients.
Methods. On IRB-approved diverse breast cancer MR cases, a deep learning segmentation algorithm was created and trained. The model’s backbone is UNet++, which consists of U-Nets of varying depths whose decoders are densely connected at the same resolution via the skip connections and all the constituent UNets are trained simultaneously to learn a shared image representation. This design not only improves the overall segmentation performance, but also enables model pruning during the inference time. The model was trained on the breast tumors located independently by a radiologist with consensus review by a second radiologist with at least five years of experience. MRI was performed using a 3.0-T imaging system in the prone position with a dedicated 16-channel breast coil and T1 weighted DEC-MR images were analyzed for the study. We used 80:20 random split for training and validation of the model.
Results. A total of 124 breast cancer patients had pre-treatment MR imaging before the start of NST - the cohort comprised 49 HR+HER2-, 37 HR+HER2+, 11 HR-HER2+, and 27 TNBC cases (mean tumor 2.3 cm (+/- 3.1mm).) The model was tested on 2571 individual images. Overall, the model scored 0.85 [0.84 – 0.86, 95% CI] dice score and 0.8[0.79-0.81, 95% CI] IoU score. TNBC tumors scored dice [0.88 – 0.89, 95% CI], HER2 neg and ER/PR positive dice [0.84-0.85, 95% CI] and HER2 positive dice [0.84-0.85, 95% CI]. We observed that model performed equally for the solid tumors and irregular shapes and didn’t observe any difference in the segmentation performance between residual and non-residual tumors types - dice score [0.85 – 0.86, 95% CI] and [0.83 – 0.84, 95% CI] respectively.
Conclusion. The proposed segmentation model can perform equally well on various clinical breast cancer subtypes. The model has high false positive rate towards biopsy clip and high background enhancement which we plan to solve by adding annotation of the clip and high non-cancer enhancement in future training data. We will release the trained model with open-source license to increase the scalability of the radiomics studies with fully automated segmentation. Given the importance of breast cancer subtypes as prognostic factors in women with operable breast cancer, automated segmentation of varying breast tumor subtypes will help to analyze imaging biomarkers embedded within the standard of care imaging studies in a larger scale study which will ¬potentially help radiologists, pathologists, surgeons, and clinicians understand features driving breast cancer phenotypes and pave the way for developing digital twin for breast cancer patients.
Workshop
Recorded
W
DescriptionGraphics Processing Units are nowadays used to accelerate applications in multiple scientific domains, and is therefore necessary even for researchers outside of computer science to learn how to use them. However, traditional GPU programming courses are often aimed at people with a computer science or high-performance computing background.
To address this challenge we developed a GPU programming course, following the Carpentries pedagogical style, centered around live coding and the teaching of actionable skills. The course is open-source, freely available online in the Carpentries Incubator, and has been successfully taught both online and in-person.
To address this challenge we developed a GPU programming course, following the Carpentries pedagogical style, centered around live coding and the teaching of actionable skills. The course is open-source, freely available online in the Carpentries Incubator, and has been successfully taught both online and in-person.
Paper
Recorded
Applications
Computational Science
Scientific Computing
TP
DescriptionSimulations to calculate a single gravitational waveform (GW) can take several weeks. Yet, thousands of such simulations are needed for the detection and interpretation of gravitational waves. Future detectors will require even more accurate waveforms. Here we present the first large scale, adaptive mesh, multi-GPU numerical relativity (NR) code along with performance analysis and benchmarking. While comparisons are difficult to make, our GPU extension of the dendrogr~NR code achieves 6x speedup over existing state-of-the-art codes. We achieve 800 GFlops/s on a single NVIDIA A100 GPU with an overall 2.5x speedup over a two-socket, 128-core AMD EPYC 7763 CPU node with an equivalent CPU implementation. We present detailed performance analyses, parallel scalability results, and accuracy assessment for GWs computed for mass ratios q=1,2,4. We also present strong scalability up to 8 A100s and weak scaling up to 229,376 x86 cores on the Texas Advanced Computing Center's Frontera system.
Workshop
Recorded
Algorithms
Architectures
Compilers
Computational Science
Exascale Computing
Heterogeneous Systems
Hierarchical Parallelism
Memory Systems
Parallel Programming Languages and Models
Parallel Programming Systems
Resource Management and Scheduling
W
DescriptionWe introduce a new high-performance design for parallelism within the Quantum Monte Carlo code QMCPACK. We demonstrate that the new design is better able to exploit the hierarchical parallelism of heterogeneous architectures compared to the previous GPU implementation. The new version is able to achieve higher GPU occupancy via the new concept of crowds of Monte Carlo walkers, and by enabling more host CPU threads to effectively offload to the GPU. The higher performance is expected to be achieved independent of the underlying hardware, significantly improving developer productivity and reducing code maintenance costs. Scientific productivity is also improved with full support for fallback to CPU execution when GPU implementations are not available or CPU execution is more optimal.
Posters
Research Posters
TP
XO/EX
DescriptionHPC systems are at risk of being underutilized due to various resource requirements of applications and the imbalance of utilization among subsystems. This work provides a holistic analysis and view of memory utilization on a leadership computing facility, the Perlmutter system at NERSC, through which we gain insights about the resource usage patterns of the memory subsystem. The results of the analysis can help evaluate current system configurations, offer recommendations for future procurement, provide feedback to users on code efficiency, and motivate research in new architecture and system designs.
Posters
Research Posters
TP
XO/EX
DescriptionMonitoring the status of large computing systems is essential to identify unexpected behavior and improve their performance and up-time. However, due to the large-scale and distributed design of such computing systems as well as a large number of monitoring parameters, automated monitoring methods should be applied. Such automatic monitoring methods should also have the ability to adapt themselves to the continuous changes in the computing system. In addition, they should be able to identify behavioral anomalies in useful time, in order to perform appropriate reactions. This work proposes a general light-weight and unsupervised method for near real-time anomaly detection using operational data measurement on large computing systems. The proposed model requires as low as 4 hours of data and 50 epochs for each training process to accurately resemble the behavioral pattern of computing systems.
Birds of a Feather
TP
XO/EX
DescriptionCompute Express Link™ (CXL™) maintains memory coherency between the CPU memory space and memory on CXL attached devices. CXL enables a high-speed, efficient interconnect between the CPU, platform enhancements, and workload accelerators such as GPUs, FPGAs, and other purpose-built accelerator solutions.
This BoF session will feature a panel of experts from the CXL Consortium to discuss available CXL devices and what devices the industry can expect to see in the next year. The experts will also explore the new features in the CXL 3.0 specification and the new usage models it will enable.
This BoF session will feature a panel of experts from the CXL Consortium to discuss available CXL devices and what devices the industry can expect to see in the next year. The experts will also explore the new features in the CXL 3.0 specification and the new usage models it will enable.
Workshop
Recorded
Applications
Architectures
Benchmarking
Exascale Computing
Modeling and Simulation
Performance
Performance Portability
W
DescriptionTighter integration of computational resources can foster superior application performance by mitigating communication bottlenecks. Unfortunately, not every application can use every compute or accelerator all the time. As a result, co-locating resources often leads to under-utilization of resources. In the next five years, HPC system architects will be presented with a spectrum of accelerated solutions ranging from tightly coupled, single package APUs to a sea of disaggregated GPUs interconnected by a global network. In this paper, we detail NEthing, our methodology and tool for evaluating the potential performance implications of such diverse architectural paradigms. We demonstrate our methodology on today’s and projected 2026 technologies for three distinct workloads: a compute-intensive kernel, a tightly-coupled HPC simulation, and an ensemble of loosely-coupled HPC simulations. Our results leverage NEthing to quantify the increased utilization disaggregated systems must achieve in order to match superior performance of APUs and on-board GPUs.
Posters
Research Posters
TP
XO/EX
DescriptionReal-world HPC workloads impose a lot of pressure on storage systems as they are highly data dependent. On the other hand, as a result of recent developments in storage hardware, it is expected that the storage diversity in upcoming HPC systems will grow. This growing complexity in the storage system presents challenges to users, and often results in I/O bottlenecks due to inefficient usage. There have been several studies on reducing I/O bottlenecks. The earliest attempts worked to solve this problem by combining I/O characteristics with expert insight. The recent attempts rely on the performance analysis from the I/O characterization tools. However, the problem is multifaceted with many metrics to consider, hence difficult to do manually, even for experts. In this work, we develop a methodology that produces a multifaceted view of the I/O behavior of a workload to identify potential I/O bottlenecks automatically.
Workshop
Recorded
Career Development
Professional Development
Software Engineering
Workforce
W
DescriptionThis paper shares a perspective for the research software engineering (RSE) community to navigate the National Laboratory landscape. The RSE role is a recent concept that led to organizational challenges to place and evaluate their impact, costs and benefits. The premise is that RSEs are a natural fit into the current landscape and can use traditional career growth strategies in science: publications, community engagements and proposals. Projects funding RSEs can benefit from this synergy and be inclusive on traditional activities. Still, a great deal of introspection is needed to close gaps between the rapidly evolving RSE landscape and the well-established communication patterns in science. This perspective is built upon interactions in industry, academia and government in high-performance computing (HPC) environments. The goal is to contribute to the conversation around RSE career growth and understand their return on investment for scientific projects and sponsors.
Awards Presentation
Test of Time
Recorded
Awards
TP
DescriptionFor decades, the high-performance computing (HPC) community has focused on performance, where performance is defined as speed. To achieve better performance per compute node, microprocessor vendors have not only doubled the number of transistors (and speed) every 18-24 months, but they have also doubled the power densities. Consequently, keeping a large-scale HPC system functioning properly requires continual cooling in a large machine room, thus resulting in substantial operational costs. Furthermore, the increase in power densities has led (in part) to a decrease in system reliability, thus leading to lost productivity.
To address these problems, we propose a power-aware algorithm that automatically and transparently adapts its voltage and frequency settings to achieve significant power reduction and energy savings with minimal impact on performance. Specifically, we leverage a commodity technology called “dynamic voltage and frequency scaling” to implement our power-aware algorithm in the run-time system of commodity HPC systems.
To address these problems, we propose a power-aware algorithm that automatically and transparently adapts its voltage and frequency settings to achieve significant power reduction and energy savings with minimal impact on performance. Specifically, we leverage a commodity technology called “dynamic voltage and frequency scaling” to implement our power-aware algorithm in the run-time system of commodity HPC systems.
Workshop
Recorded
Accelerator-based Architectures
Data Analytics
In Situ Processing
Scientific Computing
Visualization
Workflows
W
DescriptionAlthough in situ visualization can reduce the amount of data written to the storage, in situ visualization can still generate large amount of data for subsequent analysis. For instance, from different viewpoints at every visualization time step. Considering that some of these images can be similar, an appropriate image selection to reduce the total number of images would be beneficial to minimize the analysis time for understanding the underlying simulation phenomena without missing important features. As an approach for such smart in situ visualization, we have worked on adaptive time step selection for skipping time steps with small amount of change between time steps. In this lightning talk, focusing on the set of images which can be generated from different viewpoints at every time step, we will present a PSNR-based image selection approach for eliminating similar images to further reduce the total number of images, targeting smarter in situ visualization.
Workshop
Recorded
Quantum Computing
W
DescriptionWe present Q# implementations for arbitrary fixed-point arithmetic operations for a gate-based quantum computer based on lookup tables (LUT). In general, this is an inefficient way of implementing a function since the number of inputs can be large or even infinite. However, if the input domain can be bounded and there can be some error tolerance in the output (both of which are often the case in practical use-cases), the quantum LUT implementation of certain quantum arithmetic functions can be more efficient than their corresponding reversible arithmetic implementations. We discuss the implementation of the LUT using Q#, show examples of how to use the LUT to implement quantum arithmetic functions, and compare the resources required for the implementation with the current state-of-the-art bespoke implementations of exponential and Gaussian functions.
Workshop
Recorded
Career Development
Professional Development
Software Engineering
Workforce
W
DescriptionResearch Software Engineering (RSE) provides methodological tools to develop software to be deployed in High-Performance Computing (HPC) infrastructures, follow good practices, and achieve a good quality of software. Also, RSE supports actors involved in the development, from developers to users, including development, deployment, interaction, and training. The oil and gas community is one of the most critical contexts for scientific applications, from exploration to econometry and market analysis. Following RSE elements, we present a development path to build robust research software in this contribution.
Workshop
Recorded
AI-HPC Convergence
Extreme Scale Computing
Parallel Programming Languages and Models
Performance
Runtime Systems
W
DescriptionSparse linear algebra routines are fundamental building blocks of a large variety of scientific applications. Direct solvers, which are methods for solving linear systems via the factorization of matrices into products of triangular matrices, are commonly used in many contexts. The Cholesky factorization is the fastest direct method for symmetric and definite positive matrices.
This presentation presents selective nesting, a method to determine the optimal task granularity for the parallel Cholesky factorization based on the structure of sparse matrices. We propose the OPT-D algorithm, which automatically and dynamically applies selective nesting. OPT-D leverages matrix sparsity to drive complex task-based parallel workloads in the context of direct solvers. We run an extensive evaluation campaign considering a heterogeneous set of 35 sparse matrices and a parallel machine featuring the A64FX processor. OPT-D delivers an average performance speedup of 1.46x with respect to the best state-of-the-art parallel method to run direct solvers.
This presentation presents selective nesting, a method to determine the optimal task granularity for the parallel Cholesky factorization based on the structure of sparse matrices. We propose the OPT-D algorithm, which automatically and dynamically applies selective nesting. OPT-D leverages matrix sparsity to drive complex task-based parallel workloads in the context of direct solvers. We run an extensive evaluation campaign considering a heterogeneous set of 35 sparse matrices and a parallel machine featuring the A64FX processor. OPT-D delivers an average performance speedup of 1.46x with respect to the best state-of-the-art parallel method to run direct solvers.
Workshop
Recorded
W
DescriptionHigh Performance Computing (HPC) applications must be containerized to run in a Kubernetes (K8s) environment. The traditional model for running HPC applications in a K8s environment requires the Application Container (APP) to include the runtime environment and the launch support mechanisms, in addition to the application. This requirement can increase the APP size and introduce security vulnerabilities. The separated model presented detaches the runtime from the APP. This allows the system administrators to define, maintain, and secure the Runtime Environment Container (REC). A PMIx library connects the APP and REC. The PMIx library serves as a runtime communication conduit for HPC parallel libraries (like MPI) to perform necessary functions like inter-process wire-up. The APP is nested within the REC using unprivileged, rootless Podman. The separated model is demonstrated by running a set of HPC applications in an off-the-shelf K8s system.
Paper
Recorded
Reliability and Resiliency
TP
DescriptionI/O efficiency is crucial to productivity in scientific computing, but the growing complexity of HPC systems and applications complicates efforts to understand and optimize I/O behavior at scale. Data-driven machine learning-based I/O throughput models offer a solution: they can be used to identify bottlenecks, automate I/O tuning, or optimize job scheduling with minimal human intervention. Unfortunately, current state-of-the-art I/O models are not robust enough for production use and under-perform after being deployed.
We analyze four years of application, scheduler, and storage system logs on two leadership-class HPC platforms to understand why I/O models under-perform in practice. We propose a taxonomy consisting of five categories of I/O modeling errors: poor application and system modeling, inadequate dataset coverage, I/O contention, and I/O noise. We develop litmus tests to quantify each category, allowing researchers to narrow down failure modes, enhance I/O throughput models, and improve future generations of HPC logging and analysis tools.
We analyze four years of application, scheduler, and storage system logs on two leadership-class HPC platforms to understand why I/O models under-perform in practice. We propose a taxonomy consisting of five categories of I/O modeling errors: poor application and system modeling, inadequate dataset coverage, I/O contention, and I/O noise. We develop litmus tests to quantify each category, allowing researchers to narrow down failure modes, enhance I/O throughput models, and improve future generations of HPC logging and analysis tools.
Workshop
Recorded
Accelerator-based Architectures
Data Analytics
In Situ Processing
Scientific Computing
Visualization
Workflows
W
DescriptionWe contribute a new approach for in situ automation of camera placement over time. Our approach incorporates triggers, regularly evaluating the current camera placement and searching for a new camera placement when a trigger fires. We evaluate our approach running in situ with five data sets from two simulation codes, considering camera placement quality (evaluated using a viewpoint quality metric) and overhead (number of camera positions evaluated). We find that our approach has a significant – reduced overhead with similar quality – compared to the naive approach of searching for a new camera placement each cycle.
Posters
Research Posters
TP
XO/EX
DescriptionIn this work we accelerate a target a deep learning model designed to enhance CT images of covid-19 chest scans namely DD-Net using sparse techniques. The model follows an auto encoder decoder architecture in deep learning paradigm and has high dimensionality and thus takes many compute hours of training. We propose a set of techniques which target these two aspects of model - dimensionality and training time. We will implement techniques to prune neurons making the model sparse and thus reduce the effective dimensionality with a loss of accuracy not more than 5% with minimal additional overhead of retraining. Then we propose set of techniques tailored with respect to underlying hardware in order to better utilize the existing components of hardware (such as tensor core) and thus reduce time and associated cost required to train this model.
Workshop
Recorded
W
DescriptionKinetic equilibria are a fundamental aspect of tokamak plasma analysis, but are often highly specialized and labor intensive to produce. This has become a bottleneck to both deeper physics understandings and more sophisticated experiment controls. This project aims to remove these barriers by developing a rapid, fully-automated workflow to produce better-than-human, high-precision whole-discharge kinetic equilibria. The required elements in this workflow now exist separately, but what is missing is the coupling of different aspects and overall performance optimization. We have designed this workflow for the DIII-D national fusion facility with the goal of producing results quickly enough to be used for experiment planning in the 15-20 minute time window between subsequent discharges. The results will also be stored in a database for follow-up analysis and as the foundation for AI/ML surrogate models. Initial results suggest that it may be possible to achieve our goal within a target 10 minute window.
Workshop
Recorded
W
DescriptionEfficient data communication is a major goal for scalable and cost-effective use of datacenter and HPC system resources. To let applications communicate efficiently, exchanged data must be serialized at the source and deserialized at the destination. The serialization/deserialization process enables exchanging data in a language- and machine-independent format. However, serialization/deserialization overheads can negatively impact application performance. For example, a server within a microservice framework must deserialize all incoming requests before invoking the respective microservices. We show how data deserialization can be offloaded to fully programmable SmartNICs and performed on the data path, on a per-packet basis. This solution avoids intermediate memory copies, enabling on-the-fly deserialization. We showcase our approach by offloading Google Protocol Buffers, a widely used framework to serialize/deserialize data. We show through microservice throughput modeling how we can improve the overall throughput by pipelining the deserialization and actual application activities with PsPIN.
Workshop
Recorded
Accelerator-based Architectures
Algorithms
Architectures
Big Data
Data Analytics
Parallel Programming Languages and Models
Productivity Tools
W
DescriptionDatalog, a bottom-up declarative logic programming language, has a wide variety of uses for deduction, modeling, and data analysis, across application domains. Datalog can be efficiently implemented using relational algebra primitives such as join, projection and union. While, there exist several multi-threaded and multi-core implementations of Datalog that target CPU-based systems, our work makes an inroad towards developing a Datalog implementation for GPUs. We demonstrate the feasibility of a high performance relational algebra backend for a small subset of Datalog applications that can effectively leverage the parallelism of GPUs using cuDF. cuDF is a library from the Rapids suite that uses the NVIDIA CUDA programming model for GPU parallelism. It provides similar functionalities to Pandas, a popular data analysis engine. In this presentation, we analyze and evaluate the performance of cuDF versus Pandas for two graph mining problems implemented in Datalog, (1) triangles counting and (2) transitive closure computation.
Exhibitor Forum
Recorded
TP
XO/EX
DescriptionAs supercomputing infrastructures become increasingly distributed, centralizing the management of data that may span multiple data centers, public cloud providers, and edge locations is key to accelerating research. Whether organizations are looking to expand information sharing for science teams; enhance data management practices across collaborative platforms; or unlock access to cloud services for data practitioners, centralizing multi-cloud data management addresses these challenges by seamlessly integrating multiple public clouds and on-premises storage under a single namespace. Modern technologies improve the responsiveness of data workflows by supporting constant movement of data and applications across systems and automating data placement and lifecycle rules. This means that both on-premises and cloud applications can use the same data without negatively impacting performance. It also means that the right data is placed where and when it’s needed for the most effective, agile workflows. Finally, by synchronizing data across multiple cloud-based repositories, multi-cloud data management software enables data to be accessed independent of its physical location to eliminate vendor lock-in and minimize cloud egress fees.
Join us for this technical deep dive into how centralizing multi-cloud data management can maximize the value of cloud initiatives by creating new opportunities for collaboration and innovation across platforms and data-driven ecosystems.
Join us for this technical deep dive into how centralizing multi-cloud data management can maximize the value of cloud initiatives by creating new opportunities for collaboration and innovation across platforms and data-driven ecosystems.
Paper
Recorded
Applications
Numerical Algorithms
Security
TP
DescriptionThe Elliptic Curve Digital Signature Algorithm (ECDSA) is an essential building block of various cryptographic protocols. In particular, most blockchain systems adopt it to ensure transaction integrity. However, due to its high computational intensity, ECDSA is often the performance bottleneck in blockchain transaction processing. Recent work has accelerated ECDSA algorithms on the CPU; in contrast, success has been limited on the GPU, which has great potential for parallelization but is challenging for implementing elliptic curve functions. In this paper, we propose RapidEC, a GPU-based ECDSA implementation for SM2, a popular elliptic curve. Specifically, we design architecture-aware parallel primitives for elliptic curve point operations, and parallelize the processing of a single SM2 request as well as batches of requests. Consequently, our GPU-based RapidEC outperformed the state-of-the-art CPU-based algorithm by orders of magnitude. Additionally, our GPU-based modular arithmetic functions as well as point operation primitives can be applied to other computation tasks.
Workshop
Recorded
W
DescriptionMost high-fidelity physics simulation codes, such as Flash-X, need to save intermediate results (checkpoint files) to restart or gain insights into the evolution of the simulation. These simulation codes save such intermediate files synchronously, where computation is stalled while the data is written to storage. Depending on the problem size and computational requirements, this file write time can be a substantial portion of the total simulation time. In this paper, we evaluate the overheads and the overall benefit of asynchronous I/O in HDF5 to simulations. Results from real-world high-fidelity simulations on the Summit supercomputer show that I/O operation is overlapped with application communication or computation or both, effectively hiding some or all of the I/O latency. Our evaluation shows that while using asynchronous I/O adds overhead to the application, the I/O time reduction is more significant, resulting in overall up to 1.5X performance speedup
Workshop
Recorded
W
DescriptionIn this work, we accelerate the Kernel Ridge Regression algorithm on an adaptive computing platform to achieve higher performance within faster development time by employing a design approach using high-level synthesis. In order to avoid storing the potentially huge kernel matrix in external memory, the designed accelerator computes the matrix on-the-fly in each iteration. Moreover, we overcome the memory bandwidth limitation by partitioning the kernel matrix into smaller tiles that are pre-fetched to small local memories and reused multiple times. The design is also parallelized and fully pipelined to accomplish the highest performance. The final accelerator can be used for any large-scale data without kernel matrix storage limitations and with an arbitrary number of features. This work is an important first step towards a library for accelerating different Kernel methods for Machine Learning applications for FPGA platforms that can be used conveniently from Python with a NumPy interface.
Paper
Recorded
Data Mangement
Storage
TP
DescriptionLossy compression is one of the most efficient solutions to reduce storage overhead and improve I/O performance for HPC applications. However, existing parallel I/O libraries cannot fully utilize lossy compression to accelerate parallel write due to the lack of deep understanding on compression-write performance. To this end, we propose to deeply integrate predictive lossy compression with HDF5 to significantly improve parallel-write performance. Specifically, we propose analytical models to predict the time of compression and parallel write before the actual compression to enable compression-write overlapping. We also introduce an extra space to handle the prediction uncertainty. Moreover, we propose an optimization to reorder the compression tasks to increase the overlapping efficiency. Experiments with up to 4,096 cores show that our solution improves the write performance by up to 4.5x and 2.9x over the non-compression and lossy compression solutions, respectively, with only 1.5% storage overhead (to original data) on two real-world applications.
Tutorial
Recorded
Accelerator-based Architectures
Heterogeneous Systems
Parallel Programming Languages and Models
Performance Portability
Productivity Tools
Software Engineering
TUT
DescriptionThis half-day hands-on tutorial teaches how to accelerate HPC applications using the portable parallelism and concurrency features of the C++17 and C++20 standards, without any language or vendor extensions, such that a single version of the code is portable to multi-core CPU and to GPU systems. We further show how to integrate this approach with MPI to target CPU clusters and multi-GPU platforms. The tutorial exercises follow classical HPC themes like a PDE solver mini-application for the 2D unsteady heat equation. The exercises provide attendees with hands-on experience applying C++ parallel algorithms and execution policies to parallelize and accelerate HPC programs using only standard C++. The attendees are presented problem-solving strategies for common tasks like computing reductions or running iterative solvers for multi-dimensional problems. Furthermore, the tutorial and exercises give attendees hands-on experience in integrating C++ parallel algorithms into pre-existing MPI applications, teaching how to re-use the pre-existing MPI code to produce MPI/C++ applications that run on multi-CPU and multi-GPU systems. Finally, we conclude with a summary of our professional experience applying the ISO C++ parallel programming model to accelerate large real-world HPC applications and provide an outlook of future topics in C++ standard parallelism.
Workshop
Recorded
HPC Training and Education
W
DescriptionIn response to an increasing demand for digital skills in industry and academia, a series of credentialed short courses that cover a variety of topics related to high performance computing were designed and implemented to enable university students and researchers to effectively utilize research computing resources and bridge the gap for users with educational backgrounds that do not include computational training. The courses cover a diverse array of topics, including subjects in programming, cybersecurity, artificial intelligence/machine learning, bioinformatics, and cloud computing. The courses are designed to enable the students to apply the skills they learn to their own research that incorporates use of large-scale computing systems. These courses offer advantages to generic online courses in that they teach computing skills relevant to academic research programs. Finally, the micro-credentials are transcriptable, may be stacked with existing programs to create a larger degree plan, and add to a student’s resume.
Awards Presentation
SC22 Opening Session & Turing Lecture
Recorded
Awards
Keynote
Turing
TP
W
TUT
XO/EX
DescriptionJoin us for the 2021 ACM A.M. Turing Award Lecture featuring Jack Dongarra. A longtime SC supporter, Jack’s pioneering contributions to numerical algorithms and libraries that enabled HPC software to keep pace with exponential hardware improvements for over four decades has, through the years, accelerated HPC. With our SC22 conference theme, HPC Accelerates, we’re honored that Jack selected SC22 as the location to present his award lecture.
Be sure to include the ACM A.M. Turing Lecture in your schedule when planning your SC22 conference experience. You won’t want to miss it! This lecture replaces our traditional keynote presentation.
Be sure to include the ACM A.M. Turing Lecture in your schedule when planning your SC22 conference experience. You won’t want to miss it! This lecture replaces our traditional keynote presentation.
Paper
Recorded
System Software
TP
DescriptionWe present a technique for applying reverse mode automatic differentiation (AD) on a non-recursive second-order functional array language that supports nested parallelism and is primarily aimed at efficient GPU execution.
The key idea is to eliminate the need for a tape by relying on redundant execution to bring into each new scope all program variables that may be needed by the differentiated code. Efficient execution is enabled by the observation that perfectly nested scopes do not introduce re-execution and that such perfect nests can be readily produced by application of known compiler transformations. Our technique differentiates loops and bulk-parallel operators---e.g., map, reduce(-by-index), scan, and scatter---by specific rewrite rules and aggressively optimizes the resulting nested-parallel code. We report an evaluation that compares with established AD solutions and demonstrates competitive performance on ten common benchmarks from recent applied AD literature.
The key idea is to eliminate the need for a tape by relying on redundant execution to bring into each new scope all program variables that may be needed by the differentiated code. Efficient execution is enabled by the observation that perfectly nested scopes do not introduce re-execution and that such perfect nests can be readily produced by application of known compiler transformations. Our technique differentiates loops and bulk-parallel operators---e.g., map, reduce(-by-index), scan, and scatter---by specific rewrite rules and aggressively optimizes the resulting nested-parallel code. We report an evaluation that compares with established AD solutions and demonstrates competitive performance on ten common benchmarks from recent applied AD literature.
Workshop
Recorded
AI-HPC Convergence
Extreme Scale Computing
Parallel Programming Languages and Models
Performance
Runtime Systems
W
DescriptionThe Message Passing Interface (MPI) is the most dominant programming model on HPC systems and has been instrumental in developing efficient, large scale parallel applications. However, it has a rather static view of compute resources building on top of the concept of immutable communicators. While this provides some easy-of-use and simplicity, it is limiting, in particular for modern workflow-based workloads as well as in its support for resource adaptive systems. The newly introduced concept of MPI Sessions, however, opens the door more dynamicity and adaptivity. In this talk I will highlight the opportunities that can arise from such directions and discuss a novel approaches we are pursuing as part of several EuroHPC projects. Our ultimate goal is to provide full malleability in MPI as well as the surrounding software layers - from system software to applications - and with that enable us to more efficiently harness the computational capabilities of current and future HPC systems.
Workshop
Recorded
Benchmarking
Cloud and Distributed Computing
Containers
Datacenter
Networks
Privacy
Resource Management and Scheduling
Security
SIGHPC
State of the Practice
System Administration
System Software
W
DescriptionAdditional Questions, Community Discussion, and Supply Chain Issues ...
Birds of a Feather
TP
XO/EX
DescriptionLast year's panel "HPC's Growing Sustainability Challenges and Emerging Approaches" gave an excellent introduction to the carbon impact of HPC along with ideas for carbon mitigation. This BoF we will focus on concrete actions that data center operators and users can undertake to reduce HPC's carbon footprint. These range from using more energy efficient processors, to improved cooling, extending the lifetime of computing equipment, shifting load from regions with carbon-intense electricity to regions where the vast majority of electricity comes from renewable resources. Pro's and cons of various will approaches will be discussed. Audience participation and ideas will be welcome.
Paper
Recorded
Applications
Numerical Algorithms
Security
TP
DescriptionSeveral scientific applications rely on sparse direct solvers for their numerical robustness. However, performance optimization for these solvers remains a challenging task, especially on GPUs. This is due to workloads of small dense matrices that are different in size. Matrix decompositions on such irregular workloads are rarely addressed on GPUs.
This paper addresses irregular workloads of matrix computations on GPUs and shows their impact on a sparse LU solver. We designed an interface for the basic matrix operations supporting problems of different sizes. The interface enables us to develop irrLU-GPU, an LU decomposition on matrices of different sizes. We demonstrate the impact of irrLU-GPU on sparse LU solvers using NVIDIA and AMD GPUs. Experimental results are shown for a sparse direct solver based on multifrontal sparse LU decomposition applied to linear systems arising from the simulation, using finite element discretization on unstructured meshes, of a high frequency indefinite Maxwell problem.
This paper addresses irregular workloads of matrix computations on GPUs and shows their impact on a sparse LU solver. We designed an interface for the basic matrix operations supporting problems of different sizes. The interface enables us to develop irrLU-GPU, an LU decomposition on matrices of different sizes. We demonstrate the impact of irrLU-GPU on sparse LU solvers using NVIDIA and AMD GPUs. Experimental results are shown for a sparse direct solver based on multifrontal sparse LU decomposition applied to linear systems arising from the simulation, using finite element discretization on unstructured meshes, of a high frequency indefinite Maxwell problem.
Tutorial
Recorded
Big Data
Cloud and Distributed Computing
Data Analytics
Data Mangement
Emerging Technologies
Exascale Computing
File Systems and I/O
In Situ Processing
Performance
Productivity Tools
Reliability and Resiliency
Resource Management and Scheduling
Software Engineering
Visualization
TUT
DescriptionAs concurrency and complexity continue to increase on high-end machines, storage I/O performance is rapidly becoming a fundamental challenge to scientific discovery. At the exascale, online analysis will become a dominant form of data analytics, and thus scalable in situ workflows will become critical, along with high performance I/O to storage. The many components of a workflow running simultaneously pose another challenge of evaluating and improving the performance of these workflows. Therefore, performance data collection needs to be an integral part of the entire workflow.
In this tutorial, we present ADIOS-2 which allows for building in situ and file-based data processing workflows for extreme scale systems, including interactive, on-demand, in situ visualization of the data, and including performance profiling of the entire workflow. Half of this tutorial will be hands-on sessions, where we provide access to the software, and build together a complete MiniApp with in situ analytics and performance analysis that users can run on their laptop and supercomputers at large scale. We will show how ADIOS-2 is fully integrated into three popular visualization and performance tools: Jupyter Notebook, ParaView, and TAU, creating a software ecosystem for in situ processing of both performance and scientific data.
In this tutorial, we present ADIOS-2 which allows for building in situ and file-based data processing workflows for extreme scale systems, including interactive, on-demand, in situ visualization of the data, and including performance profiling of the entire workflow. Half of this tutorial will be hands-on sessions, where we provide access to the software, and build together a complete MiniApp with in situ analytics and performance analysis that users can run on their laptop and supercomputers at large scale. We will show how ADIOS-2 is fully integrated into three popular visualization and performance tools: Jupyter Notebook, ParaView, and TAU, creating a software ecosystem for in situ processing of both performance and scientific data.
Workshop
Recorded
W
DescriptionWe present efforts to encourage the adoption of modules for teaching heterogeneous parallel computing through a faculty development workshop. The workshop was held remotely using a novel format to exploit the advantages of a virtual format and mitigate its disadvantages. Adoption at a wide variety of institutions showed module effectiveness and also gathered feedback leading to several module improvements. We also report on the adoptions themselves, which show the importance of supporting adaptation of the modules for diverse settings.
Tutorial
Recorded
Algorithms
Cloud and Distributed Computing
Datacenter
Parallel Programming Languages and Models
Performance
TUT
DescriptionThe vast majority of production parallel scientific applications today use MPI and run successfully on the largest systems in the world. Parallel system architectures are evolving to include complex, heterogeneous nodes comprising general-purpose CPUs as well as accelerators such as GPUs. At the same time, the MPI standard itself is evolving to address the needs and challenges of future extreme-scale platforms as well as applications. This tutorial will cover several advanced features of MPI that can help users program modern systems effectively. Using code examples based on scenarios found in real applications, we will cover several topics including efficient ways of doing 2D and 3D stencil computation, derived datatypes, one-sided communication, hybrid programming (MPI + threads, shared memory, GPUs), topologies and topology mapping, neighborhood and nonblocking collectives, and some of the new performance-oriented features in MPI-4. Attendees will leave the tutorial with an understanding of how to use these advanced features of MPI and guidelines on how they might perform on different platforms and architectures.
Tutorial
Recorded
Accelerator-based Architectures
Directive Based Programming
Heterogeneous Systems
Parallel Programming Languages and Models
Performance
TUT
DescriptionWith the increasing prevalence of multicore processors, shared-memory programming models are essential. OpenMP is a popular, portable, widely supported, and easy-to-use shared-memory model. Developers usually find OpenMP easy to learn. However, they are often disappointed with the performance and scalability of the resulting code. This disappointment stems not from shortcomings of OpenMP, but rather from the lack of depth with which it is employed. Our “Advanced OpenMP Programming” tutorial addresses this critical need by exploring the implications of possible OpenMP parallelization strategies, both in terms of correctness and performance.
We assume attendees understand basic parallelization concepts and know the fundamentals of OpenMP. We focus on performance aspects, such as data and thread locality on NUMA architectures, false sharing, and exploitation of vector units. All topics are accompanied by extensive case studies, and we discuss the corresponding language features in-depth. Continuing the emphasis of this successful tutorial series, we focus solely on performance programming for multi-core architectures. Throughout all topics, we present the recent additions of OpenMP 5.0, 5.1 and 5.2 and comment on developments targeting OpenMP 6.0.
We assume attendees understand basic parallelization concepts and know the fundamentals of OpenMP. We focus on performance aspects, such as data and thread locality on NUMA architectures, false sharing, and exploitation of vector units. All topics are accompanied by extensive case studies, and we discuss the corresponding language features in-depth. Continuing the emphasis of this successful tutorial series, we focus solely on performance programming for multi-core architectures. Throughout all topics, we present the recent additions of OpenMP 5.0, 5.1 and 5.2 and comment on developments targeting OpenMP 6.0.
Birds of a Feather
TP
XO/EX
DescriptionFPGAs have gone from niche components to being a central part of many data centers worldwide to being considered for core HPC installations. The last year has seen tremendous advances in FPGA programmability and technology, and FPGAs for general HPC is apparently within reach. This BoF has two parts. The first is a series of lightning talks presenting advances in tools and technologies emphasizing work by new investigators. The second part of the BoF will be a general discussion driven by the interests of the attendees, potentially including additional topics.
Birds of a Feather
TP
XO/EX
DescriptionThe goal of this BoF session is to bring the HPC and QC communities closer together with the objective to scrutinize HPC codes and workflows for potential hybrid quantum-classical computing.
The focus will be primarily on the identification of the required tool set, including the infrastructure and of the potential applications, and less on the computation acceleration.
The format of the BoF will consist of three short impulse talks followed by a moderated panel discussion, inviting substantial contributions from the audience.
The focus will be primarily on the identification of the required tool set, including the infrastructure and of the potential applications, and less on the computation acceleration.
The format of the BoF will consist of three short impulse talks followed by a moderated panel discussion, inviting substantial contributions from the audience.
Exhibitor Forum
Recorded
TP
XO/EX
DescriptionNext-generation exascale supercomputers are increasingly requiring converged HPC/AI systems, as evidenced by specifications from government, university, and commercial supercomputing labs for future systems, which require AI performance specifications in addition to the traditional specifications for HPC performance.
HPC and AI workloads are similar, as both compute and memory intensive, as well as being highly parallel. HPC and AI diverge, however, in the level of precision that is often required. The level of data analysis required for HPC applications typically needs double-precision or possibly single-precision. AI, however, frequently requires lower precision, with the reduced precision enabling much higher performance. Another key difference is that AI workloads benefit from sparsity to maximize performance and efficiency, and sparsity is not used by HPC.
This presentation compares HPC and AI workloads, reviews the trends that are driving AI and HPC convergence for supercomputers, and presents Tachyum’s Prodigy Universal Processor and its revolutionary architecture which unifies the functionality of CPU, GPU, and TPU to address the demands of both HPC and AI workloads in a single device without needing costly and power-hungry accelerators. Key features that will be highlighted include Prodigy’s advanced HPC and AI subsystems, the benefits of lower precision and sparse data types for AI applications, and recent innovations, Tachyum has made to enhance and accelerate AI processing, that are unique to Prodigy.
HPC and AI workloads are similar, as both compute and memory intensive, as well as being highly parallel. HPC and AI diverge, however, in the level of precision that is often required. The level of data analysis required for HPC applications typically needs double-precision or possibly single-precision. AI, however, frequently requires lower precision, with the reduced precision enabling much higher performance. Another key difference is that AI workloads benefit from sparsity to maximize performance and efficiency, and sparsity is not used by HPC.
This presentation compares HPC and AI workloads, reviews the trends that are driving AI and HPC convergence for supercomputers, and presents Tachyum’s Prodigy Universal Processor and its revolutionary architecture which unifies the functionality of CPU, GPU, and TPU to address the demands of both HPC and AI workloads in a single device without needing costly and power-hungry accelerators. Key features that will be highlighted include Prodigy’s advanced HPC and AI subsystems, the benefits of lower precision and sparse data types for AI applications, and recent innovations, Tachyum has made to enhance and accelerate AI processing, that are unique to Prodigy.
Invited Talk
Recorded
TP
XO/EX
DescriptionRISC-V has grown from a university project into a global open ISA standard with a thriving computing ecosystem comprising hundreds of collaborating organizations, including most major computing companies. This talk will present how RISC-V is well-suited for future HPC computing needs. RISC-V's technical advantages include a greater inherent efficiency than competing architectures, a sophisticated vector processing extension, and natural support for customized instruction set extensions. RISC-V's non-technical advantages include an open standard model that encourages both competition and collaboration, and which ensures long-term stability to protect investment in the software ecosystem.
Birds of a Feather
TP
XO/EX
DescriptionThe Covid-19 pandemic has shone a light on the increasing importance of HPC in public health, particularly with respect to the genomics of key pathogens. This BoF aims to help to provide a starting point to build a new network of those from academic institutions, healthcare organizations, public health agencies, and industry who are responsible for the emerging HPC infrastructures that will be increasingly important in the delivery of Public Health. The BoF will be a forum to share experience and best practice, with the aim of creating a new network of professionals to work together for global benefit.
Students@SC
Posters
Research Posters
TP
XO/EX
DescriptionThe LLVM Flang compiler ("Flang") is currently Fortran 95 compliant, and the frontend can parse Fortran 2018. However, Flang does not have a comprehensive 2018 test suite and does not fully implement the static semantics of the 2018 standard. We are investigating whether agile software development techniques, such as pair programming and test-driven development (TDD), can help Flang to rapidly progress to Fortran 2018 compliance. Because of the paramount importance of parallelism in high-performance computing, we are focusing on Fortran’s parallel features, commonly denoted “CoArray Fortran". We are developing what we believe are the first exhaustive, open-source tests for the static semantics of Fortran 2018 parallel features, and contributing them to the LLVM project. A related effort involves writing runtime tests for parallel 2018 features and supporting those tests by developing a new parallel runtime library: the CoArray Fortran Framework of Efficient Interfaces to Network Environments (Caffeine).
Invited Talk
Recorded
TP
XO/EX
DescriptionToday’s era of explosive data growth poses serious challenges for society in transforming massive, random, heterogeneous data streams and structures into useful knowledge, applicable to every aspect of modern life, including national security, economic productivity, scientific discovery, medical breakthroughs, and social interactions. The burgeoning data, which is increasing exponentially not only in volume, but in velocity, variety, and complexity, already far outpaces the abilities of current computing systems to execute the complex data analytics needed to extract meaningful insights in a timely manner.
The key problem with today’s computers is that they were designed to address yesterday’s compute-intensive problems rather than today’s data-intensive problems. Transforming massive data streams and structures into actionable knowledge and meaningful results in near real-time requires a complete rethinking of computing architectures and technologies – one that places the primary focus on data access and data movement rather than on faster compute power. The data of interest today and in the future is typically sparse, random, and heterogeneous, with minimal locality (it is randomly distributed across the computer), and characterized by poor data re-use, streaming updates flowing into the system, and fine-grain data movement and parallelism. The computations to be performed are determined by the data, and multiple applications might need simultaneous access to the same data. These are very different conditions than those characteristic of yesterday’s compute-intensive applications.
IARPA’s new AGILE Program aims to provide data-analytic results in time for appropriate response, e.g., to predict impending adversarial events rather than forensically analyzing them after the fact. It will accomplish this goal by developing new system-level intelligent mechanisms for moving, accessing, and storing large, random, time-varying data streams and structures that allow for the scalable and efficient execution of dynamic graph analytic applications. The program solicited system designs that emphasize optimizing the fully integrated system, not independent optimization of individual functionalities. AGILE aims to develop scalable, energy-efficient computing system designs that enable solutions to data-intensive problems as well as traditional compute-intensive problems. These designs will be cost-effective and realizable in silicon prior to the year 2030.
The key problem with today’s computers is that they were designed to address yesterday’s compute-intensive problems rather than today’s data-intensive problems. Transforming massive data streams and structures into actionable knowledge and meaningful results in near real-time requires a complete rethinking of computing architectures and technologies – one that places the primary focus on data access and data movement rather than on faster compute power. The data of interest today and in the future is typically sparse, random, and heterogeneous, with minimal locality (it is randomly distributed across the computer), and characterized by poor data re-use, streaming updates flowing into the system, and fine-grain data movement and parallelism. The computations to be performed are determined by the data, and multiple applications might need simultaneous access to the same data. These are very different conditions than those characteristic of yesterday’s compute-intensive applications.
IARPA’s new AGILE Program aims to provide data-analytic results in time for appropriate response, e.g., to predict impending adversarial events rather than forensically analyzing them after the fact. It will accomplish this goal by developing new system-level intelligent mechanisms for moving, accessing, and storing large, random, time-varying data streams and structures that allow for the scalable and efficient execution of dynamic graph analytic applications. The program solicited system designs that emphasize optimizing the fully integrated system, not independent optimization of individual functionalities. AGILE aims to develop scalable, energy-efficient computing system designs that enable solutions to data-intensive problems as well as traditional compute-intensive problems. These designs will be cost-effective and realizable in silicon prior to the year 2030.
Paper
Recorded
Machine Learning and Artificial Intelligence
TP
DescriptionSolving quantum many-body problems is one of the most fascinating research fields in condensed matter physics. An efficient numerical method is crucial to understand the mechanism of novel physics, such as the high Tc superconductivity, as one has to find the optimal solution in the exponentially large Hilbert space. The development of Artificial Intelligence (AI) provides a unique opportunity to solve the quantum many-body problems, but there is still a large gap from the goal. In this work, we present a novel computational framework and adapt it to the Sunway supercomputer. With highly efficient scalability up to 40 million heterogeneous cores, we can drastically increase the number of variational parameters, which greatly improves the accuracy of the solutions. The investigations of the spin-1/2 J1-J2 model and the t-J model achieve unprecedented accuracy and time-to-solution far beyond the previous state of the art.
Birds of a Feather
TP
XO/EX
DescriptionHPC is increasingly employed in AI. Although HPC itself is natively ethically neutral, its use to enable AI applications that can have harmful impacts on humans and society and can render HPC collusive and ethically liable. This BoF will consider the ethical implication of the coupling of AI and HPC and the formation of guidelines for the HPC community to ensure that researchers consider potentially harmful consequences of their research and adhere to best practices for sustainable and ethical use of HPC resources.
Paper
Recorded
Accelerator-based Architectures
Performance
Visualization
TP
DescriptionSparse Matrix-Vector multiplication (SpMV) is an important computational kernel. Tens of sparse matrix formats and implementations have been designed to speed up SpMV performance. We develop AlphaSparse. It goes beyond the scopes of human-designed artificial formats and traditional auto-tuners subject to prior existing artificial format(s) and implementation(s), by automatically creating new machine-designed formats and SpMV kernel implementations entirely from the knowledge of input sparsity patterns and hardware architectures. Based on our proposed Operator Graph that expresses the path of SpMV code design, it takes an arbitrary sparse matrix as input while outputting the machine-designed format and SpMV implementation that achieve high performance. By extensively evaluating 843 matrices from SuiteSparse Matrix Collection, AlphaSparse achieves performance improvement by up to 22.2 times (3.2 times on average) compared to state-of-the-art five artificial formats and up to 2.8 times (1.5 times on average) over the up-to-date implementation of traditional auto-tuning.
Workshop
Recorded
W
DescriptionThe AMD Heterogeneous Accelerated Computing Program (HACC) is an initiative by AMD to provide an infrastructure and exchange platform for studying FPGA acceleration for HPC and data center workloads. The Paderborn Center for Parallel Computing (PC2) was accepted into the HACC initiative in spring 2022, which now comprises five centers worldwide. I will give a brief overview of the HACC program and will highlight the new Alveo U280 partition of our Noctua 2 supercomputer, which is accessible through the HACC program, and provides a particularly flexible software and networking environment.
Birds of a Feather
TP
XO/EX
DescriptionThe 2022 edition of the Americas High-Performance Computing (HPC) Collaboration BoF seeks to showcase collaborations that have resulted from the partnerships formed in previous editions. It will also present opportunities and experiences between different HPC Networks and Laboratories from countries in North, Central, South America, and the Caribbean. This BoF aims at showing the current state of the art in continental collaboration in HPC research, the latest developments of regional collaborative networks, and updating the roadmap for the next year for the Americas HPC partnerships.
Posters
Research Posters
TP
XO/EX
DescriptionThe fast Fourier Transforms (FFT), a reduced-complexity formulation of the Discrete Fourier Transform (DFT), dominate the computational cost in many areas of science and engineering. Due to the large-scale data, multi-node heterogeneous systems aspire to meet the increasing demands from parallel computing FFT in the field of High-Performance Computing (HPC). In this work, we present a highly efficient GPU-based distributed FFT framework by adapting the Cooley-Tukey recursive FFT algorithm. Two major types of optimizations, including automatic low-dimensional FFT kernel generation and asynchronous strategy for multi-GPUs, are presented to enhance the performance of our approach for large-scale distributed FFT, and numerical experiments demonstrate that our work achieves more than 40x speedup over CPU FFT libraries and about 2x speedup over heFFTe, currently available state-of-art research, on GPUs.
Workshop
Recorded
W
DescriptionOur team is a developing a series of AI Bootcamps for Cyberinfrastructure (CI) Professionals to increase support expertise for researchers with Artificial Intelligence (AI) workloads running at research computing facilities. We have completed the first six-week, virtual program of core foundations topics in AI and machine learning. Our next bootcamp is focused on CI professionals in software and data engineering roles. Our team is comprised of CI professionals and Computer Science and Engineering faculty to provide a comprehensive curriculum for the professional learner. We saw a great deal of enthusiasm among the CI professional community for this program and those who attended rated it highly. We plan to refine the materials and make them generally available at the end of the project.
Workshop
Recorded
Benchmarking
Cloud and Distributed Computing
Containers
Datacenter
Networks
Privacy
Resource Management and Scheduling
Security
SIGHPC
State of the Practice
System Administration
System Software
W
DescriptionWe demonstrate a continuous acceptance testing strategy used at NERSC that can be implemented in the broader HPC community. To accomplish this task, we designed a new framework that can handle the complex parts of HPC systems, allowing us to verify a system is working optimally. buildtest [1] is an acceptance testing framework that can automate the testing of HPC systems and enable HPC support teams to painlessly create and run tests. Testing is initiated by changes to the system/software stack at scheduled system outage that demands for NERSC staff to build, run and monitor test results using GitLab’s Continuous Integration (CI) [2]. Test results are clearly communicated to developers and users via the CDash [3] web interface and test failures are documented as github issues. Together this framework forms a robust method for verifying cutting edge software stacks’ function in challenging HPC environments.
Workshop
Recorded
Cloud and Distributed Computing
In Situ Processing
Scientific Computing
Workflows
W
DescriptionCryogenic electron microscopy (Cryo-EM) is a method applied to samples cooled to cryogenic temperatures that can reach a near-atomic resolution of biological molecules. Recent progress in methodology has created an entirely new set of challenges to overcome - among them, the specific environment of the HPC system and coordination and automation of the initial stages. Our solution is an automated Cryo-EM image pre-processing service tailored to an HPC environment with close to real-time feedback allowing the researchers to interact with the data acquisition session located in a facility remote to the HPC cluster. We automated the data transfer, created a service around the Pegasus Workflow Management System, kept the user interaction minimum, and offered the researcher an option to start the pre-processing right after initiating the microscope session. The users receive real-time feedback enabling them to interact with the data acquisition, adjust it and collect a better dataset.
Workshop
Recorded
HPC Training and Education
W
DescriptionDelivering training and education on hybrid technologies (including AI, ML, GPU, Data and Visual Analytics including VR and Quantum Computing) integrated with HPC resources is key to enable individuals and businesses to take full advantage of digital technologies, hence enhancing processes within organizations and providing the enabling skills to thrive in a digital economy. Supercomputing centers focused on solving industry-led problems face the challenge of having a pool of users with little experience in executing simulations on large scale facilities, as well as limited knowledge of advanced computational techniques and integrated technologies. We aim not only at educating them in using the facilities available, but to raise awareness of methods which have the potential to increase their productivity. In this presentation, we provide our perspective on how to efficiently train industry users, and how to engage about wider digital technologies and how these, used efficiently together, can benefit their business.
Workshop
Recorded
Applications
Architectures
Benchmarking
Exascale Computing
Modeling and Simulation
Performance
Performance Portability
W
DescriptionExpanding upon their Scalable Vector Extension (SVE), Arm have introduced the Scalable Matrix Extension (SME) to improve in-core performance for matrix operations such as matrix multiplication. With the lack of hardware and cycle-accurate simulations available which supports SME, it is unclear how effective this new instruction set extension will be, and for what type of applications it will provide the most benefit.
By adapting The Simulation Engine (SimEng) from the University of Bristol’s High Performance Computing Group to support SME, we aim to compare the simulated performance of a Fujitsu A64FX core (with native SVE support) to a like-for- like hypothetical core with added SME support. By simulating a wide range of Streaming Vector Lengths for our hypothetical SME core model, we provide and discuss first-of-a-kind results for an SME implementation, before discussing future work that will be carried out to further evaluate the suitability of SME.
By adapting The Simulation Engine (SimEng) from the University of Bristol’s High Performance Computing Group to support SME, we aim to compare the simulated performance of a Fujitsu A64FX core (with native SVE support) to a like-for- like hypothetical core with added SME support. By simulating a wide range of Streaming Vector Lengths for our hypothetical SME core model, we provide and discuss first-of-a-kind results for an SME implementation, before discussing future work that will be carried out to further evaluate the suitability of SME.
Posters
Research Posters
TP
XO/EX
DescriptionAutotuning is a widely used method for guiding developers of large-scale applications to achieve high performance. However, autotuners typically employ black-box optimizations to recommend parameter settings at the cost of users missing the opportunity to identify performance bottlenecks. Performance analysis fills that gap and identifies problems and optimization opportunities that can result in better runtime and utilization of hardware resources. This work combines the best of the both worlds by integrating a systematic performance analysis and visualization approach into a publicly available autotuning framework, GPTune, to suggest users which configuration parameters are important to tune, to what value, and how tuning the parameters affect hardware-application interactions. Our experiments demonstrate that a subset of the task parameters impact the execution time of the Hypre application; the memory traffic and page faults cause performance problems in the Plasma-DGEMM routine on Cori-Haswell.
Workshop
Recorded
W
DescriptionWe present an analysis of the collection of user-support tickets that were created during nearly nine years of operation of the Blue Waters supercomputer. The analysis was based on information obtained from the Jira ticketing system and its corresponding queues. The paper contains a set of statistics showing, in quantitative form, the distribution of tickets across system areas. It also shows the computed metrics related to management of the tickets by our staff. Additionally, we present an analysis, based on Machine-Learning and Sentiment Analysis techniques, conducted over the text entered in tickets, aimed at detecting trends on users' views and perspectives about Blue Waters. This kind of study, which is uncommon in the literature, could provide guidance for operators of future large systems about the expected volume of user support demanded by each system area, and about how to allocate support staff such that users receive the best possible assistance.
Workshop
Recorded
Accelerator-based Architectures
Compilers
Dataflow and Tasking
Directive Based Programming
Heterogeneous Systems
Parallel Programming Languages and Models
Runtime Systems
W
DescriptionOpenACC is a high-level directive-based parallel programming model that can manage the sophistication of heterogeneity in architectures and abstract it from the users. The portability of the model across CPUs and accelerators has gained the model a wide variety of users. This means it is also crucial to analyze the reliability of the compilers' implementations. To address this challenge, the OpenACC Validation and Verification team has proposed a validation testsuite to verify the OpenACC implementations across various compilers with an infrastructure for a more streamlined execution. This paper will cover the following aspects: (a) the new developments since the last publication on the testsuite, (b) outline the use of the infrastructure, (c) discuss tests that highlight our workflow process, (d) analyze the results from executing the testsuite on various systems, and (e) outline future developments.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
Recorded
TP
DescriptionOpenACC is a high-level directive-based parallel programming model that can manage the sophistication of heterogeneity in architectures and abstract it from the users. The portability of the model across CPUs and accelerators has gained the model a wide variety of users. This means it is also crucial to analyze the reliability of the compilers' implementations. To address this challenge, the OpenACC Validation and Verification team has proposed a validation testsuite to verify the OpenACC implementations across various compilers with an infrastructure for a more streamlined execution. This paper will cover the following aspects: (a) the new developments since the last publication on the tetsuite, (b) outline the use of the infrastructure, (c) discuss tests that highlight our workflow process, (d) analyze the results from executing the testsuite on various systems, and (e) outline future developments.
Posters
Research Posters
Recorded
TP
DescriptionNOvA is a world-leading neutrino physics experiment that is making measurements of fundamental neutrino physics parameters and performing searches for physics beyond the Standard Model. These measurements must leverage high performance computing facilities to perform data intensive computations and execute complex statistical analyses. We outline the NOvA analysis workflows we have implemented on NERSC Cori and Perlmutter systems. We have developed an implicitly-parallel data-filtering framework for high energy physics data based on pandas and HDF5. We demonstrate scalability of the framework and advantages of an aggregated monolithic dataset by using a realistic neutrino cross-section measurement. We also demonstrate the performance and scalability of the computationally intensive profiled Feldman-Cousins procedure for statistical analysis. This process performs statistical confidence interval construction based on non-parametric Monte Carlo simulation and was applied to the NOvA sterile neutrino search. We show the NERSC Perlmutter system provides an order of magnitude computing performance gain over Cori.
Birds of a Feather
TP
XO/EX
DescriptionParallel I/O performance can be a critical bottleneck for applications, yet users are often ill-equipped for identifying and diagnosing I/O performance issues. Increasingly complex hierarchies of storage hardware and software deployed on many systems only compound this problem. Tools that can effectively capture, analyze, and tune I/O behavior for these systems empower users to realize performance gains for many applications.
In this BoF, we form a community around best practices in analyzing parallel I/O and cover recent advances to help address the problem presented above, drawing on the expertise of users, I/O researchers, and administrators in attendance.
In this BoF, we form a community around best practices in analyzing parallel I/O and cover recent advances to help address the problem presented above, drawing on the expertise of users, I/O researchers, and administrators in attendance.
Workshop
Recorded
Reliability and Resiliency
W
DescriptionWith exascale computing, the number of components that comprise high-performance computing (HPC) systems has increased by more than 70%, leading to a shorter mean time between failure (MTBF) and larger power budgets. These issues induce the need for (1) checkpoint/restart (C/R) and (2) energy reduction techniques. C/R has evolved with different software and hardware advances, thus it is crucial to understand how its energy usage differs under various storage tiers and synchronicity. In this paper, we present a comparison of the energy consumption of leading, state-of-the-art C/R libraries, VELOC and GenericIO. We perform weak and strong scalability tests of the C/R libraries and show that asynchronous C/R provides 4x greater throughput while using 33% less energy than synchronous C/R. Data size and throughput are directly correlated to energy consumption. Therefore, C/R developers should focus on ways to improve/maintain high throughput in order to reduce energy consumption to address exascale needs.
Workshop
Recorded
W
DescriptionCosmology simulations are among some of the largest simulations being currently run on supercomputers, generating terabytes to petabytes of data for each run. Consequently, scientists are seeking to reduce the amount of storage needed while preserving enough quality for analysis and visualization of the data. One of the most commonly used visualization techniques for cosmology simulations is volume rendering. Here, we investigate how different types of lossy error-bound compression algorithms affect the quality of volume-rendered images generated from reconstructed datasets. We also compute a number of image quality assessment metrics to determine which ones are the most effective at identifying artifacts in the visualizations.
Birds of a Feather
TP
XO/EX
DescriptionThe ultimate goal of outreach activities is to connect with individuals outside or at the periphery of the HPC community and empower them to become the next generation of HPC professionals. While most large centers and organizations have some outreach staff, many small HPC centers find the development and maintenance of an outreach program a serious challenge. This BoF session will gather HPC Outreach facilitators from across the community to share challenges, experiences, lessons learned and strategies for developing sustainable Outreach programs. The discussions will be captured into a shared document that will guide future community efforts.
Birds of a Feather
TP
XO/EX
DescriptionThe goal of this BoF is to introduce the HPC community to the RISC-V ecosystem and how it can enable research and development. We will start with a short panel presentation (20 minutes) on the status of the RISC-V HPC ecosystem. This will be followed by a Q&A session with the panel and audience members. There will be directed questions as well as ad hoc questions from the audience.
Workshop
Recorded
Applications
Architectures
Benchmarking
Exascale Computing
Modeling and Simulation
Performance
Performance Portability
W
DescriptionWhile many good development-oriented tools exist for analyzing and improving the performance of HPC applications, capability for capturing and analyzing the dynamic behavior of application in real production runs is lacking. Many heavily-used applications do keep some internal metrics of their performance, but there is no unified way of using these. In this paper we present the initial idea of AppEKG, both a concept of and a prototype tool for providing a unified, understandable view of HPC application behavior in production. Our prototype AppEKG framework can achieve less than 1% overhead, thus usable in production, and still provide dynamic data collection that captures time-varying runtime behavior.
Workshop
Recorded
W
Workshop
Recorded
HPC Training and Education
W
DescriptionGiven the anticipated growth of the high-performance computing market, HPC is challenged with expanding the size, diversity, and skill of its workforce while also addressing post-pandemic distributed workforce protocols and an ever-expanding ecosystem of architectures, accelerators and software stacks.
As we move toward exascale computing, training approaches need to address how best to prepare future computational scientists and enable established domain researchers to stay current and master tools needed for exascale architectures.
This paper explores adding in-person and virtual hackathons to the training mix to bridge traditional programming curricula and hands-on skills needed among the diverse communities. We outline current learning and development programs available; explain benefits and challenges in implementing hackathons for training; share specific use cases, including training “readiness,” outcomes and sustaining progress; discuss how to engage diverse communities—from early career researchers to veteran scientists; and recommend best practices for implementing these events into their training mix.
As we move toward exascale computing, training approaches need to address how best to prepare future computational scientists and enable established domain researchers to stay current and master tools needed for exascale architectures.
This paper explores adding in-person and virtual hackathons to the training mix to bridge traditional programming curricula and hands-on skills needed among the diverse communities. We outline current learning and development programs available; explain benefits and challenges in implementing hackathons for training; share specific use cases, including training “readiness,” outcomes and sustaining progress; discuss how to engage diverse communities—from early career researchers to veteran scientists; and recommend best practices for implementing these events into their training mix.
Paper
Recorded
Post-Moore Computing
Quantum Computing
TP
DescriptionAs computer system technology approaches the end of Moore's law, new computing paradigms that improve performance become a necessity. One such paradigm is approximate computing (AC). AC can present significant performance improvements, but a challenge lies in providing confidence that approximations will not overly degrade the application output quality. In AC, application domain experts manually identify code regions amenable to approximation. However, automatically guiding a developer where to apply AC is still a challenge.
We propose Puppeteer, a novel method to rank code regions based on amenability to approximation. Puppeteer uses uncertainty quantification methods to measure the sensitivity of application outputs to approximation errors. A developer annotates possible application code regions and Puppeteer estimates the sensitivity of each region. Puppeteer successfully identifies insensitive regions on different benchmarks. We utilize AC on these regions and we obtain speedups of 1.18x, 1.8x, and 1.3x for HPCCG, DCT, and BlackScholes, respectively.
We propose Puppeteer, a novel method to rank code regions based on amenability to approximation. Puppeteer uses uncertainty quantification methods to measure the sensitivity of application outputs to approximation errors. A developer annotates possible application code regions and Puppeteer estimates the sensitivity of each region. Puppeteer successfully identifies insensitive regions on different benchmarks. We utilize AC on these regions and we obtain speedups of 1.18x, 1.8x, and 1.3x for HPCCG, DCT, and BlackScholes, respectively.
Workshop
Recorded
Architectures
Benchmarking
Cloud and Distributed Computing
Containers
Datacenter
Networks
Privacy
Resource Management and Scheduling
Security
SIGHPC
State of the Practice
System Administration
System Software
W
DescriptionUpdate on the Status of Argonne's New and Expected Systems.
Birds of a Feather
TP
XO/EX
DescriptionThis BoF brings together the Arm HPC community to discuss how current and future standards will influence the growing diversity of Arm-related hardware and software. A panel composed of government, academic, and industry practitioners and vendors will discuss whether hardware standards (e.g., Armv9 and SBSA) and software standards (e.g., C++ Standard Parallelism and OpenMP) can sufficiently support the growing and diverse Arm hardware ecosystem. Audience participation is strongly encouraged with a focus on answering standards-related questions and facilitating the growth and interoperability of future Arm-based extreme scale systems.
Posters
Research Posters
TP
XO/EX
DescriptionHistorical temperature measurements are the basis of important global climate datasets like HadCRUT4 and HadCRUT5 to analyze climate change. These datasets contain many missing values and have low resolution grids. Here we demonstrate that artificial intelligence can skillfully fill these observational gaps and upscale these when combined with numerical climate model data. We show that recently developed image inpainting techniques perform accurate reconstructions via transfer learning. In addition, high resolution in weather and climate was always a common and ongoing goal of the community. We gain a neural network which reconstructs and downscales the important observational data sets (IPCC AR6) at the same time, which is unique and state-of-the-art in climate research.
Birds of a Feather
TP
XO/EX
DescriptionWith the rise of ASEAN significance in the global landscape, so has its HPC. There are multiple world-class supercomputers now being planned and deployed and rising sets of users conducting cutting edge sciences. ASEAN has officially sanctioned its “HPC Task Force” among its coalition of major stakeholders to formulate a collective HPC infrastructure, federate them with advanced tools, collaborate with other regions e.g., Japan with Fugaku as well as with a joint HPC school with Europe and Japan. The BoF will present the status quo of ASEAN HPC and discuss further outreach of ASEAN HPC to the global HPC community.
Workshop
Recorded
W
DescriptionSince 2009, Amazon has offered its unused compute capacity as AWS Spot Instances. For the first eight years of spot, pure market dynamics and high pricing variability created an ideal environment for time-series prediction. Following a pricing-scheme change in 2017, this extreme variability was removed as pricing is artificially smoothed for the end-user, therefore making it significantly easier to accurately predict prices. Nevertheless, the literature demonstrates ongoing efforts to accurately predict spot prices. To show prediction in the modern spot market is unnecessary, we train nearly 2.2 million ARIMA models on new and old data to demonstrate an order of magnitude improvement in accuracy for models trained on new data. Further, we show this new ease of price prediction makes spot instances ideal for large-scale, cost-aware cloud computing, as cost estimation is now trivial. Accordingly, we demonstrate that even naive prediction approaches waste less than $360 for 1,000,000 core hours.
Workshop
Recorded
W
DescriptionMany of Los Alamos National Laboratory's HPC codes are memory bandwidth bound. These codes exhibit high levels of sparse memory
access which differ significantly from standard benchmarks.
In this paper we present an analysis of the memory access of some of our most important code-bases. We then generate micro-benchmarks
that preserve the memory access characteristics of our codes using two approaches,
one based on statistical sampling of relative memory offsets in a sliding time window at the
function level and another at the loop level. The function level approach is used to
assess the impact of advanced memory technologies such as LPDDR5 and HBM3 using
the gem5 simulator. Our simulation results show significant improvements for sparse memory access workloads using HBM3 relative to LPDDR5 and better scaling on a per core basis. Assessment of two different architectures show that higher peak memory bandwidth results in high bandwidth on sparse workloads.
access which differ significantly from standard benchmarks.
In this paper we present an analysis of the memory access of some of our most important code-bases. We then generate micro-benchmarks
that preserve the memory access characteristics of our codes using two approaches,
one based on statistical sampling of relative memory offsets in a sliding time window at the
function level and another at the loop level. The function level approach is used to
assess the impact of advanced memory technologies such as LPDDR5 and HBM3 using
the gem5 simulator. Our simulation results show significant improvements for sparse memory access workloads using HBM3 relative to LPDDR5 and better scaling on a per core basis. Assessment of two different architectures show that higher peak memory bandwidth results in high bandwidth on sparse workloads.
Workshop
Recorded
Applications
Architectures
Heterogeneous Systems
Hierarchical Parallelism
Parallel Programming Languages and Models
Performance
Performance Portability
Scientific Computing
W
DescriptionWith dynamic imbalances caused by both software and ever more complex hardware, applications and runtime systems must adapt to dynamic load imbalances. We present a diffusion-based, reactive, fully asynchronous, and decentralized dynamic load balancer for a distributed actor library. With the asynchronous execution model, features such as remote procedure calls, and support for serialization of arbitrary types, UPC++ is especially feasible for the implementation of the actor model. While providing a substantial speedup for small- to medium-sized jobs with both predictable and unpredictable workload imbalances, the scalability of the diffusion-based approaches remains below expectations in most presented test cases.
Exhibitor Forum
Recorded
TP
XO/EX
DescriptionThis presentation consists of two parts, discussing SX-Aurora TSUBASA vector supercomputer and introducing digital annealer working on SX-Aurora TSUBASA called Vector Annealer. The first half of the presentation shows the vector architecture of SX-Aurora TSUBASA, especially its latest vector processors having the highest-level memory bandwidth. Sustained performance and power efficiency are also discussed, as well as NEC’s future plans and roadmap. The second half of the presentation shows NEC’s quantum computing strategies and their products to provide higher sustained performance in the annealing/optimization fields. NEC developed Vector Annealing as a digital annealer and has a strong business relationship with D-Wave providing a quantum annealer. NEC aims at solving various social issues by using the quantum/digital annealing technologies and by developing a hybrid platform with supercomputer and quantum/digital annealer to provide much higher sustained performance,
Workshop
Recorded
W
DescriptionX-ray Bragg coherent diffraction imaging (BCDI) is widely used for materials characterization. However, obtaining X-ray diffraction data is difficult and computationally intensive. Here, we introduce a machine learning approach to identify crystalline line defects in samples from the raw coherent diffraction data. To automate this process, we compose a workflow coupling coherent diffraction data generation with training and inference of deep neural network defect classifiers. In particular, we adopt a continual learning approach, where we generate training and inference data as needed based on the accuracy of the defect classifier instead of all training data generated a priori. The results show that our approach improves the accuracy of defect classifiers while using much fewer samples of data.
Workshop
Recorded
Quantum Computing
W
DescriptionCurrent quantum computers suffer from noise that prohibits extracting useful results directly from longer computations. The figure of merit is often an expectation value, which experiences a noise induced bias. A systematic way to remove such bias is probabilistic error cancellation (PEC). PEC requires noise characterization and introduces an exponential sampling overhead.
Probabilistic error reduction (PER) is a related method that systematically reduces the overhead. In combination with zero-noise extrapolation, PER can yield expectation values with an accuracy comparable to PEC. We present an automated quantum error mitigation software framework that includes noise tomography and application of PER to user-specified circuits. We provide a multi-platform Python package that implements a recently developed Pauli noise tomography technique and exploits a noise scaling method to carry out PER. We also provide software that leverages a previously developed toolchain, employing PyGSTi for gate set tomography and Mitiq for PER.
Probabilistic error reduction (PER) is a related method that systematically reduces the overhead. In combination with zero-noise extrapolation, PER can yield expectation values with an accuracy comparable to PEC. We present an automated quantum error mitigation software framework that includes noise tomography and application of PER to user-specified circuits. We provide a multi-platform Python package that implements a recently developed Pauli noise tomography technique and exploits a noise scaling method to carry out PER. We also provide software that leverages a previously developed toolchain, employing PyGSTi for gate set tomography and Mitiq for PER.
Workshop
Recorded
Quantum Computing
W
DescriptionEmerging quantum algorithms that process data require that classical input data be represented as a quantum state. These data-processing algorithms often follow the gate model of quantum computing---which requires qubits to be initialized to a basis state, typically |0> ---and thus often employ state generation circuits to transform the initialized basis state to a data-representation state. There are many ways to encode classical data in a qubit, and the oft-applied approach of basis encoding does not allow optimization to the extent that other variants do. In this work, we thus consider automatic synthesis of addressable, quantum read-only memory (QROM) circuits, which act as data-encoding state-generation circuits. We investigate three data encoding approaches, one of which we introduce to provide improved dynamic range and precision. We present experimental results that compare these encoding methods for QROM synthesis to better understand the implications of and applications for each.
Workshop
Recorded
W
DescriptionUse of heterogeneous architectures has steadily increased during the past decade. However, non-homogeneous systems present a challenge to the programming model as the execution models between CPU and accelerator might differ considerably. OpenMP, since version 4.0, has been trying to bridge this gap by allowing to offload a code block to a target device. Among the additions to the OpenMP offloading API since, the most notably probably is asynchronous execution between device and host. By default, offloaded regions are executed synchronously, thus the host thread blocks until their completion. The nowait clause allows work to overlap between the host and target device. However, nowait must be manually added by the user, along with the tasks data dependencies and appropriate synchronization to avoid race conditions, increasing the program complexity and developer burden.
Workshop
Recorded
Cloud and Distributed Computing
In Situ Processing
Scientific Computing
Workflows
W
DescriptionProvenance registration is becoming more and more important as we increase the size and number of experiments performed using computers. In particular, when provenance is recorded in HPC environments, it must be efficient and scalable. We propose a provenance registration method for scientific workflows, efficient enough to run in supercomputers (thus, it could run in other environments with more relaxed restrictions, such as distributed ones). It also must be scalable in order to deal with large workflows, that are more typically used in HPC. We also target transparency for the user, shielding them from having to specify how provenance must be recorded. We implement our design using the COMPSs programming model as a Workflow Management System (WfMS) and use RO-Crate as a well-established standard to record provenance. Experiments are provided, demonstrating the efficiency and scalability of our solution.
Workshop
Recorded
W
DescriptionThis presentation introduces a Cloud orchestrator controller that enables the autoscaling of containerized HPC Clusters in the Cloud. This controller triggers the creation or suppression of containerized HPC compute nodes according to metrics collected at the containerized HPC scheduler’s job queue level. Our approach does not modify either the Cloud orchestrator or HPC scheduler. The scheme followed is generic and can be applied to every HPC schedulers. Moreover, the containerization extends the experimentation reproducibility by the addition of the HPC scheduler itself to the environment replayed by the end user. The presentation exemplifies Cloud and HPC convergence to allow a high degree of flexibility for users and community platform developers. It also explores continuous integration/deployment approaches of Cloud computing to orchestrate multiple and potentially different HPC job schedulers that scale under the supervision of the Cloud orchestrator.
Birds of a Feather
TP
XO/EX
DescriptionHPC centers around the world use benchmarks to evaluate their machines and to engage with vendors during procurement. The goal of this BoF is twofold. First, a series of short presentations will gather information on the state of the art methodologies for creating and validating the benchmarking sets. Second, an open discussion will gather community feedback on pitfalls of the current methodologies and how these methodologies should evolve to accommodate the growing diversity of the computational workloads and HPC architectures. The intended audience is HPC application developers and users, teams benchmarking HPC data centers, HPC vendors, and performance researchers.
Meeting_notes
Meeting_notes
Workshop
Recorded
Applications
Architectures
Benchmarking
Exascale Computing
Modeling and Simulation
Performance
Performance Portability
W
DescriptionFortran DO CONCURRENT has emerged as a new way to achieve parallel execution of loops on CPUs and GPUs. This paper studies the performance portability of this construct on a range of processors and compares it with the incumbent models: OpenMP, OpenACC and CUDA. To do this study fairly, we implemented the BabelStream memory bandwidth benchmark from scratch, entirely in modern Fortran, for all of the models considered, which include Fortran DO CONCURRENT, as well as two variants of OpenACC, four variants of OpenMP (2 CPU and 2 GPU), CUDA Fortran, and both loop- and array-based references. BabelStream Fortran matches the C++ implementation as closely as possible, and can be used to make language-based comparisons. This paper represents one of the first detailed studies of the performance of Fortran support on heterogeneous architectures; we include results for AArch64 and x86_64 CPUs as well as AMD, Intel and NVIDIA GPU platforms.
Tutorial
Recorded
Cloud and Distributed Computing
Containers
Datacenter
Productivity Tools
Resource Management and Scheduling
Software Engineering
TUT
DescriptionCloud computing technologies use has grown considerably in HPC during the last few years. The complexity and scale that comes with cloud environments can make the first experience a daunting proposition. Cloud technologies offer a number of new capabilities to streamline tasks for HPC users and administrators. However, how to use these in HPC may not be immediately clear.
This tutorial provides a foundation to run HPC workloads in the cloud. It is organized in four series of progressive lectures and labs that provides a hands-on learning experience. It starts with a primer on cloud foundations and how they map to common HPC concepts, dives deeper into cloud core components, and presents the best practices to run HPC in the cloud.
This tutorial uses a combination of lectures and hands-on labs on provided temporary Amazon Web Services (AWS) accounts to provide both conceptual and hands-on learning.
This tutorial provides a foundation to run HPC workloads in the cloud. It is organized in four series of progressive lectures and labs that provides a hands-on learning experience. It starts with a primer on cloud foundations and how they map to common HPC concepts, dives deeper into cloud core components, and presents the best practices to run HPC in the cloud.
This tutorial uses a combination of lectures and hands-on labs on provided temporary Amazon Web Services (AWS) accounts to provide both conceptual and hands-on learning.
Exhibitor Forum
Recorded
TP
XO/EX
DescriptionJoin this technical deep dive into Google Cloud’s latest high-performance computing (HPC) advancements, covering the latest VMs, processors, accelerators, and storage solutions. We’ll also discuss our new HPC tools for deploying and managing your HPC environments, and how our customers are benefiting from running their HPC in the cloud.
Birds of a Feather
TP
XO/EX
DescriptionGiven the anticipated growth of the HPC market, HPC is challenged with expanding the size, diversity, and skill of its workforce. As we move toward exascale computing, how best do we prepare future computational scientists, and enable established domain researchers to stay current and master tools needed for exascale architectures?
This BoF invites scientists, researchers, trainers, educators, and the RSEs that support them to discuss current learning and development programs, explore adding in-person and virtual hackathons to existing training modalities, and brainstorm implementation strategies to bridge between traditional programming curricula and hands-on skills needed by diverse communities within different environments.
This BoF invites scientists, researchers, trainers, educators, and the RSEs that support them to discuss current learning and development programs, explore adding in-person and virtual hackathons to existing training modalities, and brainstorm implementation strategies to bridge between traditional programming curricula and hands-on skills needed by diverse communities within different environments.
Tutorial
Recorded
Applications
Computational Science
Productivity Tools
Software Engineering
TUT
DescriptionProducing scientific software is a challenge. The high-performance modeling and simulation community, in particular, faces the confluence of disruptive changes in computing architectures and new opportunities (and demands) for greatly improved simulation capabilities, especially through coupling physics and scales. Simultaneously, computational science and engineering (CSE), as well as other areas of science, are experiencing an increasing focus on scientific reproducibility and software quality.
Computer architecture changes require new software design and implementation strategies, including significant refactoring of existing code. Reproducibility demands require more rigor across the entire software endeavor. Code coupling requires aggregate team interactions including integration of software processes and practices. These challenges demand large investments in scientific software development and improved practices. Focusing on improved developer productivity and software sustainability is both urgent and essential.
This tutorial will provide information about software practices, processes, and tools explicitly tailored for CSE and HPC. Goals are improving the productivity of those who develop CSE software, increasing the sustainability of software artifacts, and trustworthiness in their use. Topics include the software processes for (small) teams, including agile processes, collaboration via version control workflows, reproducibility, and scientific software design, refactoring, and testing (including test design strategies and continuous integration).
Computer architecture changes require new software design and implementation strategies, including significant refactoring of existing code. Reproducibility demands require more rigor across the entire software endeavor. Code coupling requires aggregate team interactions including integration of software processes and practices. These challenges demand large investments in scientific software development and improved practices. Focusing on improved developer productivity and software sustainability is both urgent and essential.
This tutorial will provide information about software practices, processes, and tools explicitly tailored for CSE and HPC. Goals are improving the productivity of those who develop CSE software, increasing the sustainability of software artifacts, and trustworthiness in their use. Topics include the software processes for (small) teams, including agile processes, collaboration via version control workflows, reproducibility, and scientific software design, refactoring, and testing (including test design strategies and continuous integration).
Invited Talk
Recorded
TP
XO/EX
DescriptionStorage and compute technologies are no longer improving at pace with exponentially growing global demand. The world’s largest data storage stakeholders already face hard choices about what data to keep in the face of limited capacity, and compute stakeholders are rapidly approaching the resource scaling limits of massive data centers for training the largest AI models.
Biology offers a guide for solving these problems. Living systems store information in DNA with extraordinary density, enough to store all the world’s data in one small room. Living systems also implement natural intelligence – still an aspirational goal for AI – using low-power neural circuit “wetware” that fits between our ears. If we can understand and exploit these capabilities, we can overcome the scaling issues facing the HPC field.
In this talk, I will describe IARPA’s high-risk, high-payoff research programs to address fundamental problems in storage and computing using biology as a guide. This includes the Molecular Information Storage (MIST) program, which is developing DNA data storage technologies that will eventually allow us to store exabytes of data in a tabletop form factor, and the Machine Intelligence from Cortical Networks (MICrONS) program, which has densely mapped the structure and function of neural circuits to guide the development of next-generation computing architectures.
Biology offers a guide for solving these problems. Living systems store information in DNA with extraordinary density, enough to store all the world’s data in one small room. Living systems also implement natural intelligence – still an aspirational goal for AI – using low-power neural circuit “wetware” that fits between our ears. If we can understand and exploit these capabilities, we can overcome the scaling issues facing the HPC field.
In this talk, I will describe IARPA’s high-risk, high-payoff research programs to address fundamental problems in storage and computing using biology as a guide. This includes the Molecular Information Storage (MIST) program, which is developing DNA data storage technologies that will eventually allow us to store exabytes of data in a tabletop form factor, and the Machine Intelligence from Cortical Networks (MICrONS) program, which has densely mapped the structure and function of neural circuits to guide the development of next-generation computing architectures.
Paper
Recorded
Big Data
Computational Science
TP
DescriptionOut-of-core graph processing is an attractive solution for processing very large graphs that do not fit in the memory of a single machine. The new class of ultra-low-latency SSDs should expand the impact and utility of out-of-core graph processing systems. However, current out-of-core systems cannot fully leverage the high IOPS these devices can deliver.
We introduce Blaze, a new out-of-core graph processing system optimized for ultra-low-latency SSDs. Blaze offers high-performance out-of-core graph analytics by constantly saturating these fast SSDs with a new scatter-gather technique called online binning that allows value propagation among graph vertices without atomic synchronization. Blaze offers succinct APIs to allow programmers to write efficient out-of-core graph algorithms without the burden to manage complex IO executions. Our evaluation shows that Blaze outperforms current out-of-core systems by a wide margin on six datasets and a set of representative graph queries on Intel Optane SSD.
We introduce Blaze, a new out-of-core graph processing system optimized for ultra-low-latency SSDs. Blaze offers high-performance out-of-core graph analytics by constantly saturating these fast SSDs with a new scatter-gather technique called online binning that allows value propagation among graph vertices without atomic synchronization. Blaze offers succinct APIs to allow programmers to write efficient out-of-core graph algorithms without the burden to manage complex IO executions. Our evaluation shows that Blaze outperforms current out-of-core systems by a wide margin on six datasets and a set of representative graph queries on Intel Optane SSD.
Workshop
Recorded
W
DescriptionThe choice of programming model for accelerated computing applications depends on a wide range of factors, which weigh differently across application domains, institutions, and even countries. Why does one application use standard programming languages like C++, while another uses embedded programming models like Kokkos or directives such as OpenACC, and yet another directly programs in vendor-specific languages like CUDA or HIP? This panel will work through a comparison of the various choices, and share hands-on experience from developers in different countries and fields of expertise. We’ll explore both technical and non-technical reasons for how the various approaches are mixed. Join us for a fun and insightful session!
Workshop
Recorded
Accelerator-based Architectures
Algorithms
Architectures
Big Data
Data Analytics
Parallel Programming Languages and Models
Productivity Tools
W
DescriptionResearch to accelerate matrix multiplication, pushed by the growing computational demands of deep learning, has sprouted many efficient architectural solutions, such as NVIDIA’s Tensor Cores. These accelerators are designed to process efficiently a high volume of small dense matrix products in parallel. However, it is not obvious how to leverage these accelerators for sparse matrix multiplication. A natural way to adapt the accelerators to this problem is to divide the matrix into small blocks, and then multiply only the nonzero blocks. In this paper, we investigate ways to reorder the rows of a sparse matrix to reduce the number of nonzero blocks and cluster the nonzero elements into a few dense blocks. While this pre-processing can be computationally expensive, we show that the high speed-up provided by the accelerators can easily repay the cost, especially when several multiplications follow one reordering.
Workshop
Recorded
HPC Training and Education
W
DescriptionThe Blue Waters project pursued activities focused on national scale education, outreach, and training. The activities began in 2009. During 2022, the final year of the project, the team is focused on documenting the impact on the national community, lessons learned, and recommendations for programs that adopt/adapt similar activities.
The presentation to the attendees at this workshop will include the impact, lessons learned, and recommendations based on our experiences. If accepted, a full paper will be submitted for publication in the Journal of Computational Science Education that will expand upon the information provided in the presentation.
The presentation to the attendees at this workshop will include the impact, lessons learned, and recommendations based on our experiences. If accepted, a full paper will be submitted for publication in the Journal of Computational Science Education that will expand upon the information provided in the presentation.
Paper
Recorded
Accelerator-based Architectures
Performance
Visualization
TP
DescriptionOptimizing application performance in today's hardware architecture landscape is an important, but increasingly complex task, often requiring detailed performance analyses. In particular, data movement and reuse play a crucial role in optimization and are often hard to improve without detailed program inspection. Performance visualizations can assist in the diagnosis of performance problems, but generally rely on data gathered through lengthy program executions. In this paper, we present a performance visualization geared toward analyzing data movement and reuse to inform impactful optimization decisions, without requiring program execution. We propose an approach that combines static dataflow analysis with parameterized program simulations to analyze both global data movement and fine-grained data access and reuse behavior, and visualize insights in-situ on the program representation. Case studies analyzing and optimizing real-world applications demonstrate our tool's effectiveness in guiding optimization decisions and making the performance tuning process more interactive.
Students@SC
Exhibitor Forum
Recorded
TP
XO/EX
DescriptionAs High Performance Computing (HPC) moves from a specialist science to an everyday commodity, there is still an unreasonably large barrier to entry for new users. Traditionally, getting access to HPC resources is both expensive and time consuming, and once you get access, moving between clusters is equally as cumbersome.
The Alces Flight team has experimented with various concepts in the pursuit of the question, “How can we lower the barrier to entry for HPC users?” Starting in 2015, the team explored a free subscription model and the impact/usage by an individual user on public cloud, from which the base knowledge of the OpenFlightHPC open-source project emerged in 2019.
OpenFlightHPC is an open-source community developing a flexible, functional and stable HPC stack that can be launched on any platform. The project provides the knowledge and toolsets needed for HPC environment creation in a manner that anyone with basic-level HPC experience can utilize. The toolset assists in helping to create more portable HPC environments using process standardization to promote free interchange of knowledge for shared benefit.
This presentation covers:
- The importance of learning through experimentation and successful failures.
- The community and cultural shifts in people, skills, and sustainability that are feeding the need for greater flexibility in HPC.
- How OpenFlightHPC works, including bare-metal and cloud deployment techniques, process automation using tools including Ansible and Salt, and portability of workloads both in container and shared environments.
The Alces Flight team has experimented with various concepts in the pursuit of the question, “How can we lower the barrier to entry for HPC users?” Starting in 2015, the team explored a free subscription model and the impact/usage by an individual user on public cloud, from which the base knowledge of the OpenFlightHPC open-source project emerged in 2019.
OpenFlightHPC is an open-source community developing a flexible, functional and stable HPC stack that can be launched on any platform. The project provides the knowledge and toolsets needed for HPC environment creation in a manner that anyone with basic-level HPC experience can utilize. The toolset assists in helping to create more portable HPC environments using process standardization to promote free interchange of knowledge for shared benefit.
This presentation covers:
- The importance of learning through experimentation and successful failures.
- The community and cultural shifts in people, skills, and sustainability that are feeding the need for greater flexibility in HPC.
- How OpenFlightHPC works, including bare-metal and cloud deployment techniques, process automation using tools including Ansible and Salt, and portability of workloads both in container and shared environments.
Workshop
Recorded
W
DescriptionHigh Performance Computing (HPC) is playing an increasingly important role in industry, research and everyday life. Moreover, a central core of the European HPC strategy is the Modular Supercomputing Architecture (MSA), which breaks with traditional HPC architectures by integrating heterogeneous computing resources in system-level modules. Nevertheless, HPC and especially MSA content only rarely find their way into the curriculum of computer science courses at German universities. In addition, the necessary competencies for independent scientific research are hardly addressed, although these skills are essential for students for writing their final theses.
We present a blended learning based module concept that promotes the understanding and application of modular supercomputing while connecting it with the techniques of scientific project work. The module was first implemented at Goethe University in Summer 2022. The initial feedback and evaluation results are quite encouraging both in terms of learning outcomes and student engagement and interest.
We present a blended learning based module concept that promotes the understanding and application of modular supercomputing while connecting it with the techniques of scientific project work. The module was first implemented at Goethe University in Summer 2022. The initial feedback and evaluation results are quite encouraging both in terms of learning outcomes and student engagement and interest.
Birds of a Feather
TP
XO/EX
DescriptionScientific advances designed to address global challenges require researchers to have seamless access to data and computing and increasingly high performance computing. A certain disconnect has characterized the relationship between the HPC and data communities and this needs to be addressed in order to fully support today's data and compute intensive science. An open exploration of the sociotechnical and technical differences between the two communities, as well as describing any open challenges towards closer collaboration will be discussed. One BoF outcome is to draw in ‘HPC-oriented’ colleagues who wish to learn more or be more aligned with the data community.
Exhibitor Forum
Recorded
TP
XO/EX
DescriptionKubernetes has become the de-facto tool for orchestrating containerized workloads, and AI workloads are no different. But can an orchestrator built for long-running (micro)-services meet the needs of research experimentation and simulations? Can IT easily incorporate K8s into their AI & HPC workflows?
Join Gijsbert Janssen van Doorn of Run:ai for a crash course in Kubernetes for AI & HPC. Learn what’s working, what’s not, and some fixes for supporting these demanding environments with K8s.
In this session we will:
- Explain how and why Kubernetes is the top choice for AI & HPC workloads
- See where Kubernetes is challenged when it comes to AI & HPC workloads
- See how using GPUs instead of CPUs can accelerate your development cycles
Join Gijsbert Janssen van Doorn of Run:ai for a crash course in Kubernetes for AI & HPC. Learn what’s working, what’s not, and some fixes for supporting these demanding environments with K8s.
In this session we will:
- Explain how and why Kubernetes is the top choice for AI & HPC workloads
- See where Kubernetes is challenged when it comes to AI & HPC workloads
- See how using GPUs instead of CPUs can accelerate your development cycles
Workshop
Recorded
AI-HPC Convergence
Extreme Scale Computing
Parallel Programming Languages and Models
Performance
Runtime Systems
W
DescriptionAPEX (Autonomic Performance Environment for eXascale) is a performance measurement library for distributed, asynchronous multitasking runtime systems. It provides support for both lightweight measurement and high concurrency. To support performance measurement in systems that employ user-level threading, APEX uses a dependency chain in addition to the call stack to produce traces and task dependency graphs. APEX also provides a runtime adaptation system based on the observed system performance. In this paper, we describe the evolution of APEX from its design for HPX to support an array of programming models and abstraction layers and describe some of the features that have evolved to help understand the asynchrony and high concurrency of asynchronous tasking models.
Workshop
Recorded
W
DescriptionThe computational storage device (CSD) must aid background tasks for the storage service applications (background tasks) without harming user I/O performance (foreground I/O). However, in practice, SPDK often increases foreground I/O latencies and under-utilizes CPU cores in the CSD. These problems proceed from allocating foreground I/Os and background tasks to the same CPU core because SPDK processes them as the same request without distinguishing them. To tackle this, we propose a Background Task-aware Scheduler (BTS) for CSDs built using SPDK. BTS solves the following problems: (i) idle CPU cores in the CSD are not used, and (ii) the latency of foreground I/O increases due to interference with background tasks. For evaluation, we implemented a key-value interface CSD using SPDK. With BTS, the results show that idle CPUs are used to process background tasks by guaranteeing the low latency of foreground I/O when the background tasks are set to deduplication.
Paper
Recorded
Architectures
Networks
TP
Best Paper Finalist
DescriptionHigh-performance clusters and datacenters pose increasingly demanding requirements on storage systems. If these systems do not operate at scale, applications are doomed to become I/O bound and waste compute cycles. To accelerate the data path to remote storage nodes, remote direct memory access (RDMA) has been embraced by storage systems to let data flow from the network to storage targets, reducing overall latency and CPU utilization. Yet, this approach still involves CPUs on the data path to enforce storage policies such as authentication, replication, and erasure coding. We show how storage policies can be offloaded to fully programmable SmartNICs, without involving host CPUs. By using PsPIN, an open-hardware SmartNIC, we show latency improvements for writes (up to 2x), data replication (up to 2x), and erasure coding (up to 2x), when compared to respective CPU- and RDMA-based alternatives.
Exhibitor Forum
Recorded
TP
XO/EX
DescriptionIn this session, we’ll paint a picture for removing infrastructure constraints to solve complex computational problems. Imagine agile scalable infrastructure with no fixed assets, and no waiting in the queue to start jobs. We’ll share progress on an extraordinary project using Virtual Flow to do extreme scale screening, and computational drug discovery at scale. Together with academic researchers and partners, we’ve built out a 5-10 billion molecular database to identify targets, using 2.2 million virtual CPUs. Learn how the most vexing societal problems of our generation will be solved through what we at AWS call Impact Computing.
Workshop
Recorded
Benchmarking
Cloud and Distributed Computing
Containers
Datacenter
Networks
Privacy
Resource Management and Scheduling
Security
SIGHPC
State of the Practice
System Administration
System Software
W
DescriptionHigh performance computing has always offered batch computing services but demand is growing for a wider range of workflow and data services. Container orchestration is a perfect candidate for offering scheduling services for these types of workloads in a similar way. By leveraging container orchestration with Kubernetes, you can build a platform that includes both a service catalog and lets users run their own containerized services directly.
The power of such a platform is being able to stand on the shoulders of giants. This starts with leveraging Kubernetes for container orchestration and running these types of workloads. Next is using the internal Kubernetes’ paradigms with Operators to provide higher level scheduling of specific types of applications to create a service catalog. Third is using the Kubernetes API to tie everything together under a single user experience.
The power of such a platform is being able to stand on the shoulders of giants. This starts with leveraging Kubernetes for container orchestration and running these types of workloads. Next is using the internal Kubernetes’ paradigms with Operators to provide higher level scheduling of specific types of applications to create a service catalog. Third is using the Kubernetes API to tie everything together under a single user experience.
Exhibitor Forum
Recorded
TP
XO/EX
DescriptionWhile most deployed networks today use C-Band, the L-Band has been available for decades and is also deployed on Dispersion Shifted Fiber. Using both (C+L) doubles the capacity per fiber pair, but requires additional equipment to be added to an in-service traffic bearing system and yields less than optimal performance due to band interaction of separate amplifiers. New C&L Band systems are being designed with fewer components and provide better performance by lighting the entire spectrum day one for lower cost per bit and superior reach. Hear when, where, and why Verizon is pushing the development of this new technology for its nationwide long haul network.
Paper
Recorded
Numerical Algorithms
Scientific Computing
TP
DescriptionThis paper presents the Communication-Avoiding 3D Matrix Multiplication (CA3DMM) algorithm, a simple and novel algorithm that has optimal or near-optimal communication cost. CA3DMM is based on a unified view of parallel matrix multiplication. Such a view generalizes 1D, 2D, and 3D matrix multiplication algorithms to reduce the data exchange volume for different shapes of input matrices. CA3DMM further minimizes the actual communication costs by carefully organizing its communication patterns. CA3DMM is much simpler than some other generalized 3D algorithms, and CA3DMM does not require low-level optimization. Numerical experiments show that CA3DMM has good parallel scalability and has similar or better performance when compared to state-of-the-art PGEMM implementations for a wide range of matrix dimensions and number of processes.
Workshop
Recorded
W
DescriptionThis paper provides an introduction to the CoArray Fortran Framework of Efficient Interfaces to Network Environments (Caffeine), a parallel runtime library built atop the GASNet-EX exascale networking library. Caffeine leverages several non-parallel Fortran features to write type- and rank-agnostic interfaces and corresponding procedure definitions that support parallel Fortran 2018 features, including communication, collective operations, and related services. One major goal is to develop a runtime library that can eventually be considered for adoption by LLVM Flang, enabling that compiler to support the parallel features of Fortran.
The paper describes the motivations behind Caffeine's design and implementation decisions, details the current state of Caffeine's development, and previews future work. We explain how the design and implementation offer benefits related to software sustainability by lowering the barrier to user contributions, reducing complexity through the use of Fortran 2018 C-interoperability features, and high performance through the use of a lightweight communication substrate.
The paper describes the motivations behind Caffeine's design and implementation decisions, details the current state of Caffeine's development, and previews future work. We explain how the design and implementation offer benefits related to software sustainability by lowering the barrier to user contributions, reducing complexity through the use of Fortran 2018 C-interoperability features, and high performance through the use of a lightweight communication substrate.
Workshop
Recorded
Algorithms
Architectures
Compilers
Computational Science
Exascale Computing
Heterogeneous Systems
Hierarchical Parallelism
Memory Systems
Parallel Programming Languages and Models
Parallel Programming Systems
Resource Management and Scheduling
W
DescriptionWe present the open-source CAMP tool for assessing deep memory hierarchies through performance measurements of synthetic kernels. CAMP provides different access patterns and allows to vary kernels' operational intensities. We describe the tool's design and implementation, and analyse measurements on a compute node of ARCHER2, the UK national supercomputer and compare it to measurements on a compute node on NEXTGenIO. We report results of a strong scaling study of contiguous, strided and stencil access patterns for various operational intensities and explore thread placement options and data sizes. The results confirm that bandwidth saturation can be achieved with a relatively small number of threads on AMD Rome and that underpopulation may be beneficial as performance drops when the node is fully populated for configurations with lower operational intensity, whilst the effect is less pronounced on the less hierarchical Intel Cascade Lake. Finally we discuss sub-NUMA-node awareness and directions for extending CAMP.
Paper
Recorded
Cloud and Distributed Computing
TP
DescriptionFunction-as-a-Service (FaaS) platforms have recently gained rapid popularity. Many stateful applications have been migrated to FaaS platforms due to their ease of deployment, scalability, and minimal management overhead. However, failures in FaaS have not been thoroughly investigated, thus making these desirable platforms unreliable for guaranteeing function execution and ensuring performance requirements. In this paper, we propose Canary, a highly resilient and fault-tolerant framework for FaaS that mitigates the impact of failures and reduces the overhead of function restart. Canary utilizes replicated container runtimes and application-level checkpoints to reduce application recovery time over FaaS platforms. Our evaluations using representative stateful FaaS applications show that Canary reduces the application recovery time and dollar cost by up to 83% and 12%, respectively over the default retry-based strategy. Moreover, it improves application availability with an additional average execution time and cost overhead of 14% and 8%, respectively, as compared to the ideal failure-free execution.
Invited Talk
Recorded
TP
XO/EX
DescriptionFor decades, Moore’s Law made the economics of specialized chips unattractive because the upfront costs couldn’t be justified when the alternative was fast-improving CPUs. As Moore’s Law fades, however, this is changing. Not only is specialization becoming more economically attractive, but it is now one of the best ways to get performance improvements for many applications. In this talk, I will discuss (1) how the economics of specialization have changed, (2) how specialization is fracturing computing in ways commonly seen in other technologies, and (3) how long we can expect the gains from specialization to make up for the slowdown in Moore’s Law.
Posters
Research Posters
TP
XO/EX
DescriptionQueries on large graphs use the stored graph properties to generate responses. As most of the real-world graphs are dynamic, i.e., the graph topology changes with time, and hence the related graph properties are also time-varying. In such cases, maintaining correctness in stored graph properties requires recomputation or update on previous properties. Here, we present an efficient framework, CANDY for updating the properties in large dynamic networks. We prove the efficacy of our general framework by applying it to update graph properties such as Single Source Shortest Path (SSSP), Vertex Coloring, and PageRank. Empirically we show that our shared-memory parallel and NVIDIA GPU-based data-parallel implementations perform better than the state-of-the-art implementations.
Birds of a Feather
TP
XO/EX
DescriptionData centers consume nearly 1% of global electricity demand, contributing to 0.3% of all global CO2 emissions and this is expected to rise without proactive steps. Tempting as it may be to point the finger at big tech, the truth is that users of various sizes all have had a hand in the increase in data centers’ workloads. How can the HPC community do our part to drive down greenhouse gas emissions without sacrificing the computing power needed to support our mission and services as promised?
Workshop
Recorded
Cloud and Distributed Computing
In Situ Processing
Scientific Computing
Workflows
W
DescriptionWe analyze a heart monitoring center for patients wearing electrocardiogram sensors outside hospitals. This prevents serious heart damages and increases life expectancy and health-care efficiency. In this paper, we address a problem to provide a scalable infrastructure for the real-time processing scenario for at least 10,000 patients simultaneously, and efficient fast processing architecture for the postponed scenario when patients upload data after realized measurements. CardioHPC is a project to realize a simulation of these two scenarios using digital signal processing algorithms and artificial intelligence-based detection and classification software for automated reporting and alerting.
We elaborate the challenges we met in experimenting with different serverless implementations: 1) container-based on Google Cloud Run, and 2) Function-as-a-Service (FaaS) on AWS Lambda. Experimental results present the effect of overhead in the request and transfer time, and speedup achieved by analyzing the response time and throughput on both container-based and FaaS implementations as serverless workflows.
We elaborate the challenges we met in experimenting with different serverless implementations: 1) container-based on Google Cloud Run, and 2) Function-as-a-Service (FaaS) on AWS Lambda. Experimental results present the effect of overhead in the request and transfer time, and speedup achieved by analyzing the response time and throughput on both container-based and FaaS implementations as serverless workflows.
Students@SC
DescriptionThere are so many unique opportunities in HPC! While many core technical skills are applicable across a wide range of careers, there are also a lot of important differences. This panel brings together representatives from diverse career paths including industry, academia, and research labs. Come learn about the differences and similarities, and gain insight regarding the path that is best for you!
Posters
Research Posters
TP
XO/EX
DescriptionIn this work, we study the performance-portability of offloaded lattice Boltzmann kernels and the trade-off between portability and efficiency. The study is based on a proxy application for the lattice Boltzmann method (LBM). The performance portability programming framework of Kokkos (with CUDA or SYCL backend) is used and compared with programming models of native CUDA and native SYCL. The Kokkos library supports the mainstream GPU products in the market. The performance of the code can vary with accelerating models, number of GPUs, scale of the problem, propagation patterns and architectures. Both Kokkos library and CUDA toolkit are studied on the supercomputer of ThetaGPU (Argonne Leadership Computing Facility). It is found that Kokkos (CUDA) has almost the same performance as native CUDA. The automatic data and kernel management in Kokkos may sacrifice the efficiency, but the parallelization parameters can also be tuned by Kokkos to optimize the performances.
Student Cluster Competition
TP
XO/EX
DescriptionWe are proficient in distributed system and parallel computing, algorithm optimization, computer operating system, and other HPC necessary knowledge and have participated in a large number of related researches or projects. Expect the basic knowledge necessary for supercomputers, our team also has very wide-ranging expertise. Bo-Luo Ge has solid knowledge of computer operation and maintenance, network knowledge, and computer system knowledge. He is the main member of the cluster operation and maintenance of the CUHKsz supercomputer club. Zi-Fan Liu has a deep knowledge of reinforcement learning and deep learning and also has done some research on the application of reinforcement learning in Smart Grid. Yi-Liang He has profound compiler-level insights and is excellent at simd and risc-v. Si-Wei Zhang has unique comprehension of the underlying compilation support and has done related research in the CUHKsz laboratory. Bo-Luo Ge, Yi-Liang He, Si-Wei Zhang, and Zi-Fan Liu have also participated in the ASC of 2021 and won second prize. Yang-Lin Zhang has solid knowledge of Computer Vision and has done some jobs in GPU parallel threading. Hao-Nan Xue has been involved in many hardware-related projects.
Except for the professional computer domain knowledge, our team also has wide non-computer domain knowledge, such as Econometrics, Electricity Grid, Operation Management, Data Mining, etc. The diversity of our directions gives us the advantage of solving large-scale problems in various fields, and the combination of the thinking methods in different fields also improves the efficiency of discussion and problem solving within the group.
We are instructed by an outstanding professor, Professor Yeh-Ching Chung. Professor Yeh-Ching Chung established a supercomputing team at National Tsing Hua University before he came to CUHKsz, and led National Tsing Hua University to win the first prize in the final competitions of ASC, ISC, and SC many times. Under the leadership of Professor Yeh-Ching Chung, we participated in the ASC competition of 2021 and won the second prize. We also participated in many parallel optimization-related competitions, such as Intel's PAC to test and improve our skills.
As we know, SCC was developed to provide an immersive high performance computing experience to undergraduate and high school students. As an international platform for students who are interested in HPC, we sincerely hope that we can compete with other groups all over the world to improve our skills and show our ability and knowledge to the world.
Except for the professional computer domain knowledge, our team also has wide non-computer domain knowledge, such as Econometrics, Electricity Grid, Operation Management, Data Mining, etc. The diversity of our directions gives us the advantage of solving large-scale problems in various fields, and the combination of the thinking methods in different fields also improves the efficiency of discussion and problem solving within the group.
We are instructed by an outstanding professor, Professor Yeh-Ching Chung. Professor Yeh-Ching Chung established a supercomputing team at National Tsing Hua University before he came to CUHKsz, and led National Tsing Hua University to win the first prize in the final competitions of ASC, ISC, and SC many times. Under the leadership of Professor Yeh-Ching Chung, we participated in the ASC competition of 2021 and won the second prize. We also participated in many parallel optimization-related competitions, such as Intel's PAC to test and improve our skills.
As we know, SCC was developed to provide an immersive high performance computing experience to undergraduate and high school students. As an international platform for students who are interested in HPC, we sincerely hope that we can compete with other groups all over the world to improve our skills and show our ability and knowledge to the world.
Posters
Research Posters
TP
XO/EX
DescriptionThe NERSC Perlmutter HPC system is the most recent large-scale US system that is publicly available. NERSC chose to deploy a first phase of its GPU-based nodes in late 2021 using 2x Slingshot10 connections and has been upgrading them to 4x Slinghot11 connections starting in summer 2021. In this poster we provide benchmark numbers for using CGYRO, a popular fusion turbulence simulation tool, comparing the original and the upgraded network setup. CGYRO has been previously shown to be communication-bound in many recent HPC systems and we show that the upgraded networking provides a significant boost for fusion science.
Exhibitor Forum
Recorded
TP
XO/EX
DescriptionThe HPC landscape is larger, more complex, and more interconnected than ever before. With both cloud HPC and quantum computing entering as disruptors, users face many challenges managing software and data. We discuss some solutions with Covalent, a new open-source Pythonic toolkit for reproducible computational research. We demonstrate using practical examples from classical and quantum machine learning how users can rapidly iterate over hardware and software in order to efficiently identify novel research results. We also go into detail to discuss some of the challenges around quantum-classical interconnects and how hybrid quantum algorithms map to hybrid infrastructure.
Workshop
Recorded
W
DescriptionThis talk will share stories from CAAR PIConGPU and ECP SOLLVE projects. The stories will present our experiences on porting applications to pre-exascale systems to exascale system, Frontier. It will highlight challenges we faced preparing and using relevant software tools including alpaka, OpenMP and OpenACC programming models among other tools. The talk will also present insights we gathered from profiler/performance analysis tools. Takeaways will be drawn from both the projects to share with the IPDRM community and at the same seek input from the audience so we can together improve our techniques and approaches.
Workshop
Recorded
W
DescriptionIntroducing undergraduate students to key concepts of distributed computing has become almost essential as the world continues to embrace cloud-based solutions to daily problems and as research continues to grow in scale requiring distributed resources. Although distributed computing is an important part of the computer science curriculum, it can be difficult to introduce at some institutions. We explore some key challenges associated with introducing distributed computing into the computer science curriculum at a small, liberal arts college. We focus on an initial failure introducing a specialized distributed computing course too soon and relay the successes and failures experienced over a one year span of incorporating key distributed computing concepts across multiple systems-level courses. We discuss lessons learned from our first foray into teaching distributed computing and provide recommendations for new adopters of distributed computing curriculum based on our experiences.
Workshop
Recorded
W
DescriptionMPI has been very successful, evolving from a parallel programming model for single process per core and node to the dominant internode programming model for HPC applications on today's clusters and extreme scale systems. As MPI approaches its third decade in 2024, what are the challenges to be addressed and changes to be made in MPI? This talk will discuss some of issues facing MPI, with examples from remote memory, I/O, and accelerator-rich nodes.
Birds of a Feather
TP
XO/EX
DescriptionWe envision scientific computing as a key beneficiary of the "deep programmable networks" paradigm, which provide advanced processing capabilities at terabit speeds. Together with high-performance compute nodes, this creates a large distributed system that pushes the performance envelope beyond the currently known bounds. Despite holding a lot of promise, this is far from becoming mainstream. Key hurdles facing programmable networks are in building and operating them. This session will benefit scientific computing, network programming, and operations communities. We intend to have a series of lightning talks followed by moderated panel discussion. Audience will interact with experts and seek their vision.
Workshop
Recorded
Cloud and Distributed Computing
In Situ Processing
Scientific Computing
Workflows
W
DescriptionScientific workflow is one of the well-established pillars of large-scale computational science and emerged as a torchbearer to formalize and structure a massive amount of complex heterogeneous data and accelerate scientific progress. SWfMSs support the automation of repetitive tasks and capture complex analysis through workflows. However, the execution of workflows is costly and requires a lot of resource usage. At different phases of a workflow life cycle, most SWfMSs store provenance information, allowing result reproducibility, sharing, and knowledge reuse in the scientific community. But, this provenance information can be many times larger than the workflow and input data, and managing provenance data is growing in complexity with large-scale applications. We describe the challenges of provenance managing and reusing in e-science, focusing primarily on scientific workflow approaches by exploring different SWfMSs and provenance management systems. We also investigated the ways to overcome the challenges.
Workshop
Recorded
Benchmarking
Cloud and Distributed Computing
Containers
Datacenter
Networks
Privacy
Resource Management and Scheduling
Security
SIGHPC
State of the Practice
System Administration
System Software
W
DescriptionChapter Updates and Closing Remarks
Workshop
Recorded
W
DescriptionAs the scale and complexity of HPC systems keep growing, data compression techniques are often adopted to reduce the data movement bottleneck. While lossy compression becomes preferable to a lossless one because of the potential of generating high compression ratios, it would lose its worth the effort without finding an optimal balance between volume reduction and information loss. The insight of this paper is that quantifying dominant coefficients at the block level reveals the right balance, potentially impacting overall compression ratios. Motivated by this, we characterize three transformation-based lossy compression mechanisms at the block level, using the statistical features that capture data characteristics. We build several prediction models using the collected features and the characteristics of dominant coefficients and evaluate the effectiveness of each model using six HPC datasets. Our results demonstrate that the random forest classifier captures the behavior of dominant coefficients precisely, achieving nearly 99% of prediction accuracy.
Paper
Recorded
Post-Moore Computing
Quantum Computing
TP
DescriptionWhen quantum programs are executed on noisy intermediate-scale quantum (NISQ) computers, they experience hardware noise; consequently, the program outputs are often erroneous. To mitigate the adverse effects of hardware noise, it is necessary to understand the effect of hardware noise on the program output and more fundamentally, understand the impact of hardware noise on specific regions within a quantum program. Identifying and optimizing regions that are more noise-sensitive is the key to expanding the capabilities of NISQ computers.
Toward achieving that goal, we propose CHARTER, a novel technique to pinpoint specific gates and regions within a quantum program that are the most affected by the hardware noise and that have the highest impact on the program output. Using CHARTER's methodology, programmers can obtain a precise understanding of how different components of their code affect the output and optimize those components without the need for non-scalable quantum simulation on classical computers.
Toward achieving that goal, we propose CHARTER, a novel technique to pinpoint specific gates and regions within a quantum program that are the most affected by the hardware noise and that have the highest impact on the program output. Using CHARTER's methodology, programmers can obtain a precise understanding of how different components of their code affect the output and optimize those components without the need for non-scalable quantum simulation on classical computers.
Birds of a Feather
TP
XO/EX
DescriptionThe PMIx Standard APIs facilitate interaction between applications, tools, middleware, and system runtimes. PMIx addresses a range of use cases including: application launch and wire-up; inspection, steering, and debugging tools; dynamic application management, fault tolerance, and cross-library coordination; and communication across container boundaries.
We invite all SC attendees to hear about the current version of the PMIx Standard, significant activity in the PMIx Standard working groups, OpenPMIx and PRRTE implementation releases, and broadening adoption of PMIx. We will recap the activities of the past year, showcase community and working group efforts, and discuss the roadmap for the next year.
We invite all SC attendees to hear about the current version of the PMIx Standard, significant activity in the PMIx Standard working groups, OpenPMIx and PRRTE implementation releases, and broadening adoption of PMIx. We will recap the activities of the past year, showcase community and working group efforts, and discuss the roadmap for the next year.
Workshop
Recorded
Diversity Equity Inclusion (DEI)
Education and Training and Outreach
W
DescriptionConvolutional neural networks (CNNs) are being incorporated into many image-based tasks across a variety of domains. Some of these tasks are real-world safety critical tasks such as object detection and lane line detection for self-driving cars. These applications have strict safety requirements and must be able guarantee the reliable operation of the network. We propose a selective triplication of important parts of the network determined via weight pruning methodologies in order to maintain a reliable CNN in environments that may be resource-limited.
Paper
Recorded
Extreme Scale Computing
Memory Systems
Parallel Programming Systems
State of the Practice
TP
DescriptionThe rise of machine learning (ML) applications and their use of mixed precision to perform interesting science are driving forces behind AI for science on HPC. The convergence of ML and HPC with mixed precision offers the possibility of transformational changes in computational science.
The HPL-AI benchmark is designed to measure the performance of mixed precision arithmetic as opposed to the HPL benchmark which measures double precision performance. Pushing the limits of systems at extreme scale is nontrivial ---little public literature explores optimization of mixed precision computations at this scale.
In this work, we demonstrate how to scale up the HPL-AI benchmark on the pre-exascale Summit and exascale Frontier systems at the Oak Ridge Leadership Computing Facility (OLCF) with a cross-platform design. We present the implementation, performance results, and a guideline of optimization strategies employed for delivering portable performance on both AMD and NVIDIA GPUs at extreme scale.
The HPL-AI benchmark is designed to measure the performance of mixed precision arithmetic as opposed to the HPL benchmark which measures double precision performance. Pushing the limits of systems at extreme scale is nontrivial ---little public literature explores optimization of mixed precision computations at this scale.
In this work, we demonstrate how to scale up the HPL-AI benchmark on the pre-exascale Summit and exascale Frontier systems at the Oak Ridge Leadership Computing Facility (OLCF) with a cross-platform design. We present the implementation, performance results, and a guideline of optimization strategies employed for delivering portable performance on both AMD and NVIDIA GPUs at extreme scale.
Posters
Research Posters
TP
XO/EX
DescriptionApplications of quantum machine learning algorithms are currently still being studied. Recent work suggests that classical gradient descent techniques can effectively train variational quantum circuits. We propose to train quantum variational circuits to find smaller text and image embeddings that preserve contrastive-learning distances based on CLIP large embeddings. This is a critical task since fine-tuning CLIP to produce low-dimensional embeddings is prohibitively expensive. We introduce CLIP-ACQUA, a model trained in a self-supervised configuration from CLIP embeddings to reduce the latent space. We use CLIP-ACQUA on a sizeable unlabelled corpus of text and images to demonstrate its effectiveness. Our experiments show that we can obtain smaller latent spaces that preserve the original embedding distances inferred during contrastive learning. Furthermore, using our model requires no fine-tuning of CLIP, preserving its original robustness and structure. The data used as a demonstration aids in modeling consumer-to-consumer online marketplaces to detect illicit activities.
Workshop
Recorded
Architectures
Data Analytics
Datacenter
Extreme Scale Computing
HPC Community Collaboration
Machine Learning and Artificial Intelligence
Performance
Resource Management and Scheduling
System Software
W
DescriptionIn-person and virtual discussion period covering presentations and position papers.
Workshop
Recorded
AI-HPC Convergence
Extreme Scale Computing
Parallel Programming Languages and Models
Performance
Runtime Systems
W
Workshop
Recorded
Accelerator-based Architectures
Compilers
Dataflow and Tasking
Directive Based Programming
Heterogeneous Systems
Parallel Programming Languages and Models
Runtime Systems
W
DescriptionClosing remarks and awards of the Ninth Workshop on Accelerator Programming Using Directives (WACCPD 2022)
Workshop
Recorded
Accelerator-based Architectures
Data Analytics
In Situ Processing
Scientific Computing
Visualization
Workflows
W
Workshop
Recorded
W
DescriptionAs a method to optimize the investment for computational resources, cloud bursting is collecting a lot of attention, where the organizations utilize the cloud computing environment in on-demand fashion, while preserving the minimum amount of on-premise resources for sensitive data processing. For the practical cloud bursting, we need to achieve 1) secure job / data sharing, 2) uniform job execution environment for on-premise and cloud, and 3) on-demand automatic deployment of the execution environment on the cloud. To enable these items, we propose a meta-scheduling system called CloudQ. CloudQ 1) uses cloud object storage for data sharing, 2) utilizes container images to provide uniform job execution environment, and 3) automatically deploys an execution environment on the cloud.
Student Cluster Competition
TP
XO/EX
DescriptionWe are ClusDur – a team of enthusiastic Durham University students who can’t wait to enter the world of HPC! We would love to participate in IndySCC since it is the ideal opportunity to get first insights into supercomputing and gain cluster competition experience. This would be very valuable to us since neither of us has previously participated in any cluster competition.
We would do well in IndySCC due to our interdisciplinarity, diversity of skill levels and breadth of technical expertise. ClusDur consists of students from computer science, engineering, mathematics and physics - with three of us being in their very first year of study.
Interdisciplinarity and diversity of skill levels are key to enable our students-teach-students approach and allow us to learn and thrive together. This will help us consolidate and win through difficulties, deadlines and all-nighters of the competition together.
To gain first HPC experience, Harrison, Joseph and Matthew have already participated in an introductory course on the usage of the Durham University’s supercomputer Hamilton. Further, Allaida, Jack, Matthew and Robert have already successfully participated in classical Hackathons such as DurHack. Hence, they have experience with solving challenges under time pressure and as a team.
As third-year physics students, Harrison and Robert gained their first scientific computing experience through their computational physics projects. For Harrison, the project allowed him to encounter multiprocessing and apply it to a scientific simulation. Robert used his project as an incentive to gain experience with Arch Linux on a Raspberry Pi. He taught himself system administration skills that are a great asset to the team when it comes to cluster configuration and shell scripting.
Allaida is a first-year Computer Science student. She has already acquired a solid foundation in Python programming and adds experience with machine learning to the team’s skill set thanks to her participation in DurHack. Matthew is a first-year Engineering student and brings domain knowledge and programming experience in Python, C/C++ and MATLAB to the table. Allaida and Matthew aim to further their practical CS skills and gain insights into scientific simulations.
Jack is a first-year Mathematics student, with a strong interest in systems level programming and related industry experience. He is eager to share his software development knowledge with the team. As a second-year computer science student, Joseph has a background and keen interest in hardware optimization and novel computing approaches. Through IndySCC, Jack and Joseph aim to further their knowledge about performance optimization and gain insights into bare metal cloud computing.
Laura conducts research on parallel programming paradigms, especially on task parallelism in molecular dynamics simulations. She participated in the ISC SCC 2015 and aims to share her experience through mentoring. Adam’s research interests include the scheduling behaviour of task-based runtimes and heterogeneous computing. He competed as part of Team Durham in the CIUK SCC 2021 and is keen to mentor ClusDur through their first SCC. Tobias is conducting research on the efficient implementation of multiscale algorithms. He's strongly involved as PI in the UK's exascale programme ExCALIBUR.
We would do well in IndySCC due to our interdisciplinarity, diversity of skill levels and breadth of technical expertise. ClusDur consists of students from computer science, engineering, mathematics and physics - with three of us being in their very first year of study.
Interdisciplinarity and diversity of skill levels are key to enable our students-teach-students approach and allow us to learn and thrive together. This will help us consolidate and win through difficulties, deadlines and all-nighters of the competition together.
To gain first HPC experience, Harrison, Joseph and Matthew have already participated in an introductory course on the usage of the Durham University’s supercomputer Hamilton. Further, Allaida, Jack, Matthew and Robert have already successfully participated in classical Hackathons such as DurHack. Hence, they have experience with solving challenges under time pressure and as a team.
As third-year physics students, Harrison and Robert gained their first scientific computing experience through their computational physics projects. For Harrison, the project allowed him to encounter multiprocessing and apply it to a scientific simulation. Robert used his project as an incentive to gain experience with Arch Linux on a Raspberry Pi. He taught himself system administration skills that are a great asset to the team when it comes to cluster configuration and shell scripting.
Allaida is a first-year Computer Science student. She has already acquired a solid foundation in Python programming and adds experience with machine learning to the team’s skill set thanks to her participation in DurHack. Matthew is a first-year Engineering student and brings domain knowledge and programming experience in Python, C/C++ and MATLAB to the table. Allaida and Matthew aim to further their practical CS skills and gain insights into scientific simulations.
Jack is a first-year Mathematics student, with a strong interest in systems level programming and related industry experience. He is eager to share his software development knowledge with the team. As a second-year computer science student, Joseph has a background and keen interest in hardware optimization and novel computing approaches. Through IndySCC, Jack and Joseph aim to further their knowledge about performance optimization and gain insights into bare metal cloud computing.
Laura conducts research on parallel programming paradigms, especially on task parallelism in molecular dynamics simulations. She participated in the ISC SCC 2015 and aims to share her experience through mentoring. Adam’s research interests include the scheduling behaviour of task-based runtimes and heterogeneous computing. He competed as part of Team Durham in the CIUK SCC 2021 and is keen to mentor ClusDur through their first SCC. Tobias is conducting research on the efficient implementation of multiscale algorithms. He's strongly involved as PI in the UK's exascale programme ExCALIBUR.
Workshop
Recorded
W
DescriptionWith the increasing prevalence of scalable file systems in the context of HPC, the importance of accurate anomaly detection on runtime logs is increasing. But as it currently stands, many log-based anomaly detection methods have encountered numerous challenges when applied to logs from many parallel file systems (PFSes) due to their irregularity and ambiguity in time-based log sequences. To circumvent these problems, this study proposes ClusterLog, a log pre-processing method to cluster temporal sequence of log keys based on their semantic similarity. By grouping semantically and sentimentally similar logs, it aims to represent log sequences with the smallest amount of unique log keys, intending to improve the ability for a downstream sequence based model to learn the log patterns. The preliminary results indicate not only its effectiveness in reducing the granularity of log sequences without the loss of important sequence information, but also its generalizability to different file systems’ logs.
Workshop
Recorded
Cloud and Distributed Computing
In Situ Processing
Scientific Computing
Workflows
W
DescriptionMolecular dynamics (MD) simulations are widely used to study large-scale molecular systems. However, reaching the necessary timescale to detect rare processes is challenging, even with modern supercomputers. To overcome the timescale limitation, the simulation of a long MD trajectory is replaced by multiple short-range simulations executed simultaneously in an ensemble. Analyses are usually co-scheduled with these simulations to efficiently process large volumes of data in situ. Executing a workflow ensemble of simulations and their in situ analyses requires sophisticated management of computational resources so that they are not slowing down each other. In this paper, we propose an efficient method to co-schedule and allocate resources for a workflow ensemble such that the makespan is minimized. We evaluate the proposed approach using an accurate simulator based on the WRENCH simulation framework. Results demonstrate the significance of co-scheduling simulations and in situ analyses that couple data together to benefit from data locality.
Paper
Recorded
Machine Learning and Artificial Intelligence
TP
DescriptionGraph neural networks (GNNs) suffer from low GPU utilization due to frequent memory accesses. Existing concurrent training mechanisms cannot be directly adapted to GNNs because they fail to consider the impact of input irregularity. This requires pre-profiling the memory footprint of concurrent tasks based on input dimensions to ensure successful co-location on GPU. Moreover, massive training tasks generated from scenarios such as hyper-parameter tuning require flexible scheduling strategies. To address these problems, we propose CoGNN that enables efficient management of GNN training tasks on GPUs. Specifically, the CoGNN organizes the tasks in a queue and estimates the memory consumption of each task based on cost functions at operator basis. In addition, the CoGNN implements scheduling policies to generate task groups, which are iteratively submitted for execution. The experiment results show that the CoGNN can achieve shorter completion and queuing time for training tasks from diverse GNN models.
Tutorial
Recorded
AI-HPC Convergence
Applications
Cloud and Distributed Computing
Data Analytics
Data Mangement
Exascale Computing
Machine Learning and Artificial Intelligence
Performance
Resource Management and Scheduling
TUT
DescriptionThe success of the Transformer model has pushed the limits of deep learning to operate on the scale of billions of parameters. This proliferation of larger model size has outpaced advances in hardware, resulting in an urgent need to distribute the training of enormous models across multiple GPU clusters. Despite this trend, best practices for choosing an optimal strategy are still lacking due to the breadth of knowledge required across both deep learning and parallel computing.
The Colossal-AI system addresses the above challenge by introducing a unified interface to scale your sequential code of model training to distributed environments. It supports parallel training methods such as data, pipeline, tensor and sequence parallelism, as well as heterogeneous training methods such as a zero redundancy optimizer. The system mirrors its design with the predominant way that the AI community is familiar with in writing non-distributed code and can easily be adapted to efficient parallel training.
We provide AWS computing instances with example code to help attendees get familiar with the system and apply it to scale their large AI models with minimal effort. More information about Colossal-AI is available at https://github.com/hpcaitech/ColossalAI.
The Colossal-AI system addresses the above challenge by introducing a unified interface to scale your sequential code of model training to distributed environments. It supports parallel training methods such as data, pipeline, tensor and sequence parallelism, as well as heterogeneous training methods such as a zero redundancy optimizer. The system mirrors its design with the predominant way that the AI community is familiar with in writing non-distributed code and can easily be adapted to efficient parallel training.
We provide AWS computing instances with example code to help attendees get familiar with the system and apply it to scale their large AI models with minimal effort. More information about Colossal-AI is available at https://github.com/hpcaitech/ColossalAI.
Paper
Recorded
Post-Moore Computing
Quantum Computing
TP
DescriptionThis work presents a generalization of NchooseK, a constraint satisfaction system designed to target both quantum circuit devices and quantum annealing devices. Previously, NchooseK supported only hard constraints, which made it suitable for expressing problems in NP (e.g., 3-SAT) but not NP-hard problems (e.g., minimum vertex cover). In this paper, we show how support for soft constraints can be added to the model and implementation, broadening the classes of problems that can be expressed elegantly in NchooseK without sacrificing portability across different quantum devices.
Through a set of examples, we argue that this enhanced version of NchooseK enables problems to be expressed in a more concise, less error-prone manner than if these problems were encoded manually for quantum execution. We include an empirical evaluation of performance, scalability, and fidelity on both a large IBM Q system and a large D-Wave system.
Through a set of examples, we argue that this enhanced version of NchooseK enables problems to be expressed in a more concise, less error-prone manner than if these problems were encoded manually for quantum execution. We include an empirical evaluation of performance, scalability, and fidelity on both a large IBM Q system and a large D-Wave system.
Birds of a Feather
TP
XO/EX
DescriptionThis BoF will bring together academia, government research laboratories, and industry to discuss and contribute to the two active community-driven, vendor-neutral forums focusing on energy efficiency in HPC software stacks. For more than 7 years, these two complementary forums- HPC-PowerStack and PowerAPI - have led the efforts in identifying and building software solutions across the software stack.
This highly interactive BoF will enable the community to discuss ongoing challenges in designing cost-effective, cohesive, portable, and interoperable implementations of HPC software that enable monitoring and control of system efficiency. Attendees will contribute brainstorming solutions for addressing imminent exascale power challenges.
This highly interactive BoF will enable the community to discuss ongoing challenges in designing cost-effective, cohesive, portable, and interoperable implementations of HPC software that enable monitoring and control of system efficiency. Attendees will contribute brainstorming solutions for addressing imminent exascale power challenges.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
DescriptionLarge data sets tend to be very common in many areas of high-performance computing. Often times, the size of these data sets are so extreme that they far exceed the storage capabilities of their system. This highlights an opportunity to employ compression methods in order to reduce the data set down to a manageable size. Given that reduction methods operate on data in different ways, it is important to compare these methods with the goal of determining the optimal approach for any given data set. This poster compares the effectiveness of different data reduction methods on image data from Los Alamos National Labs based on three major parameters: PSNR, compression ratio, and compression rate. Our analysis indicated the SZ lossy compressor was the most effective for this data set, given that it offered the highest PSNR along with a very reasonable compression ratio.
Workshop
Recorded
W
DescriptionMPI also includes persistent operations, which specify recurring communication patterns. The idea is that the usage of those operations can result in a performance benefit compared to the standard non-blocking communication. But in current MPI implementations this performance benefit is not really observable. We determine the message envelope matching as one of the causes of overhead. As persistent MPI requests can be used multiple times, the compiler can, in some cases, prove that message matching is only needed for the first occurrence and can be entirely skipped for subsequent usages.
We present the required compiler analysis and an implementation of a communication scheme that skips the message envelope matching. This allows us to substantially reduce the communication overhead that cannot be overlapped with computation. Using the Intel IMB-ASYNC Benchmark, we can see a communication overhead reduction of up to 95% for larger message sizes.
We present the required compiler analysis and an implementation of a communication scheme that skips the message envelope matching. This allows us to substantially reduce the communication overhead that cannot be overlapped with computation. Using the Intel IMB-ASYNC Benchmark, we can see a communication overhead reduction of up to 95% for larger message sizes.
Workshop
Recorded
W
DescriptionRepeatability and reproducibility are important for science, but HPC application experiments have complex and changing software stacks and runtime environments which make this difficult to attain. This presentation describes a system for combining containers and system hardware environment capture to fully capture an application experiment’s provenance; the clear interfaces to hardware exposed by containers are a key element of this approach. The evaluation of this system demonstrates that this combination enables effective analysis of application dependencies, increasing the reproducibility of experimental results in HPC systems. It also demonstrates that this technique can provide basis for Scientific Development Operations that result in verifiable application experiments.
Panel
Recorded
Big Data
Data Mangement
Emerging Technologies
TP
XO/EX
DescriptionThe recent advancement of technology in both software and hardware enables the concept of the composable system design. A composable system provides flexibility to serve a variety of workloads by means of a software defined infrastructure based on hardware disaggregated over a network fabric. The system offers a dynamic co-design platform that allows experiments and measurements in a controlled environment. This new paradigm targets eliminating unused (jailed) hardware in a computing system and decouples the life cycle of components (e.g., CPU vs memory). In addition, a composable system is helpful for accelerating the adoption of new hardware in software applications as new devices can be simply plugged into an existing system. In this panel discussion, we will discuss the pros/cons of composable systems, and considerations when applying this design for data centers to accommodate a variety of workloads
Workshop
Recorded
Applications
Architectures
Heterogeneous Systems
Hierarchical Parallelism
Parallel Programming Languages and Models
Performance
Performance Portability
Scientific Computing
W
DescriptionIn this paper, we explore the composition capabilities of the Template Task Graph (TTG) programming model. We show how fine-grain composition of tasks is possible in TTG between DAGs belonging to different libraries, even in a distributed setup. We illustrate the benefits of this fine-grain composition on a linear algebra operation, the matrix inversion via the Cholesky method, which consists of three operations that need to be applied in sequence.
Evaluation on a cluster of many core shows that the transparent fine-grain composition implements the complex operation without introducing unnecessary synchronizations, increasing the overlap of communication and computation, and thus improving significantly the performance of the entire composed operation.
Evaluation on a cluster of many core shows that the transparent fine-grain composition implements the complex operation without introducing unnecessary synchronizations, increasing the overlap of communication and computation, and thus improving significantly the performance of the entire composed operation.
Workshop
Recorded
Accelerator-based Architectures
Algorithms
Architectures
Big Data
Data Analytics
Parallel Programming Languages and Models
Productivity Tools
W
DescriptionProcessing large graphs has become an important irregular workload. We present Massively Parallel Log Graphs (MPLG) to accelerate GPU graph codes, including highly optimized codes. MPLG combines a compressed in-memory representation with low-overhead parallel decompression. This yields a speedup if the boost in memory performance due to the reduced footprint outweighs the overhead of the extra instructions to decompress the graph on the fly. However, achieving a sufficiently low overhead is difficult, especially on GPUs with their high-bandwidth memory. Prior work has only successfully employed similar ideas on CPUs, but those approaches exhibit limited parallelism, making them unsuitable for GPUs. On large real-world inputs, MPLG speeds up graph analytics by up to 67% on a Titan V GPU. Averaged over 15 graphs from several domains, it improves the performance of Rodinia’s breadth-first search by 11.9%, Gardenia’s connected components by 5.8%, and ECL’s graph coloring by 5.0%.
Workshop
Recorded
W
DescriptionThrough simulation, observation, and experiments, far more data is being generated today than can reasonably be stored to disk and later analyzed without any form of data reduction. Moreover, with deepening memory hierarchies, dwindling per-core memory bandwidth, and increasing heterogeneity, even on-node data movement between memory and registers makes for a significant performance bottleneck and primary source of power consumption. Hence, it is becoming increasingly important that the bits being moved and stored in numerical computations are free of redundancy and represent valuable information rather than error.
This talk gives an overview of zfp, a compressed number representation and multi-dimensional array container that mitigates the challenges of data movement using high-speed, lossy (but optionally error-bounded) compression. zfp reduces I/O time and off-line storage by 1-2 orders of magnitude depending on accuracy requirements, as dictated by user-set error tolerances. Unique among data compressors, zfp also supports constant-time read/write random access to individual array elements from compressed storage. zfp's compressed arrays appear to the user like conventional uncompressed arrays and can often be integrated into existing applications with minimal code changes. When used in numerical computations, zfp arrays provide a fine-grained knob on precision while achieving accuracy comparable to IEEE floating point at half the storage or less, reducing both memory footprint and bandwidth. Several application use cases are presented that demonstrate reduced storage, increased accuracy, and improved performance.
This talk gives an overview of zfp, a compressed number representation and multi-dimensional array container that mitigates the challenges of data movement using high-speed, lossy (but optionally error-bounded) compression. zfp reduces I/O time and off-line storage by 1-2 orders of magnitude depending on accuracy requirements, as dictated by user-set error tolerances. Unique among data compressors, zfp also supports constant-time read/write random access to individual array elements from compressed storage. zfp's compressed arrays appear to the user like conventional uncompressed arrays and can often be integrated into existing applications with minimal code changes. When used in numerical computations, zfp arrays provide a fine-grained knob on precision while achieving accuracy comparable to IEEE floating point at half the storage or less, reducing both memory footprint and bandwidth. Several application use cases are presented that demonstrate reduced storage, increased accuracy, and improved performance.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
Recorded
TP
DescriptionQuantum circuit simulation can be carried out as a contraction over many quantum tensors. QTensor, a library built for quantum circuit simulation using a bucket elimination algorithm, contracts tensors to return a final energy value. As bucket elimination advances, tensors can grow large, and memory becomes a bottleneck. To address memory limitations of circuit simulation while enabling more complex circuits to be simulated, we focus on implementing a lossy compressor that can compress the floating-point data stored in quantum circuit tensors while simultaneously preserving a final energy value within an error bound after decompression. We study the effects of various lossy compression/decompression strategies on data compressibility, throughput, and result error to ensure compression/decompression can be effective, fast, and does not heavily distort data. The work for this project is in progress and preliminary results for proposed preprocessing/postprocessing strategies and compressor optimizations that have been developed will be showcased.
Panel
Recorded
Applications
Cloud and Distributed Computing
Emerging Technologies
TP
XO/EX
DescriptionFor generations, NASA’s space missions have captured the imagination of those around the world. From Artemis to the International Space Station, there is a need for High Performance Computing, analytics, and simulations at the edge – with planned data flows to the core. Demand is skyrocketing for use cases spanning operational decision-making at the edge, ensuring health safety of our astronauts, and advancing scientific discovery. In fact, this edge case ability, with AI/ML, is changing the business models for the evolving space economy. Hear from and engage with our panel of experts about how recent missions have expanded our concept of computing at the edge – addressing both space-based and terrestrial challenges.
Workshop
Recorded
Accelerator-based Architectures
Data Analytics
In Situ Processing
Scientific Computing
Visualization
Workflows
W
DescriptionData representation and coupling between scientific libraries is a key challenge to building a vibrant ecosystem of HPC simulation tools. From bespoke data structures to hundreds of file-based data models, the myriad of possible choices involved both enables key features and blocks adoption of others. Connecting data between code bases requires agreeing on or adapting between data representations. While in some cases this process is trivial, for more complicated cases, adapting data becomes a costly barrier. Conduit was designed within this context to help meet the key challenge of sharing data across HPC simulation tools by providing a dynamic API to describe in-memory data. It supports coupling simulations and connecting simulations to analysis and I/O libraries. This paper provides a broad overview of Conduit, background on the evolution of the project, and details on recently added features relevant to in situ use cases.
Workshop
Recorded
W
DescriptionWith increasing system performance and complexity, it is becoming increasingly crucial to examine the scaling behavior of an application and thus determine performance bottlenecks at early stages. Unfortunately, modeling this trend is a challenging task in the presence of noise, as the measurements can become irreproducible and misleading thus resulting in strong deviations from the real behavior. While noise impacts the application runtime, it has little to no effect on some hardware counters like floating-point operations. However, selecting the appropriate counters for performance modeling demands some investigation. We perform a noise analysis on various hardware counters. Using our noise generator, we add on top of the system noise additional noise to inspect the counters' variability. We perform the analysis on five systems with three applications in the presence of various noise patterns and categorize the counters across the systems according to their noise resilience.
Workshop
Recorded
W
DescriptionIn High-performance computing (HPC) environments, SingularityCE has been widely used, and the primary reason for its popularity is that it can significantly reduce system administrators’ work of deploying applications. Traditionally, HPC administrators may need thousands of hours to deploy a broad stack of bioinformatics applications. SingularityCE has the potential to transform traditional methods of installing and managing applications. We introduce how our HPC center used SingularityCE to deploy about 600 containerized bioinformatics applications, which were tested by staff with expertise in bioinformatics , onto 7 production systems. This presentation also explores how, leveraging LMOD, containerization was made transparent to users through environment modules for these container images. Finally, it discusses how we deployed applications with graphical user interface to Open OnDemand as interactive applications, and how we modified python-based containers to support Jupyter Notebook. The sum of these contributions provides a robust and reproducible computing ecosystem for life science researchers.
Birds of a Feather
TP
XO/EX
DescriptionHigh-Performance Computing systems that have been traditionally deployed at a single site are expected to significantly expand their reach to include a variety of remote edge systems. Theses edge systems include computing platforms located near instruments as well as the instruments themselves. Examples range from interconnected ecosystems of large science instruments and supercomputers and vehicle networks orchestrated by large scale AI. These interconnected systems form a continuum wherein computation is distributed in various stages from the edge to the core. This BoF will address the challenges and best practices associated with designing, implementing, and operating such complex computing ecosystems.
Birds of a Feather
TP
XO/EX
DescriptionCloud computing technologies such as elastic scaling, application containerization, and container orchestration are gaining prevalence in HPC due to their benefits of resource dynamism, automation, reproducibility, and resilience. Similarly, HPC technologies for application performance optimization and sophisticated scheduling of complex resources are being integrated into modern cloud infrastructures. This trend is leading to a new domain of Converged Computing, an environment that combines the best capabilities from both worlds. In this highly-interactive BoF, we invite experts from both communities and the audience to discuss their current experiences with converged computing and share their views on its future.
Workshop
Recorded
Correctness
Software Engineering
W
DescriptionOpening remarks for the Correctness '22 Workshop.
Workshop
Recorded
Cloud and Distributed Computing
In Situ Processing
Scientific Computing
Workflows
W
DescriptionExperimental and observational science pipelines are increasingly turning to supercomputing resources to handle their large-scale data analysis. Many of these pipelines serve experiments that are running 24/7, and must shutdown or find alternatives for their real-time data analysis during outages. Workflows from experimental and observational facilities are usually architected with a specific network and computing facility in mind, and are very difficult to switch between compute resources. What's more, the assumptions built into the architecture of most high-performance computing (HPC) centers makes moving workflows to new locations more complicated. By carefully targeting well-understood cosmology and genomics pipelines, we have researched the capabilities needed to run these workflows at multiple computing sites. In this process, we have identified several of the pain points and key future research topics for automated workflow migration, and have made substantial progress towards a future where fully automated workflows can run across the DOE complex.
Posters
Scientific Visualization & Data Analytics Showcase
Recorded
TP
DescriptionMarine macroalgae in the Gulf of Mexico is an important potential source for biofuel. However, identifying locations with the correct biogeochemical and hydrodynamic conditions for cultivation on a large enough scale to meet the needs of the U.S. private energy sector is impossible from purely observational studies. Large-scale, HPC modeling of earth systems processes enables researchers to study complex physical relationships with high fidelity. Here, we present novel visualization techniques showing the results of a global run of the E3SM's MPAS-Ocean model data with biogeochemistry extensions to improve ongoing research in macroalgae cultivation.
Posters
Research Posters
TP
XO/EX
DescriptionThe k-nearest neighbor search is used in various applications such as machine learning, computer vision, database search, and information retrieval. While the computational cost of the exact nearest neighbor search is enormous, an approximate nearest neighbor search (ANNS) is being paid much attention. IVFPQ is one of the ANNS methods. Although we can leverage the high bandwidth and low latency of shared memory to compute the search phase of the IVFPQ on NVIDIA GPUs, the throughput can degrade due to shared memory bank conflict. To reduce the bank conflict and improve the search throughput, we propose a custom 8-bit floating point value format. This format doesn’t have a sign bit and can be converted from/to FP32 with a few instructions. We use this format for IVFPQ on GPUs and get better performance without significant recall loss compared to FP32 and FP16.
Workshop
Recorded
Reliability and Resiliency
W
DescriptionWe present nascent work that introduces a path forward for leveraging CXL 2.0 and 3.0 subprotocols, notably the CXL.cache and CXL.mem subprotocols, in a novel fashion. We intend to demonstrate CXL Type 1 and Type 2 device functionality in order to make a strong case for more and larger cache levels. This will not only create a novel memory hierarchy for us to accelerate precomputed (speculated) memory paths but will effectively operate as a live checkpoint and snapshot option since we are moving many traditional RAM operations onto redefined non-volatile media, initially CXL-enabled NVMe drives.
Workshop
Recorded
HPC Training and Education
W
DescriptionThe NSF Research Experience for Undergraduates Site: Cyberinfrastructure (CI) Research 4 Social Change at The University of Texas at Austin (UT) Texas Advanced Computing Center (TACC) aims to engage students historically excluded from careers in Science, Technology, Engineering, and Mathematics (STEM) in research projects that are focused on social change. This nine-week summer program places students in a paid full-time research position working directly with researchers at UT. TACC is a leader in advanced computing and provides researchers access to supercomputers, tools, and support services. The REU activities are designed to leverage that knowledge and teach students computational competencies while helping students develop a sense of belonging in computing.
In this presentation, we summarize strategies that boosted student engagement and retention in the program. We also detail our efforts and outcomes for training and placement of students from diverse backgrounds and disciplines.
In this presentation, we summarize strategies that boosted student engagement and retention in the program. We also detail our efforts and outcomes for training and placement of students from diverse backgrounds and disciplines.
Workshop
Recorded
HPC Training and Education
W
DescriptionSummer computing camps for middle and high school students are rapidly becoming a staple at High Performance Computing (HPC) centers and Computer Science departments around the country. Developing a curriculum that targets specific computing subfields with unmet needs remains a challenge. Here, we report on developments in the two-week Summer Computing Academy to focus on two such subfields. During the first week, ‘Computing for a Better Tomorrow: Data Sciences’, introduced students to real-life applications of big data processing. A variety of topics were covered, including genomics and bioinformatics, cloud computing, and machine learning. During the second week, ‘Camp Secure: Cybersecurity’, focused on issues related to principles of cybersecurity. Students were taught online safety, cryptography, and internet structure. The two weeks are unified by a common thread of Python programming. Modules from the SCA program may be implemented at other institutions with relative ease and promote cybertraining efforts nationwide.
Workshop
Recorded
Benchmarking
Cloud and Distributed Computing
Containers
Datacenter
Networks
Privacy
Resource Management and Scheduling
Security
SIGHPC
State of the Practice
System Administration
System Software
W
DescriptionUpdate on the status of NIST 800-171 and other HPC security topics.
Birds of a Feather
TP
XO/EX
DescriptionDAOS (https://docs.daos.io/) is an open-source scale-out object store that delivers extremely high performance to the most data-intensive HPC/AI workloads. With growing adoption, DAOS has seen significant community contributions like domain-specific container types, additional hardware support beyond x86_64 (e.g. ARM), and enabling DAOS in the cloud.
This BoF brings together the DAOS community to discuss, share experiences, and brainstorm on future enhancements of DAOS. Topics include practical experiences with on-prem and cloud deployments, application use cases, and the software roadmap. This session targets end users, HPC/AI middleware developers, system administrators, DAOS core software developers, and vendors of DAOS-based hardware/software/cloud offerings.
This BoF brings together the DAOS community to discuss, share experiences, and brainstorm on future enhancements of DAOS. Topics include practical experiences with on-prem and cloud deployments, application use cases, and the software roadmap. This session targets end users, HPC/AI middleware developers, system administrators, DAOS core software developers, and vendors of DAOS-based hardware/software/cloud offerings.
Workshop
Recorded
W
DescriptionIn situ models represent a relevant alternative to classical post hoc workflows as they allow bypassing disk accesses, thus reducing the IO bottleneck. However, as most in situ data analytics tools are based on MPI, they are complicated to use, especially to parallelize irregular algorithms. Deisa, a task-based in situ analytics tool, couples MPI with Dask, providing a higher level and easier way to write in situ analytics. In this work, we improve Deisa's design by introducing three main concepts: deisa virtual arrays, contracts, and external tasks in Dask distributed. Those refinements reduce the load in the centralized scheduler of Dask and integrate selected simulation data in Dask task graphs transparently, improving Deisa's performance and productivity.
Workshop
Recorded
W
DescriptionScientific exploration is increasingly dependent on the convergence of scientific modeling, data analytics, and machine learning. The result is data-intensive workflows that are composed of multiple stages of computation and communication between distributed and heterogeneous computing resources. Data movement through storage systems is frequently the most significant bottleneck, which is compounded by increasingly large data volumes and rates. To identify opportunities for optimizing data movement, we are developing novel workflow telemetry that highlights data objects’ dynamic flow, reuse, lifetime, and locality. Our objective is to enable modeling and reasoning about task-data locality, especially compared to default placement and data exchange, and the scheduling of anticipatory data movement that selects what data should be staged in memory and when.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
DescriptionToday’s scientific projects and simulations often require repeated transfer of large data volumes between the storage system and the client. This increases the load on the network, leading to congestion. In order to mitigate these effects, regional data storage cache systems are used to store data locally. This project examines the XCache storage system to closely analyze data trend patterns in the data volume and data throughput performance, while also creating a model for predicting how caches could potentially impact network traffic and data transfer performance overall. The results of the data access patterns demonstrated that traffic volume was reduced by an average factor of 2.35. The hourly and daily prediction models also showed low error values, reinforcing the learning methods used in this effort.
Paper
Recorded
Resource Management and Scheduling
System Software
TP
DescriptionHPC applications are increasingly being designed as dynamic workflows for the ease of development and scaling. This work demonstrates how the serverless computing model can be leveraged for efficient execution of complex, real-world scientific workflows, although serverless computing was not originally designed for executing scientific workflows. This work characterizes, quantifies, and improves the execution of three real-world, complex, dynamic scientific workflows: ExaFEL (workflow for investigating the molecular structures via X-Ray diffraction), Cosmoscout-VR (workflow for large scale virtual reality simulation), and Core Cosmology Library (a cosmology workflow for investigating dark matter). The proposed technique, DayDream, employs the hot start mechanism for warming up the components of the workflows by decoupling the runtime environment from the component function code to mitigate cold start overhead. DayDream optimizes the service time and service cost jointly to reduce the service time by 45% and service cost by 23% over the state-of-the-art HPC workload manager.
Workshop
Recorded
W
DescriptionModern sky surveys conducted by powerful telescopes are some of the largest data generators in science, quite popular and visible to the broader audience. However, what goes largely unnoticed is the fact that cosmological simulations often have to produce even larger data sets in order to scientifically interpret these observations: we need many models employing different plausible physics, as well as to ensure that statistical errors of our predictions are less than observational errors. To maintain desired accuracy, modern simulations track the time evolution of trillions of elements over thousands of timesteps. For such large runs, storing a large number of time steps for later analysis is not a viable strategy anymore, and beyond-exascale forecasts point to growth in flops continually outpacing growth of disk space as well as network bandwidth, making the post processing strategy increasingly impossible. In this talk, I will go over the difficulties we are facing with large data sizes and which present major technological roadblock. Then I will present some of our existing lines of attack on this problem, including different compression methods, surrogate modeling, as well as our design for running multiple codes in situ: using coroutines and position independent executables we enable cooperative multitasking between simulation and analysis, allowing the same executables to post-process simulation output, as well as to process it on the fly, both in situ and in transit.
Workshop
Recorded
HPC Training and Education
W
DescriptionThe Data-Enabled Advanced Computational Training Program for Cybersecurity Research and Education (DeapSECURE) is a non-degree training consisting of six modules covering a broad range of cyberinfrastructure techniques, including high performance computing, big data, machine learning and advanced cryptography, aimed at reducing the gap between current cybersecurity curricula and requirements needed for advanced research and industrial projects.
Since 2020, these lesson modules have been updated and retooled to suit fully-online delivery. Hands-on activities were reformatted to accommodate self-paced learning. In this presentation, we summarize the four years of the project comparing in-person and on-line only instruction methods as well as outlining lessons learned. The module content and hands-on materials are being released as open-source educational resources. We also indicate our future direction to scale up and increase adoption of the DeapSECURE training program to benefit cybersecurity research everywhere.
Since 2020, these lesson modules have been updated and retooled to suit fully-online delivery. Hands-on activities were reformatted to accommodate self-paced learning. In this presentation, we summarize the four years of the project comparing in-person and on-line only instruction methods as well as outlining lessons learned. The module content and hands-on materials are being released as open-source educational resources. We also indicate our future direction to scale up and increase adoption of the DeapSECURE training program to benefit cybersecurity research everywhere.
Posters
Research Posters
TP
XO/EX
DescriptionIt is difficult to implement a CNN for edge processing in satellites, automobiles, and more, where machine resources and power are limited. FPGAs meet such constraints of machine resources and power associated with CNNs. FPGAs have low power consumption, but limited machine resources. Quantization Neural Networks have fewer parameters (bit depth) than CNNs and better estimation accuracy than BNNs.
Although CNNs for regression problems are rarely implemented with FPGAs, our study installed debris pose estimation on an FPGA using the latest edge technology such as quantization neural network. Pose estimations were run on a workstation using 32bit floating-point precision and on an FPGA using 8bit int precision. The average errors were 4.98% and 5.38%, respectively. This demonstrates that the regression problem can be transferred to an FPGA without a significant loss of accuracy. The FPGA power efficiency is more than 218k times that of a workstation implementation.
Although CNNs for regression problems are rarely implemented with FPGAs, our study installed debris pose estimation on an FPGA using the latest edge technology such as quantization neural network. Pose estimations were run on a workstation using 32bit floating-point precision and on an FPGA using 8bit int precision. The average errors were 4.98% and 5.38%, respectively. This demonstrates that the regression problem can be transferred to an FPGA without a significant loss of accuracy. The FPGA power efficiency is more than 218k times that of a workstation implementation.
Workshop
Recorded
Reliability and Resiliency
W
DescriptionTesting correctness of either a new MPI implementation or a transparent checkpointing package for MPI is inherently difficult. A bug is often observed when running a correctly written MPI application, and it produces an error. Tracing the bug to a particular subsystem of the MPI package is difficult due to issues of complex parallelism, race conditions, etc. This work provides tools to decide if the bug is: in the subsystem implementing of collective communication; or in the subsystem implementing point-to-point communication; or in some other subsystem. The tools were produced in the context of testing a new system, MANA. MANA is not a standalone MPI implementation, but rather a package for transparent checkpointing of MPI applications. In addition, a short survey of other debugging tools for MPI is presented. The strategy of transforming the execution for purposes of diagnosing a bug appears to be distinct from most existing debugging approaches.
Exhibitor Forum
Recorded
TP
XO/EX
DescriptionToday’s HPC platforms leverage GPU technologies from NVIDIA and AMD to maximize compute capabilities for advanced scientific and research applications. Developers utilize MPI in combination with CUDA, HIP, OpenMP and other parallel languages and must deal with the complexities of running code both on the CPU and GPU. Debugging these mixed environments can be a real challenge, especially when dealing with thousands of nodes and GPUs at a time.
This interactive session highlights GPU debugging on each of these architectures with the TotalView for HPC debugger. You will learn:
• How CUDA debugging on NVIDIA GPUs compares with HIP debugging on AMD GPUs
• Significant architecture and terminology differences between the GPU environments
• How to easily debug multi-node and multi-GPU code in the same session
• What is the state of debugging OpenMP on GPUs
• How to combine debugging features to efficiently debug tough parallel problems
Learning how the TotalView for HPC debugger helps in each of these areas will enable you to understand your code faster, find bugs in your code quicker, and improve the quality of your code.
This interactive session highlights GPU debugging on each of these architectures with the TotalView for HPC debugger. You will learn:
• How CUDA debugging on NVIDIA GPUs compares with HIP debugging on AMD GPUs
• Significant architecture and terminology differences between the GPU environments
• How to easily debug multi-node and multi-GPU code in the same session
• What is the state of debugging OpenMP on GPUs
• How to combine debugging features to efficiently debug tough parallel problems
Learning how the TotalView for HPC debugger helps in each of these areas will enable you to understand your code faster, find bugs in your code quicker, and improve the quality of your code.
Tutorial
Recorded
AI-HPC Convergence
Applications
Cloud and Distributed Computing
Datacenter
Machine Learning and Artificial Intelligence
Resource Management and Scheduling
TUT
DescriptionDeep learning is rapidly and fundamentally transforming the way science and industry use data to solve problems. Deep neural network models have been shown to be powerful tools for extracting insights from data across a large number of domains. As these models grow in complexity to solve increasingly challenging problems with larger and larger datasets, the need for scalable methods and software to train them grows accordingly.
The Deep Learning at Scale tutorial aims to provide attendees with a working knowledge of deep learning on HPC-class systems, including core concepts, scientific applications, performance optimization, tips, and techniques for scaling. We will provide training accounts on some of the world's largest GPU systems, example code, and datasets to allow attendees to experiment hands-on with optimized, scalable distributed training of deep neural network machine learning models.
The Deep Learning at Scale tutorial aims to provide attendees with a working knowledge of deep learning on HPC-class systems, including core concepts, scientific applications, performance optimization, tips, and techniques for scaling. We will provide training accounts on some of the world's largest GPU systems, example code, and datasets to allow attendees to experiment hands-on with optimized, scalable distributed training of deep neural network machine learning models.
Student Cluster Competition
TP
XO/EX
DescriptionThe Monash HPC SCC team is part of a larger team of undergraduate students called DeepNeuron (https://www.deepneuron.org/). DeepNeuron is a student team focused on improving the world with Artificial Intelligence and High-Performance Computing.
Computational background
Many students in the team are undertaking Computer Science or Engineering as their degree, learning subjects such as data structures, algorithms and programming languages such as C, C++, Python. On top of that, with the assistance of Monash eResearch (https://www.monash.edu/researchinfrastructure/eresearch, which maintains two local HPC clusters), the Monash Student Cluster Competition(SCC) team goes through practical and theoretical training sessions on HPC topics, which include:
Cloud Technologies (OpenStack on our local Nectar system)
Compilers (gcc, icc)
Container technology (Singularity, Docker)
Working on local High Performance Computer Clusters
Parallel Programming (MPI, openMP)
Non-computational science domain
Monash University gives students the opportunity to undertake double degrees which allows the team to have a more diverse background in fields such as Mathematics, Physics, Commerce, Biomedicine, and more.
Broad background
Some members of the team also have experience with hardware. We have built a small model cluster with four Raspberry Pi nodes for training and testing. The team has also constructed a small HPC cluster with the support of Dell and NVIDIA, gaining practical experiences in assembling and maintaining a server.
Have any of the members participated before
The team has competed in several well known competitions around the globe such as ASC21, IndySCC21 and ISC22. In fact, the Monash DeepNeuron SCC team has won First prize and Application award in ASC (http://www.asc-events.org/ASC20-21/) and came 2nd in IndySCC21. The team for IndySC22 will consist of experienced members who won ASC21 and IndySC21, and new members who are first time joining the competition. Pascal and Nick have participated in IndySCC21 and they are currently competing in ISC22. They are both passionate to reinforce the skills they learnt last year and achieve a better result. Those experienced members will enable new members to learn new technologies they face and ensure them making a meaningful contribution to IndySC22.
How does SCC help academic career
SCC is one of the best places which challenges students and inspires them to become future High Performance Computing researchers. The opportunities to attend a major conference and to meet with peers and experts around the world would broaden students’ views on their future academic career. With the growing field of artificial intelligence and data science, the experience of dealing with scientific programs, optimising them, speeding them up with parallel computing techniques using HPC will serve us well in our future endeavours.
Introduce the advisor and advisor's background
The team advisor is Mr Simon Michnowicz. Simon was the team advisor for the Monash SCC 2018 team, as well as the ASC 21 and Indy Sc21 teams. Simon works at Monash eResearch, where he manages two HPC Clusters. As an accredited Software Carpentry teacher, Simon is passionate about educating the next generation of HPC professionals.
Computational background
Many students in the team are undertaking Computer Science or Engineering as their degree, learning subjects such as data structures, algorithms and programming languages such as C, C++, Python. On top of that, with the assistance of Monash eResearch (https://www.monash.edu/researchinfrastructure/eresearch, which maintains two local HPC clusters), the Monash Student Cluster Competition(SCC) team goes through practical and theoretical training sessions on HPC topics, which include:
Cloud Technologies (OpenStack on our local Nectar system)
Compilers (gcc, icc)
Container technology (Singularity, Docker)
Working on local High Performance Computer Clusters
Parallel Programming (MPI, openMP)
Non-computational science domain
Monash University gives students the opportunity to undertake double degrees which allows the team to have a more diverse background in fields such as Mathematics, Physics, Commerce, Biomedicine, and more.
Broad background
Some members of the team also have experience with hardware. We have built a small model cluster with four Raspberry Pi nodes for training and testing. The team has also constructed a small HPC cluster with the support of Dell and NVIDIA, gaining practical experiences in assembling and maintaining a server.
Have any of the members participated before
The team has competed in several well known competitions around the globe such as ASC21, IndySCC21 and ISC22. In fact, the Monash DeepNeuron SCC team has won First prize and Application award in ASC (http://www.asc-events.org/ASC20-21/) and came 2nd in IndySCC21. The team for IndySC22 will consist of experienced members who won ASC21 and IndySC21, and new members who are first time joining the competition. Pascal and Nick have participated in IndySCC21 and they are currently competing in ISC22. They are both passionate to reinforce the skills they learnt last year and achieve a better result. Those experienced members will enable new members to learn new technologies they face and ensure them making a meaningful contribution to IndySC22.
How does SCC help academic career
SCC is one of the best places which challenges students and inspires them to become future High Performance Computing researchers. The opportunities to attend a major conference and to meet with peers and experts around the world would broaden students’ views on their future academic career. With the growing field of artificial intelligence and data science, the experience of dealing with scientific programs, optimising them, speeding them up with parallel computing techniques using HPC will serve us well in our future endeavours.
Introduce the advisor and advisor's background
The team advisor is Mr Simon Michnowicz. Simon was the team advisor for the Monash SCC 2018 team, as well as the ASC 21 and Indy Sc21 teams. Simon works at Monash eResearch, where he manages two HPC Clusters. As an accredited Software Carpentry teacher, Simon is passionate about educating the next generation of HPC professionals.
Paper
Recorded
Machine Learning and Artificial Intelligence
TP
DescriptionThe landscape of transformer model inference is increasingly diverse in model size, model characteristics, latency and throughput requirements, hardware requirements, etc. With such diversity, designing a versatile inference system is challenging, addresses these challenges by (1) a multi-GPU inference solution to minimize latency while maximizing throughput for both dense and sparse transformers when the model fits in aggregate GPU memory, and (2) a heterogeneous inference solution that leverages CPU/NVMe/GPU memory to enable high-throughput inference for models larger than aggregate GPU memory, reduces latency by 6.4× and increases throughput by 4× over the state-of-the-art while achieving 260 TFLOPS/GPU throughput (over 80% of A100 peak). It enables trillion parameter scale inference under real-time latency constraints by leveraging hundreds of GPUs, an unprecedented scale for inference. It can inference 25× larger models than with GPU only solutions, while delivering a high throughput of 84 TFLOPS (over 50% of A6000 peak).
Paper
Recorded
Applications
Numerical Algorithms
Security
TP
DescriptionMultilinear algebra kernel performance on modern massively-parallel systems is determined mainly by data movement. However, deriving data movement-optimal distributed schedules for programs with many high-dimensional inputs is a notoriously hard problem.
State-of-the-art libraries rely on heuristics and often fall back to suboptimal tensor folding and BLAS calls. We present Deinsum, an automated framework for distributed multi-linear algebra computations expressed in Einstein notation, based on rigorous mathematical tools to address this problem. Our framework automatically derives data movement-optimal tiling and generates corresponding distributed schedules, further optimizing the performance of local computations by increasing their arithmetic intensity.
To show the benefits of our approach, we test it on two important tensor kernel classes: Matricized Tensor Times Khatri-Rao Products and Tensor Times Matrix chains. We show performance results and scaling on the Piz Daint supercomputer, with up to 19x speedup over state-of-the-art solutions on 512 nodes.
State-of-the-art libraries rely on heuristics and often fall back to suboptimal tensor folding and BLAS calls. We present Deinsum, an automated framework for distributed multi-linear algebra computations expressed in Einstein notation, based on rigorous mathematical tools to address this problem. Our framework automatically derives data movement-optimal tiling and generates corresponding distributed schedules, further optimizing the performance of local computations by increasing their arithmetic intensity.
To show the benefits of our approach, we test it on two important tensor kernel classes: Matricized Tensor Times Khatri-Rao Products and Tensor Times Matrix chains. We show performance results and scaling on the Piz Daint supercomputer, with up to 19x speedup over state-of-the-art solutions on 512 nodes.
Workshop
Recorded
W
DescriptionHPC facilities have employed flash-based storage tier near to compute nodes to absorb high I/O demand by HPC applications during periodic system-level checkpoints. To accelerate these checkpoints, proxy-based distributed key-value stores (PD-KVS) gained particular attention for their flexibility to support multiple backends and network configurations. PD-KVS rely internally on monolithic KVS, such as RocksDB, to exploit the KV interface and query support. However, PD-KVS are unaware of high redundancy factor in checkpoint data, which can be up to GBs to TBs, and therefore, tend to generate high write and space amplification on these storage layers. We propose DENKV which is deduplication-extended node-local LSM-tree-based KVS. DENKV employs asynchronous partially inline dedup (APID) and aims to maintain the performance characteristics of LSM-tree-based KVS while reducing the write and space amplification problems. We implemented DENKV atop BlobDB and showed that our solution maintains performance while reducing write and space amplification.
Workshop
Recorded
Applications
Architectures
Heterogeneous Systems
Hierarchical Parallelism
Parallel Programming Languages and Models
Performance
Performance Portability
Scientific Computing
W
DescriptionThe partitioned global address space (PGAS) model with one-sided communication has recently received attention as an easy and intuitive method for describing remote data access in nodes. PGAS can be implemented using remote direct memory access, which provides lightweight one-sided communication and low overhead synchronization semantics. In this paper, to enable portable, lightweight, and efficient one-sided communication on the Fugaku supercomputer, we designed and implemented Universal Communication X (UCX) for Tofu Interconnect D. An evaluation using OpenSHMEM-UCX and OSHMPI indicates that OpenSHMEM with UCX on Tofu Interconnect D enables smaller latency and better efficiency compared with that for OpenSHMEM with MPI and that it is beneficial for several applications based on PGAS models.
Workshop
Recorded
Reliability and Resiliency
W
DescriptionDynamic resource management opens up numerous opportunities in the High Performance Computing. It improves the system-level services as well as application performance. Checkpointing can also be deemed as a system-level service and can reap the benefits offered by dynamism. The checkpointing system can have better resource availability by integrating with a malleable resource management system. In addition to fault tolerance, the checkpointing system can cater to the data redistribution demand of malleable applications during resource change. Therefore, we propose iCheck, an adaptive application-level checkpoint management system that can efficiently utilize the system and application level dynamism to provide better checkpointing and data redistribution services to applications.
Invited Talk
Recorded
TP
XO/EX
DescriptionThis talk will review the use of HPC in numerical weather prediction, climate monitoring and projection. Starting from the early foundations of modern numerical weather prediction, this talk will describe advances in the field, culminating in the ongoing efforts to create digital replicas of the Earth system such as the European Commission's Destination Earth initiative or NOAA's digital twin for Earth observations. Digital Twins of Earth encapsulate both the latest science and technology advances to provide near-real time information on extremes and climate change adaptation in a wider digital environment, where users can interact, modify, and ultimately create their own tailored information.
Recent work has demonstrated that global, coupled storm-resolving (or km-scale) simulations are feasible and can contribute to building such information systems and are no longer a dream thanks to recent advances in Earth system modeling, supercomputing, and the ongoing adaptation of weather and climate codes for accelerators. Such simulations start to explicitly represent essential climate processes, e.g. detailed inland water and land-use representation, deep convection, and mesoscale ocean eddies, that today need to be fully parameterized even at the highest resolution used in global weather and climate information production. These simulation outputs, combined with novel, data-driven deep learning advances, thus offer a window into the future, with a promise to significantly increase the realism and timeliness of delivery of Earth system information to a broad range of users. Despite the significant compute and data challenges, including both memory intensive and extreme scaling workflows, there is a real prospect to better support global to local warning systems and complement existing climate change mitigation and adaptation efforts.
Recent work has demonstrated that global, coupled storm-resolving (or km-scale) simulations are feasible and can contribute to building such information systems and are no longer a dream thanks to recent advances in Earth system modeling, supercomputing, and the ongoing adaptation of weather and climate codes for accelerators. Such simulations start to explicitly represent essential climate processes, e.g. detailed inland water and land-use representation, deep convection, and mesoscale ocean eddies, that today need to be fully parameterized even at the highest resolution used in global weather and climate information production. These simulation outputs, combined with novel, data-driven deep learning advances, thus offer a window into the future, with a promise to significantly increase the realism and timeliness of delivery of Earth system information to a broad range of users. Despite the significant compute and data challenges, including both memory intensive and extreme scaling workflows, there is a real prospect to better support global to local warning systems and complement existing climate change mitigation and adaptation efforts.
Workshop
Recorded
W
DescriptionHEDP experiments commonly involve a dynamic wave-front propagating inside a low-density foam. To classify the foams' quality, accurate information is required. For each foam, five images are taken: two 2D images representing the top and bottom surface foam planes and three images of side cross-sections from 3D scannings. An expert has to do the complicated, harsh, and exhausting work of manually classifying the foam's quality through the image set and only then determine whether the foam can be used in experiments or not. In this work, we present a novel state-of-the-art multi-view deep-learning classification model determining the foams' quality classification and thus aids the expert. Our model achieved 86% accuracy on upper and lower surface foam planes and 82% on the entire set, suggesting interesting heuristics to the problem. A significant added value in this work is the ability to regress the foam quality and even explain the decision visually.
Exhibitor Forum
Recorded
TP
XO/EX
DescriptionCoronary artery disease (CAD) is a highly prevalent type of heart disease in the US, causing more than 360,000 deaths in 2017 alone. In 20% of cases, these lesions occur at arterial bifurcations or branch points in the arterial tree. Determining how best to treat these lesions remains a particular challenge, as they may involve the main branch, the side branch, or both vessels and bifurcation stenting is associated with a higher risk for adverse cardiac events. However, it is unclear whether restoring blood flow in the main branch alleviates the disturbed hemodynamics in the side branch as well. This question becomes even more challenging due to the complex anatomic features of bifurcation lesions, such as the curvature, percentage stenosis, and length. To ascertain the influence of anatomic changes in bifurcation lesion geometries and how different stenting options affect resulting flow, we produced a synthetic database of 360 different bifurcation lesion morphologies. However, as each individual simulation is computationally costly methods employed to we developed methods to optimize parallel simulations of coronary interventions on NVIDIA-based GPU instances in the Microsoft Azure cloud computing platform. By establishing a cloud-based, high throughput framework for computing coronary flow, we were efficiently calculate flow for the full database and quantify the influence of lesion- and treatment-specific parameters on resulting flow. Here we will discuss the technical hurdles overcome to complete the large-scale fluid dynamics analysis using a cloud-based platform.
Workshop
Recorded
HPC Training and Education
W
DescriptionThe use of computing technologies is expanding exponentially in every sector of our lives, creating a need for access to high-quality education and training materials to conduct research computing. Instructional materials for both teaching and learning are needed on a broad range of topics related to both developing and applying research computing technologies in all disciplines. The critical need for quality materials applies to formal classroom learning as well as to informal and self-paced learning.
This presentation will build upon the strong interest generated during the Workshop at PEARC22, to engage a larger community of participants. A working group has been formed within the ACM SIGHPC Education Chapter to continue to adopt metadata standards that will build on our shared foundations. The working group is pursuing a collaboration with the NSF funded ACCESS projects. The presenters hope to use this presentation to encourage participation by organizations and consortia internationally.
This presentation will build upon the strong interest generated during the Workshop at PEARC22, to engage a larger community of participants. A working group has been formed within the ACM SIGHPC Education Chapter to continue to adopt metadata standards that will build on our shared foundations. The working group is pursuing a collaboration with the NSF funded ACCESS projects. The presenters hope to use this presentation to encourage participation by organizations and consortia internationally.
Workshop
Recorded
W
DescriptionSubgraph Isomorphism is a fundamental problem in graph analytics and it has been applied to many domains. It is well known that subgraph isomorphism is a NP-complete problem. There has been a lot of efforts devoted to this problem in the past two decades. However, GPU-based subgraph isomorphism systems are relatively rare since the GPU memory is not big enough to hold all the instances during the matching process. Most current GPU subgraph isomorphism frameworks suffer from the limited GPU main memory and redundant computation. These issues restrict them on smaller patterns and graphs and limit their performance. To overcome these issues, we design a new GPU-based subgraph isomorphism system named DGSM. We validate our techniques by comparing with two state-of-the-art systems, CPU-based DAF and GPU-based GSI. Our experimental results show that our system achieve 2 orders of magnitude faster than DAF and GSI on both labeled and unlabeled graph.
Student Cluster Competition
TP
XO/EX
DescriptionOur team is composed of students across various fields and interests. In total, our team consists of students from Computer Science, Computational Engineering, Physics, Mathematics, and Electrical and Computer Engineering. All team members have experience with computing, but it is only one member’s primary focus of study. Our differing experiences and backgrounds ensures our team has a diverse skill set that is ready to meet the challenges of SCC.
Mathew Abraham is a computational engineering major who has experience with C++ and Python. He has gained exposure to machine learning libraries and training models on GPUs, and is interested in applying his classroom knowledge to this competition.
Surendra Anne is a physics major with experience in Python. Having had limited exposure to other aspects of computing, participating in SC22 is a great opportunity for him to learn more about resource management in computing. Knowledge of HPC can translate to his interest in astrophysics research, as sifting through copious amounts of data is an important part of imaging in astrophysics and particle experiments.
Jorge Bolivar is a computational engineering major with a solid understanding of Python and C++. This competition will lead him to explore new branches of computing, and he hopes HPC will create an environment where he can challenge himself and learn more about his interests in real world applications.
Saiprathik Chalamkuri is a computer science student who has delved into user level concepts such as data structures and algorithms, and has experience working with lower level concepts such as computer architecture and operating systems. He looks forward to learning more in-depth about how hardware relates to reaping computing power and hopes for this competition to increase his skills maximizing hardware and software efficiency.
Jenna May is a computational engineering major in the process of transferring into electrical and computer engineering with plans to concentrate in computer architecture. She has experience with C++, Python, and assembly language along with exposure to CPU architecture. She hopes to learn more on how to best optimize the working of both software and hardware, which will be a great benefit to her career.
Benjamin Nederveld is currently a third year mathematics major pursuing a minor in Chinese. He hopes to combine the practical aspects of competition with some of the more theoretical aspects of his education. Additionally, he hopes to learn more about collaboration on large-scale technical projects.
It is all members’ first time participating in SCC, and we all hope to gain valuable technical skills and experience that can be directly transferable to our future careers. Furthermore, the technical and problem solving skills we gain from participating will help us academically.
Our advisors work full time at the Texas Advanced Computing Center. Joe Garcia and Nick Thorne are system administrators that build, maintain, and ensure high availability of various TACC clusters. Matthew Cawood is part of the Performance and Architectures team within the High Performance Computing group where he conducts performance benchmarking and analysis along with software development.
Mathew Abraham is a computational engineering major who has experience with C++ and Python. He has gained exposure to machine learning libraries and training models on GPUs, and is interested in applying his classroom knowledge to this competition.
Surendra Anne is a physics major with experience in Python. Having had limited exposure to other aspects of computing, participating in SC22 is a great opportunity for him to learn more about resource management in computing. Knowledge of HPC can translate to his interest in astrophysics research, as sifting through copious amounts of data is an important part of imaging in astrophysics and particle experiments.
Jorge Bolivar is a computational engineering major with a solid understanding of Python and C++. This competition will lead him to explore new branches of computing, and he hopes HPC will create an environment where he can challenge himself and learn more about his interests in real world applications.
Saiprathik Chalamkuri is a computer science student who has delved into user level concepts such as data structures and algorithms, and has experience working with lower level concepts such as computer architecture and operating systems. He looks forward to learning more in-depth about how hardware relates to reaping computing power and hopes for this competition to increase his skills maximizing hardware and software efficiency.
Jenna May is a computational engineering major in the process of transferring into electrical and computer engineering with plans to concentrate in computer architecture. She has experience with C++, Python, and assembly language along with exposure to CPU architecture. She hopes to learn more on how to best optimize the working of both software and hardware, which will be a great benefit to her career.
Benjamin Nederveld is currently a third year mathematics major pursuing a minor in Chinese. He hopes to combine the practical aspects of competition with some of the more theoretical aspects of his education. Additionally, he hopes to learn more about collaboration on large-scale technical projects.
It is all members’ first time participating in SCC, and we all hope to gain valuable technical skills and experience that can be directly transferable to our future careers. Furthermore, the technical and problem solving skills we gain from participating will help us academically.
Our advisors work full time at the Texas Advanced Computing Center. Joe Garcia and Nick Thorne are system administrators that build, maintain, and ensure high availability of various TACC clusters. Matthew Cawood is part of the Performance and Architectures team within the High Performance Computing group where he conducts performance benchmarking and analysis along with software development.
Workshop
Recorded
W
DescriptionIn this paper, we propose a direct GPU compilation scheme that leverages the portable target offloading interface provided by LLVM/OpenMP. Utilizing this infrastructure allows us to compile an existing host application for the GPU and execute it there with only a minimal wrapper layer for the user code, command line arguments, and a compiler provided GPU implementation of C/C++ standard library functions. The C/C++ library functions are partially implemented for direct device execution and otherwise fallback to remote procedure call (RPC) to call host functions transparently. Our proposed prototype will allow users to quickly compile for, and test on, the GPU without explicitly handling kernel launches, data mapping, or host-device synchronization. We evaluate our implementation using three proxy applications with host OpenMP parallelism and three microbenchmarks to test the correctness of our prototype GPU compilation.
Birds of a Feather
TP
XO/EX
DescriptionThis BoF will be a forum to discuss most recent topics of research around disaggregated heterogeneous architectures, their operation and use. “Disaggregated” aka “modular supercomputing” refers to a system-level architecture in which heterogeneous resources are organized in partitions or modules, each one with a different type of node-configuration. This approach is gaining traction in the HPC landscape, with Perlmutter, Lumi, JUWELS and MeluXina representing just some examples. This BoF discusses the challenges seen by operators, vendors, developers of system software, programming models and tools, as well as application developers when adapting their codes to make use of such machines.
Workshop
Recorded
Architectures
Cloud and Distributed Computing
Emerging Technologies
Networks
Scientific Computing
W
Workshop
Recorded
Architectures
Cloud and Distributed Computing
Emerging Technologies
Networks
Scientific Computing
W
Workshop
Recorded
Security
W
DescriptionHPC datacenters are large and shared across many users for efficiency. The last era of HPC users were highly-technical and conscientious experts. Increasingly, HPC infrastructure is being democratized to span a diverse set of users with varying concerns for and proficiency in security. There is an inevitable threat from malicious attacks or inadvertent interference from bad actors, supply chain threats, noisy neighbors, or nosey admins. To thwart this, we recommend a defensible architecture with continuous improvements in applying the principles of zero trust in a range of settings from the edge to the cloud to federations of multiple clusters.
This talk will present the principles that NVIDIA follows to design in security by default. This involves attribute-based access control, modular design, and clear ownership of responsibility. The security friction is reduced enough for novice users to get an easy onramp, while customers retain full control over policies, with both visibility and ownership over security issues. We employ integrated datacenter-wide hardware and software solutions to create isolation of the network, storage, compute, and tenants from administrators.
We introduce new hardware in DPUs, switches, and features like confidential computing, as well as software solutions that involve scheduling, monitoring, AI-driven analysis, management, and security services. We’ll connect participants to the application of these principles by providing a wide range of real-world examples. Join us for a fun talk that’s sure to spark stimulating discussion on what we can do together to build upon a zero trust foundation and move the needle against the adversary!
This talk will present the principles that NVIDIA follows to design in security by default. This involves attribute-based access control, modular design, and clear ownership of responsibility. The security friction is reduced enough for novice users to get an easy onramp, while customers retain full control over policies, with both visibility and ownership over security issues. We employ integrated datacenter-wide hardware and software solutions to create isolation of the network, storage, compute, and tenants from administrators.
We introduce new hardware in DPUs, switches, and features like confidential computing, as well as software solutions that involve scheduling, monitoring, AI-driven analysis, management, and security services. We’ll connect participants to the application of these principles by providing a wide range of real-world examples. Join us for a fun talk that’s sure to spark stimulating discussion on what we can do together to build upon a zero trust foundation and move the needle against the adversary!
Posters
Research Posters
TP
XO/EX
DescriptionMissing climatological data is a general problem in climate research that leads to uncertainty of prediction models that rely on these data resources. So far, existing approaches for infilling missing precipitation data are mostly numerical or statistical techniques that require time consuming computations and are not suitable for large regions with missing data. Most recent machine learning techniques have proven to perform well on infilling missing temperature or satellite data. However, these techniques consider only spatial variability in the data whereas precipitation data is much more variable in both space and time. We propose a convolutional inpainting network that additionally considers temporal variability and atmospheric parameters in the data. The model was trained and evaluated on the RADOLAN data set over Germany. Since the training of this high-resolved data set requires a large amount of computational resources, we apply distributed training on an HPC system to maximize the performance.
Panel
Recorded
Heterogeneous Systems
Machine Learning and Artificial Intelligence
Workflows
TP
XO/EX
DescriptionHigh-performance computing (HPC) systems and machine learning (ML) have many common design goals. Despite this commonality, large-scale systems are increasingly heterogeneous, with SmartNICs, DPUs, CPUs, GPUs, and FPGAs all intermingled in a system organization that can exploit that heterogeneity across the entire system. The interconnection network ties together these heterogeneous processing elements to provide a consistent system-wide programming model to ply those heterogeneous resources.
Every large-scale workload requires both computation and communication as two sides of the same coin – computed results must be communicated and consumed by other cooperating processing elements. This panel discussion seeks to explore whether domain-specific accelerators (GPUs, TPUs, TSPs, etc) require a similar domain-specific network to extract performance from the accelerator at the system level. This begs the question: “Are we converging (toward converged HPC/ML) or diverging for these performance-critical workloads?”
Every large-scale workload requires both computation and communication as two sides of the same coin – computed results must be communicated and consumed by other cooperating processing elements. This panel discussion seeks to explore whether domain-specific accelerators (GPUs, TPUs, TSPs, etc) require a similar domain-specific network to extract performance from the accelerator at the system level. This begs the question: “Are we converging (toward converged HPC/ML) or diverging for these performance-critical workloads?”
Workshop
Recorded
W
DescriptionThe complex software and hardware I/O stack of HPC platforms makes it challenging for end-users to extract performance and understand the root causes of I/O bottlenecks they encounter. Despite continuous efforts from the community to profile I/O performance and propose new optimization techniques and tuning options to improve performance, there is still a translation gap between profiling and tuning. In this paper, we propose Drishti, a solution to guide end-users in optimizing I/O in their applications by detecting typical I/O performance pitfalls and providing recommendations. We illustrate its applicability in two case studies and evaluate its robustness and performance by summarizing the issues detected in over a hundred thousand Darshan logs collected by the National Energy Research Scientific Computing Center on the Cori supercomputer. Drishti can empower end-users and guide them in the I/O optimization journey by shedding some light on everyday I/O performance pitfalls and how to fix them.
Workshop
Recorded
W
DescriptionWith a massive upsurge in data, combining deduplication with distributed storage continuously suffer from a low deduplication ratio when providing the corresponding throughput. It is because distributed storage requires sharding data on different nodes, while global deduplication needs eliminating redundancies in a unified view.In this paper, we present clustering-based sharding method, D-Shard, in distributed deduplication storage systems that leads to a comparable deduplication efficiency on a single system while supporting a high throughput. First, using Dynamic K-Means approach to cluster super-blocks, then extracting every cluster center feature as the anchor point for sharding; Second, Construct a secondary deduplication index based on the Compact Hamming Index. Currently, preliminary results show that super-block clustering is convergent, and routing strategy based on anchor points can achieve a higher deduplication ratio compared to the state-of-the-art approach and the throughput of system has been greatly improved.
Paper
Recorded
Data Mangement
Storage
TP
DescriptionError-bounded lossy compression has been considered a promising solution to address the big-data issue for scientific application. However, the existing lossy compressors are all developed on fixed designs which cannot adapt to diverse quality metrics favored by different users. In this paper, we propose QoZ, a dynamic quality metric oriented error bounded lossy compressor. Our key contributions include: (1) We propose a highly-parameterized multi-level interpolation based data predictor which significantly improves the compression quality with the same compressed size. (2) We design the lossy compression framework QoZ with the predictor proposed, which can auto-tune parameters and optimize the compression based on user-specified quality metrics. (3) We evaluate QoZ carefully by comparing it with multiple state-of-the-arts on real-world scientific application datasets. Experiments show that, compared with the second best, QoZ achieves up to 70% compression ratio improvement under the same error bound or 150%(270%) compression ratio improvement under the same PSNR(SSIM).
Workshop
Recorded
Correctness
Software Engineering
W
DescriptionDataRaceBench is a dedicated benchmark suite to evaluate tools aimed to find data races in OpenMP programs. Using microbenchmarks with or without data races, DRB is able to generate standard quality metrics and provide systematic and quantitative assessments of data race detection tools. However, as the number of microbenchmarks grows, it is challenging to manually identify similar code patterns for DRB, within the context of identifying duplicated kernels or guiding the additions of new kernels. In this paper, we experiment with a transformer-based, deep learning approach to similarity analysis. A state-of-the-art transformer model, CodeBERT, has been adapted to find similar OpenMP code regions. We explore the challenges and the solutions when applying transformer-based similarity analysis to source codes which are unseen by pre-trained transformers. Using comparative experiments of different variants of similarity analysis, we comment on the strengths and limitations of the transformer-based approach and point out future research directions.
Early Career Program
Inclusivity
TP
DescriptionThis session is designed for SC potential volunteers that desire to learn what committee roles are available and why becoming a part of the SC family is beneficial and rewarding. Informal opportunity to meet several committee leads as well as ask questions.
Early Career Program
Inclusivity
TP
DescriptionMentorship is a dynamic, career-long phenomenon spanning many different relationships that support our personal and professional development. A wealth of scholarship on mentorship practices has emerged across many disciplines studying how mentorship happens in the workplace, its benefits, and what companies can do to foster those relationships. Of note, numerous studies have linked mentorship with diversity and inclusion; mentorship can support the growth and retention of workers from underrepresented and marginalized groups by “bringing them into the fold” and empowering them. As a software engineering researcher, Reed has actively been investigating the instrumental role that mentorship can play in the careers of women and LGBTQIA+ individuals in tech. In this talk, he will make the case for how we can leverage these insights to build stronger mentor-mentee relationships and to foster more inclusive and equitable communities.
Early Career Program
Inclusivity
TP
DescriptionFinding the right career path early may be one of the most rewarding discoveries in a young professional's life. This panel discussion will feature insightful stories and kernels of wisdom of four panelists whose diverse careers span from start-ups to large companies, non-profit organizations to universities, and government labs to government agencies. They offer their practical wisdom to present a broader picture of the different workplaces in the HPC community. It will help young individuals to better match their strengths and objectives to the challenges and rewards of the different work places.
Early Career Program
Inclusivity
TP
DescriptionA series of talks on ways to manage Work/Life balance.
Early Career Program
Inclusivity
TP
DescriptionThis is a succession of conversations between a mentor and a small group of mentees. Mentees rotate the tables having an opportunity to talk to several mentors during a specified amount of time asking questions and trying to establish a connection.
Early Career Program
Inclusivity
TP
DescriptionThe job interview skills session is designed to equip the participants with the needed skills and tools to be successful in interviews. Participants learn how to present themselves effectively and develop strategies to help secure their ideal role. For the first half of the session, we will cover topics such as interview dos and don'ts, stand out after the job interview, online interviews, tips for answering common interview questions. The second half is focused on mock interviews, resume coaching, and feedback.
Early Career Program
Inclusivity
TP
DescriptionAn interactive simulation aimed to help participants communicate across diverse backgrounds. As the workplace is becoming increasingly multicultural, this exercise helps improve effective oral communication in the workplace and enhance the productivity of people from different linguistic backgrounds. Redundancìa is designed to increase empathy by experiencing how communicating in a second language impacts our thought patterns, perceptions, and connections with others.
Early Career Program
Inclusivity
TP
DescriptionTo start the Early Career Program this year, the Speed Networking activity will allow time for participants to interact and introduce themselves. The activity will include a short hands-on activity followed by an opportunity to mingle with the larger group.
Workshop
Recorded
Performance Portability
W
DescriptionThe OpenMP language continues to evolve with every new specification release, as does the need to validate and verify the new features that have been introduced. With the release of OpenMP 5.0 and OpenMP 5.1, plenty of new target offload and host-based features have been introduced to the programming model. While OpenMP continues to grow in maturity, there is an observable growth in the number of compiler and hardware vendors that support OpenMP. We focus on evaluating the conformity and implementation progress of various compiler vendors such as Cray, IBM, GNU, Clang/LLVM, NVIDIA, and Intel. We specifically address the 4.5, 5.0, and 5.1 versions of the specification.
Tutorial
Recorded
Accelerator-based Architectures
Big Data
Cloud and Distributed Computing
Datacenter
Exascale Computing
Heterogeneous Systems
Parallel Programming Languages and Models
Performance
Productivity Tools
Resource Management and Scheduling
TUT
DescriptionOver the past years, GPUs became ubiquitous in HPC installations around the world. Today, they provide the majority of performance of some of the largest supercomputers (e.g. Summit, Sierra, JUWELS Booster). This trend continues in the recently deployed and upcoming Pre-Exascale and Exascale systems (LUMI, Leonardo; Frontier, Perlmutter): GPUs are chosen as the core computing devices to enter this next era of HPC.
To take advantage of future GPU-accelerated systems with tens of thousands of devices, application developers need to have the proper skills and tools to understand, manage, and optimize distributed GPU applications. In this tutorial, participants will learn techniques to efficiently program large-scale multi-GPU systems. While programming multiple GPUs with MPI is explained in detail, also advanced tuning techniques and complementing programming models like NCCL and NVSHMEM are presented. Tools for analysis are shown and used to motivate and implement performance optimizations. The tutorial teaches fundamental concepts that apply to GPU-accelerated systems in general, taking the NVIDIA platform as an example. It is a combination of lectures and hands-on exercises, using Europe’s fastest supercomputer, JUWELS Booster, for interactive learning and discovery.
To take advantage of future GPU-accelerated systems with tens of thousands of devices, application developers need to have the proper skills and tools to understand, manage, and optimize distributed GPU applications. In this tutorial, participants will learn techniques to efficiently program large-scale multi-GPU systems. While programming multiple GPUs with MPI is explained in detail, also advanced tuning techniques and complementing programming models like NCCL and NVSHMEM are presented. Tools for analysis are shown and used to motivate and implement performance optimizations. The tutorial teaches fundamental concepts that apply to GPU-accelerated systems in general, taking the NVIDIA platform as an example. It is a combination of lectures and hands-on exercises, using Europe’s fastest supercomputer, JUWELS Booster, for interactive learning and discovery.
Paper
Recorded
Machine Learning and Artificial Intelligence
TP
Best Paper Finalist
DescriptionThe exponentially growing model size drives the continued success of deep learning, but it brings prohibitive computation and memory cost. From the algorithm perspective, model sparsification and quantization have been studied to alleviate the problem. From the architecture perspective, hardware vendors provide Tensor cores for acceleration. However, it is very challenging to gain practical speedups from sparse, low-precision matrix operations on Tensor cores, because of the strict requirements for data layout and lack of support for efficiently manipulating the low-precision integers. We propose Magicube, a high-performance sparse-matrix library for low-precision integers on Tensor cores. Magicube supports SpMM and SDDMM, two major sparse operations in deep learning with mixed precision. Experimental results on an NVIDIA A100 GPU show that Magicube achieves on average 1.44x (up to 2.37x) speedup over the vendor-optimized library for sparse kernels, and 1.43x speedup over the state-of-the-art with a comparable accuracy for end-to-end sparse Transformer inference.
Posters
Research Posters
TP
XO/EX
DescriptionThis poster presents GPU optimizations for Sparse Deep Neural Networks using Apache TVM. Although various deep neural network models exist, SpDNNs have shown great improvements in the size and memory of neural networks. SpDNNs provide unique scalability difficulties in which optimizations and advancements can be made. Apache TVM is a machine learning compiler framework for CPUs and GPUs. It has been shown to have promising improvements for the performance, deployment, and optimizations of the networks. To evaluate its effectiveness for SpDNNs, this work builds SpDNNs with Apache TVM and compares with current SpDNNs. When testing with various datasets, TVM-based implementation can achieve faster and more efficient optimizations.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
DescriptionAchieving full automation of program optimization is still an open problem for compiler writers. This work explores machine learning as a potential solution to learn data locality optimizations for tensor applications. Training models with supervised-learning for loop-nest optimization often requires prohibitively expensive training data generation for learning the combined effects of a transformation sequence. As a solution, this work proposes a novel learning strategy called Composed Singular Prediction (CSP) that significantly reduces the training data generation cost in the context of learned loop transformation models. The learned models are then deployed to predict data locality optimization schedules for Conv2d kernels to achieve performance improvements up to 4x against Intel oneDNN while saving over 100x in training data collection time over exhaustive search.
Paper
Recorded
Machine Learning and Artificial Intelligence
TP
DescriptionDeep learning Recommendation Model (DLRM) plays an important role in various application domains. However, existing DLRM training systems require a large number of GPUs due to the memory-intensive embedding tables. To this end, we propose EL-Rec, an efficient computing framework harnessing the Tensor-train (TT) technique to democratize the training of large-scale DLRMs with limited GPU resources. Specifically, EL-Rec optimizes TT decomposition based on key computation primitives of embedding tables and implements a high-performance compressed embedding table which is a drop-in replacement of Pytorch API. EL-Rec introduces an index reordering technique to harvest the performance gains from both local and global information of training inputs. EL-Rec also highlights a pipeline training paradigm to eliminate the communication overhead between the host memory and the training worker. Comprehensive experiments demonstrate that EL-Rec can handle the largest publicly available DLRM dataset with a single GPU and achieves 3× speedup over the state-of-the-art DLRM frameworks.
Workshop
Recorded
Reliability and Resiliency
W
DescriptionA framework for the efficient in-network data transfer between a parallel application and an independent storage server is proposed. The case of an unexpected and unrecoverable interruption of the application is considered, where the server takes the role of an emergency backup service preventing the unnecessary loss of valuable information. Cleanup time buffers can be optimally exploited by the framework making use of RDMA transport and redistribution of data by means of the Maestro middleware. Experiments are performed on a HPE/Cray EX system to construct a heuristics for amounts of data that can realistically be backed up during a given time buffer. The method proves to be faster than VELOC and plain MPI-IO using one server node already, for a number of user ranks up to a hundred, with the promise of also better scalability in the long run due to the in-network approach as opposed to filesystem transport.
Tutorial
Recorded
Architectures
Big Data
Data Mangement
Emerging Technologies
File Systems and I/O
TUT
DescriptionThe future of memory and storage technologies will be diverse, from existing hardware such as Intel Optane DIMMs through to CXL enabled storage and memory devices, and beyond. These new forms of memory require both different programming approaches to exploit the persistent functionality and storage performance, and potential redesign of applications to benefit from the full performance of the hardware and ensure correctness and data integrity.
This tutorial aims to educate attendees on the persistent memory hardware currently available, the future technologies such as CXL, the software methods to exploit such hardware, the choices that users of systems and system designers have when deciding which functionality and configurations to utilize, and bespoke storage systems that exploit such hardware. The tutorial will provide hands-on experience using and programming against the DAOS object store and DAOS API, along with information on programming persistent memory directly through PMDK and CXL, as well as a range of information on the hardware and software ecosystem and potential performance and functionality benefits. The tutorial will include hands-on practical sessions on systems with persistent memory, PMDK, and DAOS.
This tutorial aims to educate attendees on the persistent memory hardware currently available, the future technologies such as CXL, the software methods to exploit such hardware, the choices that users of systems and system designers have when deciding which functionality and configurations to utilize, and bespoke storage systems that exploit such hardware. The tutorial will provide hands-on experience using and programming against the DAOS object store and DAOS API, along with information on programming persistent memory directly through PMDK and CXL, as well as a range of information on the hardware and software ecosystem and potential performance and functionality benefits. The tutorial will include hands-on practical sessions on systems with persistent memory, PMDK, and DAOS.
Panel
Recorded
Diversity Equity Inclusion (DEI)
HPC Training and Education
TP
XO/EX
DescriptionThis panel will raise several important questions about the role of education, society, workforce, and artificial intelligence on gender disparities and biases. Despite many initiatives to reduce the gender gap, we are still exposed to stereotypes via advertisement, imbalanced leadership, and social media. Everyday AI applications also encode gender biases, which even deepens this inequality. More often, soft skills have been traditionally associated with women and rarely have been considered an asset for a STEM workplace. The paradoxical skills asymmetry is shown at its best in the recent trend of “soft” skills sought by tech companies (e.g,, communication, team collaboration, writing, conflict resolution). This diverse women panel of entrepreneurs, researchers, educators, and leaders will share their expertise and advice on capitalizing “soft” skills and gaining “hard” skills to offset the paradoxical skills asymmetry in society and at the workplace.
Workshop
Recorded
Accelerator-based Architectures
Data Analytics
In Situ Processing
Scientific Computing
Visualization
Workflows
W
DescriptionThis paper reports on Catalyst usability and initial adoption by SPARC analysts. The use case driven approach highlights the analysts’ perspective. Impediments to adoption can be due deficiencies in software capabilities, but analysts identify many mundane inconveniences and barriers that prevent them from fully leveraging Catalyst. With that said, for many analyst tasks Catalyst provides enough relative advantage that they have begun applying it in their production work, and recognize the potential for it to solve problems they currently struggle with. The findings in this report include specific issues and minor bugs in Paraview python scripting, which are viewed as having straightforward solutions, and a broader adoption analysis.
Posters
Research Posters
TP
XO/EX
DescriptionEnergy systems research strongly relies on large modeling frameworks. Many of them use linear optimization approaches to calculate blueprints for ideal future energy systems, which become increasingly complex, as do the models. The state of the art is to compute them with shared-memory computers combined with approaches to reduce the model size. We overcome this and implement a fully automated workflow on HPC using a newly developed solver for distributed memory architectures. Moreover, we address the challenge of uncertainty in scenario analysis by performing sophisticated parameter variations for large-scale power system models, which cannot be solved in the conventional way. Preliminary results show that we are able to identify clusters of future energy system designs, which perform well from different perspectives of energy system research and also consider disruptive events. Furthermore, we also observe that our approach provides the most insights when being applied to complex rather than simple models.
Birds of a Feather
TP
XO/EX
DescriptionTraditional interest in increasing parallelism for individual jobs in HPC systems is being conditioned by the variety and dynamicity of resource demands of jobs at runtime. Malleability techniques can help to adapt resource usage dynamically to achieve maximum efficiency. Malleable HPC systems, however, face a series of fundamental research challenges, such as resource management, scheduling, malleability control, applications co-design, and data movement. All aforementioned issues will be addressed in the proposed Birds of a Feather session, which aims at building a community of developers and users around the topic of malleability in High Performance Computing, Networking, and Storage.
Workshop
Recorded
W
DescriptionHost-FPGA connectivity is critical for enabling a vast number of FPGA use-cases. This interface must be reliable, robust, and uniform, while supporting necessary protocols and functionality. Existing support for host-FPGA connectivity has a number of drawbacks, including a lack of portability and poor upstream support. Native VirtIO drivers in the host OS can help address these limitations, but implementing device-side support for VirtIO is challenging due to the hardware complexity involved.
We present a framework for enabling FPGAs to interface native operating system VirtIO drivers on the host. To reduce the implementation overhead and improve portability, this framework uses both generic RTL blocks and modified, chip/device specific PCIe IP blocks. We test the framework using Xilinx IP, implemented on an Alinx board, and a host machine running Fedora. Our results show that the FPGA can be successfully enumerated as a VirtIO device, and interfaced using only native Linux VirtIO drivers.
We present a framework for enabling FPGAs to interface native operating system VirtIO drivers on the host. To reduce the implementation overhead and improve portability, this framework uses both generic RTL blocks and modified, chip/device specific PCIe IP blocks. We test the framework using Xilinx IP, implemented on an Alinx board, and a host machine running Fedora. Our results show that the FPGA can be successfully enumerated as a VirtIO device, and interfaced using only native Linux VirtIO drivers.
Birds of a Feather
TP
XO/EX
DescriptionThe DoD has invested significant time and funding to support a large base of users on a variety of HPC-backed projects. This BoF will use lightning talks about current research, technology acquisition plans, and software development needs and interests to illustrate DoD goals and opportunities for engagement. These lightning talks are intended to help external organizations and researchers connect with DoD users and sites to encourage partnerships and help solve problems. External engagement will help DoD users and HPC sites grow expertise and connect to the larger HPC community.
Workshop
Recorded
W
DescriptionScience is a practice of systematically studying something and offering data and evidence to reach a conclusion. With first principles simulations, basic physics are used to model some phenomena leading to consistent, repeatable results. With an incomplete physics model or models too complex or costly to run for a given task, AI or ML are being used to estimate what the missing physics would be if we could meet our goals with a first principles approach. Our work has been exploring how to ensure ML is capable of offering a science level of consistency so we can trust our science applications incorporating ML models.
Our earlier work examined the impact of pseudorandom numbers on model quality. For this study, we have examined the pseudo-random number generation algorithms used to seed essentially all ML algorithms to ensure that model generation can be performed by other scientists to achieve identical results.
Our earlier work examined the impact of pseudorandom numbers on model quality. For this study, we have examined the pseudo-random number generation algorithms used to seed essentially all ML algorithms to ensure that model generation can be performed by other scientists to achieve identical results.
Workshop
Recorded
AI-HPC Convergence
Extreme Scale Computing
Parallel Programming Languages and Models
Performance
Runtime Systems
W
Workshop
Recorded
AI-HPC Convergence
Extreme Scale Computing
Parallel Programming Languages and Models
Performance
Runtime Systems
W
Workshop
Recorded
AI-HPC Convergence
Extreme Scale Computing
Parallel Programming Languages and Models
Performance
Runtime Systems
W
Workshop
Recorded
AI-HPC Convergence
Extreme Scale Computing
Parallel Programming Languages and Models
Performance
Runtime Systems
W
Birds of a Feather
TP
XO/EX
DescriptionThis follow-up to the broadly attended SC19 and SC21 BoFs will expand the conversation related to ethical considerations in the field of HPC and its role in shaping society. The BoF is highly interactive and aims to be an exchange for the community to discuss and relate ethical behavior and societal norms to the design of HPC solutions and autonomous/intelligent systems, for example, so that they do not intentionally perpetuate global inequality. By furthering this dialogue, we can work to ensure the HPC community is advancing its commitment to technology for the benefit of all of humanity.
Birds of a Feather
TP
XO/EX
DescriptionPRACE engaged in 2018 in the coordination of European HPC activities, including access to HPC systems, user support, training , policy, technology, operations and dissemination. The initiative led to the development of the "HPC in Europe" portal, a mechanism to structure and present European HPC services.
Since then, the European HPC strategy has undergone strong changes, with the entry of EuroHPC JU and new coordination actors. The objective of this BoF is to present the current status of the ecosystem, discuss further exploitation of the HPC portal, and include the user experience from the Castiel/EuroCC network of European Competence Centres.
Since then, the European HPC strategy has undergone strong changes, with the entry of EuroHPC JU and new coordination actors. The objective of this BoF is to present the current status of the ecosystem, discuss further exploitation of the HPC portal, and include the user experience from the Castiel/EuroCC network of European Competence Centres.
Workshop
Recorded
W
DescriptionCurrent HPC systems provide memory resources tightly coupled with compute nodes. But HPC applications are evolving – diverse workloads demand different memory resources to achieve both high performance and utilization. In this study, we evaluate a memory subsystem leveraging CXL-enabled memory to provide configurable capacity and bandwidth. We propose an emulator to explore the performance impact of various memory configurations, and a profiler to identify optimization opportunities. We evaluate the performance of seven HPC workloads and six graph workloads on the emulated system. Our results show that three and two HPC workloads have less than 10% and 18% performance impact on 75% pooled memory. Also, a dynamically configured high-bandwidth system could effectively support bandwidth bottle-necked workloads like grid-based solvers. Finally, we identify interference through shared memory pools as a practical challenge for HPC systems to adopt CXL-enabled memory.
Workshop
Recorded
Applications
Architectures
Benchmarking
Exascale Computing
Modeling and Simulation
Performance
Performance Portability
W
DescriptionRecent revisions to the ISO C++ standard have added specifications for parallel algorithms. These additions cover common use-cases, including sequence traversal, reduction, and even sorting, many of which are highly applicable in HPC, and thus represent a potential for increased performance and productivity.
This study evaluates the state of the art for implementing heterogeneous HPC applications using the latest built-in ISO C++17 parallel algorithms. We implement C++17 ports of representative HPC mini-apps that cover both compute-bound and memory bandwidth-bound applications. We then conduct benchmarks on CPUs and GPUs, comparing our ports to other widely-available parallel programming models, such as OpenMP, CUDA, and SYCL.
Finally, we show that C++17 parallel algorithms are able to achieve competitive performance across multiple mini-apps on many platforms, with some notable exceptions. We also discuss several key topics, including portability, and describe workarounds for a number of remaining issues, including index-based traversal and accelerator device/memory management.
This study evaluates the state of the art for implementing heterogeneous HPC applications using the latest built-in ISO C++17 parallel algorithms. We implement C++17 ports of representative HPC mini-apps that cover both compute-bound and memory bandwidth-bound applications. We then conduct benchmarks on CPUs and GPUs, comparing our ports to other widely-available parallel programming models, such as OpenMP, CUDA, and SYCL.
Finally, we show that C++17 parallel algorithms are able to achieve competitive performance across multiple mini-apps on many platforms, with some notable exceptions. We also discuss several key topics, including portability, and describe workarounds for a number of remaining issues, including index-based traversal and accelerator device/memory management.
Workshop
Recorded
W
DescriptionMotivated by maturing programming models and portability for heterogeneous computing, we describe the challenges posed by hardware architectures and programming models when migrating an optimized implementation of nonuniform reduction from CUDA to HIP and SYCL. We explain the migration experience, evaluate the performance of the reduction on GPU-based computing platforms, and provide feedback on improving portability for the development of the SYCL programming model.
Workshop
Recorded
Cloud and Distributed Computing
In Situ Processing
Scientific Computing
Workflows
W
ACM Gordon Bell Finalist
Awards Presentation
Recorded
Awards
TP
DescriptionWe are motivated by newly proposed methods for mining large-scale corpora of scholarly publications (e.g., full biomedical literature), which consists of tens of millions of papers spanning decades of research. In this setting, analysts seek to discover relationships among concepts. They construct graph representations from annotated text databases and then formulate the relationship-mining problem as an all-pairs shortest paths (APSP) and validate connective paths against curated biomedical knowledge graphs (e.g., SPOKE). In this context, we present COAST (Exascale Communication-Optimized All-Pairs Shortest Path) and demonstrate 1.004 EF/s on 9,200 Frontier nodes (73,600 GCDs). We develop hyperbolic performance models (HYPERMOD), which guide optimizations and parametric tuning. The proposed COAST algorithm achieved the memory constant parallel efficiency of 99% in the single-precision tropical semiring. Looking forward, COAST will enable the integration of scholarly corpora like PubMed into the SPOKE biomedical knowledge graph.
Workshop
Recorded
HPC Training and Education
W
DescriptionThis talk will give an overview of ECP’s Broadening Participation Initiative, which has the mission of establishing a sustainable plan to recruit and retain a diverse workforce in the DOE high-performance computing community by fostering a supportive and inclusive culture within the computing sciences at DOE national laboratories. We will describe key activities within three complementary thrusts: establishing an HPC Workforce Development and Retention Action Group, creating accessible ‘Intro to HPC’ training materials, and launching the Sustainable Research Pathways for High-Performance Computing (SRP-HPC) workforce development program. We are leveraging ECP’s unique multilab partnership to work toward sustainable collaboration across the DOE community, with the long-term goal of changing the culture and demographic profile of DOE computing sciences.
Invited Talk
Recorded
TP
XO/EX
DescriptionCERN faces an unprecedented data challenge at the High Luminosity LHC, with an increase in exabytes produced annually and the processing, storage and analysis needs rising by an order of magnitude. In order to fulfill these new requirements, exascale technologies – including heterogeneous architectures, high performance computing, and machine learning – will be needed.
Alongside data intensive science and leaders in technology, CERN is pushing the frontiers of development to explore new technologies. In this presentation, I will provide an overview of the innovative R&D program currently undertaken by our community, which will be utilized to face the future computing challenges at CERN.
Alongside data intensive science and leaders in technology, CERN is pushing the frontiers of development to explore new technologies. In this presentation, I will provide an overview of the innovative R&D program currently undertaken by our community, which will be utilized to face the future computing challenges at CERN.
Birds of a Feather
TP
XO/EX
DescriptionThe Exascale Computing ALgorithms & Infrastructures Benefiting UK Research (ExCALIBUR) program is a research effort aiming to enable exploitation of future exascale supercomputers by the next generation of high-performance simulation software. Funded by the UK government, and running between 2019 and 2025, the program focuses on targeting high priority codes, algorithms, and techniques to meet the demands of computational scientists and engineers. Currently at the mid-point of the program, in this BoF we will highlight some of the activities and successes to date, as well as explore opportunities for collaborating more widely with global exascale computing research activities and programs.
Birds of a Feather
TP
XO/EX
DescriptionEfforts like the US Exascale Computing Program (ECP) have focused on accelerating scientific codes for next-generation HPC systems as well as bringing modern software engineering practices to these applications. Efforts like ECP focus large amounts of developer resources on a few important codebases, but a much larger body of scientific and research codes do benefit from the same attention, especially in terms of making codes accessible, interoperable, and reliable. This BoF will engage a set of expert panelists and the audience in understanding how we can bring best practices for software engineering to the wider audience of scientific software developers.
Workshop
Recorded
Performance Portability
W
DescriptionSparse matrices and linear algebra are at the heart of scientific simulations. The adoption of dynamic sparse matrices that can change the underlying data-structure to match the computation at runtime without introducing prohibitive overheads has the potential of optimizing performance through dynamic format selection. We introduce Morpheus, a library that provides an efficient abstraction for dynamic sparse matrices. The adoption of dynamic matrices aims to improve the productivity of developers and end-users who want to take advantage of the optimization opportunity to improve the performance of their applications, remaining unaware of the format specific details. We demonstrate that by porting HPCG to use Morpheus, and without further code changes, 1) HPCG can now target heterogeneous environments and 2) the performance of the SpMV kernel is improved up to 2.5x and 7x on CPUs and GPUs respectively, through runtime selection of the best format on each MPI process.
Workshop
Recorded
Quantum Computing
W
DescriptionA central challenge of applying near-term quantum optimization algorithms to industrially relevant problems is the need to incorporate complex constraints. In general, such constraints cannot be easily encoded in the circuit, and the quantum circuit measurement outcomes are not guaranteed to respect the constraints. Therefore, the optimization must trade off the in-constraint probability and the quality of the in-constraint solution by adding a penalty for constraint violation into the objective. We propose a new approach for solving constrained optimization problems with unconstrained, easy-to-implement quantum ansätze. Our method leverages the in-constraint energy as the objective and adds a lower-bound constraint on the in-constraint probability to the optimizer. We demonstrate significant gains in solution quality over directly optimizing the penalized energy. We implement our method in QVoice, a Python package that interfaces with Qiskit for quick prototyping in simulators and on quantum hardware.
Workshop
Recorded
W
DescriptionAdditive manufacturing is a rapidly growing area that has the potential to revolutionize society. In order to better understand and improve this process, scientists and engineers conduct detailed studies on the applicability of various materials and the process that additively constructs the object. One method of analyzing the additive process is to use cameras to take images of the object as it is built layer by layer. As the complexity of the process, image resolution, and image capture frequency increases, so too does the volume of data generated, which can lead to data storage/movement issues. In this paper, we present an exploratory study of applying various lossless and lossy reduction techniques to an additive manufacturing data set from Los Alamos National Laboratory. Results show that SZ gives the best reduction ratio, ZFP yields the best accuracy, and Hybrid Data Sampling is the fastest method.
Posters
Research Posters
TP
XO/EX
DescriptionThe Influence Maximization (IM) problem on a social network is the problem of identifying a small cohort of vertices that, when initially activated, results in a cascading effect that will activate the maximum expected number other vertices in the network. While the problem is NP-hard under budget constraints, it has a submodular structure that leads to efficient approximation.
In this work, we present techniques and our performance analysis that we are using to drive the design of efficient FPGA acceleration for the seed selection step within the IMM algorithm. Currently, we are able to achieve from 0.75x to 4.78x speedup, with the main bottleneck being a static overhead determined by the size of the input graph. We discuss future work to improve on the current architecture, and hope to provide techniques for making "almost-regular" applications fast and efficient on FPGAs.
In this work, we present techniques and our performance analysis that we are using to drive the design of efficient FPGA acceleration for the seed selection step within the IMM algorithm. Currently, we are able to achieve from 0.75x to 4.78x speedup, with the main bottleneck being a static overhead determined by the size of the input graph. We discuss future work to improve on the current architecture, and hope to provide techniques for making "almost-regular" applications fast and efficient on FPGAs.
Workshop
Recorded
Quantum Computing
W
DescriptionThe theoretical gains promised by quantum computing remain unrealized across practical applications given the limitations of current hardware. But the gap between theory and hardware is closing, assisted by developments in quantum algorithmic modeling. One such recent development is QuantumCircuitOpt (QCOpt), an open-source software framework that leverages commercial optimization-based solvers to find provably optimal compact circuit decompositions, which are exact up to global phase and machine precision. While such circuit design problems can be posed using non-linear, non-convex constraints, QCOpt implements a Mixed-Integer Linear Programming model, where non-linear constraints are reformulated using well-known linearization techniques. In this work, we instead explore whether the QCOpt model could be effective with continuous Non-Linear Programming (NLP) formulations. We are able to present not only multiple potential enhancements to QCOpt's run times, but also opportunities for more generally exploring the behavior of gradient-based NLP solvers.
Posters
Research Posters
TP
XO/EX
DescriptionThe GeoCAT-comp program is a Python toolkit used by the geoscience community to analyze data. This project explores ways to port GeoCAT-comp to run on GPUs, as recent supercomputers are shifting to include GPU accelerators as the major resource. Although GeoCAT-comp's routines are all sequential or utilize Dask parallelization on the CPU, the data processing is embarrassingly parallel and computationally costly, enabling us to optimize using GPUs. GeoCAT uses NumPy, Xarray, and Dask arrays for CPU parallelization. In this project, we examined different GPU-accelerated Python packages (e.g., Numba and CuPy). Taking into account the deliverability of the final porting method to the GeoCAT team, CuPy is selected. CuPy is a Python CUDA-enabled array backend module that is quite similar to NumPy. We analyzed the performance of the GPU-accelerated code compared to the Dask CPU parallelized code over various array sizes and resources, and through strong and weak scaling.
Birds of a Feather
TP
XO/EX
DescriptionFor the past 30 years, PCI-SIG® has delivered specifications that remain ahead of the industry demand for a high-bandwidth, low-latency I/O interconnect. With each new PCI Express® (PCIe®) specification, PCI-SIG has consistently delivered enhanced performance, unprecedented speeds, and low latency. With the release of the PCIe 6.0 specification in 2022, PCI-SIG moved into the PAM4 era, delivering 64 GT/s data rate while maintaining full backwards compatibility. PCI-SIG also introduced official PCIe 5.0 Compliance Testing in 2022. In this session, attendees will learn how PCIe 6.0 architecture enables next-generation HPC applications. Presenters will also highlight PCIe 5.0 technology adoption and applications.
Workshop
Recorded
Algorithms
Architectures
Compilers
Computational Science
Exascale Computing
Heterogeneous Systems
Hierarchical Parallelism
Memory Systems
Parallel Programming Languages and Models
Parallel Programming Systems
Resource Management and Scheduling
W
DescriptionThe optimization processes of modern compilers are often guided by performance experts who express optimization via so-called scheduling languages. We present our work-in-progress results toward a novel compiler whose scheduling language is based on the approach of Multi-Dimensional Homomorphisms (MDH). We argue that our MDH-based scheduling language enables a structured, hierarchical optimization process, by offering scheduling commands that systematically de- and re-compose computations to/from the memory and core hierarchies of state-of-the-art architectures (GPUs, multi-core CPUs, etc). Thereby, a performance expert expresses hierarchical code optimizations in a concise and structured way, contributing to a simplified code optimization process. Our first experiments on NVIDIA GPU and Intel CPU show that our scheduling language is capable of expressing as hierarchical code optimizations the optimization decisions of the popular deep learning compiler TVM. Our experiments also confirm that via auto-tuning, we are able to achieve better performance than TVM on both architectures.
Workshop
Recorded
Accelerator-based Architectures
Compilers
Dataflow and Tasking
Directive Based Programming
Heterogeneous Systems
Parallel Programming Languages and Models
Runtime Systems
W
DescriptionThe architectures of supercomputers are increasing in heterogeneity. It is important to maintain efficient code portability to take advantage of the computing capabilities of the evolving hardware in these systems. Intel has adopted an open standard programming interface for heterogeneous systems called oneAPI, designed to allow code portability across different processor architectures. This paper evaluates oneAPI by porting the dense linear algebra library Matrix Algebra on GPU and Multicore Architectures to Data Parallel C++, the direct programming language of oneAPI. Performance of the migrated code for GEMM is compared to MKL, OpenMP GEMM and native CUDA implementations on multicore CPUs and GPUs. The initial migrated code demonstrates impressive performance on multicore CPUs. It also retains the performance of CUDA on NVIDIA GPUs. It performs poorly on the Intel GPU but is improved through autotuning. Intel's oneAPI allowed for a successful extension of MAGMA portability to multicore CPUs and Intel GPUs.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
DescriptionThe architectures of supercomputers are increasing in diversity. It is important to maintain efficient code portability to take advantage of the computing capabilities of the evolving hardware in these systems. Intel has adopted an open standard programming interface for heterogeneous systems called oneAPI, designed to allow code portability across different processor architectures. This report evaluates oneAPI by migrating a general matrix-matrix multiplication CUDA algorithm from the dense linear algebra library Matrix Algebra on GPU and Multicore Architectures to Data Parallel C++, the direct programming language of oneAPI. Performance of the migrated code is compared to native CUDA implementations on multicore CPUs and GPUs. The initial migrated code demonstrates impressive performance on multicore CPUs. It retains the performance of CUDA on NVIDIA GPUs. It performs poorly on the Intel GPU but is improved with tuning. Intel's oneAPI allowed for a successful extension of MAGMA portability to multicore CPUs and Intel GPUs.
Workshop
Recorded
Reliability and Resiliency
W
DescriptionMANA is an MPI-Agnostic, Network-Agnostic transparent checkpointing tool for MPI applications, which is a recent breakthrough in transparent checkpointing. NERSC has been in collaboration with MANA team at Northeastern University and MemVerge, Inc to enable MANA for NERSC’s top applications to support DOE’s experimental facilities’ real-time workloads by checkpointing lower priority jobs and resuming them later. MANA employs a novel split-process approach and works by intercepting the MPI APIs to ensure that transparent checkpointing to occur at a consistent state between MPI processes and also to achieve network agnosticism. Thus, writing proper wrapper functions for MPI APIs is critical for MANA to checkpoint and restart MPI applications correctly and efficiently. While it is straightforward to implement a wrapper function for most of the MPI APIs, it is not trivial to correctly intercept some of the APIs, and the major challenge is to ensure the same behavior after intercepting the MPI APIs. In this lightning talk, we will review the current status of MPI API support in MANA, and present challenges in supporting various MPI APIs including its communicators, objects, data types, environments, etc., as well as the roadmap to extend the MPI API support in current and future versions of MPI standard. What we learned from supporting MPI APIs in MANA will be helpful to similar approaches that intercept MPI APIs.
MANA uses DMTCP as its checkpointing tool, and is implemented in the DMTCP framework as a plugin. MANA is an open source project.
MANA uses DMTCP as its checkpointing tool, and is implemented in the DMTCP framework as a plugin. MANA is an open source project.
Workshop
Recorded
Applications
Architectures
Heterogeneous Systems
Hierarchical Parallelism
Parallel Programming Languages and Models
Performance
Performance Portability
Scientific Computing
W
DescriptionHeterogeneous supercomputing systems are becoming the mainstream thanks to their powerful accelerators. However, the accelerators' special memory model and APIs increase the development complexity, and calls for innovative programming model designs. To address this issue, OpenMP has added target offloading for portable accelerator programming, and MPI allows transparent send-receive of accelerator memory buffers. Meanwhile, Partitioned Global Address Space (PGAS) languages like OpenSHMEM are falling behind for heterogeneous computing because their special memory models pose additional challenges.
We propose language and runtime interoperability extensions for both OpenMP and OpenSHMEM to enable portable remote access on GPU buffers, with minimal amount of code changes. Our modified runtime systems work in coordination to manage accelerator memory, eliminating the need for staging communication buffers. Comparing to the standard implementation, our extensions attain 6x point-to-point latency improvement, 1.3x better collective operation latency, 4.9x random access throughput, and up to 12.5% higher strong scalability.
We propose language and runtime interoperability extensions for both OpenMP and OpenSHMEM to enable portable remote access on GPU buffers, with minimal amount of code changes. Our modified runtime systems work in coordination to manage accelerator memory, eliminating the need for staging communication buffers. Comparing to the standard implementation, our extensions attain 6x point-to-point latency improvement, 1.3x better collective operation latency, 4.9x random access throughput, and up to 12.5% higher strong scalability.
ACM Gordon Bell Finalist
Awards Presentation
Recorded
Awards
TP
DescriptionWe develop a stochastic finite element method with ultra-large degrees of freedom that discretize probabilistic and physical spaces using unstructured second-order tetrahedral elements with double precision using a mixed-precision implicit iterative solver that scales to the full Fugaku system and enables fast Uncertainty Quantification (UQ). The developed solver designed to attain high performance on a variety of CPU/GPU-based supercomputers enabled solving 37 trillion degrees-of-freedom problem with 19.8% peak FP64 performance on full Fugaku (89.8 PFLOPS) with 87.7% weak scaling efficiency, corresponding to 224-fold speedup over the state of the art solver running on full Summit. This method, which has shown its effectiveness via solving huge (32-trillion degrees-of-freedom) practical problems, is expected to be a breakthrough in damage mitigation, and is expected to facilitate the scientific understanding of earthquake phenomena and have a ripple effect on other fields that similarly require UQ.
Posters
Research Posters
TP
XO/EX
DescriptionAccurate modeling of turbulent hypersonic flows has tremendous scientific and commercial value, and applies to atmospheric flight, supersonic combustion, materials discovery and climate prediction. In this poster, we describe our experiences in extending the capabilities of and modernizing CRoCCo, an MPI-based, CPU-only compressible computational fluid dynamics code. We extend CRoCCo to support block-structured adaptive mesh refinement using a highly-scalable AMR library, AMReX, and add support for a fully curvilinear solver. We also port the computational kernels in CRoCCo to NVIDIA GPUs to enable scaling on modern exascale systems. We present our techniques for overcoming performance challenges and evaluate the updated code, CRoCCo-AMR, on the Summit system, demonstrating a 5× to 24× speedup over the CPU-only version.
ACM Gordon Bell Finalist
Awards Presentation
Recorded
Awards
TP
DescriptionSimilarity search is one of the most fundamental computations that are regularly performed on ever-increasing protein datasets. Scalability is of paramount importance for uncovering novel phenomena that occur at very large scales. We unleash the power of over 20,000 GPUs on the Summit system to perform all-vs-all protein similarity search on one of the largest publicly available datasets with 405 million proteins, in less than 3.5 hours, cutting the time-to-solution for many use cases from weeks. The variability of protein sequence lengths, as well as the sparsity of the space of pairwise comparisons, make this a challenging problem in distributed memory. Due to the need to construct and maintain a data structure holding indices to all other sequences, this application has a huge memory footprint that makes it hard to scale the problem sizes. We overcome this memory limitation by innovative matrix-based blocking techniques, without introducing additional load imbalance.
Panel
Recorded
Applications
Reliability and Resiliency
TP
XO/EX
DescriptionRecent experiences with COVID restrictions, supply chain disruptions, and utility service uncertainties have been creating an elevated challenge for managing resilience and risk for supercomputing centers around the globe. This challenge will only escalate as the impacts of extreme weather-related events (temperature extremes, flooding, fires) become more severe and more frequent. These challenges highlight the need for supercomputing operational and laboratory directors to reassess their risks in the face of climate change. This panel will bring together directors from across the globe to share experiences, articulate concerns, and describe strategies for managing these elevated and new risks.
Workshop
Recorded
W
DescriptionWhile FPGAs have enjoyed success in accelerating high-frequency financial workloads for some time, their use for quantitative finance, which is the use of mathematical models to analyze financial markets and securities, has been far more limited to-date. In this presentation, we extend our previous work accelerating the industry standard Securities Technology Analysis Center's (STAC) derivatives risk analysis benchmark STAC-A2, by first porting this from the existing Xilinx implementation to an Intel Stratix-10 FPGA, exploring the challenges encountered when moving from one FPGA architecture to another and suitability of techniques. We then present a host-data-streaming approach that ultimately outperforms our previous version on a Xilinx Alveo U280 FPGA by up to 4.6 times and requiring 9 times less energy at the largest problem size, while outperforming the CPU and GPU versions by up to 8.2 and 5.2 times respectively.
Tutorial
Recorded
Algorithms
Applications
Big Data
Cloud and Distributed Computing
Datacenter
Performance
Reliability and Resiliency
TUT
DescriptionResilience is a critical issue for large-scale platforms. This tutorial provides a comprehensive survey of fault-tolerant techniques for high-performance and big data applications, with a fair balance between theory and practice. This tutorial is organized across four main topics:
(i) Overview of failure types (software/hardware, transient/fail-stop), and typical probability distributions (Exponential, Weibull, Log-Normal);
(ii) General-purpose techniques, which include several checkpoints and rollback recovery protocols, replication, prediction, and silent error detection;
(iii) Application-specific techniques, such as user-level in-memory checkpointing, data replication (map-reduce), or fixed-point convergence for iterative applications (back-propagation);
(iv) Practical deployment of fault tolerance techniques with User Level Fault Mitigation (a proposed MPI standard extension). Examples include widely used routines such as Monte-Carlo methods, SPMD stencil, map-reduce, and back-propagation in neural networks.
A step-by-step approach will show how to protect these routines and make them fault-tolerant, using a variety of techniques, in a hands-on session (a docker container will be provided).
The tutorial is open to all SC22 attendees who are interested in the current status and expected promise of fault-tolerant approaches for scientific and big data applications. Background will be provided for all protocols and probabilistic models. Basic MPI knowledge will be helpful for the hands-on session.
(i) Overview of failure types (software/hardware, transient/fail-stop), and typical probability distributions (Exponential, Weibull, Log-Normal);
(ii) General-purpose techniques, which include several checkpoints and rollback recovery protocols, replication, prediction, and silent error detection;
(iii) Application-specific techniques, such as user-level in-memory checkpointing, data replication (map-reduce), or fixed-point convergence for iterative applications (back-propagation);
(iv) Practical deployment of fault tolerance techniques with User Level Fault Mitigation (a proposed MPI standard extension). Examples include widely used routines such as Monte-Carlo methods, SPMD stencil, map-reduce, and back-propagation in neural networks.
A step-by-step approach will show how to protect these routines and make them fault-tolerant, using a variety of techniques, in a hands-on session (a docker container will be provided).
The tutorial is open to all SC22 attendees who are interested in the current status and expected promise of fault-tolerant approaches for scientific and big data applications. Background will be provided for all protocols and probabilistic models. Basic MPI knowledge will be helpful for the hands-on session.
Workshop
Recorded
Architectures
Data Analytics
Datacenter
Extreme Scale Computing
HPC Community Collaboration
Machine Learning and Artificial Intelligence
Performance
Resource Management and Scheduling
System Software
W
DescriptionThe lifetime of a leadership computing system spans about 10 years: It takes 5 years from conception to delivery, and then the systems stays in production for an additional 5 years. A lot can change in 10 years: Strategic priorities, policies, technologies, usage patterns, economic landscape. Yet, we have to make this work. In this talk, I will discuss some of the approaches we can use to design a high-performance computing system that will be relevant 10 years in the future. I will also discuss why sometimes we have to look beyond what we actually know, in order to produce true game-changing systems.
Workshop
Recorded
AI-HPC Convergence
Extreme Scale Computing
Parallel Programming Languages and Models
Performance
Runtime Systems
W
DescriptionProgramming exascale systems was seen as a major challenge at the start of the efforts to reach that level of performance. Perhaps not surprisingly, despite predictions of the likely dominance of new languages, users of DOE exascale systems still rely heavily on the MPI + OpenMP model that has dominated HPC for several years. Even emerging C++ abstraction layers such as Kokkos and RAJA often use the familiar MPI + OpenMP model in their backends. Thus, this talk will describe the implementation of the MPI + OpenMP model on the El Capitan and Frontier DOE exascale systems as while as how OpenMP has evolved, and will continue to evolve, to remain a key part of the large-scale programming ecosystem.
Workshop
Recorded
Reliability and Resiliency
W
DescriptionDAOS is an open-source scale-out object store designed from the ground up to deliver extremely high bandwidth/IOPS and low latency I/Os to the most demanding data-intensive workloads. It aims at supporting nextgen scientific workflows combining simulation, big data and AI in a single storage tier. DAOS presents a rich and scalable storage interface that allows efficient storage of both structured and unstructured data. DAOS supports multiple application interfaces including a parallel filesystem, Hadoop/Spark connector, TensorFlow-IO, native Python bindings, HDF5, MPI-IO as well as domain-specific data models like SEGY. Many DAOS deployments are underway including a 230PB installation connected to the ALCF’s Aurora system and a 1PB DAOS system for LRZ’s SuperMUC-NG phase 2. In this presentation, we will provide an overview of the DAOS architecture, the software ecosystem, and the Aurora deployment.
Workshop
Recorded
Architectures
Cloud and Distributed Computing
Emerging Technologies
Networks
Scientific Computing
W
Workshop
Recorded
W
DescriptionThe EOSC Compute Platform, delivered by the EGI Federation, is a system of federated compute and storage facilities, complemented by diverse access, data management and compute platform services. Following the requirements of four scientific use cases, EGI has expanded the EOSC Compute Platform with a set of HPC systems, allowing the execution of combined cloud-HPC workflows for open science projects. In this presentation, we will show our approach for federation of HPC providers, including EOSC-compliant user access management, monitoring and accounting.
Paper
Recorded
Correctness
System Software
TP
DescriptionTesting code for floating-point exceptions is crucial as exceptions can quickly propagate and produce unreliable numerical answers. The state-of-the-art to test for floating-point exceptions in GPUs is quite limited and solutions require the application's source code, which precludes their use in accelerated libraries where the source is not publicly available. We present an approach to find inputs that trigger floating-point exceptions in black-box GPU functions, i.e., functions where the source code and information about input bounds are unavailable. Our approach is the first to use Bayesian optimization (BO) to identify such inputs and uses novel strategies to overcome the challenges that arise in applying BO to this problem. We implement our approach in the Xscope framework and demonstrate it on 58 functions from the CUDA Math Library and functions from ten HPC programs. Xscope is able to identify inputs that trigger exceptions in about 72% of the tested functions.
Workshop
Recorded
AI-HPC Convergence
Emerging Technologies
Memory Systems
Networks
Resource Management and Scheduling
W
Posters
Scientific Visualization & Data Analytics Showcase
Recorded
TP
DescriptionA jet of fluid -- when we open a garden hose, for instance -- exhibits a rich tapestry of flow physics, including the rupture of fluid films and a cascade of filament and droplet breakup and coalescence. In addition to its breathtaking beauty, this jet atomization is a critical component for a broad spectrum of energy and healthcare applications. Simulating and visualization jet atomization is an ideal way to understand and control this phenomenon. However, the multiscale nature of jet atomization makes this a very challenging problem. Here, we visualize one of this phenomenon's highest resolution simulation datasets. The dataset consists of over 120,000-time steps of an adaptively resolved spatial mesh with length scales. We describe the parallel workflow and associated challenges while visualizing the time evolution of the jet. We show how this visualization produces a deep qualitative understanding of fluid dynamics from the outputs of these massive simulations.
Student Cluster Competition
TP
XO/EX
DescriptionWe as a team represent the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU). In the past decade, FAU has participated in many cluster competitions around the world and SCC was one among them for 8 years. Since then, the support team in the background has continued to develop constantly, but support staff from the early days are still involved. This experience is granting us broader exposure to the students and strengthens ongoing co-operations with our sponsors. Furthermore, the university provides continuous support in hardware and financial means since the first participation in 2013. Additionally, we have one team member who was involved in the team from last year which benefits the team since he can pass on his knowledge gained in handling the hardware and preparing for the competition in general.
The advantage for our team is that most of our team members are studying Computer Science and one team member studies Computational Engineering which provides us with strong programming skills. Hardware architecture and basic understanding of computers is a part of those studies as well, so we’re prepared for all aspects of building a system that is capable of being successful. We have chosen a variety of different specializations which provides valuable insights from various fields. For example, some of us have minors in physics, theoretical computer science and artificial intelligence.
All of us have student jobs related to our field of studying at our university or a company. This enables us to attain practical knowledge that helps us in overcoming the challenges we face in the competition. For example, one of our students’ work is closely related to the MRTs from Siemens giving him a clear idea of how to handle especially time sensitive hardware. Two of us had some hands on Robotics especially on Robotic Arms and Navigation gaining some knowledge on Machine Architecture. Two other students worked as tutors giving them the ability to explain and put their thoughts in words well.
It is also beneficial that the team members have worked on servers and Robots as their private projects during their free time therefore having a good grip on Linux and the Linux command line tools.
Our Advisor, Dominik Ernst, is a PhD student at FAU and the NHR@FAU’s GPU expert. He holds a master's degree with honors in Computational Engineering. His research combines analytic performance modelling for GPUs and automatic code analysis in support of code generation and kernel execution decisions. His broad background in GPU porting and optimization in various applications fields includes internships at NVIDIA in Santa Clara and at CERN in Geneva.
As a member of the first ever team that competed in the SCC for the FAU in 2013 and 2014, he is no stranger to this competition.
We as a team are looking forward to gaining practical knowledge as well as making use of the networking opportunities. We hope we could implement the skills attained and this background would help us to successfully handle the different applications of the competition.
The advantage for our team is that most of our team members are studying Computer Science and one team member studies Computational Engineering which provides us with strong programming skills. Hardware architecture and basic understanding of computers is a part of those studies as well, so we’re prepared for all aspects of building a system that is capable of being successful. We have chosen a variety of different specializations which provides valuable insights from various fields. For example, some of us have minors in physics, theoretical computer science and artificial intelligence.
All of us have student jobs related to our field of studying at our university or a company. This enables us to attain practical knowledge that helps us in overcoming the challenges we face in the competition. For example, one of our students’ work is closely related to the MRTs from Siemens giving him a clear idea of how to handle especially time sensitive hardware. Two of us had some hands on Robotics especially on Robotic Arms and Navigation gaining some knowledge on Machine Architecture. Two other students worked as tutors giving them the ability to explain and put their thoughts in words well.
It is also beneficial that the team members have worked on servers and Robots as their private projects during their free time therefore having a good grip on Linux and the Linux command line tools.
Our Advisor, Dominik Ernst, is a PhD student at FAU and the NHR@FAU’s GPU expert. He holds a master's degree with honors in Computational Engineering. His research combines analytic performance modelling for GPUs and automatic code analysis in support of code generation and kernel execution decisions. His broad background in GPU porting and optimization in various applications fields includes internships at NVIDIA in Santa Clara and at CERN in Geneva.
As a member of the first ever team that competed in the SCC for the FAU in 2013 and 2014, he is no stranger to this competition.
We as a team are looking forward to gaining practical knowledge as well as making use of the networking opportunities. We hope we could implement the skills attained and this background would help us to successfully handle the different applications of the competition.
Paper
Recorded
Extreme Scale Computing
Memory Systems
Parallel Programming Systems
State of the Practice
TP
DescriptionWe present an empirical study on memory reliablity by correlating correctable errors (CEs) with uncorrectable errors (UEs) using the large-scale field data across 3 major DIMM manufacturers from a contemporary server farm of ByteDance. Unlike the traditional chipkill error correction code (ECC), in contemporary Intel server platforms the ECC gets weakened, not able to tolerate some error-bit patterns from a single chip. Using obtainable coarse-grained ECC knowledge, we derive a new indicator from the error-bit information: risky CE occurrence in terms of ECC guaranteed coverage. We show that the new indicator is consistently sensitive and specific in testing future UEs indicating the substantial contribution of the weakened ECC to those UEs today. We empirically demonstrate how practically useful UE predictors are constructed based on the new indicator in conjunction with other useful attributes such as certain micro-level fault indicators and DIMM part numbers, achieving the state-of-the-art performance.
Workshop
Recorded
AI-HPC Convergence
Extreme Scale Computing
Parallel Programming Languages and Models
Performance
Runtime Systems
W
DescriptionOcto-Tiger, a large-scale 3D AMR code for the merger of stars, uses a combination of HPX, Kokkos and explicit SIMD types, aiming to achieve performance-portability for a broad range of heterogeneous hardware. However, on A64FX CPUs, we encountered several missing pieces, hindering performance by causing problems with the SIMD vectorization. Therefore, we add std::experimental::simd as an option to use in Octo-Tiger's Kokkos kernels alongside Kokkos SIMD, and further add a new SVE (Scalable Vector Extensions) SIMD backend.
Additionally, we amend missing SIMD implementations in the Kokkos kernels within Octo-Tiger's hydro solver. We test our changes by running Octo-Tiger on three different CPUs: An A64FX, an Intel Icelake and an AMD EPYC CPU, evaluating SIMD speedup and node-level performance. We get a good SIMD speedup on the A64FX CPU, as well as noticeable speedups on the other two CPU platforms. However, we also experience a scaling issue on the EPYC CPU.
Additionally, we amend missing SIMD implementations in the Kokkos kernels within Octo-Tiger's hydro solver. We test our changes by running Octo-Tiger on three different CPUs: An A64FX, an Intel Icelake and an AMD EPYC CPU, evaluating SIMD speedup and node-level performance. We get a good SIMD speedup on the A64FX CPU, as well as noticeable speedups on the other two CPU platforms. However, we also experience a scaling issue on the EPYC CPU.
Workshop
Recorded
Performance Portability
W
DescriptionMeeting both scalability and performance portability requirements is a challenge for any HPC application, especially for adaptively refined ones. In Octo-Tiger, an astrophysics application for the simulation of stellar mergers, we approach this with existing solutions: We employ HPX to obtain fine-grained tasks to easily distribute work and finely overlap communication and computation. For the computations themselves, we use Kokkos to turn these tasks into compute kernels capable of running on hardware ranging from a few CPU cores to powerful accelerators. There is a missing link, however: while the fine-grained parallelism exposed by HPX is useful for scalability, it can hinder GPU performance when the tasks become too small to saturate the device, causing low resource utilization. To bridge this gap, we investigate multiple different GPU work aggregation strategies within Octo-Tiger, adding one new strategy, and evaluate the node-level performance impact on recent AMD and NVIDIA GPUs, achieving noticeable speedups.
Workshop
Recorded
Applications
Architectures
Benchmarking
Exascale Computing
Modeling and Simulation
Performance
Performance Portability
W
DescriptionNow that the exascale Frontier is here, it is instructive to compare its properties to those projected in the 2008 Exascale technology report, and ask what's different, why did it seemingly take so long to get here, and what lessons should we take away for future machines. We discuss these points from the aspects of the original ground rules for the Exascale report, the technologies involved, and the changes in architecture and microarchitecture.
Workshop
Recorded
W
DescriptionClosing remarking and conclusion of FTXS 2022.
Workshop
Recorded
W
DescriptionPresentation on silent data corruption by our featured speaker, Harish Dixit from Facebook.
Workshop
Recorded
W
DescriptionIntroduction and welcome to the 12th Workshop on Fault Tolerance for HPC at eXtreme Scale.
Workshop
Recorded
Applications
Architectures
Heterogeneous Systems
Hierarchical Parallelism
Parallel Programming Languages and Models
Performance
Performance Portability
Scientific Computing
W
DescriptionPartitioned Global Address Space (PGAS) programming models, typified by systems such as UPC and Fortran coarrays, expose one-sided Remote Memory Access (RMA) communication as a key building block for HPC applications. Architectural trends in supercomputing make such programming models increasingly attractive, and newer, more sophisticated models such as UPC++, Legion and Chapel that rely upon similar communication paradigms are gaining popularity.
GASNet-EX is a portable, open-source, high-performance communication library designed to efficiently support the networking requirements of PGAS runtime systems and other alternative models in emerging exascale machines. We present microbenchmark results which demonstrate the RMA performance of GASNet-EX is competitive with MPI implementations on four recent, high-impact, production HPC systems. The networks measured are representative of hardware currently used in six of the top ten fastest supercomputers in the world, and all of the exascale systems on the U.S. DOE road map.
GASNet-EX is a portable, open-source, high-performance communication library designed to efficiently support the networking requirements of PGAS runtime systems and other alternative models in emerging exascale machines. We present microbenchmark results which demonstrate the RMA performance of GASNet-EX is competitive with MPI implementations on four recent, high-impact, production HPC systems. The networks measured are representative of hardware currently used in six of the top ten fastest supercomputers in the world, and all of the exascale systems on the U.S. DOE road map.
Student Cluster Competition
TP
XO/EX
DescriptionThe GeekPie_HPC team is united with a broad background and different STEM minds that aim to discover and resolve challenging engineering works. The philosophy of problem-solving is highly related to HPC applications and DevOps. We are equipped with free will and a solid ability to explore the technology world.
As for the diversity of research interests, Jiajun Cheng, a Sophomore, who is the new captain this year, currently works for Multi-Disciplinary AI VR/AR Studio(MARS), a place to make AI and CV happen in normal people's life. He is responsible for the backend of an animation App called Wand, and he has learned system architecture design, CI/CD, and Kubernetes.
Aibo Hu is a freshman majoring in computer science. He is interested in algorithms and data structures and has participated in many algorithmic programming contests. Besides, he is also a member of the GeekPie_ DevOps team and maintains the GeekPie_ mirror service. He is now working on learning computer architecture and systems.
Zecheng Li is a Computer Architecture and Parallel Computing enthusiast. His previous experience as a backup teammate at SC21 made him interested in fine-tuning the HPC program. He is currently doing an internship for a trading firm optimizing the code there.
Weiqi Wu, our lovely Mascot, is a Natural Language Processing enthusiast. She has conducted many small projects in these fields. She joined the team for a cross-discipline view of Machine Learning Systems.
Yichi Zhang is a freshman from the GeekPie_ Association. He was an algorithm competitor and won two silver medals in the ICPC Regional Contest. He is currently a member of the GeekPie_ DevOps team and studying computer systems and compiler technology.
Yining Zhang is a senior undergraduate student with a focus on architecture and systems. He has taken courses and completed projects on computer architecture, operating systems, distributed systems, and high-performance computing. Currently, he is also an assistant engineer of Biomedical Big Data Platform in the university, using his computer expertise to help other majors to do some scientific computing work.
Our previous teammates benefit a lot from the SC competition, and we are grateful to the committee. We meet the software and hardware nerd of our age and share our experiences and provide internship opportunities. They pursue their studies in prestigious schools and world-leading companies thanks to the experience and connections built during the SC event. Specifically, Ms. Jia Du goes to Carnegie Mellon University to study computer vision. Mr. Yanjie Song, Songhui Cao, and Guancheng Li land their research career in Prof. Shu Yin's team, focusing on the application of non-volatile random-access memory in the general memory hierarchy. Mr. Jianwen Luo takes an internship at Xilinx and joins Prof. Yajun's lab at ShanghaiTech, targetting FPGA acceleration in traditional computational models. Mr. Yuzhuo Jing started his Ph.D. degree at John Hopkins University, focusing on the security of Linux systems. Our former captain, Yiwei Yang will start the Ph.D. at UC Santa Cruz, focusing on general Systems in fall 2022.
As for the diversity of research interests, Jiajun Cheng, a Sophomore, who is the new captain this year, currently works for Multi-Disciplinary AI VR/AR Studio(MARS), a place to make AI and CV happen in normal people's life. He is responsible for the backend of an animation App called Wand, and he has learned system architecture design, CI/CD, and Kubernetes.
Aibo Hu is a freshman majoring in computer science. He is interested in algorithms and data structures and has participated in many algorithmic programming contests. Besides, he is also a member of the GeekPie_ DevOps team and maintains the GeekPie_ mirror service. He is now working on learning computer architecture and systems.
Zecheng Li is a Computer Architecture and Parallel Computing enthusiast. His previous experience as a backup teammate at SC21 made him interested in fine-tuning the HPC program. He is currently doing an internship for a trading firm optimizing the code there.
Weiqi Wu, our lovely Mascot, is a Natural Language Processing enthusiast. She has conducted many small projects in these fields. She joined the team for a cross-discipline view of Machine Learning Systems.
Yichi Zhang is a freshman from the GeekPie_ Association. He was an algorithm competitor and won two silver medals in the ICPC Regional Contest. He is currently a member of the GeekPie_ DevOps team and studying computer systems and compiler technology.
Yining Zhang is a senior undergraduate student with a focus on architecture and systems. He has taken courses and completed projects on computer architecture, operating systems, distributed systems, and high-performance computing. Currently, he is also an assistant engineer of Biomedical Big Data Platform in the university, using his computer expertise to help other majors to do some scientific computing work.
Our previous teammates benefit a lot from the SC competition, and we are grateful to the committee. We meet the software and hardware nerd of our age and share our experiences and provide internship opportunities. They pursue their studies in prestigious schools and world-leading companies thanks to the experience and connections built during the SC event. Specifically, Ms. Jia Du goes to Carnegie Mellon University to study computer vision. Mr. Yanjie Song, Songhui Cao, and Guancheng Li land their research career in Prof. Shu Yin's team, focusing on the application of non-volatile random-access memory in the general memory hierarchy. Mr. Jianwen Luo takes an internship at Xilinx and joins Prof. Yajun's lab at ShanghaiTech, targetting FPGA acceleration in traditional computational models. Mr. Yuzhuo Jing started his Ph.D. degree at John Hopkins University, focusing on the security of Linux systems. Our former captain, Yiwei Yang will start the Ph.D. at UC Santa Cruz, focusing on general Systems in fall 2022.
Workshop
Recorded
W
DescriptionCallgraph or caller-callee relationships have been used for various kinds of static program analysis, performance analysis and profiling, and for program safety or security analysis such as detecting anomalies of program execution or code injection attacks. However, different tools generate call graphs in different formats, which prevents efficient reuse of call graph results. We present an approach of using ontology and resource description framework (RDF) to create knowledge graph for specifying callgraphs to facilitate the construction of full-fledged and complex call graph of computer programs, realizing more interoperable and scalable program analyses than conventional approach. We create formal ontology-based specification of call graph information to capture concepts and properties of both static and dynamic callgraphs so different tools can collaboratively contribute to comprehensive analysis results. Our experiments show that ontology enables merging of call graphs generated from different tools and flexible queries using a standard query interface.
Workshop
Recorded
W
DescriptionDrug discovery is a time-consuming process with successive stages, often taking ~10 to ~15 years to develop candidate molecules into molecular therapeutics. In the computer aided drug discovery, new technologies are being developed to shorten the first stage of the drug discovery process: screening candidates for hit molecules. Given the large size of chemical space from which a new drug molecule has to be selected, this screening step is a challenge and reducing the number of costly experiments required is a priority.
A desirable solution for accelerating this process while keeping the cost under control is to generate drug molecules with desired properties via virtual design-build-test cycle. AI methods and HPC resources have shown potential for leveraging widely available small molecule libraries to generate new optimized molecules.
Recent progress has demonstrated advantages of using generative models, specifically Transformer-based language models (LM) that have been successfully implemented to predict desired chemical properties from sequence data (1, 2). These LMs are applied as powerful automated mutation operator, learning from commonly occurring chemical sequences available in the database. This calculated shift towards chemical-sequence for model training points to a revolution in moving away from the time-consuming feature engineering and curation that has long relied on molecular properties and fingerprints. As an example, our recent work illustrated a possible LM-based efficient strategy for creating generalizable models for small target molecules and protein sequences (3).
Here we present a first-of-its-kind comparative study between LM and a novel architecture on where LM can be efficiently deployed on Generative Adversarial Network (GAN) platform, to perform different specific optimizations tasks using genetic algorithm-based mutations. Fundamentally this hybrid architecture (LM-GAN) uses traditional generator and discriminator but takes advantage of pre-trained LM while predicting new molecules. During training, the mutation rate is varied from 10% to 100% in four different set of population size ranging from 5K to 50K. Random mutations were considered to select μ parents from population and to generate new molecules with 5 top predictions for a given set of masks. Thus, implemented genetic algorithm has (μ+5μ) survivor selection scheme where only novel unique molecules are retained in population.
Our results show that LM-GAN performs better with smaller size population (up to 10K) in generating molecules both in terms of better optimized properties and with a greater number of atoms, but this trend reverses as the population size increases. On the other hand, LM performs better in terms of generating more novel molecules. Finally, when estimating the ratio of accepted molecules to the generated novel molecules with desired optimized properties, LM-GAN performs consistently better in all population size.
Apart from drug or molecules discovery, in terms of HPC and AI, this work paves the way for further study in understanding the necessity of pre-training and fine-tuning of population data (type, sampling, diversity and size) requirements, the effect of GAN framework on LM models with variation in mutation rate, the effect of LM in replacing CNNs to capture non-local, long-range dependencies and addressing the problem of mode collapse.
A desirable solution for accelerating this process while keeping the cost under control is to generate drug molecules with desired properties via virtual design-build-test cycle. AI methods and HPC resources have shown potential for leveraging widely available small molecule libraries to generate new optimized molecules.
Recent progress has demonstrated advantages of using generative models, specifically Transformer-based language models (LM) that have been successfully implemented to predict desired chemical properties from sequence data (1, 2). These LMs are applied as powerful automated mutation operator, learning from commonly occurring chemical sequences available in the database. This calculated shift towards chemical-sequence for model training points to a revolution in moving away from the time-consuming feature engineering and curation that has long relied on molecular properties and fingerprints. As an example, our recent work illustrated a possible LM-based efficient strategy for creating generalizable models for small target molecules and protein sequences (3).
Here we present a first-of-its-kind comparative study between LM and a novel architecture on where LM can be efficiently deployed on Generative Adversarial Network (GAN) platform, to perform different specific optimizations tasks using genetic algorithm-based mutations. Fundamentally this hybrid architecture (LM-GAN) uses traditional generator and discriminator but takes advantage of pre-trained LM while predicting new molecules. During training, the mutation rate is varied from 10% to 100% in four different set of population size ranging from 5K to 50K. Random mutations were considered to select μ parents from population and to generate new molecules with 5 top predictions for a given set of masks. Thus, implemented genetic algorithm has (μ+5μ) survivor selection scheme where only novel unique molecules are retained in population.
Our results show that LM-GAN performs better with smaller size population (up to 10K) in generating molecules both in terms of better optimized properties and with a greater number of atoms, but this trend reverses as the population size increases. On the other hand, LM performs better in terms of generating more novel molecules. Finally, when estimating the ratio of accepted molecules to the generated novel molecules with desired optimized properties, LM-GAN performs consistently better in all population size.
Apart from drug or molecules discovery, in terms of HPC and AI, this work paves the way for further study in understanding the necessity of pre-training and fine-tuning of population data (type, sampling, diversity and size) requirements, the effect of GAN framework on LM models with variation in mutation rate, the effect of LM in replacing CNNs to capture non-local, long-range dependencies and addressing the problem of mode collapse.
ACM Gordon Bell COVID Finalist
ACM Gordon Bell Finalist
Awards Presentation
Recorded
Awards
TP
DescriptionWe seek to transform how new and emergent variants of pandemic-causing viruses, specifically SARS-CoV-2, are identified and classified. By adapting large language models (LLMs) for genomic data, we build genome-scale language models (GenSLMs) which can learn the evolutionary landscape of SARS-CoV-2 genomes. By pre-training on over 110 million prokaryotic gene sequences and fine-tuning a SARS-CoV-2-specific model on 1.5 million genomes, we show that GenSLMs can accurately and rapidly identify variants of concern. Thus, to our knowledge, GenSLMs represents one of the first whole genome scale foundation models which can generalize to other prediction tasks. We demonstrate scaling of GenSLMs on GPU-based supercomputers and AI-hardware accelerators utilizing 1.63 Zettaflops in training runs with a sustained performance of 121 PFLOPS in mixed precision and peak of 850 PFLOPS. We present initial scientific insights from examining GenSLMs in tracking evolutionary dynamics of SARS-CoV-2, paving the path to realizing this on large biological data.
Student Cluster Competition
TP
XO/EX
DescriptionThe Gig 'em bytes Student Cluster Competition (SCC) team is multidisciplinary with members from the departments of Biomedical Engineering, Chemistry, Electrical & Computer Engineering, Management Information Systems, Physics, and Statistics at Texas A&M University (TAMU).
Patrick is the team lead and previously competed in the IndySCC. He will be able to help the less experienced team members navigate their way through the benchmarking of the system, as well as the dissection of the applications to optimize their performance. He would like to gain an even deeper understanding of HPC systems through this intense competition, so that he can help diversify the scope in which it is used in industry.
Catherine has always been interested in problem solving, and has been heavily involved in areas of science. She is a physics major, and will select the computational physics track, so that she can combine her study of the physical world with the world of computation. She has not competed in the SCC before, but is very willing to work hard and excited to learn more about PHASTA.
Becky is currently working as a student technician at TAMUs High Performance Research Computing (HPRC) facility. She has seen how researchers utilize HPC to advance their studies and innovations, and she believes that HPC will help to shape the future. This competition is a perfect opportunity for her to gain more knowledge and hands-on experience with HPC.
Emmanuel is also a student technician at HPRC. He has assisted users in accessing the HPRC systems, and editing scripts to run correctly or more efficiently. As technology advances, more and more users depend on HPC for their tasks. This competition will be a great opportunity to further his knowledge and experience for his personal and professional career.
Lius is very familiar with computer components because he has been building computers since he was young and has spent his free-time learning about HPC. He has project based experience in Python and C++ coding which will be beneficial to the team. He is excited to work with HPC components and discovering how the knowledge he has now can grow and in what ways he can contribute to the challenges this competition will present.
Curran has been involved in computational chemistry research and HPC since 2020. He competed in the IndySCC21 competition which allowed him to progress further in his research and prepared him to pursue research as a career. He is ready for this opportunity to experience even greater benefits with this year's competition and he couldn’t be more thrilled!
Dr. Lisa Perez, Associate Director at TAMU's HPRC is the team advisor and possesses an extensive background in the computational sciences and HPC system administration. She led the multi-institutional Ag-Jag (TAMU/TAMU-SA) VirtualSCC20 team and Gig 'em bytes (TAMU/PVAMU) IndySCC21 team. Co-advisor Dr. Xin Yang is an Assistant Research Scientist at HPRC. She has expertise in the area of computational chemistry and HPC.
The Gig 'em bytes team is well-rounded in scientific disciplines and skill sets necessary to succeed!
Patrick is the team lead and previously competed in the IndySCC. He will be able to help the less experienced team members navigate their way through the benchmarking of the system, as well as the dissection of the applications to optimize their performance. He would like to gain an even deeper understanding of HPC systems through this intense competition, so that he can help diversify the scope in which it is used in industry.
Catherine has always been interested in problem solving, and has been heavily involved in areas of science. She is a physics major, and will select the computational physics track, so that she can combine her study of the physical world with the world of computation. She has not competed in the SCC before, but is very willing to work hard and excited to learn more about PHASTA.
Becky is currently working as a student technician at TAMUs High Performance Research Computing (HPRC) facility. She has seen how researchers utilize HPC to advance their studies and innovations, and she believes that HPC will help to shape the future. This competition is a perfect opportunity for her to gain more knowledge and hands-on experience with HPC.
Emmanuel is also a student technician at HPRC. He has assisted users in accessing the HPRC systems, and editing scripts to run correctly or more efficiently. As technology advances, more and more users depend on HPC for their tasks. This competition will be a great opportunity to further his knowledge and experience for his personal and professional career.
Lius is very familiar with computer components because he has been building computers since he was young and has spent his free-time learning about HPC. He has project based experience in Python and C++ coding which will be beneficial to the team. He is excited to work with HPC components and discovering how the knowledge he has now can grow and in what ways he can contribute to the challenges this competition will present.
Curran has been involved in computational chemistry research and HPC since 2020. He competed in the IndySCC21 competition which allowed him to progress further in his research and prepared him to pursue research as a career. He is ready for this opportunity to experience even greater benefits with this year's competition and he couldn’t be more thrilled!
Dr. Lisa Perez, Associate Director at TAMU's HPRC is the team advisor and possesses an extensive background in the computational sciences and HPC system administration. She led the multi-institutional Ag-Jag (TAMU/TAMU-SA) VirtualSCC20 team and Gig 'em bytes (TAMU/PVAMU) IndySCC21 team. Co-advisor Dr. Xin Yang is an Assistant Research Scientist at HPRC. She has expertise in the area of computational chemistry and HPC.
The Gig 'em bytes team is well-rounded in scientific disciplines and skill sets necessary to succeed!
Birds of a Feather
TP
XO/EX
DescriptionThe world is transitioning to IPv6 - many ISPs are now seeing over 50% of their traffic via IPv6. This BoF provides a brief summary of the transition to, and exploration of, the current state of IPv6-only networks. The combination has implications for, and will impact, HPC systems, other systems and networks of all sizes. This topic is pertinent to anyone who wants to learn more about IPv6 and IPv6-only networking. Dynamic quick talks on IPv6 themes surrounding global impacts, real world applications, best practices, and lessons learned will guide a robust and interactive discussion with the audience.
Workshop
Recorded
Applications
Architectures
Benchmarking
Exascale Computing
Modeling and Simulation
Performance
Performance Portability
W
DescriptionGraphics Processing Units (GPUs) have revolutionized the computing landscape over the past decade. However, the growing energy demands of data centers and computing facilities equipped with GPUs come with significant capital and environmental costs. The energy consumption of GPU applications greatly depend on how well they are optimized. Auto-tuning is an effective and commonly applied technique of finding the optimal combination of algorithm, application, and hardware parameters to optimize performance of a GPU application. In this paper, we introduce new energy monitoring and optimization capabilities in Kernel Tuner, a generic auto-tuning tool for GPU applications. These capabilities enable us to investigate the difference between tuning for execution time and various approaches to improve energy efficiency, and investigate the differences in tuning difficulty. Additionally, our model for GPU power consumption greatly reduces the large tuning search space by providing clock frequencies for which a GPU is likely most energy efficient.
Workshop
Recorded
Algorithms
Exascale Computing
Extreme Scale Computing
Heterogeneous Systems
Post-Moore Computing
Quantum Computing
W
DescriptionThe ensemble data assimilation of computational fluid dynamics simulations based on the lattice Boltzmann method (LBM) and the local ensemble transform Kalman filter (LETKF) is implemented and optimized on a GPU supercomputer based on NVIDIA A100 GPUs. To connect the LBM and LETKF parts, data transpose communication is optimized by overlapping computation, file I/O, and communication based on data dependency in each LETKF kernel. In two dimensional forced isotropic turbulence simulations with the ensemble size of M=64 and the number of grid points of N_x=128^2, the optimized implementation achieved x3.80 speedup from the naive implementation, in which the LETKF part is not parallelized. The main computing kernel of the local problem is the eigenvalue decomposition (EVD) of M x M real symmetric dense matrices, which is computed by a newly developed batched EVD in EigenG. The batched EVD in EigenG outperforms that in cuSOLVER, and x65.3 speedup was achieved.
Workshop
Recorded
W
DescriptionComplex diseases such as cancer and neurological disorders require a systemic approach to understand underlying causes and identify therapeutic targets to help patients. More comprehensive analyses, however, often bring significant computational challenges. EDDY (Evaluation of Differential DependencY) is a computational method to identify rewiring of biological pathways between biological conditions such as drug responses or subtypes of disease [1]. Through its probabilistic framework with resampling and permutation, aided by the incorporation of annotated gene sets, EDDY demonstrated superior sensitivity to other methods. Further development integrated prior knowledge into these interrogations [2]. However, the considerable computational cost for this statistical rigor limited its application to larger datasets. Fortunately, ample and independent computation coupled with manageable memory footprint positioned EDDY as a strong candidate for graphical processing unit (GPU) implementation. With custom kernels to decompose the independence test loop, network construction, network enumeration, and Bayesian network scoring to accelerate the computation. GPU-accelerated EDDY consistently benchmarked at two orders of magnitude in performance enhancement [3]. EDDY has been applied to the determination of rewired pathways controlling differing small molecule responses in cancer cell lines [4]. Further investigations extended this to pathways associated with pulmonary hypertension [5].
Recent emergence of single cell transcriptomic and spatial transcriptomic data raises additional computational challenges, mainly due to an order of magnitude increase in sample size, compared to bulk cell transcriptomic data, often bringing the number of samples to analyze to hundreds of thousands of cells (samples). This called for additional optimization of the existing EDDY-GPU codes. By working with a NVIDIA team through Princeton Hackathon 2022, we were able to dramatically increase the computational speed of the EDDY-GPU. New sampling strategies has been implemented to adjust to samples counts at this scale. In addition, the latest code development phase identified various performance bottlenecks, which not only improved acceleration but allowed for the incorporation of even larger gene sets, such as immune pathways. Hence, EDDY’s statistical rigor can now be brought to bear in the inference of specific diagnostic and treatment strategies for the individual patient, and with an implementation that allows this data analysis to be run on a physician’s desktop within reasonable time. We will present preliminary results using this newly improved EDDY-GPU with single cell transcriptomic data from cancer, Alzheimer’s disease, and pulmonary hypertension.
Recent emergence of single cell transcriptomic and spatial transcriptomic data raises additional computational challenges, mainly due to an order of magnitude increase in sample size, compared to bulk cell transcriptomic data, often bringing the number of samples to analyze to hundreds of thousands of cells (samples). This called for additional optimization of the existing EDDY-GPU codes. By working with a NVIDIA team through Princeton Hackathon 2022, we were able to dramatically increase the computational speed of the EDDY-GPU. New sampling strategies has been implemented to adjust to samples counts at this scale. In addition, the latest code development phase identified various performance bottlenecks, which not only improved acceleration but allowed for the incorporation of even larger gene sets, such as immune pathways. Hence, EDDY’s statistical rigor can now be brought to bear in the inference of specific diagnostic and treatment strategies for the individual patient, and with an implementation that allows this data analysis to be run on a physician’s desktop within reasonable time. We will present preliminary results using this newly improved EDDY-GPU with single cell transcriptomic data from cancer, Alzheimer’s disease, and pulmonary hypertension.
Workshop
Recorded
Accelerator-based Architectures
Compilers
Dataflow and Tasking
Directive Based Programming
Heterogeneous Systems
Parallel Programming Languages and Models
Runtime Systems
W
DescriptionThe development of directive based parallel programming models such as OpenACC has significantly reduced the cost in using accelerators such as GPUs. In this study, the sparse matrix vector product (SpMV), which is often the most computationally expensive part in physics-based simulations, was accelerated by GPU porting using OpenACC. Further speed-up was achieved by introducing the element-by-element (EBE) method in SpMV, an algorithm that is suitable for GPU architecture because it requires large amount of operations but small amount of memory access. In a comparison on one compute node of the supercomputer ABCI, using GPUs resulted in a 21-fold speedup over the CPU-only case, even when using the typical SpMV algorithm, and an additional 2.9-fold speedup when using the EBE method. The results on such analysis was applied to a seismic response analysis considering soil liquefaction, and using GPUs resulted in a 42-fold speedup compared to using only CPUs.
Paper
Recorded
Data Analytics
Performance
TP
DescriptionProduction software of data centers often suffers from unnecessary memory inefficiencies. Nevertheless, whole-program monitoring tools often incur incredibly high overhead due to fine-grained memory access instrumentation.
To this end, this work presents a novel learning-aided system, namely Puffin, to identify three kinds of unnecessary memory operations including dead stores, silent loads, and silent stores, by applying gated graph neural networks onto fused static and dynamic program semantics with respect to relative positional embedding. To deploy the system in large-scale data centers, this work explores a sampling-based detection infrastructure with high efficacy and negligible overhead. We evaluate Puffin upon the well-known SPEC CPU 2017 benchmark suite for four compilation options. Experimental results show that the proposed method is able to capture the three kinds of memory inefficiencies with as high accuracy as 96%, with a performance speed-up of 5.66x over the state-of-the-art tool.
To this end, this work presents a novel learning-aided system, namely Puffin, to identify three kinds of unnecessary memory operations including dead stores, silent loads, and silent stores, by applying gated graph neural networks onto fused static and dynamic program semantics with respect to relative positional embedding. To deploy the system in large-scale data centers, this work explores a sampling-based detection infrastructure with high efficacy and negligible overhead. We evaluate Puffin upon the well-known SPEC CPU 2017 benchmark suite for four compilation options. Experimental results show that the proposed method is able to capture the three kinds of memory inefficiencies with as high accuracy as 96%, with a performance speed-up of 5.66x over the state-of-the-art tool.
Paper
Recorded
Big Data
Computational Science
TP
DescriptionExisting streaming graph processing systems typically adopt two phases of refinement and recomputation to ensure the correctness of the incremental computation. However, severe redundant memory accesses exist due to the unnecessary synchronization among independent edge updates. In this paper, we present GraphFly, a high-performance asynchronous streaming graph processing system based on dependency-flows. GraphFly features three key designs: 1) Dependency trees (Dtrees), which helps quickly identify independent graph updates with low cost; 2) Dependency-flow based processing model, which exploits the space-time dependent co-scheduling for cache efficiency; 3) Specialized graph data layout, which further reduces memory accesses. We evaluate GraphFly, and the results show that GraphFly significantly outperforms state-of-the-art systems KickStarter and GraphBolt by 5.81× and 1.78× on average, respectively. Also, GraphFly scales well with different sizes of update batch and compute resources.
Exhibitor Forum
Recorded
TP
XO/EX
DescriptionWith the slowing of Moore’s Law and explosion of data-flow compute in the domains of AI and HPC, there has been a new renaissance in domain specific architectures (DSAs) to help meet today’s compute demands. A large swath of these architectures are spatial in nature, where compute is unrolled in space to expose more parallelism for data-flow-heavy workloads. With these spatial architectures comes the challenge of effectively mapping workloads to the available compute units. Parallelizing compilers are often touted as the means to this goal, but their effectiveness is largely limited by the abstraction exposed by hardware to software. Here we explore the inherent challenges faced by some existing spatial architectures, such as GPUs, and explain how focusing on deterministic compute can alleviate these challenges. We do this by diving deep into Groq’s Tensor Streaming Processor (TSP), exploring how the architecture empowers software to efficiently map data-flow workloads to the chip’s massive amounts of compute. We demonstrate how this “software-defined hardware” approach is well-suited for data-flow compute, showcasing >5x improvements compared to current state-of-the-art on LSTM and Transformer-based models. We also explore how the compiler and architecture allow for powerful hardware-software co-design capabilities.
Paper
Recorded
File Systems and I/O
Storage
TP
DescriptionModern High-Performance Computing (HPC) data centers routinely store massive data sets resulting in millions of directories and billions of files. To efficiently search and sift through these files and directories we present the Grand Unified File Index (GUFI), a novel file system metadata index that enables both privileged and regular users to rapidly locate and characterize data sets of interest. GUFI uses a hierarchical index that preserves file access permissions such that the index can be securely accessed by users while still enabling efficient, advanced analysis of storage system usage by cluster administrators. Compared with the current state-of-the-art indexing for file system metadata, GUFI is able to provide speedups of 1.5x to 230x for queries executed by administrators on a real production file system namespace. Queries executed by users, which typically cannot rely on cluster-wide indexing, see even greater speedups using GUFI.
Paper
Recorded
Architectures
Networks
TP
Best Reproducibility Advancement Finalist
DescriptionNumerous microarchitectural optimizations unlocked tremendous processing power for deep neural networks that in turn fueled the AI revolution. With the exhaustion of such optimizations, the growth of modern AI is now gated by the performance of training systems, especially their data movement. Instead of focusing on single accelerators, we investigate data-movement characteristics of large-scale training at full system scale. Based on our workload analysis, we design HammingMesh, a novel network topology that provides high bandwidth at low cost with high job scheduling flexibility. Specifically, HammingMesh can support full bandwidth and isolation to deep learning training jobs with two dimensions of parallelism. Furthermore, it also supports high global bandwidth for generic traffic. Thus, HammingMesh will power future large-scale deep learning systems with extreme bandwidth requirements.
Tutorial
Recorded
Accelerator-based Architectures
Applications
Heterogeneous Systems
Post-Moore Computing
Quantum Computing
TUT
DescriptionSYCL is a programming model that lets developers support a wide variety of devices (CPUs, GPUs, and more) from a single code base. Given the growing heterogeneity of processor roadmaps, moving to an open standard, platform-independent model such as SYCL is essential for modern software developers. SYCL has the further advantage of supporting a single-source style of programming from completely standard C++.
In this tutorial, we will introduce SYCL and provide programmers with a solid foundation they can build on to gain mastery of this language. The main benefit of using SYCL over other heterogeneous programming models is the single programming language approach, which enables one to target multiple devices using the same programming model, and therefore to have a cleaner, portable, and more readable code.
This is a hands-on tutorial. The real learning will happen as students write code. The format will be short presentations followed by hands-on exercises.
In this tutorial, we will introduce SYCL and provide programmers with a solid foundation they can build on to gain mastery of this language. The main benefit of using SYCL over other heterogeneous programming models is the single programming language approach, which enables one to target multiple devices using the same programming model, and therefore to have a cleaner, portable, and more readable code.
This is a hands-on tutorial. The real learning will happen as students write code. The format will be short presentations followed by hands-on exercises.
Students@SC
DescriptionMachine learning can be used to solve a lot of seemingly disparate problems in different fields. Here, we'll focus on some computer vision and natural language applications, inspired by real-world examples from ML-centered projects at Oak Ridge National Lab! We'll learn some foundations of machine learning as we play with text classification, object detection, and maybe even some video analysis problems right on your laptop!
Tutorial
Recorded
Accelerator-based Architectures
Architectures
Data Mangement
Heterogeneous Systems
Performance
Resource Management and Scheduling
TUT
DescriptionThis tutorial presents state-of-the-art performance tools for leading HPC systems founded on the community-developed Score-P instrumentation and measurement infrastructure, demonstrating how they can be used for performance engineering of effective scientific applications based on standard MPI, OpenMP, hybrid combination of both, and increasingly common usage of accelerators. Parallel performance tools from the Virtual Institute – High Productivity Supercomputing (VI-HPS) are introduced and featured in hands-on exercises with Score-P, Scalasca, Vampir, and TAU. These platform-agnostic tools are installed and supported on many of the HPC systems coordinated via PRACE, ECP, XSEDE/ACCESS, and others. We present the complete workflow of performance engineering, including instrumentation, measurement (profiling and tracing, timers and hardware counters), data storage, analysis, tuning, and visualization. Emphasis is placed on how tools are used in combination for identifying performance problems and investigating optimization alternatives. Participants will use their notebook computer for guided exercises on contemporary CPU+GPU HPC systems which will prepare them to locate and diagnose performance bottlenecks in their own parallel programs.
Further information about the tutorial – including the registration for a training account for the hands-on exercises on the Top500 #11 JUWELS-Booster quad-A100 GPU modular supercomputer nodes at Jülich Supercomputing Centre (JSC) – is available at https://www.vi-hps.org/training/other/sc22-score-p-tutorial.html
Further information about the tutorial – including the registration for a training account for the hands-on exercises on the Top500 #11 JUWELS-Booster quad-A100 GPU modular supercomputer nodes at Jülich Supercomputing Centre (JSC) – is available at https://www.vi-hps.org/training/other/sc22-score-p-tutorial.html
Birds of a Feather
TP
XO/EX
DescriptionHDF5 is a pivotal I/O library for scientific applications. In this BoF, we will present new features that target exascale and “cloud HPC” environments, HDF5’s role in the ECP project, and the HDF5 roadmap. We will moderate a panel with representatives from research, commercial, and government organizations who will present case studies on how they use HDF5 for both cloud and exascale systems. This will provide a forum for users to discuss their experiences with HDF5, including new features to access data in object stores and the cloud. Session leaders will moderate open discussion with attendees and solicit feedback.
Workshop
Recorded
Performance Portability
W
DescriptionIn order to take advantage of the burgeoning diversity in processors at the frontier of supercomputing, the HPC community is migrating and improving codes to utilize heterogeneous nodes, where accelerators, principally GPUs, are highly prevalent in top-tier supercomputer designs. Programs therefore need to embrace at least some of the complexities of heterogeneous architectures. Parallel programming models have evolved to express heterogeneous paradigms whilst providing mechanisms for writing portable, performant programs. History shows that technologies first introduced at the frontier percolate down to local workhorse systems. However, we expect there will always be a mix of systems, some heterogeneous, but some remaining as homogeneous CPU systems. Thus it is important to ensure codes adapted for heterogeneous systems continue to run efficiently on CPUs. In this study, we explore how well widely used heterogeneous programming models perform on CPU-only platforms, and survey the performance portability they offer on the latest CPU architectures.
Paper
Recorded
Machine Learning and Artificial Intelligence
TP
DescriptionGraph neural networks (GNNs) have shown to significantly improve graph analytics. Existing systems for GNN training are primarily designed for homogeneous graphs. In industry, however, most graphs are actually heterogeneous in nature (i.e., having multiple types of nodes and edges). Existing systems train a heterogeneous GNN (HetGNN) as a composition of homogeneous GNNs and thus suffer from critical limitations such as lack of memory optimization and limited operator parallelism. To address these limitations, we propose HGL– a heterogeneity-aware system for GNN training. At the core of HGL is an intermediate representation, called HIR, which provides a holistic representation for GNNs and enables cross-relation optimization for HetGNN training. We devise tailored optimizations on HIR, including graph stitching, operator fusion, and operator bundling. Experimental results verify that HGL significantly outperforms DGL and PyG.
Workshop
Recorded
Algorithms
Architectures
Compilers
Computational Science
Exascale Computing
Heterogeneous Systems
Hierarchical Parallelism
Memory Systems
Parallel Programming Languages and Models
Parallel Programming Systems
Resource Management and Scheduling
W
DescriptionFast Fourier Transforms (FFT) are used to solve various scientific and engineering problems. For example, computational fluid dynamics (CFD) simulations employ a pseudo-spectral method to solve flow simulations using Fast Fourier transforms (FFT). FFT requires global communications among all parallel processes; this often increases communication times. We present FFTOpt, a communication optimization library that leverages hierarchical communications within a compute node and across nodes of a large-scale system. This is a generic library that may optimize any code or application that uses MPI_Alltoall and MPI_Sendrecv without any dependency on the system or application. FFTOpt also uses topology information and runtime details about node allocation to aggregate at node and switch levels to reduce communication times. We tested FFTOpt in our department cluster and PARAM Sanganak supercomputer of IIT Kanpur using the FFTW, FFTK and P3DFFT libraries. FFTOpt reduces communication time by up to 63%.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
DescriptionJobs for a High Performance Computing cluster are allocated system resources by a scheduling application such as SLURM. These scheduling applications are highly configurable by HPC administrators through the use of parameters which modify and customize their scheduling behavior. Although there are default values for these scheduling parameters provided by their creators and maintainers, it is unclear which values for scheduler parameter settings would be optimal for a particular HPC system running the types of jobs its users typically submit. Using over 37,000 jobs from historic job log data from Kansas State University’s High Performance Computing cluster, this research uses a SLURM simulator to execute over 90,000 scheduler simulations requiring over 840,000 compute hours along with gradient boosted tree regression to predict an optimal set of scheduler configuration parameters which results in a 79% decrease in the average job queue time when compared with the default scheduler parameters
Workshop
Recorded
Applications
Architectures
Benchmarking
Exascale Computing
Modeling and Simulation
Performance
Performance Portability
W
DescriptionWe propose a new benchmark for ranking high-performance (HP) computers. The new benchmark is designed to rank the computers based on how fast they can solve a sparse linear system of equations, exhibiting computational and communication requirements typical in many scientific applications. The main novelty of the new benchmark is that it provides the flexibility to utilize lower precision arithmetic. This is motivated by the observations that some new hardware architectures deliver lower-precision arithmetic at higher performance. There are other machines that do not follow this trend. However, using a lower-precision arithmetic reduces the required amount of data transfer, which alone could improve the solver performance. We present our initial design of the new benchmark, and its reference implementation and performance on different architectures. We also discuss challenges of designing such a benchmark.
Posters
Research Posters
TP
XO/EX
DescriptionWe have successfully developed an efficient algorithm capable of computation of N=1 million elements and 0.1 million time-steps. Strong-scaling analyses show that the algorithm exhibits the good scalability for OpenMP / MPI of 8 threads and more than 10000 cores (~200 nodes). This capacity is necessary to simulate the nationwide fault activity for the Japanese Islands with the current HPC systems. The algorithm is applied to simulate the 15 thousand years of the earthquake recurrence history along one of the largest active faults in SW Japan, the Median Tectonic line. We demonstrate that the optimized algorithm is a powerful tool enabling us to build a physics-based method applied to long-term forecast of earthquake generation.
Workshop
Recorded
W
DescriptionWe present the fifth example in a series of assignments used in a Parallel Computing course to teach the approaches to the same problem in different parallel programming models. It targets concepts of shared-memory programming, distributed-memory programming, and/or GPU programming. This assignment is based on a Montecarlo probabilistic approach for a Hill Climbing algorithm, in order to locate the maximum values of a two dimensional function. The program is designed to be simple, easy to understand by students, and to include specific parallelization and optimization opportunities. It maintains the same core concepts used in four previously presented assignments, with a different design approach. It focus on dealing with non-determinism during execution, the impact of randomization on load-balance, and new relevant optimization challenges. This assignment has been successfully used in parallel programming contests during an optional Parallel Programming course in the third year of Computer Engineering degree.
Workshop
Recorded
Algorithms
Architectures
Compilers
Computational Science
Exascale Computing
Heterogeneous Systems
Hierarchical Parallelism
Memory Systems
Parallel Programming Languages and Models
Parallel Programming Systems
Resource Management and Scheduling
W
Workshop
Recorded
Algorithms
Architectures
Compilers
Computational Science
Exascale Computing
Heterogeneous Systems
Hierarchical Parallelism
Memory Systems
Parallel Programming Languages and Models
Parallel Programming Systems
Resource Management and Scheduling
W
Workshop
Recorded
Algorithms
Architectures
Compilers
Computational Science
Exascale Computing
Heterogeneous Systems
Hierarchical Parallelism
Memory Systems
Parallel Programming Languages and Models
Parallel Programming Systems
Resource Management and Scheduling
W
Workshop
Recorded
Algorithms
Architectures
Compilers
Computational Science
Exascale Computing
Heterogeneous Systems
Hierarchical Parallelism
Memory Systems
Parallel Programming Languages and Models
Parallel Programming Systems
Resource Management and Scheduling
W
Exhibitor Forum
Recorded
TP
XO/EX
DescriptionAn avalanche of data and new computing paradigms are driving the demand for hardware accelerators. Currently, there are more than 200 accelerator companies. Traditional scale-out servers with GPUs rely on Ethernet/IB networks to interconnect CPU and GPU resources. These configurations use RDMA for GPU sharing at the expense of latency, utilization, and wall-clock times. The R in RDMA requires protocol translations, latency, and bounce buffers.
GigaIO’s dynamic memory fabric (FabreX) enables inter-cluster DMA, providing direct resource access and eliminating both bounce buffers and unintended latency. GigaIO’s disaggregated composability using FabreX delivers a dynamic computing environment with optimized GPU utilization (80%) across a 512Gbps interconnect. This drives more science with less hardware.
This talk will discuss how GigaIO’s dynamic memory fabric utilizes changing computing configurations to advance data analytics and research:
• Compose more GPUs than servers support while ensuring CPUs are used for value-added processes
• Propel research forward via computational paradigms that can’t exist without this type of flexibility
• Accommodate scratch storage using composable NVMe-oF over FabreX
• Interconnect all devices via memory fabric so that compute resources can be moved to data at the proper workflow stage instead of moving data to compute, thus eliminating idle GPU time
• Take advantage of CXL standards and supported devices as they become available
Considerations:
• Server BIOS must support dynamic allocation of devices
• CPUs (BUS, IDs, and MMIO)
• CXL will not be a seamless transition; it will have protocol and interconnect variabilities as the technology matures
GigaIO’s dynamic memory fabric (FabreX) enables inter-cluster DMA, providing direct resource access and eliminating both bounce buffers and unintended latency. GigaIO’s disaggregated composability using FabreX delivers a dynamic computing environment with optimized GPU utilization (80%) across a 512Gbps interconnect. This drives more science with less hardware.
This talk will discuss how GigaIO’s dynamic memory fabric utilizes changing computing configurations to advance data analytics and research:
• Compose more GPUs than servers support while ensuring CPUs are used for value-added processes
• Propel research forward via computational paradigms that can’t exist without this type of flexibility
• Accommodate scratch storage using composable NVMe-oF over FabreX
• Interconnect all devices via memory fabric so that compute resources can be moved to data at the proper workflow stage instead of moving data to compute, thus eliminating idle GPU time
• Take advantage of CXL standards and supported devices as they become available
Considerations:
• Server BIOS must support dynamic allocation of devices
• CPUs (BUS, IDs, and MMIO)
• CXL will not be a seamless transition; it will have protocol and interconnect variabilities as the technology matures
HPC Accelerates Plenary
Recorded
TP
W
TUT
XO/EX
DescriptionThe essence of High Performance Computing (HPC) is the use of computational methods to improve understanding, leading to discovery or innovation in some dimension. Supercomputing, analytics, and artificial intelligence are all different facets of this approach.
All are important at SC22, where the theme is “HPC Accelerates.” HPC accelerates in the literal sense. Parallel programming accelerates time-to-insight, which in turn accelerates scientific discovery and product development. But HPC accelerates in a grander sense as well, taking us not only faster in the same direction, but exploring entirely new avenues of exploration and thought. At its best, can HPC accelerate the course of humanity?
The opening SC22 plenary panel explores the many ways in which HPC Accelerates. The panelists are all leaders, representing different perspectives in supercomputing, data analytics, hyperscale, and AI, setting up a conversation that promises to range from practical to inspirational, building on where we are today, accelerating through the SC22 program to come, and into the future beyond.
All are important at SC22, where the theme is “HPC Accelerates.” HPC accelerates in the literal sense. Parallel programming accelerates time-to-insight, which in turn accelerates scientific discovery and product development. But HPC accelerates in a grander sense as well, taking us not only faster in the same direction, but exploring entirely new avenues of exploration and thought. At its best, can HPC accelerate the course of humanity?
The opening SC22 plenary panel explores the many ways in which HPC Accelerates. The panelists are all leaders, representing different perspectives in supercomputing, data analytics, hyperscale, and AI, setting up a conversation that promises to range from practical to inspirational, building on where we are today, accelerating through the SC22 program to come, and into the future beyond.
Workshop
Recorded
Reliability and Resiliency
W
DescriptionThe recent entrance of the High-Performance Computing (HPC) world into the exascale era challenges how vast amounts of data are analyzed, manipulated, and stored. However, the already substantial performance gap between computing, memory, and storage expands rapidly in the presence of distributed large-scale applications on new generation supercomputers. The widest gap of all, the memory-storage one, is still 2-3 orders of magnitude wide. As a result, said applications struggle with two main storage-oriented tasks – diagnostics and checkpointing – in which there is a need to persist data during runtime for further usage. Recently, novel interdependent introductions of non-volatile RAM (NVRAM) hardware and persistent memory file systems (PMFSs) were made to the storage stack and are planned to collectively integrate into the next Aurora exascale system. Fridman et al. (FTXS@SC’21) benchmarked the diagnostics (FIO, BT-IO) and checkpointing (SCR, DMTCP) use-cases as in supercomputers with the aid of NVRAM and several PMFSs, excluding block-oriented non-volatile devices. Rather, this strategy solely relies on using RAM-NVRAM and even pure-NVRAM memory-storage configuration. We review these results, and introduce how NVRAM can be utilized not only for C/R mechanisms and diagnostics via PMFSs, but also for Algorithm-Based Fault Tolerance (ABFT), with the PMDK library and MPI one-sided communication directly to byte-addressable NVRAM. We specifically focus on Exact State Reconstruction of iterative linear solvers. We show that this strategy utilizes hardware properly and reliably, achieving best-known performances for those use-cases and, as such, suggesting a new approach to devise HPC recoverable algorithms.
Workshop
Recorded
W
DescriptionParallel computing is becoming more relevant in non-traditional disciplines. Moreover, given the dynamism of today’s computational environments, the traditional classroom approach for HPC pedagogy does not fit all needs required at various levels of education. The traditional computer science education which typically mentions briefly the concept of threading is no longer apt for preparing the future HPC workforce. Additionally, many K-12 and post-college personnel are encountering problems or are involved in projects where high performance computing can make a useful contribution. In recent years, there have been several pedagogical and andragogical approaches initiated by the HPC community to increase the instructional effectiveness for bridging the gaps in HPC knowledge and skills.
This work aims to share experiences with educational challenges and opportunities that stimulate the acquisition of high performance computing skills.
This work aims to share experiences with educational challenges and opportunities that stimulate the acquisition of high performance computing skills.
Birds of a Feather
TP
XO/EX
DescriptionGovernment agencies, industry and academia are demanding a new generation of tools to efficiently solve large scale analytics problems in a variety of business, scientific and national security applications. This BoF gathers the community developing high-performance frameworks and workflows for large scale graph analytics to survey current approaches, identify new challenges and opportunities, and discuss interoperability of emerging infrastructures. A central goal is developing requirements and recommendations for future tools. As in previous editions, this BoF will explore, and compare and contrast conventional implementations as well as algebraic approaches, inviting the GraphBLAS community to discuss its state and evolution.
Exhibitor Forum
Recorded
TP
XO/EX
DescriptionThe Omnia software project has created an open-source, community-driven software platform for managing HPC systems and resources that embraces the emergence of high performance data analytics and AI, as well as the sustained and growing importance of parallel simulation and high throughput computing. Omnia is designed with modern workloads in mind, using modern software tools and approaches to enable flexible configuration of Slurm and Kubernetes cluster from existing resources, embracing containers, accelerators, and other technologies important in current and future expected HPC workloads. In this session, we will cover both the present and near-term state of the rapidly evolving Omnia software, including the benefits to customers and the opportunity for customers and partners to contribute directly to advancing the project. Tune in to hear some known and possible future directions of Omnia, and get a sneak peek into the future of the Dell Technologies HPC systems portfolio.
Birds of a Feather
TP
XO/EX
DescriptionThis BoF brings together experts from HPC centers around the globe to discuss future system testing methodologies. The session will include a panel focusing on HPC system testing at scale including acceptance testing of Perlmutter, Frontier, Fugaku, and LUMI. Panelists will describe challenges faced and share their perspectives on how those could have been overcome. Then, we will host two speakers to spark ideas for the open discussion in which attendees will be invited to identify key areas that HPC center staff and vendors should focus on to prepare for the next-generation of compute and data resources.
Workshop
Recorded
Benchmarking
Cloud and Distributed Computing
Containers
Datacenter
Networks
Privacy
Resource Management and Scheduling
Security
SIGHPC
State of the Practice
System Administration
System Software
W
Paper
Recorded
Machine Learning and Artificial Intelligence
TP
DescriptionThis work presents a Hybrid Low-Rank Natural Gradient Descent method, called HyLo, that accelerates the training time of deep neural networks. Natural gradient descent (NGD) requires computing the inverse of the Fisher information matrix (FIM), which is typically expensive at large-scale. Kronecker factorization methods such as KFAC attempt to improve NGD’s running time by approximating the FIM with Kronecker factors. However, the size of Kronecker factors increases quadratically as the model size grows. Instead, in HyLo, we use the Sherman-Morrison-Woodbury variant of NGD (SNGD) and propose a reformulation of SNGD to resolve its scalability issues. HyLo uses a computationally-efficient low-rank factorization to achieve superior timing for Fisher inverses. We evaluate HyLo on large models including ResNet-50, U-Net, and ResNet-32 on up to 64 GPUs. HyLo converges 1.4x-2.1x faster than the state-of-the-art distributed implementation of KFAC and reduces the computation and communication time up to 350x and 10.7x on ResNet-50.
Workshop
Recorded
Cloud and Distributed Computing
In Situ Processing
Scientific Computing
Workflows
W
DescriptionHyperShell is an elegant, cross-platform, high-performance computing utility for processing shell commands over a distributed, asynchronous queue. It is a highly scalable workflow automation tool for many-task scenarios. The software was originally created several years ago at Purdue University to meet the specific unmet needs of researchers not satisfied by existing solutions. Here we will outline the context for its existence and focus on some unique capabilities it offers.
Workshop
Recorded
Accelerator-based Architectures
Algorithms
Architectures
Big Data
Data Analytics
Parallel Programming Languages and Models
Productivity Tools
W
DescriptionAs many real-world graphs change rapidly, it is crucial to design dynamic algorithms that efficiently maintain graph statistics upon updates, since the cost of re-computation from scratch can be prohibitive. Furthermore, due to the high frequency of updates, we can improve performance by using parallelism to process batches of updates at a time. This talk presents new graph algorithms in this parallel batch-dynamic setting.
Specifically, we present the first parallel batch-dynamic algorithm for approximate k-core decomposition that is efficient in both theory and practice. Our algorithm is based on our novel parallel level data structure, inspired by the sequential level data structures of Bhattacharya et al. and Henzinger et al. Given a graph with n vertices and a batch of B updates, our algorithm maintains a (2 + epsilon)-approximation of the coreness values of all vertices (for any constant epsilon > 0) in O(B log^2(n)) amortized work and O(log^2(n) loglog(n)) span (parallel time) with high probability. We implement and experimentally evaluate our algorithm, and demonstrate significant speedups over state-of-the-art serial and parallel implementations for dynamic k-core decomposition.
We have also designed new parallel batch-dynamic algorithms for low out-degree orientation, maximal matching, clique counting, graph coloring, minimum spanning forest, single-linkage clustering, some of which use our parallel level data structure.
Specifically, we present the first parallel batch-dynamic algorithm for approximate k-core decomposition that is efficient in both theory and practice. Our algorithm is based on our novel parallel level data structure, inspired by the sequential level data structures of Bhattacharya et al. and Henzinger et al. Given a graph with n vertices and a batch of B updates, our algorithm maintains a (2 + epsilon)-approximation of the coreness values of all vertices (for any constant epsilon > 0) in O(B log^2(n)) amortized work and O(log^2(n) loglog(n)) span (parallel time) with high probability. We implement and experimentally evaluate our algorithm, and demonstrate significant speedups over state-of-the-art serial and parallel implementations for dynamic k-core decomposition.
We have also designed new parallel batch-dynamic algorithms for low out-degree orientation, maximal matching, clique counting, graph coloring, minimum spanning forest, single-linkage clustering, some of which use our parallel level data structure.
Workshop
Recorded
Accelerator-based Architectures
Algorithms
Architectures
Big Data
Data Analytics
Parallel Programming Languages and Models
Productivity Tools
W
Workshop
Recorded
Accelerator-based Architectures
Algorithms
Architectures
Big Data
Data Analytics
Parallel Programming Languages and Models
Productivity Tools
W
DescriptionGraph algorithms and techniques are increasingly being used in scientific and commercial applications to express relations and explore large data sets. Although conventional or commodity computer architectures, like CPU or GPU, can compute fairly well dense graph algorithms, they are often inadequate in processing large sparse graph applications. Memory access patterns, memory bandwidth requirements and on-chip network communications in these applications do not fit in the conventional program execution flow. In this work, we propose and design a new architecture for fast processing of large graph applications. To leverage the lack of the spatial and temporal localities in these applications and to support scalable computational models, we design the architecture around two key concepts. (1) The architecture is a multicore processor of independently clocked processing elements. These elements communicate in a self-timed manner and use handshaking to perform synchronization, communication, and sequencing of operations. By being asynchronous, the operating speed at each processing element is determined by actual local latencies rather than global worst-case latencies. We create a specialized ISA to support these operations. (2) The application compilation and mapping process uses a graph clustering algorithm to optimize parallel computing of graph operations and load balancing. Through the clustering process, we make scalability an inherent property of the architecture where task-to-element mapping can be done at the graph node level or at node cluster level. A prototyped version of the architecture outperforms a comparable CPU by 10~20x across all benchmarks and provides 2~5x better power efficiency when compared to a GPU.
Birds of a Feather
TP
XO/EX
DescriptionThe International Association of Supercomputing Centres (IASC), formed in June 2022, is a worldwide consortium of public-facing advanced computing user facilities sharing knowledge and know-how on their operations, management, and strategy. In this BoF, results from survey workshops held this fall will be discussed with an expert panel and the participants. Input from this session will guide working group formation around central topics (e.g. NetZero, open-source software, quantum computing in HPC centers, cloud strategy). Center directors, managers, program developers, technical topic leads, and anyone interested are warmly welcomed to join the conversation.
Paper
Recorded
Accelerator-based Architectures
Bioinformatics
File Systems and I/O
TP
DescriptionPtychography is a popular microscopic imaging modality and sets the record for the highest image resolution. Unfortunately, the high image resolution requires significant amount of memory and computation, forcing many applications to compromise their image resolution in exchange for a smaller memory footprint and a shorter reconstruction time. In this paper, we propose a novel image gradient decomposition method that significantly reduces memory footprint by tessellating image gradients and measurements into tiles. In addition, we propose a parallel decomposition method that enables asynchronous point-to-point communications and pipelining with minimal parallel overhead. Our experiments on a large-scale Titanate material dataset show that the Gradient Decomposition reduces memory footprint by 51 times and achieves time-to-solution in 2.2 minutes by scaling to 4158 GPUs with a super-linear speedup at 364% efficiency. This performance is 2.7 times more memory efficient, 9 times more scalable, and 86 times faster than the state-of-the-art algorithm.
Birds of a Feather
TP
XO/EX
DescriptionCancer is a disease that touches us all. With the explosion in new data, availability of HPC, and access to AI resources, the opportunities are tremendous for individuals to make contributions. The importance of inclusivity, diversity and equity are also essential to progress in cancer from research to clinic, from workforce to patient. This BoF will focus on highlighting the critical need for a broad and global community involvement, avenues to get connected with HPC and AI in cancer research, and avenues to develop the workforce to ultimately broaden the impact of HPC on cancer and accelerate treatments.
Workshop
Recorded
AI-HPC Convergence
Extreme Scale Computing
Parallel Programming Languages and Models
Performance
Runtime Systems
W
DescriptionThe move to larger, more powerful, compute nodes on large scale HPC systems has been significant in recent years. It's not uncommon for nodes now to have 128+ computational cores, and significant amount of GPU resources. This provides potential scope for active middleware to run on these nodes, managing anything from storage and I/O to compute kernels and network traffic. However, there needs to be a stronger understanding of the impact of on-node workloads on application performance, especially when we are aiming to scale to exascale systems with many millions of workers. I will discuss work we are doing to evaluate and characterize the impact of on-node workloads, and explore some of the active middleware that could enable scaling up to very large node and system sizes without requiring significant user application changes.
Workshop
Recorded
Algorithms
Exascale Computing
Extreme Scale Computing
Heterogeneous Systems
Post-Moore Computing
Quantum Computing
W
DescriptionComputation on architectures that feature fine-grained parallelism requires algorithms that overcome load imbalance, inefficient memory accesses, serialization, and excessive synchronization. In this paper, we explore an algorithm that completely removes the need for synchronization but allows for asynchronous updates in the spirit of chaotic relaxation. Methods of this type have been identified as highly competitive for computations on exascale machines, but practical implementations for GPU platforms featuring extreme parallelism levels are a scarce resource. We present an asynchronous Richardson iteration optimized for high-end GPUs, demonstrate the superiority of the algorithm over a highly tuned synchronous Richardson iteration, and deploy the algorithm as production-ready implementation in the Ginkgo open source library. The ideas presented here on the algorithm design, implementation, and performance can help guide the design of other asynchronous algorithms on GPUs.
Workshop
Recorded
W
DescriptionThe User-Level Failure Mitigation (ULFM) specification extends the MPI standard to address resilience and enable MPI applications to restore their communication capability after a failure. This works builds upon the wide body of experience gained in the field to eliminate a gap between current practice and the ideal, more asynchronous, recovery model in which the fault tolerance activities of multiple components can be carried out simultaneously and overlap.
This work proposes to:
(1) bring forward the building blocks that permit the effective scoping of fault recovery in an application, so that independent components in an application can recover without interfering with each other, and separate groups of processes in the application can recover independently or in unison; and
(2) overlap recovery activities necessary to restore the consistency of the system (e.g., eviction of faulty processes from the communication group) with application recovery activities (e.g., dataset restoration from checkpoints).
This work proposes to:
(1) bring forward the building blocks that permit the effective scoping of fault recovery in an application, so that independent components in an application can recover without interfering with each other, and separate groups of processes in the application can recover independently or in unison; and
(2) overlap recovery activities necessary to restore the consistency of the system (e.g., eviction of faulty processes from the communication group) with application recovery activities (e.g., dataset restoration from checkpoints).
Posters
Research Posters
Recorded
TP
DescriptionFast analysis of scientific data from X-ray free electron laser (XFEL) experimental facilities is key for supporting real-time decisions that efficiently use these facilities to speed up scientific discovery. Our research shows gains obtained using graphics processing units (GPUs) to accelerate 3D reconstruction of Single Particle Imaging (SPI) X-ray diffraction data. We achieve a 4X speedup over the previous GPU implementation, 50% better image reconstruction resolution, and 485X speedup when calculating resolution compared to the existing implementation. We showcase techniques to optimize per-node computational efficiency, increase scalability and improve the accuracy of SPI by using better algorithms, improving data movement and accesses, reusing data structures, and reducing memory fragmentation.
Workshop
Recorded
W
DescriptionThread-based MPI runtimes, which associate private communication contexts or endpoints with each thread, rather than sharing a single context across a multithreaded process, have been proposed as an alternative to MPI's traditional multithreading models. Adaptive MPI is one such implementation, and in this work we identify and overcome shortcomings in its support for point-to-point communication. We examine also the consequences of MPI's messaging semantics on its runtime and investigate how its design can be improved for applications that do not require the full messaging semantics. We show that the issues for AMPI reduce to similar problems first identified in the context of efficient MPI+X support. Our focus is on enhancing AMPI's support for asynchrony and concurrency while still optimizing for communication locality through a novel locality-aware message matching scheme. We compare performance with and without the relaxed messaging semantics and our associated optimizations.
Workshop
Recorded
Security
W
DescriptionAll modern computer systems, including supercomputers, are vulnerable to a wide variety of security exploits. Performance analysis tools are an often overlooked source of vulnerabilities. Performance measurement interfaces can have security issues that lead to information leakage, denial of service attacks, and possibly even full system compromise. Desktop systems can mitigate risk by disabling performance interfaces, but that is not always possible on HPC systems where performance (and thus measurement) is paramount. We investigate various ways of finding security issues in the performance measurement stack. We introduce the perf_fuzzer, a tool that methodically finds bugs in the Linux perf_event_open() system call. We also discuss the perf data fuzzer which looks for userspace bugs in the perf analysis tool. We describe the development of the fuzzing tools, examine the bugs found, and discuss ways to prevent such bugs from occurring in the future.
Invited Talk
Recorded
TP
XO/EX
DescriptionThere is a pressing need to bring machine learning to a diverse set of hardware devices. Current approaches typically rely on vendor-specific operator libraries and frameworks, and require significant engineering effort. In this talk, we will present an overview of the Apache TVM open source stack, which exposes graph- and operator-level optimizations to provide performance portability for machine learning workloads across diverse hardware back-ends. TVM solves compiler optimization challenges by employing a learning-based approach for rapid exploration of optimizations, saving months of engineering time and offering state-of-the-art performance in both edge and server use cases. We will discuss how TVM offers broad model coverage, and makes effective use of hardware resources. We will end the talk with a peek at the OctoML Platform which brings DevOps agility to ML deployment.
Workshop
Recorded
Accelerator-based Architectures
Data Analytics
In Situ Processing
Scientific Computing
Visualization
Workflows
W
DescriptionIn this paper, we present a novel in situ algorithm for tracking flow features characteristic of multiphase dispersed systems. The work combines elements of feature detection, temporal analysis of geometric objects and distributed data processing. We argue that high-fidelity simulations have a unique opportunity to measure certain statistical properties that are essential for building practical understanding of many scientific and industrial problems. We employ high-frequency temporal sampling to reconstruct a complete evolution of drops or bubbles. In particular, we can track various elementary properties such as position, volume and surface area, but our main focus is on events like break-up or coalescence. An implementation of the algorithm is carried out in Catalyst V2 with OpenFOAM™as a simulation code. Finally, we discuss issues that at present are inhibiting a highly-scalable application of the solution.
Workshop
Recorded
Accelerator-based Architectures
Data Analytics
In Situ Processing
Scientific Computing
Visualization
Workflows
W
DescriptionIn the context of numerical simulation, a surrogate model approximates the outputs of a solver with a low computational cost. In this article, we present an In Situ visualization prototype, based on Catalyst 2, for monitoring the training of surrogate models based on Deep Neural Networks. We believe that In Situ monitoring can help solve a fundamental problem of this kind of training: standard metrics, such as the Mean Squared Error, do not convey enough information on which simulation aspects are harder to learn. Our prototype allows the interactive visualization of the current state of convergence of a physical quantity spatial field, complementing the traditional loss function value curve. We enable the steering of physical parameters during the training process, for interactive exploration. We also allow the user to influence the learning process in real-time by changing the learning rate. Results are illustrated on a Computational Fluids Dynamics use case.
Exhibitor Forum
Recorded
TP
XO/EX
DescriptionSoftware development in a High-Performance Computing (HPC) environment is non-trivial and requires a thorough understanding of the application and the architecture. Static Code Analysis is helping many developers understand, choose, re-use, or document the best suitable software or algorithm implementations.
After the initial development, all HPC software has many “Maintenance and Evolution” stages, each introducing new improvements, functionalities, or further optimizations. Those step-by-step refinements and tuning can create many feedback loops, easily jeopardizing a project’s integrity and sustainability.
The presentation addresses Historical DevOps and its positive impact on code sustainability challenges in HPC.
Historical DevOps provides new data points to identify and quantify the structural changes in source code beyond the traditional line-based CHURN. Historical DevOps relies on Static Code Analysis to create clear, granular, and actionable views of “what changed between versions of source code,” enabling new strategies that let developers maintain better control over the evolution of any software project. The mid-term result is less costly rewrites and project cancellations. And benefits are even higher when your HPC team relies on extensive cooperation and 3rd party projects and dependencies.
After the initial development, all HPC software has many “Maintenance and Evolution” stages, each introducing new improvements, functionalities, or further optimizations. Those step-by-step refinements and tuning can create many feedback loops, easily jeopardizing a project’s integrity and sustainability.
The presentation addresses Historical DevOps and its positive impact on code sustainability challenges in HPC.
Historical DevOps provides new data points to identify and quantify the structural changes in source code beyond the traditional line-based CHURN. Historical DevOps relies on Static Code Analysis to create clear, granular, and actionable views of “what changed between versions of source code,” enabling new strategies that let developers maintain better control over the evolution of any software project. The mid-term result is less costly rewrites and project cancellations. And benefits are even higher when your HPC team relies on extensive cooperation and 3rd party projects and dependencies.
Workshop
Recorded
Architectures
Cloud and Distributed Computing
Emerging Technologies
Networks
Scientific Computing
W
Workshop
Recorded
Architectures
Cloud and Distributed Computing
Emerging Technologies
Networks
Scientific Computing
W
Workshop
Recorded
Architectures
Cloud and Distributed Computing
Emerging Technologies
Networks
Scientific Computing
W
Workshop
Recorded
Architectures
Cloud and Distributed Computing
Emerging Technologies
Networks
Scientific Computing
W
Workshop
Recorded
AI-HPC Convergence
Emerging Technologies
Memory Systems
Networks
Resource Management and Scheduling
W
Birds of a Feather
TP
XO/EX
DescriptionBeing a standard-based interconnect, InfiniBand enjoys the continuous development of new capabilities. NDR 400G InfiniBand In-Network Computing and Data Processing Unit (DPU) technologies provide innovative hardware and programmable engines offloading and accelerating communication frameworks and application algorithms. The session will discuss the InfiniBand In-Network Computing technology and testing results from leading supercomputing platforms as well as the NVIDIA Selene AI supercomputer. As the needs for faster data speed accelerate, the InfiniBand Trade Association has been working to set the goals for future speeds (XDR and beyond). This topic will also be covered at the session, and the first NDR results.
Student Cluster Competition
TP
XO/EX
DescriptionThis is a first-time collaboration for the Purdue-IU Student Cluster Competition team; no team members have previously participated in the competition. IU has not competed in any major Student Cluster Competition in many years, Purdue last competed in a major Student Cluster Competition in 2019. All team members have some formal education in computational skill: basic programming through operating systems. However, their experiences are quite varied. There are first year students to final year students on the team, some students have robotics experience, others have strong interests in foreign languages and philosophy. In forming the Purdue-IU team, our philosophy was that we would leverage each others’ strengths to enable peer-mentoring and, thus, provide leadership opportunities for the students. The Purdue student members are all currently acting as student HPC systems administrators while the IU students are engaged with software and applications.
We believe that our team is interdisciplinary. The Purdue-IU team has a strong technology focus in terms of area of study but still captures a wide swath of interests. The “departments” or fields of study represented here are Computer and Information Technology, Data Science, Computer Science, Unmanned Aerial Systems, Intelligent Systems Engineering. In particular, we note that Intelligent Systems Engineering captures computing disciplines combined with “domain science” disciplines such as Cellular and Molecular Biology, Neuroscience/Neuroimaging, and Precision Manufacturing. Across the United States, it is becoming increasingly common for interdisciplinary programs of study to emerge which blur the lines between or blend what is considered computing and “domain science”.
The student team members have expressed interests in high performance computing because they view HPC skills as essential to their future careers or because they believe that HPC (systems administration, applications development, facilitation, etc.) is a viable career path. IU has not fielded a cluster competition team in 10+ years; Purdue, however, has fielded over 12 teams at SC, International SuperComputing (ISC), and Asia SuperComputing Cluster (ASC) Competition events. Those efforts have impacted the lives, educations, and careers of over 70 students directly. Purdue team alumni have gone into HPC related fields, hyperscale businesses, and pursued numerous advanced degrees. Additionally, there is excitement about being on a team with students from another university. Purdue and IU are the two premier public institutions in the State of Indiana. Traditionally, the two institutions have had rivalries in sports but both institutions serve the public good for the citizens of Indiana (and are welcoming of students from other states and nations).
The advising team is led by Erik Gough, Lead Computational Scientist in Purdue's Research Computing Department. Supporting Erik is a team of experienced HPC center staff and HPC faculty including: Elizabett A. Hillery (Director of High Performance Computing, Research Computing at Purdue), Dr. Winona Snapp-Childs (Chief Operating Officer for the Indiana University Pervasive Technology Institute), Robert Henschel
(Program Director of Research Computing Engagement at IU), Dr. Deepak Nadig (Assistant Professor in the Department of Computer and Information Technology at Purdue) and Dr. Beth Plale (Professor of Computer Engineering at IU).
We believe that our team is interdisciplinary. The Purdue-IU team has a strong technology focus in terms of area of study but still captures a wide swath of interests. The “departments” or fields of study represented here are Computer and Information Technology, Data Science, Computer Science, Unmanned Aerial Systems, Intelligent Systems Engineering. In particular, we note that Intelligent Systems Engineering captures computing disciplines combined with “domain science” disciplines such as Cellular and Molecular Biology, Neuroscience/Neuroimaging, and Precision Manufacturing. Across the United States, it is becoming increasingly common for interdisciplinary programs of study to emerge which blur the lines between or blend what is considered computing and “domain science”.
The student team members have expressed interests in high performance computing because they view HPC skills as essential to their future careers or because they believe that HPC (systems administration, applications development, facilitation, etc.) is a viable career path. IU has not fielded a cluster competition team in 10+ years; Purdue, however, has fielded over 12 teams at SC, International SuperComputing (ISC), and Asia SuperComputing Cluster (ASC) Competition events. Those efforts have impacted the lives, educations, and careers of over 70 students directly. Purdue team alumni have gone into HPC related fields, hyperscale businesses, and pursued numerous advanced degrees. Additionally, there is excitement about being on a team with students from another university. Purdue and IU are the two premier public institutions in the State of Indiana. Traditionally, the two institutions have had rivalries in sports but both institutions serve the public good for the citizens of Indiana (and are welcoming of students from other states and nations).
The advising team is led by Erik Gough, Lead Computational Scientist in Purdue's Research Computing Department. Supporting Erik is a team of experienced HPC center staff and HPC faculty including: Elizabett A. Hillery (Director of High Performance Computing, Research Computing at Purdue), Dr. Winona Snapp-Childs (Chief Operating Officer for the Indiana University Pervasive Technology Institute), Robert Henschel
(Program Director of Research Computing Engagement at IU), Dr. Deepak Nadig (Assistant Professor in the Department of Computer and Information Technology at Purdue) and Dr. Beth Plale (Professor of Computer Engineering at IU).
Invited Talk
Recorded
TP
XO/EX
DescriptionFor practical quantum computing, HPC infrastructures will integrate quantum computers and simulators (QCS) in addition to cloud access to stand-alone QCS.
As longterm experience in conventional supercomputing demonstrate, the successful integration of QCS into HPC systems requires a focus on all three fundamental components of the HPC ecosystem: users and their applications, software, and hardware.
The strategy of the Jülich Supercomputing Centre (JSC) for quantum computing is based on three pillars. The first pillar is based on the classification of quantum computers as “analog” or “digital” systems and their intermediate stages, aiming at the provision of all these variants. The goal is to use QCS with a high enough technological maturity as pilot production systems. The second pillar of JSC’s strategy is the tightest possible integration of QCS into JSC’s HPC systems. For this purpose, JSC employs the concept of the modular supercomputer architecture (MSA). The QCS are used as modules in the MSA, closely coupled with other specific modules such as the general purpose CPU and GPU acceleration systems or a tiered common high speed storage. Modular operation requires joint scheduling that will ensure an efficient exploitation of available resources. The third pillar of JSC’s quantum computing strategy is the creation of a world-leading quantum computer user infrastructure. This includes the technical provision of the QCS (integrated into the HPC environment) – under European legislation – the deployment of the systems via peer-reviewed calls for proposals, the support of the users in simulation labs, algorithm development groups and cooperative research. These activities are carried out in the Jülich UNified Infrastructure for Quantum computing (JUNIQ).
By coordinating the EuroHPC JU project HPCQS, JSC together with European partners from science, engineering and industry is bringing the principle of JUNIQ to the European level, with JUNIQ forming a nucleus in an integrated and federated EuroQCS (European Quantum Computing and Simulation) infrastructure.
As longterm experience in conventional supercomputing demonstrate, the successful integration of QCS into HPC systems requires a focus on all three fundamental components of the HPC ecosystem: users and their applications, software, and hardware.
The strategy of the Jülich Supercomputing Centre (JSC) for quantum computing is based on three pillars. The first pillar is based on the classification of quantum computers as “analog” or “digital” systems and their intermediate stages, aiming at the provision of all these variants. The goal is to use QCS with a high enough technological maturity as pilot production systems. The second pillar of JSC’s strategy is the tightest possible integration of QCS into JSC’s HPC systems. For this purpose, JSC employs the concept of the modular supercomputer architecture (MSA). The QCS are used as modules in the MSA, closely coupled with other specific modules such as the general purpose CPU and GPU acceleration systems or a tiered common high speed storage. Modular operation requires joint scheduling that will ensure an efficient exploitation of available resources. The third pillar of JSC’s quantum computing strategy is the creation of a world-leading quantum computer user infrastructure. This includes the technical provision of the QCS (integrated into the HPC environment) – under European legislation – the deployment of the systems via peer-reviewed calls for proposals, the support of the users in simulation labs, algorithm development groups and cooperative research. These activities are carried out in the Jülich UNified Infrastructure for Quantum computing (JUNIQ).
By coordinating the EuroHPC JU project HPCQS, JSC together with European partners from science, engineering and industry is bringing the principle of JUNIQ to the European level, with JUNIQ forming a nucleus in an integrated and federated EuroQCS (European Quantum Computing and Simulation) infrastructure.
Exhibitor Forum
Recorded
TP
XO/EX
DescriptionIn the realm of high-performance heterogeneous computing, closely coupling FPGA accelerators with heterogenous compute elements is a much-needed technology in combating increasing development costs of monolithic solutions. With the development of standardized die-to-die interfaces, this technology is a reality. A new era has begun, chiplets, whereby designers build tiny ICs that contain a well-defined subset of functionality. Designed to be mixed and matched with other chiplets, FPGAs provide the ideal base die to connect all of these small ICs together in one package using standardized high performance die to die interfaces such as Universal Chiplet Interconnect Express (UCIe). This brings the benefits of faster time to market, lower development costs and scalable solutions to increase FPGA functionality. This talk will discuss the “New Era of Chiplets,” and FPGA technology designed specifically to utilize chiplets and standardized interfaces. We examine the robust UCIe open standard and development community that is accelerating HPC to include more customizable, package-level integration, combining best-in-class die-to-die interconnect and protocol connections from an interoperable, multi-vendor ecosystem. We also examine FPGA architecture as a standardized interface, including utilization of a two-dimensional network on chip (2D NoC). Built into the FPGA fabric, the 2D NoC interconnects I/O, memory, internal functional blocks to transfer low latency, high bandwidth data both on chip as well as across die-to-die interfaces. A detailed description of 2D NoC will be included, along with an example of how designers can integrate 2D NoC technology with standard die-to-die interfaces to build high performance heterogeneous computing systems.
Workshop
Recorded
W
DescriptionWorkflow synthesis is important for automatically creating the data processing workflow in a FAIR data management system for HPC. Previous methods are table-based, rigid and not scalable. This paper addresses these limitations by developing a new approach to workflow synthesis, interactive NLU-powered ontology-based workflow synthesis (INPOWS). INPOWS allows the use of Natural Language for queries, maximizes the robustness in handling concepts and language ambiguities through an interactive ontology-based design, and achieves superior extensibility by adopting a synthesis algorithm powered by Natural Language Understanding. In our experiments, INPOWS shows the efficacy in enabling flexible, robust, and extensible workflow synthesis.
Posters
Research Posters
TP
XO/EX
DescriptionChimbuko is a framework for detecting real-time performance anomalies incurred by large-scale applications. Understanding the source of anomalous behaviors is difficult due to the high volume of information stored by Chimbuko in a provenance database. This undergraduate research project aims to intuitively display this high volume of information without overwhelming users. We then integrate our analysis and visualization techniques into a publicly available framework called Dashing. This project facilitates interactive user investigation of anomaly provenance in large-scale applications.
Exhibitor Forum
Recorded
TP
XO/EX
DescriptionCompute Express Link™ (CXL™) maintains memory coherency between the CPU memory space and memory on CXL attached devices. This enables fine-grained resource sharing for higher performance with heterogeneous processing, memory disaggregation, memory pooling, persistent memory, and emerging memory media. Last year at SC’21, CXL Consortium members showcased the first public demonstrations of CXL, proving to the industry that CXL’s vision to enable a new ecosystem of high-performance, heterogeneous computing is now a reality. There were also multi-vendor demos to illustrate interoperability between vendor solutions.
This year, the CXL Consortium released the CXL 3.0 specification to the public. CXL 3.0 builds on previous technology generations to increase scalability and optimize system level flows with advanced switching and fabric capabilities. In addition to doubling the data rate to 64GTs with no added latency over CXL 2.0, CXL 3.0 introduces fabric capabilities and management, improved memory sharing and pooling, enhanced coherency, efficient peer-to-peer communications, and fine-grained resource sharing across multiple compute domains.
With the completion of the first CXL Consortium Pre-FYI compliance event, CXL products are even closer to market release. In the CXL Consortium booth, member companies will be showcasing their CXL solutions, multi-vendor demos, and proof-of-concepts that will highlight use cases in the industry that will benefit from CXL’s high speed, low latency, and cache coherent interconnect.
This year, the CXL Consortium released the CXL 3.0 specification to the public. CXL 3.0 builds on previous technology generations to increase scalability and optimize system level flows with advanced switching and fabric capabilities. In addition to doubling the data rate to 64GTs with no added latency over CXL 2.0, CXL 3.0 introduces fabric capabilities and management, improved memory sharing and pooling, enhanced coherency, efficient peer-to-peer communications, and fine-grained resource sharing across multiple compute domains.
With the completion of the first CXL Consortium Pre-FYI compliance event, CXL products are even closer to market release. In the CXL Consortium booth, member companies will be showcasing their CXL solutions, multi-vendor demos, and proof-of-concepts that will highlight use cases in the industry that will benefit from CXL’s high speed, low latency, and cache coherent interconnect.
Tutorial
Recorded
Algorithms
Emerging Technologies
Heterogeneous Systems
Post-Moore Computing
Quantum Computing
TUT
DescriptionHybrid quantum-classical algorithms would be the most important quantum applications in the next few years. The frameworks based on C++ have demonstrated significant performance advantages compared to Python-based frameworks when handling scalable quantum applications. In this tutorial, we will introduce the principle of quantum computation and the basic concept of hybrid quantum-classical variational algorithms, and then we will present a C++ quantum SDK, developed by Intel, to perform efficient execution of variational algorithms. We will also demonstrate how a hybrid quantum-classical program is written, compiled, and executed on the platform. This tutorial will be highly interactive. Participants will get hands-on experience in writing C++ code to solve chemistry problems such as estimating the ground state energy of an H2 molecule, and graph problems such as MaxCut.
Tutorial
Recorded
Architectures
Cloud and Distributed Computing
Emerging Technologies
Networks
TUT
DescriptionInfiniBand (IB), High-speed Ethernet (HSE), RoCE, Omni-Path, EFA, Tofu, Slingshot, and Aquila technologies are generating a lot of excitement toward building next-generation High-End Computing (HEC) systems including clusters, datacenters, file systems, storage, cloud computing, and Big Data (Hadoop, Spark, HBase, and Memcached) environments. This tutorial will provide an overview of these emerging technologies, their offered architectural features, their current market standing, and their suitability for designing HEC systems. It will start with a brief overview of IB, HSE, RoCE, Omni-Path, EFA, Tofu, Slingshot, and Aquila. An in-depth overview of the architectural features of IB, HSE (including iWARP and RoCE), and Omni-Path, their similarities and differences, and the associated protocols will be presented. An overview of the emerging NVLink/NVLink2/NVSwitch, Slingshot, Tofu, and Aquila architectures will be given. An overview of OpenFabrics stack which encapsulates IB, HSE, and RoCE (v1/v2) in a unified manner will be presented. An overview of libfabrics and UCX stacks will also be provided. Hardware/software solutions and the market trends behind these networking technologies will be highlighted. Sample performance numbers of these technologies and protocols will be presented. Finally, hands-on exercises will be carried out for the attendees to gain first-hand experience in running experiments with high-performance networks.
Tutorial
Recorded
Algorithms
Emerging Technologies
Post-Moore Computing
Quantum Computing
TUT
DescriptionQuantum computing offers the potential to revolutionize high-performance computing by providing a means to solve certain computational problems asymptotically faster than any classical computer. Quantum computing has advanced recently from merely a theoretical possibility to engineered reality, including commercial entities offering early prototype quantum processors, both special-purpose quantum annealers and general-purpose gate-model processors. The media have been showcasing each new development and implicitly conveying the message that quantum-computing ubiquity is nigh. Here, we will respond to this hype and provide an overview of the exciting but still early state of the field.
In this tutorial, we introduce participants to the computational models that give quantum computing its immense computational power. We examine the thought processes that programmers need to map problems to quantum computers. And we discuss hardware and algorithmic challenges that must be overcome before quantum computing becomes a component of every software developer's repertoire.
In this tutorial, we introduce participants to the computational models that give quantum computing its immense computational power. We examine the thought processes that programmers need to map problems to quantum computers. And we discuss hardware and algorithmic challenges that must be overcome before quantum computing becomes a component of every software developer's repertoire.
Workshop
Recorded
Architectures
Data Analytics
Datacenter
Extreme Scale Computing
HPC Community Collaboration
Machine Learning and Artificial Intelligence
Performance
Resource Management and Scheduling
System Software
W
DescriptionToday, many supercomputing organizations perform system evaluation and analysis for multiple reasons from managing system services to application performance analysis to long term system co-design. Often different data and tools are used. But within sight is now the ability to use high fidelity, continuously collected system wide data collection, combined with vendor independent, community SW tools to do real-time system management, application performance improvements all the way to long term Quantitative Co-Design. When merged with models and/or kernels of the next generation of applications and methods, we may be able to rapidly and fully evaluate many configuration and systems to optimize the next generation technologies.
Workshop
Recorded
W
DescriptionRicardo will give a quick glance at how CERN is using Kubernetes at scale
Workshop
Recorded
W
DescriptionContainers have become an essential tool for the deployment of scientific data pipelines in a portable and reproducible manner across clusters and clouds. The immutability of containers is an essential concept in this context because it guarantees that the environment in which pipeline tasks are executed is not altered over time and is precisely replicable. However, this principle collides with the continuously evolving nature of scientific data analyses. Modern scientific workflows can be composed of several dozens of tasks and corresponding containers, which developers need to modify and deploy promptly to validate experiment hypotheses. This may require re-building dozens of containers and uploading them to remote repositories, which forces developers to waste a significant amount of time on “infrastructure” tasks and adds considerable overhead to the overall development and deployment cycle. This talk will introduce a novel solution to the problem of container provisioning for data analysis pipelines based on the concept of container augmentation, which allows the extension of the container content at runtime without the need to store it in a persistent repository.
Workshop
Recorded
W
DescriptionIn the event of a volcanic eruption it is vital to be able to predict the spread of volcanic gases. Such gases can cause respiratory problems and, if the concentrations are high enough, burns to skin and asphyxiation. This is particularly relevant near to the eruption site, and when the eruption occurs in a region with a high degree of geographical complexity, such as Iceland, where dense gases can settle in low-lying areas. This presentation will describe some of the basics of modeling volcanic plume dispersion, including highly complex simulations that capture the spread of volcanic gases. In March 2021 and August 2022, there were fissure eruptions on the Reykjanes Peninsula in Iceland. Following the initial notification from Iceland, the National Center for Atmospheric Science obtained an emergency allocation of 256 processors from the UK Supercomputer, ARCHER2. A modified version of the Weather Research and Forecasting (WRF) model was then used to simulate volcanic gas dispersion operationally, and the simulations shared with colleagues in Iceland. The model, implementation on the supercomputer, and workflow (from initial notification to uploading of results) will be described.
Workshop
Recorded
Algorithms
Exascale Computing
Extreme Scale Computing
Heterogeneous Systems
Post-Moore Computing
Quantum Computing
W
DescriptionThis is a particularly disruptive time for the development of future computing systems, and the next 10 years will see some very fundamental shifts in how those systems are architected and deployed. With the emergence of Heterogeneous Computing including AI and Quantum, we are seeing an explosive growth in computational techniques be supported on future computing systems and in computing solutions to support those techniques. In this talk we will present those opportunities and challenges, and we’ll offer thoughts on the implications for future systems designs both at the hardware and software levels.
Workshop
Recorded
Algorithms
Exascale Computing
Extreme Scale Computing
Heterogeneous Systems
Post-Moore Computing
Quantum Computing
W
DescriptionWe present a brief overview of machine learning techniques and show that certain methods of linear algebra such as the eigenvalue problem or more generally singular value decomposition constitute the foundations of these techniques. We consider some examples of applications by highlighting the essential role of these methods. The ever-increasing production of data requires new methodological and technological approaches to meet the challenge of their effective analyzes. A new machine learning approach based on the Unite and Conquer methods will be presented. This intrinsically parallel and scalable technique can be implemented with synchronous or asynchronous communications. Experimental results, demonstrating the interest of the approach for an effective analysis of data in the case of clustering and anomaly detection will be presented.
Workshop
Recorded
Algorithms
Exascale Computing
Extreme Scale Computing
Heterogeneous Systems
Post-Moore Computing
Quantum Computing
W
DescriptionWe live in a world where large-scale systems for machine intelligence are increasingly being used to solve complex problems in scientific research. A convergence of machine learning model adoption alongside classical algorithms, purpose-built scale-out systems availability in the cloud and maturing software ecosystems is paving the way for an exponential increase in the size of ML models being deployed at scale by research institutions. Models with trillions of parameters are not too far out in the future. New hybrid systems combining both classical and AI approaches will be required to meet the needs of these large-scale algorithms.
In this session, Graphcore will reveal how Intelligence Processing Unit (IPU) systems, purpose built for AI and particularly well suited to hybrid AI/HPC workloads, have been designed to tackle these compute scale-out challenges. This technology allows researchers to start small and then seamlessly scale to tackle mega-models, preparing them for the multi-trillion-parameter model era.
In this session, Graphcore will reveal how Intelligence Processing Unit (IPU) systems, purpose built for AI and particularly well suited to hybrid AI/HPC workloads, have been designed to tackle these compute scale-out challenges. This technology allows researchers to start small and then seamlessly scale to tackle mega-models, preparing them for the multi-trillion-parameter model era.
Workshop
Recorded
Accelerator-based Architectures
Data Analytics
In Situ Processing
Scientific Computing
Visualization
Workflows
W
DescriptionMachine learning (ML) has become ubiquitous within the sciences due to its ability to perform a wide array of tasks which add value within traditional workflows. These models can provide advanced data analytics through dimensionality reduction, pattern recognition, and clustering. For large-scale simulations, post hoc data analysis requires writing and reading large quantities of data, which can severely limit the rate. In situ analysis can reduce the frequency and quantity of data written to disk but requires the integration of simulations with ML methods, which poses a software development challenge. In this talk, we will present an approach to integrating simulations with ML models through an inference server and remote procedure calls (RPCs). By separating the machine learning into one or more independent processes, the inference calls can be made within drop-in functions using RPCs with minimal modifications to the existing code and can be scaled across parallel processes with MPI. While the deep learning platform, TensorFlow, is typically considered a Python tool, RPCs can couple a TensorFlow model server with applications written in a wide variety of languages. We will demonstrate the computational efficiency and scalability of the approach across a series of use cases, such as deploying machine-learned surrogate models in simulations and enabling ML super-resolution in visualization tools.
Workshop
Recorded
Algorithms
Exascale Computing
Extreme Scale Computing
Heterogeneous Systems
Post-Moore Computing
Quantum Computing
W
DescriptionNeuromorphic computing is a popular technology for the future of computing. Much of the focus in neuromorphic computing research and development has focused on new architectures, devices, and materials, rather than in the software, algorithms, and applications of these systems. In this talk, I will overview the field of neuromorphic computing with a particular focus on challenges and opportunities in using neuromorphic computers as co-processors. I will discuss neuromorphic applications for both machine learning and non-machine learning use cases.
Workshop
Recorded
Algorithms
Exascale Computing
Extreme Scale Computing
Heterogeneous Systems
Post-Moore Computing
Quantum Computing
W
DescriptionIn this talk, we will discuss the challenges and opportunities for implementing deep learning algorithms at scale for scientific applications on leadership class HPC systems. Using examples drawn from multiple application areas we will see how challenges created by algorithmic complexity as well as multiple aspects of large data lend themselves to parallelization schemes. Additionally, we will explore what are the implications and demands for existing and upcoming AI accelerators.
Birds of a Feather
TP
XO/EX
DescriptionWith the increasing importance of efficient IO to reach peak computing performance, the IO500 is becoming the de-facto standard for measuring HPC storage performance. Developed in 2017, the IO500 has released two lists every year since, with the BoF highlight being the new IO500 list presentation.
This BoF’s goal is to foster IO500 community to progress the common goals of creating, sharing, and benefiting from a large corpus of shared storage performance data. We are also building a detailed repository of high-performance production storage systems as they evolve over time, providing a knowledge base for HPC researchers and system designers.
This BoF’s goal is to foster IO500 community to progress the common goals of creating, sharing, and benefiting from a large corpus of shared storage performance data. We are also building a detailed repository of high-performance production storage systems as they evolve over time, providing a knowledge base for HPC researchers and system designers.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
DescriptionParallel file systems like Lustre contain complicated I/O paths from clients to storage servers. An efficient I/O path requires proper settings of multiple parameters as the default settings often fail to deliver optimal performance, especially for diverse workloads in the HPC environment. Existing tuning strategies are limited in being adaptive, timely, and flexible. We propose IOPathTune, which adaptively tunes PFS I/O Path online from the client side without characterizing the workloads, doing expensive profiling, and communicating with other machines. We leveraged CloudLab to conduct the evaluations with 20 different Filebench workloads under three different test conditions: single-client standalone tests, dynamic workload change, and multi-client executions. We observed either on par or better performance than the default configuration across all workloads. Some of the most considerable improvement includes 231%, 113%, 96%, 43%.
Workshop
Recorded
W
DescriptionMost parallel profilers reveal only the MPI point-to-point communication patterns in a communication matrix. However, this may lead to sub-optimal process mapping due to incomplete information about the application communications. In this work, we have developed a profiler, IPMPI, that reveals the complete communication information including collective communications. We compare the results of two recent process mapping tools, TopoMatch and LPMS, with and without the input from IPMPI. We have run six applications and benchmarks such as ROMS, LAMMPS, DFT, HPCG, miniAMR and miniFE on two supercomputers and our shared department cluster. We observed a maximum average gain between 5.7 - 8.6% across these systems.
Workshop
Recorded
Accelerator-based Architectures
Data Analytics
In Situ Processing
Scientific Computing
Visualization
Workflows
W
Birds of a Feather
TP
XO/EX
DescriptionThe “Julia for HPC” birds-of-a-feather (BoF) aims to provide a place for the high-performance computing (HPC) community with interests in the Julia programming language. Julia proposes an integrated development end-to-end co-design model as a LLVM front-end for science to close the gap between high-productivity languages and the desired performance of traditional compiled languages on extreme heterogeneous systems. We invite participants from academia, government, and industry to share and discuss their experiences, identify and learn about current opportunities and gaps. Potential topics include: community, adoption and support in leadership facilities, the Julia ecosystem, programming models and packages targeting HPC workflows.
Student Cluster Competition
TP
XO/EX
DescriptionThe Universitas Indonesia team consists of senior-level undergraduate students majoring in physics (Mr. Abednego, Mr. Fahreza, Mr. Millennianno, and Mr. Ma’ruf), math (Mr. Al Josan), and computer science (Mr. Prasetya), with varying computational science experience. Mr. Abednego and Mr. Fahreza currently perform research with Quantum ESPRESSO software and Boltztrap2 Python package, and are taking a course on Fortran 90/95. Mr. Millennianno and Mr. Ma’ruf study Instrumentation Physics and Control Systems, both hardware and software, from the basics of Assembly and C as the main way of writing instructions/code. Mr. Prasetya (CS) and Mr. Al Josan (math) contribute their deep knowledge from the point of view of their respective fields. This broad spectrum of backgrounds will definitely contribute to the team’s performance in the competition.
This is our first HPC competition. We have great interest to participate in IndySCC, and we’re eager to learn new things, for example about parallelization and algorithms. This competition will not only provide a place to learn those things, but also give a chance and experience for the team members to apply the knowledge that have been acquired previously and from academic courses. We also realize that HPC will become an integral part of scientific research, especially in the simulation and artificial intelligence world, in which every team member is currently interested. We hope to contribute our talent in the world of HPC.
Mr. Phan is an alumnus of Universitas Indonesia (S.Si. in Physics) who recently finished his MS in Physics at the University of Tennessee, Knoxville. His master’s thesis involved porting a dynamical density response code “Exciting-Plus” based on Time-Dependent Density Functional Theory (TD-DFT) to the Summit supercomputer. The original CPU-only code was awarded the 2010 ACM Gordon Bell Honorable Mention for Performance, and the ported code performed 12x faster (wall clock time) compared to CPU-only runs on Summit. He currently works as a Research Software Engineer at Sourcery Institute.
Dr. Cahaya is an assistant professor at the Physics Department, Universitas Indonesia. He received his doctorate degree from Institute for Materials Research, Tohoku University, Japan. He teaches mathematical and computational methods in physics at the undergraduate and master level. He utilizes Density Functional Theory calculations for modeling physical systems, supported by the research computing cluster at Theoretical & Computational Condensed Matter Physics (TCMP) lab. This cluster is also available for TCMP students, which includes Mr. Abednego and Mr. Fahreza.
Dr. Adhianto is a research staff at the Department of Computer Science, Rice University. He received his doctorate degree from the University of Houston. His research interests are compiler and performance analysis.
Prof. Bustamam is a professor at the Mathematics Department, Universitas Indonesia. He received the PhD degree in bioinformatics from the University of Queensland, Australia. His research focuses on high-performance computing approaches on computational mathematics, computational biology, bioinformatics, and computer science.
Dr. Budiardja is a computational scientist in the National Center for Computational Sciences (NCCS) at Oak Ridge National Laboratory. He earned his PhD from the University of Tennessee, Knoxville in computational astrophysics.
This is our first HPC competition. We have great interest to participate in IndySCC, and we’re eager to learn new things, for example about parallelization and algorithms. This competition will not only provide a place to learn those things, but also give a chance and experience for the team members to apply the knowledge that have been acquired previously and from academic courses. We also realize that HPC will become an integral part of scientific research, especially in the simulation and artificial intelligence world, in which every team member is currently interested. We hope to contribute our talent in the world of HPC.
Mr. Phan is an alumnus of Universitas Indonesia (S.Si. in Physics) who recently finished his MS in Physics at the University of Tennessee, Knoxville. His master’s thesis involved porting a dynamical density response code “Exciting-Plus” based on Time-Dependent Density Functional Theory (TD-DFT) to the Summit supercomputer. The original CPU-only code was awarded the 2010 ACM Gordon Bell Honorable Mention for Performance, and the ported code performed 12x faster (wall clock time) compared to CPU-only runs on Summit. He currently works as a Research Software Engineer at Sourcery Institute.
Dr. Cahaya is an assistant professor at the Physics Department, Universitas Indonesia. He received his doctorate degree from Institute for Materials Research, Tohoku University, Japan. He teaches mathematical and computational methods in physics at the undergraduate and master level. He utilizes Density Functional Theory calculations for modeling physical systems, supported by the research computing cluster at Theoretical & Computational Condensed Matter Physics (TCMP) lab. This cluster is also available for TCMP students, which includes Mr. Abednego and Mr. Fahreza.
Dr. Adhianto is a research staff at the Department of Computer Science, Rice University. He received his doctorate degree from the University of Houston. His research interests are compiler and performance analysis.
Prof. Bustamam is a professor at the Mathematics Department, Universitas Indonesia. He received the PhD degree in bioinformatics from the University of Queensland, Australia. His research focuses on high-performance computing approaches on computational mathematics, computational biology, bioinformatics, and computer science.
Dr. Budiardja is a computational scientist in the National Center for Computational Sciences (NCCS) at Oak Ridge National Laboratory. He earned his PhD from the University of Tennessee, Knoxville in computational astrophysics.
Birds of a Feather
TP
XO/EX
DescriptionSYCL is an open standard with a new release in 2020. After SC17, SC18, SC19, and SC20, SC21’s successful ISO C++ SYCL BoF, and with increasing use of C++ in HPC, there was popular demand for updates on the new SYCL 2020 features and current developments. It means developers will be able to write their HPC software using the SYCL standard and that will enable the same software on the forthcoming Aurora supercomputer at Argonne National Lab, NERSC, LBL, ORNL, and potentially, supercomputers with other architectures, including AMD, ARM, or RISC-V.
Posters
Research Posters
TP
XO/EX
DescriptionKokkos is a representative approach between template metaprogramming solutions that offers programmers high-level abstractions for generic programming while most of the device-specific code generation and optimizations are delegated to the compiler through template specializations. For this, Kokkos provides a set of device-specific code specializations in multiple backends, such as CUDA and HIP. However, maintaining and optimizing multiple device-specific back ends for each new device type can be complex and error-prone. To alleviate these concerns, this paper presents an alternative OpenACC back end for Kokkos: KokkACC. KokkACC provides a high-productivity programming environment and—potentially—a multi architecture back end. We have observed competitive performance; in some cases, KokkACC is faster than NVIDIA’s CUDA back end and much faster than OpenMP’s GPU offloading back end. This work also includes implementation details and a detailed performance study conducted with a set of mini-benchmarks (AXPY and DOT product) and two mini-apps (LULESH and miniFE).
Workshop
Recorded
Accelerator-based Architectures
Compilers
Dataflow and Tasking
Directive Based Programming
Heterogeneous Systems
Parallel Programming Languages and Models
Runtime Systems
W
DescriptionTemplate metaprogramming is gaining popularity as a high-level solution for achieving performance portability. Kokkos is a representative approach that offers programmers high-level abstractions while most of the device-specific code generation are delegated to the compiler. OpenACC is a high-level and directive-based programming model. This model allows developers to insert hints (pragmas) into their code that help the compiler to parallelize the code. This paper presents an OpenACC back end for Kokkos: KokkACC. KokkACC provides a high-productivity programming environment back end. This work demonstrates the potential benefits of having a high-level and a descriptive programming model based on OpenACC. We observe competitive performance; in some cases, KokkACC is faster than CUDA back end and much faster than OpenMP’s GPU offloading backend. This work also includes a detailed performance study conducted with a set of mini-benchmarks (AXPY and DOT product) and three mini-apps (LULESH, miniFE and SNAP, a LAMMPS proxy mini-app).
Paper
Recorded
Resource Management and Scheduling
System Software
TP
DescriptionTraditionally, I/O systems have been developed within the confines of a centralized OS kernel. This led to monolithic and rigid storage systems that are limited by low development speed, low expressiveness, and suboptimal performance. Various assumptions are forced onto users including reliance on the UNIX-file abstraction, the POSIX standard, and a narrow set of I/O policies. Recent hardware innovation and the explosion of I/O requirements of modern applications have questioned how storage services are developed. To support high-performance while maintaining flexibility, I/O subsystems are shifting to userspace designs. To that end, this paper introduces LabStor, a modular and extensible platform for developing high-performance, customized I/O stacks in userspace. By enabling direct access to device drivers from userspace, any number of composable I/O stack designs are now possible with high-velocity development. Evaluations show that I/O stacks developed in LabStor can yield performance improvements of up to 60% in various applications.
Birds of a Feather
TP
XO/EX
DescriptionIn this BoF, we present a vision from industry and academia of establishing a new forum that addresses the need for Next Generation Interfaces in data management in a federated environment. As part of the BoF, we will first introduce perspectives of this vision and the pressing challenges. Following up, we will then discuss promising approaches that address a subset of the vision, namely for heterogeneous storage and compute environments.
Birds of a Feather
TP
XO/EX
DescriptionWe propose to discuss application needs in the area of large-scale dynamic network analysis. Given the latest advancement in exascale computing and the volume of data available, an important problem is how to analyze large-scale networks to be meaningful for applications. Many challenges exist in developing software for analyzing dynamic graphs, including consensus on the output, reproducibility, and most critically whether existing parallel update algorithms support real-world applications’ needs. We aim to take steps toward forming a community of users of dynamic network software while spreading awareness about the tools available and the challenges related to large-scale dynamic graph analysis.
Paper
Recorded
Post-Moore Computing
Quantum Computing
TP
DescriptionQuantum computational chemistry (QCC) is the use of quantum computers to solve problems in computational quantum chemistry. We develop a high-performance variational quantum eigensolver (VQE) simulator on a new Sunway supercomputer. The major innovations include: (1) a Matrix Product State (MPS) based VQE simulator to reduce the amount of memory needed; (2) a combination of the Density Matrix Embedding Theory with the MPS-based VQE simulator to further extend the simulation range; (3) A three-level parallelization scheme to scale up to 20 million cores; (4) Usage of the Julia script language as the main programming (e.g. C) language, which both makes the programming easier and enables cutting edge performance; (5) Study of real chemistry systems based on the VQE simulator, achieving nearly linearly strong and weak scaling. Our simulation demonstrates the power of VQE for large quantum chemistry systems, thus paves the way for large-scale VQE experiments on near-term quantum computers.
Workshop
Recorded
W
DescriptionFollowing the trend of heterogeneity, hardware manufacturers and vendors are releasing new architectures and their proprietary software stack (e.g., libraries) that can harness the best possible performance for commonly used kernels, such as linear algebra kernels. However, tuned kernels for one architecture are not portable to others. Moreover, the co-existence of different architectures in a single node made orchestration difficult. To address these challenges, we introduce LaRIS, a portable framework for LaPACK functionalities. LaRIS ensures a separation between linear algebra algorithms and vendor-library kernels using IRIS runtime and IRIS-BLAS library. Such abstraction at the algorithm level makes implementation completely vendor-library and architecture agnostic. LaRIS uses IRIS runtime to dynamically select the vendor-library kernel and suitable processor architecture at runtime. Through LU factorization, we demonstrate that LaRIS can fully utilize different heterogeneous systems by launching and orchestrating different vendor-library kernels without any change in the source code.
Posters
Research Posters
TP
XO/EX
DescriptionIn past years, the world has switched to many-core and multi-core shared memory architectures. As a result, there is a growing need to utilize these architectures by introducing shared memory parallelization schemes, such as OpenMP, to software applications. Nevertheless, introducing OpenMP into code, especially legacy code, is challenging due to pervasive pitfalls in management of parallel shared memory. To facilitate the performance of this task, many source-to-source (S2S) compilers have been created over the years, tasked with inserting OpenMP directives into code automatically. In addition to having limited robustness to their input format, these compilers still do not achieve satisfactory coverage and precision in locating parallelizable code and generating appropriate directives. In this work, we propose leveraging recent advances in machine learning techniques, specifically in natural language processing (NLP) - the transformers model, to suggest the need for an OpenMP directive or specific clauses (reduction and private).
Workshop
Recorded
Architectures
Data Analytics
Datacenter
Extreme Scale Computing
HPC Community Collaboration
Machine Learning and Artificial Intelligence
Performance
Resource Management and Scheduling
System Software
W
DescriptionTraditionally, we have assumed that HPC users are fairly boring and that their workloads often do similar things repetitively. Their "boring" nature has served us well so far -- we could design "boring" systems and get away with it. But, now things are changing and changing fast. Our HPC workloads and users are becoming interesting and, often, are surprising us with new trends and behavior. That means it is springing excitement into our lives. We need to design interesting solutions, and come out of our boredom. My talk will discuss specific examples from job resource consumption, system reliability, and performance tuning -- and their impact on system design and operations. I'll discuss some "speculative" use cases which could really disrupt our boredom and calmness, and assess if we are ready for that?
Workshop
Recorded
W
DescriptionIn order to improve the reliability of the storage system and reduce the data redundancy, erasure-coded storage is widely used. The traditional approach to fault recovery research relies mostly on passive recovery, which occurs when data loss is detected. But passive recovery reduces the system's reliability. At present, machine learning methods can accurately predict soon-to-fail (STF) disks. Based on successfully predicted disk failures, we propose a method to proactively recover data in disks by using local erasure coding within nodes, called LEC-PR (Local EC Proactive Recovery). By migrating and recovering the data in advance, the reliability of the data can be improved. LEC-PR reduces cross-node recovery times and cross-node traffic during proactive data recovery by partially redundancy of the node's internal data blocks on other nodes. As compared to the existing method, LEC-PR can reduce cross-node traffic by 35% and shorten recovery time by up to 69%.
Paper
Recorded
Extreme Scale Computing
Memory Systems
Parallel Programming Systems
State of the Practice
TP
DescriptionHybrid MPI+threads programming is gaining prominence, but, in practice, applications perform slower with it compared to the MPI everywhere model. The most critical challenge to the parallel efficiency of MPI+threads applications is slow MPI_THREAD_MULTIPLE performance. MPI libraries have recently made significant strides on this front, but to exploit their capabilities, users must expose the communication parallelism in their MPI+threads applications. MPI 4.0 provides users with new performance-oriented options to do so, but the evaluation of these new mechanisms shows that they pose several challenges. An alternative design is MPI Endpoints. In this paper, we present a comparison of the different designs from the perspective of MPI’s end-users: domain scientists and application developers. We evaluate the mechanisms on metrics beyond performance such as usability, scope, and portability. Based on the lessons learned, we make a case for a future direction.
Workshop
Recorded
Performance Portability
W
DescriptionAccelerator-based heterogeneous computing is the de facto standard in current and upcoming exascale machines. These heterogeneous resources empower computational scientists to select a machine or platform well-suited to their domain or applications. However, this diversity of machines also poses challenges related to programming model selection: inconsistent availability of programming models across different exascale systems, lack of performance portability for those programming models that do span several systems, and inconsistent performance between different models on a single platform. We explore these challenges on exascale-similar hardware, including AMD MI100 and Nvidia A100 GPUs. By extending the source-to-source compiler OpenARC, we demonstrate the power of automated translation of applications written in a single front-end programming model (OpenACC) into a variety of back-end models (OpenMP, OpenCL, CUDA, HIP) that span the upcoming exascale environments. This translation enables us to compare performance within and across devices and to analyze programming model behavior with profiling tools.
Posters
Research Posters
TP
XO/EX
DescriptionIn structured grid finite-difference, finite-volume, and finite-element discretizations of partial differential equation conservation laws, regular stencil computations constitute the main core kernel in many temporally explicit approaches for such problems. For various blocking dimensions, the Spatial Blocking (SB) approach enables data reuse within multiple cache levels.
Introduced in GIRIH, the Multi-core Wavefront Diamond blocking (MWD) method optimizes practically relevant stencil algorithms by combining the concepts of diamond tiling and multi-core aware wavefront temporal blocking, leading to significant increase in data reuse and locality.
We evaluate the performance of MWD on a variety of recent multi-core architectures. Among all of them, the new AMD multi-processor, codenamed Milan-X, provides an unprecedented capacity for the Last Level Cache. We show that the Milan-X hardware design is ideal for the MWD method, and significant performance gain can be achieved relative to its predecessors Milan and Rome.
Introduced in GIRIH, the Multi-core Wavefront Diamond blocking (MWD) method optimizes practically relevant stencil algorithms by combining the concepts of diamond tiling and multi-core aware wavefront temporal blocking, leading to significant increase in data reuse and locality.
We evaluate the performance of MWD on a variety of recent multi-core architectures. Among all of them, the new AMD multi-processor, codenamed Milan-X, provides an unprecedented capacity for the Last Level Cache. We show that the Milan-X hardware design is ideal for the MWD method, and significant performance gain can be achieved relative to its predecessors Milan and Rome.
Workshop
Recorded
Correctness
Software Engineering
W
DescriptionIntegrating multiple disparate paradigms in a single execution model increases the complexity of OpenMP, making OpenMP programs prone to data races. Inspired by OpenMP's task-oriented execution model,we extended SPD3,a data race detection algorithm designed for async-finish task parallelism to support OpenMP programs. We found that by extending SPD3’s key data structure,DPST, SPD3 can support the majority of OpenMP constructs. We have implemented a prototype, TSAN-SPD3,on top of Google’s ThreadSanitizer(TSAN). To conduct an apples-to-apples comparison with ARCHER, we compared TSAN-SPD3 with an ARCHER implementation that executes on the same version of TSAN. In addition, we evaluated ARCHER in two modes,the default mode using the original TSAN and the accelerated mode enabling the use of SIMD instructions in TSAN. The evaluation was conducted on the BOTS benchmark suite. The evaluation results show that in eight out of nine benchmarks TSAN-SPD3 achieved similar overhead with ARCHER, while TSAN-SPD3 can identify more potential races.
Workshop
Recorded
Cloud and Distributed Computing
In Situ Processing
Scientific Computing
Workflows
W
DescriptionlibEnsemble is a Python toolkit for running dynamic ensembles of simulations. libEnsemble aims to minimize the effort of the user in describing their workflow via generator and simulator functions written in Python, and to maximize code reuse by maintaining a library of existing functions. Example generator functions perform optimizations, train models, and test candidate solutions. This work highlights how libEnsemble’s dynamic features have enabled practical multi-fidelity workflows.
Workshop
Recorded
W
DescriptionLinux containers bring several advantages for the deployment of High Performance Computing applications using the Message Passing Interface (MPI). However, efficiently leveraging high-speed network resources when employing portable container images able to work on different systems remains a challenging goal. Often, the adopted approach to combine image portability with performance is to replace at runtime the container's MPI libraries with a native implementation. While effective, such practice is heavily constrained by the requirement of binary compatibility between host and container libraries. This work presents two techniques based on libfabric which overcome the limitations of MPI replacement, while still granting containerized applications near-native communication performance and portability. These techniques were validated experimentally with synthetic benchmarks, demonstrating their effectiveness and, in the case of one technique, revealing a notable degree of flexibility at runtime. Two scale-out experiments also showed the capability to closely match native results using both synthetic and real-world benchmarks.
Workshop
Recorded
W
DescriptionServerless platforms have exploded in popularity in recent years, but, today, these platforms are still unsuitable for large classes of applications. They perform well for batch-oriented workloads that perform coarse transformations over data asynchronously, but their lack of clear service level agreements (SLAs), high per-invocation overheads, and interference make deploying online applications with stringent response time demands impractical. Our assertion is that beyond the glaring issues like cold start costs, a more fundamental shift is needed in how serverless function invocations are provisioned and scheduled in order to support these more demanding applications. Specifically, we propose a platform that leverages the observability and predictability of serverless functions to enforce multi-resource fairness. Finally, we propose a new distributed and hierarchical function scheduling architecture that combines lessons from multi-resource fair scheduling to create an approach that we believe will enable tighter SLAs on serverless platforms than has been possible in the past.
Workshop
Recorded
W
DescriptionApptainer (formerly known as Singularity) since its beginning implemented many of its container features with the assistance of a setuid-root program. It still supports that mode, but as of version 1.1.0 it no longer uses setuid by default. This is feasible because it now can mount squash filesystems, mount ext2/3/4 filesystems, and use overlayfs using unprivileged user namespaces and FUSE. It also now enables unprivileged users to build containers, even without requiring system administrators to configure /etc/subuid and /etc/subgid unlike other “rootless” container systems. As a result, all the unprivileged functions can be used nested inside of another container, even if the container runtime prevents any elevated privileges.
Workshop
Recorded
W
DescriptionUnfortunately, the serverless computing model is not traditionally designed for scientific applications. Serverless computing platforms pose multiple design and implementation challenges that make it difficult for scientific workflow execution. Even if one could engineer a system to leverage the serverless platform for scientific workflows, the resulting execution would be sub-optimal and may not respond to real-time needs. This talk will describe the design and implementation of a system that overcomes these challenges and makes serverless attractive for scientific workflows. We will show how we can leverage the serverless computing model for executing HPC workflows.
Workshop
Recorded
W
DescriptionPortability and reproducibility are the primary goals of scientific communities that intend to use container technology that is familiar with High Performance Computing (HPC) applications and environments on-premises and in the cloud to support these concerns. Singularity and Sarus evolved to meet the needs of computational workload users and administrators in HPC environments, as well as container portability and reproducibility for scientific computing. To scale complex workloads on multi-node distributed memory architectures, HPC applications often use the Message Passing Interface (MPI), while the use of a Graphical User Interface (GUI) is an integral part of pre-processing, post-processing, data analysis, and result interpretation. This work presents approaches to building and executing containerized MPI and GUI applications on HPC in the cloud. We provide comprehensive performance evaluation of MPI and GUI applications running in an isolated container environment with Workload Manager, with reference to native on HPC in the cloud.
Workshop
Recorded
W
DescriptionWhat if you had access to a high performance desktop from any endpoint device? What if this desktop allowed you to launch interactive graphical applications like MATLAB or Jupyter Notebooks as well as compose scientific workflows using graphical and command line tools? What if this desktop had enough performance to run the applications and also provided tools to quickly manage and submit jobs to an HPC cluster? Indiana University built such a desktop in 2017 using ThinLinc and has evolved it over the years. This talk will summarize the capabilities of the IU Research Desktop and outline how it helped Indiana University attract more users to our HPC systems.
Workshop
Recorded
W
DescriptionModern high precision radiation therapy (RT) applications require a rapid and accurate planning process. Since the anatomical changes during treatment are mostly deformable, deformable image registration (DIR) is a core process used during treatment to account for those changes in the shape and size of internal organs between the initial and adaptive planning images acquired during the treatment course. DIR methods have already obtained huge success on registration accuracy, however, they usually take a long computation time and this limits clinical applications. So, the research question (RQ) of this project is: performing an accurate deformable image registration requires a tremendous amount of computing time, how to obtain significant acceleration while maintaining registration accuracy?
Different DIR algorithms will behave differently; therefore, users need to be aware of specifics of their software before clinical use. In our project, we are evaluating a multi-GPU-based DIR framework capabilities for radiotherapy treatments, using lung data sets. It is called CLAIRE. CLAIRE aims at solving the large-scale imaging problems, while we want to provide real-time capabilities for clinically relevant problem sizes. We believe that CLAIRE can be benefited from a series of performance optimizations to improve strong scaling scenarios, since the scalability of CLAIRE is limited due to the high communication costs for small problem sizes.
Different DIR algorithms will behave differently; therefore, users need to be aware of specifics of their software before clinical use. In our project, we are evaluating a multi-GPU-based DIR framework capabilities for radiotherapy treatments, using lung data sets. It is called CLAIRE. CLAIRE aims at solving the large-scale imaging problems, while we want to provide real-time capabilities for clinically relevant problem sizes. We believe that CLAIRE can be benefited from a series of performance optimizations to improve strong scaling scenarios, since the scalability of CLAIRE is limited due to the high communication costs for small problem sizes.
Workshop
Recorded
W
DescriptionHPC is dominated by batch systems and rigid programming models. Jobs specify static resource allocation and wait to be scheduled, leading to node under-utilization and long wait times. In the cloud, serverless functions are used to dynamically assign computing resources and scale allocations exactly to application needs, providing resource flexibility and elastic offloading to spare data center resources. Employing the programming model of Function-as-a-Service (FaaS) would bring to HPC online computations and more elastic resource management, but functions are not ready to handle the performance requirements of HPC workloads.
In rFaaS, we propose new allocation and computation policies for serverless that benefit from faster processing and RDMA networking. rFaaS enhances traditional FaaS computing with the low latency and high throughput needed in compute- and data-intensive HPC applications. With high-performance functions, we provide opportunistic computing that uses idle resources in the HPC cluster to handle dynamic, priority, and interactive workloads.
In rFaaS, we propose new allocation and computation policies for serverless that benefit from faster processing and RDMA networking. rFaaS enhances traditional FaaS computing with the low latency and high throughput needed in compute- and data-intensive HPC applications. With high-performance functions, we provide opportunistic computing that uses idle resources in the HPC cluster to handle dynamic, priority, and interactive workloads.
Workshop
Recorded
W
DescriptionInteractive HPC is still difficult to achieve in practice for many scientific researchers. Even with empowering tools such as Jupyter that provide a framework for interactive computing, there are still challenges that scientists face in developing interactive HPC components.
Some of the important challenges scientists face include adapting existing scientific code for HPC environments, interactive data visualization, interactive parameter exploration, streaming data, and the ability to effectively share and allow others to reproduce their work. We will discuss these challenges briefly and ways to address them.
Some of the important challenges scientists face include adapting existing scientific code for HPC environments, interactive data visualization, interactive parameter exploration, streaming data, and the ability to effectively share and allow others to reproduce their work. We will discuss these challenges briefly and ways to address them.
Workshop
Recorded
W
DescriptionThe Regional HPC Center ROMEO is positioned in a dual dimension of research support: offering high-performance computing resources as a service and contributing scientifically to R&D or research projects, whether multidisciplinary or purely HPC, by relying on a history of collaboration with national research structures, services and technology providers.
In March 2020, at the outbreak of COVID-19, researchers in IA and HPC focused on the use of Deep Learning to model and predict the evolution of COVID with the scarce data available at that moment. Public health data but also population and individuals mobility patterns were the first parameters added to the model, reaching accurate predictions. Later on, additional data on public health actions (lockdown, curfews, vaccination) and more detailed population profiles were integrated to the model. Currently, the model is integrated into a decision support tool available for French public decision makers.
In parallel to this, a long-standing collaboration with biochemists on high performance molecular docking using GPUs allowed us to launch a massive-scale campaign to virtually screen 326 thousand druggable molecules against 20 target proteins of the COVID. The two most powerful French GPU supercomputers were mobilized, including half of the ROMEO supercomputer, to produce the equivalent of 1,500 years of calculation in a few days and thus identify a group of drugs with potential virus inhibition capabilities. The one hundred top-most promising molecules were synthesized and now are being assessed in-vitro.
In March 2020, at the outbreak of COVID-19, researchers in IA and HPC focused on the use of Deep Learning to model and predict the evolution of COVID with the scarce data available at that moment. Public health data but also population and individuals mobility patterns were the first parameters added to the model, reaching accurate predictions. Later on, additional data on public health actions (lockdown, curfews, vaccination) and more detailed population profiles were integrated to the model. Currently, the model is integrated into a decision support tool available for French public decision makers.
In parallel to this, a long-standing collaboration with biochemists on high performance molecular docking using GPUs allowed us to launch a massive-scale campaign to virtually screen 326 thousand druggable molecules against 20 target proteins of the COVID. The two most powerful French GPU supercomputers were mobilized, including half of the ROMEO supercomputer, to produce the equivalent of 1,500 years of calculation in a few days and thus identify a group of drugs with potential virus inhibition capabilities. The one hundred top-most promising molecules were synthesized and now are being assessed in-vitro.
Paper
Recorded
Machine Learning and Artificial Intelligence
TP
DescriptionTransformer-based neural models are used in many AI applications. Training these models is expensive, as it takes huge GPU resources and long duration. It is challenging because typical data like sentences have variable lengths, and Transformer's computation patterns are more complex than convolutional neural networks. Existing systems either only focus on model inference or optimization for only BERT-like encoder models.
In this paper, we present LightSeq2, a system to accelerate training for a general family of Transformer models on GPUs. We propose a series of GPU optimization techniques tailored to the specific computation flow and memory access patterns of Transformer models. LightSeq2 supports many model architectures, including BERT (encoder-only), GPT (decoder-only), Transformer (encoder-decoder), and vision Transformer. Our experiments for a variety of models and benchmarks show that LightSeq2 is consistently faster (1.4-3.5x) than previous systems on different GPUs. In particular, it gains 308% training speedup on WMT14 English-German benchmark.
In this paper, we present LightSeq2, a system to accelerate training for a general family of Transformer models on GPUs. We propose a series of GPU optimization techniques tailored to the specific computation flow and memory access patterns of Transformer models. LightSeq2 supports many model architectures, including BERT (encoder-only), GPT (decoder-only), Transformer (encoder-decoder), and vision Transformer. Our experiments for a variety of models and benchmarks show that LightSeq2 is consistently faster (1.4-3.5x) than previous systems on different GPUs. In particular, it gains 308% training speedup on WMT14 English-German benchmark.
Birds of a Feather
TP
XO/EX
DescriptionLiquid cooling is key to dealing with heat density, reducing energy consumption and increasing performance. With more than a decade's experience with liquid cooling in the large-scale supercomputing centers, many data centers are still facing challenges with adoption. This BoF will bring together people who are knowledgable in liquid cooling from supercomputing sites, system integrators, liquid cooling vendors, and engineering design companies to identify common roadblocks and key learnings for helping to resolve roadblocks.
Posters
Research Posters
TP
XO/EX
DescriptionServices on the edge and fog systems desire mobility owing to the user or data mobility, and the necessity of relocating to the cloud upon the oversubscription. More specifically, live migration of containerized microservices is required for service mobility, elasticity, and load balancing purposes. Although container runtimes and orchestrators recently provided native live migration support, they do not allow migration across autonomous computing systems with heterogeneous orchestrators. Our hypothesis is that non-native and non-invasive support for the live container migration is the need of hour and can unlock several new use cases. We develop a non-native and non-invasive live container migration method leveraging the nested container runtime. We design the architecture and develop the solution to enable container migration across heterogeneous orchestrators. We evaluate the performance against other approaches. We observe that for microservices smaller than 512 MiB, the nested container runtime approach can be implemented within an acceptable overhead.
Doctoral Showcase
Posters
Recorded
TP
DescriptionTo enable efficient and productive programming of today's supercomputers and beyond, a variety of issues must be addressed, including: load balancing (i.e., utilizing all resources equally), fault tolerance (i.e., coping with hardware failures), and resource elasticity (i.e., allowing the addition/release of resources).
In this work, we address above issues in the context of Asynchronous Many-Tasking (AMT) for clusters. Here, programmers split a computation into many fine-grained execution units (called tasks), which are dynamically mapped to processing units (called workers) by a runtime system.
Regarding load balancing, we propose a work stealing technique that transparently schedules tasks to resources of the overall system, balancing the workload over all processing units. Experiments show good scalability, and a productivity evaluation shows intuitive use.
Regarding fault tolerance, we propose four techniques to protect programs transparently. All perform localized recovery and continue the program execution with fewer resources. Three techniques write uncoordinated checkpoints of task descriptors in a resilient store. One technique does not write checkpoints, but exploits natural task duplication of work stealing. Experiments show failure-free running time overhead below 1% and a recovery overhead below 0.5 seconds. Simulations of job set executions show that makespans can be reduced by up to 97%.
Regarding resource elasticity, we propose a technique to enable the addition and release of nodes at runtime by transparently relocating tasks accordingly. Experiments show costs for adding and releasing nodes below 0.5 seconds. Additionally, simulations of job set executions show that makespans can be reduced by up to 20%.
In this work, we address above issues in the context of Asynchronous Many-Tasking (AMT) for clusters. Here, programmers split a computation into many fine-grained execution units (called tasks), which are dynamically mapped to processing units (called workers) by a runtime system.
Regarding load balancing, we propose a work stealing technique that transparently schedules tasks to resources of the overall system, balancing the workload over all processing units. Experiments show good scalability, and a productivity evaluation shows intuitive use.
Regarding fault tolerance, we propose four techniques to protect programs transparently. All perform localized recovery and continue the program execution with fewer resources. Three techniques write uncoordinated checkpoints of task descriptors in a resilient store. One technique does not write checkpoints, but exploits natural task duplication of work stealing. Experiments show failure-free running time overhead below 1% and a recovery overhead below 0.5 seconds. Simulations of job set executions show that makespans can be reduced by up to 97%.
Regarding resource elasticity, we propose a technique to enable the addition and release of nodes at runtime by transparently relocating tasks accordingly. Experiments show costs for adding and releasing nodes below 0.5 seconds. Additionally, simulations of job set executions show that makespans can be reduced by up to 20%.
Posters
Research Posters
TP
XO/EX
DescriptionGlobal pandemics can wreak havoc and lead to significant social, economic and personal losses. Preventing the spread of infectious diseases requires interventions at different levels needing the study of potential impact and efficacy of those preemptive measures. Modeling epidemic diffusion and possible interventions can help us in this goal. Agent-based models have been used effectively in the past to model contagion processes. We present Loimos, a highly parallel simulation of epidemic diffusion written on top of the Charm++ asynchronous task-based system. Loimos uses a hybrid time-stepped and discrete-event simulation to model disease spread. We demonstrate that our implementation of Loimos is able to efficiently utilize a large number of cores on different HPC platforms, namely, we scale to about 32k cores on Theta at ALCF and about 4k cores on Cori at NERSC.
Workshop
Recorded
W
DescriptionIn recent years, deep learning-based models for electronic health records have shown impressive results in many clinical tasks. Deep learning classification models typically require large labeled training datasets and are designed to address specific clinical tasks. Transformers are powerful state-of-art language models designed to learn inherent patterns in unstructured text data in an unsupervised manner. The transformer model’s unsupervised training enables generalizability and reusability of the model to various clinical tasks, negating the need for labeled data in the training phase. The trained transformer can then be fine-tuned towards a specific clinical task using a small but task-curated training dataset. In the current work, we build a transformer model that can effectively accommodate the length of typical cancer pathology reports. We use 5.7 million pathology reports from six Surveillance, Epidemiology, and End Results Program’s (SEER) cancer registries to train “from scratch” the Big-Bird model. Big-Bird model is a transformer model built for long documents (up to 4096 tokens) compared to popular models such as BERT (up to 512 tokens). As the memory requirement of a transformer model scales quadratically with the sequence length of input text, Big-Bird utilizes sparse attention. In phase one, Big-Bird is built in an unsupervised manner using the pre-training task called masked language prediction. This phase requires the largest amount of computation, and it leverages the secure CITADEL capability for working with protected health information (PHI) data on the Summit supercomputer at the Oak Ridge Leadership Computing Facility. In phase two, we fine-tune the pre-trained Big-Bird model to handle five information extraction tasks: site, sub-site, histology, laterality, and behavior. For fine-tuning, we use data from six SEER registry data with the 10-day window constraint before and after the date of cancer diagnosis, and the ground truth for five tasks is from the manually coded CTC (Cancer/Tumor/Case) report. One advantage of this two-phase approach is the re-usability of the phase one model for any pathology-relevant clinical task in phase two. Our results show that the proposed Big-Bird model fine-tuned with SEER data on five information tasks outperforms the current state-of-the-art deep learning classification model by an average of 2% microF1 score on all tasks and an average 8% macro F1 score on all tasks. In most challenging tasks, subsite has a 4% increase in micro F1 score and histology has a 25% increase in macro F1 score. The results demonstrate the promise of using a single pretrained model on five related clinical tasks. We plan to further test the generalizability and reusability of the model by extending the tasks to other clinically useful tasks such as bio-marker extraction and identification of malignant and metastatic disease.
Tutorial
Recorded
Algorithms
Applications
Big Data
Computational Science
Data Analytics
Data Mangement
File Systems and I/O
TUT
DescriptionLarge-scale numerical simulations, observations, experiments, and AI computations are generating or consuming very large datasets that are difficult to analyze, store, and transfer. Data compression is an attractive and efficient technique to significantly reduce scientific datasets. This tutorial reviews the state of the art in lossy compression of scientific datasets, covers the main compression techniques (e.g. decomposition, transforms, prediction, sampling, precision reduction, etc.) and discusses in detail lossy compressors (SZ, ZFP, TThresh, LibPressio), compression error assessment metrics, and the Z-checker tool to analyze the compression error. The tutorial addresses the following questions: Why lossless and lossy compression? How does compression work? How to measure and control compression error? What are the current use cases for simulations, experiments, and AI computations? The tutorial uses examples of real-world scientific datasets to illustrate the different compression techniques and their performance. From a participant perspective, the tutorial will detail how to use compression software as executables and as modules integrated in parallel I/O libraries (ADIOS, HDF5). This half-day tutorial, given by two of the leading teams in this domain and targeting primarily beginners interested in learning about lossy compression for scientific data, is improved from the highly rated tutorials given at ISC17-21 and SC17-21.
Posters
Research Posters
TP
XO/EX
DescriptionMassive Multiple-Input-Multiple-Output is a crucial technology for Next-Generation networks (Next-G). It uses hundreds of antennas at transceivers to exchange data. However, its accurate signal detection relies on solving an NP-hard optimization problem in real-time latency.
In this poster, we propose a new GPU-based detection algorithm that demonstrates the positive impact of low-precision arithmetic with multiple GPUs to achieve next-G latency/scalability/accuracy requirements. Our approach iteratively extends a solution with several symbols representing the best combination out of the aggregated levels. The computation at each iteration is formulated as a matrix multiplication operation to leverage GPU architectures.
The obtained results using A100 GPU show a 1.7x improvement by exploiting half-precision arithmetic without loss in accuracy. Furthermore, our low-precision multi-GPU version with four A100 GPUs is 4x faster than the single-precision single GPU version and 40x faster than a similar parallel CPU implementation executed on a two-socket 28-core IceLake CPU with 56 threads.
In this poster, we propose a new GPU-based detection algorithm that demonstrates the positive impact of low-precision arithmetic with multiple GPUs to achieve next-G latency/scalability/accuracy requirements. Our approach iteratively extends a solution with several symbols representing the best combination out of the aggregated levels. The computation at each iteration is formulated as a matrix multiplication operation to leverage GPU architectures.
The obtained results using A100 GPU show a 1.7x improvement by exploiting half-precision arithmetic without loss in accuracy. Furthermore, our low-precision multi-GPU version with four A100 GPUs is 4x faster than the single-precision single GPU version and 40x faster than a similar parallel CPU implementation executed on a two-socket 28-core IceLake CPU with 56 threads.
Students@SC
Birds of a Feather
TP
XO/EX
DescriptionLustre is the leading open-source and open-development file system for HPC. Around two thirds of the top 100 supercomputers use Lustre. It is a community developed technology with contributors from around the world. Lustre currently supports many HPC infrastructures beyond scientific research, such as financial services, energy, manufacturing and life sciences. Lustre clients are available for broadly deployed instruction set architectures such as x86, POWER, and Arm.
At this BoF, Lustre developers, administrators, and solution providers will gather to discuss recent Lustre developments and challenges, including the role of Lustre in AI and its use in Cloud environments.
At this BoF, Lustre developers, administrators, and solution providers will gather to discuss recent Lustre developments and challenges, including the role of Lustre in AI and its use in Cloud environments.
Doctoral Showcase
Posters
Recorded
TP
DescriptionWith the rise of Big Data, there has been a significant effort in increasing compute power through GPUs, TPUs, and heterogeneous architectures. As a result, the bottleneck of applications is shifting toward memory performance. Prefetching techniques are widely used to hide memory latency and improve instructions per cycle (IPC). A data prefetching process is a form of speculation that looks at memory access patterns to forecast the near future accesses and avoid cache misses. Traditional hardware data prefetchers use pre-defined rules, which are not powerful enough to adapt to the increasingly complex memory access patterns from new workloads.
We hypothesize that a machine learning-based prefetcher can be developed to achieve high-quality memory access prediction, leading to the improvement of IPC for a system. We develop several optimizations for ML-based prefetching. First, we propose RAOP, a framework for RNN augmented offset prefetcher, in which RNN provides temporal references for a spatial offset prefetcher, leading to the improvement of IPC. Second, we propose C-MemMAP, which provides clusters for downstream meta-models to balance the model size and prediction accuracy. We propose DM (delegated model) clustering method that learns latent patterns from long memory traces, which has significantly raised the prediction accuracy of the meta-models. Third, we propose TransFetch, an attention-based prefetcher that supports variable-degree prefetching by modeling prefetching as a multi-label classification problem. In addition, we propose ReSemble, a Reinforcement Learning (RL) based adaptive ensemble framework that enables multiple prefetchers to complement each other on hybrid applications and updates online.
We hypothesize that a machine learning-based prefetcher can be developed to achieve high-quality memory access prediction, leading to the improvement of IPC for a system. We develop several optimizations for ML-based prefetching. First, we propose RAOP, a framework for RNN augmented offset prefetcher, in which RNN provides temporal references for a spatial offset prefetcher, leading to the improvement of IPC. Second, we propose C-MemMAP, which provides clusters for downstream meta-models to balance the model size and prediction accuracy. We propose DM (delegated model) clustering method that learns latent patterns from long memory traces, which has significantly raised the prediction accuracy of the meta-models. Third, we propose TransFetch, an attention-based prefetcher that supports variable-degree prefetching by modeling prefetching as a multi-label classification problem. In addition, we propose ReSemble, a Reinforcement Learning (RL) based adaptive ensemble framework that enables multiple prefetchers to complement each other on hybrid applications and updates online.
Workshop
Recorded
W
DescriptionThe current project seeks to enhance current 4DCTlung cancer patient image processing, image guidance, and adaptive radiotherapy verification through the integration of machine learning (Artificial Intelligence - AI) methods in an existing clinical radiation oncology framework. The current state-of-the-art for Lung SBRT treatment planning begins with the accurate delineation of target organ volumes and their surrounding structures, which is usually done using semi-automatic methods, mixing computer-assisted tools, and dedicated physicians. When it comes to 4DCT scans, what is usually done is to compute a visual average of the images across the different respiratory phases and the contours for those organs (in one specific phase) are delineated. In the last few years, deformable image registration (DIR) techniques have been developed and used in this field to propagate the contour delineation from one specific phase to the rest of the respiratory phases in the CT. Results of the target region delineation are then used by physicians and clinicians to select an optimal treatment phase. Rather than the mostly manual and slow/iterative process introduced above, our current project seeks to create more accurate and more robust delineations through improved machine learning models, decreasing time spent per patient plan, and applying a more mathematically rigorous and objective manner of selecting the optimal radiation treatment gating window, while enhancing image resolution, enhancing target definition, and treatment delivery.
The project is subsequently divided in various phases. One phase consists of deriving the deformation parameters to describe the three-dimensional movement of the patient's target treatment region through deformation propagation. The second phase involves a surrogate model for fast reconstruction of the dose distribution in the gross tumor volume(s), GTV(s), and organs at risk ,OAR(s), across all the phases, accounting for the deformed target region in time. With these dose profiles, the physicians will have the bounds (within a confidence interval) for the absorbed dose by different organs and tumors across all the respiratory cycle and they will be able to determine if the treatment plan for that patient is accurate and appropriate or if it needs to be replanned.
Additional phases of the algorithm use AI approached to enhance this step. The project's algorithm is unique and captures a higher degree of individualization based on the patient's specific organ movement compared with prior non-AI algorithms. Although other studies in the past have explored integrating AI for image segmentation for auto-contouring, our project's novelty lies in the manner of initialized parameters and the specific operations performed. Our project furthers patient-specific treatment planning while adopting a more streamlined approach and helps make more informed decisions using AI, to arrive at improved radiation treatment plans for lung cancer patients undergoing SBRT. We have tested our algorithm in several patients and have seen encouraging improvements.
The project is subsequently divided in various phases. One phase consists of deriving the deformation parameters to describe the three-dimensional movement of the patient's target treatment region through deformation propagation. The second phase involves a surrogate model for fast reconstruction of the dose distribution in the gross tumor volume(s), GTV(s), and organs at risk ,OAR(s), across all the phases, accounting for the deformed target region in time. With these dose profiles, the physicians will have the bounds (within a confidence interval) for the absorbed dose by different organs and tumors across all the respiratory cycle and they will be able to determine if the treatment plan for that patient is accurate and appropriate or if it needs to be replanned.
Additional phases of the algorithm use AI approached to enhance this step. The project's algorithm is unique and captures a higher degree of individualization based on the patient's specific organ movement compared with prior non-AI algorithms. Although other studies in the past have explored integrating AI for image segmentation for auto-contouring, our project's novelty lies in the manner of initialized parameters and the specific operations performed. Our project furthers patient-specific treatment planning while adopting a more streamlined approach and helps make more informed decisions using AI, to arrive at improved radiation treatment plans for lung cancer patients undergoing SBRT. We have tested our algorithm in several patients and have seen encouraging improvements.
Workshop
Recorded
W
DescriptionUnique scientific instruments designed and operated by large global collaborations are expected to produce exabyte-scale data volumes per year by 2030. These collaborations depend on globally distributed storage and compute to turn raw data into science. While all of these infrastructures have batch scheduling capabilities to share compute, Research and Education networks lack those capabilities. There is thus uncontrolled competition for bandwidth between and within collaborations. As a result, data "hogs" disk space at processing facilities for much longer than it takes to process, leading to vastly over-provisioned storage infrastructures. Integrated co-scheduling of networks as part of high-level managed workflows might reduce these storage needs by more than an order of magnitude. This presentation describes such a solution, demonstrates its functionality in the context of the Large Hadron Collider (LHC) at CERN, and presents the next-steps toward its use in production.
Tutorial
Recorded
Cloud and Distributed Computing
Containers
Productivity Tools
Reliability and Resiliency
Resource Management and Scheduling
Software Engineering
System Software
Workflows
TUT
DescriptionThe modern scientific software stack includes thousands of packages, from C, C++, and Fortran libraries, to packages written in interpreted languages like Python and R. HPC applications may depend on hundreds of packages spanning all of these ecosystems. To achieve high performance, they must also leverage low-level and difficult-to-build libraries such as MPI, BLAS, and LAPACK. Integrating this stack is extremely challenging. The complexity can be an obstacle to deployment at HPC sites and deters developers from building on each other's work.
Spack is an open source tool for HPC package management that simplifies building, installing, customizing, and sharing HPC software stacks. Its adoption has grown rapidly: it is used by end-users, by developers, and by world's largest HPC centers. Spack provides a powerful and flexible dependency model, a simple Python syntax for writing package build recipes, and a repository of over 6,000 packages maintained by a community of over 1,000 contributors. This tutorial provides an introduction to Spack's capabilities: installing and authoring packages, integrating Spack with development workflows, and deploying software at HPC facilities. Attendees will learn foundational skills for automating day-to-day tasks, as well as deeper knowledge of Spack for advanced use cases.
Spack is an open source tool for HPC package management that simplifies building, installing, customizing, and sharing HPC software stacks. Its adoption has grown rapidly: it is used by end-users, by developers, and by world's largest HPC centers. Spack provides a powerful and flexible dependency model, a simple Python syntax for writing package build recipes, and a repository of over 6,000 packages maintained by a community of over 1,000 contributors. This tutorial provides an introduction to Spack's capabilities: installing and authoring packages, integrating Spack with development workflows, and deploying software at HPC facilities. Attendees will learn foundational skills for automating day-to-day tasks, as well as deeper knowledge of Spack for advanced use cases.
Paper
Recorded
Machine Learning and Artificial Intelligence
Software Engineering
State of the Practice
TP
DescriptionHigh Performance Computing~(HPC) software stacks have become complex, with the dependencies of some applications numbering in the hundreds. Packaging, distributing, and administering software stacks of that scale is a complex undertaking anywhere. HPC systems deal with esoteric compilers, hardware, and a panoply of uncommon combinations.
In this paper, we explore the mechanisms available for packaging software to find its own dependencies in the context of a taxonomy of software distribution, and discuss their benefits and pitfalls. We discuss workarounds for some common problems caused by using these composed stacks and introduce Shrinkwrap: A solution to producing binaries that directly load their dependencies from precise locations and in a precise order. Beyond simplifying the use of the binaries, this approach also speeds up loading as much as 7x for a large dynamically-linked MPI application in our evaluation.
In this paper, we explore the mechanisms available for packaging software to find its own dependencies in the context of a taxonomy of software distribution, and discuss their benefits and pitfalls. We discuss workarounds for some common problems caused by using these composed stacks and introduce Shrinkwrap: A solution to producing binaries that directly load their dependencies from precise locations and in a precise order. Beyond simplifying the use of the binaries, this approach also speeds up loading as much as 7x for a large dynamically-linked MPI application in our evaluation.
Workshop
Recorded
Algorithms
Exascale Computing
Extreme Scale Computing
Heterogeneous Systems
Post-Moore Computing
Quantum Computing
W
DescriptionThe datatype engine in Message Passing Interface (MPI) libraries supports the communication layer by handling the transfer of non-contiguous datatypes. Basic datatypes (integer, float, etc.) serve as building blocks for more complex, and potentially non-contiguous, derived datatypes. In this context, the datatype engine facilitates the description of complex datatypes and provides an efficient and portable interface between complex types and the network communication layer. In this paper, we focus on the Open MPI datatype engine with its compact type representations and pipeline techniques used to hide communication latency. We identified cases where the current datatype representation is sub-optimal and provide an alternative based on range descriptions and Memory Access Rearrangements (MARs) for more efficient pack/unpack operations. As a result, we obtain a performance improvement between 1.2x and 3.2x compared to the current datatype description in Open MPI.
Student Cluster Competition
TP
XO/EX
DescriptionPo Hao Chen is a third year student at Boston University pursuing computer science. His domain of interest lies in the intersection of theory and practice. His independent research focuses on optimization through mathematics and distributed computing. He lectures a course on algorithmic problem solving targeted at upperclassmen at Boston University. He has been involved in the HPC community for the past three years. In addition to publications at HPC conferences and attending past SCs, he runs the only HPC student club in the Boston area with Carlton.
Carlton Knox is a third year student in computer engineering. He has been a member of the BUHPC club for three years and is the current President of the club. He is passionate about software and hardware performance in computer architecture design. He regularly hosts workshops related to HPC concepts for prospective students. Carlton's research involves leveraging machine learning to predict CPU temperatures to increase power efficiency, and he wishes to extend his work to the hardware used in HPC systems.
Andrew Nguyen is a third year student at Northeastern University majoring in electrical and computer engineering, with a minor in game design. In addition to computational science, Andrew studied quantum mechanics and modern physics. He became involved in architecture research working in GPU simulators. Although Andrew is relatively new to HPC, he has experience with the software stacks used by the community through his projects.
Vance Raiti is a sophomore at Boston University studying electrical engineering. He came from a performing arts background and is relatively new to the HPC scene. His interests have grown greatly after joining the club while working on optimizing the club's Raspberry Pi cluster. Vance was trained rigorously as a mathematician and is interested in various domains of science. He has studied advanced mathematics, quantum information theory, and fluid dynamics.
Yida Wang is a sophomore at Boston University studying computer science and business. He joined the BUHPC team this year and became interested in pursuing a career in the field. He plans on working in machine learning research and hopes to learn more about leveraging distributed systems through the competition.
Yiran Yin is a sophomore at Boston University studying computer engineering and mathematics. She has been part of the BUHPC club for a year and volunteers to manage the club's Jetson cluster. She is still exploring and seeking her field of interests and hopes to understand the areas HPC can be applied to through the competition.
Kurt Keville is a researcher at MIT. He has been involved in the Boston HPC scene for the past two decades and mentored many competitions teams, including past teams in ISC, SC, and ASC. He provides us access to resources, vendor connections, and insights into cluster design.
Benjamin Li is the team’s secondary advisor. He has previously competed in the last two Student Cluster Competitions and brings a wealth of experience in setting up applications. He will be assisting with training the team as well as coordinating logistics.
Carlton Knox is a third year student in computer engineering. He has been a member of the BUHPC club for three years and is the current President of the club. He is passionate about software and hardware performance in computer architecture design. He regularly hosts workshops related to HPC concepts for prospective students. Carlton's research involves leveraging machine learning to predict CPU temperatures to increase power efficiency, and he wishes to extend his work to the hardware used in HPC systems.
Andrew Nguyen is a third year student at Northeastern University majoring in electrical and computer engineering, with a minor in game design. In addition to computational science, Andrew studied quantum mechanics and modern physics. He became involved in architecture research working in GPU simulators. Although Andrew is relatively new to HPC, he has experience with the software stacks used by the community through his projects.
Vance Raiti is a sophomore at Boston University studying electrical engineering. He came from a performing arts background and is relatively new to the HPC scene. His interests have grown greatly after joining the club while working on optimizing the club's Raspberry Pi cluster. Vance was trained rigorously as a mathematician and is interested in various domains of science. He has studied advanced mathematics, quantum information theory, and fluid dynamics.
Yida Wang is a sophomore at Boston University studying computer science and business. He joined the BUHPC team this year and became interested in pursuing a career in the field. He plans on working in machine learning research and hopes to learn more about leveraging distributed systems through the competition.
Yiran Yin is a sophomore at Boston University studying computer engineering and mathematics. She has been part of the BUHPC club for a year and volunteers to manage the club's Jetson cluster. She is still exploring and seeking her field of interests and hopes to understand the areas HPC can be applied to through the competition.
Kurt Keville is a researcher at MIT. He has been involved in the Boston HPC scene for the past two decades and mentored many competitions teams, including past teams in ISC, SC, and ASC. He provides us access to resources, vendor connections, and insights into cluster design.
Benjamin Li is the team’s secondary advisor. He has previously competed in the last two Student Cluster Competitions and brings a wealth of experience in setting up applications. He will be assisting with training the team as well as coordinating logistics.
Tutorial
Recorded
Dataflow and Tasking
Directive Based Programming
Parallel Programming Languages and Models
TUT
DescriptionWith the increasing prevalence of multi-core processors, shared-memory programming models are essential. OpenMP is a popular, portable, widely supported, and easy-to-use shared-memory model. Since version 3.0 released in 2008, OpenMP offers tasking to support the creation of composable parallel software blocks and the parallelization of irregular algorithms. Developers usually find OpenMP easy to learn. However, mastering the tasking concept of OpenMP requires a change in the way developers reason about the structure of their code and how to expose the parallelism of it. Our tutorial addresses this critical aspect by examining the tasking concept in detail and presenting patterns as solutions to many common problems.
We assume attendees understand basic parallelization concepts and know the fundamentals of OpenMP. We present the OpenMP tasking language features in detail and focus on performance aspects, such as introducing cut-off mechanisms, exploiting task dependencies, and preserving locality. All aspects are accompanied by extensive case studies. As a full-day tutorial, we could include hands-on sessions. Throughout all topics, we present the recent additions of OpenMP 5.0, 5.1 and 5.2 and comment on the developments targeting OpenMP 6.0.
We assume attendees understand basic parallelization concepts and know the fundamentals of OpenMP. We present the OpenMP tasking language features in detail and focus on performance aspects, such as introducing cut-off mechanisms, exploiting task dependencies, and preserving locality. All aspects are accompanied by extensive case studies. As a full-day tutorial, we could include hands-on sessions. Throughout all topics, we present the recent additions of OpenMP 5.0, 5.1 and 5.2 and comment on the developments targeting OpenMP 6.0.
Workshop
Recorded
W
DescriptionPrediction, together with understanding and management, of pediatric neuroblastoma (NBL) outcomes, from spontaneous regression to relapse and death, remain limited, and rely mostly on age, stage, and the one-gene test for MYCN amplification, none of which are NBL specific. Here, we use the generalized singular value decomposition (GSVD), formulated as a multi-tensor decomposition [1], to model whole genomes of patient-matched NBL and blood DNA. The GSVD discovers two orthogonal genome-wide patterns of copy-number alterations (CNAs) in the tumors that are correlated with survival. First, as in previous, experimentally validated, models of, e.g., adult brain astrocytoma [2], one pattern is exclusive to the tumors. Previously unseen is a pattern that is common to both the blood and tumor genomes. Second, both patterns predict survival better than and independent of the existing predictors as well as independent of each other. In both patterns, differential RNA expression consistently map to the DNA CNAs. Third, the GSVD separates these patterns from normal variations that are conserved in the tumors but do not predict outcome, e.g., the male-specific X-chromosome deletion relative to the autosome. We computationally validate both patterns by using – and demonstrating for the first time – the pseudoinverse projection for transfer learning from the ≈3M-bin whole-genome to ≈10K-bin target-capture sequencing profiles of a mutually-exclusive set of patients [3]. We show that the two patterns describe independent, yet complementary cellular mechanisms that transform human normal to tumor cells, predict new personalized therapies, and may predict the response to existing therapies. The tumor-exclusive pattern includes co-occurrence of MYCN amplification with previously unrecognized druggable CNAs, including amplifications of genes encoding for extra-embryonic transcripts, to jointly predict survival. The pattern that is common to the blood and tumor genomes describes an earlier stage in NBL development, where the embryonic program is hijacked toward aneuploidy and where the subsequent tumor development can spontaneously regress via embryonic self-correction.
[1] M. W. Bradley, K. A. Aiello, S. P. Ponnapalli,* H. A. Hanson* and O. Alter, "GSVD- and Tensor GSVD-Uncovered Patterns of DNA Copy-Number Alterations Predict Adenocarcinomas Survival in General and in Response to Platinum," Applied Physics Letters (APL) Bioengineering 3 (3), article 036104 (August 2019); https://doi.org/10.1063/1.5099268
[2] S. P. Ponnapalli, M. W. Bradley, K. Devine, J. Bowen, S. E. Coppens, K. M. Leraas, B. A. Milash, F. Li, H. Luo, S. Qiu, K. Wu, H. Yang, C. T. Wittwer, C. A. Palmer, R. L. Jensen, J. M. Gastier-Foster, H. A. Hanson, J. S. Barnholtz-Sloan and O. Alter, "Retrospective Clinical Trial Experimentally Validates Glioblastoma Genome-Wide Pattern of DNA Copy-Number Alterations Predictor of Survival," Applied Physics Letters (APL) Bioengineering 4 (2), article 026106 (May 2020); https://doi.org/10.1063/1.5142559
[3] O. Alter and G. H. Golub, "Integrative Analysis of Genome-Scale Data by Using Pseudoinverse Projection Predicts Novel Correlation between DNA Replication and RNA Transcription," Proceedings of the National Academy of Sciences (PNAS) USA 101 (47), pp. 16577–16582 (November 2004); https://doi.org/10.1073/pnas.0406767101
[1] M. W. Bradley, K. A. Aiello, S. P. Ponnapalli,* H. A. Hanson* and O. Alter, "GSVD- and Tensor GSVD-Uncovered Patterns of DNA Copy-Number Alterations Predict Adenocarcinomas Survival in General and in Response to Platinum," Applied Physics Letters (APL) Bioengineering 3 (3), article 036104 (August 2019); https://doi.org/10.1063/1.5099268
[2] S. P. Ponnapalli, M. W. Bradley, K. Devine, J. Bowen, S. E. Coppens, K. M. Leraas, B. A. Milash, F. Li, H. Luo, S. Qiu, K. Wu, H. Yang, C. T. Wittwer, C. A. Palmer, R. L. Jensen, J. M. Gastier-Foster, H. A. Hanson, J. S. Barnholtz-Sloan and O. Alter, "Retrospective Clinical Trial Experimentally Validates Glioblastoma Genome-Wide Pattern of DNA Copy-Number Alterations Predictor of Survival," Applied Physics Letters (APL) Bioengineering 4 (2), article 026106 (May 2020); https://doi.org/10.1063/1.5142559
[3] O. Alter and G. H. Golub, "Integrative Analysis of Genome-Scale Data by Using Pseudoinverse Projection Predicts Novel Correlation between DNA Replication and RNA Transcription," Proceedings of the National Academy of Sciences (PNAS) USA 101 (47), pp. 16577–16582 (November 2004); https://doi.org/10.1073/pnas.0406767101
Workshop
Recorded
W
DescriptionComputations on structured grids using standard multidimensional array layouts can incur substantial data movement costs through the memory hierarchy. This presentation explores the benefits of using a framework (Bricks) to separate the complexity of data layout and optimized communication from the functional representation. To that end, we provide three novel contributions and evaluate them on several kernels taken from GENE, a phase-space fusion tokamak simulation code. We extend Bricks to support 6-dimensional arrays and kernels that operate on complex data types, and integrate Bricks with cuFFT. We demonstrate how to optimize Bricks for data reuse, spatial locality, and GPU hardware utilization achieving up to a 2.67× speedup on a single A100 GPU. We conclude with insights on how to rearchitect memory subsystems.
Panel
Recorded
Heterogeneous Systems
Memory Systems
TP
XO/EX
DescriptionMemory heterogeneity refers to the memory architecture with multiple memory components and those memory components have diverse characteristics (such as latency and bandwidth). It is common to see heterogeneous memory (HM) in supercomputers nowadays. With the emergence of processing-in-memory and resource disaggregation, there will be more memory components with increasingly different features (not only in terms of latency and bandwidth, but also in terms of computing capabilities and reliability).
Managing HM is challenging. The programmer often has to take care of memory allocation, decide data placement and migration, and make the best use of fast memory in HM. Memory heterogeneity also introduces complexity in programming models, and introduces new performance bugs because of bad usage of HM. As a result, the programming productivity of domain scientists is reduced. This panel will discuss how memory heterogeneity will impact the HPC ecosystem, including architectures, runtime systems, programming models and applications.
Managing HM is challenging. The programmer often has to take care of memory allocation, decide data placement and migration, and make the best use of fast memory in HM. Memory heterogeneity also introduces complexity in programming models, and introduces new performance bugs because of bad usage of HM. As a result, the programming productivity of domain scientists is reduced. This panel will discuss how memory heterogeneity will impact the HPC ecosystem, including architectures, runtime systems, programming models and applications.
Paper
Recorded
Correctness
System Software
TP
DescriptionWe present a technique for introducing and optimizing the use of memory in a functional array language, aimed at GPU execution, that supports correct-by-construction parallelism. Using linear memory access descriptors as building blocks, we define a notion of memory in the compiler IR that enables cost-free change-of-layout transformations (e.g., slicing, transposition), whose results can even be carried across control flow such as ifs/loops without manifestation in memory. The memory notion allows a graceful transition to an unsafe IR that is automatically optimized (1) to mix reads and writes to the same array inside a parallel construct, and (2) to map semantically different arrays to the same memory buffer. The result is code similar to what imperative users would write. Our evaluation shows that our proposed optimizations offer significant speedups (1.1x-2x) and result in performance competitive to hand-written code from challenging public benchmarks, such as Rodinia's NW, LUD, and Hotspot.
Workshop
Recorded
Quantum Computing
W
DescriptionQuantum optimal control problems are typically solved by gradient-based algorithms such as GRAPE, which suffer from exponential growth in storage with increasing number of qubits and linear growth in memory requirements with increasing number of time steps. Employing QOC for discrete lattice reveals that these memory requirements are a barrier for simulating large models or long time spans. We employ a nonstandard differentiable programming approach that significantly reduces the memory requirements at the cost of a reasonable amount of recomputation. The approach exploits invertibility properties of the unitary matrices to reverse the computation during back-propagation. We have created a QOC software in the differentiable programming framework JAX that implements this approach, and demonstrate its effectiveness for lattice gauge theory.
Paper
Recorded
File Systems and I/O
Storage
TP
DescriptionIn parallel and distributed file systems, caching can improve data performance and metadata operations. Currently, most distributed file systems adopt a write-back data cache for performance and a write-through metadata cache for simplifying consistency. However, with modern file systems scales and workloads, write-through metadata caching can impact overall file system performance, e.g., through lock contention and heavy RPC loads required for namespace synchronization and transaction serialization.
This paper proposes a novel metadata writeback caching (MetaWBC) mechanism to improve the performance of metadata operations in distributed environments. To achieve extreme metadata performance, we developed a fast, lightweight, and POSIX-compatible memory file system as a metadata cache. Further, we designed a file caching state machine and included other performance optimizations. We coupled MetaWBC with Lustre and evaluated that MetaWBC can outperform the native parallel file system by up to 8x for metadata-intensive benchmarks, and up to 7x for realistic workloads in throughput.
This paper proposes a novel metadata writeback caching (MetaWBC) mechanism to improve the performance of metadata operations in distributed environments. To achieve extreme metadata performance, we developed a fast, lightweight, and POSIX-compatible memory file system as a metadata cache. Further, we designed a file caching state machine and included other performance optimizations. We coupled MetaWBC with Lustre and evaluated that MetaWBC can outperform the native parallel file system by up to 8x for metadata-intensive benchmarks, and up to 7x for realistic workloads in throughput.
Workshop
Recorded
AI-HPC Convergence
Emerging Technologies
Memory Systems
Networks
Resource Management and Scheduling
W
DescriptionTightly-coupled HPC systems have rigid memory allocation and can result in expensive memory resource under-utilization. As novel memory and network technologies mature, disaggregated memory systems are becoming a promising solution for future HPC systems. It allows workloads to use the available memory of the entire system. We propose a design framework to explore the disaggregated memory system design space. The framework incorporates memory capacity, network bandwidth, and local and remote memory access ratio, and provides an intuitive approach to guide machine configurations based on technology trends and workload characteristics. We apply our framework to analyze eleven workloads from five computational scenarios, including AI training, data analysis, genomics, protein, and traditional HPC. We demonstrate the ability of our methodology to understand the potential and pitfalls of a disaggregated memory system and motivate machine configurations. Our methodology shows that 10 out of our 11 applications/workflows can leverage disaggregated memory without affecting performance.
Workshop
Recorded
Correctness
Software Engineering
W
DescriptionKokkos is a C++ library and ecosystem for writing parallel programs on heterogeneous systems. One of the primary goals of Kokkos is portability: programs in Kokkos are expressed through general parallel constructs which can enable the same code to compile and execute on different parallel architectures. However, there is no known formal model of Kokkos's semantics, which must be generic enough to support current and future CPU and accelerator architectures. As a first step of formalizing Kokkos, We introduce MiniKokkos: a small language capturing the main features of Kokkos, and then prove that MiniKokkos ensures portability across all possible parallel executions. We also provide a case study of how MiniKokkos can help reason about Kokkos programs and help find a bug in the Kokkos implementation.
Paper
Recorded
Reliability and Resiliency
TP
Best Paper Finalist
Best Student Paper Finalists
DescriptionWith the ever-shrinking size of transistors, silent data corruptions (SDCs) are becoming a common yet serious issue in HPC. Selective instruction duplication (SID) is a widely used fault-tolerance technique that can obtain high SDC coverage with low-performance overhead. However, existing SID methods are confined to single program input in its assessment, assuming that error resilience of a program remains similar across inputs. Nevertheless, we observe that the assumption cannot always hold, leading to a drastic loss in SDC coverage in different inputs, compromising HPC reliability. We notice that the SDC coverage loss correlates with a small set of instructions – we call them incubative instructions, which reveal elusive error propagation characteristics across multiple inputs. We proposed MINPSID, an automated SID framework that identifies incubative instructions in programs and re-prioritizes incubative instructions. Evaluation shows MINPSID can effectively mitigate the loss of SDC coverage across multiple inputs.
Birds of a Feather
TP
XO/EX
DescriptionCome and share your experiences with the state-of-the-art of mixed-precision techniques! By wisely trading off accuracy, we can mitigate data movement overheads and increase performance of applications. No free lunch however: these optimizations require support from the software/hardware ecosystem and strong numerical validation. This BoF invites the HPC community at large interested in applying mixed precisions into their workflows. Experts from scientific applications/software libraries/hardware architectures will briefly provide the context on this timely topic, share their own perspectives, engage with the audience via a set of questions, and eventually gather feedback to define a roadmap moving forward.
Workshop
Recorded
Algorithms
Exascale Computing
Extreme Scale Computing
Heterogeneous Systems
Post-Moore Computing
Quantum Computing
W
DescriptionThe multi-precision methods commonly follow approximate-iterate scheme by first obtaining the approximate solution from a low-precision factorization and solve. Then, they iteratively refine the solution to the desired accuracy that is often as high as what is possible with traditional approaches. While targeting symmetric/Hermitian eigenvalue problems of the form Ax=(lambda)x, we revisited the SICE algorithm by applying the Sherman-Morrison formula on the diagonally-shifted tridiagonal systems, we propose an updated SICE-SM algorithm. We exploited asynchronous scheduling techniques to take advantage of the new computational graph enabled by the use of mixed-precision in the eigensolver. By incorporating the latest two-stage algorithms from the PLASMA and MAGMA software libraries for numerical linear algebra, we achieved up to 3.6x speedup using the mixed-precision eigensolver with the blocked SICE-SM algorithm for iterative refinement when compared with full double complex precision solvers for the cases with a portion of eigenvalues and eigenvectors requested.
Workshop
Recorded
Applications
Architectures
Benchmarking
Exascale Computing
Modeling and Simulation
Performance
Performance Portability
W
DescriptionTime-Dependent Density Functional Theory (TDDFT) workloads are an example of high-impact computational methods that require leveraging the performance of HPC architectures. However, finding the optimal values of their performance-critical parameters raises performance portability challenges that must be addressed. In this work, we propose an ML-based tuning methodology based on Bayesian optimization and transfer learning to tackle the performance portability for TDDFT codes in HPC systems. Our results demonstrate the effectiveness of our transfer-learning proposal for TDDFT workloads, which reduced the number of executed evaluations by up to 86% compared to an exhaustive search for the global optimal performance parameters on the Cori and Perlmutter supercomputers. Compared to a Bayesian-optimization search, our proposal reduces the required evaluations by up to 46.7% to find the same optimal runtime configuration. Overall, this methodology can be applied to other scientific workloads for current and emerging high-performance architectures.
Workshop
Recorded
W
DescriptionMachine learning (ML) algorithms are showing a growing trend in helping the scientific communities across different disciplines and institutions to address large and diverse data problems. However, many available ML tools are programmatically demanding and computationally costly. The MLExchange project aims to build a collaborative platform equipped with enabling tools that allow scientists and facility users who do not have a profound ML background to use ML and computational resources in scientific discovery. At the high level, we are targeting a full user experience where managing and exchanging ML algorithms, workflows, and data are readily available through web applications. Since each component is an independent container, the whole platform or its individual service(s) can be easily deployed at servers of different scales. Thus, MLExchange renders flexible using scenarios---users could either access the platform from a remote server or run its individual service(s) within their local network.
Workshop
Recorded
W
DescriptionWith the slowing of CMOS technology scaling trends and the continued growth of compute requirements for applications like 5G wireless and machine learning, there has been a widespread emphasis on new accelerator architectures emphasizing heterogeneity. However, programming heterogeneous devices can be challenging, requiring heterogenous design tools supporting multiple levels of abstraction. This talk will discuss how MLIR, a new compiler infrastructure which directly supports multiple levels of abstraction, can enable these new design tools to support next generation heterogeneous systems.
Birds of a Feather
TP
XO/EX
DescriptionMachine learning applications are rapidly expanding into scientific domains and challenging the hallmarks of traditional high performance computing workloads. We present MLPerf, a community-driven system performance benchmark which spans a range of machine learning tasks. The speakers at this BoF are experts in the fields of HPC, science applications, machine learning, and computer architecture, representing academia, government research organizations, and private industry. In this session, we will cover the past year’s development within the MLPerf organization and provide an update on the latest round of submissions to MLPerf-HPC benchmark suite to solicit input from interested parties within the HPC community.
Posters
Research Posters
Recorded
TP
DescriptionIn recent years, despite remarkable progress in computing and network performance, HPC platforms have struggled to maintain satisfactory I/O throughput. Various solutions have been proposed to mitigate the contention and variability experienced by more and more concurrent applications, particularly on heavily shared parallel file systems. In consequence, many large scale platforms now offer complex hierarchies of storage resources using diverse architectures based on different hardware technologies such as persistent memories or flash. In that context, we propose to study how to efficiently allocate these heterogeneous storage resources. In our poster, we introduce StorAlloc, a modular and extensible simulator of a storage-aware job-scheduler. We present the design of the tool before showing through the concrete example of the dimensioning of a partition of burst buffers the insights StorAlloc can provide in terms of storage system design and resource scheduling algorithms.
Workshop
Recorded
Cloud and Distributed Computing
In Situ Processing
Scientific Computing
Workflows
W
DescriptionGuaranteeing the data integrity of scientific workflows and their associated data products, in the face of non-malicious and malicious threats, is of paramount importance for the validity and credibility of scientific research. In this work, we describe how we can leverage two popular cybersecurity classification frameworks - OSCRP and MITRE ATT&CK®, to systematically model threats to the integrity of scientific workflows and data in a research setting. We enumerate non- malicious and malicious threats to the integrity of scientific workflows, and present the relevant assets, concerns, avenues of attacks and impact of the threats in typical scientific workflow execution scenarios.
Posters
Research Posters
TP
XO/EX
DescriptionSupraventricular Tachycardia (SVT) is when the heart’s upper chambers beat either too quickly or out of rhythm with the heart’s lower chambers. This out-of-step heart beating is a leading cause of strokes, heart attacks, and heart failure. The most successful treatment for SVT is catheter ablation, a process where an electrophysiologist (EP) maps the heart to find areas with abnormal electrical activity. The EP then runs a catheter into the heart to burn the abnormal area, blocking the electrical signals. Much is not known about what triggers SVT and where to place scar tissue for optimal patient outcomes. We have produced a dynamic model of the right atrium accelerated on NVIDIA GPUs. An interface allows researchers to insert ectopic signals into the simulated atria and ablate sections of the atria allowing them to rapidly gain insight into what causes SVTs and how to terminate them.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
DescriptionScientific software is required to be fast, painless to change, and easy to deploy. Historically, compiled languages such as C/C++ and Fortran have been preferred when writing software with the highest performance requirements. However these languages are complex, and the resulting software is challenging to maintain and deploy across platforms. We present our recent software projects written in Rust, a fast-growing, ergonomic, systems-level programming language with a toolchain designed for high-performance and simple cross platform builds. We illustrate the current state of the scientific computing ecosystem in Rust, through our experience developing high-performance MPI-distributed software for computational physics problems.
Workshop
Recorded
W
DescriptionThis talk considers building a from-scratch specification and implementations of a message passing parallel programming interface and standard, conceptually akin to MPI, but built with Occam's razor in mind and without legacy concerns or 100% backward compatibility. It will be aimed to be divided between a) a small, performant core (with scalability and performance ideally better than MPI-5 on its evolutionary trajectory), and b) backward compatible interfaces to allow existing MPI-4 applications to be used (with performance and scalability no worse than today). MPI-0 would be a small core subset of MPI concepts and capabilities, re-envisaged from the ground up with strong attention to resources, heterogeneity of the nodes, concurrency within the MPI, and deep application optimization. This strategy may offer an effective, new way to think of about "thin middleware" for the next decade of message-passing-based parallel programming for the exascale-plus era. The C++ language will serve as a core specification and likely implementation language for such middleware, with C and modern Fortran bindings delivered as part of backward compatibility (possibly with normative headers defined by the standards body). The ability to deliver primitive QoS, strong progress, and support integration with other programming models will be considered. A stackable model of Core+non-core-specs will enable procurements, applications, and implementations to focus first on highly efficient profiles that have recognized features and boundaries, again in recognition that many applications use but little of MPI's functions and concepts. How tools would fit in, and potential ABI compatibility, likely via a normative C layer, will be mentioned just a little bit.
Birds of a Feather
TP
XO/EX
DescriptionMPICH is a widely used, opensource implementation of the MPI message passing standard. It has been ported to many platforms and used by several vendors and research groups as the basis for their own MPI implementations. This BoF session will provide a forum for users of MPICH as well as developers of MPI implementations derived from MPICH to discuss experiences and issues in using and porting MPICH. Future plans for MPICH will be discussed. Representatives from MPICH-derived implementations will provide brief updates on the status of their efforts. MPICH developers will also be present for an open forum discussion.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
DescriptionCells are the basic building blocks of human organisms. Single-cell RNA sequencing is a technology for studying the heterogeneity of cells of different organs, tissues, subjects, conditions, and treatments. Identification of cell types and states in sequenced data is an important and challenging task, requiring computational approaches that are accurate, robust, and scalable. Existing approaches use cluster analysis as the first step of cell-types prediction. Their performance remains limited because they optimize only one objective function. In this study, two evolutionary clustering approaches were designed, implemented, and systematically validated, namely a single-objective evolutionary algorithm and a multi-objective evolutionary algorithm. The algorithms were evaluated on synthetic and real datasets. The results demonstrated that the performance and the accuracy of both evolutionary algorithms were consistent, stable, and on par with or better than baseline algorithms. Running time analysis of multi-processing on an HPC showed that evolutionary algorithms can efficiently handle large datasets.
Workshop
Recorded
W
DescriptionWe present an infrastructure for executing multiscale scientific workflows in hybrid cloud environment. We use an end-to-end modelling of mayonnaise production as an example of an industrially relevant problem involving modeling of materials properties and their corresponding processing methods. We present a container image design which allows an integration of alternative simulation services operating at atomistic, mesoscopic, and continuum levels. The image allows adjustment at runtime to carry out different multiscale studies following diverse parallel execution patterns. We then discuss a prototype of the on-demand HPC facility implemented on two variants of the virtualized infrastructure to handle these workflows in the context of hybrid cloud deployments.
Workshop
Recorded
W
DescriptionHeterogeneity is, and has long been, a defining characteristic of computing architectures. For example, the microarchitecture implementing an instruction set architecture (ISA) contains a diverse collection of function blocks such as instruction decoder, ALUs, memory controllers, page walk hardware, to name just a few. Even at the ISA level, heterogeneity is available in instructions to control specialized units such as SIMD or vector processors, bit matrix multiply, and crypto engines.
In the past, application developers were largely shielded from these forms of heterogeneity through ISA, compiler analysis, intrinsics, or high level pragmas. However as microelectronics approaches fundamental scaling limits in feature size and power, it has become increasingly necessary to provide and expose specialized circuits purpose-built, each to its narrow function. Thus, heterogeneity has become the new normal in the world of computing, affecting virtually all levels from IoT to supercomputer.
In this talk, I will discuss current and future heterogeneous architectures that we will have to exploit to achieve required performance levels. Approaches and tools to make these complex resources accessible will be reviewed. Scalability challenges and the tension between portability and performance will be discussed.
In the past, application developers were largely shielded from these forms of heterogeneity through ISA, compiler analysis, intrinsics, or high level pragmas. However as microelectronics approaches fundamental scaling limits in feature size and power, it has become increasingly necessary to provide and expose specialized circuits purpose-built, each to its narrow function. Thus, heterogeneity has become the new normal in the world of computing, affecting virtually all levels from IoT to supercomputer.
In this talk, I will discuss current and future heterogeneous architectures that we will have to exploit to achieve required performance levels. Approaches and tools to make these complex resources accessible will be reviewed. Scalability challenges and the tension between portability and performance will be discussed.
Exhibitor Forum
Recorded
TP
XO/EX
DescriptionLarge Language Models are shifting “what’s possible” in AI, but distributed training across thousands of traditional accelerators is massively complex and always suffers diminishing returns as more compute is added. Always? No longer. In this talk, Natalia Vassilieva from Cerebras Systems, will present a cluster of 16 Cerebras CS-2 nodes that achieves near-perfect linear scaling across more cores than the world’s most powerful supercomputer. And there’s more: the programming model is radically simple: the code for 16 nodes is exactly the same as that for a single node. A new era of easy access to extreme-scale AI has just begun.
Workshop
Recorded
W
DescriptionNERSC is the primary scientific computing facility for DOE’s Office of Science. NERSC supports diverse production workloads across a wide range of scientific disciplines, which requires a rather complicated queue structure with various resource limits and priorities. It has been challenging for users to generate proper job scripts to optimally use the systems. We developed a Slurm job script generator, a web application to help users not only generate job scripts but also learn how the batch system works. The job script generator was first deployed in 2016 to help generate an optimal process/threads affinity for the hybrid MPI + OpenMP applications for NERSC’s Cori system, and was recently extended to support more systems and use cases. In this talk, we will present the features supported in our job script generator, and describe the code design and implementation, which is easily adaptable to other centers who deploy Slurm.
Workshop
Recorded
Architectures
Cloud and Distributed Computing
Emerging Technologies
Networks
Scientific Computing
W
Workshop
Recorded
Quantum Computing
W
DescriptionEfficient quantum control is necessary for practical quantum computing implementations with current technologies. Conventional algorithms for determining optimal control parameters are computationally expensive, largely excluding them from use outside of the simulation. Existing hardware solutions structured as lookup tables are imprecise and costly. By designing a machine learning model to approximate the results of traditional tools, a more efficient method can be produced. Such a model can then be synthesized into a hardware accelerator for use in quantum systems. We demonstrate a machine learning algorithm for predicting optimal pulse parameters. This algorithm is lightweight enough to fit on a low-resource FPGA and perform inference with a latency of 175 ns and pipeline interval of 5 ns with > 0.99 gate fidelity. In the long term, such an accelerator could be used near quantum computing hardware where traditional computers cannot operate, enabling quantum control at a reasonable cost at low latencies.
Workshop
Recorded
W
DescriptionNeuromorphic computing technology continues to make strides in the development of new algorithms, devices, and materials. In addition, applications have begun to emerge where neuromorphic computing shows promising results. However, numerous barriers to further development and application remain. In this work, we identify several science areas where neuromorphic computing can either make an immediate impact (within 1 to 3 years) or the societal impact would be extremely high if the technological barriers can be addressed. We identify both opportunities and hurdles to the development of neuromorphic computing technology for these areas. Finally, we discuss future directions that need to be addressed to expand both the development and application of neuromorphic computing.
Posters
Research Posters
TP
XO/EX
DescriptionThe multi-precision methods commonly follow approximate-iterate scheme by first obtaining the approximate solution from a low-precision factorization and solve. Then, they iteratively refine the solution to the desired accuracy that is often as high as what is possible with traditional approaches. While targeting symmetric/Hermitian eigenvalue problems of the form Ax=(lambda)x, we revisited the SICE algorithm by applying the Sherman-Morrison formula on the diagonally-shifted tridiagonal systems, we propose an updated SICE-SM algorithm. We exploited asynchronous scheduling techniques to take advantage of the new computational graph enabled by the use of mixed-precision in the eigensolver. By incorporating the latest two-stage algorithms from the PLASMA and MAGMA software libraries for numerical linear algebra, we achieved up to 3.6x speedup using the mixed-precision eigensolver with the blocked SICE-SM algorithm for iterative refinement when compared with full double complex precision solvers for the cases with a portion of eigenvalues and eigenvectors requested.
Tutorial
Recorded
Accelerator-based Architectures
Benchmarking
Performance
TUT
DescriptionAs we move toward exascale, the gap between peak and application performance is continuing to open. Paradoxically, bad node-level performance leads to highly scalable code, but at the price of increased overall time to solution. Consequently, valuable resources are wasted, often on a massive scale. If the user cares about time to solution on any scale, optimal performance on the node level is often the key factor. We convey the architectural features of current processor chips, multiprocessor nodes, and accelerators, as far as they are relevant for the practitioner. Peculiarities like SIMD vectorization, shared vs. separate caches, bandwidth bottlenecks, and ccNUMA characteristics are introduced, and the influence of system topology and affinity on the performance of typical parallel programming constructs is demonstrated. Performance engineering and performance patterns are suggested as powerful tools that help the user understand the bottlenecks at hand and to assess the impact of possible code optimizations. A cornerstone of these concepts is the roofline model, which is described in detail, including useful case studies, limits of its applicability, and possible refinements.
Posters
Research Posters
TP
XO/EX
DescriptionScientific software in high performance computing is becoming increasingly complex both in terms of its size and the number of external dependencies. Correctness and performance issues can become more challenging in actively developed software with increasing complexity. This leads to software developers having to spend larger portions of their time on debugging, optimizing, and maintaining code. Making software optimization and maintenance easier for developers is paramount to accelerating the rate of scientific progress. Fortunately, there is a wealth of data on scientific coding practices available implicitly via version control histories. These contain the state of a code at each stage throughout its development via commit snapshots. Commit snapshots provide dynamic insight into the software development process that static analyses of release tarballs do not. We propose a new machine learning based approach for studying the performance of source code across code modifications.
Paper
Recorded
Accelerator-based Architectures
Performance
Visualization
TP
DescriptionRecent work demonstrated that power management (PM) can impact application performance in CPU-based HPC systems, even when machines have the same architecture and SKU. This variation occurs due to manufacturing variability and the chip’s PM. However, while modern HPC systems widely employ GPUs, there is limited work on how variability affects GPU applications. In this paper, we study 4 HPC clusters with state-of-the-art GPUs: Oak Ridge’s Summit, Sandia’s Vortex, TACC’s Longhorn, and Livermore’s Corona. The first three clusters use NVIDIA V100 GPUs, while the fourth uses AMD MI60 GPUs. After identifying applications that stress different GPU components, we gathered data from over 90% of the GPUs in the clusters. In total, we collected over 100,000 hours of data. Regardless of application and cluster, our results show significant variance: 32% (max 72%) average performance variation, despite GPU architecture and vendor SKU being the same.
Posters
Research Posters
TP
XO/EX
DescriptionA new machine learning-based non-destructive testing (NDT) technique for the examination of conductive objects is presented. NDT of objects behind barriers utilize the defect-induced distortions on electromagnetic (EM) fields to detect flaws in the structure of inspected targets. Such distortions are highly non-linear, requiring significant amounts of data for training neural networks. To this end, a massively parallelized data generation framework is proposed in conjunction with a multi-frequency hybrid neural network (MF-HNN), to create a physics-informed inversion AI model. The performance of the resulting inversion algorithm is applied on casings, where tubular pipes are inspected. For data generation, physics-based solvers are employed to simulate the EM field distribution resulting from pipes with defects. The large-scale distribution of this step leads to 43 times faster execution than a single CPU. This allows the MF-HNN to achieve significantly improved generalization performance and to generate high-resolution cross-sectional images of the pipelines.
Student Cluster Competition
TP
XO/EX
DescriptionChan-Yu, Mou is our team leader. He is a junior student from Tsing Hua Experimental Education Program, whose mission is to discover extraordinary students in diverse areas by looking beyond grades. Chan-Yu has broad interests in computer science and astronomy. So he joined the SCC team training as a freshman, and he is a member of the team that won first place in ASC20-21 (virtual), and second prize in HPC-AI 20. Chan-Yu brings diversity, experience, and leadership to our team.
En-Ming Huang is a sophomore majoring in Computer Science. He has a strong interest in HPC. He had already joined the team training and took the graduate-level parallel program course. In his freshman year under the guidance of Prof. Chou, he has already published a paper in The Journal of Supercomputing.
Fu-Chiang Chang is a transfer student from the National Chung Cheng University. He is interested in HPC and deep learning, and currently studies distributed and federated learning under the guidance of Prof. Chou. He is the person in charge of the DLRM application in the HPC-AI’21 competition, and he is looking forward to the new challenge in the SCC competition.
Pang-Ning Wu and Pin-Yi Kuo are sophomore students majoring in Computer Science. Both of them are passionate about all things related to Information Technology and are always eager to learn something new. They joined the team to touch and learn things that can’t be taught in class.
Hsu-Tzu Ting is a junior student majoring in Computer Science. As the only female student on the team, she wants to be a role model to encourage more female students to participate in this exciting competition.
Our team is led by Prof. Chou, who has served as the team advisor since 2011 and won several awards over the years, including the overall champion in ASC '19, 20-21, and highest Linpack in SC’14, ASC '18. The team is currently ranked second in the world on the HPC-AI leaderboard. He worked in LBNL before joining National Tsing Hua University. He has published over 50 papers in top journals and conferences, including 3 papers at the SC conference. He has developed several young talents in the HPC field who are either currently pursuing Ph.D. and M.S. degrees in the US or working in HPC-related companies in Taiwan. One of them is also a student volunteer in SC’22.
Overall, we are a team with an interdisciplinary, hard-working attitude, well-trained skills, and good teamwork. We have been training for the competition for over a year since last February. The training process has not only taught us the skills and knowledge needed for the competitions but also brought us together to develop trust and friendship between each other.
En-Ming Huang is a sophomore majoring in Computer Science. He has a strong interest in HPC. He had already joined the team training and took the graduate-level parallel program course. In his freshman year under the guidance of Prof. Chou, he has already published a paper in The Journal of Supercomputing.
Fu-Chiang Chang is a transfer student from the National Chung Cheng University. He is interested in HPC and deep learning, and currently studies distributed and federated learning under the guidance of Prof. Chou. He is the person in charge of the DLRM application in the HPC-AI’21 competition, and he is looking forward to the new challenge in the SCC competition.
Pang-Ning Wu and Pin-Yi Kuo are sophomore students majoring in Computer Science. Both of them are passionate about all things related to Information Technology and are always eager to learn something new. They joined the team to touch and learn things that can’t be taught in class.
Hsu-Tzu Ting is a junior student majoring in Computer Science. As the only female student on the team, she wants to be a role model to encourage more female students to participate in this exciting competition.
Our team is led by Prof. Chou, who has served as the team advisor since 2011 and won several awards over the years, including the overall champion in ASC '19, 20-21, and highest Linpack in SC’14, ASC '18. The team is currently ranked second in the world on the HPC-AI leaderboard. He worked in LBNL before joining National Tsing Hua University. He has published over 50 papers in top journals and conferences, including 3 papers at the SC conference. He has developed several young talents in the HPC field who are either currently pursuing Ph.D. and M.S. degrees in the US or working in HPC-related companies in Taiwan. One of them is also a student volunteer in SC’22.
Overall, we are a team with an interdisciplinary, hard-working attitude, well-trained skills, and good teamwork. We have been training for the competition for over a year since last February. The training process has not only taught us the skills and knowledge needed for the competitions but also brought us together to develop trust and friendship between each other.
Workshop
Recorded
Applications
Architectures
Benchmarking
Exascale Computing
Modeling and Simulation
Performance
Performance Portability
W
DescriptionCollective communication operations are fundamental cornerstones in many high-performance applications. MPI libraries typically implement a selection logic that attempts to make good algorithmic choices for specific collective communication problem. It has been shown in the literature that the hard-coded algorithm selection logic found in MPI libraries can be improved by prior offline tuning.
We go a fundamentally different way of improving the algorithm selection for MPI collectives. We integrate the probing of different algorithms directly into the MPI library. Whenever an MPI application is started, the tuner, instead of the default selection logic, finds the next algorithm to complete an issued MPI collective call and records its runtime. With the recorded performance data, the tuner is able to build a performance model that allows selecting an efficient algorithm.
We show in a case study using miniAMR that our approach can effectively tune the performance of Allreduce.
We go a fundamentally different way of improving the algorithm selection for MPI collectives. We integrate the probing of different algorithms directly into the MPI library. Whenever an MPI application is started, the tuner, instead of the default selection logic, finds the next algorithm to complete an issued MPI collective call and records its runtime. With the recorded performance data, the tuner is able to build a performance model that allows selecting an efficient algorithm.
We show in a case study using miniAMR that our approach can effectively tune the performance of Allreduce.
Workshop
Recorded
Accelerator-based Architectures
Compilers
Dataflow and Tasking
Directive Based Programming
Heterogeneous Systems
Parallel Programming Languages and Models
Runtime Systems
W
DescriptionWe propose an interoperation mechanism to enable novel composability across pragma-based programming models. We study and propose a clear separation of duties and implement our approach by augmenting the OmpSs-2 programming model, compiler and runtime system to support OmpSs-2 + OpenACC programming. To validate our proposal we port ZPIC, a kinetic plasma simulator, to leverage our hybrid OmpSs-2 + OpenACC implementation. We compare our approach against OpenACC versions of ZPIC in terms of on a multi-GPU HPC system. We show that our approach manages to provide automatic asynchronous and multi-GPU execution, removing significant burden from the application’s developer, while also being able to outperform manually programmed versions, thanks to a better utilization of the hardware.
Workshop
Recorded
Correctness
Software Engineering
W
DescriptionMPI Remote Memory Access (RMA) provides a one-sided communication model for MPI applications. Ensuring consistency between RMA operations with synchronization calls is a key requirement when writing correct RMA codes. Wrong API usage may lead to concurrent modifications of the same memory location without proper synchronization resulting in data races across processes. Due to their non-deterministic nature, such data races are hard to detect. This paper presents MUST-RMA, an on-the-fly data race detector for RMA applications. MUST-RMA uses a race detection model based on happened-before and consistency analysis. It combines the MUST correctness tool with the race detector ThreadSanitizer to detect races across processes in RMA applications. A classification quality study on MUST-RMA with different test cases shows a precision and recall of 0.93. An overhead study on a stencil and a matrix transpose kernel shows runtime slowdowns of 3x to 20x for up to 192 processes.
Workshop
Recorded
W
DescriptionAs High Performance Computing (HPC) workflows increase in complexity, their designers seek to enable automation and flexibility offered by cloud technologies. Container orchestration through Kubernetes enables highly desirable capabilities but does not satisfy the performance demands of HPC. Kubernetes tools that automate the lifecycle of Message Passing Interface (MPI)-based applications do not scale, and the Kubernetes scheduler does not provide crucial scheduling capabilities. In this work, we detail our efforts to port CORAL-2 benchmark codes to Kubernetes on IBM Cloud and AWS EKS. We describe contributions to the MPI Operator to achieve 3,000-rank scale, a two-orders-of-magnitude improvement to state of the art. We discuss enhancements to KubeFlux, our scheduler plugin for Kubernetes based on the next-generation, cloud-ready Flux framework. Finally, we compare the placement decisions of KubeFlux with those of the Kubernetes scheduler and demonstrate that KubeFlux allows simulated scientific workflows to achieve up to 3x higher performance.
Birds of a Feather
TP
XO/EX
DescriptionThis BoF is meant to be an open discussion to guide the future roadmap for Open OnDemand (openondemand.org), by getting feedback from the community on the prioritization of the various tasks planned for the next few years. OOD is extremely relevant to ongoing discussions within the HPC community about user interfaces and science gateways. The session leaders, all part of the OOD development team, will jointly develop the content for the presentation in advance to ensure a wide range of viewpoints and topics are presented. We will also consult with our user advisory group in advance for their suggestions.
Invited Talk
Recorded
TP
XO/EX
DescriptionLarge-scale data-parallel workloads demand ever-growing performance under an (almost) constant power envelope. To address this fundamental challenge in a regime of diminishing returns from technology scaling, we must minimize overhead associated with data transfer, data-flow management, as well as instruction fetching, decoding, while introducing "controlled doses" of domain specialization. The last decade has seen the rise of open instruction sets, open architectures, and open hardware as key enablers for designing more efficient computing systems. In this talk, I share insights gained in designing open-source RISC-V hardware and software for energy-efficient computing, moving from tiny, parallel ultra-low power chips to high-performance many-core chiplets, and provide a personal view on future directions.
Birds of a Feather
TP
XO/EX
DescriptionOpenACC is focused on helping the developer community advance by expanding their accelerated parallel computing skills, and supports a directive-based, high-level accelerated programming model on CPUs, GPUs and other devices. OpenACC supports over 25 hackathons globally each year and has facilitated acceleration of over 200 applications on multiple platforms, e.g., Frontier, Perlmutter, JUWELS, Summit, Sunway Taihulight, and Piz Daint. This BoF invites scientists, programmers and researchers to discuss their experiences in adopting OpenACC for scientific applications, learn about the roadmaps from implementers, share best practices in community facilitated training in software development, and the latest developments in the language specification.
Birds of a Feather
TP
XO/EX
DescriptionOpenHPC provides a community-driven stack of common ingredients to deploy and manage Linux based HPC clusters. Formed in November 2015 and formalized as a Linux Foundation project in June 2016, OpenHPC continues to see rapid growth in its user community and has added new software components and supports multiple OSes/architectures. At this BoF, speakers from the OpenHPC Technical Steering Committee will provide technical updates from the project and near-term roadmaps. We then invite open discussion giving attendees an opportunity to provide feedback on OpenHPC conventions and packaging, request additional components and configurations, and to discuss general future trends.
Workshop
Recorded
Benchmarking
Cloud and Distributed Computing
Containers
Datacenter
Networks
Privacy
Resource Management and Scheduling
Security
SIGHPC
State of the Practice
System Administration
System Software
W
DescriptionOpening remarks for the Seventh Annual HPC Systems Professionals Workshop.
Workshop
Recorded
Accelerator-based Architectures
Data Analytics
In Situ Processing
Scientific Computing
Visualization
Workflows
W
Workshop
Recorded
Accelerator-based Architectures
Compilers
Dataflow and Tasking
Directive Based Programming
Heterogeneous Systems
Parallel Programming Languages and Models
Runtime Systems
W
DescriptionOpening remarks of the Ninth Workshop on Accelerator Programming Using Directives (WACCPD 2022)
Birds of a Feather
TP
XO/EX
DescriptionIn this highly interactive BoF, attendees will get first-hand information from OpenMP implementors and language designers on the planned features of the upcoming OpenMP API version 6.0. Through a series of lighting talks and discussion rounds, BoF participants will interact and have amble opportunity with these different groups of OpenMP experts, ask questions, and provide their feedback.
The leaders of the OpenMP ARB will provide insight into the future of OpenMP, from the 5.2 specification released in Nov'21 and beyond to OpenMP 6.0. Vendor representatives will discuss support and timelines for OpenMP features and expert users will describe their journey.
The leaders of the OpenMP ARB will provide insight into the future of OpenMP, from the 5.2 specification released in Nov'21 and beyond to OpenMP 6.0. Vendor representatives will discuss support and timelines for OpenMP features and expert users will describe their journey.
Workshop
Recorded
Algorithms
Architectures
Compilers
Computational Science
Exascale Computing
Heterogeneous Systems
Hierarchical Parallelism
Memory Systems
Parallel Programming Languages and Models
Parallel Programming Systems
Resource Management and Scheduling
W
DescriptionOpenMP has become the de facto standard for shared memory parallel programming. OpenMP provides a directive, nowait, to enable asynchronous target offload from host to device. In this presentation, we identify best practices in using the asynchronous offload in OpenMP correctly and performantly. Through experimental evaluation on Summit and Crusher, we show how we use the nowait clause of OpenMP to improve performance of a graph algorithm, Floyd-Warshall, by up to 58.24% on Summit and 30.38% on Crusher. Such opportunities suggest the need for programmers to use the nowait features of OpenMP with care in order to achieve performance.
Workshop
Recorded
W
DescriptionThe Message Passing Interface (MPI) is a software platform that can utilize the parallel capabilities of most multi- processors, making it useful for teaching students about parallel and distributed computing (PDC). MPI provides language bindings for Fortran and C/C++, but many university instructors lack expertise in these languages, preventing them from using MPI in their courses. OpenMPI is a free implementation of MPI that also provides Java bindings, allowing instructors who know Java but not C/C++ or Fortran to teach PDC. However, Java has a reputation as a “slow” language, so some say it is unsuitable for teaching PDC. This paper gives a head-to-head comparison of the performance of OpenMPI’s Java and C bindings. Our study shows that by default, Java can be faster than C unless one takes special measures, and it exhibits similar speedup, efficiency, and scalability. We conclude that Java is a suitable language for teaching PDC.
Birds of a Feather
TP
XO/EX
DescriptionOperational Data Analytics (ODA) provides unique opportunities to analyze, understand, and optimize operations of HPC systems. Readily available open-source frameworks make the collection of monitoring data from different domains of the HPC system (infrastructure, system hardware, software, applications) increasingly easy. However, making the data work for HPC operations is not straight-forward. AI-based methods seem interesting, but which tools and methods are suitable for this type of data is not obvious. This BoF aims to bring together practitioners in HPC operations to share use cases for ODA, discuss problems, and provide feedback.
Workshop
Recorded
W
DescriptionThe serverless computing model is being adopted and supported by different cloud computing vendors. However, this trend is not getting reflected in the HPC community because of some inherent difficulties and design limitations of the serverless model. In this talk, I'll demonstrate how HPC users and programmers can leverage the serverless computing model for their workflows, and overcome some of the demanding challenges using our group's open-source tools.
Paper
Recorded
Applications
Computational Science
Scientific Computing
TP
DescriptionNek5000/RS, a highly-performant open-source spectral element code, has recently achieved an unprecedented milestone in the simulation of nuclear reactors: the first full core computational fluid dynamics simulations of reactor cores, including pebble beds with 352,625 pebbles and 98M spectral elements (51 billion gridpoints), advanced in less than 0.25 seconds per Navier-Stokes timestep. The authors present performance and optimization considerations necessary to achieve this milestone when running on all of Summit. These optimizations led to a four-fold reduction in time-to-solution, making it possible to perform high-fidelity simulations of a single flow-through time in less than six hours for a full reactor core under prototypical conditions.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
DescriptionLarge scale neural network training is challenging due to the high ratio of communication to computation. Recent work has shown that these large networks contain sparse subnetworks consisting of 10-20% of the parameters, which when trained in isolation reach comparable accuracy to the larger network. In this work, we propose a novel approach that exploits the existence of these sparse subnetworks to dramatically improve the efficiency of large scale neural network training. By storing in sparse and computing in dense, we are able to reduce the number of parameters drastically while matching the compute efficiency of the original network. We exploit this reduced parameter set to optimize the communication time of AxoNN, a state-of-the-art framework for parallel deep learning. Our approach yields a significant speedup of 17% when training a 2.7 billion parameter transformer model on 384 GPUs.
Paper
Recorded
Reliability and Resiliency
TP
DescriptionGPU’s powerful computational capacity holds great potential for processing hierarchically-compressed data without decompression. Unfortunately, existing GPU approaches offer only traversal-based analytics; random access is extremely inefficient, substantially limiting their utility. To solve this problem, we develop a novel and broadly applicable optimization enabling efficient random access to hierarchically-compressed data in GPU memory. We address three major challenges. The first is designing GPU data structures that support random access. The second is efficiently generating data structures on GPUs. Generating data structures for random access is costly on the CPU, and the inefficiency increases dramatically when PCIe data transmission is incorporated. The third is query processing on compressed data in GPU memory. Random accesses result in severe conflicts between massive threads. We evaluate our solution on two GPU platforms using five real-world datasets. Experiments show that the random access operations on GPU can achieve 65.04x average speedup compared to the state-of-the-art method.
Workshop
Recorded
Diversity Equity Inclusion (DEI)
Education and Training and Outreach
W
DescriptionIn this work, we focus on efficiently generalizing the Bruck algorithm to non-uniform all-to-all data exchange. We present two alternative techniques for extending the Bruck algorithm to support non-uniform data distributions: padded Bruck and two-phase Bruck. In padded Bruck, we convert the non-uniform communication pattern into a uniform one by padding data messages into equal-sized buffers. Our other implementation, the two-phase Bruck, uses a meta-data exchange phase and a monolithic working buffer to facilitate non-uniform all-to-all data exchange. Moreover, we also performed experimental investigation of the tunable Bruck algorithm with varying radix-r. We demonstrated that the Bruck algorithm with r = sqrt(P) (P : total number of processes) is the most effective in most cases.
Exhibitor Forum
Recorded
TP
XO/EX
DescriptionWith exascale workloads getting more complex, the optimization of the HPC environment, hardware and software, is getting much more challenging. On the hardware side, they still include an important part of computations as measured by Flops performance and memory bandwidth. They might also integrate a significant Data Analytics portion for applications such as digital twins. Recently Artificial Intelligence optimization techniques have been introduced. AI optimized Surrogate models may replace classical computations with neural networks, they boost the application performance, sometimes with an order of magnitude improvement, when run on AI optimized hardware components ( GPUs, IPUs…). Other applications, e.g. gene sequencing, might optimally use dataflow components such as FPGAs. An exascale hardware platform tightly integrates a large number of heterogenous nodes for computing, data processing, AI …. Since the denser the system architecture, the more efficient, HPC cabinets power consumption and heat dissipation requirements now far exceed classical IT systems air cooling capacity. We must rely on liquid cooling, which also allows for “free” cooling with high temperature water cooling (up-to 40°C ). In addition, the software environment provides an extra level of optimization with Machine Learning trained runtime modules for job scheduling, data instantiation, energy/performance control. Finally, there is a shift toward Cloud computing, either for financial reasons (OPEX vs CAPEX) or simply for providing extra resources on top of in-house systems. Once containerized (Docker) and managed by on orchestrator (Kubernetes) HPC applications run at bare-metal speed. The seamless integration of cloud and on-premises resources requires a federated management.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
Recorded
TP
DescriptionThe traceback phase of the Smith-Waterman (SW) algorithm requires significant memory and introduces an irregular memory access pattern which makes it challenging to implement for GPU architectures. In this work, we introduce a novel strategy for implementing the traceback kernel for the SW algorithm on GPUs by restructuring the global memory access patterns and introducing a memory-efficient data structure for storing large dynamic programming matrices in GPU’s limited memory. To demonstrate this kernel’s performance we integrated this into the existing ADEPT library and Metahipmer2, a de novo metagenomic short read assembler. Our implementation is 3.6x faster than traceback in GASAL2, and 51x faster than traceback in Striped Smith-Waterman, the current state of the art SW libraries on GPU and CPU respectively. It sped up the final alignment step in Metahipmer2 by an average of 44% and improved the overall execution time of Metahipmer2 by an average of 13%.
Paper
Recorded
Data Analytics
Performance
TP
DescriptionThis paper introduces Out of Hypervisor (OoH), a new virtualization research axis. Instead of emulating full virtual hardware inside a VM to support a hypervisor, the OoH principle is to individually expose hypervisor-oriented hardware virtualization features to the guest OS. This way, guest’s processes could take benefit from those features. We illustrate OoH with Intel PML (Page Modification Logging) which allows efficient dirty page tracking to improve VM live migration. Because dirty page tracking is at the heart of many essential tasks including checkpointing and garbage collection, OoH exposes PML to accelerate these tasks in the guest. We present two OoH PML designs namely Shadow (SPML) and Extended PML (EPML) that we integrated into CRIU and Boehm GC. Evaluations results show that EPML speeds up CRIU checkpointing (by 13×) and Boehm garbage collection (by 6×) times compared to SPML, /proc, and userfaultfd while leading to 16× less overhead on applications.
Workshop
Recorded
Benchmarking
Cloud and Distributed Computing
Containers
Datacenter
Networks
Privacy
Resource Management and Scheduling
Security
SIGHPC
State of the Practice
System Administration
System Software
W
DescriptionWe outline four different tools developed to either solve a specific problem or streamline a workflow related to the configuration and administration of a HPC Cluster. The issues that prompted the creation of the tools were identified at the Stanford Research Computing Center in the context of managing the Sherlock HPC Cluster and Oak long-term data storage environments. The tools that were created to address the issues encountered have been used in multiple locations, both internal and external to Stanford University. In this paper, we describe the solutions developed for four different areas of system management: filesystem (Lustre), drives (SAS), interconnect (InfiniBand), and job scheduler (Slurm).
Exhibitor Forum
Recorded
TP
XO/EX
DescriptionMetadata management matters. Beyond its creation, you need a long-term plan for managing metadata storage and performance to get the most out of your HPC environments.
This session will explore the metadata challenges and use cases associated with all phases of HPC workloads, including data ingest, pre-processing, training, inference, analysis, and validation. Join WEKA CTO Shimon Ben-David to examine what IO profiles look like at every stage in a data pipeline, why and how these challenges become amplified when multiple HPC workloads are consolidated onto a common platform, and learn about storage strategies that can help you to overcome these metadata challenges.
This session will explore the metadata challenges and use cases associated with all phases of HPC workloads, including data ingest, pre-processing, training, inference, analysis, and validation. Join WEKA CTO Shimon Ben-David to examine what IO profiles look like at every stage in a data pipeline, why and how these challenges become amplified when multiple HPC workloads are consolidated onto a common platform, and learn about storage strategies that can help you to overcome these metadata challenges.
Paper
Recorded
Accelerator-based Architectures
Bioinformatics
File Systems and I/O
TP
Best Paper Finalist
Best Student Paper Finalists
DescriptionQueries of multi-TB Mass Spectrometry (MS) repositories provide deep insights into biological processes and pose challenging data processing problems. The key bottleneck for running these queries is the number of small random reads. Byte-addressable persistent main memory (PMEM) technologies enable real-time MS search systems by delivering low-latency, high-bandwidth storage.
This work presents P-MASSIVE, real-time multi-terabyte scale MS search system. P-MASSIVE takes advantage of PMEM and the underlying nature of its data access patterns to maximize performance. We evaluate P-MASSIVE across various storage hierarchies and project forward over the next decade to understand how MS query systems might evolve.
Our evaluation shows that P-MASSIVE offers a cost-effective solution that achieves near-DRAM performance. A single query takes 1.7 seconds in P-MASSIVE, 69× faster than state-of-the-art implementation. In an end-to-end, user-facing application, P-MASSIVE delivers a 90% shorter wait time than the latest MS search tool, returning results within seconds rather than minutes.
This work presents P-MASSIVE, real-time multi-terabyte scale MS search system. P-MASSIVE takes advantage of PMEM and the underlying nature of its data access patterns to maximize performance. We evaluate P-MASSIVE across various storage hierarchies and project forward over the next decade to understand how MS query systems might evolve.
Our evaluation shows that P-MASSIVE offers a cost-effective solution that achieves near-DRAM performance. A single query takes 1.7 seconds in P-MASSIVE, 69× faster than state-of-the-art implementation. In an end-to-end, user-facing application, P-MASSIVE delivers a 90% shorter wait time than the latest MS search tool, returning results within seconds rather than minutes.
Workshop
Recorded
Accelerator-based Architectures
Algorithms
Architectures
Big Data
Data Analytics
Parallel Programming Languages and Models
Productivity Tools
W
DescriptionVector gather instructions are available in various processors, which are essential for handling irregular memory accesses. Additionally, the processors support virtual memory that allows programmers not to consider the limitation of the physical memory space. To realize the virtual memory, the processors require address translation between virtual and physical addresses. When a vector gather instruction loads data elements distributed over the physical memory space, all virtual addresses must be translated one by one, causing many translations by accessing a Translation Lookaside Buffer (TLB). Hence, the TLB easily becomes a bottleneck in handling vector gather instructions. To relieve the bottleneck, this paper proposes an address coalescing method for the address translations of vector gather instructions by utilizing vector arithmetic units in the processor. The evaluation results show that the proposed method can achieve a 2x performance improvement in numerical and 1.08x in graph applications, which contain many vector gather instructions.
Panel
Recorded
Career Development
Emerging Technologies
State of the Practice
TP
XO/EX
DescriptionEvery year we see flashy new technology introduced by new and existing vendors promising new heights of performance, both in von Neumann architectures and beyond. However, without Dennard scaling or even Moore’s Law, much of the hard work required to benefit from these advances falls on the software programmer. Domain scientists are often put into a situation where they need to quickly become experts in emerging technologies with steep learning curves and their confidence is challenged. Should we be porting our codes to each new technology? How many code changes are “too many”? Is true performance portability even a realistic goal? In this panel, we will explore what we have learned from (successfully or unsuccessfully) porting to GPUs, many-core, FPGAs, and more, and discuss how both the scientific community and industry can lower the barrier to allow domain scientists to productively exploit emerging technologies.
Students@SC
DescriptionRegister here for the event by creating an account at https://www.designsafe-ci.org/ using your institutional email address prior to 7th November 2022.
Experience a slice of the computational research world, take the role of a computational scientist tasked with understanding a pandemic currently spreading through a community and researching solutions to keep the community safe
The SC22 Student Programming will host a workshop that will introduce students to a slice of that computational research cycle. The workshop will be taught by Charlie Dey and Je’aime Powell from the Texas Advanced Computing Center (TACC).
Computational research begins with an observation of a natural occurrence, then transitions to developing a model which mathematically describes that occurrence, to using advanced computing techniques to solve that model, then generating, verifying, and validating the data against observational data, and repeating the cycle: building and expanding, solving, generating, verifying and validating. The end goal is to build a system, accurately representing a scientific process, which you can run "what if" scenarios against when a real world experiment is not attainable.
It starts with a simple scientific process, using simple probability to get a "person" sick. Then expand that simple process into a computational model to simulate a disease propagating through a set population. Students will be broken into teams and given a set of challenges, requiring the teams to update and expand their computational models to meet.
Experience a slice of the computational research world, take the role of a computational scientist tasked with understanding a pandemic currently spreading through a community and researching solutions to keep the community safe
The SC22 Student Programming will host a workshop that will introduce students to a slice of that computational research cycle. The workshop will be taught by Charlie Dey and Je’aime Powell from the Texas Advanced Computing Center (TACC).
Computational research begins with an observation of a natural occurrence, then transitions to developing a model which mathematically describes that occurrence, to using advanced computing techniques to solve that model, then generating, verifying, and validating the data against observational data, and repeating the cycle: building and expanding, solving, generating, verifying and validating. The end goal is to build a system, accurately representing a scientific process, which you can run "what if" scenarios against when a real world experiment is not attainable.
It starts with a simple scientific process, using simple probability to get a "person" sick. Then expand that simple process into a computational model to simulate a disease propagating through a set population. Students will be broken into teams and given a set of challenges, requiring the teams to update and expand their computational models to meet.
Workshop
Recorded
AI-HPC Convergence
Extreme Scale Computing
Parallel Programming Languages and Models
Performance
Runtime Systems
W
DescriptionArtificial Intelligence (AI) enhances the speed, precision, and effectiveness of many applications and simulations of different fields, including scientific applications and large-scale HPC simulations and models. Recently, researchers have attempted to solve problems related to High-Performance Computing and Cyberinfrastructure, such as Scheduling and Resource Management, Device Mapping and Autotuning, Code Optimization and Compilers, Code Generation and Translation, etc., using AI and specifically Deep Learning. However, a major challenge of this type of research is that Deep Learning methods usually need large datasets, and unlike in other fields, comparatively fewer datasets are available for these tasks. Another major challenge of data-driven HPC research is the representation of the data or code. For example, some primary research questions on data-driven Code and Compiler Optimization remain unanswered: “Can there be a UNIVERSAL REPRESENTATION for code that will perform well for all tasks, or do we need to have different representations for multiple optimizations? Can DL models learn ENOUGH without any dynamic or profiling information? Can DL models learn from all the IMBALANCED and mostly UNLABELED data?”. This panel aims to identify and discuss the challenges and opportunities for applying Deep Learning to HPC. It presents a stimulating environment where the community can discuss topics relevant to HPC and AI. The panel intends to initiate research collaborations and provides an opportunity to receive feedback and opinions from domain experts and discover new ideas, directions, and potential solutions in data-driven HPC research.
Workshop
Recorded
Security
W
Workshop
Recorded
Architectures
Cloud and Distributed Computing
Emerging Technologies
Networks
Scientific Computing
W
Workshop
Recorded
Architectures
Cloud and Distributed Computing
Emerging Technologies
Networks
Scientific Computing
W
Workshop
Recorded
Architectures
Cloud and Distributed Computing
Emerging Technologies
Networks
Scientific Computing
W
Workshop
Recorded
Architectures
Cloud and Distributed Computing
Emerging Technologies
Networks
Scientific Computing
W
Workshop
Recorded
Architectures
Cloud and Distributed Computing
Emerging Technologies
Networks
Scientific Computing
W
Workshop
Recorded
Architectures
Cloud and Distributed Computing
Emerging Technologies
Networks
Scientific Computing
W
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
DescriptionParaGraph is an open-source toolkit for use in co-designing hardware and software for supercomputer-scale systems. It bridges an infrastructure gap between an application target and existing high-fidelity computer-network simulators. The first component of ParaGraph is a high-level graph representation of a parallel program, which faithfully represents parallelism and communication, can be extracted automatically from a compiler, and is “tuned” for use with network simulators. The second is a runtime that can emulate the representation’s dynamic execution for a simulator. User-extensible mechanisms are available for modeling on-node performance and transforming high-level communication into operations that backend simulators understand. Case studies include deep learning workloads that are extracted automatically from programs written in JAX and TensorFlow and interfaced with several event-driven network simulators. These studies show how system designers can use ParaGraph to build flexible end-to-end software-hardware co-design workflows to tweak communication libraries, find future hardware bottlenecks, and validate simulations with traces.
Tutorial
Recorded
Algorithms
Directive Based Programming
Heterogeneous Systems
Parallel Programming Languages and Models
Performance
TUT
DescriptionThis tutorial provides a comprehensive overview of parallel computing, emphasizing aspects most relevant to the user. It is suitable for new users, students, managers, and anyone seeking an overview of parallel computing. It discusses software and hardware/software interaction, with an emphasis on standards, portability, and systems that are widely available.
The tutorial surveys basic parallel computing concepts using examples of engineering, scientific, and machine learning. They illustrate using MPI on distributed memory systems; OpenMP on shared memory systems; MPI+OpenMP on hybrid systems; and CUDA and compiler directives on GPUs and accelerators. It discusses numerous parallelization and load balancing approaches, and software engineering and performance improvement aspects, including the use of state-of-the-art tools.
The tutorial helps attendees make intelligent decisions by covering the primary options that are available, explaining how the different components work together and what they are suitable for. Extensive pointers to web-based resources are provided for follow-up studies.
The tutorial surveys basic parallel computing concepts using examples of engineering, scientific, and machine learning. They illustrate using MPI on distributed memory systems; OpenMP on shared memory systems; MPI+OpenMP on hybrid systems; and CUDA and compiler directives on GPUs and accelerators. It discusses numerous parallelization and load balancing approaches, and software engineering and performance improvement aspects, including the use of state-of-the-art tools.
The tutorial helps attendees make intelligent decisions by covering the primary options that are available, explaining how the different components work together and what they are suitable for. Extensive pointers to web-based resources are provided for follow-up studies.
Posters
Research Posters
TP
XO/EX
DescriptionThis study aimed to employ deep learning capability and computing scalability to create a model and predict the velocity of the straining turbulence flow. The turbulence flow was generated in a laboratory. The turbulence intensity of the flow is controlled via impeller rotation speed. The mean strain rate is made by two circular plates moving toward each other in the center of the measuring area by an actuator. The dynamics of the particles are measured using high-speed Lagrangian Particle Tracking at 10,000 frames per second. Measured data from the experiment were employed to design a gated recurrent unit model. Two powerful parallel computing machines, JUWELS and DEEP-EST, were employed to implement the model. The velocity forecasting with a gated recurrent network presents a considerable outcome. The computing machine's scalability using GPUs accelerates this model's computing time significantly, which strengthens the ability to predict turbulent flow.
Tutorial
Recorded
Architectures
Benchmarking
Big Data
Data Mangement
Datacenter
Emerging Technologies
File Systems and I/O
Storage
TUT
DescriptionI/O on HPC systems is a black art. This tutorial sheds light on the state-of-the-art in parallel I/O and provides the knowledge necessary for attendees to best leverage I/O resources available to them. We cover the entire I/O software stack including storage and parallel file systems at the lowest layer, the role of burst buffers (NVRAM), intermediate layers (such as MPI-IO), and high-level I/O libraries (such as HDF-5). We emphasize ways to use these interfaces that result in high performance and tools for generating insight into these stacks.
The first third of the tutorial covers parallel I/O fundamentals. We discuss storage technologies, both present and near-future and the major parallel and distributed file systems. We focus on application in the second third, connecting storage to our examination of the upper library layers of the I/O stack, covering MPI-IO, Parallel netCDF, and HDF5. Finally we discuss tools for understanding I/O behavior.
The first third of the tutorial covers parallel I/O fundamentals. We discuss storage technologies, both present and near-future and the major parallel and distributed file systems. We focus on application in the second third, connecting storage to our examination of the upper library layers of the I/O stack, covering MPI-IO, Parallel netCDF, and HDF5. Finally we discuss tools for understanding I/O behavior.
Students@SC
DescriptionDo you love a challenge? Have you ever participated in coding competitions like the International Collegiate Programming Contest (ICPC)? Do you want to test your parallel and distributed programming skills or develop them? Come join us in the first Parallel Programming Marathon at SC! You will receive a set of problem descriptions and sequential/serial solutions. You are challenged to optimize them while keeping the output correct using parallel and distributed programming techniques. Your aggregated speed up will determine your place in the rank! The contest will be asynchronous and will stay open for three days, so you can explore all SC offers while having some fun coding. Are you ready?
Register here for the event: https://docs.google.com/forms/d/e/1FAIpQLSc6--QSYQawck4CqgVAoBTbjy3fFNUOL0CM4F6UVNvCaA1JOQ/viewform
Register here for the event: https://docs.google.com/forms/d/e/1FAIpQLSc6--QSYQawck4CqgVAoBTbjy3fFNUOL0CM4F6UVNvCaA1JOQ/viewform
Posters
Research Posters
TP
XO/EX
DescriptionThe standard implementation of MPI_Alltoall uses a combination of techniques, including the spread-out and Bruck algorithms. The existing Bruck algorithm implementation is limited to a radix of two, so the total number of communication steps is fixed at log2(P) (P: total number of processes). The spread-out algorithm, on the other hand, requires P-1 communication steps. There remains a wide unexplored parameter area between these two extremities of the communication spectrum that can be tuned. In this paper, we formalize a generalized formula and implementation of the Bruck algorithm, whose radix can be varied from 2 to P-1. With this ability, both the total number of communication steps and the total amount of data transmitted can be tuned, which allows performance tuning. We performed an experimental investigation and demonstrated that the Bruck with the optimal radix is up to 57% faster than the vendor's optimized MPI_Alltoall on the Theta supercomputer.
Paper
Recorded
Quantum Computing
Resource Management and Scheduling
System Software
TP
DescriptionPython’s ease of use and rich collection of numeric libraries make it an excellent choice for rapidly developing scientific applications. However, composing these libraries to take advantage of complex heterogeneous nodes is still difficult. To simplify writing multi-device code, we created Parla, a heterogeneous task-based programming framework that fully supports Python’s scientific programming stack. Parla’s API is based on Python decorators and allows users to wrap code in Parla tasks for parallel execution. Parla arrays enable automatic movement of data between devices. The Parla runtime handles resource-aware mapping, scheduling, and execution of tasks. Compared to other Python tasking systems, Parla is unique in its parallelization of tasks within a single process, its GPU context and resource-aware runtime, and its design around gradual adoption to provide easy migration of and integration into existing Python applications. We show that Parla can achieve performance competitive with hand-optimized code while improving ease of development.
Workshop
Recorded
W
DescriptionFrequently occurring code and design patterns in scientific applications are often used for parallelizing serial code. But, identifying these patterns is difficult. We propose using Graph Neural Networks for modeling code flow graphs to identify patterns in such parallel code. Additionally, identifying the best runtime parameters for parallel code is also challenging. We propose a pattern-guided deep learning based tuning approach, to identify the best runtime parameters for OpenMP loops. We validate our hypothesis on 20 different applications from Polybench, and STREAM benchmark suites. Our approach identifies patterns with an accuracy of 91%. We validate the usefulness of using patterns for auto-tuning, on tuning the number of threads, scheduling policies and chunk size on a single-socket system, and the thread count and affinity on a multi-socket machine. We achieve geometric mean speedups of 1.1X and 4.7X respectively over default OpenMP configurations, compared to brute-force speedups of 1.27X and 4.93X respectively.
Workshop
Recorded
Applications
Architectures
Heterogeneous Systems
Hierarchical Parallelism
Parallel Programming Languages and Models
Performance
Performance Portability
Scientific Computing
W
Workshop
Recorded
Applications
Architectures
Heterogeneous Systems
Hierarchical Parallelism
Parallel Programming Languages and Models
Performance
Performance Portability
Scientific Computing
W
DescriptionMitigating climate change while providing the nation’s transportation and power generation are important to energy and environmental security. The shift to hydrogen as a clean energy carrier is one of the most promising strategies to reduce CO2 emissions in the face of increasing energy demand. While hydrogen has a few drawbacks as an energy carrier due to its low energy density, ammonia is simpler to transport and store for extended periods of time, making it an attractive carbon-free energy carrier for off-grid localized power generation and marine shipping. However ammonia has poor reactivity and forms NOx and N2O emissions. The poor ammonia reactivity can be circumvented by partial cracking of ammonia to form ammonia/hydrogen/nitrogen blends tailored to match conventional hydrocarbon fuel properties. However, combustion of ammonia/hydrogen/nitrogen blends at high pressure, and in particular, the coupling between turbulence and fast hydrogen diffusion remains poorly understood. Pre-exascale computing provides a unique opportunity for direct numerical simulation (DNS) of turbulent combustion with ammonia/hydrogen blends to investigate the pressure effects on combustion rate, blow-off limits and chemical pathways for NOx and N2O formation.
Exascale computing introduces challenges for data management and the need for reduced order surrogate models (ROMS) for chemical species dimension reduction and for novel in situ analysis and visualization methods. A novel model driven on-the-fly ROM recently formulated and implemented in reactive flow DNS to reduce the computational cost of chemistry will be described. Recent advances in topological segmentation, feature extraction, and statistical summarization for extreme-scale data will be discussed in the context of in situ analysis workflows that capture salient time-varying features.
Exascale computing introduces challenges for data management and the need for reduced order surrogate models (ROMS) for chemical species dimension reduction and for novel in situ analysis and visualization methods. A novel model driven on-the-fly ROM recently formulated and implemented in reactive flow DNS to reduce the computational cost of chemistry will be described. Recent advances in topological segmentation, feature extraction, and statistical summarization for extreme-scale data will be discussed in the context of in situ analysis workflows that capture salient time-varying features.
Workshop
Recorded
Applications
Architectures
Heterogeneous Systems
Hierarchical Parallelism
Parallel Programming Languages and Models
Performance
Performance Portability
Scientific Computing
W
Workshop
Recorded
Applications
Architectures
Heterogeneous Systems
Hierarchical Parallelism
Parallel Programming Languages and Models
Performance
Performance Portability
Scientific Computing
W
Workshop
Recorded
Applications
Architectures
Heterogeneous Systems
Hierarchical Parallelism
Parallel Programming Languages and Models
Performance
Performance Portability
Scientific Computing
W
Workshop
Recorded
Applications
Architectures
Heterogeneous Systems
Hierarchical Parallelism
Parallel Programming Languages and Models
Performance
Performance Portability
Scientific Computing
W
DescriptionDifferent aspects of the workshop and other questions from the moderator and audience will be discussed in the panel.
Workshop
Recorded
Applications
Architectures
Heterogeneous Systems
Hierarchical Parallelism
Parallel Programming Languages and Models
Performance
Performance Portability
Scientific Computing
W
Workshop
Recorded
W
DescriptionSplinters is a distributed system for sampling IO metadata in Google data centers. It has been deployed in production for several years, and is the main engine for the analysis of storage systems and workloads in Google. Given the scale of the storage infrastructure, reliably collecting and processing the IO samples is a complex problem, and we explain how we design around the various challenges. We show how the collected IO samples are used for ad-hoc queries and longitudinal analysis. We also outline several applications where we used the IO samples for the design and implementation of new systems.
Posters
Research Posters
TP
XO/EX
DescriptionData transformation tasks - such as encoding, decoding, parsing, and conversion between common data formats - are at the core of many data analytics, data processing and scientific applications. This has led to the development of custom software libraries and hardware implementations targeting popular data transformations. By accelerating specific transformations, however, these solutions suffer from lack of generality. On the other hand, a generic and programmable data processing engine might support a wide range of data transformations, but do so at the cost of reduced performance compared to custom, algorithm-specific solutions.
In this work, we aim to bridge this gap between generality and performance. To this end, we provide a compilation framework that transparently converts data transformation tasks expressed using pushdown transducers into efficient GPU code.
In this work, we aim to bridge this gap between generality and performance. To this end, we provide a compilation framework that transparently converts data transformation tasks expressed using pushdown transducers into efficient GPU code.
Workshop
Recorded
W
DescriptionSoftware, its design, development, and engineering has become one of the corner stones of computational science research. There is a lot of demand for scientists and engineers who can write well-designed, sustainable, and reproducible software. This is at odds with the traditional way of writing code just to explore and showcase ideas, mainly for scientific papers.
To provide students with practical experience in all aspects of research software engineering (RSE), we have adopted a peer review based assignment approach that helps students focus on writing efficient parallel algorithms by providing a build, test and continuous integration framework. In addition, the students are then required to submit a pull/merge request to the central repository. We have observed that this workflow improves the students' programming skills, introduces them to RSE practices while teaching them to program parallel numerical algorithms on high performance machines.
To provide students with practical experience in all aspects of research software engineering (RSE), we have adopted a peer review based assignment approach that helps students focus on writing efficient parallel algorithms by providing a build, test and continuous integration framework. In addition, the students are then required to submit a pull/merge request to the central repository. We have observed that this workflow improves the students' programming skills, introduces them to RSE practices while teaching them to program parallel numerical algorithms on high performance machines.
Workshop
Recorded
W
DescriptionAccurate prediction of fluid flows remains an important field of research and engineering. To this end, computational fluid dynamics is widely employed. Due to their high demands on computational resources CFD applications profit from HPC systems. Continuous performance analysis and optimization is key to efficient utilization of HPC resources. This paper demonstrates the beneficial cooperation between developers of HPC software and performance tools in the context of the CFD solver CODA and the sparse linear system solver Spliss. We investigate concepts used by CODA/Spliss to achieve high scalability, evaluate their effectiveness with performance analysis tools, and illustrate profits obtained from a close collaboration between HPC application and tool developers. Tools support developers in analyzing/tuning their applications. Feedback and requests of developers inspire tools enhancements. We highlight these aspects with an extended support of non-blocking collectives in performance tools and emphasize the need for sophisticated tool support of multi-threaded MPI applications.
Workshop
Recorded
Applications
Architectures
Benchmarking
Exascale Computing
Modeling and Simulation
Performance
Performance Portability
W
DescriptionHardware performance counters provide detailed insight into the performance of applications running on modern systems, but they can be challenging to use without detailed knowledge of the computational and counter architectures. Our work addresses this challenge by identifying metrics that are common to many micro-architectures and can be directly related to the algorithms in question. These metrics, some long used and some being presented for the first time, are carefully designed to be easy to follow, informative, and portable to multiple systems. In this paper, we discuss the background of empirical performance analysis, describe our set of metrics, and demonstrate analysis on example benchmarks and mini-applications. The metrics and examples are presented on both an Intel Xeon Cascade Lake and an ARM-based Fujitsu A64FX. The significant differences in the ISAs, caches and hardware counters between these two systems demonstrate the portability of the proposed metrics.
Workshop
Recorded
W
DescriptionHigh-performance object stores are an emerging technology which offers an alternative solution in the field of HPC storage, with potential to address long-standing scalability issues in traditional distributed POSIX file systems due to excessive consistency assurance and metadata prescriptiveness.
In this presentation, we assess the performance of storing object-like data within a standard file system, where the configuration and access mechanisms have not been optimized for object access behavior, and compare with and investigate the benefits of using an object storage system. While this approach is not exploiting the file system in a standard way, this work allows us to investigate whether the underlying storage technology performance is more or less important than the software interface and infrastructure a file system or object store provides.
In this presentation, we assess the performance of storing object-like data within a standard file system, where the configuration and access mechanisms have not been optimized for object access behavior, and compare with and investigate the benefits of using an object storage system. While this approach is not exploiting the file system in a standard way, this work allows us to investigate whether the underlying storage technology performance is more or less important than the software interface and infrastructure a file system or object store provides.
Workshop
Recorded
W
DescriptionState-of-the-art multiphysics simulations running on large scale leadership computing platforms have many variables contributing to their performance and scaling behavior. We recently encountered an interesting performance anomaly in Flash-X, a multiphysics multicomponent simulation software, when characterizing its performance behavior on several large-scale platforms. The anomaly was tracked down to the interaction between the use of dynamic allocation of scratch data and data locality in the cache hierarchy. In this paper we present the details of unexpected performance variability we encountered, the extensive analysis using the performance measurement tool TAU to collect the data and Python data analysis libraries to explore the data, and our insights from this experience. In the process, we discovered and removed or mitigated two additional performance limiting bottlenecks for performance tuning.
Posters
Research Posters
TP
XO/EX
DescriptionIn this work, we evaluate the performance of unroll and tiling, two loop transformations introduced in OpenMP 5.1 and early implemented in Clang 13 for GPUs. Experiments on a common seismic computational kernel demonstrate performance gains on three GPU architectures.
Workshop
Recorded
Diversity Equity Inclusion (DEI)
Education and Training and Outreach
W
DescriptionThe Nyx cosmology code is used to simulate the formation of large scale structure in Lyα simulations of the universe. As the dark matter particles begin clustering, the cost of a single time step grows due to the load imbalance. These highly clustered regions can also prohibit fitting the entire problem in GPU HBM. If the entire problem does not fit on the HBM, Nyx should use managed memory, and the cost of each time step becomes dependent on the host-device memory bandwidth. This also imposes dynamic
restrictions on the best domain decomposition for other physic components such as the heating-cooling and the hydrodynamics solve.
In this talk, we will focus on different performance characteristics of Nyx when constrained by load imbalance and the GPU memory capacity, as well as the current approach in Nyx to optimizing this.
restrictions on the best domain decomposition for other physic components such as the heating-cooling and the hydrodynamics solve.
In this talk, we will focus on different performance characteristics of Nyx when constrained by load imbalance and the GPU memory capacity, as well as the current approach in Nyx to optimizing this.
Birds of a Feather
TP
XO/EX
DescriptionWith increasing heterogeneity in system deployments (CPUs, GPGPUs, AI accelerators, FPGAs, IPU/DPUs), HPC users face a daunting task of programming for such diverse architectures. This BoF, organized by the IXPUG, but not limited to Intel technology, will focus on sharing expertise in portable programming across a wide variety of architectures, running a diverse set of workloads. This BoF will explore current approaches and best practices for programming across heterogeneous systems and exotic architectures, with the goal of identifying a common set of principles and practices that can be leveraged to develop and maintain software across sites, architectures, and applications.
Workshop
Recorded
Performance Portability
W
DescriptionThe emergence of multiple accelerator based computer architectures and programming models makes it challenging to achieve performance portability for large-scale scientific simulation software. In this paper, we focus on a sparse block diagonal matrix multiple vector (SpMM) computational kernel and discuss techniques that can be used to achieve performance portability on NVIDIA and AMD based accelerators using CUDA, HIP, OpenACC, Kokkos. We show that performance portability can vary significantly across programming models, GPU architectures, and problem settings, up to 52x in the explored problems. Our study visits the performance portability aggregation metric to guide the development and the selection of performance portable variants.
Workshop
Recorded
Performance Portability
W
DescriptionThis paper presents the performance portable implementation of a kinetic plasma simulation code with C++ parallel algorithm to run across multiple CPUs and GPUs. Relying on the language standard parallelism stdpar and proposed language standard multi-dimensional array support mdspan, we demonstrate that a performance portable implementation is possible without harming the readability and productivity. We obtain a good overall performance for a mini-application in the range of 20% to the Kokkos version on Intel Icelake, NVIDIA V100, and A100 GPUs. Our conclusion is that stdpar can be a good candidate to develop a performance portable and productive code targeting the exascale era platform, assuming this approach will be available on AMD and/or Intel GPUs in the future.
Workshop
Recorded
AI-HPC Convergence
Extreme Scale Computing
Parallel Programming Languages and Models
Performance
Runtime Systems
W
DescriptionIncreasing workload fidelity and achieving faster time to solution has required the deployment of the world's first exascale systems. However, the scale of these systems present programming challenges due to greatly increased parallelism and heterogeneity. This talk details early performance results at scale on systems such as Frontier using a variety of techniques across HPC and Machine Learning, such as MPI and RCCL. We conclude with a discussion of the significance and impact of programming models on applications with GPUs/accelerators during the post-exascale era.
Tutorial
Recorded
Accelerator-based Architectures
Benchmarking
Heterogeneous Systems
Performance
Software Engineering
TUT
DescriptionThe Roofline performance model offers an insightful and intuitive method for extracting the key execution characteristics of HPC applications and comparing them against the performance bounds of modern CPUs and GPUs. Its ability to abstract the complexity of memory hierarchies and identify the most profitable optimization techniques have made Roofline-based analysis increasingly popular in the HPC community. The tutorial aims to bridge this gap on both CPUs and GPUs by both exposing the fundamental aspects behind different Roofline modeling principles as well as providing several use cases that highlight their efficacy for application optimization. This tutorial presents a unique combination of instruction to Roofline by its creator, hands-on instruction in using Roofline within Intel’s, NVIDIA’s, and AMD’s performance tools, and discussions of Roofline use cases at ALCF, NERSC, and OLCF computing centers. The presenters have a long history of collaborating on the Roofline model and have presented several Roofline-based tutorials.
Workshop
Recorded
W
DescriptionConcurrent programming is used in all large and complex computer systems. However, concurrency errors and system failures (ex: crashes and deadlocks) are common. We find that Petri nets can be used to model concurrent systems and find and remove errors ahead of time. We introduce a novel generalization of Petri nets with nondeterministic transition nodes to match real systems. These allow for a compact way to construct, optimize, and prove computer programs at the concurrency level. Petri net programs can also be optimized by automatically solving for maximal concurrency, where the maximum number of valid threads is determined by the structure of the Petri net prior to execution. We discuss an algorithm to compute the state graph of a given Petri net start state pair. We introduce our open source software framework which implements this theory as a general purpose concurrency focused middle-ware.
Birds of a Feather
TP
XO/EX
DescriptionHigh-fidelity simulations are increasingly important in the design of complex systems. However, the computational cost of such models hinders their use for design space exploration, optimization, and uncertainty quantification. Alternative approaches, such as projection-based methods, often exhibit limited accuracy and call for collecting simulations at several data points, which is expensive in the first place. Recently, however, research institutions and industry have been collaborating to develop physics-informed neural network frameworks for simulations. This BoF seeks input from the machine learning and HPC communities as well as open participation in the development of useful tools to meet their needs.
Workshop
PhySRNet: Physics Informed Super-Resolution Network for Application in Computational Solid Mechanics
Recorded
W
Workshop
Recorded
Performance Portability
W
DescriptionOpenMP offload improves the application development complexity of HPC GPU codes and provides portability. A source of poor performance is the lockstep execution of data transfers and computation. Overlapping these operations can provide significant performance gains. However, the developer must manually slice data transfers and kernel execution, and efficiently schedule these operations for execution, which is a hard and error-prone task.
We propose Piper, an automatic mechanism for OpenMP offload to perform overlapping. Piper statically analyzes offload kernels and associates computations with memory locations. The extended runtime system exploits this analysis information, divides a kernel into independent sub-tasks, and schedules them for pipelined execution for overlapping. At any point in time, Piper also controls the coarseness and number of sub-tasks executed. By doing so, Piper allows the execution of kernels whose memory requirements exceed the GPU device memory. Piper speeds up execution up to 2.67× compared to OpenMP-offload execution.
We propose Piper, an automatic mechanism for OpenMP offload to perform overlapping. Piper statically analyzes offload kernels and associates computations with memory locations. The extended runtime system exploits this analysis information, divides a kernel into independent sub-tasks, and schedules them for pipelined execution for overlapping. At any point in time, Piper also controls the coarseness and number of sub-tasks executed. By doing so, Piper allows the execution of kernels whose memory requirements exceed the GPU device memory. Piper speeds up execution up to 2.67× compared to OpenMP-offload execution.
Students@SC
DescriptionEver find yourself tongue-tied at a conference when you bump into that researcher or engineer whom you have been wanting to meet and collaborate with? Ever go to introduce yourself at a job interview and find yourself opening with “Uhm uhh . . . “. Well, here is the opportunity for you! The “PitchIT” workshop provides guidance on how to develop a short speech that can be used to show who you are, present your ideas, break the ice, and make a quick connection. During the workshop you will work with professionals in the field and fellow student volunteers to develop, fine tune, and practice your personal professional sales pitch so that it becomes so easy to give, that you’ll be ready to win an opportunity with it the next time you find yourself riding on an elevator or bus with a possible collaborator.
Workshop
Recorded
Applications
Architectures
Benchmarking
Exascale Computing
Modeling and Simulation
Performance
Performance Portability
W
Workshop
Recorded
Applications
Architectures
Benchmarking
Exascale Computing
Modeling and Simulation
Performance
Performance Portability
W
Workshop
Recorded
Applications
Architectures
Benchmarking
Exascale Computing
Modeling and Simulation
Performance
Performance Portability
W
Workshop
Recorded
Applications
Architectures
Benchmarking
Exascale Computing
Modeling and Simulation
Performance
Performance Portability
W
DescriptionThe PMBS22 workshop is concerned with the comparison of high-performance computing systems through performance modeling, benchmarking or through the use of tools such as simulators. We are particularly interested in research which reports the ability to measure and make tradeoffs in software/hardware co-design to improve sustained application performance. We are also keen to capture the assessment of future systems.
The aim of this workshop is to bring together researchers, from industry and academia, concerned with the qualitative and quantitative evaluation and modeling of high-performance computing systems. Authors are invited to submit novel research in all areas of performance modeling, benchmarking and simulation, and we welcome research that brings together current theory and practice. We recognize that the term 'performance' has broadened to include power consumption and reliability, and that performance modeling is practiced through analytical methods and approaches based on software tools and simulators.
The aim of this workshop is to bring together researchers, from industry and academia, concerned with the qualitative and quantitative evaluation and modeling of high-performance computing systems. Authors are invited to submit novel research in all areas of performance modeling, benchmarking and simulation, and we welcome research that brings together current theory and practice. We recognize that the term 'performance' has broadened to include power consumption and reliability, and that performance modeling is practiced through analytical methods and approaches based on software tools and simulators.
Workshop
Recorded
W
DescriptionEfficient use of energy is essential for today's supercomputing systems, as energy cost is generally a major component of their operational cost. Research into "green computing'' is needed to reduce the environmental impact of running these systems. As such, several scientific communities are evaluating the trade-off between time-to-solution and energy-to-solution. While the runtime of an application is typically easy to measure, energy consumption is not. Therefore, we present the Power Measurement Toolkit (PMT), a high-level software library capable of collecting power consumption measurements on various hardware. The library provides a standard interface to easily measure the energy use of devices such as CPUs and GPUs in critical application sections.
Paper
Recorded
Architectures
Networks
TP
DescriptionIn this paper, we present PolarFly, a diameter-2 network topology based on the Erdős-Rényi family of polarity graphs from finite geometry. This is the first known diameter-2 topology that asymptotically reaches the Moore bound on the number of nodes for a given network degree and diameter.
PolarFly achieves high Moore bound efficiency even for the moderate radixes commonly seen in current and near-future routers, reaching more than 96% of the theoretical peak. It also offers more feasible router degrees than the state-of-the-art solutions, greatly adding to the selection of scalable diameter-2 networks. PolarFly enjoys many other topological properties highly relevant in practice, such as a modular design and expandability that allow incremental growth in network size without rewiring the whole network. Our evaluation shows that PolarFly outperforms competitive networks in terms of scalability, cost, and performance for various traffic patterns..
PolarFly achieves high Moore bound efficiency even for the moderate radixes commonly seen in current and near-future routers, reaching more than 96% of the theoretical peak. It also offers more feasible router degrees than the state-of-the-art solutions, greatly adding to the selection of scalable diameter-2 networks. PolarFly enjoys many other topological properties highly relevant in practice, such as a modular design and expandability that allow incremental growth in network size without rewiring the whole network. Our evaluation shows that PolarFly outperforms competitive networks in terms of scalability, cost, and performance for various traffic patterns..
Workshop
Recorded
Performance Portability
W
DescriptionThe SLATE project is implementing a distributed dense linear algebra library for highly-scalable distributed-memory accelerator-based computer systems. The goal is to provide a library that can easily be ported to different hardware (CPUs, GPUs, accelerators) and will provide high performance for machines into the future. Current ports include CPUs, CUDA, ROCm, and oneAPI. We achieve both performance and portability by leveraging several layers and abstractions, including OpenMP tasks to track data dependencies, MPI for distributed communication, and the BLAS++ and LAPACK++ libraries developed as a portable layer across vendor-optimized CPU and GPU BLAS and LAPACK functionality. We rely on the C++ standard library and templating to reduce code duplication for better maintainability. The few kernels not present in BLAS are implemented in CUDA, HIP, and OpenMP target offload, and are easily ported to new platforms.
Workshop
Recorded
Runtime Systems
System Software
W
DescriptionHardware design in high-performance computing (HPC) is often highly experimental. Exploring new designs is difficult and time-consuming, requiring lengthy vendor cooperation. RISC-V is an open-source processor ISA that improves the accessibility of chip design, including the ability to do hardware/software co-design using open-source hardware and tools. Conventional operating systems like Linux are massively complex and modification is time-prohibitive. In this paper, we describe our port of the Kitten lightweight kernel operating system to RISC-V in order to provide an alternative to Linux for conducting co-design research. Kitten’s small code base and simple resource management policies are well matched for quickly exploring new hardware ideas that may require radical operating system modifications and restructuring. Our evaluation shows that Kitten on RISC-V is functional and provides similar performance to Linux for single-core benchmarks. This provides a solid foundation for using Kitten in future co-design research involving RISC-V.
Workshop
Recorded
Architectures
Data Analytics
Datacenter
Extreme Scale Computing
HPC Community Collaboration
Machine Learning and Artificial Intelligence
Performance
Resource Management and Scheduling
System Software
W
DescriptionRecent trends have led to an increased reliance on more diverse and heterogeneous device technologies to continue performance scaling. As a result, many supercomputers now include memory systems with multiple types of memory storage, each with different power, performance, and capacity characteristics. The community urgently needs new strategies to adapt mission critical applications to such complex memory architectures. In this talk, we will describe a quantitative approach that leverages lightweight application monitoring to derive and enforce effective runtime management for complex memory platforms, without requiring any developer effort or even recompilation of target programs. Additionally, we will present an evaluation that shows our approach can enable substantial performance benefits for a variety of memory intensive applications on real and complex memory hardware.
Workshop
Recorded
Architectures
Data Analytics
Datacenter
Extreme Scale Computing
HPC Community Collaboration
Machine Learning and Artificial Intelligence
Performance
Resource Management and Scheduling
System Software
W
DescriptionThis work examines the challenges and opportunities of using Machine Learning (ML) for Monitoring and Operational Data Analytics (MODA) in the context of Quantitative Codesign of Supercomputers (QCS). MODA is employed to gain insights into the behavior of current High Performance Computing (HPC) systems to improve system efficiency, performance, and reliability (e.g. through optimizing cooling infrastructure, job scheduling, and application parameter tuning). In this work, we take the position that QCS in general, and MODA in particular, require close exchange with the ML community to realize the full potential of data-driven analysis for the benefit of existing and future HPC systems. This exchange will facilitate identifying the appropriate ML methods to gain insights into current HPC systems and to go beyond expert-based knowledge and rules of thumb.
Workshop
Recorded
Architectures
Data Analytics
Datacenter
Extreme Scale Computing
HPC Community Collaboration
Machine Learning and Artificial Intelligence
Performance
Resource Management and Scheduling
System Software
W
DescriptionThe goal of building HPC systems is to enable execution of large-scale user application workflows in a performant manner. This is a multi-dimensional problem, including not just a particular application’s or workflows*’ time-to-solution, but the aggregate throughput of all applications submitted (workload*) and the energy spent for their execution. Beyond minimizing the overall energy used, the aggregate HPC system power draw must always remain within a contract envelope. While in practice, one often tunes an individual application’s performance, we rather need to optimize the efficiency of the overall HPC ecosystem. Optimizing the overall efficiency requires optimizing utilization of all resources and the overall performance of the workload while honoring constraints (such as power and priority).
Paper
Recorded
Post-Moore Computing
Quantum Computing
TP
DescriptionQuantum-assisted sampling is a promising technique to enable training probabilistic ML models, which otherwise depend on slow-mixing classical sampling methods; such as, the use of Quantum Annealing Processors (QAP) to train Boltzmann Machines (BMs). Previous work has shown that QAPs can sample from a Boltzmann distribution, although, at an unknown instance-dependent temperature. Due to this distribution divergence, existing training algorithms have resorted to negative-phase temperature scaling.
This method, although effective under arduous tuning, introduces unwanted noise to the sample set due to the quantization errors caused by the under-utilization of the QAP bias ranges; and is prone to bias overflow. We introduce a change in the training algorithm to allow positive-phase temperature scaling; an approach that reduces the impact of quantization noise, while still incorporating temperature scaling. As a result, we see an overall improvement in the convergence rate and testing accuracy, when compared to the state-of-the-art approach.
This method, although effective under arduous tuning, introduces unwanted noise to the sample set due to the quantization errors caused by the under-utilization of the QAP bias ranges; and is prone to bias overflow. We introduce a change in the training algorithm to allow positive-phase temperature scaling; an approach that reduces the impact of quantization noise, while still incorporating temperature scaling. As a result, we see an overall improvement in the convergence rate and testing accuracy, when compared to the state-of-the-art approach.
Posters
Research Posters
TP
XO/EX
DescriptionAppropriately adjusting the power draw of computational hardware plays a crucial role in its efficient use. While vendors have already implemented hardware-controlled power management, additional energy savings are available, depending on the state of the machine. We propose the online classification of such states based on computationally informed machine learning algorithms to adjust the power cap of the next time step. This research highlights that the overall energy consumption can be reduced significantly, often without a prohibitive penalty in the runtime of the applications.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
DescriptionWhen transmitting image data from a deployed edge device, a high-bandwidth connection to a cloud system cannot be guaranteed. An early-warning system for an intersection crosswalk, for instance, would have to be able to compress and transmit data with enough quality to ensure prompt detection of danger through remote image processing. Adaptive lossy compression provides a potential solution for this, although it is yet to be evaluated on actual edge hardware. By separating the compression and detection pipelines between client and server processes, improving compression ratios by up to 4.95% via a unified lossless stage, demonstrating compression performance on an Arm-powered edge device, and benchmarking network performance under a range of realistic bandwidth conditions, we attempt to evaluate the viability of this method under realistic conditions. This poster discusses our revised architecture and its performance, along with the relevance of our results towards method refinement.
Workshop
Recorded
W
DescriptionFederated learning (FL) is deemed a promising paradigm for privacy-preserving data analytics in collaborative scientific computing.
However, there lacks an effective and easy-to-use FL infrastructure for scientific computing and high-performance computing (HPC) environments. The objective of this presentation is two-fold. First, we identify three missing pieces of a scientific FL infrastructure:
(i) a native MPI programming interface that can be seamlessly integrated into existing scientific applications,
(ii) an independent data layer for the FL system such that the user can pick the persistent medium for her own choice, such as parallel file systems and multidimensional databases,
(iii) efficient encryption protocols that are optimized for scientific workflows.
The second objective is to present a work-in-progress FL infrastructure, namely MPI-FL, which is implemented with PyTorch and MPI4py. We deploy MPI-FL on 1,000 CPU cores and evaluate it with four standard benchmarks: MNIST, Fashion-MNIST, CIFAR-10, and SVHN-extra.
However, there lacks an effective and easy-to-use FL infrastructure for scientific computing and high-performance computing (HPC) environments. The objective of this presentation is two-fold. First, we identify three missing pieces of a scientific FL infrastructure:
(i) a native MPI programming interface that can be seamlessly integrated into existing scientific applications,
(ii) an independent data layer for the FL system such that the user can pick the persistent medium for her own choice, such as parallel file systems and multidimensional databases,
(iii) efficient encryption protocols that are optimized for scientific workflows.
The second objective is to present a work-in-progress FL infrastructure, namely MPI-FL, which is implemented with PyTorch and MPI4py. We deploy MPI-FL on 1,000 CPU cores and evaluate it with four standard benchmarks: MNIST, Fashion-MNIST, CIFAR-10, and SVHN-extra.
Workshop
Recorded
AI-HPC Convergence
Extreme Scale Computing
Parallel Programming Languages and Models
Performance
Runtime Systems
W
DescriptionParallel programming for extreme scale computing is hard. Couple that with heterogeneous processors across the system and it becomes even harder. Add to the mix that modern programmers are not being trained to understand how algorithms map onto the features of hardware, and it becomes harder still. Throw in that software outlives hardware so a single codebase must work across a wide range of different systems, and we arrive at programming challenges at an extreme scale. In this talk we will propose pragmatic solutions to these challenges; solutions that will support high programmer productivity to generate codebases that are performant and portable.
Posters
Research Posters
TP
XO/EX
DescriptionApplications can experience significant performance differences when run on different architectures. For example, GPUs are often utilized to accelerate an application over its CPU implementation. Understanding how performance changes across platforms is vital to the design of hardware, systems software, and performance critical applications. However, modeling the relationship between systems and performance is difficult as run time data needs to be collected on each platform. In this poster, we present a methodology for predicting the relative performance of an application across multiple systems using profiled performance counters and deep learning.
Paper
Recorded
Data Analytics
Performance
TP
DescriptionCaching techniques are widely used in the era of cloud computing from applications, such as Web caches to infrastructures, Memcached and memory caches in computer architectures. Prediction of cached data can greatly help improve cache management and hit rate. The recent advancement of deep learning techniques enables the design of novel intelligent cache replacement policies.
In this work, we propose a learning-aided approach to predict future data accesses. We find that a powerful LSTM-based recurrent neural network can provide high prediction accuracy based on only a cache trace as input. The high accuracy results from a carefully crafted locality-driven feature design. Inspired by the high prediction accuracy, we propose a pseudo OPT policy and evaluate it upon 13 real-world storage workloads from Microsoft Cloud. Results demonstrate that our new policy improves the state-of-art by up to 19.2% and incurs only 2.3% higher miss ratio than OPT on average.
In this work, we propose a learning-aided approach to predict future data accesses. We find that a powerful LSTM-based recurrent neural network can provide high prediction accuracy based on only a cache trace as input. The high accuracy results from a carefully crafted locality-driven feature design. Inspired by the high prediction accuracy, we propose a pseudo OPT policy and evaluate it upon 13 real-world storage workloads from Microsoft Cloud. Results demonstrate that our new policy improves the state-of-art by up to 19.2% and incurs only 2.3% higher miss ratio than OPT on average.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
Recorded
TP
DescriptionThe dCache installation is a storage management system that acts as a disk cache for high-energy physics (HEP) data. Storagespace on dCache is limited relative to persistent storage devices, therefore, a heuristic is needed to determine what data should be kept in the cache. A good cache policy would keep frequently accessed data in the cache, but this requires knowledge of future dataset popularity. We present methods for forecasting the number of times a dataset stored on dCache will be accessed in the future. We present a deep neural network that can predict future dataset accesses accurately, reporting a final normalized loss of 4.6e-8. We present a set of algorithms that can forecast future dataset accesses given an access sequence. Included are two novel algorithms, Backup Predictor and Last N Successors, that outperform other file prediction algorithms. Findings suggest that it is possible to anticipate dataset popularity in advance.
Workshop
Recorded
Quantum Computing
W
DescriptionWith the rapid advancement of quantum technologies, the integration between classical and quantum computing systems is an active area of research critical to future development. The coupling between these systems requires both to be as efficient as possible. One of the key elements to increase efficiency on the quantum side is circuit optimization. The goal is to execute the circuit on the desired hardware in less time and with less complexity, thereby reducing the impact of noise on the quantum system. However, the optimization process does not guarantee to generate improved results, yet it is always a computationally highly complex task that can create significant load for the classical computing side. To mitigate this problem, we propose a novel approach to predict the optimizability of any circuit using a Machine Learning-based algorithm within the decision workflow. This optimizes the most suitable circuits thereby increasing efficiency of the optimization process itself.
Paper
Recorded
Big Data
Computational Science
TP
Best Paper Finalist
DescriptionImportant graph mining problems such as Clustering are computationally demanding. To significantly accelerate these problems, we propose ProbGraph: a graph representation that enables simple and fast approximate parallel graph mining with strong theoretical guarantees on work, depth, and result accuracy. The key idea is to represent sets of vertices using probabilistic set representations such as Bloom filters. These representations are much faster to process than the original vertex sets thanks to vectorizability and small size. We use these representations as building blocks in important parallel graph mining algorithms such as Clique Counting or Clustering. When enhanced with ProbGraph, these algorithms significantly outperform tuned parallel exact baselines (up to nearly 50x on 32 cores) while ensuring accuracy of more than 90% for many input graph datasets. Our novel bounds and algorithms based on probabilistic set representations with desirable statistical properties are of separate interest for the data analytics community.
Workshop
Recorded
Diversity Equity Inclusion (DEI)
Education and Training and Outreach
W
DescriptionWe present OpenPME (Open Particle-Mesh Environment), a Problem Solving Environment (PSE) which provides a Domain Specific Language (DSL) built atop a domain model general enough to write numerical simulations in scientific computing using particle-mesh abstractions. This helps to close the productivity gap in HPC applications and effectively lowers the programming barrier to enable the smooth implementation of scalable simulations. OpenPME programs are lowered to generate high-performant C++ code through a sequence of compiler model-to-model transformations. We also introduce a model-based autotuning approach of discretization methods for OpenPME compiler. We evaluate the autotuner in two diffusion test cases and the results show that we consistently find configurations that outperform those found by state-of-the-art general-purpose autotuners.
Paper
Recorded
Applications
Computational Science
Scientific Computing
TP
DescriptionEarth system models are developed with a tight coupling to target hardware, often containing specialized code predicated on processor characteristics. This coupling stems from using imperative languages that hard-code computation schedules and layout.
We present a detailed account of optimizing the Finite Volume Cubed-Sphere Dynamical Core (FV3), improving productivity and performance. By using a declarative Python-embedded stencil domain-specific language and data-centric optimization, we abstract hardware-specific details and define a semi-automated workflow for analyzing and optimizing weather and climate applications. The workflow utilizes both local and full-program optimization, as well as user-guided fine-tuning. To prune the infeasible global optimization space, we automatically utilize repeating code motifs via a novel transfer tuning approach. On the Piz Daint supercomputer, we scale to 2,400 GPUs, achieving speedups of up to 3.92x over the tuned production implementation at a fraction of the original code.
We present a detailed account of optimizing the Finite Volume Cubed-Sphere Dynamical Core (FV3), improving productivity and performance. By using a declarative Python-embedded stencil domain-specific language and data-centric optimization, we abstract hardware-specific details and define a semi-automated workflow for analyzing and optimizing weather and climate applications. The workflow utilizes both local and full-program optimization, as well as user-guided fine-tuning. To prune the infeasible global optimization space, we automatically utilize repeating code motifs via a novel transfer tuning approach. On the Piz Daint supercomputer, we scale to 2,400 GPUs, achieving speedups of up to 3.92x over the tuned production implementation at a fraction of the original code.
Workshop
Recorded
W
DescriptionWith recent improvements in silicon fabrication technology, reconfigurable devices can now be applied to accelerate functions beyond the traditional computing domains. In-network processing and smart computational storage are just two of these approaches. We discuss both simple and more complex application examples for both of these domains, covering a network-attached ML inference appliance, a JOIN accelerator for distributed databases, and also look forward to using a cache-coherent interconnect, such as CCIX or CXL, to tackle a complex database acceleration scenario linking a computational storage unit using near-data processing to a full-scale PostgreSQL database system. Beyond these hardware architectures, the talk also examines improvements in programming tools specialized for the realization of reconfigurable computing systems. Using the open-source TaPaSCo framework as an example, advanced features such as on-chip dynamic parallelism, flexibly customizable inter-processing element communications, and host/accelerator shared virtual memory with physical page migration capabilities are discussed.
Tutorial
Recorded
Accelerator-based Architectures
AI-HPC Convergence
Architectures
Benchmarking
Emerging Technologies
Heterogeneous Systems
Machine Learning and Artificial Intelligence
TUT
DescriptionScientific applications are increasingly adopting Artificial Intelligence (AI) techniques to advance science. There are specialized hardware accelerators designed and built to efficiently run AI applications. With diverse hardware and software stacks of these systems, it is challenging to comprehend their capabilities, programming approaches, and measure performance. In this tutorial, we will present an overview of novel AI accelerators namely, SambaNova, Cerebras, Graphcore, Groq, and Habana. This includes presentations on hardware and software features on each system. We present steps on how to program these systems by porting deep learning models with any refactoring of codes implemented in standard DL framework implementations, compiling and running on the accelerator hardware. Next, we conduct a hands-on session on SambaNova and Cerebras systems at ALCF AI Testbed. The tutorial will provide the attendees with an understanding of the key capabilities of emerging AI accelerators and their performance implications for scientific applications.
Tutorial
Recorded
Accelerator-based Architectures
Directive Based Programming
Parallel Programming Languages and Models
Productivity Tools
TUT
DescriptionOpenMP 1.0 was released in 1997 when the primary concern was symmetric multiprocessors. Over time, hardware has evolved with more complex memory hierarchies forcing us to embrace NUMA machines and work to understand how OpenMP fits in with distributed memory systems.
Current trends in hardware bring co-processors and accelerators such as GPUs into the fold. A modern platform is often a heterogeneous system with CPU cores, GPU cores, and other specialized accelerators. OpenMP has responded by adding directives that map code and data onto a device. We refer to this family of directives as the target directives.
In this hands-on tutorial, we will explore these directives as they apply to programming GPUs. We assume attendees already know the fundamentals of OpenMP (perhaps by taking the OpenMP Common Core tutorial) so we can focus on deeply understanding the target directives and their use in complex applications. We expect students to use their own laptops (with Windows, Linux, or OS/X) to connect to remote servers with GPUs, but the best option is for students to load an OpenMP compiler onto their laptops before the tutorial. Information about OpenMP compilers is available at www.openmp.org.
Current trends in hardware bring co-processors and accelerators such as GPUs into the fold. A modern platform is often a heterogeneous system with CPU cores, GPU cores, and other specialized accelerators. OpenMP has responded by adding directives that map code and data onto a device. We refer to this family of directives as the target directives.
In this hands-on tutorial, we will explore these directives as they apply to programming GPUs. We assume attendees already know the fundamentals of OpenMP (perhaps by taking the OpenMP Common Core tutorial) so we can focus on deeply understanding the target directives and their use in complex applications. We expect students to use their own laptops (with Windows, Linux, or OS/X) to connect to remote servers with GPUs, but the best option is for students to load an OpenMP compiler onto their laptops before the tutorial. Information about OpenMP compilers is available at www.openmp.org.
Workshop
Recorded
Correctness
Software Engineering
W
DescriptionNumerical exceptions, which may be caused by operations like sqrt(-1) or convergence failures, are often unavoidable, in particular when software is used on unforeseen inputs. As more aspects of society become automated (self-driving cars, and cyber-physical systems more generally), it becomes increasingly important to design software that is resilient to exceptions, and responds to them consistently. Consistency is needed to build higher-level resilient and consistent software. We explore the design space of consistent exception handling for the BLAS and LAPACK, pointing out many current examples of inconsistent exception handling, and propose a new design balancing consistency, complexity, ease of use, and performance. Some compromises are needed, because of preexisting inconsistencies, including in vendor BLAS implementations, different programming languages, and compilers. And user requests from our surveys are quite diverse. We also propose our design as a possible model for other numerical software, and welcome comments on our design choices.
Workshop
Recorded
W
DescriptionBorn of necessity while developing complex multi-physics HPC simulations at Lawrence Livermore National Laboratory, Thicket is the user-facing tool in a suite of performance analysis tools. HPC users run codes on many architectures --- different CPUs, different GPUs, and we expect more architecture heterogeneity in the future --- and collect metadata using Adiak as well as performance data using LLNL’s Caliper and other measurement tools. Our users needed a programmatic way to analyze the data from these experiments. In this talk, we describe Thicket, the multi-dimensional performance data analysis tool, and showcase examples and use cases. We also describe the multi-year process of getting the buy-in of million-line code developers to integrate our performance analysis tool suite, and the leaps in performance engineering that were made as a result.
Exhibitor Forum
Recorded
TP
XO/EX
DescriptionFor decades, networks have been constructed with purpose-built devices designed for individual use cases. Every network role has had its own unique requirements, so organizational structures were created around these deployments with each team solving for their specific needs. For the first time, Cisco Silicon One family of chips frees IT teams to regain the operational agility they have lost through a converged architecture that reduces the hardware/software permutations in the network. By incorporating groundbreaking technologies such as advanced memory sub-systems, P4 + RTC engine, and large-scale power savings, Cisco has created the world’s first architecture that can be deployed throughout the entire network.
Learn how Cisco has built and implemented a new generation of ASICs to stretch the limits of high-performance and multi-purpose architectures while massively lowering network power consumption. That is best in class and lowest watts per 100G! Customers are able to experience an across-the-board increase in operational efficiencies through software and automation development while achieving their sustainability goals. Consume hardware and software on your terms and innovate at the pace your business demands. You have never had more control over your own network infrastructure.
Learn how Cisco has built and implemented a new generation of ASICs to stretch the limits of high-performance and multi-purpose architectures while massively lowering network power consumption. That is best in class and lowest watts per 100G! Customers are able to experience an across-the-board increase in operational efficiencies through software and automation development while achieving their sustainability goals. Consume hardware and software on your terms and innovate at the pace your business demands. You have never had more control over your own network infrastructure.
ACM Gordon Bell Finalist
Awards Presentation
Recorded
Awards
TP
DescriptionWe present a first-of-kind mesh-refined (MR) massively parallel Particle-In-Cell (PIC) code for kinetic plasma simulations optimized on the Frontier, Fugaku, Summit, and Perlmutter supercomputers. Major innovations, implemented in the WarpX PIC code, include: (i) a three level parallelization strategy that demonstrated performance portability and scaling on millions of A64FX cores and tens of thousands of AMD and Nvidia GPUs (ii) a groundbreaking mesh refinement capability that provides between 1.5× to 4× savings in computing requirements on the science case reported in this paper, (iii) an efficient load balancing strategy between multiple MR levels. The MR PIC code enabled 3D simulations of laser-matter interactions on Frontier, Fugaku, and Summit, which have so far been out of the reach of standard codes. These simulations helped remove a major limitation of compact laser-based electron accelerators, which are promising candidates for next generation high-energy physics experiments and ultra-high dose rate FLASH radiotherapy.
Workshop
Recorded
Reliability and Resiliency
W
DescriptionSupercomputers have become increasingly important due to the growing demand for computational power and the amount of available data. As supercomputing systems become larger and serve many users simultaneously, the costs of building and maintaining such systems increase, and the probability of faults increases. Therefore, such systems’ efficiency and resilience are essential for providers and users. One primary tool that provides system resilience is DMTCP, a system-level Checkpoint/Restart (C/R) library that allows performing C/R operations seamlessly without any source code modifications. Meanwhile, Python has become one of the major languages for application programming; hence providing it with C/R capabilities is desirable in many systems. Accordingly, previous work has brought C/R to Python by supporting DMTCP C/R programmatically from within a Python program. Nevertheless, a particular class of python codes is not self-contained but rather designed to support other applications by scheduling, managing, and analyzing their results, such as execution wrappers and pipelining, parameter sweeping, etc. This class of Python codes is widespread on HPC systems using the SLURM job scheduler by all types of users. In this work, we extend the previous integration of DMTCP to Python programs and first introduce pyDMTCP. This Python module enables Python wrappers of scientific applications to easily utilize DMTCP checkpointing via a Python interface and externally to applications via SLURM. The interface also maps the entire HPC system according to several main parameters to allow fault-free and optimized C/R executions between different nodes.
The source code of pyDMTCP will be available at https://github.com/Scientific-Computing-Lab-NRCN/pyDMTCP.
The source code of pyDMTCP will be available at https://github.com/Scientific-Computing-Lab-NRCN/pyDMTCP.
Workshop
Recorded
W
DescriptionPYNQ is an open-source project from AMD that aims to help HPC applications achieve performance goals quicker by lowering software complexity. PYNQ with its runtime Python APIs has its roots in Zynq SoCs (ARM processors plus programmable logic) and has expanded across both datacenter and RFSoC adaptive computing platforms. With that expansion, the PYNQ community now numbers in 1000s of active users across 10s of thousands shipped platforms. In this short talk, PYNQ will be revisited in the context of cloud and quantum computing as both areas where adaptive computing is appearing in larger HPC frameworks. PYNQ has provided a scalable API for cloud deployments for some time and more recently has been deployed within new quantum computing control systems. Lastly, our newest project, PYNQ-Metadata, will be introduced as this work gives users new levels of hardware introspection into existing (and new) hardware designs, all from within Jupyter notebooks.
Workshop
Recorded
W
DescriptionIt is becoming increasingly common for laboratories and universities to share computing resources. Also as cloud usage and applications continue to expand, a hybrid cloud working model is fast becoming a common standard practice. In line with these present-day trends, we present an open-source Python library that provides information on high performance computing (HPC) clusters and systems that are available to a user via a peer to peer (P2P) infrastructure. These metrics include the size of system and availability of nodes, along with the speed of connection between clusters. We will present the benefits of using a P2P model compared to traditional client server models and look at the ease in which this can be implemented. We will also look at the benefits and uses of gathering this data in one location in order to assist with the managing of complex workloads in heterogeneous environments.
Paper
Recorded
Post-Moore Computing
Quantum Computing
TP
DescriptionWith collaborative DNN inference, part of queries run on their source edge device to reduce latencies. Because edges show diverse performance and network conditions, different layers should run on different devices, and queries on the datacenter show irregular structures. However, emerging schemes are not able to process such irregular queries. We propose ICE, a collaborative inference service scheme that effectively supports irregular queries. ICE comprises a query slicer, a query manager, and a lag enhancer. The query slicer maps the execution of queries based on the edges' performance and network conditions. The query manager batches irregular queries adaptively and schedules the irregular queries based on their progress. The lag enhancer reduces the QoS violation when queries run slower due to interference on the edge. Experiments show that ICE improves the supported peak load of the datacenter by 43.2% on average while guaranteeing the required 99%-ile latencies compared with state-of-the-art techniques.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
DescriptionThe computational advance in high-performance computing leads to increased data generation by applications, resulting in a bottleneck within the system due to I/O limitations. One solution is the Spatio-temporal sampling method, which takes advantage of both spatial and temporal data reduction methods to produce higher post-reconstruction quality. Various user input parameters such as the number of bins or histogram intersection limit the performance for Spatio-temporal sampling. This poster focuses on determining the effect of the histogram intersection threshold in the Spatio-temporal sampling method. Results indicate that as long as a data set is not identical across adjacent time-steps, reducing the histogram intersection percentage increases the sampling bandwidth until blocks reused become static. The ExaAM data set shows an increase of 100-130% in sampling bandwidth, with only about a 5% decrease in PSNR value at 60% histogram intersection or lower.
Workshop
Recorded
Architectures
Data Analytics
Datacenter
Extreme Scale Computing
HPC Community Collaboration
Machine Learning and Artificial Intelligence
Performance
Resource Management and Scheduling
System Software
W
Panel
Recorded
Accelerator-based Architectures
Quantum Computing
TP
XO/EX
DescriptionQuantum computing is a quickly growing field and we expect viable systems to be available relatively soon. However, it’s becoming clear that quantum computing systems alone are only of limited use unless integrated with HPC systems to enable both efficient pre- and post-processing and to allow for truly hybrid applications. Many questions remain: which applications will benefit from such acceleration? How widely will quantum acceleration be efficient and cost-effective? How large do quantum systems have to be to make an impact and when will this be the case? What is the impact on the operation in HPC centers? Will quantum acceleration augment or replace other accelerators, e.g., for AI? Or will the quantum computing hype bubble just pop? In this panel, we will discuss these questions with experts from the quantum computing community, with HPC architects, representatives from HPC centers as well as vendors of traditional and emerging accelerators.
Workshop
Recorded
Diversity Equity Inclusion (DEI)
Education and Training and Outreach
W
DescriptionLarge scale parallel file systems with multiple storage tiers require policy driven data management to facilitate efficient storage and access of large-scale data. However, management of data across the tiers is challenging due to the massive scale of data being stored. In this talk, we present our initial work on QuickSilver, a light-weight flexible distributed policy engine. QuickSilver is composed of many single-purpose agents that handle tasks such as gathering file metadata, enforcing policy decisions, and executing policy actions like purging or data migration. These agents are designed to communicate using distributed message queues, while maintaining minimal state information. We will discuss the architectural details of the policy engine and its use of message queues to enable scaling. Examples of the initial implementation will be shown with preliminary performance numbers. Since this project is in its infancy, we will also discuss our plans for future work and areas of improvement.
Birds of a Feather
TP
XO/EX
DescriptionThis BoF explores the experience of Black Americans in HPC from the standpoint of prominent leaders' personal histories, with the goal of highlighting what we can do as a community to attract and promote greater racial diversity in our industry. A moderator-led Q&A will establish a basis for discussion, yielding to open questions and discussion from the audience. This BoF is relevant to anyone who supports broader racial representation in HPC and seeks to contribute at an individual or organizational level.
Student Cluster Competition
TP
XO/EX
DescriptionTeam RACKlette is a student team which is active all year round. This allows us to form a strong team spirit and friendships in the team. Our team currently consists of 15 members at varying stages of their studies of which 12 can still perform in the SCC. We make sure that all members get to join at least one competition a year. Even those who aren't part of the competing team give their support and assistance. In addition to this, our alumni members form the backbone of the team; offering knowledge and handy tricks that are useful in any situation. This is remarkable since despite being supported by ETHZ regarding knowledge, advice and funding, we are not getting any credits towards our degrees for our commitment in Team RACKlette.
Nonetheless all team members are eager to compete and gain new experiences and insights into the rapidly expanding fields of HPC and AI. These competitions give us unique insights into ongoing research,cutting edge business and give us relevant life experiences that help us in deciding our future master programs and careers. The experience we gain from competitions like SC also compliments what we learn in lectures and lets us consolidate what we learned so far. It also prepares us for future lectures in teaching us how to interact with servers and run code efficiently.
Our advisors are Torsten Hoefler and Hussein Harake.
Professor Hoefler inspired the founding of Team RACKlette and has supported us ever since. He has prior experience with Student Cluster Competitions as a participant and as an advisor of winning teams (e.g. SC'08, ISC'19).
Mr. Harake is a HPC System Manager at CSCS and our contact for everything related to our cluster hardware. He has supported Team RACKlette since its founding as well and is the main reason we’ve been able to get so many cutting edge hardware sponsorships by so many vendors..
Our team for this year’s SC-competition is built with balance and diversity in mind. One team member participated in last year’s Indy-SCC and ISC22, another member participated in ISC22 and four members will have their first chance to join a HPC competition at SC22. With this mix we strike a balance between experienced and new members. Our plan is to lay a foundation with the more experienced members on which the newer participants can build. Adding to the difference in HPC experience is the spread in academic experience and interests. The team consists of students from «Computer Science» (CS) and «Computational Science & Engineering» (CSE) which allows us to have a theoretical as well as practical background for HPC. In addition to the Computer Science background members have had lectures in i.e. Physics, Robotics and Biology. ETHZ also emphasizes taking lectures outside the standard curriculum. Therefore everybody has to take courses in «Humanities, Social and Political Sciences». These lectures can be in i.e. History, Economics or Philosophy.
Nonetheless all team members are eager to compete and gain new experiences and insights into the rapidly expanding fields of HPC and AI. These competitions give us unique insights into ongoing research,cutting edge business and give us relevant life experiences that help us in deciding our future master programs and careers. The experience we gain from competitions like SC also compliments what we learn in lectures and lets us consolidate what we learned so far. It also prepares us for future lectures in teaching us how to interact with servers and run code efficiently.
Our advisors are Torsten Hoefler and Hussein Harake.
Professor Hoefler inspired the founding of Team RACKlette and has supported us ever since. He has prior experience with Student Cluster Competitions as a participant and as an advisor of winning teams (e.g. SC'08, ISC'19).
Mr. Harake is a HPC System Manager at CSCS and our contact for everything related to our cluster hardware. He has supported Team RACKlette since its founding as well and is the main reason we’ve been able to get so many cutting edge hardware sponsorships by so many vendors..
Our team for this year’s SC-competition is built with balance and diversity in mind. One team member participated in last year’s Indy-SCC and ISC22, another member participated in ISC22 and four members will have their first chance to join a HPC competition at SC22. With this mix we strike a balance between experienced and new members. Our plan is to lay a foundation with the more experienced members on which the newer participants can build. Adding to the difference in HPC experience is the spread in academic experience and interests. The team consists of students from «Computer Science» (CS) and «Computational Science & Engineering» (CSE) which allows us to have a theoretical as well as practical background for HPC. In addition to the Computer Science background members have had lectures in i.e. Physics, Robotics and Biology. ETHZ also emphasizes taking lectures outside the standard curriculum. Therefore everybody has to take courses in «Humanities, Social and Political Sciences». These lectures can be in i.e. History, Economics or Philosophy.
Workshop
Recorded
Cloud and Distributed Computing
In Situ Processing
Scientific Computing
Workflows
W
DescriptionWorkflow applications are becoming increasingly important to support scientific discovery. That is leading to a proliferation of workflow management systems and, thus, to a fragmented software ecosystem. Integration among existing workflow tools is a way to improve development efficiency and, ultimately, support the sustainability of the scientific workflow community. We describe our experience with integrating RADICAL-Pilot (RP) and Parsl as a way to enable users to develop and execute workflow applications with heterogeneous tasks on heterogeneous high-performance computing resources. We describe our approach to the integration of the two systems and detail the development of RPEX, a Parsl executor which uses RP as its workload manager. We develop an RP executor that enables executing heterogeneous MPI Python functions on CPUs and GPUs, and we measure the weak and strong scaling of RPEX, RP, and Parsl when providing new capabilities to two paradigmatic use cases: Colmena and Ice Wedge Polygons.
Student Cluster Competition
TP
XO/EX
DescriptionHPC is a priority and topic of research across several departments and colleges at Clemson University. Clemson students and faculty regularly use HPC to revolutionize their field. Last year, Clemson put together a diverse and competitive team and competed for the second time. This year, Clemson’s Death Valley Computing is a diverse and strong team with new candidates who are applying their strengths and collaborating together to build a formidable team. Each member carries a strong foundation in traditional computer science and engineering, along with their individual experiences and skill sets for various aspects of the competition. Participation in the SCC provides us with an opportunity unlike any other to combine our knowledge and creativity to evaluate and expand our understanding of HPC; more importantly, lay the foundation for future opportunities in graduate school and industry.
Cooper Sanders, a senior (CPE), is interested in GPU architecture and has worked on many research projects at Clemson concerning HPC. He has contributed to several codebases by porting scientific workflows to GPUs and optimizing existing kernels. He is working with Los Alamos National Lab this summer on optimizing LANL research software.
David Krasowska is a senior (CPE) that is interested in hardware design and architecture. He has experience, including a published paper, in HPC research involving lossy data compression in collaboration with Argonne National Lab and Los Alamos National Lab.
Ethan Gindlesperger, a senior (CPE) with a minor in mathematics and a focus on computer architecture. Ethan has a background in video game design and robotics, and spent time interning with Intel last year. Ethan will parley his HPC experience in this competition to become a strong candidate for graduate school and/or industry jobs.
Logan Durham is a sophomore (CS). He works in laptop support for Clemson IT. In his free time, he works on older desktops and enterprise hardware, such as Dell Poweredge servers and HP thin clients, to learn how systems are set up and managed. He has participated in the HackHPC@PEARC21 hackathon and is working with Los Alamos National Laboratory on a data compression project.
Moises Martinez Herrera is a freshman (CS). He works for Clemson’s IT helping customers with software and basic hardware issues. He is a first generation Hispanic student at Clemson University. Moises has set up a personal storage server in his home and has participated in the Hello World hackathon hosted in Clemson.
Benjamin Schlueter, a freshman (CPE) minoring in math and business who is interested in artificial intelligence and HPC. He has a passion for learning as well as working with computers in events such as hackathons and creative inquiries. Benjamin has already built a cluster computer and is excited to learn more via participation in this competition.
The team’s mentor is Dr. Jon C. Calhoun, a tenure-track Assistant Professor of Electrical and Computer Engineering who researches fault tolerance and lossy data compression. He is a strong advocate of HPC education and research for undergraduates; mentoring 8 undergraduates in his research group in 2021-2022.
Cooper Sanders, a senior (CPE), is interested in GPU architecture and has worked on many research projects at Clemson concerning HPC. He has contributed to several codebases by porting scientific workflows to GPUs and optimizing existing kernels. He is working with Los Alamos National Lab this summer on optimizing LANL research software.
David Krasowska is a senior (CPE) that is interested in hardware design and architecture. He has experience, including a published paper, in HPC research involving lossy data compression in collaboration with Argonne National Lab and Los Alamos National Lab.
Ethan Gindlesperger, a senior (CPE) with a minor in mathematics and a focus on computer architecture. Ethan has a background in video game design and robotics, and spent time interning with Intel last year. Ethan will parley his HPC experience in this competition to become a strong candidate for graduate school and/or industry jobs.
Logan Durham is a sophomore (CS). He works in laptop support for Clemson IT. In his free time, he works on older desktops and enterprise hardware, such as Dell Poweredge servers and HP thin clients, to learn how systems are set up and managed. He has participated in the HackHPC@PEARC21 hackathon and is working with Los Alamos National Laboratory on a data compression project.
Moises Martinez Herrera is a freshman (CS). He works for Clemson’s IT helping customers with software and basic hardware issues. He is a first generation Hispanic student at Clemson University. Moises has set up a personal storage server in his home and has participated in the Hello World hackathon hosted in Clemson.
Benjamin Schlueter, a freshman (CPE) minoring in math and business who is interested in artificial intelligence and HPC. He has a passion for learning as well as working with computers in events such as hackathons and creative inquiries. Benjamin has already built a cluster computer and is excited to learn more via participation in this competition.
The team’s mentor is Dr. Jon C. Calhoun, a tenure-track Assistant Professor of Electrical and Computer Engineering who researches fault tolerance and lossy data compression. He is a strong advocate of HPC education and research for undergraduates; mentoring 8 undergraduates in his research group in 2021-2022.
Workshop
Recorded
Cloud and Distributed Computing
In Situ Processing
Scientific Computing
Workflows
W
DescriptionScientists construct scientific workflows in Scientific Workflow Management Systems (SWfMSs) to analyze scientific data. However, these scientific workflows can be complex and challenging to create for new and expert users due to the significant growth of tools, the heterogeneous nature of data, and the complexity of the tasks. To overcome these obstacles, scientists started to share their designed workflow in the community in the interest of open science, and many researchers constructed several tools/workflow recommendation systems. But we identified several challenges, i.e., many scientific workflows contain errors, outdated tools, invalid tools connections, improper tagging, etc. Also, in the future, many workflow tools can be obsoleted. Then the existing recommendation systems will fail to recommend appropriate tools, eventually creating a less optimal and error-containing workflow. Considering all these challenges, we propose a recommendation system to recommend tools/sub-workflow using machine learning approaches to help scientists create optimal, error-free, and efficient workflows.
Workshop
Recorded
W
DescriptionHPC systems are a critical resource for scientific research. The increased demand for computational power and memory ushers in the exascale era, in which complex supercomputers consist of numerous compute nodes and are consequently expected to experience frequent faults and crashes.
Exact state reconstruction (ESR) was proposed as an alternative mechanism to alleviate the impact of frequent failures on long-term computations. ESR has been shown to provide exact reconstruction of iterative solvers while avoiding the need for costly checkpointing. However, ESR currently relies on volatile memory for fault tolerance, and must therefore maintain redundancies in the RAM of multiple nodes, incurring high memory and network overheads.
Recent supercomputer designs feature emerging non-volatile RAM (NVRAM) technology. This paper investigates how NVRAM can be utilized to devise an enhanced ESR-based recovery mechanism that is more efficient and provides full resilience, based on a novel MPI implementation of One-Sided Communication (OSC) over RDMA.
Exact state reconstruction (ESR) was proposed as an alternative mechanism to alleviate the impact of frequent failures on long-term computations. ESR has been shown to provide exact reconstruction of iterative solvers while avoiding the need for costly checkpointing. However, ESR currently relies on volatile memory for fault tolerance, and must therefore maintain redundancies in the RAM of multiple nodes, incurring high memory and network overheads.
Recent supercomputer designs feature emerging non-volatile RAM (NVRAM) technology. This paper investigates how NVRAM can be utilized to devise an enhanced ESR-based recovery mechanism that is more efficient and provides full resilience, based on a novel MPI implementation of One-Sided Communication (OSC) over RDMA.
Workshop
Recorded
Cloud and Distributed Computing
In Situ Processing
Scientific Computing
Workflows
W
DescriptionThis abstract presents a conceptual framework and methods to extract and share (meta)data (data and metadata) necessary for reproducibility in the context of complex hybrid workflows (workflows that include numerical simulations and data-intensive applications) executed at extreme scale. We target Digital Objects required to reproduce results and performance: we capture, fuse, and analyze (meta)data to select parameters influencing reproducibility, and make them FAIR Digital Objects for re-use.
Workshop
Recorded
W
DescriptionEnergy consumption is a major concern in high-performance computing. One important contributing factor is the number of times the wires are charged and discharged, i.e., how often they switch from '0' to '1' and vice versa. We describe a software technique to minimize this switching activity in GPUs, thereby lowering the energy usage. Our technique targets the memory bus, which comprises many high-capacitance wires that are frequently used. Our approach is to strategically change data values in the source code such that loading and storing them yields fewer bit flips. The new values are guaranteed to produce the same control flow and program output. Measurements on GPUs from two generations show that our technique allows programmers to save up to 9.3% of the whole-GPU energy consumption and 1.2% on average across eight graph-analytics CUDA codes without impacting performance.
Workshop
Recorded
Career Development
Professional Development
Software Engineering
Workforce
W
DescriptionAfter my first work experience as an RSE at University of Plymouth (UK) in a single-topic project (which was about porting a scientific code on GPUs), I joined a team of RSEs at University of Durham (UK) in the end of 2021. In this talk, I share my thoughts about challenges of RSE from the perspective of the early career stage related to the multidisciplinary variety of projects and range of tasks as well as to the hybrid mode of work.
Workshop
Recorded
W
DescriptionWe present a Reinforcement Learning (RL) based approach to efficiently perform loop-distribution with the goals of optimizing for vectorization and locality. We generate the SCC Dependence Graph for each loop of the program. Our RL model learns to predict the distribution order of the loop by performing a topological walk of graph. The RL-reward is computed using instruction cost and number of cache misses. For training purposes, we also propose a novel strategy to extend the training set by generating new loops.
We show results on x86 architecture on various benchmarks: TSVC, LLVM-Test-Suite, PolyBench, PolyBenchNN. Our framework achieves an average improvement of 3.63% on TSVC, 4.61% on LLVM-Test-Suite MicroBenchmarks, 1.78% on PolyBench and 1.95% on PolyBenchNN benchmark suites for performance, with LLVM-O3 flag as baseline. We also show the improvements on other performance metrics like Instruction Per Cycle (IPC), Number of loops distributed and vectorized, and L1 cache performance.
We show results on x86 architecture on various benchmarks: TSVC, LLVM-Test-Suite, PolyBench, PolyBenchNN. Our framework achieves an average improvement of 3.63% on TSVC, 4.61% on LLVM-Test-Suite MicroBenchmarks, 1.78% on PolyBench and 1.95% on PolyBenchNN benchmark suites for performance, with LLVM-O3 flag as baseline. We also show the improvements on other performance metrics like Instruction Per Cycle (IPC), Number of loops distributed and vectorized, and L1 cache performance.
Workshop
Recorded
W
DescriptionHigh Level Synthesis (HLS) offers a possible programmability solution for FPGAs but currently delivers far lower hardware quality than circuits written using Hardware Description Languages (HDLs). One reason is because the standard set of code optimizations used by CPU compilers, such as LLVM, are not well suited for an FPGA backend.
While much work has been done employing reinforcement learning for compilers in general, that directed toward HLS is limited and conservative. We expand both the number of learning strategies for HLS compiler tuning and the metrics used to evaluate their impact. Our results show improvements over state-of-art for each standard benchmark evaluated and learning quality metric investigated. Choosing just the right strategy can give an improvement of 23x in learning speed, 4x in performance potential, 3x in speedup over -O3, and has the potential to largely eliminate the fluctuation band from the final results.
While much work has been done employing reinforcement learning for compilers in general, that directed toward HLS is limited and conservative. We expand both the number of learning strategies for HLS compiler tuning and the metrics used to evaluate their impact. Our results show improvements over state-of-art for each standard benchmark evaluated and learning quality metric investigated. Choosing just the right strategy can give an improvement of 23x in learning speed, 4x in performance potential, 3x in speedup over -O3, and has the potential to largely eliminate the fluctuation band from the final results.
Panel
Recorded
HPC Community Collaboration
State of the Practice
TP
XO/EX
DescriptionToday, most HPC systems on the TOP500 are examples of a commodity monoculture, built from nodes containing server-class microprocessors and GPU accelerators. With the end of Dennard scaling, the slowing of Moore’s Law, and exponentially rising costs for semiconductor fabrication facilities, high-performance computing (HPC) is at an important inflection point. In another profound shift, computing economics are now dominated by cloud hyperscalers and smartphone vendors who are increasingly building using custom semiconductors. Concurrently, AI advances are reshaping how we think about the nature of scientific computation and pursue scientific breakthroughs.
How can the HPC community best adapt? Our thesis is that current approaches to designing and constructing leading edge high-performance computing systems must change in fundamental ways. This panel will explore possible approaches based on end-to-end co-design; custom hardware configurations and packaging; large-scale prototyping, which was once common; and collaborative partnerships; motivated in part by a recent position paper: https://arxiv.org/abs/2203.02544
How can the HPC community best adapt? Our thesis is that current approaches to designing and constructing leading edge high-performance computing systems must change in fundamental ways. This panel will explore possible approaches based on end-to-end co-design; custom hardware configurations and packaging; large-scale prototyping, which was once common; and collaborative partnerships; motivated in part by a recent position paper: https://arxiv.org/abs/2203.02544
Panel
Recorded
Scientific Computing
Software Engineering
TP
XO/EX
DescriptionWith the end of Dennard scaling and the slowdown of Moore's law, approximate computing (AC) has emerged as an attractive option to improve performance and energy efficiency by relaxing correctness and allowing errors. Several AC techniques have been proposed, from hardware-level techniques (e.g., voltage scaling) to software-level techniques (e.g., memorization and mixed-precision). These methods have proven to be helpful in workloads that are not traditional in HPC, such as image and video processing, which can naturally tolerate error. However, how feasible is it to apply AC methods to HPC scientific applications? The panel gathers experts in different AC fields to address this question. Driven by their vast experience, panelists will express views on the most crucial problems of adopting AC in HPC. Attendees will benefit from discussions with the panelists and will provide feedback.
Workshop
Recorded
Reliability and Resiliency
W
DescriptionDebugging massively parallel applications remains a highly challenging task. With trends towards larger and more complex supercomputers, remarkably increasing degrees of parallelism, more parallelism options (e.g., heterogeneity), and emerging programming models, applications gain higher performance and scalability by using more asynchronous algorithms. However, they come at a productivity cost: they introduce non-determinism in parallel program execution—i.e., the applications do not produce the same output in different runs—and this makes debugging even a greater challenge. A particularly well-known source of non-determinism at large scale is the message-passing interface (MPI). As network and system noise can affect the order of received messages, applications can take different computation paths depending on the order of the received messages. This complicates debugging since computation paths and associated computational results may vary between the original run (where a bug manifested itself) and the debugged runs. In this lightning talk, we introduce ReMPI (MPI Record-and-Replay Tool, https://github.com/PRUNERS/ReMPI) that facilitates debugging non-deterministic MPI applications. ReMPI records the execution of each MPI process as trace data, which includes the order of the message receives. Then, during debugging, a replay mechanism uses these recorded traces to ensure that every MPI process observes the same message exchanges as the recorded run.
Birds of a Feather
TP
XO/EX
DescriptionThe advancement of scientific knowledge is driven by the ability to reproduce research findings. While there is an agreement about the importance of reproducibility and we continue to see a growing number of related initiatives, most researchers, however, still do not incorporate reproducibility practices in their work. It is also not uncommon to lose years of research progress when a researcher leaves a team or graduates. Building on PEARC22 BoF discussions, this BoF will aim at democratizing reproducibility and discussing opportunities and challenges for developing active community-driven services and shared training resources for the reproducibility and trustworthiness of scientific research.
Workshop
Recorded
AI-HPC Convergence
Emerging Technologies
Memory Systems
Networks
Resource Management and Scheduling
W
Workshop
Recorded
AI-HPC Convergence
Emerging Technologies
Memory Systems
Networks
Resource Management and Scheduling
W
DescriptionCompute express link (CXL) has recently attracted significant attention thanks to its excellent hardware heterogeneity management and resource disaggregation capabilities. Even though there is yet no commercially available product or platform integrating CXL 2.0/3.0 into memory pooling, it is expected to make memory resources practically and efficiently disaggregated much better than ever before.
In this lecture, we will argue why existing computing and memory resources require a new interface for cache coherency and demonstrate how CXL can put the different types of resources into a disaggregated pool. As a use case scenario, this lecture will show two real system examples, building a CXL 2.0-based end-to-end system that directly connects a host processor complex and remote memory resources over CXL's memory protocol and a CXL-integrated storage expansion system prototype. At the end of the lecture, we introduce a set of hardware prototypes designed to support the future CXL system (CXL 3.0) as our ongoing project.
In this lecture, we will argue why existing computing and memory resources require a new interface for cache coherency and demonstrate how CXL can put the different types of resources into a disaggregated pool. As a use case scenario, this lecture will show two real system examples, building a CXL 2.0-based end-to-end system that directly connects a host processor complex and remote memory resources over CXL's memory protocol and a CXL-integrated storage expansion system prototype. At the end of the lecture, we introduce a set of hardware prototypes designed to support the future CXL system (CXL 3.0) as our ongoing project.
Workshop
Recorded
AI-HPC Convergence
Emerging Technologies
Memory Systems
Networks
Resource Management and Scheduling
W
Workshop
Recorded
AI-HPC Convergence
Emerging Technologies
Memory Systems
Networks
Resource Management and Scheduling
W
Workshop
Recorded
Accelerator-based Architectures
Data Analytics
In Situ Processing
Scientific Computing
Visualization
Workflows
W
DescriptionIn situ approaches enable performing data analysis/visualization (ana/vis) close to the data source and running them on the same system. However, variations in the simulation data and the diversity of underlying HPC environments increase the difficulty of adjusting the in situ processing configurations adaptively. Triggers are an emerging strategy that follows the autonomic computing paradigm to optimize when and how to execute in situ ana/vis tasks. By inspecting indicators, the trigger can flexibly issue customized control instructions to optimize the execution of in situ ana/vis tasks in real-time. This position paper formalizes the elements of the trigger mechanism according to the definition of autonomic computing. It uses the formalization as a guideline to summarize the research status of different aspects of the trigger mechanism for in situ processing, including (1) where to execute ana/vis tasks, (2) resource allocation of ana/vis tasks, and (3) when to execute ana/vis tasks.
Birds of a Feather
TP
XO/EX
DescriptionAt SC21, over 50 attendees participated in robust discussion of strategies for managing storage in advanced computing environments. At the urging of many, we plan to continue the conversation this year, focusing on two themes: (1) progress made on creating unified storage environments for research computing and (2) strategies and tactics for dealing with the end of “unlimited free" cloud storage. In particular, policy and pricing model changes in cloud storage offerings have placed substantial pressure on RCD organizations to migrate to alternative storage solutions—a task that can be daunting, given the scale and diversity of data involved.
Paper
Recorded
Architectures
Machine Learning and Artificial Intelligence
TP
DescriptionData prefetching hides memory latency by predicting and loading necessary data into cache beforehand. Most prefetchers in the literature are efficient for specific memory address patterns thereby restricting their utility to specialized applications--they do not perform well on hybrid applications with multifarious access patterns. Therefore we propose ReSemble: a reinforcement learning based adaptive ensemble framework that enables multiple prefetchers to complement each other on hybrid applications. Our RL trained ensemble controller takes prefetch suggestions from all prefetchers as input, selects the best suggestion dynamically, and learns online toward getting higher cumulative rewards, which are collected from prefetch hits/misses. Our ensemble framework using a simple multilayer perceptron as the controller achieves averages of 85.27% (accuracy) and 44.22% (coverage), leading to 31.02% IPC improvement, which outperforms state-of-the-art individual prefetchers by 8.35%--26.11%, while also outperforming SBP, a state-of-the-art (non-RL) ensemble prefetcher by 5.69%.
ACM Gordon Bell Finalist
Awards Presentation
Recorded
Awards
TP
DescriptionWe extend the capability of space-time geostatistical modeling using algebraic approximations, illustrating application-expected accuracy worthy of double precision from majority low-precision computations and low-rank matrix approximations. We exploit the mathematical structure of the dense covariance matrix whose inverse action and determinant are repeatedly required in Gaussian log-likelihood optimization. Geostatistics augments first-principles modeling approaches for the prediction of environmental phenomena given the availability of measurements at a large number of locations; however, traditional Cholesky-based approaches grow cubically in complexity, gating practical extension to continental and global datasets now available. We combine the linear algebraic contributions of mixed-precision and low-rank computations within a tile-based Cholesky solver with on-demand casting of precisions and dynamic runtime support from PaRSEC to orchestrate tasks and data movement. Our adaptive approach scales on various systems and leverages the Fujitsu A64FX nodes of Fugaku to achieve up to 12X performance speedup against the highly optimized dense Cholesky implementation.
Workshop
Recorded
W
DescriptionFault-tolerant applications need to recover data lost after process failures. It is typically impractical to request replacement resources after a failure. Therefore, applications have to continue with the remaining resources. This requires redistributing the workload. We present an algorithmic framework and its C++ implementation ReStore that enables recovery of data after process failures. By storing all required data in memory via an appropriate data distribution and replication, recovery is substantially faster than with standard checkpointing schemes that rely on a parallel file system. As the application developer can specify which data to load, we also support shrinking recovery instead of recovery using spare compute nodes. Our experiments show loading times of lost input data in the range of milliseconds on up to 24,576 processors and a substantial speedup of the recovery time for the fault-tolerant version of a widely used bioinformatics application.
Workshop
Recorded
W
DescriptionIn this work in progress, we will showcase a comprehensive analysis of the current state-of-the-art solutions for data skew mitigation in several environments. Our experiments and evaluation comprise several data-intensive workflows running on Spark using the Grid’5000 testbed. The data-intensive workflows vary from a highly optimized WordCount application, an iterative application like PageRank, to an SQL-based decision support system benchmark, TPC-H with various sizes and configurations. Going forward, we will discuss our current efforts toward heterogeneity-aware multi-stages data partitioning.
Student Cluster Competition
TP
XO/EX
DescriptionThis team, "Revontuli" (northern lights in Finnish), is Finland's first ever entry to SCC, and all the team members are first-timers. The team is coordinated and managed by CSC which is the national supercomputing center of Finland and the host of the 550 Pflops LUMI pre-exascale supercomputer. The team is unique in such a way that its members come from four universities (Aalto University, University of Helsinki, Tampere University and LUT University) in four different cities (Helsinki, Espoo, Tampere and Lappeenranta) in Finland. That is, it's not a university team but a true all-Finnish team.
CSC started looking for team members in the fall of 2021 by distributing flyers and contacting universities with computational science programs. We received a number of applications, interviewed the most promising candidates, and finalized the team before Christmas holidays. In addition to the technical skills, we emphasized the motivation to learn HPC and to work as a team. The team members are at different stages in their BSc studies in computer science, but all are experienced Linux users and hobbyist or even professional programmers with experience in various programming languages including Fortran, C/C++, Python. CSC's role is to bring in the HPC skills. One of the team members, Roope Salmi, has won the bronze medal in the International Olympiad in Informatics 2020, however, none of them had HPC experience before. Thus, this competition provides the team members a unique possibility to learn HPC. As the importance of HPC is increasing in multitude of scientific disciplines the competition is very likely to benefit the participants whatever their exact academic path will turn out to be.
The advisor, Dr. Jussi Enkovaara has background in computational physics and has been at CSC since 2005. He has experience in developing and optimizing large scientific applications and international HPC projects. Currently, he is working in HPC support helping customers in optimizing and parallelizing scientific applications and contributing to the CSC's user training.
CSC started looking for team members in the fall of 2021 by distributing flyers and contacting universities with computational science programs. We received a number of applications, interviewed the most promising candidates, and finalized the team before Christmas holidays. In addition to the technical skills, we emphasized the motivation to learn HPC and to work as a team. The team members are at different stages in their BSc studies in computer science, but all are experienced Linux users and hobbyist or even professional programmers with experience in various programming languages including Fortran, C/C++, Python. CSC's role is to bring in the HPC skills. One of the team members, Roope Salmi, has won the bronze medal in the International Olympiad in Informatics 2020, however, none of them had HPC experience before. Thus, this competition provides the team members a unique possibility to learn HPC. As the importance of HPC is increasing in multitude of scientific disciplines the competition is very likely to benefit the participants whatever their exact academic path will turn out to be.
The advisor, Dr. Jussi Enkovaara has background in computational physics and has been at CSC since 2005. He has experience in developing and optimizing large scientific applications and international HPC projects. Currently, he is working in HPC support helping customers in optimizing and parallelizing scientific applications and contributing to the CSC's user training.
Workshop
Recorded
Runtime Systems
System Software
W
DescriptionI report on experiences developing and deploying the funcX distributed function as a service (FaaS) platform and in employing this platform to support distributed computing pipelines that link instruments, computers (e.g., for analysis, simulation, AI model training), edge computing (e.g., for analysis), data stores, metadata catalogs, and high-speed networks. Both funcX and the Globus Flows system used to implement these pipelines combine cloud-hosted management, for reliability, with edge-hosted execution, for flexible and scalable execution. I discuss, in particular, the funcX and Globus Flows architectures; the container management strategies used in funcX to execute functions with high performance and efficiency on diverse funcX endpoints; and funcX’s integration with an in-memory data store and Globus for managing data that spans endpoints.
Workshop
Recorded
Runtime Systems
System Software
W
DescriptionThe high performance computing is evolving rapidly, shaped by the confluence of three trends: a) traditional simulation and modeling workloads are converging with massive data analytic and AI/ML workflows; b) the efficiency of special purpose heterogeneous hardware is increasing; and c) the demand for flexible delivery models that blend traditional on-premises deployments with cloud-like as-a-service models continues to grow. Heterogeneity is driven by the end of Moore's Law, growth of data, and by the emergence of broad AI adoption that is well-suited for special-purpose hardware. To date, serverless computing abstracts the complexity of the underlying infrastructure by leveraging homogeneity and is motivated by simplified DevOps experience for new composable and scalable applications. Delivering the efficiency of heterogeneity, the productivity of serverless, and the granularity of Functions-as-a-Service demands a new architecture.
The Heterogeneous Serverless Computing (HSC) aims to enable development and delivery of HPC, HPDA, and AI (H2A) workloads with the ease and efficiency of the Cloud and with higher scale and more fluidity than supercomputers. HSC is a software-hardware co-designed infrastructure supporting H2A workflow execution economically and securely at fine granularity using Functions as a Service (FaaS). HSC targets the changeover evolution to H2A workflows with flexible consumption models, the edge-to-exascale deployment, and embraces a more maintainable, scalable, and re-usable development model. We focus on innovative uses of accelerators, such as in SmartNICs and Fabric Attached Memories, to improve performance of H2A applications and efficiency of hardware, but without compromising ease of development.
The Heterogeneous Serverless Computing (HSC) aims to enable development and delivery of HPC, HPDA, and AI (H2A) workloads with the ease and efficiency of the Cloud and with higher scale and more fluidity than supercomputers. HSC is a software-hardware co-designed infrastructure supporting H2A workflow execution economically and securely at fine granularity using Functions as a Service (FaaS). HSC targets the changeover evolution to H2A workflows with flexible consumption models, the edge-to-exascale deployment, and embraces a more maintainable, scalable, and re-usable development model. We focus on innovative uses of accelerators, such as in SmartNICs and Fabric Attached Memories, to improve performance of H2A applications and efficiency of hardware, but without compromising ease of development.
Workshop
Recorded
Quantum Computing
W
DescriptionRecent works have demonstrated that large quantum circuits can be cut and decomposed into smaller clusters of quantum circuits with fewer qubits that can be executed independently on a small quantum computer. Classical post-processing then combines the results from each cluster to reconstruct the output of the original quantum circuit. However, the runtime for such hybrid quantum-classical algorithms is exponential in the number of cuts on a circuit. We propose Rotation-Inspired Circuit Cut Optimization (RICCO), an alternative method which reduces the post-processing overhead of circuit cutting, at the cost of solving optimization problem. RICCO introduces unitary rotations at cut locations to rotate the quantum state such that expectation values with respect to one set of observables are maximized and others are set to zero. We demonstrate practical application of RICCO to VQE by classically simulating a small instance of VQE and comparing it to one of the existing circuit-cutting methods.
Workshop
Recorded
Career Development
Professional Development
Software Engineering
Workforce
W
Workshop
Recorded
Career Development
Professional Development
Software Engineering
Workforce
W
Workshop
Recorded
Career Development
Professional Development
Software Engineering
Workforce
W
Workshop
Recorded
Career Development
Professional Development
Software Engineering
Workforce
W
Workshop
Recorded
Career Development
Professional Development
Software Engineering
Workforce
W
ACM Gordon Bell COVID Finalist
ACM Gordon Bell Finalist
Awards Presentation
Running Ahead of Evolution - AI Based Simulation for Predicting Future High-Risk SARS-CoV-2 Variants
Recorded
Awards
TP
DescriptionThe never-ending emergence of SARS-CoV-2 variations of concern (VOCs) has challenged the whole world for pandemic control. In order to develop effective drugs and vaccines, one needs to efficiently simulate SARS-CoV-2 spike receptor binding domain (RBD) mutations and identify high-risk variants. We pretrain a large protein language model on approximately 408 million protein sequences and construct a high-throughput screening for the prediction of binding affinity and antibody escape. As the first work on SARS-CoV-2 RBD mutation simulation, we successfully identify mutations in the RBD regions of 5 VOCs and can screen millions of potential variants in seconds. Our workflow scales to 4096 NPUs with 96.5% scalability and 493.9× speedup in mixed precision computing, while achieving a peak performance of 366.8 PFLOPS (reaching 34.9% theoretical peak) on Pengcheng Cloudbrain-II. Our method paves the way for simulating coronavirus evolution in order to prepare for a future pandemic that will inevitably take place.
Workshop
Recorded
Diversity Equity Inclusion (DEI)
Education and Training and Outreach
W
DescriptionThe scientific applications in HPC can use MPI and another runtime in the same execution to communicate between processes and for intra-node orchestration. However, with these developments it can be difficult to use all resources of a supercomputer as some load imbalance may appear during an execution. This abstract presents methods to use more efficiently resources in case of two types of load imbalance with minimal impact on application's code. The first method dynamically detects load imbalance and balance the computation by redistributing OpenMP threads between MPI processes local to the node. With this method an improvement of up to 30\% can be observed for MiniFE. The second method aims at harnessing the unused cores during the execution to compute another application at the same time. The preliminary results show a gain with this second method but with some impact on the first application.
Panel
Recorded
Data Analytics
Scientific Computing
TP
XO/EX
DescriptionThe goal of this panel is to discuss the latest runtime evolution and the impact on applications. Advances in this matter are key to executing science workflows and understanding their results, enabling efficient execution on diverse platforms, ensuring scalability of high-level descriptions of analytics workflows, and increasing user productivity and system utilization. In other words, how easily and rapidly a science team can develop or port a workflow to a new platform, and how well the resulting implementation makes use of the platform and its resources.
Our panel includes a large number of different runtimes. Examples of these are OpenMP, OpenACC, SYCL, COMPS, PaRSEC, OmpSs, and StarPU. This is a great opportunity to bring together some of the most important and widely used runtimes and programming models, and present/discuss the latest efforts on each of them and the different perspectives to face the challenges of the upcoming extreme heterogeneity era.
Our panel includes a large number of different runtimes. Examples of these are OpenMP, OpenACC, SYCL, COMPS, PaRSEC, OmpSs, and StarPU. This is a great opportunity to bring together some of the most important and widely used runtimes and programming models, and present/discuss the latest efforts on each of them and the different perspectives to face the challenges of the upcoming extreme heterogeneity era.
Workshop
Recorded
W
DescriptionScientific parallel applications often use MPI for inter-node communications and OpenMP for intra-node orchestration. Parallel applications such as particle transport, seismic wave propagation simulator, or finite-element applications often exhibit workload imbalance due to their nature of ongoing data movement. These applications usually develop software balancing strategies triggered when some imbalance thresholds are detected to reduce this imbalance. These developments are complex to implement and impact the entire distributed applications’ performance by synchronizing and exchanging the load over the network. This preseentation proposes a method to dynamically detect load imbalance and balance the computation by redistributing OpenMP threads between MPI processes local to the node. With minimal impact on the applications’ codes, we demonstrate how this technique can improve the overall applications’ performance up to 28% on MiniFE, 17% on Quicksilver, and 3% on Ondes3D. We also present its impact when executing multiple nodes and our proposed approach’s limitations.
Birds of a Feather
TP
XO/EX
DescriptionSAGE3, the Smart Amplified Group Environment, is the next-generation, “human-in-the-loop” collaboration platform, providing HPC users with the tools to access, explore, discover, publish, share, integrate, and reuse complex datasets. SAGE3 supports today’s hybrid work/home environments (laptops, single monitors and display walls), and interfaces with a variety of computational infrastructures, workflows, notebooks, and analytics software through Artificial Intelligence-enabled services and orchestration services. It lowers the barrier of entry into AI for non-expert users, democratizing access to disruptive technologies for those with varying skills. The BoF presents SAGE3 features, highlights user stories, describes future design plans, and encourages attendee feedback and interaction.
Awards Presentation
Recorded
Awards
TP
W
TUT
XO/EX
DescriptionThe SC22 conference awards, as well as selected ACM, IEEE and SigHPC awards, will be presented.
The awards include Student Cluster Competition Winners, Best Student Paper, Best Paper, Test of Time, Best Poster, Best Scientific Visualization, Best Reproducibility Advancement Award, IEEE TCHPC Award for Excellence for Early Career Researchers in High Performance Computing, ACM Student Research Competition, ACM/IEEE-CS George Michael Memorial HPC Fellowship, ACM Gordon Bell Prize and Gordon Bell Special Prize for High Performance Computing-Based COVID-19 Research, ACM SIGHPC Computational & Data Science Fellowships, ACM SIGHPC Outstanding Doctoral Dissertation, SigHPC Emerging Woman Leader in Technical Computing Award, and Education Award.
Everyone with an SC22 badge is welcome to attend.
The awards include Student Cluster Competition Winners, Best Student Paper, Best Paper, Test of Time, Best Poster, Best Scientific Visualization, Best Reproducibility Advancement Award, IEEE TCHPC Award for Excellence for Early Career Researchers in High Performance Computing, ACM Student Research Competition, ACM/IEEE-CS George Michael Memorial HPC Fellowship, ACM Gordon Bell Prize and Gordon Bell Special Prize for High Performance Computing-Based COVID-19 Research, ACM SIGHPC Computational & Data Science Fellowships, ACM SIGHPC Outstanding Doctoral Dissertation, SigHPC Emerging Woman Leader in Technical Computing Award, and Education Award.
Everyone with an SC22 badge is welcome to attend.
SC23
Recorded
TP
XO/EX
DescriptionJoin us for a preview of the next SC conference. A look to the future of SC is a great way to start your Thursday morning.
Doctoral Showcase
Posters
Recorded
TP
DescriptionEfficiently and accurately simulating partial differential equations (PDEs) in and around arbitrarily defined geometries, especially with high levels of adaptivity, has significant implications for different application domains. In this work, we develop a fast construction of a ‘good’ adaptively-refined incomplete octree based mesh capable of carving out arbitrarily shaped void regions from the parent domain: an essential requirement for fluid simulations around complex objects. Further, we integrate the mesh generation with Petsc to solve several multiphysics and multiphase phenomena. We showcase the applicability of the algorithms to solve the large scale problems. The algorithms developed have enabled us to run the most resolved jet atomization simulations and demonstrated scaling till O(100K) processors on TACC Frontera.
Paper
Recorded
System Software
TP
Best Paper Finalist
Best Student Paper Finalists
DescriptionDerivatives are key to numerous science, engineering, and machine learning applications. While existing tools generate derivatives of programs in a single language, modern parallel applications combine a set of frameworks and languages to leverage available performance and function in an evolving hardware landscape.
We propose a scheme for differentiating arbitrary DAG-based parallelism that preserves scalability and efficiency, implemented into the LLVM-based Enzyme automatic differentiation framework. By integrating with a full-fledged compiler backend, Enzyme can differentiate numerous parallel frameworks and directly control code generation. This flexibility permits Enzyme to leverage parallel and differentiation-specific optimizations far beyond existing tools.
We differentiate nine distinct versions of the LULESH and miniBUDE applications, written in different programming languages (C++, Julia) and parallel frameworks (OpenMP, MPI, RAJA, Julia tasks, MPI.jl), demonstrating similar scalability to the original program and with a differentiation overhead of 1.0-11.7x on 64 threads/nodes.
We propose a scheme for differentiating arbitrary DAG-based parallelism that preserves scalability and efficiency, implemented into the LLVM-based Enzyme automatic differentiation framework. By integrating with a full-fledged compiler backend, Enzyme can differentiate numerous parallel frameworks and directly control code generation. This flexibility permits Enzyme to leverage parallel and differentiation-specific optimizations far beyond existing tools.
We differentiate nine distinct versions of the LULESH and miniBUDE applications, written in different programming languages (C++, Julia) and parallel frameworks (OpenMP, MPI, RAJA, Julia tasks, MPI.jl), demonstrating similar scalability to the original program and with a differentiation overhead of 1.0-11.7x on 64 threads/nodes.
Paper
Recorded
Architectures
Machine Learning and Artificial Intelligence
TP
DescriptionCycle-accurate microarchitecture simulators are essential tools to architect new processors. But they are often replaced by alternative methodologies such as statistical or analytical modeling for shorter turnaround time. There have also been attempts to employ ML to perform architecture simulations, such as Ithemal and SimNet but existing solutions may be even slower due to intrinsic computational intensity and memory traffic challenges.
This paper proposes the first GPU-based microarchitecture simulator that unleashes the GPU's potential to accelerate the state-of-the-art ML-based simulators. First, we introduce an efficient GPU implementation that minimizes data movement and customizes state-of-the-art ML inference engines to achieve rapid single instruction simulation for SimNet. Second, we propose a parallel simulation paradigm that partitions a trace into sub-traces to simulate them in parallel with rigorous error analysis and effective error correction mechanisms. Combined, our GPU-based simulator outperforms traditional CPU-based simulators significantly, i.e., up to 1014x speedup over gem5 detailed simulation.
This paper proposes the first GPU-based microarchitecture simulator that unleashes the GPU's potential to accelerate the state-of-the-art ML-based simulators. First, we introduce an efficient GPU implementation that minimizes data movement and customizes state-of-the-art ML inference engines to achieve rapid single instruction simulation for SimNet. Second, we propose a parallel simulation paradigm that partitions a trace into sub-traces to simulate them in parallel with rigorous error analysis and effective error correction mechanisms. Combined, our GPU-based simulator outperforms traditional CPU-based simulators significantly, i.e., up to 1014x speedup over gem5 detailed simulation.
Paper
Recorded
Numerical Algorithms
Scientific Computing
TP
DescriptionStencil computations lie at the heart of many scientific and industrial applications. Stencil algorithms pose several challenges on machines with cache based memory hierarchy, due to low re-use of memory accesses if special care is not taken to optimize them. This work shows that for stencil computation a novel algorithm that leverages a localized communication strategy effectively exploits the second generation Cerebras Wafer-Scale Engine (WSE-2), which has no cache hierarchy. This study focuses on a 25-point stencil finite-difference method for the 3D wave equation, a kernel frequently used in earth modeling as numerical simulation. In essence, the algorithm trades memory accesses for data communication and takes advantage of the fast communication fabric provided by the architecture. The algorithm —historically memory-bound— becomes compute-bound. This allows the implementation to achieve near perfect weak scaling, reaching up to 503 TFLOPs on a single WSE-2, a figure that only full clusters can eventually yield.
Workshop
Recorded
Algorithms
Exascale Computing
Extreme Scale Computing
Heterogeneous Systems
Post-Moore Computing
Quantum Computing
W
DescriptionTargeting viscoelastic crustal deformation analysis, we develop a scalable unstructured implicit finite-element solver accelerated by a data-driven method. Here, we combine a data-driven predictor, that uses past time step data for estimating high-accuracy initial solutions, and a multi-grid based conjugate gradient solver for efficient solving of the remaining errors. When compared to using a standard initial solution predictor on a block Jacobi-preconditioned conjugate gradient solver, a 3.19-fold speedup was attained by using the data-driven predictor, and combination with a multi-grid solver attained a total speedup of 76.8-fold on Fugaku. Furthermore, as the computation of the data-driven predictor is localized and can be conducted without communication between computation nodes, the solver attained high weak scalability efficiency of 78.5% up to 73728 compute nodes of Fugaku, leading to 6.88% FP64 peak efficiency for the whole application. Such development is also expected to be useful for accelerating other PDE-based time evolution problems.
Posters
Research Posters
TP
XO/EX
DescriptionWe present a strategy for GPU acceleration of a multiphase compressible flow solver that brings us closer to exascale computing. Given the memory-bound nature of most CFD problems, one must be prudent in implementing algorithms and offloading work to accelerators for efficient use of resources. Through careful choice of OpenACC decorations, we achieve 46% of peak GPU FLOPS on the most expensive kernel, leading to a 500-times speedup on an NVIDIA A100 compared to 1 modern Intel CPU core. The implementation also demonstrates ideal weak scaling for up to 13824 GPUs on OLCF Summit. Strong scaling behavior is typical but improved by reduced communication times via CUDA-aware MPI.
Workshop
Recorded
W
DescriptionIntegration of machine learning with simulation is part of a growing trend, however, the augmentation of codes in a highly-performant, distributed manner poses a software development challenge. In this work, we explore the question of how to easily augment legacy simulation codes on high- performance computers (HPCs) with machine-learned surrogate models, in a fast, scalable manner. Initial naive augmentation attempts required significant code modification and resulted in significant slowdown. This led us to explore inference server techniques, which allow for model calls through drop-in functions. In this work, we investigated TensorFlow Serving with gRPC and RedisAI with SmartRedis for server-client implementations, where the deep learning platform runs as a persistent process on HPC compute node GPUs and the simulation makes client calls while running on CPUs. We evaluated inference performance for real gas equations of state, machine-learned boundary conditions for rotorcraft aerodynamics, and super-resolution techniques on a POWER9 supercomputer.
Paper
Recorded
Quantum Computing
Resource Management and Scheduling
System Software
TP
DescriptionWe present Atos, a dynamic scheduling framework for multi-node-GPU systems that supports PGAS-style lightweight one-sided memory operations within and between nodes.
Atos's lightweight GPU-to-GPU communication enables latency hiding and can smooth the interconnection usage for bisection-limited problems. These benefits are significant for dynamic, irregular applications that often involve fine-grained communication at unpredictable times and without predetermined patterns. Some principles for high performance: (1) do not involve the CPU in the communication control path; (2) allow GPU communication within kernels, addressing memory consistency directly rather than relying on synchronization with the CPU; (3) perform dynamic communication aggregation when interconnections have limited bandwidth. By lowering the overhead of communication and allowing it within GPU kernels, we support large, high-utilization GPU kernels but with more frequent communication. We evaluate Atos on two irregular problems: Breadth-First-Search and PageRank. Atos outperforms the state-of-the-art graph libraries Gunrock, Grout and Galois on both single-node-multi-GPU and multi-node-GPU settings.
Atos's lightweight GPU-to-GPU communication enables latency hiding and can smooth the interconnection usage for bisection-limited problems. These benefits are significant for dynamic, irregular applications that often involve fine-grained communication at unpredictable times and without predetermined patterns. Some principles for high performance: (1) do not involve the CPU in the communication control path; (2) allow GPU communication within kernels, addressing memory consistency directly rather than relying on synchronization with the CPU; (3) perform dynamic communication aggregation when interconnections have limited bandwidth. By lowering the overhead of communication and allowing it within GPU kernels, we support large, high-utilization GPU kernels but with more frequent communication. We evaluate Atos on two irregular problems: Breadth-First-Search and PageRank. Atos outperforms the state-of-the-art graph libraries Gunrock, Grout and Galois on both single-node-multi-GPU and multi-node-GPU settings.
Paper
Recorded
Numerical Algorithms
Scientific Computing
TP
DescriptionHSS and H^2-matrices are hierarchical low-rank matrix formats that can reduce the complexity of factorizing dense matrices from O(N^3) to O(N). For HSS matrices, it is possible to remove the dependency on the diagonal blocks during Cholesky/LU factorization, which results in a highly parallel algorithm. However, the weak admissibility of HSS limits it’s applicability to simple problems in 1-D and 2-D geometries. On the other hand, the strong admissibility of H^2-matrices allows it to handle actual 3-D problems, but introduces the dependency on the diagonal blocks during the factorization. In the present work, we propose a decoupling of the low-rank basis and the Schur complement basis in H^2-matrices, which allows us to remove the dependency on the diagonal blocks. This results in a highly parallel H^2-matrix factorization. We compare with other scalable approximate dense matrix factorization codes such as Lorapo.
Workshop
Recorded
Algorithms
Exascale Computing
Extreme Scale Computing
Heterogeneous Systems
Post-Moore Computing
Quantum Computing
W
Workshop
Recorded
Algorithms
Exascale Computing
Extreme Scale Computing
Heterogeneous Systems
Post-Moore Computing
Quantum Computing
W
Workshop
Recorded
Algorithms
Exascale Computing
Extreme Scale Computing
Heterogeneous Systems
Post-Moore Computing
Quantum Computing
W
Paper
Recorded
Accelerator-based Architectures
Bioinformatics
File Systems and I/O
TP
DescriptionCorrelated electronic structure calculations enable an accurate prediction of the physicochemical properties of complex molecular systems; however, the scale of these calculations is limited by their extremely high computational cost. The Fragment Molecular Orbital (FMO) method is arguably one of the most effective ways to lower this computational cost while retaining predictive accuracy. In this paper, a novel distributed many-GPU algorithm and implementation of the FMO method are presented. When applied in tandem with the Hartree-Fock and RI-MP2 methods, the new implementation enables correlated calculations on 623,016 electrons and 146,592 atoms in less than 45 minutes using 99.8% of the Summit supercomputer (27,600 GPUs). The implementation demonstrates remarkable speedups with respect to other current GPU and CPU codes, and excellent strong scalability on Summit achieving 94.6% parallel efficiency on 4600 nodes. This work makes feasible correlated quantum chemistry calculations on significantly larger molecular systems than before and with higher accuracy.
Paper
Recorded
Networks
Performance
Visualization
TP
DescriptionThe SSSP kernel was introduced into the Graph 500 benchmark in 2017. However, there has been no result from a full-scale world-top supercomputer. The primary reason is the poor work-inefficiency of existing algorithms at large scales.
We propose an SSSP implementation for machines, including an SSSP algorithm to achieve work-efficiency, along with an adaptive dense/sparse-mode selection approach to achieve communication-efficiency. Our implementation reaches 7638 GTEPS, with 103158 processors (over 40 million cores), and achieves 3.7 times in performance and 512 times in graph size compared with the current top one on the Graph 500 SSSP list. Based on our experience of running extreme-scale SSSP, we uncover the root cause of its poor scalability: the weight distribution allows edges with weights close to zero, making the SSSP tree deeper on larger graphs. We further explore a scalability-friendly weight distribution by setting a non-zero lower bound to the edge weights.
We propose an SSSP implementation for machines, including an SSSP algorithm to achieve work-efficiency, along with an adaptive dense/sparse-mode selection approach to achieve communication-efficiency. Our implementation reaches 7638 GTEPS, with 103158 processors (over 40 million cores), and achieves 3.7 times in performance and 512 times in graph size compared with the current top one on the Graph 500 SSSP list. Based on our experience of running extreme-scale SSSP, we uncover the root cause of its poor scalability: the weight distribution allows edges with weights close to zero, making the SSSP tree deeper on larger graphs. We further explore a scalability-friendly weight distribution by setting a non-zero lower bound to the edge weights.
Exhibitor Forum
Recorded
TP
XO/EX
DescriptionWe have seen an increase in the need for large-scale training of ML models as more and more startups and established companies seek to gain an edge with increasingly large and powerful models. These models require hundreds or thousands of GPUs for an extended period of time. Performance is crucial both at the level of individual GPU and of scaling efficiently across the network. A well-known example is Aleph Alpha, whose 5-language GPT-3-like model has up to over 300 billion machine learning parameters and even offers visual understanding in full multimodality, significantly extending the range of established possibilities. Scaling these large training models can be very complex and certainly difficult to tune, requiring a cost-effective infrastructure with A100 Nvidia GPUs and high throughput ultra-low latency RDMA that can provide availability, resiliency, and performance at scale.
In this talk we will discuss our approach to support the needs of these large-scale ML models for training and inference on Oracle Cloud and showcase the full-stack foundation of the transformative business potential in this industrial revolution. We will show examples of use from various companies and will discuss the challenges that were addressed to run these models at such scale in a modern enterprise architecture. We will finish the presentation with a discussion of some of the open research problems that still need to be addressed in this area.
In this talk we will discuss our approach to support the needs of these large-scale ML models for training and inference on Oracle Cloud and showcase the full-stack foundation of the transformative business potential in this industrial revolution. We will show examples of use from various companies and will discuss the challenges that were addressed to run these models at such scale in a modern enterprise architecture. We will finish the presentation with a discussion of some of the open research problems that still need to be addressed in this area.
Workshop
Recorded
W
DescriptionContainers have provided a popular new paradigm for managing software and services. However, in HPC the use of containers has historically been more difficult due to multi-tenancy, security, and performance requirements; consequently several custom HPC container runtimes have emerged from the community. The resulting fractured ecosystem presents challenges both for HPC container framework maintainers and for users. In this paper, we describe work at NERSC to adapt Podman, a popular OCI-compliant container framework developed by Red Hat, Inc., for use in HPC. Podman has several key features which make it appealing for use in an HPC environment: its rootless container mode addresses many security concerns, it has a standardized command interface which will be familiar to users of established popular container runtimes, it is daemon-less, and it is open-source and community supported. Additional innovations at NERSC have enabled Podman to achieve the good scaling behavior required by HPC applications.
Birds of a Feather
TP
XO/EX
DescriptionThis session will explore the state-of-the-art in scientific code coupling, paying particular attention to enabling software currently in development. Consideration will be paid to the state of existing libraries in the context of exascale computing, mathematical rigour, and corresponding workflows, with highlighting examples of applications drawn from several scientific areas. We will look at whether the current trajectory of coupling technologies is the right one and if so, what can we do to improve core performance, portability, and applicability to enable massive, coupled simulations on supercomputers.
Tutorial
Recorded
Applications
Cloud and Distributed Computing
Computational Science
Containers
Emerging Technologies
Productivity Tools
Resource Management and Scheduling
Workflows
TUT
DescriptionKubernetes has become the leading container orchestration solution over the past few years. It can work anywhere from on-prem clusters to commercial clouds, abstracting the computational resources and workloads to run on those.
The main compute paradigm for large-scale distributed computing has long been the batch system. Kubernetes doesn't directly present a traditional batch interface, but the concepts are similar enough to allow for easy porting of existing batch-focused workloads to it. Kubernetes additionally provides a significantly richer semantics, including explicit storage and network provisioning, that allows for compute workloads previously not feasible on traditional batch system.
In this tutorial, you will learn how to run your software in Kubernetes clusters. The program includes both a Kubernetes architectural overview and an overview of job and workflow submission procedures. Theoretical information is paired with hands-on sessions operating on the PRP production Kubernetes cluster, with federation exercises accessing the SDSC Expanse system.
The main compute paradigm for large-scale distributed computing has long been the batch system. Kubernetes doesn't directly present a traditional batch interface, but the concepts are similar enough to allow for easy porting of existing batch-focused workloads to it. Kubernetes additionally provides a significantly richer semantics, including explicit storage and network provisioning, that allows for compute workloads previously not feasible on traditional batch system.
In this tutorial, you will learn how to run your software in Kubernetes clusters. The program includes both a Kubernetes architectural overview and an overview of job and workflow submission procedures. Theoretical information is paired with hands-on sessions operating on the PRP production Kubernetes cluster, with federation exercises accessing the SDSC Expanse system.
Birds of a Feather
TP
XO/EX
DescriptionDigital Twins, virtual representations of objects providing actionable information in actionable time by combining sensor data with surrogate models, have a long, successful history in industry. The recent shift in HPC combining simulation, AI and edge computing is not only an opportunity to apply Digital Twin technology in science, but also to apply massive compute power for digital twins. The goal of this BoF is to bring together digital twin practitioners, computational scientists, middleware developers and HPC resource providers to identify opportunities and challenges in building Digital Twins for science and discuss the impact of HPC in this space.
Exhibitor Forum
Recorded
TP
XO/EX
DescriptionThere is a tremendous amount of excitement around the potential that quantum computers offer to the way that we approach hard problems presently unsolvable by today’s most sophisticated supercomputers. Breakthroughs in materials design, medical research and other advancements will provide major societal benefits. At the same time, the arrival of quantum computers will threaten current cryptographic systems and will require a major overhaul of cryptographic tools we depend on today. Optical networks are not immune to the threat posed by an increasing number of shady players. Optical link encryption has been used for years and in many networks. But the encryption engine alone is only part of transport security.
To provide communication security in optical (and other) networks, we rely on cryptographic algorithms as the building blocks in our secure protocols. The impact from quantum computers varies; some primitives will be weakened, and some will be completely broken. The research and standards communities are working tirelessly to define replacement candidates and standardize them, so that the great migration can begin. In the meantime, some networks need our attention now because they are vulnerable to a harvest and decrypt attack, where secure communications sessions can be harvested today and then decrypted later when a practical quantum computer is available. There are ways to mitigate this threat today. We will explore all options available to us today and in the future.
To provide communication security in optical (and other) networks, we rely on cryptographic algorithms as the building blocks in our secure protocols. The impact from quantum computers varies; some primitives will be weakened, and some will be completely broken. The research and standards communities are working tirelessly to define replacement candidates and standardize them, so that the great migration can begin. In the meantime, some networks need our attention now because they are vulnerable to a harvest and decrypt attack, where secure communications sessions can be harvested today and then decrypted later when a practical quantum computer is available. There are ways to mitigate this threat today. We will explore all options available to us today and in the future.
Posters
Research Posters
TP
XO/EX
DescriptionWith modern technology and High-Performance Computing (HPC), Molecular Dynamics (MD) simulations can be task and data parallel. That means, they can be decomposed into multiple independent tasks (i.e., trajectories) with their own data, which can be processed in parallel. Analysis of MD simulations includes finding specific molecular events and the conformation changes that a protein undergoes. However, the traditional analysis relies on the global decomposition of all the trajectories for a specific molecular system, which can be performed only in a centralized way. We propose a lightweight self-supervised machine learning technique to analyze MD simulations in situ. That is, we aim to speed up the process of finding molecular events in the protein trajectory at run-time, without having to wait for the entire simulation to finish. This allows us to scale the analysis with the simulation.
Posters
Research Posters
TP
XO/EX
DescriptionBy detecting different animal species reliably at scale we can protect biodiversity. Yet, traditionally, biodiversity data has been collected by expert observers which is prohibitively expensive, not reliable neither scalable. Automated species detection via machine-learning is promising, but it is constrained by the necessity of large training data sets all labeled by human experts. Here, we propose to use Self-Supervised Learning for studying semantic features from passively collected acoustic data. We utilized a joint embedding configuration to acquire features from spectrograms. We processed recordings from ∼190 hours of audio. In order to process these volumes of data we utilized a HPC cluster provided by the Argonne Leadership Computing Facility. We analyzed the output space from a trained backbone which highlights important semantic attributes of the spectrograms. We envisage these preliminary results as compelling for future automatic assistance of biologist as a pre-processing stage for labeling very big data sets.
Paper
Recorded
File Systems and I/O
Storage
TP
DescriptionDistributed locks are used to guarantee the distributed client-cache coherence in parallel file systems. However, they lead to poor performance in the case of parallel writes under high contention workloads. We analyze the distributed lock manager and find out that lock conflict resolution is the root cause of the poor performance, which involves frequent lock revocations and slow data flushing from client caches to data servers. We design a distributed lock manager named SeqDLM by exploiting the sequencer mechanism. SeqDLM mitigates the lock conflict resolution overhead using early grant and early revocation while keeping the same semantics as traditional distributed locks. To evaluate SeqDLM, we have implemented a parallel file system called ccPFS using SeqDLM and traditional distributed locks. Evaluations on 96 nodes show SeqDLM outperforms the traditional distributed locks by up to 10.3x for high contention parallel writes on a shared file with multiple stripes.
Workshop
Recorded
Runtime Systems
System Software
W
DescriptionThe sequential task flow (STF) model is the mainstream approach for interacting with task-based runtime systems. Compared with other approaches of submitting tasks into a runtime system, STF has interesting advantages centered around an easy-to-use API, that allows users to expressed algorithms as a sequence of tasks, while allowing the runtime to automatically identify and analyze the task dependencies and scheduling.
We focus on the DTD interface in PaRSEC, highlight some of its lesser known limitations and implemented two optimization techniques for DTD: support for user level graph trimming, and a new API for broadcast read-only data to remote tasks. We then analyze the benefits and limitations of these optimizations with benchmarks as well as on Cholesky and QR matrix factorizations, on two different systems Shaheen-II and Fugaku. We pointed out some potential for further improvements, and provided valuable insights into the strength and weakness of STF model.
We focus on the DTD interface in PaRSEC, highlight some of its lesser known limitations and implemented two optimization techniques for DTD: support for user level graph trimming, and a new API for broadcast read-only data to remote tasks. We then analyze the benefits and limitations of these optimizations with benchmarks as well as on Cholesky and QR matrix factorizations, on two different systems Shaheen-II and Fugaku. We pointed out some potential for further improvements, and provided valuable insights into the strength and weakness of STF model.
Paper
Recorded
Networks
Performance
Visualization
TP
DescriptionInline and in transit visualization are popular in situ visualization models for high performance computing (HPC) applications. Inline visualization is invoked through a library call on the HPC application (simulation), while in transit methods invoke a visualization module running on in transit resources. In transit methods can offer better efficiency than inline by running the visualization at a lower concurrency level than the simulation. State-of-the-art in transit schemes are limited to employing a dedicated in transit resource for every simulation. The resulting idle time on the in transit resource can severely limit the cost savings over inline methods. This research proposes SERVIZ, an in transit visualization service that can be shared amongst multiple simulations to reduce idle time, thereby efficiently using in transit resources. SERVIZ achieves cost savings of up to 26% over inline and up to 4x reduction in idle time compared to a dedicated in transit implementation.
Student Cluster Competition
TP
XO/EX
DescriptionAlthough none of the team members has participated in any competition about supercomputing before, we all have rich HPC experiments and knowledge of computer science. Four of the team members have participated in a national education project about distributed training on GPU clusters and conducted research on LAMMPS in cooperation with the Shenzhen research institution of Peking University. The two new members are now writing a paper about pipeline training optimization technology under heterogeneous GPU cluster and are expected to publish it soon. The members were heavily selected through the intramural competition of Southeast University. Only the top six participants of the intramural competition are qualified to enter the SC22 team. The Intramural Competition of Southeast University selects HPL, HPCG, IO500, and DeepMD kits as benchmarks, so the team members have done in-depth research on HPC and had some innovative ideas about the optimization of SC's benchmarks. The team has created an open-source supercomputing learning platform(CSWU-Challenge.github.io) for college students from scratch. Many students interested in supercomputing are learning on our website. Additionally, to share resources about HPC conveniently, we also create a series of cloud services based on our skills. Students of Southeast University can upload and download resources via the service.
When it comes to interdisciplinary, only one of our team members majors in computer science, and others major in physics, chemistry, biology, and artificial intelligence. One thing all of the team members had in common was their high-performance computing research training in their domains, which sparked their interest in supercomputing.
Despite the board background our team members have, only members who major in chemistry have rich practical experience of DeepMD before. All the members will go through club training for SCC.
Specific reasons for how HPC will help team members in their academic careers can be concluded as follows: 1. Our team members either plan to enter the computer science area or are working on projects about supercomputing. Participation in SC will significantly help us accumulate related knowledge and experience, laying the foundation for our future careers. 2. SC has a significant influence on the world. The competition experience will bring us enough credits and add a nice touch to our resumes for applying to Ph.D. programs.
Our advisor, Jinghui Zhang, is the director of the High-Resolution Remote Sensing Data Research Center of Southeast University and the Secretary of the International Steering Committee of the IEEE CSCWD Conference. As a representative of Southeast University participating AMS-02 experiment, he worked at the European Organization for Nuclear Research(CERN) in Switzerland and directly participated in the data processing of the AMS experiment for three and a half years, and participated in the establishment of Southeast University AMS Supercomputing Center as a core member. He has published more than 20 papers in well-known international journals and conferences. As the project leader, he has presided over two projects of the National Natural Science Foundation of China and over one subproject of the National Science and Technology Innovation 2030-"New Generation Artificial Intelligence" major project.
When it comes to interdisciplinary, only one of our team members majors in computer science, and others major in physics, chemistry, biology, and artificial intelligence. One thing all of the team members had in common was their high-performance computing research training in their domains, which sparked their interest in supercomputing.
Despite the board background our team members have, only members who major in chemistry have rich practical experience of DeepMD before. All the members will go through club training for SCC.
Specific reasons for how HPC will help team members in their academic careers can be concluded as follows: 1. Our team members either plan to enter the computer science area or are working on projects about supercomputing. Participation in SC will significantly help us accumulate related knowledge and experience, laying the foundation for our future careers. 2. SC has a significant influence on the world. The competition experience will bring us enough credits and add a nice touch to our resumes for applying to Ph.D. programs.
Our advisor, Jinghui Zhang, is the director of the High-Resolution Remote Sensing Data Research Center of Southeast University and the Secretary of the International Steering Committee of the IEEE CSCWD Conference. As a representative of Southeast University participating AMS-02 experiment, he worked at the European Organization for Nuclear Research(CERN) in Switzerland and directly participated in the data processing of the AMS experiment for three and a half years, and participated in the establishment of Southeast University AMS Supercomputing Center as a core member. He has published more than 20 papers in well-known international journals and conferences. As the project leader, he has presided over two projects of the National Natural Science Foundation of China and over one subproject of the National Science and Technology Innovation 2030-"New Generation Artificial Intelligence" major project.
Paper
Recorded
Cloud and Distributed Computing
TP
Best Student Paper Finalists
DescriptionServerless computing enables a new way of building and scaling cloud applications by allowing developers to write fine-grained cloud functions. However, under resource contention, function execution duration may be prolonged and fail to accurately account for the true resource usage. Our experiments show that the OS scheduling policy of servers can have a crucial impact on performance. The default Linux scheduler, CFS, frequently context-switches short functions, causing a much longer turnaround time.
We propose SFS (Smarter Function Scheduler), which works entirely in the user space and carefully orchestrates existing Linux FIFO and CFS (Completely Fair Scheduler) schedulers to approximate Shortest Remaining Time First (SRTF). SFS seamlessly combines a new FILTER policy with Linux CFS, to trade-off increased duration of long functions for significant performance improvement for short functions. Evaluation results show that significantly improves short functions' duration with a small impact on relatively longer functions, compared to CFS.
We propose SFS (Smarter Function Scheduler), which works entirely in the user space and carefully orchestrates existing Linux FIFO and CFS (Completely Fair Scheduler) schedulers to approximate Shortest Remaining Time First (SRTF). SFS seamlessly combines a new FILTER policy with Linux CFS, to trade-off increased duration of long functions for significant performance improvement for short functions. Evaluation results show that significantly improves short functions' duration with a small impact on relatively longer functions, compared to CFS.
Birds of a Feather
SIGHPC
TP
XO/EX
DescriptionThe annual business meeting of SIGHPC is your opportunity to hear about and discuss the status of SIGHPC and its chapters. We will also be discussing upcoming plans for the year. All of the elected officers and many of the other volunteers will be present to answer your questions about SIGHPC. Representatives from our chapters will also be available.
Workshop
Recorded
W
DescriptionThis assignment provides students with hands-on experience with engineering software for performance. Students are tasked with optimizing a graphical n-body simulation to run fast on a modern shared-memory multi-core using serial and parallel program optimizations. This open-ended assignment invites students to develop and test optimizations to make the program run as fast as possible while still producing the same results. Students learn how to diagnose performance bottlenecks, develop serial and parallel program optimizations, and evaluate the correctness and performance of their changes. We found that students in 6.172, MIT's undergraduate course on performance engineering of software systems, are excited by this project, especially seeing their optimizations improve the visual smoothness of the simulation and enable it to handle larger problem sizes within fixed time constraints. The materials for the assignment are publicly available at https://github.com/ailiop/EduHPC-22-Peachy-Sphere-Simulation.
Panel
Recorded
AI-HPC Convergence
Data Mangement
TP
XO/EX
DescriptionAI is driving heterogeneous compute, with specific workflows needing radically different hardware configurations. Composable infrastructure purports to eliminate the restrictions imposed by traditional static architectures by allowing hardware resources to be dynamically assigned, rather than being tied to physical servers. The promise is higher efficiency of high cost components, and the building of otherwise “impossible servers”. But can it scale, and is the added cost really recouped through increased flexibility and utilization?
In this fast-paced, lively, and highly interactive format, two HPC industry analysts will lead “Pro” and “Con” teams, with the audience deciding who wins. A typical debate format will feature opening statements, rebuttals, and tricky questions aimed at tripping up the other team. We will also include plenty of time for audience questions and comments. This will be an informative, compelling and fun event, with a clear outcome that will be issued in press release format during SC22.
In this fast-paced, lively, and highly interactive format, two HPC industry analysts will lead “Pro” and “Con” teams, with the audience deciding who wins. A typical debate format will feature opening statements, rebuttals, and tricky questions aimed at tripping up the other team. We will also include plenty of time for audience questions and comments. This will be an informative, compelling and fun event, with a clear outcome that will be issued in press release format during SC22.
Birds of a Feather
TP
XO/EX
DescriptionThe number and diversity of intelligent network devices have recently exploded. In particular, a variety of network interface cards (NICs) and data processing units (DPUs) that incorporate computational resources have recently become widely available. Examples of these new devices include Nvidia's BlueField DPUs, Xilinx's SmartNICs, and the Fungible DPU. The proliferation of these new devices has raised a number of questions regarding how best to exploit them to accelerate HPC workloads, including scientific simulations. This BoF will provide the community with an important opportunity to gather and share ideas about these promising new devices.
Workshop
Recorded
Correctness
Software Engineering
W
DescriptionAnnouncement of the software correctness competition.
Birds of a Feather
TP
XO/EX
DescriptionSoftware engineering (SWE) for modeling, simulation, and data analytics for computational science and engineering (CSE) is challenging, with ever-more sophisticated, higher fidelity simulations of ever-larger, more complex problems involving larger data volumes, more domains, and more researchers. Targeting both commodity and high-end computers multiplies these challenges. We invest significantly in creating these codes, but rarely talk about that experience; we just focus on the results. We seek to raise awareness of SWE for CSE, and provide an opportunity for discussion and community building. Presentations and discussion notes will be made available through the BoF series website, http://bit.ly/swe-cse-bof.
Workshop
Recorded
W
DescriptionThe aggregated HPC resources with rigid allocation systems and programming models struggle to adapt to diverse and changing workloads. Thus, HPC systems fail to efficiently use the large pools of unused memory and increase the utilization of idle computing resources. Prior work attempted to increase the throughput and efficiency of supercomputing systems through workload co-location and resource disaggregation. However, these methods fail to provide a solution that can be applied to existing systems without major hardware modifications and performance losses. In this presentation, we propose to use the new cloud paradigm of serverless computing to improve the utilization of supercomputers. We show that the FaaS programming model can satisfy the requirements of high-performance applications. We demonstrate a software resource disaggregation approach where the co-location of functions allows to utilize idle cores and accelerators while retaining near-native performance.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
Recorded
TP
DescriptionAggregated HPC resources have rigid allocation systems and programming models and struggle to adapt to diverse and changing workloads. Thus, HPC systems fail to efficiently use the large pools of unused memory and increase the utilization of idle computing resources. Prior work attempted to increase the throughput and efficiency of supercomputing systems through workload co-location and resource disaggregation. However, these methods fall short of providing a solution that can be applied to existing systems without major hardware modifications and performance losses.
In this project, we use the new cloud paradigm of serverless computing to improve the utilization of supercomputers. We show that the FaaS programming model satisfies the requirements of high-performance applications and how idle memory helps resolve cold startup issues. We demonstrate a software resource disaggregation approach where the co-location of functions allows idle cores and accelerators to be utilized while retaining near-native performance.
In this project, we use the new cloud paradigm of serverless computing to improve the utilization of supercomputers. We show that the FaaS programming model satisfies the requirements of high-performance applications and how idle memory helps resolve cold startup issues. We demonstrate a software resource disaggregation approach where the co-location of functions allows idle cores and accelerators to be utilized while retaining near-native performance.
Posters
Research Posters
TP
XO/EX
DescriptionThe SOLLVE V&V suite tests new OpenMP features to visualize compiler and system compilers. Systems include Oak Ridge National Laboratory's (ORNL) Summit and Crusher systems, as well as National Energy Research Scientific Computing Center (NERSC)'s Perlmutter system.
Paper
Recorded
Numerical Algorithms
Scientific Computing
TP
DescriptionWe are interested in solving linear systems arising from three applications: (1) kernel methods in machine learning, (2) discretization of boundary integral equations from mathematical physics, and (3) Schur complements formed in the factorization of many large sparse matrices. The coefficient matrices are often data-sparse in the sense that their off-diagonal blocks have low numerical ranks; specifically, we focus on ''hierarchically off-diagonal low-rank (HODLR)'' matrices. We introduce algorithms for factorizing HODLR matrices and for applying the factorizations on a GPU. The algorithms leverage the efficiency of batched dense linear algebra, and they scale nearly linearly with the matrix size when the numerical ranks are fixed. The accuracy of the HODLR-matrix approximation is a tunable parameter, so we can construct high-accuracy fast direct solvers or low-accuracy robust preconditioners. Numerical results show that we can solve problems with several millions of unknowns in a couple of seconds on a single GPU.
Birds of a Feather
TP
XO/EX
DescriptionSpack is a package manager for scientific computing, with a rapidly growing open-source community. Spack has over 1,000 contributors from academia, industry, and laboratories across the world, and is used to manage software releases for the U.S. Exascale Computing Project. At this BoF, Spack developers will give updates on the community, new features, and the roadmap for future development. We will poll the audience to gather valuable information on how Spack is being used, and will open the floor for questions. All are invited to provide feedback, request features, and discuss future directions. Help us make installing HPC software simple!
Workshop
Recorded
Security
W
DescriptionSecurity models for Linux distro package security and interoperability have traditionally emphasized the use of more recent (more secure) versions at the occasional expense of execution reproducibility. A complementary approach (e.g., Lmod) allows access to multiple sysadmin-approved package versions. Another approach (e.g., Spack) enables a pure user space approach for package selection without system administrator oversight. While maximizing reproducibility, there is no user feedback regarding potential security vulnerabilities. We introduce a general security model for package management and our implementation of SpackNVD, a security auditing tool for Spack. Users may query reported vulnerabilities for specific package versions and can prevent installation where the severity score exceeds a threshold. We emphasize this is a tool, not a solution: Spack users are not expected to be security professionals. However, this information may influence Spack concretizer decisions, and enable users to ask support staff about whether specific package versions are appropriate for use.
Workshop
Recorded
Accelerator-based Architectures
Algorithms
Architectures
Big Data
Data Analytics
Parallel Programming Languages and Models
Productivity Tools
W
DescriptionDecomposing sparse matrices into lower and upper triangular matrices (sparse LU factorization) is a key operation in many computational scientific applications. We developed SparseLU, a sparse linear algebra library that implements a new algorithm for LU factorization on general sparse matrices. The new algorithm divides the input matrix into tiles to which OpenMP tasks are created for factorization computation, where only tiles that contain nonzero elements are computed. For comparative performance analysis, we used the reference library SuperLU. Testing was performed on synthetically generated matrices which replicate the conditions of the real-world matrices. SparseLU is able to reach a mean speedup of ∼ 29× compared to SuperLU.
Paper
Recorded
Machine Learning and Artificial Intelligence
Software Engineering
State of the Practice
TP
DescriptionFederated learning~(FL) facilitates the training and deploying AI models on edge devices. Preserving user data privacy in FL introduces several challenges, including expensive communication costs, limited resources, and data heterogeneity. In this paper, we propose SPATL, an FL method that addresses these issues by: (a) introducing a salient parameter selection agent and communicating selected parameters only; (b) splitting a model into a shared encoder and a local predictor, and transferring its knowledge to heterogeneous clients via the locally customized predictor. Additionally, we leverage a gradient control mechanism to further speed up model convergence and increase robustness of training processes. Experiments demonstrate that SPATL reduces communication overhead, accelerates model inference, and enables stable training processes with better results compared to state-of-the-art methods. Our approach reduces communication cost by up to 86.45%, accelerates local inference by reducing up to 39.7% FLOPs on VGG-11, and requires 7.4× less communication overhead when training ResNet-20.
Paper
Recorded
System Software
TP
DescriptionWe introduce SpDISTAL, a compiler for sparse tensor algebra that targets distributed systems. SpDISTAL combines separate descriptions of tensor algebra expressions, sparse data structures, data distribution, and computation distribution. Thus, it enables distributed execution of sparse tensor algebra expressions with a wide variety of sparse data structures and data distributions. SpDISTAL is implemented as a C++ library that targets a distributed task-based runtime system and can generate code for nodes with both multi-core CPUs and multiple GPUs. SpDISTAL generates distributed code that achieves performance competitive with hand-written distributed functions for specific sparse tensor algebra expressions and that outperforms general interpretation-based systems by one to two orders of magnitude.
Workshop
Recorded
Accelerator-based Architectures
Compilers
Dataflow and Tasking
Directive Based Programming
Heterogeneous Systems
Parallel Programming Languages and Models
Runtime Systems
W
DescriptionMost high-end computers adopt hybrid architectures, porting a large-scale scientific code onto accelerators is necessary. The paper presents a generic method for porting large-scale scientific code onto accelerators using compiler directives within a modularized function unit test platform. We have implemented the method and designed a software tool (SPEL) to port the E3SM Land Model (ELM) onto the GPUs in the Summit computer. SPEL automatically generates GPU-ready test modules for all ELM functions, such as CanopyFlux, SoilTemperature, and EcosystemDynamics. SPEL breaks the ELM into a collection of standalone unit test programs for easy code verification and further performance improvement. We further optimize several ELM test modules with advanced techniques, including memory reduction, DeepCopy, reconstructed parallel loops, and asynchronous GPU kernel launch. We hope our study will inspire new toolkit developments that expedite large-scale scientific code porting with compiler directives.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
DescriptionError-bounded lossy compression is a critical technique for significantly reducing scientific data volumes. With ever-emerging heterogeneous high-performance computing (HPC) architecture, GPU-accelerated error-bounded compressors such as cuSZ have been developed. In order to improve the data quality and the compression ratio while maintaining high throughput, an interpolation-based spline method is introduced, inspired by the existing CPU prototype. In this work, We present (1) an efficient GPU implementation of the 3D interpolative spline prediction method, (2) a finer-grained data chunking approach using anchor points to leverage the modern GPU architecture, and (3) an in-depth analysis of how such anchor point affects the error formation and the compression ratio, and (4) a preliminary result in performance on the state-of-the-art modern GPUs. Our solution can achieve 1) a higher compression ratio than the previous default prediction method in cuSZ, and 2) the overall comparable data quality and compression ratio with the CPU prototype.
Workshop
Spot-On: A Checkpointing Framework for Fault-Tolerant Long-Running Workloads on Cloud Spot Instances
Recorded
Reliability and Resiliency
W
DescriptionSpot instances offer a cost-effective solution for applications running in the cloud computing environment. However, it is challenging to run long-running jobs on spot instances because they are subject to unpredictable evictions. Here, we present Spot-on, a generic software framework that supports fault-tolerant long-running workloads on spot instances through checkpoint and restart. Spot-on leverages existing checkpointing packages and is compatible with the major cloud vendors. Using a genomics application as a test case, we demonstrated that Spot-on supports both application-specific and transparent checkpointing methods. Compared to running applications using on-demand instances, it allows the completion of these workloads for a significant reduction in computing costs. Compared to running applications using application-specific checkpoint mechanisms, transparent checkpoint-protected applications take less time to complete, leading to further cost reductions.
Invited Talk
Recorded
TP
XO/EX
DescriptionQubits, or ‘quantum bits’ have the potential to exceed the abilities of their classical counterparts for applications such as sensing, secure communication, simulation and ultimately computing. These efforts are collectively known as “Quantum Information Science (QIS)”. There are already many applications of such systems, in for example the development of atomic clocks, used in modern GPS systems. The ultimate goal of QIS is the development of a universal quantum computer (QC), a device which can theoretically approximate any unitary operation on its constituent qubits. Such a computer could be used to solve specific problems exponentially faster than a traditional computing system. The set of such problems known to-date is however quite limited. A primary example includes Shor’s factoring algorithm, discovered in 1994, which is credited with fueling interest in quantum computing research. Other applications, still under development, include quantum chemistry and quantum machine learning, which play important roles in the recent private sector activities. The core technologies, however, are still in a nascent stage of development, and it remains unclear which technological approach will prove most effective in the long term. This presentation will begin with a brief overview of the field of QIS, then focus on an update of the leading state-of-the-art QC technologies, and will finish with a deep dive into the status of quantum error correction, a long-term essential element for running quantum algorithms.
Workshop
Recorded
Correctness
Software Engineering
W
DescriptionCommunications are a critical part of HPC simulations, and one of the main focuses of application developers when scaling on supercomputers. While classical message passing (also called two-sided communications) is the dominant communication paradigm, one-sided communications are often praised to be efficient to overlap communications with computations, but challenging to program. Their usage is generally abstracted through languages and memory abstractions to ease programming. Therefore, little work has been done to help programmers use intermediate runtime layers, such as MPI-RMA, that is often reserved to expert programmers. Indeed, programming with MPI-RMA presents several challenges that require handling the asynchronous nature of one-sided communications to ensure the proper semantic of the program while ensuring its memory consistency. This presentation proposes a new static analysis of MPI-RMA codes that shows
to the programmer the errors that can be detected at compile time.
to the programmer the errors that can be detected at compile time.
Posters
Research Posters
TP
XO/EX
DescriptionPerformance data are collected to establish how well exascale applications are doing with executing their code or workflow as efficiently as possible. Chimbuko, a tool specifically focused on the analysis of performance data in real time, looks through these data and collects performance anomalies that are detected. These anomalies are saved into the Chimbuko Provenance Database, together with as much contextual information as needed. The goal of our work is to perform statistical analysis on the Chimbuko Provenance Database by presenting simple visualizations and determining if the information collected for each anomaly is sufficient to conduct a causal analysis. Statistical methods such as Theil’s U correlation analysis, Logistic regression, and K-Prototype clustering were used to identify association between variables. Furthermore, feature selection was conducted with Decision Tree and Random Forest. We identified association between call_stack and several variables, which reveals that call_stack is a very important feature of the dataset.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
Recorded
TP
DescriptionIn the fields of science and engineering, lossy compression plays a growing role in running scientific simulations, as output data is on the scale of terabytes. Using error bounded lossy compression reduces the amount of storage for each simulation; however, there is no known bound for the upper limit of lossy compressibility. Data correlation structures, compressors and error bounds are factors allowing larger compression ratios and improved quality metrics. This provides one direction towards quantifying lossy compressibility. Our previous work explored 2D statistical methods to characterize the data correlation structures and their relationships, through functional models, to compression ratios and quality metrics for 2D scientific data. In this poster, we explore the expansion of our statistical methods to 3D scientific data. The method was comparable to 2D. Our work is the next step towards evaluating the theoretical limits of lossy compressibility used to predict compression performance and optimally adapt compressors.
Birds of a Feather
TP
XO/EX
DescriptionWe explore the possibilities of a hybrid system capable of solving both HPC and AI scientific problems. Such a hybrid architecture combines the synergism between classical HPC platforms and dedicated AI chip systems, which is important due to the computational challenges brought to the fore by massively parallel Exascale systems.
We discuss the system functionality, the algorithmic software adaptations, and performance considerations. We present efforts in supporting AI/ML applications in addition to seismic imaging, climate/weather prediction, and computational astronomy on hybrid systems. In particular, we investigate how Graphcore’s IPU can accelerate hybrid HPC applications, beyond the originally intended AI workloads.
We discuss the system functionality, the algorithmic software adaptations, and performance considerations. We present efforts in supporting AI/ML applications in addition to seismic imaging, climate/weather prediction, and computational astronomy on hybrid systems. In particular, we investigate how Graphcore’s IPU can accelerate hybrid HPC applications, beyond the originally intended AI workloads.
Workshop
Recorded
W
DescriptionWe review some recent work on designing high performance structured mesh stencil accelerators on FPGA, building on a performance model that helps guide the design space exploration. We then review an accelerator design for tridiagonal system solvers applied to complex applications. For both applications we compare against GPU implementations on an NVIDIA V100, and show significant energy benefits for the FPGA. Finally, we present some recent work on accelerating Sphere Decoding for Massive MIMO on FPGAs.
Paper
Recorded
Numerical Algorithms
Scientific Computing
TP
DescriptionGraph pattern matching is a fundamental task in many graph analytics and graph mining applications. As an NP-hard problem, it is often a performance bottleneck in these applications. Previous work has proposed to use GPUs to accelerate the computation. However, we find that the existing GPU solutions fail to show a performance advantage over the state-of-the-art CPU implementation, due to their subgraph-centric design. In this work, we propose a novel stack-based graph pattern matching system on GPU that avoids the synchronization and memory consumption issues of the previous subgraph-centric systems. We also propose a two-level work-stealing and a loop-unrolling technique to improve the inter-warp and intra-warp GPU resource utilization of our system. The experiments show that our system significantly advances the state-of-the-art for graph pattern matching on GPU.
Workshop
Recorded
Quantum Computing
W
DescriptionNoisy quantum simulation is challenging since one has to take into account the stochastic nature of the process. The dominating method for it is the density matrix approach. In this paper, we evaluate conditions for which this method is inferior to a substantially simpler way of simulation. Our approach uses stochastic ensembles of quantum circuits, where random Kraus operators are applied to original quantum gates to represent random errors for modeling quantum channels. We show that our stochastic simulation error is relatively low, even for large numbers of qubits. We implemented this approach as a part of the QTensor package. While usual density matrix simulations on average hardware are challenging at n>15, we show that for up to n<30, it is possible to run embarrassingly parallel simulations with <1% error. By using the tensor slicing technique, we can simulate up to 100 qubit QAOA circuits with high depth using supercomputers.
Paper
Recorded
Machine Learning and Artificial Intelligence
TP
DescriptionDeep neural networks (DNNs) with billion-scale parameters have demonstrated impressive performance in solving many tasks. Unfortunately, training a billion-scale DNN is out of the reach of many data scientists because it requires high-performance GPU servers that are too expensive to purchase and maintain. We present STRONGHOLD, a novel approach for enabling large DNN model training with no change to the user code. STRONGHOLD scales up the largest trainable model size by dynamically offloading data to the CPU RAM and enabling the use of secondary storage. It automatically determines the minimum amount of data to be kept in the GPU memory to minimize GPU memory usage. Compared to state-of-the-art offloading-based solutions, STRONGHOLD improves the trainable model size by 1.9x∼6.5x on a 32GB V100 GPU, with 1.2x∼3.7x improvement on the training throughput. It has been deployed into production to successfully support large-scale DNN training.
Students@SC
DescriptionThe two-part event is designed to be beginner friendly, and open to anyone who wants to learn more about HPC. The first session will take place virtually before SC and will provide lessons in foundational skills. The second session will take place in person at SC22 and will introduce common parallel and accelerated computing methods. You can attend one or both sessions.
To speed access to the training cluster, registration prior to 7th November 2022 is requested. The following link can be used to register for one or both sessions:
Register here for sessions: https://www.olcf.ornl.gov/sc22-students-hpc-crash-course/
Virtual Pre-Conference Event:
11/10/22
12:00 pm- 3:30pm CDT
Zoom (provided at registration)
The pre-conference virtual Day 1 will be delivered by Zoom/Slack/Git. During Day 1, we will explain how/why HPC can be useful to you, help you get set up with an "ssh" client that will allow you to log in to a remote UNIX environment. Then we will cover the foundational skills needed to participate in hands-on HPC exercises including, UNIX, command-line text editors, and an introduction to C and Python programming. Participants will be supported by OLCF staff through a combination of Zoom and Slack. Students will have access to a Unix environment. This session is recommended but not required for the Conference Hands on HPC session.
In-Person at SC Session:
11/13/22
9:00-1:00 pm CDT
D227 -- Kay Bailey Hutchison Convention Center, Dallas
During the in-person sessions at the conference, we will give an overview of HPC programming environments, parallel programming models, job schedulers and job launchers, before directing participants to a set of self-guided HPC challenges that cover basic parallel programming and GPU programming topics. These self-guided challenges will be performed on OLCF’s Ascent training cluster which has an architecture identical to one cabinet of the Summit Supercomputer. Students will have access to Ascent until November 30 to complete all the exercises. Students who complete a select number of the exercises and challenges by November 30, will receive a certificate for completing an Introduction to HPC.
To speed access to the training cluster, registration prior to 7th November 2022 is requested. The following link can be used to register for one or both sessions:
Register here for sessions: https://www.olcf.ornl.gov/sc22-students-hpc-crash-course/
Virtual Pre-Conference Event:
11/10/22
12:00 pm- 3:30pm CDT
Zoom (provided at registration)
The pre-conference virtual Day 1 will be delivered by Zoom/Slack/Git. During Day 1, we will explain how/why HPC can be useful to you, help you get set up with an "ssh" client that will allow you to log in to a remote UNIX environment. Then we will cover the foundational skills needed to participate in hands-on HPC exercises including, UNIX, command-line text editors, and an introduction to C and Python programming. Participants will be supported by OLCF staff through a combination of Zoom and Slack. Students will have access to a Unix environment. This session is recommended but not required for the Conference Hands on HPC session.
In-Person at SC Session:
11/13/22
9:00-1:00 pm CDT
D227 -- Kay Bailey Hutchison Convention Center, Dallas
During the in-person sessions at the conference, we will give an overview of HPC programming environments, parallel programming models, job schedulers and job launchers, before directing participants to a set of self-guided HPC challenges that cover basic parallel programming and GPU programming topics. These self-guided challenges will be performed on OLCF’s Ascent training cluster which has an architecture identical to one cabinet of the Summit Supercomputer. Students will have access to Ascent until November 30 to complete all the exercises. Students who complete a select number of the exercises and challenges by November 30, will receive a certificate for completing an Introduction to HPC.
Paper
Recorded
Networks
Performance
Visualization
TP
DescriptionDragonfly interconnect is a crucial network technology for supercomputers. To support exascale systems, network resources are shared such that links and routers are not dedicated to any node pair. While link utilization is increased, workload performance is often offset by network contention. Recently, intelligent routing built on reinforcement learning demonstrates higher network throughput with lower packet latency. However, its effectiveness in reducing workload interference is unknown. In this work, we present extensive network simulations to study multi-workload contention under different routing mechanisms, intelligent routing and adaptive routing, on a large-scale Dragonfly system.
We develop an enhanced network simulation toolkit, along with a suite of workloads with distinctive communication patterns. We also present two metrics to characterize application communication intensity. Our analysis focuses on examining how different workloads interfere with each other under different routing mechanisms by inspecting both application-level and network-level metrics. Several key insights are made from the analysis.
We develop an enhanced network simulation toolkit, along with a suite of workloads with distinctive communication patterns. We also present two metrics to characterize application communication intensity. Our analysis focuses on examining how different workloads interfere with each other under different routing mechanisms by inspecting both application-level and network-level metrics. Several key insights are made from the analysis.
Workshop
Recorded
Diversity Equity Inclusion (DEI)
Education and Training and Outreach
W
DescriptionAs Graphics Processing Units (GPUs) are becoming a de facto solution for accelerating a wide range of applications, their reliable operation is becoming increasingly important. One major challenge is to accurately measure GPGPU application error resilience. A typical GPGPU application spawns a huge number of threads and utilizes a large amount of potentially unreliable compute and memory resources available on the GPUs. As the number of possible fault locations can be in the billions, evaluating every fault and examining its effect on the application error resilience is impractical. Application resilience is evaluated via extensive fault injection campaigns based on sampling of an extensive fault site space. Typically, the larger the input of the GPGPU application, the longer the experimental campaign. We devise a methodology, SUGAR (Speeding Up GPGPU Application Resilience Estimation with input sizing), that dramatically speeds up the evaluation of GPGPU application error resilience by judicious input sizing.
Workshop
Recorded
Reliability and Resiliency
W
DescriptionSuperCheck-SC22 opening remarks from the workshop organizers
Invited Talk
Recorded
TP
XO/EX
DescriptionBiological systems present some of the most demanding, compute intensive high performance computing applications. Mechanistic understanding of viruses and molecular machines, as well as computational drug design efforts, often require calculations of free energies. Free energy calculations require enormous amounts of conformational sampling to achieve equilibrium thermodynamics. Even modest amounts of sampling (e.g. 1 millisecond of physiological time) require 10^12 time steps. Due to the electrostatic charges present, long-range electrostatic forces play important roles. Thus, biological simulations are often much more intensive than materials science applications, which typically do not include long-range electrostatic interactions. Additional factors of complexity, such as the fact that many processes are far from equilibrium and that chemical reactions can be critical (requiring quantum mechanical calculations), further complicate these systems. If we neglect chemical reactions and non-equilibrium effects, we estimate that simulating 1 second of physiological time for the human genome (in the case of 23 chromosomes) would require at least 10 YF (1 YF = 10^24 FLOPs). While these calculations are far beyond the scope of current platforms, they provide a roadmap for the way forward in biomolecular simulation. To strive toward this vision, we perform large-scale explicit solvent molecular dynamics simulations feasible on current platforms and also scope out much larger systems using coarse-grained approaches using a multiresolution strategy. Such simulations play an important role in integrating disparate forms of experimental data into a single coherent picture. We used explicit solvent MD simulations (2.64 million atoms) to identify the accommodation corridor in the ribosome, critical for tRNA selection during protein synthesis (Sanbonmatsu, et al., PNAS, 2005). Microsecond explicit solvent simulations of the ribosome (2.2 million atoms) also laid the foundations for our energy landscape calculations using all-atom structure-based simulations of spontaneous accommodation events (Whitford, et al., PLoS Comput. Biol., 2013; Whitford, et al., RNA, 2010). We are applying a similar strategy to chromatin architecture, which plays a key role in embryo development, brain function and cancer. As a first step, we have performed the first explicit solvent simulation of an entire gene locus (GATA4), consisting of 427 nucleosomes and over one billion atoms (the first published billion atom biomolecular simulation) (Jung, et al., J. Comp. Chem. 2019). We will also describe coarse-grained simulations of the X-chromosome consistent with high throughput capture sequencing data (Lappala, et al., PNAS, 2021), which help us to scope more detailed and more intensive simulations.
Exhibitor Forum
Recorded
TP
XO/EX
DescriptionHigh performance computing and Artificial Intelligence are the most essential tools fueling the advancement of science. NVIDIA accelerated computing platform brings together GPUs, CPUs, DPUs and full stack software to tackle challenges that otherwise can’t be solved.
We will take a deep dive into the heterogeneous and large scale NVIDIA coherent platform which builds on High throughput Hopper H100 GPU, general purpose Grace CPU and BlueField-3 Data Processing Unit (DPU), enabling Cloud Native Supercomputing, digital twins and datacenters of the future.
A coherent platform provides immense value to the community, by providing a unified developer experience and at the same time bringing the best of GPU, CPU and networking performance.
We will take a deep dive into the heterogeneous and large scale NVIDIA coherent platform which builds on High throughput Hopper H100 GPU, general purpose Grace CPU and BlueField-3 Data Processing Unit (DPU), enabling Cloud Native Supercomputing, digital twins and datacenters of the future.
A coherent platform provides immense value to the community, by providing a unified developer experience and at the same time bringing the best of GPU, CPU and networking performance.
Student Cluster Competition
TP
XO/EX
DescriptionTeam NTU was formed as part of the HPC club at NTU, a student run club that promotes HPC adoption and awareness in the University. Its diverse membership consists of students spanning a wide range of schools, including the schools of Electrical and Electronic Engineering, Computer Science and Engineering, and Physical and Mathematical Sciences.
By organizing weekly training sessions, we were able to build up attendees’ skills and observe their competencies. From there, we were able to draw on that diverse pool to select club members with the right set of skills required to tackle this competition’s benchmarks and applications.
Our multidisciplinary team consists of members from the School of Electrical and Electronic Engineering, School of Computer Science and Engineering, and School of Physical and Mathematical Science.
Further, our members also have a wide breadth of practical experience in administering and designing small-scale HPC systems, as some team members are also in charge of managing the club’s own clusters within the University’s Parallel and Distributed Computing laboratory. Beyond just maintenance, they also liaise with sponsors and lab administrators to procure hardware so as to keep the clusters at the bleeding edge of HPC technology. By doing HPC, our members also gained knowledge for computer science and some general science, which helps a lot on academic aspects.
We also have team members with prior participation in HPC-AI APAC, ISC and SC events. We believe our experience on those competitions would prove invaluable to our success in SC22.
Our advisor, Dr Bu-Sung Lee also gave us great support for learning HPC and joining all of these events. Dr. Lee is a faculty member of the School of Computer Science and Engineering, and also a member of the Policy & Resource Allocation Committee at Singapore’s National Supercomputing Center (NSCC), the team’s primary sponsor. He has been involved in many Asia-Pacific research and education networks including the Singapore Advanced Research and Education Network (SingAREN) (as its founding president) as well as the Trans-Eurasia Information Network (TEIN-2).
We are also looking forward to learning from other teams for how they optimize the application and work on the cluster, making new friends, exploring different SC conferences for state-of-the-art HPC technologies and enhancing our HPC skills.
By organizing weekly training sessions, we were able to build up attendees’ skills and observe their competencies. From there, we were able to draw on that diverse pool to select club members with the right set of skills required to tackle this competition’s benchmarks and applications.
Our multidisciplinary team consists of members from the School of Electrical and Electronic Engineering, School of Computer Science and Engineering, and School of Physical and Mathematical Science.
Further, our members also have a wide breadth of practical experience in administering and designing small-scale HPC systems, as some team members are also in charge of managing the club’s own clusters within the University’s Parallel and Distributed Computing laboratory. Beyond just maintenance, they also liaise with sponsors and lab administrators to procure hardware so as to keep the clusters at the bleeding edge of HPC technology. By doing HPC, our members also gained knowledge for computer science and some general science, which helps a lot on academic aspects.
We also have team members with prior participation in HPC-AI APAC, ISC and SC events. We believe our experience on those competitions would prove invaluable to our success in SC22.
Our advisor, Dr Bu-Sung Lee also gave us great support for learning HPC and joining all of these events. Dr. Lee is a faculty member of the School of Computer Science and Engineering, and also a member of the Policy & Resource Allocation Committee at Singapore’s National Supercomputing Center (NSCC), the team’s primary sponsor. He has been involved in many Asia-Pacific research and education networks including the Singapore Advanced Research and Education Network (SingAREN) (as its founding president) as well as the Trans-Eurasia Information Network (TEIN-2).
We are also looking forward to learning from other teams for how they optimize the application and work on the cluster, making new friends, exploring different SC conferences for state-of-the-art HPC technologies and enhancing our HPC skills.
Birds of a Feather
TP
XO/EX
DescriptionThe Open Storage Network (OSN) provides a performant and cost efficient distributed data sharing and transfer service for active scientific data sets, providing easy access and high bandwidth delivery of large data sets to researchers and compute resources. Following its inception in 2017, the OSN transitioned to a production-level pilot and welcomed others to utilize the network. Today, users can request storage allocations of 1TB-50TB via the ACCESS allocation process. This BoF provides an update to the community, and open discussion to address questions on joining the network, and gather input on user requirements.
Posters
Research Posters
TP
XO/EX
DescriptionFederated Learning (FL) is a distributed Machine Learning paradigm aiming to collaboratively learn a shared model while considering privacy preservation by letting the clients process their private data locally. In the Computing Continuum context (edge-fog-cloud ecosystem), FL raises several challenges such as supporting very heterogeneous devices and optimizing massively distributed applications.
We propose a workflow to better support and optimize FL systems across the Computing Continuum by relying on formal descriptions of the infrastructure, hyperparameter optimization and model retraining in case of performance degradation. We motivate our approach by providing preliminary results using a human activity recognition dataset. The next objective will be to implement and deploy our solution on the Grid’5000 testbed.
During the poster session, I will start by presenting the main problems for applying FL in the Computing Continuum and how our approach is tackling it. Next I will present preliminary results and discuss the remaining challenges.
We propose a workflow to better support and optimize FL systems across the Computing Continuum by relying on formal descriptions of the infrastructure, hyperparameter optimization and model retraining in case of performance degradation. We motivate our approach by providing preliminary results using a human activity recognition dataset. The next objective will be to implement and deploy our solution on the Grid’5000 testbed.
During the poster session, I will start by presenting the main problems for applying FL in the Computing Continuum and how our approach is tackling it. Next I will present preliminary results and discuss the remaining challenges.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
DescriptionDeep learning surrogate models have drawn much attention in large-scale scientific simulations because they can provide similar results to simulations at lower computational costs. To process large amounts of scientific data, distributed training on high-performance computing (HPC) clusters is often used. Training a surrogate model with data parallelism consists of three major steps: (1) Each device loads a subset of the dataset from the parallel filesystem; (2) Computing the model update on each device; (3) Communicating between devices to synchronize the model update. During these steps, we observe that data loading is the main performance bottleneck for training surrogate models. To this end, we propose SurrogateTrain, an efficient data-loading approach for training surrogate models, including offline scheduling and on-demand buffering. Our evaluation on a scientific surrogate model demonstrates that SurrogateTrain reduces the amount of data loaded by 6.7× and achieves up to 4.7× speedup in data loading.
Workshop
Recorded
HPC Training and Education
W
DescriptionAs more students want to pursue a career in big data analytics and data science, big data education has become a focal point in many colleges and universities' curricula. There are many challenges when it comes to teaching and learning big data in a classroom setting. One of the biggest challenges is to prepare big data infrastructure to provide meaningful hands-on experience to students. Setting up necessary distributed computing resource is a delicate act for instructors and system administrators because there is no one size fit all solutions. In this presentation, we propose an approach that facilitates the creation of the computing environment on both personal computers and public cloud resources. This combined approach meets different needs and can be used in an educational setting to facilitate different big data learning activities. We discuss and reflect on our experience using these systems in teaching undergraduate and graduate courses.
Student Cluster Competition
TP
XO/EX
Description- We are from the SUSTech Supercomputing Team (SST). The team is composed of SUSTech undergraduates who take a great interest in computational science and engineering. SST serves the large HPC user community in SUSTech and practices HPC skills in real scenes.
- Team Captain, Yingwei Zheng
- Participant in SC21 SCC, 2021 APAC HPC-AI Competition, and ISC22 SCC
- Has more than 6 years of programming experience in modern C++ and CUDA
- Working on high-level optimization for high-performance applications with LLVM and MLIR
- Interested in computer graphics
- Bingzhen Wang
- Participant in SC21 SCC and ASC22
- Maintainer of SUSTech Open Source Mirrors
- Interested in programming languages and computer graphics
- We have four talented freshmen this year. (New Participant Points = 10) They are:
- Jixiao Zhang
- Participant of the 2021 APAC HPC-AI Competition and ASC22
- Working on recommendation tasks in social networks with Graph Neural Network
- Junfeng Chen
- Experienced software developer
- Maintainer of SUSTech Open Source Mirrors
- Interested in hardware and operating system
- Tingzhen Dong
- Participant in ICPC Contest, Gold Medal of The 2020 ICPC Asia-East Continent Final
- Former captain of the SUSTech Collegiate Programming Contest Team
- Student assistant at the Department of Computer Science and Engineering
- Working on the research of system security and confidential computing
- Jia'nan Zhu
- Participant in ICPC and NOIP Contest
- Member of SUSTech CTF Team
- Interested in computer security
- As for our advisor - Dr. Fan is the chief engineer (Senior Engineer) of the SUSTech CCSE. He has published more than twenty papers in high-level academic journals with hundreds of citations. His research proposal was supported by the Youth Program of the National Natural Science Foundation of China and participated in several other programs of National Natural Science Funds.
- Some members of SST have interdisciplinary research experience. Yingwei Zheng and Bingzhen Wang have years of experience in online/offline physically-based rendering, which involves Numerical Analysis, Probability Theory, Statistics, and Radiometry. And CCSE Director Lianping Wang is a chair professor of the Department of Mechanics and Astronautics. Our team has experience in interdisciplinary cooperation. These experiences will help us cooperate with experts in other disciplines to find breakthrough points in the performance.
- We created a team with a broad background of experience relevant to the competition. Yingwei Zheng has a rich knowledge of compiler optimization technologies and CPU/GPU microarchitecture. And Tingzhen Dong won the Gold Medal of ICPC. They can tune codes and parameters skillfully. Junfeng Chen, Bingzhen Wang, and Jia'nan Zhu have rich experience in maintaining clusters in the Center for Computational Science and Engineering of SUSTech. They will provide support for others during the environment setup. Jixiao Zhang knows a lot about Machine Learning and will be assigned to the mystery application or the reproducibility challenge.
- Using HPC will improve students' ability to maintain computers and develop/debug programs. It will also strengthen students' understanding of computer architecture. The purpose of our participation in the SC competition includes, but is not limited to, training students' abilities, making new friends, participating in top conferences, and understanding cutting-edge trends.
- Team Captain, Yingwei Zheng
- Participant in SC21 SCC, 2021 APAC HPC-AI Competition, and ISC22 SCC
- Has more than 6 years of programming experience in modern C++ and CUDA
- Working on high-level optimization for high-performance applications with LLVM and MLIR
- Interested in computer graphics
- Bingzhen Wang
- Participant in SC21 SCC and ASC22
- Maintainer of SUSTech Open Source Mirrors
- Interested in programming languages and computer graphics
- We have four talented freshmen this year. (New Participant Points = 10) They are:
- Jixiao Zhang
- Participant of the 2021 APAC HPC-AI Competition and ASC22
- Working on recommendation tasks in social networks with Graph Neural Network
- Junfeng Chen
- Experienced software developer
- Maintainer of SUSTech Open Source Mirrors
- Interested in hardware and operating system
- Tingzhen Dong
- Participant in ICPC Contest, Gold Medal of The 2020 ICPC Asia-East Continent Final
- Former captain of the SUSTech Collegiate Programming Contest Team
- Student assistant at the Department of Computer Science and Engineering
- Working on the research of system security and confidential computing
- Jia'nan Zhu
- Participant in ICPC and NOIP Contest
- Member of SUSTech CTF Team
- Interested in computer security
- As for our advisor - Dr. Fan is the chief engineer (Senior Engineer) of the SUSTech CCSE. He has published more than twenty papers in high-level academic journals with hundreds of citations. His research proposal was supported by the Youth Program of the National Natural Science Foundation of China and participated in several other programs of National Natural Science Funds.
- Some members of SST have interdisciplinary research experience. Yingwei Zheng and Bingzhen Wang have years of experience in online/offline physically-based rendering, which involves Numerical Analysis, Probability Theory, Statistics, and Radiometry. And CCSE Director Lianping Wang is a chair professor of the Department of Mechanics and Astronautics. Our team has experience in interdisciplinary cooperation. These experiences will help us cooperate with experts in other disciplines to find breakthrough points in the performance.
- We created a team with a broad background of experience relevant to the competition. Yingwei Zheng has a rich knowledge of compiler optimization technologies and CPU/GPU microarchitecture. And Tingzhen Dong won the Gold Medal of ICPC. They can tune codes and parameters skillfully. Junfeng Chen, Bingzhen Wang, and Jia'nan Zhu have rich experience in maintaining clusters in the Center for Computational Science and Engineering of SUSTech. They will provide support for others during the environment setup. Jixiao Zhang knows a lot about Machine Learning and will be assigned to the mystery application or the reproducibility challenge.
- Using HPC will improve students' ability to maintain computers and develop/debug programs. It will also strengthen students' understanding of computer architecture. The purpose of our participation in the SC competition includes, but is not limited to, training students' abilities, making new friends, participating in top conferences, and understanding cutting-edge trends.
Workshop
Recorded
W
DescriptionIn this talk we will demonstrate our RISC-V accelerated computing stack with SYCL and oneAPI for HPC and AI. We will also demo it on an FPGA and demonstrate to attendees how they can get it working themselves and also how they can customize their own hardware design while still running HPC applications.
Birds of a Feather
TP
XO/EX
DescriptionSYCL is a powerful way to enable multi-vendor support for high performance libraries, languages, and packages, while still allowing the originally desired programmer productivity and performance. While many BoFs/Presentations will focus on the end results of a port, this BoF is meant to share lessons learned in porting a diverse array of previously vendor-specific implementations to SYCL, how they enforced numerical reproducibility, and then added flexible vectorization for portable performance. This BoF will place strong emphasis on sharing cross architecture debugging techniques.
Paper
Recorded
Numerical Algorithms
Scientific Computing
TP
Best Paper Finalist
DescriptionWe consider the distributed Cholesky factorization on homogeneous nodes. Inspired by recent progress on asymptotic lower bounds on the total communication volume required to perform Cholesky factorization, we present an original data distribution, Symmetric Block Cyclic (SBC), designed to take advantage of the symmetry of the matrix. We prove that SBC reduces the overall communication volume between nodes by a factor of square root of 2 compared to the standard 2D block-cyclic distribution. SBC can easily be implemented within the paradigm of task-based runtime systems. Experiments using the Chameleon library over the StarPU runtime system demonstrate that the SBC distribution reduces the communication volume as expected, and also achieves better performance and scalability than the classical 2D block-cyclic allocation scheme in all configurations. We also propose a 2.5D variant of SBC and prove that it further improves the communication and performance benefits.
Student Cluster Competition
TP
XO/EX
DescriptionStudent Cluster Competition (SCC) has a long tradition at Sun Yat-sen University (SYSU), even before the installation of Tianhe-2, the once-fastest supercomputer. In the last decade, the SCC team has been one of the most competitive SCC teams. At ASC, we won the 4th from 2012 to 2017, the 3rd with an e-prize in 2019, and the 3rd with the Highest Linpack in 2021. At ISC, we won the 4th in both 2019 and 2021. In addition, we were the champion of IndySCC last year. For us, our application to SCC@SC shows our persistent efforts in exploring the frontier in HPC. Our motivations and strengths are as follows:
First, most of our team members with disciplinary diversity are with HPC skills and rich SCC experience. 4 out of 6 members have the experience in SCC, and have participated in ASC'22 and ISC'22. We have invited experts from academia and industry to deliver lectures on HPC-related topics. With regular training designed by senior members, team members familiarized themselves with system setup, software management, and optimization using MPI, OpenMP, and CUDA. With training cluster management skills and preparing for the Mystery Applications, the team members have, on their own, deployed and tuned a wide range of benchmark suites and applications (e.g., HPL, WRF, ICON). Therefore, we are confident in both managing our cluster and optimizing parallel programs on it.
Second, our members with disciplinary diversity have particular interests in the fields of architecture, algorithm, AI, database, etc. Since our team members are in different majors, we can investigate problems from different perspectives. For example, Siran Liu has participated in many interdisciplinary competitions (e.g. IGEM), which enable us to glance at new applications from a higher perspective. And Tianxing Yang, who previously majored in Mathematics, will certainly bring us more distinctive ideas from his point of view. Zhe Tang and Yang Ye were post-contestants in the Physics and Chemistry Contests, and they will apply their own experience to solving problems. Han Huang and Tianxing Yang, have participated in SYSU ACM-ICPC Team, and are capable to optimize the algorithms of HPC applications.
Last but not least, we have valuable guidance from our advisor Dr. Dan Huang, a PC member of SC22. Dr. Dan Huang is currently an associate professor in the School of Computer Science and Engineering, Sun Yat-sen University. He received Ph.D. in computer engineering at the University of Central Florida. His research interests are scientific data management, in-memory computing, parallel programming model, and distributed storage systems. In addition, he worked at Oak Ridge National Lab, USA (ORNL) as a short-term researcher for about ten months. His research has been published in many top-tier conferences and journals, including TC, TPDS, ICDCS, IPDPS, and DAC.
First, most of our team members with disciplinary diversity are with HPC skills and rich SCC experience. 4 out of 6 members have the experience in SCC, and have participated in ASC'22 and ISC'22. We have invited experts from academia and industry to deliver lectures on HPC-related topics. With regular training designed by senior members, team members familiarized themselves with system setup, software management, and optimization using MPI, OpenMP, and CUDA. With training cluster management skills and preparing for the Mystery Applications, the team members have, on their own, deployed and tuned a wide range of benchmark suites and applications (e.g., HPL, WRF, ICON). Therefore, we are confident in both managing our cluster and optimizing parallel programs on it.
Second, our members with disciplinary diversity have particular interests in the fields of architecture, algorithm, AI, database, etc. Since our team members are in different majors, we can investigate problems from different perspectives. For example, Siran Liu has participated in many interdisciplinary competitions (e.g. IGEM), which enable us to glance at new applications from a higher perspective. And Tianxing Yang, who previously majored in Mathematics, will certainly bring us more distinctive ideas from his point of view. Zhe Tang and Yang Ye were post-contestants in the Physics and Chemistry Contests, and they will apply their own experience to solving problems. Han Huang and Tianxing Yang, have participated in SYSU ACM-ICPC Team, and are capable to optimize the algorithms of HPC applications.
Last but not least, we have valuable guidance from our advisor Dr. Dan Huang, a PC member of SC22. Dr. Dan Huang is currently an associate professor in the School of Computer Science and Engineering, Sun Yat-sen University. He received Ph.D. in computer engineering at the University of Central Florida. His research interests are scientific data management, in-memory computing, parallel programming model, and distributed storage systems. In addition, he worked at Oak Ridge National Lab, USA (ORNL) as a short-term researcher for about ten months. His research has been published in many top-tier conferences and journals, including TC, TPDS, ICDCS, IPDPS, and DAC.
Workshop
Recorded
W
Workshop
Recorded
Applications
Architectures
Heterogeneous Systems
Hierarchical Parallelism
Parallel Programming Languages and Models
Performance
Performance Portability
Scientific Computing
W
DescriptionWe present distributed task fusion, a run-time optimization for task-based runtimes operating on parallel and heterogeneous systems. Distributed task fusion dynamically performs an efficient buffering, analysis, and fusion of asynchronously-evaluated distributed operations, reducing the overheads inherent to scheduling distributed tasks in implicitly parallel frameworks and runtimes. We identify the constraints under which distributed task fusion is permissible and describe an implementation in Legate, a domain-agnostic library for constructing portable and scalable task-based libraries. We present performance results using cuNumeric, a Legate library that enables scalable execution of NumPy pipelines on parallel and heterogeneous systems. We realize speedups up to 1.5x with task fusion enabled on up to 32 P100 GPUs, thus demonstrating efficient execution of pipelines involving many successive fine-grained tasks. Finally, we discuss potential future work, including complementary optimizations that could result in additional performance improvements.
Doctoral Showcase
Posters
Recorded
TP
DescriptionIn contrast to conventional integrated circuits, Field Programmable Gate Arrays (FPGAs) can be reconfigured dynamically. This flexibility unlocks potential for FPGA-based accelerators to offload tasks in HPC. Scheduling tasks on FPGAs is equivalent to the allocation of chip resources: each offloaded task occupies chip area during its execution. Hence, task scheduling on FPGAs is typically done with Partial Reconfiguration (PR). However, PR requires a high development overhead, requires expert knowledge and has limited portability, making it difficult to apply existing research and lowering the adoption of FPGAs in HPC. We want to aid software developers and vendors to integrate accelerators based on FPGAs without these issues and ask: how we can optimize task scheduling on FPGAs without relying on PR?
We answer this question with three key contributions: first, we introduce an abstraction-agnostic methodology to analyze and compare scheduling strategies for FPGAs. Center of our method is the derivation of scheduling constraints from a machine model representing a target FPGA. The schedules generated for HPC applications are compared for two models. We show that the overhead for avoiding PR is feasible. Second, we propose algorithms to generate recommendations for minimal changes to the program that affect the quality of possible schedules. We show that effective recommendations can be generated for HPC applications. Third, we contribute two polynomial-time scheduling algorithms. Our results can help vendors to provide significantly more streamlined workflows for programming FPGAs, making the platform more appealing and helping the adoption of high-level programming environments like OpenCL for FPGAs.
We answer this question with three key contributions: first, we introduce an abstraction-agnostic methodology to analyze and compare scheduling strategies for FPGAs. Center of our method is the derivation of scheduling constraints from a machine model representing a target FPGA. The schedules generated for HPC applications are compared for two models. We show that the overhead for avoiding PR is feasible. Second, we propose algorithms to generate recommendations for minimal changes to the program that affect the quality of possible schedules. We show that effective recommendations can be generated for HPC applications. Third, we contribute two polynomial-time scheduling algorithms. Our results can help vendors to provide significantly more streamlined workflows for programming FPGAs, making the platform more appealing and helping the adoption of high-level programming environments like OpenCL for FPGAs.
Birds of a Feather
TP
XO/EX
DescriptionThe career panel will consist of representatives from the industry and academia with a background in HPC. The panel will share advice on different career options in HPC, and their experiences in their respective career trajectories. The primary audience for this event is current, preferably ABD, graduate students and post-doctoral researchers. The format will include a brief introduction by each speaker, followed by a moderated discussion based on a set of previously submitted questions and ending with further questions from the audience.
Paper
Recorded
Architectures
Machine Learning and Artificial Intelligence
TP
DescriptionIn high performance processors, the design of on-chip memory hierarchies is crucial for performance and energy efficiency. Current processors rely on large shared Non-Uniform Cache Architectures (NUCA) to improve performance and reduce data movement. Multiple solutions exploit information available at the microarchitecture level or in the operating system to optimize NUCA performance. However, existing methods have not taken advantage of the information captured by task dataflow programming models to guide the management of NUCA caches.
In this paper, we propose TD-NUCA, a hardware/software co-designed approach that leverages information present in the runtime system of task dataflow programming models to efficiently manage NUCA caches. TD-NUCA identifies the data access and reuse patterns of parallel applications in the runtime system and guides the operation of the NUCA caches in the hardware. As a result, TD-NUCA achieves a 1.18x average speedup over the baseline S-NUCA while requiring only 0.62x the data movement.
In this paper, we propose TD-NUCA, a hardware/software co-designed approach that leverages information present in the runtime system of task dataflow programming models to efficiently manage NUCA caches. TD-NUCA identifies the data access and reuse patterns of parallel applications in the runtime system and guides the operation of the NUCA caches in the hardware. As a result, TD-NUCA achieves a 1.18x average speedup over the baseline S-NUCA while requiring only 0.62x the data movement.
Workshop
Recorded
HPC Training and Education
W
DescriptionResearchers and developers in a variety of fields have benefited from the massively parallel processing paradigm. Numerous tasks are facilitated by the use of accelerated computing, such as graphics, simulations, visualizations, cryptography, data science, and machine learning. Over the past years, machine learning and in particular deep learning have received much attention. The development of such solutions requires a different level of expertise and insight than that required for traditional software engineering. Therefore, there is a need for novel approaches to teaching people about these topics.
This presentation outlines the primary challenges of accelerated computing and deep learning education, discusses the methodology and content of the NVIDIA Deep Learning Institute, presents the results of a quantitative survey conducted after full-day workshops, and demonstrates a sample adoption of DLI teaching kits for teaching heterogeneous parallel computing.
This presentation outlines the primary challenges of accelerated computing and deep learning education, discusses the methodology and content of the NVIDIA Deep Learning Institute, presents the results of a quantitative survey conducted after full-day workshops, and demonstrates a sample adoption of DLI teaching kits for teaching heterogeneous parallel computing.
Student Cluster Competition
TP
XO/EX
DescriptionThe members of Team Phoenix are Jack Hurst, Tracey Li, Braden Hester, Patrick Sliwinski, Samuel Henderson, and Aditya Kaushik. Our team is brought together by a shared interest in high-performance computing, and we hope to develop our experience and involvement in HPC through this competition.
All of us are part of a Vertically Integrated Project (VIP) HPC class, which is designed to train SCC teams, teach students about HPC fundamentals, and give them awareness of the industry. From this class, we learn about various HPC topics, such as schedulers, the Linpack benchmark, and some examples of building and running applications.
Our members have varied experience across academic coursework and industry. Collectively, we study five concentrations in computer science which are offered by Georgia Tech’s College of Computing: devices, information internetworks, theory, systems and architecture, and artificial intelligence. Through our coursework, we have learned about algorithms, computer architecture, operating systems and other computer science fundamentals. We have also gained industry experience through internships. Jack and Braden have worked on front- and back-end web development, and Aditya has worked with databases, networking protocols, and cloud technologies.
Patrick and Samuel are interested in HPC because they are involved in fields where HPC is applied to run simulations and model complex problems. Patrick is a member of the Yellow Jacket Space Program at Georgia Tech, where he works on embedded software for the flight computer and sensor boards. Patrick’s interest in HPC comes from the flight dynamics simulations used to plan flight trajectories by the Yellow Jacket Space Program. Before computer science, Samuel studied mechanical engineering. One reason he is interested in HPC is its applications in fluid dynamics and thermodynamics. Aditya is a TA for the systems and networks class, where students are given an exposure to computing systems and networking, including software abstractions for utilizing compute resources. Aditya is enjoying the HPC VIP class because many of the operating systems concepts that he has learned about through coursework and readings – such as scheduling and distributed systems – get put into full use in an HPC environment. Braden’s interest in HPC stems from a desire to understand how we can use computers to their fullest potential. He has experience participating in team-based academic competitions, such as Lockheed Martin’s CodeQuest programming competition and the GHSA State Math competition. Tracey’s interest in math drew her to HPC and this competition, and she is looking forward to being exposed to the techniques used in HPC to solve problems. Jack was previously a computer engineering student, so he has a background in hardware. He is an instructor for the 3D printing tech area at Georgia Tech’s Electrical and Computer Engineering makerspace. By participating in this competition, he hopes to gain more experience in linux, scripting, and parallel programming.
Sahit, our advisor, is a graduate student at Georgia Tech. He has worked for Nvidia in software security, competed in Nvidia’s GPU Hackathon, and is currently competing in ISC22, making him a wonderful resource as our team advisor.
All of us are part of a Vertically Integrated Project (VIP) HPC class, which is designed to train SCC teams, teach students about HPC fundamentals, and give them awareness of the industry. From this class, we learn about various HPC topics, such as schedulers, the Linpack benchmark, and some examples of building and running applications.
Our members have varied experience across academic coursework and industry. Collectively, we study five concentrations in computer science which are offered by Georgia Tech’s College of Computing: devices, information internetworks, theory, systems and architecture, and artificial intelligence. Through our coursework, we have learned about algorithms, computer architecture, operating systems and other computer science fundamentals. We have also gained industry experience through internships. Jack and Braden have worked on front- and back-end web development, and Aditya has worked with databases, networking protocols, and cloud technologies.
Patrick and Samuel are interested in HPC because they are involved in fields where HPC is applied to run simulations and model complex problems. Patrick is a member of the Yellow Jacket Space Program at Georgia Tech, where he works on embedded software for the flight computer and sensor boards. Patrick’s interest in HPC comes from the flight dynamics simulations used to plan flight trajectories by the Yellow Jacket Space Program. Before computer science, Samuel studied mechanical engineering. One reason he is interested in HPC is its applications in fluid dynamics and thermodynamics. Aditya is a TA for the systems and networks class, where students are given an exposure to computing systems and networking, including software abstractions for utilizing compute resources. Aditya is enjoying the HPC VIP class because many of the operating systems concepts that he has learned about through coursework and readings – such as scheduling and distributed systems – get put into full use in an HPC environment. Braden’s interest in HPC stems from a desire to understand how we can use computers to their fullest potential. He has experience participating in team-based academic competitions, such as Lockheed Martin’s CodeQuest programming competition and the GHSA State Math competition. Tracey’s interest in math drew her to HPC and this competition, and she is looking forward to being exposed to the techniques used in HPC to solve problems. Jack was previously a computer engineering student, so he has a background in hardware. He is an instructor for the 3D printing tech area at Georgia Tech’s Electrical and Computer Engineering makerspace. By participating in this competition, he hopes to gain more experience in linux, scripting, and parallel programming.
Sahit, our advisor, is a graduate student at Georgia Tech. He has worked for Nvidia in software security, competed in Nvidia’s GPU Hackathon, and is currently competing in ISC22, making him a wonderful resource as our team advisor.
Student Cluster Competition
TP
XO/EX
DescriptionAdvisors included, this will be the team's 3rd Cluster Computing Competition (although its first in the "major" competitions). The primary advisor and logistics coordinator are invested graduate EE students who placed 2nd and 4th in the last two national-level Winter Classic Invitational Cluster Computing Competitions. They submitted competitive HPCG, HPL, NAS Parallel benchmark, OpenFOAM motorcycle simulation, and machine learning application scores utilizing a variety of supercomputers and clusters provided by Google, Cray, NASA (Pleiades), Oak Ridge National Labs (SUMMIT), and AWS.
The logistics coordinator has a second undergraduate degree in biology, in addition to Electrical Engineering. The primary advisor has studied bioinformatics as well as Electrical Engineering.
Regarding the actual undergraduate team, each competitor is trained in relevant areas but new to cluster computing. Dante Uriostegui and Miguel Payan are assisting the primary advisor with a radiation hardened GPU project. Throughout the spring they learned to use cmake, make, and linux to conduct GPU programming and even compiler design. They also learned much about GPU architecture through their work with open-source GPU RTL designs. Expertise gained from this competition will help them to utilize state-funded supercomputers to decrease synthesization times and increase emulation performance with complex designs - a critical verification step that their laptops will soon begin struggling with, as the complexity of the design increases. Throughout the rest of their bachelors and master's education, they will be invaluable to the department for this skillset.
Juan Muller is proficient with linux and computer vision. He successfully programmed a drone to land on QR codes using a Raspberry Pi with an intel Compute Stick 2 VPU, requiring build-troubleshooting on his part. HPC training from this competition will empower him to train new models with more data for future machine learning work.
Daniel Alvarado was a star student in the University's rigorous microprocessor systems course, and will complete an internship working with microprocessors at Sandia National Labs over the summer. As he searches for a topic to research in the future for graduate school, HPC has caught his attention. Participation in this competition will elucidate what HPC really is, and reveal new possibilities for a senior design project and future research.
Finally, Michelle Lara and Jose Granados are earlier-on students with programming proficiency and a bright future. The HPC skillset that follows from participation in this competition will allow them not only to participate in future student cluster competitions, but also to help the department with simulation-centered research.
Thus, our team's advisors are invested and trained in HPC. Our student competitors are talented and trained in relevant areas, so this competition will serve to bring many of their skills together to form a new useful skill. As our department has a high need for HPC experts to help professors utilize UTEP-owned and State-Owned Clusters and Supercomputers for their research, the presence of undergraduates, several who are somewhat early on, who have aspirations for graduate school will be an enormous help to our school as well.
The logistics coordinator has a second undergraduate degree in biology, in addition to Electrical Engineering. The primary advisor has studied bioinformatics as well as Electrical Engineering.
Regarding the actual undergraduate team, each competitor is trained in relevant areas but new to cluster computing. Dante Uriostegui and Miguel Payan are assisting the primary advisor with a radiation hardened GPU project. Throughout the spring they learned to use cmake, make, and linux to conduct GPU programming and even compiler design. They also learned much about GPU architecture through their work with open-source GPU RTL designs. Expertise gained from this competition will help them to utilize state-funded supercomputers to decrease synthesization times and increase emulation performance with complex designs - a critical verification step that their laptops will soon begin struggling with, as the complexity of the design increases. Throughout the rest of their bachelors and master's education, they will be invaluable to the department for this skillset.
Juan Muller is proficient with linux and computer vision. He successfully programmed a drone to land on QR codes using a Raspberry Pi with an intel Compute Stick 2 VPU, requiring build-troubleshooting on his part. HPC training from this competition will empower him to train new models with more data for future machine learning work.
Daniel Alvarado was a star student in the University's rigorous microprocessor systems course, and will complete an internship working with microprocessors at Sandia National Labs over the summer. As he searches for a topic to research in the future for graduate school, HPC has caught his attention. Participation in this competition will elucidate what HPC really is, and reveal new possibilities for a senior design project and future research.
Finally, Michelle Lara and Jose Granados are earlier-on students with programming proficiency and a bright future. The HPC skillset that follows from participation in this competition will allow them not only to participate in future student cluster competitions, but also to help the department with simulation-centered research.
Thus, our team's advisors are invested and trained in HPC. Our student competitors are talented and trained in relevant areas, so this competition will serve to bring many of their skills together to form a new useful skill. As our department has a high need for HPC experts to help professors utilize UTEP-owned and State-Owned Clusters and Supercomputers for their research, the presence of undergraduates, several who are somewhat early on, who have aspirations for graduate school will be an enormous help to our school as well.
Posters
Research Posters
TP
XO/EX
DescriptionMany HPC and certainly AI or DL applications are comprised in their core of small linear algebra operations which are then used to compose large and more complicated tensor operations. Especially in the field of AI/DL portability among different hardware platforms is essential due to an extensive reliance on Python and the high-level nature of many frontends. However, scientists are often faced with the challenge to run their codes in vastly different environments. They therefore have to restrict themselves to high-level languages and hope for good compiler optimizations. Especially for complicated linear algebra operators, as they arise in high-order methods in the computational sciences, this is huge leap of faith. In this work we demonstrate how Tensor Processing Primitives, a low-dimensional SIMD abstraction for various CPU architectures, can be used to obtain very high fractions of floating point peak on seven different CPU micro-architectures offering four different ISAs.
Workshop
Recorded
W
DescriptionThe advent of next-generation X-ray free electron lasers will be capable of delivering X-rays at a repetition rate approaching 1 MHz continuously. This will require the development of data systems to handle experiments at these type of facilities, especially for high throughput applications, such as femtosecond X-ray crystallography and X-ray photon fluctuation spectroscopy. Here, we demonstrate a framework which captures single shot X-ray data at the LCLS and implements a machine-learning algorithm to automatically extract the contrast parameter from the collected data. We measure the time required to return the results and assess the feasibility of using this framework at high data volume. We use this experiment to determine the feasibility of solutions for ‘live’ data analysis at the MHz repetition rate.
Student Cluster Competition
TP
XO/EX
DescriptionOur team consists of four students from California State University Channel Islands and two students from Prairie View A&M University. Both schools competed in the 2022 Winter Classic Invitational Student Cluster Competition which involved running benchmarks and HPC simulations on hardware provided by four mentor organizations (HPE, NASA, Oak Ridge, AWS).
Out of 12 teams that participated, team Channel Islands finished strong despite no prior experience in HPC and team Prairie View A&M was declared the winner of the competition finishing in first place. With A&M’s knowledgeable experience and Channel Island’s desire to learn, we will make a formidable team for the SC22 event.
Some of our academic disciplines include mathematics, cybersecurity, software engineering, electrical engineering, and mechatronics.
HPC and this competition will help the team members find careers in any of a wide variety of fields
Dan Olds is the main advisor for the team currently. He has worked with student cluster competitions since 2010 and is the organizer of the 2021 and 2022 Winter Classic cluster competitions.
Our team is sponsored by Penguin Computing, the first time this organization has taken part in a student cluster competition.
We will also receive training from a variety of sources including the Stanford High Performance Computing Center, various application experts, and personnel from Penguin Computing. From all of these sources, we expect to learn a lot by working together over the summer learning about HPC and about the hardware that has been provided.
The team will receive training from a variety of sources including the Stanford High Performance Computing Center, various application experts, and personnel from Penguin Computing.
The students are enthusiastic about this competition and all are looking to learn more about HPC with the hopes of making it their career.
Dan Olds is the main advisor for the team currently. He has worked with student cluster competitions since 2010 and is the organizer of the 2021 and 2022 Winter Classic cluster competitions.
This team is sponsored by Penguin Computing, the first time this organization has taken part in a student cluster competition.
Out of 12 teams that participated, team Channel Islands finished strong despite no prior experience in HPC and team Prairie View A&M was declared the winner of the competition finishing in first place. With A&M’s knowledgeable experience and Channel Island’s desire to learn, we will make a formidable team for the SC22 event.
Some of our academic disciplines include mathematics, cybersecurity, software engineering, electrical engineering, and mechatronics.
HPC and this competition will help the team members find careers in any of a wide variety of fields
Dan Olds is the main advisor for the team currently. He has worked with student cluster competitions since 2010 and is the organizer of the 2021 and 2022 Winter Classic cluster competitions.
Our team is sponsored by Penguin Computing, the first time this organization has taken part in a student cluster competition.
We will also receive training from a variety of sources including the Stanford High Performance Computing Center, various application experts, and personnel from Penguin Computing. From all of these sources, we expect to learn a lot by working together over the summer learning about HPC and about the hardware that has been provided.
The team will receive training from a variety of sources including the Stanford High Performance Computing Center, various application experts, and personnel from Penguin Computing.
The students are enthusiastic about this competition and all are looking to learn more about HPC with the hopes of making it their career.
Dan Olds is the main advisor for the team currently. He has worked with student cluster competitions since 2010 and is the organizer of the 2021 and 2022 Winter Classic cluster competitions.
This team is sponsored by Penguin Computing, the first time this organization has taken part in a student cluster competition.
Birds of a Feather
TP
XO/EX
DescriptionThe BeeGFS Community BoF at SC, “Where Performance Matters” will be guided by the BeeGFS Research and Development team. This interactive BoF session will bring together the HPC and BeeGFS community to openly discuss the challenges, future goals, opportunities, and industry requirements for file systems, along with the general product direction and product feature request.
Attendees will also hear from BeeGFS users who will provide an overview of their BeeGFS use case scenario, a comparison to, and of other parallel file systems, along with their installation and configuration experience.
Attendees will also hear from BeeGFS users who will provide an overview of their BeeGFS use case scenario, a comparison to, and of other parallel file systems, along with their installation and configuration experience.
Birds of a Feather
TP
XO/EX
DescriptionThe explosion in scientific data volume, distributed from the edge to on-premise storage to storage on the cloud, is creating acute challenges and unique solutions within the High Performance Computing (HPC) community. Problems span workflows that produce and/or ingest distributed data, edge-to-cloud architectures, data distribution, data representation, and beyond. Join this BoF to meet individuals looking to address this challenge; hear from leaders from within this community; and be part of the discussion about your challenges, what you have done, and identifying the solutions that are needed.
Birds of a Feather
TP
XO/EX
DescriptionAs a primary approach to fault-tolerant computing, Checkpoint/Restart (C/R) is essential to a wide range of HPC communities. To help the community develop portable C/R codes to harness C/R benefits, which go far beyond resilience, the C/R Standard Forum will release the first version of the C/R interface standard in SC22. In this session, the C/R Standard Forum will present their first release of the C/R interface standard specification, inviting feedback from the HPC community on both the features included in the specification and the roadmap for future efforts.
Workshop
Recorded
W
DescriptionThe COVID-19 pandemic has presented a clear and present need for urgent decision making. Set in an environment of uncertain and unreliable data, and a diverse range of possible interventions, there is an obvious need for integrating HPC into workflows that include model calibration, and the exploration of the decision space. We present the design of PanSim, a portable, performant, and productive agent-based simulator, which has been extensively used to model and forecast the pandemic in Hungary. We show its performance and scalability on CPUs and GPUs, then we discuss the workflows PanSim integrates into. We describe the heterogeneous, resource-constrained HPC environment available to us, and formulate a scheduling optimization problem, as well as heuristics to solve them, to either minimize the execution time of a given number of simulations or to maximize the number of simulations executed in a given time frame.
Workshop
Recorded
W
DescriptionThe Ecosystem for Research Networking (ERN) CryoEM Remote Instrument Pilot Project was launched in response to feedback and survey data collected from hundreds of participants of the ERN series of NSF (OAC-2018927) funded community outreach events revealing that Structural Biology instrument driven science is being forced to transition from self-contained islands to federated wide-area internet accessible instruments. Its goal is to facilitate multi-institutional collaboration at the interface of computing and electron microscopy through the implementation of the ERN Federated OpenCI Lab’s Instrument CI Cloudlet design. The conclusion will be a web-based portal leveraging federated access to the instrument, workflows utilizing edge computing in conjunction with cloud computing and real-time monitoring for experimental parameter adjustments and decisions. The intention is to foster team science and scientific innovation, with emphasis on under-represented and under-resourced institutions, through the democratization of these scientific instruments. We discuss the latest Phase 1 deployment efforts.
Workshop
Recorded
Accelerator-based Architectures
Algorithms
Architectures
Big Data
Data Analytics
Parallel Programming Languages and Models
Productivity Tools
W
DescriptionThe conventional model of parallel programming today involves either copying data (and then having to track its most recent value), or not copying and requiring deep software stacks to do even the simplest operation on data that is "over there" - out of the range of loads and stores from the current core. As applications require larger data sets, with more irregular access to them, both models begin to exhibit severe scaling problems. This presentation reviews some growing evidence of the potential value of a model of computation that skirts between the two: data does not move (i.e. is not copied), and computation instead moves to the data. Several different applications have been coded for a novel platform where thread movement is handled invisibly by the hardware. The evidence to date indicates that parallel scaling for this paradigm may very well be significantly better than any mix of conventional models.
Exhibitor Forum
Recorded
TP
XO/EX
DescriptionIn both HPC and enterprise environments, average rack density has been steadily increasing over the past 10 years. These increases are being driven by both the proliferation of compute-intensive workloads, significant increases in chip power dissipation and the growing demand for reducing data center footprints. In 2022, racks exceeding 100 kW are being deployed and average rack density ranges from 10 – 15 kW. When we look forward to 2032, we should anticipate that the trend of increasing rack densities continues, requiring current cutting-edge power and cooling technologies to transition to mainstream use, and several technological advancements to be made to support peak density applications. This presentation will summarize the current state-of-the-art power and cooling technologies for high density racks; discuss the trends of increasing thermal design power and decreasing case temperatures; share our best practices in the design and validation of direct liquid cooling (DLC) systems to ensure quality and reliability; and speculate on the technologies that may be observed in high density racks in ten years from now.
Birds of a Feather
TP
XO/EX
DescriptionThe National Science Foundation's vision and investment plans for cyberinfrastructure (CI) are designed to address the evolving needs of the science and engineering research community. Senior leadership and Program staff from NSF’s Office of Advanced Cyberinfrastructure (OAC) will discuss strategic priorities as well as latest funding opportunities across all aspects of the research CI ecosystem. OAC will also present updates on OAC’s vision for democratizing access to CI and include a focus on the importance of cyberinfrastructure professionals across science and engineering. Substantial time will be devoted to Q&A between attendees and NSF staff.
Birds of a Feather
TP
XO/EX
DescriptionWith power being a first-order design constraint on par with performance, it is important to measure and analyze energy-efficiency trends in supercomputing. To raise the awareness of greenness as a first-order design constraint, the Green500 seeks to characterize the energy-efficiency of supercomputers for different metrics, workloads, and methodologies. This BoF discusses trends across the Green500 and highlights from the current Green500 list. In addition, the Green500, Top500, and Energy-Efficient HPC Working Group have been working together on improving power-measurement methodology, and this BoF presents case studies from sites that have made submissions that meet the highest quality of measurement methodology.
Birds of a Feather
TP
XO/EX
DescriptionEffective data and storage management are crucial for efficient HPC workflows and can accelerate research while achieving reproducibility and preserving data for future reference. Join us as we discuss the impact of data management on HPC workflows and explore real-world use cases and best practices from organizations optimizing their data management in support of breakthrough research. The session will explore data management for immediate computational needs as well as alternatives for long-term data access, management, and preservation. This is an interactive session where we invite the audience to share best practices.
Birds of a Feather
TP
XO/EX
DescriptionThe Message Passing Interface (MPI) API is the most dominant programming approach for HPC environments. Its specification is driven by the MPI forum, an open forum consisting of MPI developers, vendors and users. In this BoF at SC22, we will first cover the recently released version of MPI 4.0, its features and it's state of adoption. Following this, we will also take a look beyond MPI 4.0 and discuss the next planned steps for MPI that are currently under discussion, already shifting the focus to MPI 5.0, and we will solicit feedback for new ideas.
Exhibitor Forum
Recorded
TP
XO/EX
DescriptionThe footprint of Arm and Arm's silicon partners in HPC is rising and is set to accelerate. The 4-times Top 500 #1 Fujitsu A64FX Fugaku system delivers scientific results daily. Arm instances are taking a growing share of HPC and AI in the cloud: AWS Graviton3 - based on the Neoverse V1 core providing HPC and Machine Learning (ML) performance, Ampere Altra powered Google GCP and Microsoft Azure Arm instances providing competitive cost HPC workload. Arm's partners are bringing more products based on Arm IP: NVIDIA's Grace CPU and Grace-Hopper Superchips are poised to redefine AI and ML processing in the datacenter; SiPearl's Rhea CPU is building high-performance SoCs.
These CPUs are built upon a powerful formula: Arm-based compute foundations plus specialized processing from the Arm partner ecosystem deliver performance, efficiency and outcomes.
At the compute sub-system level, Arm Neoverse V-series cores advance the state of the art in scalable vector processing for HPC and AI. At the platform level, Arm is enabling CXL and UCIe-ready solutions for high-bandwidth, low-latency die-to-die and chip-to-chip accelerated solutions. These build upon Arm's AMBA CHI standard that today enables leadership HBM memory systems and cross-sectional die bandwidth. This same formula is supercharging Cloud and HPC networks, with Arm-based DPUs and SmartNICs like NVIDIA BlueField, Intel "Mt Evans", Marvell OCTEON, and AMD Pensando underpinning all of today's modern clouds.
In this talk, we will highlight the software ecosystem development along with the latest core, interconnect and subsystem technology and how this will transform the future.
These CPUs are built upon a powerful formula: Arm-based compute foundations plus specialized processing from the Arm partner ecosystem deliver performance, efficiency and outcomes.
At the compute sub-system level, Arm Neoverse V-series cores advance the state of the art in scalable vector processing for HPC and AI. At the platform level, Arm is enabling CXL and UCIe-ready solutions for high-bandwidth, low-latency die-to-die and chip-to-chip accelerated solutions. These build upon Arm's AMBA CHI standard that today enables leadership HBM memory systems and cross-sectional die bandwidth. This same formula is supercharging Cloud and HPC networks, with Arm-based DPUs and SmartNICs like NVIDIA BlueField, Intel "Mt Evans", Marvell OCTEON, and AMD Pensando underpinning all of today's modern clouds.
In this talk, we will highlight the software ecosystem development along with the latest core, interconnect and subsystem technology and how this will transform the future.
Panel
Recorded
Accelerator-based Architectures
Parallel Programming Languages and Models
Parallel Programming Systems
TP
XO/EX
DescriptiononeAPI is a cross-industry, open, standards-based unified programming model for heterogeneous systems. The oneAPI specification extends existing developer programming models to enable a diverse set of hardware through language, a set of library APIs, and a low-level hardware interface to support cross-architecture programming. It builds upon industry standards and provides an open, cross-platform developer stack to improve productivity and innovation. At the core of oneAPI is the SYCL programming language developed by the Khronos Group, which builds on the ISO C++ standard. SYCL provides explicit parallel constructs and offload interfaces to support a broad range of accelerators. In addition to direct accelerator programming with SYCL, oneAPI also provides libraries for compute- and data-intensive domains, e.g.: deep learning, scientific computing, video analytics, and media processing. Finally, a low-level hardware interface defines a set of capabilities and services to allow a language runtime system to effectively utilize a hardware accelerator.
Tutorial
Recorded
Algorithms
Directive Based Programming
Parallel Programming Languages and Models
Productivity Tools
TUT
DescriptionOpenMP is the de facto standard for writing parallel applications for shared memory computers. Born 25 years ago in 1997, it runs on just about every shared memory platform on the market. It’s also very complicated. We created OpenMP to be the “simple API” for application programmers. With a specification running to over 450 pages, OpenMP has grown into an intimidating API viewed by many as for “experts only”.
Most OpenMP programmers, however, use around 21 items from the specification. We call these 21 items the “OpenMP Common Core”. By focusing on the common core, we make OpenMP what it was always meant to be: a simple API for parallel application programmers.
In this hands-on tutorial, we explore the Common Core of OpenMP. We utilize active learning through a carefully selected set of exercises, so students will master the Common Core and learn to apply it to their own problems. Students will use their own laptops (with Windows, Linux, or OS/X) to access remote systems that support OpenMP (a remote SMP server). Alternatively, students can load an OpenMP compiler onto their laptops before the tutorial. Information about OpenMP compilers is available at www.openmp.org.
Most OpenMP programmers, however, use around 21 items from the specification. We call these 21 items the “OpenMP Common Core”. By focusing on the common core, we make OpenMP what it was always meant to be: a simple API for parallel application programmers.
In this hands-on tutorial, we explore the Common Core of OpenMP. We utilize active learning through a carefully selected set of exercises, so students will master the Common Core and learn to apply it to their own problems. Students will use their own laptops (with Windows, Linux, or OS/X) to access remote systems that support OpenMP (a remote SMP server). Alternatively, students can load an OpenMP compiler onto their laptops before the tutorial. Information about OpenMP compilers is available at www.openmp.org.
Birds of a Feather
TP
XO/EX
DescriptionThe blog Codinghorror wrote in 2007: “Choose good defaults, and users will sing the praises of your system and how easy it is to use. Choose poor defaults, and you'll face down user angst over configuration, and a host of tech support calls as well.”
We will discuss how choosing good defaults can help us increase the productivity of our less sophisticated users. Ultimately, we want HPC to be more inclusive and welcoming toward an increasingly diverse user community, for example from life and social sciences background. We will track progress of this community at the Github repository dirkpetersen/power-of-defaults
We will discuss how choosing good defaults can help us increase the productivity of our less sophisticated users. Ultimately, we want HPC to be more inclusive and welcoming toward an increasingly diverse user community, for example from life and social sciences background. We will track progress of this community at the Github repository dirkpetersen/power-of-defaults
Exhibitor Forum
Recorded
TP
XO/EX
DescriptionWind and solar are the least expensive means to produce electricity. Tens of gigawatts of new renewable energy have come online in the last decade, and tens of gigawatts will come online in the next decade. Wind and solar energy have two significant challenges that must be overcome: variability and congestion. Variability is obvious. Congestion refers to the fact that electricity has to “travel” from generators to consumers, and that often the wires are “full”. This leaves terawatt-hours of energy stranded due to insufficient transmission capacity.
What is needed are industries that can rapidly vary their electrical load and operate in sparsely populated regions where the energy is produced. HPC/HTC with the right additions, can be that industry. To do so requires that we invert our model of data center power from many nines of availability to a single nine.
The Lancium Compute HPC PaaS combines the benefits of Singularity containerization with the ability to checkpoint/restart arbitrary programs. This allows us to provide user-selectable QoS that determine the percent of time that an application will be running (on processor) versus being persisted. This allows us to manage our data center electrical load and “dance with the grid”, increasing load when there is plenty of renewable energy, decreasing load when needed.
This talk begins with an introduction to the renewable transformation, and the combination of Singularity and checkpoint/restart that enables our QoS model. Then presenting our HPC as a service interface, with particular attention to the QoS model.
What is needed are industries that can rapidly vary their electrical load and operate in sparsely populated regions where the energy is produced. HPC/HTC with the right additions, can be that industry. To do so requires that we invert our model of data center power from many nines of availability to a single nine.
The Lancium Compute HPC PaaS combines the benefits of Singularity containerization with the ability to checkpoint/restart arbitrary programs. This allows us to provide user-selectable QoS that determine the percent of time that an application will be running (on processor) versus being persisted. This allows us to manage our data center electrical load and “dance with the grid”, increasing load when there is plenty of renewable energy, decreasing load when needed.
This talk begins with an introduction to the renewable transformation, and the combination of Singularity and checkpoint/restart that enables our QoS model. Then presenting our HPC as a service interface, with particular attention to the QoS model.
Workshop
Recorded
Career Development
Professional Development
Software Engineering
Workforce
W
DescriptionWe review the Science Gateways Community Institute's Embedded Technical Support program, which connected 20 research software engineers at four universities to 59 client science gateway projects across the United States from 2015-2021. We review how the program worked and summarize lessons learned from both evaluations and anecdotal observations. Our conclusions may be valuable for other organizations that supervise teams or research software engineers to provide in-depth technical consulting.
Birds of a Feather
TP
XO/EX
DescriptionStorage protocols share data storage resources since the “prehistory” of computer sciences. The most used ones are very acquainted with the file systems semantics and are entangled with the design of distributed and parallel files systems. As the forthcoming exascale supercomputers raise new challenges for storage systems, object storage may be a game changer. What are the relevant storage protocols to address Object Stores? In this BoF, we will discuss and collect ideas on how to adapt old concepts and storage protocols or if specific protocols such as S3 will be more appropriate.
Workshop
Recorded
Algorithms
Exascale Computing
Extreme Scale Computing
Heterogeneous Systems
Post-Moore Computing
Quantum Computing
W
DescriptionLU factorization is a key approach for solving large, dense systems of linear equations. Partial row pivoting is commonly used to ensure numerical stability; however, the data movement needed for the row interchanges can reduce performance. To improve this, we propose using threshold pivoting to find pivots almost as good as those selected by partial pivoting but that result in less data movement. Our theoretical analysis bounds the element growth similarly to partial pivoting; however, it also shows that the growth of threshold pivoting for a given matrix cannot be bounded by that of partial pivoting and vice versa. Additionally, we experimentally tested the approach on the Summit supercomputer. Threshold pivoting improved performance by up to 32% without a significant effect on accuracy. For a more aggressive configuration with up to one digit of accuracy lost, the improvement was as high as 44%.
Workshop
Recorded
Quantum Computing
W
DescriptionWe present Tierkreis, a higher-order dataflow graph program representation and runtime designed for compositional, quantum-classical hybrid algorithms. The design of the system is motivated by the remote nature of quantum computers, the need for hybrid algorithms to involve cloud and distributed computing, and the long-running nature of these algorithms. The graph-based representation reflects how designers reason about and visualize algorithms, and allows automatic parallelism and asynchronicity. A strong, static type system and higher-order semantics allow for high expressivity and compositionality in the program. The flexible runtime protocol enables third-party developers to add functionality using any language or environment. With Tierkreis, quantum software developers can easily build, visualize, verify, test, and debug complex hybrid workflows, and immediately deploy them to the cloud or a custom distributed environment.
Workshop
Recorded
Applications
Architectures
Benchmarking
Exascale Computing
Modeling and Simulation
Performance
Performance Portability
W
DescriptionWe compare the ML-training performance of a Graphcore IPU-M2000-based system with Nvidia A100 GPU-based system on the Perlmutter HPC machine at NERSC/LBL. The multivariate regression of time series data from a simulated biological neuron was the scientific benchmark problem. The ML-model consisted of several convolutional, batch normalization, and fully connected layers. The training data were distributed in CPUs memory to eliminate the system dependent IO cost. The data-parallel training runs resulted in the same samples throughput on both GC200 IPUs and A100 GPUs for any choice of the number of accelerators between 1 and 256. The achieved best MSE validation loss on IPUs was only 10% to 20% larger. The aggregated energy use per 1 training epoch was between 2.5 to 3 times smaller for the Graphcore-system in comparison to the Nvidia-system. This paper also discusses aspects of software-hardware co-design to achieve highest efficiency on the IPU using PopTorch.
Birds of a Feather
Recorded
TP
XO/EX
DescriptionThe TOP500 list of supercomputers serves as a “Who’s Who” in the field of High Performance Computing (HPC). It started as a list of the most powerful supercomputers in the world and has evolved to a major source of information about trends in HPC. The 60th TOP500 list will be published in November 2022 just in time for SC22.
This BoF will present detailed analyses of the TOP500 and discuss the changes in the HPC marketplace during the past years. The BoF is meant as an open forum for discussion and feedback between the TOP500 authors and the user community.
This BoF will present detailed analyses of the TOP500 and discuss the changes in the HPC marketplace during the past years. The BoF is meant as an open forum for discussion and feedback between the TOP500 authors and the user community.
Workshop
Recorded
W
DescriptionRecently, the EUMaster4HPC project has been started with the ambition of boosting the education of HPC experts at universities throughout Europe. Within this project, a future European curriculum for a Master in HPC is being developed. Here, we report on first efforts toward establishing a set of necessary skills.
Workshop
Recorded
W
DescriptionThe objective of cyberinfrastructures is to make the use of streaming data a common practice in the scientific community by integrating knowledge obtained from a range of data sources with large-scale computer models. However, the ability to determine what, where, and how data is gathered and processed along the edge-to-cloud/HPC computing continuum is limited by the lack of abstractions that can support data-driven reactive behaviors. This creates a difficulty for integrating this heterogeneous data with time-sensitive systems that tackle global challenges.In this paper, we present a methodology for incorporating contextual information into the application logic while taking into consideration the heterogeneity of the underlying platform and the unpredictability of the data. An example of a fire science scenario spanning numerous states serves as the driver for our discussion of research issues in resource management and programming models.
Workshop
Recorded
Performance Portability
W
DescriptionTensor contractions form the fundamental computational operation of computational chemistry and more notably, these contractions dictate the performance of widely used coupled-cluster (CC) methods in computational chemistry. In this work, we study a single-source, cross-platform C++ abstraction layer programming model, SYCL for the application related to the computational chemistry methods such as CCSD(T) coupled-cluster formalism. An existing optimized CUDA implementation was migrated to SYCL to make use of the novel algorithm that provides tractable GPU memory needs for solving high-dimensional tensor contractions for accelerating CCSD(T). We present the cross-platform performance achieved using SYCL implementations for the non-iterative triples contribution of CCSD(T) formalism which is considered as the performance bottle neck on Nvidia A100 and AMD Instinct MI250X. Additionally, we also draw comparisons of similar performance metrics from vendor-based native programming models such as CUDA and ROCm HIP.
Doctoral Showcase
Posters
Recorded
TP
DescriptionModern HPC workloads produce massive amounts of distributed intermediate data that needs to be checkpointed concurrently in real-time at scale. One such popular scenario is the use of checkpoint-restore for revisiting previous states (intermediate data) to advance computations, such as adjoint methods. In this context, GPUs have shown tremendous performance improvements during computations but demonstrate I/O limitations while managing high-frequency large-volume data movement across heterogeneous memory tiers. Existing data movement runtimes are not well suited for such I/O because of factors such as imbalance in checkpoint distribution across fast memory tiers, slow memory allocation, and restore oblivious cache eviction and prefetching strategies. We address these challenges by designing a set of transparent, asynchronous checkpoint-restore techniques that minimize the blocking time of the application during I/O using three novel contributions. First, we design techniques to evenly distribute checkpoints across fast memory tiers (e.g. peer GPUs) using collaborative checkpointing that leverages fast interconnects such as NVLinks and NVSwitches for load balancing. Second, we mitigate the slow cache allocation for storing checkpoints on both GPU and host by leveraging techniques such as CUDA's virtual memory management functions, eager memory mapping, and lazy pinning. Third, we design a restore-order aware eviction and prefetching approach that is coordinated by a finite state machine based on a unified checkpoint-restore abstraction for optimal evictions. Our evaluations across real-world and synthetic benchmarks demonstrate significant speedup in both checkpoint and restore phases of the application compared to the current state-of-the-art data movement engines.
Workshop
Recorded
Runtime Systems
System Software
W
DescriptionContemporary HPC systems run compute jobs on exclusively assigned hardware resources. During communication, polling for progress is used for minimal latency. Previous work on oversubscription and event-based communication shows these techniques can improve overall system utilization and reduce energy consumption. Despite these findings, neither of the techniques is commonly used in HPC systems today. We believe that the current lack of detailed studies of the low-level effects of event-based communication, a key enabler of efficient oversubscription, is a major obstacle to a wider adoption.
We demonstrate that the sched_yield system call, often used for oversubscription, is not best suited for this purpose on modern Linux systems. Furthermore, we incorporate event-based communication into Open MPI and evaluate the effects on latency and energy consumption using an MPI micro-benchmark. Our results indicate that event-base communication incurs significant latency overhead but also saves energy. Both effects grow with the imbalance of the application.
We demonstrate that the sched_yield system call, often used for oversubscription, is not best suited for this purpose on modern Linux systems. Furthermore, we incorporate event-based communication into Open MPI and evaluate the effects on latency and energy consumption using an MPI micro-benchmark. Our results indicate that event-base communication incurs significant latency overhead but also saves energy. Both effects grow with the imbalance of the application.
Workshop
Recorded
W
DescriptionIn recent decades, High Performance Computing (HPC) and simulations have become determinant in many areas of engineering and science. Since many HPC applications rely extensively on floating-point arithmetic operations, many kinds of numerical errors can be introduced during the program execution, leading to instability or reproducibility problems. One kind of these error sources is cancellation which produces inaccurate results when two nearby numbers are subtracted. In this article, we present Candy, a new dynamic library that detects cancellations in numerical codes. Our method computes the number of significant bits of floating-point numbers by attaching a shadow value in higher precision to each number. This helps to detect in an accurate way if a program suffers from cancellation problems and thus to increase the trust in large-scale HPC applications and exascale simulations. We evaluate Candy over a set of real-world numerical applications. Also, we compare Candy against the state-of-art tool FPChecker.
Workshop
Recorded
Performance Portability
W
DescriptionThe wide adoption of Deep Neural Networks (DNN) has served as an incentive to design and manufacture powerful and specialized hardware technologies, targeting systems from Edge devices to Cloud and supercomputers.
While the proposed ONNX as a de facto for AI model description provides the portability of AI models across various AI frameworks, supporting DNN models on various hardware architectures remains challenging.
SYCL provides a C++-based portable parallel programming model to target various devices. Thus, enabling SYCL backend for an AI framework can lead to a hardware-agnostic model for heterogeneous systems.
This paper proposes a SYCL backend for ONNXRuntime as a possible solution towards the performance portability of deep learning algorithms. The proposed backend uses existing state-of-the-art SYCL-DNN and SYCL-BLAS libraries to invoke tuned SYCL kernels for DNN operations. Our performance evaluation shows that the proposed approach can achieve comparable performance with respect to the state-of-the-art optimized vendor-specific libraries.
While the proposed ONNX as a de facto for AI model description provides the portability of AI models across various AI frameworks, supporting DNN models on various hardware architectures remains challenging.
SYCL provides a C++-based portable parallel programming model to target various devices. Thus, enabling SYCL backend for an AI framework can lead to a hardware-agnostic model for heterogeneous systems.
This paper proposes a SYCL backend for ONNXRuntime as a possible solution towards the performance portability of deep learning algorithms. The proposed backend uses existing state-of-the-art SYCL-DNN and SYCL-BLAS libraries to invoke tuned SYCL kernels for DNN operations. Our performance evaluation shows that the proposed approach can achieve comparable performance with respect to the state-of-the-art optimized vendor-specific libraries.
Workshop
Recorded
W
DescriptionGraphics Processing Units (GPUs), the dominantly adopted accelerators in HPC systems, are susceptible to a transient hardware fault. A new generation of GPUs features mixed-precision architectures such as NVIDIA Tensor Cores to accelerate matrix multiplications. While widely adapted, how they would behave under transient hardware faults remain unclear. In this study, we conduct large-scale fault injection experiments on GEMM kernels implemented with different floating-point data types on the V100 and A100 Tensor Cores and show distinct error resilience characteristics for the GEMMS with different formats. We plan to explore this space in the future by building precision-aware floating-point fault tolerance techniques for applications such as DNNs that exercise low-precision computations.
Doctoral Showcase
Posters
Recorded
TP
DescriptionAs computational resources scale larger, applications often need to be refactored to deal with bottlenecks that arise to gain the advantages of strong scaling. When not properly addressed legacy workloads can lead to inefficient usage of available hardware which leads to poor throughput. One solution is to allow multiple tasks to share a system to provide multi-tenancy. Multi-tenant environments fall into two categories: time-sharing and space-sharing. Time-sharing has been an effective technique to deal with multiple applications sharing the CPU and GPU at the node-level. However, time-sharing can have a heavy performance cost such as saving and restoring architectural state (context switch overhead) which is very costly on GPUs. While space-sharing can avoid this overhead and improve throughput, current hardware and software systems lack full isolation to provide the necessary quality of service. In this work, we identify key challenges that arise when sharing resources in a HPC context. We evaluate real-world scenarios both at the node-level and cluster-level. Using these insights, we propose middleware to mitigate and improve quality of service. We introduce a runtime CUDA middleware that improves QoS for GPUs. We also introduce and study two new features of HDF5, GDS VFD and Async I/O. The former improves I/O latency while the latter improves and hides variability in I/O latency.
Paper
Recorded
Resource Management and Scheduling
System Software
TP
DescriptionToday's supercomputers offer massive computation resources to execute a large number of user jobs. Effectively managing such large-scale hardware parallelism and workloads is essential for supercomputers. However, existing HPC resource management (RM) systems fail to capitalize on the hardware parallelism by following a centralized design used decades ago. They give poor scalability and inefficient performance on today's supercomputers, which will worsen in exascale computing. We present ESlurm, a better RM for supercomputers. As a departure from existing HPC RMs, ESlurm implements a distributed communication structure. It employs a new communication tree strategy and uses job runtime estimation to improve communications and job scheduling efficiency. ESlurm is deployed into production in a real supercomputer. We evaluate ESlurm on up to 100K nodes. Compared to state-of-the-art RM solutions, ESlurm exhibits better scalability, significantly reducing the resource usage of master nodes and improving data transfer and job scheduling efficiency by a large margin.
Posters
Research Posters
Recorded
TP
DescriptionThe poster presents a scalable approach that converts the results of large-scale Computational Fluid Dynamics (CFD) simulations into a volumetric representation used by volume rendering-based visualization. Even if this functionality is provided by common post-processing tools, its efficient parallelization requires an appropriate load-balancing. Unfortunately, load-balancing according to the number of cells does not scale for unstructured meshes with high growth rate that is common in CFD. In the poster, we show that with an appropriate redistribution of data among available resources it is possible to perform the operation in just several seconds with significantly improved scalability.
Workshop
Recorded
Correctness
Software Engineering
W
DescriptionIterative methods for solving linear systems serve as a basic building block for computational science. The computational cost of these methods can be significantly influenced by the round-off errors that accumulate as a result of their implementation in finite precision. In the extreme case, round-off errors that occur in practice can completely prevent an implementation from satisfying the accuracy and convergence behavior prescribed by its underlying algorithm. In the exascale era, where cost is paramount, a thorough and rigorous analysis of the delay of convergence due to round-off should not be ignored. In this paper, we use a small model problem and the Jacobi iterative method to demonstrate how the Coq proof assistant can be used to formally specify the floating-point behavior of iterative methods, and to rigorously prove the accuracy of these methods.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
DescriptionWe present a novel parallel framework for large scale network alignment. Network alignment has applications in many disciplines including bioinformatics and social sciences. Our algorithm is one of the first network alignment tools that can not only identify similar networks, but also identify the differences between nearly similar networks. It is particularly useful in finding regions of non-determinism in event graphs, arising in large HPC simulations.
Our algorithm compares similarity between vertices based the number of graphlets (or motifs) to which the vertex belongs. Thus, it can also be used to find motifs in a graph. However, compared to the state-of-the art algorithms, our algorithm can (i) compute multiple motifs in one execution and (ii) be tuned to graph structure and user specification. We will present the algorithm, showcase the scalability results, and compare its performance and accuracy with other state-of-the art software.
Our algorithm compares similarity between vertices based the number of graphlets (or motifs) to which the vertex belongs. Thus, it can also be used to find motifs in a graph. However, compared to the state-of-the art algorithms, our algorithm can (i) compute multiple motifs in one execution and (ii) be tuned to graph structure and user specification. We will present the algorithm, showcase the scalability results, and compare its performance and accuracy with other state-of-the art software.
Workshop
Recorded
AI-HPC Convergence
Emerging Technologies
Memory Systems
Networks
Resource Management and Scheduling
W
Posters
Research Posters
TP
XO/EX
DescriptionThe poster presents the usage of a deterministic traffic simulator for optimizing traffic flow within a city. The simulator is one part of a traffic modeling framework for intelligent transportation in smart cities. In contrast to standard navigation systems where the navigation is optimized for drivers, we aim to optimize a distribution of the global traffic flow. We utilize HPC resources for the simulator’s parameters exploration for which EVEREST SDK is used.
The EVEREST project aims at developing a holistic design environment that addresses simplifying the programmability of heterogeneous and distributed architectures for Big Data applications. The project uses “data-driven” design approach with domain-specific language extensions, hardware-accelerated AI and an efficient monitoring of the execution with a unified hardware/software paradigm. During the presentation the distribution of traffic flow in a selected city will be presented in a form of short video to demonstrate the dynamicity of the system.
The EVEREST project aims at developing a holistic design environment that addresses simplifying the programmability of heterogeneous and distributed architectures for Big Data applications. The project uses “data-driven” design approach with domain-specific language extensions, hardware-accelerated AI and an efficient monitoring of the execution with a unified hardware/software paradigm. During the presentation the distribution of traffic flow in a selected city will be presented in a form of short video to demonstrate the dynamicity of the system.
Birds of a Feather
TP
XO/EX
DescriptionThe goal of this BoF is to provide a forum to discuss approaches and needs for training the CI workforce in developing, supporting and using large scale CI effectively for AI workloads. The organizers of this BoF have experience developing training activities ranging from conference tutorials, user workshops, summer institutes, hackathons, and bootcamp style training for AI and CI professionals. We’d like to share our experiences in offering such training and promote discussion in the SC community on AI training more generally. The discussion topics will range from learning outcomes, training delivery, experiential activities, to current gaps and future needs.
Posters
Research Posters
TP
XO/EX
DescriptionGPU matrix chain multiplication serves as a basis for a wide range of scientific domains like computer graphics, physics, and machine learning. While its time performance was studied for years, there has been significantly less effort in optimizing its energy efficiency. GPU power consumption is heavily impacted by the number of data transfers performed. In fact, a data transfer from global memory needs a thousand times more energy than a double precision arithmetic operation. Thus, minimizing data transfers is key for reducing the energy consumption. We present an energy efficient solution for Matrix Chain Multiplication on GPUs that minimizes computation as well as off-chip data transfers. For this, optimizations at three different levels are provided. For a single matrix multiplication, we use a large tile blocking strategy. Then, we extend our approach to three matrices. Finally, we propose a solution for a sequence of matrices.
Birds of a Feather
TP
XO/EX
DescriptionTransformers and other large language models have shown impressive capabilities as 'foundation' models for domains such as natural language processing and computer vision. Requiring huge amounts of scalable compute, self-supervised training on large datasets is leveraged to develop models applicable to a variety of specialized tasks. Recent efforts in areas such as bioinformatics and protein folding indicate the significant potential for Transformer models in domain science applications. In this session, presenters and attendees have the opportunity to discuss new algorithms, software, or hardware for training Transformers on large domain science datasets and novel ideas for applying Transformers in this space.
Panel
Recorded
Emerging Technologies
Networks
TP
XO/EX
DescriptionPhotonic components have been the technology just around the corner for a decade now. While optical network fibers are rapidly gaining traction in system networks, the question now shifts to shorter distances. Emerging silicon photonics make large promises in drastically increasing chip escape bandwidth density, reducing latency, and in many cases reducing energy per bit. But will error rates wipe out any potential benefits? Do we have to redesign our system or re-write parts of our applications? This panel brings together experts from industry and academia to share their views on how we can design systems around emerging silicon photonics, what their current level of maturity is, and what the future holds for silicon photonics in HPC.
ACM Gordon Bell COVID Finalist
ACM Gordon Bell Finalist
Awards Presentation
Recorded
Awards
TP
DescriptionWe describe our development of ab initio protein-ligand binding pose prediction models based on transformers and binding affinity prediction models based on the neural tangent kernel (NTK). Folding both protein and ligand, the TwoFold models achieve efficient and quality predictions matching state-of-the-art implementations while additionally reconstructing protein structures. NTK and Gaussian Process models are demonstrated to be a worthy use of HPC resources for AI, and the advantages of adapting highly-optimized linear solver benchmarking codes to solve the large dense linear systems required by these models are shown.
Workshop
Recorded
W
DescriptionToday’s scientific simulations are producing extremely large amount of data everyday, which induces grand challenges in transferring and storing the data efficiently. Error-bounded lossy compression has been thought of as the most promising solution to the bigdata issue, however, it would cause data distortion that has to be controlled carefully for user’s post-hoc analysis. Recently, the preservation of quantities of interest has become a priority. Derivative-related metrics are critical quantities of interest for many applications across domains. However, no prior research explored the impact of lossy compression on derivative-related metrics in particular. In this paper, we focus on understanding the impact of various error-controlled lossy compressors on multiple derivative-related metrics commonly concerned by users. We perform solid experiments that involve 5 state-of-the-art lossy compressors and 4 real-world application datasets. We summarize 5 valuable takeaways, which can shed some light in understanding the impact of lossy compression on derivative-related metrics.
Workshop
Recorded
Performance Portability
W
DescriptionThe roofline model provides a concise overview of the maximum performance capabilities of a given computer system through a combination of peak memory bandwidth and compute performance rates. The increasing complexity of scheduling and caches in recent GPUs, however, has introduced complicated performance variability that is not captured by arithmetic intensity alone. This work examines the effect of problem size and GPU launch configurations on roofline performance for V100, A100, MI100, and MI250X graphics processing units. We introduce an extended roofline model that takes problem size into account, and find that strong scaling on GPUs can be characterized by “saturation problem sizes” as additional key metrics. Saturation problem sizes break up a plot of GPU performance vs. problem size into three distinct performance regimes– size-limited, cache-bound, and DRAM-bound. With our extended roofline model, we are able to provide a robust view of these performance regimes across recent GPU architectures.
Workshop
Recorded
W
DescriptionThe Community Earth Science Model (CESM) is an important tool in climate modeling that produces a large volume of data on each simulation. Researchers have increasingly been turning to both lossless and lossy compression as an approach to reduce the volume of data for the CESM climate applications. However, it is non-trivial for users to choose the best-qualified compressor especially because of the advent of many modern lossless and lossy compressors and complicated scientific integrity assessment of climate data model. In this paper, we evaluate 11 state-of-the-art compressors using the quality metrics developed by climate scientists to understand the effectiveness of the compressors on the CESM climate datasets with 4 four different models. Our work also discloses the best compression ratio that can be reasonably achieved while meeting these strict quality requirements.
Birds of a Feather
TP
XO/EX
DescriptionIn order to exploit the capabilities of new HPC systems and to meet their demands in scalability, communication software needs to scale on millions of cores and support applications with adequate functionality. UCX is a collaboration between industry, national labs and academia that consolidates that provides a unified open-source framework.
The UCX project is managed by the UCF consortium (http://www.ucfconsortium.org/) and includes members from LANL, ANL, Ohio State University, AMD, ARM, IBM, NVIDIA and more. The session will serves as the UCX community meeting, and will introduce the latest development to HPC developers and the broader user community.
The UCX project is managed by the UCF consortium (http://www.ucfconsortium.org/) and includes members from LANL, ANL, Ohio State University, AMD, ARM, IBM, NVIDIA and more. The session will serves as the UCX community meeting, and will introduce the latest development to HPC developers and the broader user community.
Paper
Recorded
Quantum Computing
Resource Management and Scheduling
System Software
TP
DescriptionQuantum circuit simulation is critical for verifying quantum computers. Given exponential complexity in the simulation, existing simulators use different architectures to accelerate the simulation. However, due to the variety of both simulation methods and modern architectures, it is challenging to design a high-performance yet portable simulator.
In this work, we propose UniQ, a unified programming model for multiple simulation methods on various hardware architectures. We provide a unified application abstraction to describe different applications, and a unified hierarchical hardware abstraction upon different hardware.
Based on these abstractions, UniQ can perform various circuit transformations without being aware of either concrete application or architecture detail, and generate high-performance execution schedules on different platforms without much human effort. Evaluations on CPU, GPU, and Sunway platforms show that sys can accelerate quantum circuit simulation by up to 28.59× (4.47× on average) over state-of-the-art frameworks, and successfully scale to 399,360 cores on 1,024 nodes.
In this work, we propose UniQ, a unified programming model for multiple simulation methods on various hardware architectures. We provide a unified application abstraction to describe different applications, and a unified hierarchical hardware abstraction upon different hardware.
Based on these abstractions, UniQ can perform various circuit transformations without being aware of either concrete application or architecture detail, and generate high-performance execution schedules on different platforms without much human effort. Evaluations on CPU, GPU, and Sunway platforms show that sys can accelerate quantum circuit simulation by up to 28.59× (4.47× on average) over state-of-the-art frameworks, and successfully scale to 399,360 cores on 1,024 nodes.
Workshop
Recorded
Career Development
Professional Development
Software Engineering
Workforce
W
DescriptionResearch Software Engineers have a key role in supporting research in the HPC and Computational Science domains. Their specialist skills have been essential to the development and growth of these areas over many years. Significant effort is directed into raising awareness of the RSE role and developing proper career development pathways. One area requiring better understanding is training pathways. What training courses should RSEs take, at what career stage, to gain skills required at different expertise levels? What materials already exist within the community and which are missing? How do we navigate this largely undefined landscape? In short: how does one train to become an RSE? The UNIVERSE-HPC (Understanding and Nurturing an Integrated Vision for Education in RSE and HPC) project is working to enable people from a wide diversity of disciplines and backgrounds to have well-defined paths for obtaining the skills and experience required for a successful RSE career.
Workshop
Recorded
W
DescriptiononeAPI is a major initiative by Intel aimed at making it easier to program heterogeneous architectures for high-performance computing using a unified API. In addition to raising the abstraction level via an API, we argue that a curriculum of well-developed software engineering methods with exemplars will be necessary to ensure interest in HPC by current students and educators. To this end, our UnoAPI curriculum takes a holistic approach based on language and the broader development ecosystem. Our curriculum, based on a systems foundation, integrates essential principles of distributed systems, programming languages, and software engineering. We argue that a curriculum should cover these topics to attract students to HPC and enable them to confidently solve computational problems using oneAPI. We have shared our materials with a small group of undergraduate sophomores and plan a follow-up study with a larger cohort by incorporating some of our materials in our existing HPC course.
Birds of a Feather
TP
XO/EX
DescriptionThe Standard Performance Evaluation Corporation (SPEC) is a non-profit corporation formed to establish, maintain and endorse standardized benchmarks and tools to evaluate performance and energy efficiency for the newest generation of computing systems. The SPEC High Performance Group (HPG) focuses specifically on developing industry standard benchmarks for HPC systems and has a track record of producing high-quality benchmarks serving both academia and industry. This BoF invites HPC center operators, developers, and researchers to discuss their experiences using application benchmarks and learn about the roadmap for future SPEC benchmark developments.
Paper
Recorded
Machine Learning and Artificial Intelligence
Software Engineering
State of the Practice
TP
DescriptionModern scientific software stacks have become extremely complex, making use of many programming models and libraries to exploit a growing variety of GPUs and accelerators. Package managers can mitigate this complexity using dependency solvers, but they are reaching their limits. Finding compatible dependency versions is NP-complete, and modeling the semantics of package compatibility modulo build-time options, GPU runtimes, flags, and other parameters is extremely difficult. Within this enormous configuration space, defining a "good" configuration is daunting.
We tackle this problem using Answer Set Programming (ASP), a declarative model for combinatorial search problems. We show, using the Spack package manager, that ASP programs can concisely express the compatibility rules of HPC software stacks and provide strong quality-of-solution guarantees. Using ASP, we can mix new builds with preinstalled binaries, and solver performance is acceptable even when considering tens of thousands of packages.
We tackle this problem using Answer Set Programming (ASP), a declarative model for combinatorial search problems. We show, using the Spack package manager, that ASP programs can concisely express the compatibility rules of HPC software stacks and provide strong quality-of-solution guarantees. Using ASP, we can mix new builds with preinstalled binaries, and solver performance is acceptable even when considering tens of thousands of packages.
Birds of a Feather
TP
XO/EX
DescriptionChapel is a parallel programming language designed to simplify the programmability, portability, and scalability of HPC applications relative to conventional approaches. This BoF focuses on building on the community of users developing real-world applications written in Chapel and to discuss the benefits of using Chapel in terms of scalability, performance, and time-to-science. Some key Chapel concepts will be briefly summarized and then users will present highlights from data science applications, Aeronautics, Geoscience applications, and Astronomy. The last portion of the BoF will be spent having an open discussion of Chapel usage and collecting feedback on the Chapel roadmap.
Tutorial
Recorded
Cloud and Distributed Computing
Containers
Productivity Tools
Resource Management and Scheduling
Software Engineering
Workflows
TUT
DescriptionWithin just the past few years, the use of containers has revolutionized the way in which industries and enterprises have developed and deployed computational software and distributed systems. The containerization model has gained traction within the HPC community as well with the promise of improved reliability, reproducibility, portability, and levels of customization that were not previously possible on supercomputers. This adoption has been enabled by a number of HPC Container runtimes that have emerged including Singularity, Shifter, Enroot, Charliecloud, and others.
This hands-on tutorial looks to train users on the usability of containers on HPC resources. We will provide a detailed background on Linux containers, along with introductory hands-on experience building a container image, sharing the container and running it on a HPC cluster. Furthermore, the tutorial will provide more advanced information on how to run MPI-based and GPU-enabled HPC applications, how to optimize I/O intensive workflows, and how to setup GUI enabled interactive sessions. Cutting-edge examples will include machine learning and bioinformatics. Users will leave the tutorial with a solid foundational understanding of how to utilize containers on HPC resources using Podman, Shifter, and Singularity, and in-depth knowledge to deploy custom containers on their own resources.
This hands-on tutorial looks to train users on the usability of containers on HPC resources. We will provide a detailed background on Linux containers, along with introductory hands-on experience building a container image, sharing the container and running it on a HPC cluster. Furthermore, the tutorial will provide more advanced information on how to run MPI-based and GPU-enabled HPC applications, how to optimize I/O intensive workflows, and how to setup GUI enabled interactive sessions. Cutting-edge examples will include machine learning and bioinformatics. Users will leave the tutorial with a solid foundational understanding of how to utilize containers on HPC resources using Podman, Shifter, and Singularity, and in-depth knowledge to deploy custom containers on their own resources.
Posters
Research Posters
TP
XO/EX
DescriptionMemory management APIs like Umpire were created to solve the memory constraints for applications running on heterogeneous HPC systems. At Lawrence Livermore National Laboratory (LLNL), many application codes utilize the memory management capabilities of Umpire. This study focuses on one such code, a high explosive equation of state chemistry application from LLNL. This code uses Umpire’s memory pools in order to allocate all required memory at once instead of many times throughout the code. The performance of memory pools varies widely and depends upon how the blocks of memory within the pool are managed. We conducted several experiments that tested different strategies to manage allocations within a memory pool in order to study the impact on performance. Our experiments demonstrate how this performance varies, from causing an application to run out of memory prematurely to reducing peak memory usage by 64%, depending upon that management strategy.
Paper
Recorded
Cloud and Distributed Computing
TP
DescriptionModern HPC workload managers and their careful tuning contribute to the high utilization of HPC clusters. However, due to inevitable uncertainty it is impossible to completely avoid node idleness. Although such idle slots are usually too short for any HPC job, they are too long to ignore them. Function-as-a-Service (FaaS) paradigm promisingly fills this gap, and can be a good match, as typical FaaS functions last seconds, not hours. Here we show how to build a FaaS infrastructure on idle nodes in an HPC cluster in such a way that it does not affect the performance of the HPC jobs significantly. We dynamically adapt to a changing set of idle physical machines, by integrating open-source software Slurm and OpenWhisk.
We designed and implemented a prototype solution that allowed us to cover up to 90% of the idle time slots on a 50k-core cluster that runs production workloads.
We designed and implemented a prototype solution that allowed us to cover up to 90% of the idle time slots on a 50k-core cluster that runs production workloads.
Workshop
Recorded
Benchmarking
Cloud and Distributed Computing
Containers
Datacenter
Networks
Privacy
Resource Management and Scheduling
Security
SIGHPC
State of the Practice
System Administration
System Software
W
DescriptionPenn State ICDS has deployed a new HPC environment, Roar Collab, using xCAT as the bare metal manager. By combining features of objects and script definitions from xCAT with a Git Repository, ICDS maintains node software consistency that is accessible to a broader set of administrators with a fast and low learning curve. This proposal will present a review of how we used Open-Source software to implement DevOps to stand up and manage bare metal nodes and deploy stateless OS images. We will discuss in detail how we use GIT to maintain the nodes' firmware, drivers, and software. We will discuss how these have enabled administration without the in-depth familiarity of the xCAT deployment, thus allowing more team members to contribute. We know of other deployments using a similar method to maintain software consistency across nodes. We will detail the differences and discuss the pros and cons of the methods.
Paper
Recorded
Correctness
System Software
TP
DescriptionThe compact data structures and irregular computation patterns in sparse matrix computations introduce challenges to vectorizing these codes. Available approaches primarily vectorize strided computation regions of a sparse code. In this work, we propose a locality-based codelet mining (LCM) algorithm that efficiently searches for strided and partially strided regions in sparse matrix computations for vectorization. We also present a classification of partially strided codelets and a differentiation-based approach to generate codelets from memory accesses in the sparse computation. LCM is implemented as an inspector-executor framework called LCM I/E that generates vectorized code for the sparse matrix-vector multiplication (SpMV), sparse matrix times dense matrix (SpMM), and sparse triangular solver (SpTRSV). LCM I/E outperforms the MKL library with an average speedup of 1.67X, 4.1X, and 1.75X for SpMV, SpTRSV, and SpMM, respectively. It is also faster than the state-of-the-art inspector-executor framework Sympiler for the SpTRSV kernel with an average speedup of 1.9X.
Student Cluster Competition
TP
XO/EX
DescriptionOur team members have never participated in a competition like this, but are eager to rise to the challenge of IndySCC and gain hands-on experience with HPC. The IndySCC competition will create connections, build resumes, and develop cutting-edge skills. With an unparalleled commitment to student success, TAMU-CC creates life-changing opportunities for students. Our diverse team brings a variety of backgrounds, experiences, and studies.
Veronica never imagined working in technology, but has been in TAMU-CC IT for eleven years. Coming from a non-traditional background in Communication, her love for IT grew as a student employee. She moved through the ranks from the service desk to a systems administrator.
Chanelle is a Junior in Management Information Systems with a Minor in Finance. She obtained her Associate's in Business Administration with a 4.0 GPA. Chanelle challenges herself as a student employee in our Application Administration department. She was also Team Leader for a Center for Creative Land Recycling project that developed a working application for the Texas State Aquarium.
Savannah is a Sophomore in Computer Science. She has learned the basics of her future field, yet was one of the first students to volunteer for this opportunity. When asked what computational experience she said, “… minor computer science experience, but I had to Google what HPC stood for, if that tells you anything.” She is fierce and fearless.
Hunter Carver is inspiring. As a Junior in Computer Science, Hunter focuses on Systems Programming. He earned an Associate’s Degree in Computer Programming from Del Mar College, founded their first Computer Science Club, and remains a member of the College Hall of Fame. As a growing R2 institution, TAMU-CC is home to one of seven test sites for Unmanned Aircraft Systems (UAS) in the US. Hunter interns at our UAS Center where the competition experience will be directly beneficial. He has also developed a 240Mh/s mining rig to validate the Ethereum Blockchain.
Orlando Gomez is a Sophomore in Computer Science focusing on Cybersecurity. Orlando has a Linux certification and is pursuing a student employee position in TAMU-CC’s Division of IT. He has the drive to excel in this competition, in order to build his knowledge and experience.
Nabil and Amine are both international students from Algeria. In speaking with Amine, he discussed how his home is lagging in the technology field and struggles with limited access to educational resources. Even so, Nabil and Amine have awe-inspiring resumes. Both competed in a 48-hour competition by designing a robot to assist the visually impaired with navigation. They won first place regionally and third place globally. They are also both tutors on campus which will be a great asset when training less experienced members of the team.
Our mentor, Dr. Xin Yang is an Assistant Research Scientist at TAMUs High Performance Research Computing (HPRC). Her expertise is in the area of computational chemistry and molecular modeling. She joined the IndySCC 21 as a co-advisor for the TAMU-PVAMU joint Gig ‘em bytes team.
Veronica never imagined working in technology, but has been in TAMU-CC IT for eleven years. Coming from a non-traditional background in Communication, her love for IT grew as a student employee. She moved through the ranks from the service desk to a systems administrator.
Chanelle is a Junior in Management Information Systems with a Minor in Finance. She obtained her Associate's in Business Administration with a 4.0 GPA. Chanelle challenges herself as a student employee in our Application Administration department. She was also Team Leader for a Center for Creative Land Recycling project that developed a working application for the Texas State Aquarium.
Savannah is a Sophomore in Computer Science. She has learned the basics of her future field, yet was one of the first students to volunteer for this opportunity. When asked what computational experience she said, “… minor computer science experience, but I had to Google what HPC stood for, if that tells you anything.” She is fierce and fearless.
Hunter Carver is inspiring. As a Junior in Computer Science, Hunter focuses on Systems Programming. He earned an Associate’s Degree in Computer Programming from Del Mar College, founded their first Computer Science Club, and remains a member of the College Hall of Fame. As a growing R2 institution, TAMU-CC is home to one of seven test sites for Unmanned Aircraft Systems (UAS) in the US. Hunter interns at our UAS Center where the competition experience will be directly beneficial. He has also developed a 240Mh/s mining rig to validate the Ethereum Blockchain.
Orlando Gomez is a Sophomore in Computer Science focusing on Cybersecurity. Orlando has a Linux certification and is pursuing a student employee position in TAMU-CC’s Division of IT. He has the drive to excel in this competition, in order to build his knowledge and experience.
Nabil and Amine are both international students from Algeria. In speaking with Amine, he discussed how his home is lagging in the technology field and struggles with limited access to educational resources. Even so, Nabil and Amine have awe-inspiring resumes. Both competed in a 48-hour competition by designing a robot to assist the visually impaired with navigation. They won first place regionally and third place globally. They are also both tutors on campus which will be a great asset when training less experienced members of the team.
Our mentor, Dr. Xin Yang is an Assistant Research Scientist at TAMUs High Performance Research Computing (HPRC). Her expertise is in the area of computational chemistry and molecular modeling. She joined the IndySCC 21 as a co-advisor for the TAMU-PVAMU joint Gig ‘em bytes team.
Paper
Recorded
Data Mangement
Storage
TP
DescriptionTo lower the monetary/energy cost, single-machine multicore graph processing is gaining increasing attention for a wide range of traversal-centric graph algorithms such as BFS, SSSP, CC, and PageRank, of which the processing is relatively simple and the topology data (vertices and edges) dominates the memory footprint. This paper presents vGRAPH, a NUMA-aware, memory-efficient multicore graph processing system for traversal-centric algorithms. vGRAPH proposes an ultralight NUMA-aware graph preprocessing scheme which eliminates almost all complex preprocessing steps and pipelines per-NUMA graph loading and compressing, to effectively reduce inter-NUMA memory accesses while keeping both preprocessing cost and peak memory footprint low. We further optimize vGRAPH with effective HPC techniques including prefetching and work-stealing. Evaluation on a 384GB-memory, four-NUMA machine shows that compared to the state-of-the-art NUMA-aware/-unaware systems, vGRAPH can process much larger real-world and synthetic graphs with various traversal-centric algorithms, achieving significantly higher memory efficiency and lower processing time.
Posters
Scientific Visualization & Data Analytics Showcase
Recorded
TP
DescriptionHigh Performance Computing (HPC) critically underpins the design of aero-engines. With global emissions targets, engine designs require a fundamental change including designs utilizing sustainable aviation fuels and electric/hybrid flight. Virtual certification of designs with HPC is recognized as a key technology to meet these challenges, but require analysis on models with higher fidelity, using ultra-large scale executions. In this explanatory SC-SciVis showcase, we present results from time-accurate simulations of a 4.6B-element full 360-degree model of a production-representative gas turbine engine compressor, the Rig250 at DLR. This represents a grand challenge problem, at the fidelity for virtual certification standards. The results are achieved through Rolls-Royce's Hydra CFD suite on ARCHER2. The compressor is visualized under off-design conditions, demonstrating flow contours of velocity, Mach number and iso-surfaces of vorticity. The level of detail and the HPC simulations leading to the visualizations demonstrate a step-change towards achieving virtual certification objectives under production settings.
Workshop
Recorded
W
DescriptionWith their widespread availability, FPGA-based accelerators cards have become an alternative to GPUs and CPUs to accelerate computing in applications with certain requirements (like energy efficiency) or properties (like fixed-point computations). In this paper we show results and experiences from mapping an industrial application used for drug discovery on several types of accelerators. We especially highlight the effort versus benefit of FPGAs compared to CPUs and GPUs in terms of performance and energy efficiency. For this application, even with extensive use of FPGA-specific features, and performing different optimizations, results on GPUs are still better, both in terms of energy and performance.
Posters
Research Posters
TP
XO/EX
DescriptionA large body of approaches has been proposed to analyze the resilience of HPC applications. However, existing studies rarely address the challenges of the analysis result perception. Specifically, resilience analysis techniques produce a massive volume of unstructured data, making it difficult to conduct the resilience analysis. Furthermore, different analysis models produce diverse results with multiple levels of details, which creates hurdles to compare and explore the resilience of HPC program execution. To this end, we present VISILIENCE, an interactive VISual resILIENCE analysis framework to allow programmers to facilitate the resilience analysis of HPC applications. In particular, VISILIENCE leverages an effective visualization approach Control Flow Graph (CFG) to present a function execution. In addition, three widely-used models for resilience analysis (i.e., Y-Branch, IPAS, and TRIDENT) are seamlessly embedded into the framework for resilience analysis and result comparison. Case studies have been conducted to demonstrate the effectiveness of our proposed framework VISILIENCE.
Posters
Scientific Visualization & Data Analytics Showcase
Recorded
TP
DescriptionIn the United States, fossil-fuel related industrial processes account for approximately half of all greenhouse gas emissions in the United States. Chemical Looping Reactors (CLRs) provide a promising path to reducing carbon emissions; however, scale-up and testing of these systems is expensive and time-consuming. In our video, we focus on understanding bubble dynamics in fluidized beds of Chemical Looping Reactor as simulated by the MFIX-Exa Code, including the importance of Los Alamos National Laboratory’s in situ feature detection algorithm and the use of the Cinema visualization tool in the post hoc workflow. MFIX-Exa provides new computing capabilities needed to combine CFD-DEM simulation with computing at the exascale via an adaptive mesh refinement (AMReX) framework.
Posters
Scientific Visualization & Data Analytics Showcase
Recorded
TP
DescriptionThis explanatory visualization shows the results of a state-of-the-art 3D simulation of supernova explosion and neutron-star birth. It is a rare instance where the full stellar evolution of an object, including the physics of the convection and the radiation, has been simulated in three dimensions. Among the highlights is the deep core that is shrinking after explosion due to neutrino cooling and deleptonization on its way to becoming a cold, compact neutron star. There is also evidence of inner proto-neutron star convection, perhaps the site of magnetic dynamo action that can turn a pulsar into a magnetar. An exterior view shows the blast wave, which cocoons the newly-birthed neutron star, moving at ∼10,000 km/s. Additionally, a reusable pipeline was developed, which leverages state-of-the-art tools for scientific data analysis and visualization resulting in high-quality renderings.
Posters
Research Posters
TP
XO/EX
DescriptionThe Fast Fourier Transform is an essential algorithm of modern computational science. The highly parallel structure of the FFT allows for its efficient implementation on graphics processing units (GPUs), which are now widely used for general-purpose computing. This poster presents the VkFFT - an efficient GPU-accelerated multidimensional Fast Fourier Transform library for Vulkan/CUDA/HIP/OpenCL/Level Zero projects. VkFFT aims to provide the community with a cross-platform open-source alternative to vendor-specific solutions while achieving comparable or better performance. This poster presents the optimizations implemented in VkFFT and compares its performance and precision against Nvidia cuFFT and AMD's rocFFT libraries on their latest HPC GPUs. This poster also presents the first performant implementation of Discrete Cosine Transforms on GPUs. VkFFT is released under MIT license.
Paper
Recorded
Numerical Algorithms
Scientific Computing
TP
Best Student Paper Finalists
Best Reproducibility Advancement Finalist
DescriptionSubgraph matching is a fundamental building block in graph analytics. Due to its high time complexity, GPU-based solutions have been proposed for subgraph matching. Most existing GPU-based works can only cope with relatively small graphs that fit in GPU memory. To support efficient subgraph matching on large graphs, we propose a view-based method to hide communication overhead and improve GPU utilization. We develop VSGM, a subgraph matching framework that supports efficient pipelined execution and multi-GPU architecture. Extensive experimental evaluation shows that VSGM significantly outperforms the state-of-the-art solutions.
Paper
Recorded
Numerical Algorithms
Scientific Computing
TP
DescriptionAs a basic matrix factorization operation, Singular Value Decomposition (SVD) is widely used in diverse domains. In real-world applications, the computational bottleneck of matrix factorization is on small matrices, and many GPU-accelerated batched SVD algorithms have been developed recently for higher performance. However, these algorithms failed to achieve both high data locality and convergence speed, because they are size-sensitive. In this work, we propose a novel W-cycle SVD to accelerate the batched one-sided Jacobi SVD on GPUs. The W-cycle SVD, which is size-oblivious, successfully exploits the data reuse and ensures the optimal convergence speed for batched SVD. Further, we present the efficient batched kernel design, and propose a tailoring strategy based on auto-tuning to improve the batched matrix multiplication in SVDs. The evaluation demonstrates that the proposed algorithm achieves 2.6–10.2× speedup over the state-of-the-art cuSOLVER. In a real-world data assimilation application, our algorithm achieves 2.73–3.09× speedup compared with MAGMA.
Workshop
Recorded
Accelerator-based Architectures
Compilers
Dataflow and Tasking
Directive Based Programming
Heterogeneous Systems
Parallel Programming Languages and Models
Runtime Systems
W
Workshop
Recorded
Architectures
Data Analytics
Datacenter
Extreme Scale Computing
HPC Community Collaboration
Machine Learning and Artificial Intelligence
Performance
Resource Management and Scheduling
System Software
W
DescriptionDesigns of integrated stacks on, e.g., x86 systems, often build on a fiction that does not exist: that of a 1994-era Pentium over which the software has complete control. As we have worked to point out since 1999, this is a fiction: in fact, kernels run on a virtual system over which they have little control. This would be fine if it did not affect performance, but it can in the end have significant throughput impacts, e.g., on some modern systems, entire sockets can stop for 1/2 a second at a time.
I will discuss some of the cases in which these Potemkin Villages have caused real trouble and suggest possible ways to deal with them on old (x86) and new (RISC-V) systems.
I will discuss some of the cases in which these Potemkin Villages have caused real trouble and suggest possible ways to deal with them on old (x86) and new (RISC-V) systems.
Workshop
Recorded
W
DescriptionUrgent computing workloads are time critical, unpredictable, and highly dynamic. While efforts are on-going to run these on traditional HPC machines, another option is to leverage computing power donated by volunteers. Volunteer computing, where members of the public donate some of their CPU time, is a powerful way of delivering compute. However, volunteer computing has required user installation of specialist software which is a barrier to entry.
We believe that an alternative approach, where visitors to websites donate some of their CPU time, is beneficial. This is an immature field and there are numerous questions that must be answered to understand the viability of leveraging website visitors' compute. In this presentation, we describe our web-based distributed computing framework, Panther, and perform performance experiments using real world hardware and browsing habits for the first time. We demonstrate this is viable for urgent workloads, but there are numerous caveats to be considered.
We believe that an alternative approach, where visitors to websites donate some of their CPU time, is beneficial. This is an immature field and there are numerous questions that must be answered to understand the viability of leveraging website visitors' compute. In this presentation, we describe our web-based distributed computing framework, Panther, and perform performance experiments using real world hardware and browsing habits for the first time. We demonstrate this is viable for urgent workloads, but there are numerous caveats to be considered.
Workshop
Recorded
Architectures
Data Analytics
Datacenter
Extreme Scale Computing
HPC Community Collaboration
Machine Learning and Artificial Intelligence
Performance
Resource Management and Scheduling
System Software
W
DescriptionWorkshop welcome
Workshop
Recorded
Architectures
Cloud and Distributed Computing
Emerging Technologies
Networks
Scientific Computing
W
Workshop
Recorded
Applications
Architectures
Benchmarking
Exascale Computing
Modeling and Simulation
Performance
Performance Portability
W
DescriptionThe prevalence of scientific workflows with high computational demands calls for their execution on various distributed computing platforms, including large-scale leadership-class HPC clusters. To handle deployment, monitoring, and optimization of workflow executions, many workflow systems have been developed over the past decade, creating a need for workflow benchmarks to evaluate the performance of these WMSs on current and future software stacks and hardware platforms.
We present a generator of realistic workflow benchmark specifications that can be translated into benchmark code and executed with current workflow systems. Our approach generates workflow tasks with arbitrary performance characteristics (CPU, memory, and I/O usage) and with realistic task dependency structures based on those seen in production workflows. Our experimental results show that our approach generates benchmarks representative of production workflows and conducts a case study to demonstrate the use/usefulness of our generated benchmarks to evaluate the performance of workflow systems under different configuration scenarios.
We present a generator of realistic workflow benchmark specifications that can be translated into benchmark code and executed with current workflow systems. Our approach generates workflow tasks with arbitrary performance characteristics (CPU, memory, and I/O usage) and with realistic task dependency structures based on those seen in production workflows. Our experimental results show that our approach generates benchmarks representative of production workflows and conducts a case study to demonstrate the use/usefulness of our generated benchmarks to evaluate the performance of workflow systems under different configuration scenarios.
Workshop
Recorded
W
DescriptionThe massive data volumes produced by climate simulation models create an urgent need for data reduction. Lossy compression is one solution that can significantly reduce storage requirements, however, as the amount of compression applied increases, the scientific integrity of the data decreases. One metric for gauging the quality of compression is the percentage of real information present in the original data that is preserved in the compressed data. We compute bitwise real information content for several climate variables from the Community Earth System Model Large Ensemble provided by the National Center for Atmospheric Research and investigate the amount of compression that can be applied to each of these climate variables using two popular compression algorithms designed for floating-point data while preserving 99% of the real information content. Finally, we demonstrate how the real information content can be used in a straightforward manner to determine compressor settings for our data.
Birds of a Feather
TP
XO/EX
DescriptionAfter 3 years of working through the pandemic, ISO C++ continues to commit to drive towards C++23. ISO C++ continues to serve as the top 4 language based on Tiobe rating, and C/C++ is used in 79.4% of parallel programming languages based on Hyperion 2021 research HPC Briefing at ISC 2021. After the last five years’ successful ISO C++ for HPC BoF and with increasing use of C++ in Exascale computing, there was popular demand for continuing updates of the main C++20 features. This includes Concepts, ML, mdspan, library and Concurrency features. This BoF will provide updates on C++23, 26.
Workshop
Recorded
W
DescriptionHeterogeneity is becoming a norm in modern computer architectures. How to optimize the designs of such architectures and how to make programs run efficiently on such systems are two fundamental problems that have drawn much attention in the last decade. Most of those studies have however assumed a clean single-program execution scenario. But the reality is often different: Multiple programs sharing a single machine is common, with them each competing for all kinds of resources, from computing units to memory and bandwidth. As a result, many of the careful designs and optimizations end up with a poor performance delivery in practice. This talk will examine the challenges, present some promising findings, and point out some future directions.
Paper
Recorded
Numerical Algorithms
Scientific Computing
TP
DescriptionGraph neural networks (GNNs) are prevalent to deal with graph-structured datasets, encoding graph data into low dimensional vectors. In this paper, we present a fast training graph neural network framework, i.e., WholeGraph, based on a multi-GPU distributed shared memory architecture. WholeGraph consists of partitioning the graph and corresponding node or edge features to multi-GPUs, eliminating the bottleneck of communication between CPU and GPUs during the training process. And the communication between different GPUs is implemented by GPUDirect Peer-to-Peer (P2P) memory access technology. Furthermore, WholeGraph provides several optimized computing operators. Our evaluations show that on large-scale graphs WholeGraph outperforms state-of-the-art GNN frameworks, such as Deep Graph Library (DGL) and Pytorch Geometric (PyG). The speedups of WholeGraph are up to 57.32x and 242.98x compared with DGL and PyG on a single machine multi-GPU node, respectively. The ratio of GPU utilization can sustain above 95% during GNN training process.
Workshop
Recorded
Diversity Equity Inclusion (DEI)
Education and Training and Outreach
W
Workshop
Recorded
Diversity Equity Inclusion (DEI)
Education and Training and Outreach
W
Workshop
Recorded
Diversity Equity Inclusion (DEI)
Education and Training and Outreach
W
Workshop
Recorded
Diversity Equity Inclusion (DEI)
Education and Training and Outreach
W
Workshop
Recorded
Diversity Equity Inclusion (DEI)
Education and Training and Outreach
W
Workshop
Recorded
Diversity Equity Inclusion (DEI)
Education and Training and Outreach
W
Workshop
Recorded
Diversity Equity Inclusion (DEI)
Education and Training and Outreach
W
Workshop
Recorded
Diversity Equity Inclusion (DEI)
Education and Training and Outreach
W
Workshop
Recorded
Diversity Equity Inclusion (DEI)
Education and Training and Outreach
W
Workshop
Recorded
Diversity Equity Inclusion (DEI)
Education and Training and Outreach
W
Workshop
Recorded
Diversity Equity Inclusion (DEI)
Education and Training and Outreach
W
Workshop
Recorded
Quantum Computing
W
DescriptionUnitary synthesis is an optimization technique that can achieve optimal gate counts while mapping quantum circuits to restrictive qubit topologies. Synthesis algorithms are limited in scalability by their exponentially growing run times. Application to wide circuits requires partitioning into smaller components. In this work, we explore methods to reduce depth and multi-qubit gate count of wide, mapped quantum circuits using synthesis. We present TopAS, a topology aware synthesis tool that preconditions quantum circuits before mapping. Partitioned subcircuits are optimized and fitted to sparse subtopologies to balance the opposing demands of synthesis and mapping algorithms. Compared to state of the art wide circuit synthesis algorithms, TopAS is able to reduce depth on average by 35.2% and CNOT count by 11.5% for mesh topologies. Compared to the optimization and mapping algorithms of Qiskit and Tket, TopAS is able to reduce CNOT counts by 30.3% and depth by 38.2% on average.
Workshop
Recorded
Cloud and Distributed Computing
In Situ Processing
Scientific Computing
Workflows
W
DescriptionReliable execution of scientific workflows is a fundamental concern in computational campaigns. Therefore, detecting and diagnosing anomalies are both important and challenging for workflow executions that span complex, distributed computing infrastructures. We model the scientific workflow as a directed acyclic graph and apply graph neural networks (GNNs) to identify the anomalies at both the workflow and individual job levels. In addition, we generalize our GNN model to take into account a set of workflows together for the anomaly detection task rather than a specific workflow. By taking advantage of learning the hidden representation, not only from the job features but also from the topological information of the workflow, our GNN models demonstrate higher accuracy and better runtime efficiency when compared with conventional machine learning models and other convolutional neural network approaches.
Birds of a Feather
TP
XO/EX
DescriptionThe interplay of workflow technologies and HPC has been challenged by the fast rise of AI and ML technologies. Workflows empowered with ML techniques largely differ from traditional workflows running on HPC machines. In this BoF, we will bring together researchers from the workflows (https://workflows.community), HPC, and AI/ML communities that work on scientific research questions that require large-scale, distributed, and AI-heavy computing. The session will present an update on challenges, opportunities, new research directions, and future pathways, and will seek input for updating a community roadmap on HPC and AI workflows research and development.
Workshop
Recorded
Cloud and Distributed Computing
In Situ Processing
Scientific Computing
Workflows
W
DescriptionWhile distributed computing infrastructures are becoming increasingly complex, the user community provides more complex application workflows to leverage them. In addition, current trends aim to use data analytics and artificial intelligence combined with HPC modeling and simulation. However, the programming models and tools are different in these fields, and there is a need for methodologies that enable the development of workflows that combine HPC software, data analytics, and artificial intelligence. The eFlows4HPC project aims at providing a workflow software stack that fulfills this need.
The project is also developing the HPC Workflows as a Service (HPCWaaS) methodology that aims at providing tools to simplify the development, deployment, execution, and reuse of workflows. The project showcases its advances with three application Pillars with industrial and social relevance: manufacturing, climate, and urgent computing for natural hazards. The talk will present the actual progress and findings of the project.
The project is also developing the HPC Workflows as a Service (HPCWaaS) methodology that aims at providing tools to simplify the development, deployment, execution, and reuse of workflows. The project showcases its advances with three application Pillars with industrial and social relevance: manufacturing, climate, and urgent computing for natural hazards. The talk will present the actual progress and findings of the project.
Workshop
Recorded
Cloud and Distributed Computing
In Situ Processing
Scientific Computing
Workflows
W
Workshop
Recorded
Cloud and Distributed Computing
In Situ Processing
Scientific Computing
Workflows
W
Posters
Research Posters
TP
XO/EX
DescriptionWriting software that can exploit supercomputers is difficult, and this is going to get much harder as we move toward exascale where the scale and heterogenity of our machines will increase significantly. A potential solution is in the use of Domain Specific Languages (DSLs) which separate the programmer's logic from mechanisms of parallelism. However, while these have shown promise, a major challenge is that DSL toolchains are often siloed, sharing little or no infrastructure between DSLs.
In this poster, we present xDSL which is an ecosystem for DSL development. Built upon the hugely popular LLVM and MLIR, xDSL provides a Python-based toolbox to ease integration with MLIR, and a series of IR dialects and transformations that DSL developers can apply. The result is that that DSLs become a thin layer of abstraction atop a common, well supported, mature and maintained ecosystem that targets a variety of hardware architectures.
In this poster, we present xDSL which is an ecosystem for DSL development. Built upon the hugely popular LLVM and MLIR, xDSL provides a Python-based toolbox to ease integration with MLIR, and a series of IR dialects and transformations that DSL developers can apply. The result is that that DSLs become a thin layer of abstraction atop a common, well supported, mature and maintained ecosystem that targets a variety of hardware architectures.
Student Cluster Competition
TP
XO/EX
DescriptionZihan Yang, Yi Chen, and Yun Pan are junior students majoring in Computer Science and have participated in ASC twice. Kaiqi Chen, from Chu Kochen Honors College, Xingjian Qian, majoring in Computer Science, and Shaojun Xu, majoring in Control Science and Engineering, are sophomore students and have one ASC experience. Besides, three students are on the ISC22 team.
All of us are members of the Zhejiang University Supercomputing Team(ZJUSCT), the HPC interest group in ZJU. Each year, new members from various majors will join the team, bringing the vigor of youth. We communicate and obtain well-trained HPC skills here. Our common interests bring us together. Exchange, learning from each other, and making progress together are the main themes of our daily life. The new members inherit the team from the old, growing more talented, year after year. In competitions, there are challenges from different domain science disciplines, helping us to find out how to practice HPC skills in interdisciplinary work. Being one of the biggest supercomputing competitions in the world, SCC has always been our dream. We fully acknowledge the fame and the success of SCC, while we have never participated in this huge event.
We crave to challenge the most challenging HPC problems. We hope to analyze them, try different ways to solve them and obtain our approach to them. The experience in the competition will help us realize our potential and leap beyond our limits. After we have participated in various HPC competitions, we found that different competitions have different focuses. The SCC will be a brand new challenge and platform to improve our academic skills.
# Team Advisor
Jianhai Chen is currently an Associate Professor in the College of Computer Science and Technology, Zhejiang University, the super-visor of the ZJU Supercomputing Team, and the leader of Intelligent Computing and System Lab. IEEE, ACM, CCF member. Member of CCF Blockchain Professional Committee.
Zeke Wang is a ZJU100 Young Professor at Zhejiang University in Computer Science. His research interest is to use various heterogeneous devices, e.g., FPGAs and GPUs, to build Deep Learning training systems, with a focus on giant model training.
Shuibing He is currently a ZJU100 Young Professor. His research areas include intelligent computing, parallel and distributed computing, file and storage systems, non-volatile memory, in-memory computing, operating systems, and big data processing.
Yin Zhang is an associate professor at the College of Computer Science and Technology at Zhejiang University, Hangzhou, China. His research interests lie in compilers for AI chips, code representation learning, and textual and visual content analysis.
Weiwei Xu is currently a researcher at the state key lab of CAD&CG at Zhejiang University. His research interests lie in computer graphics, especially physical simulation and computer animation, computational geometry, 3D printing, and virtual reality.
Chong Zeng is a senior undergraduate student studying Computer Science and Technology at Zhejiang University. He is the team leader of our school's super-computing team. His research interests include computer vision, computer graphics, deep learning, and high-performance computing.
All of us are members of the Zhejiang University Supercomputing Team(ZJUSCT), the HPC interest group in ZJU. Each year, new members from various majors will join the team, bringing the vigor of youth. We communicate and obtain well-trained HPC skills here. Our common interests bring us together. Exchange, learning from each other, and making progress together are the main themes of our daily life. The new members inherit the team from the old, growing more talented, year after year. In competitions, there are challenges from different domain science disciplines, helping us to find out how to practice HPC skills in interdisciplinary work. Being one of the biggest supercomputing competitions in the world, SCC has always been our dream. We fully acknowledge the fame and the success of SCC, while we have never participated in this huge event.
We crave to challenge the most challenging HPC problems. We hope to analyze them, try different ways to solve them and obtain our approach to them. The experience in the competition will help us realize our potential and leap beyond our limits. After we have participated in various HPC competitions, we found that different competitions have different focuses. The SCC will be a brand new challenge and platform to improve our academic skills.
# Team Advisor
Jianhai Chen is currently an Associate Professor in the College of Computer Science and Technology, Zhejiang University, the super-visor of the ZJU Supercomputing Team, and the leader of Intelligent Computing and System Lab. IEEE, ACM, CCF member. Member of CCF Blockchain Professional Committee.
Zeke Wang is a ZJU100 Young Professor at Zhejiang University in Computer Science. His research interest is to use various heterogeneous devices, e.g., FPGAs and GPUs, to build Deep Learning training systems, with a focus on giant model training.
Shuibing He is currently a ZJU100 Young Professor. His research areas include intelligent computing, parallel and distributed computing, file and storage systems, non-volatile memory, in-memory computing, operating systems, and big data processing.
Yin Zhang is an associate professor at the College of Computer Science and Technology at Zhejiang University, Hangzhou, China. His research interests lie in compilers for AI chips, code representation learning, and textual and visual content analysis.
Weiwei Xu is currently a researcher at the state key lab of CAD&CG at Zhejiang University. His research interests lie in computer graphics, especially physical simulation and computer animation, computational geometry, 3D printing, and virtual reality.
Chong Zeng is a senior undergraduate student studying Computer Science and Technology at Zhejiang University. He is the team leader of our school's super-computing team. His research interests include computer vision, computer graphics, deep learning, and high-performance computing.
Sessions
Workshop
Recorded
Runtime Systems
System Software
W
Workshop
Recorded
AI-HPC Convergence
Emerging Technologies
Memory Systems
Networks
Resource Management and Scheduling
W
Workshop
Recorded
W
Workshop
Recorded
Correctness
Software Engineering
W
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
Workshop
Recorded
Architectures
Cloud and Distributed Computing
Emerging Technologies
Networks
Scientific Computing
W
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
Recorded
TP
Paper
Recorded
Accelerator-based Architectures
Bioinformatics
File Systems and I/O
TP
Paper
Recorded
Machine Learning and Artificial Intelligence
Software Engineering
State of the Practice
TP
Workshop
Eighth International Workshop on Heterogeneous High-Performance Reconfigurable Computing (H2RC 2022)
Recorded
W
Workshop
Recorded
AI-HPC Convergence
Extreme Scale Computing
Parallel Programming Languages and Models
Performance
Runtime Systems
W
Exhibits
Exhibit Floor Ribbon Cutting
TP
XO/EX
Workshop
Recorded
W
Workshop
Recorded
Security
W
ACM Gordon Bell COVID Finalist
ACM Gordon Bell Finalist
Awards Presentation
Recorded
Awards
TP
Workshop
Recorded
Algorithms
Architectures
Compilers
Computational Science
Exascale Computing
Heterogeneous Systems
Hierarchical Parallelism
Memory Systems
Parallel Programming Languages and Models
Parallel Programming Systems
Resource Management and Scheduling
W
Paper
Recorded
Architectures
Machine Learning and Artificial Intelligence
TP
Workshop
Recorded
Benchmarking
Cloud and Distributed Computing
Containers
Datacenter
Networks
Privacy
Resource Management and Scheduling
Security
SIGHPC
State of the Practice
System Administration
System Software
W
Workshop
Recorded
Accelerator-based Architectures
Algorithms
Architectures
Big Data
Data Analytics
Parallel Programming Languages and Models
Productivity Tools
W
Workshop
Recorded
Accelerator-based Architectures
Data Analytics
In Situ Processing
Scientific Computing
Visualization
Workflows
W
Workshop
Recorded
Accelerator-based Architectures
Compilers
Dataflow and Tasking
Directive Based Programming
Heterogeneous Systems
Parallel Programming Languages and Models
Runtime Systems
W
Paper
Recorded
Networks
Performance
Visualization
TP
Workshop
Recorded
Applications
Architectures
Benchmarking
Exascale Computing
Modeling and Simulation
Performance
Performance Portability
W
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
Reception
Research Posters
Scientific Visualization & Data Analytics Showcase
TP
Workshop
Recorded
Career Development
Professional Development
Software Engineering
Workforce
W
Students@SC
Awards Presentation
SC22 Opening Session & Turing Lecture
Recorded
Awards
Keynote
Turing
TP
W
TUT
XO/EX
Press Briefing
Workshop
Recorded
W
Posters
Scientific Visualization & Data Analytics Showcase
Recorded
TP
Posters
Scientific Visualization & Data Analytics Showcase
TP
XO/EX
Posters
Scientific Visualization & Data Analytics Showcase
TP
XO/EX
Posters
Scientific Visualization & Data Analytics Showcase
TP
XO/EX
Workshop
Recorded
Architectures
Data Analytics
Datacenter
Extreme Scale Computing
HPC Community Collaboration
Machine Learning and Artificial Intelligence
Performance
Resource Management and Scheduling
System Software
W
Paper
Recorded
Extreme Scale Computing
Memory Systems
Parallel Programming Systems
State of the Practice
TP
Student Cluster Competition
Student Cluster Competition
TP
XO/EX
Student Cluster Competition
Student Cluster Competition
TP
XO/EX
Student Cluster Competition
Student Cluster Competition
TP
XO/EX
Student Cluster Competition
Student Cluster Competition Kick-Off
TP
XO/EX
Student Cluster Competition
Student Cluster Competition Lightning Talks
TP
XO/EX
Student Cluster Competition
Student Cluster Competition Poster Session
TP
XO/EX
Student Cluster Competition
Student Cluster Competition Safety Briefing
TP
XO/EX
Student Cluster Competition
Student Cluster Competition Wrapup
TP
XO/EX
Students@SC
Students@SC
Students@SC
Paper
Recorded
Quantum Computing
Resource Management and Scheduling
System Software
TP
Break
Tech Program Afternoon Break
TP
Break
Tech Program Afternoon Break
TP
Break
Tech Program Afternoon Break
TP
Break
Tech Program Morning Break
TP
Break
Tech Program Morning Break
TP
Break
Tech Program Morning Break
TP
Workshop
Recorded
Cloud and Distributed Computing
In Situ Processing
Scientific Computing
Workflows
W
Workshop
Recorded
Applications
Architectures
Heterogeneous Systems
Hierarchical Parallelism
Parallel Programming Languages and Models
Performance
Performance Portability
Scientific Computing
W
Workshop
Recorded
W
Workshop
Recorded
Reliability and Resiliency
W
Break
Tutorial and Workshop Afternoon Break
W
TUT
Break
Tutorial and Workshop Afternoon Break
W
TUT
Break
Tutorial and Workshop Afternoon Break
W
TUT
Break
Tutorial and Workshop Afternoon Break
W
TUT
Break
Tutorial and Workshop Morning Break
W
TUT
Break
Tutorial and Workshop Morning Break
W
TUT
Break
Tutorial and Workshop Morning Break
W
TUT
Break
Tutorial and Workshop Morning Break
W
TUT
Workshop
Recorded
Diversity Equity Inclusion (DEI)
Education and Training and Outreach
W
Break
Workshop Morning Break
W
Break
Workshop Morning Break
W
Workshop
Recorded
W
Workshop
Workshop on Latest Advances in Scalable Algorithms for Large-Scale Heterogeneous Systems (ScalAH'22)
Recorded
Algorithms
Exascale Computing
Extreme Scale Computing
Heterogeneous Systems
Post-Moore Computing
Quantum Computing
W
Try a different query.