Home Search Program
Search Program
Organizations
Contributors
Presentations
Posters
Scientific Visualization & Data Analytics Showcase
TP
XO/EX
DescriptionThe Advanced Visualization Lab at the National Center for Supercomputing Applications created a cinematic scientific visualization of the ArcticDEM survey and Vavilov ice cap collapse for the documentary film "Atlas of a Changing Earth", in both digital fulldome and flatscreen television formats. While the ArcticDEM dataset is the main one featured here, this visualization fills in gaps using other datasets, including a climate simulation by Bates et al and Landsat imagery. The visualization required a number of steps including: both manual and algorithmic data cleaning, processing, and alignment; data fusion; virtual scene design; morphing interpolation; lighting design; camera choreography; compositing; and rendering on the Blue Waters supercomputer.
ACM Gordon Bell Finalist
Awards Presentation
Recorded
Awards
TP
DescriptionOver the past three decades, ab initio electronic structure calculations of large, complex and metallic systems are limited to tens of thousands of atoms in computational accuracy and efficiency on leadership supercomputers. We present a massively parallel discontinuous Galerkin density functional theory (DGDFT) implementation, which adopts adaptive local basis functions to discretize the Kohn-Sham equation, resulting in a block-sparse Hamiltonian matrix. A highly efficient pole expansion and selected inversion (PEXSI) sparse direct solver is implemented in DGDFT to achieve O(N1.5) scaling for quasi two-dimensional systems. DGDFT allows us to compute the electronic structures of complex metallic heterostructures with 2.5 million atoms (17.2 million electrons) using 35.9 million cores on the new Sunway supercomputer. The peak performance of PEXSI can achieve 64 PFLOPS (5% of theoretical peak), which is unprecedented for sparse direct solvers. This accomplishment paves the way for quantum mechanical simulations into mesoscopic scale for designing next-generation electronic devices.
Awards Presentation
Recorded
Awards
TP
DescriptionLinking scientific instruments and computation: Patterns, technologies, experiences
Powerful detectors at modern experimental facilities that collect data at multiple GB/s require online computing to process the resulting data flows. I review common patterns associated with such online analyses, and present new methods for configuring and running the resulting distributed computing pipelines. I present experiences with the application of these methods to the processing of data from five scientific instruments, each of which engages powerful computers for data inversion, model training, or other purposes. I also discuss implications of such methods for operators and users of scientific facilities.
Powerful detectors at modern experimental facilities that collect data at multiple GB/s require online computing to process the resulting data flows. I review common patterns associated with such online analyses, and present new methods for configuring and running the resulting distributed computing pipelines. I present experiences with the application of these methods to the processing of data from five scientific instruments, each of which engages powerful computers for data inversion, model training, or other purposes. I also discuss implications of such methods for operators and users of scientific facilities.
Awards Presentation
Recorded
Awards
TP
DescriptionFrom two strong oxen to billions of fleas: orchestrating
computation and data in modern high-performance computing
Following Sidney Fernbach's legacy, we will explore how massively parallel distributed supercomputers are designed, programmed, and operated today. We focus on the aspects of distributed-memory parallelism using Remote Direct Memory Access through the Message Passing Interface. We will close with an outlook of where technology will leads us and new problems for the HPC community to tackle in the coming years.
computation and data in modern high-performance computing
Following Sidney Fernbach's legacy, we will explore how massively parallel distributed supercomputers are designed, programmed, and operated today. We focus on the aspects of distributed-memory parallelism using Remote Direct Memory Access through the Message Passing Interface. We will close with an outlook of where technology will leads us and new problems for the HPC community to tackle in the coming years.
Awards Presentation
Recorded
Awards
TP
DescriptionQuotes from Seymour Cray—Are we living up to his legacy?
Seymour Cray, often regarded as the “father of supercomputing”, endowed us with valuable quotes during his stellar career, and many of those quotes can now be found online. One can say that the HPC in general has made massive progress since his unfortunate passing, but the question is, has HPC made advances in a way that lives up to his ideals and his legacy, and furthermore is properly moving forward? Moreover, it is difficult even for a genius to predict the future accurately, and as such, are the ideals from his quotes living up to the present day HPC? We review his quotes against some historical supercomputing developments I have been involved in to address these questions.
Seymour Cray, often regarded as the “father of supercomputing”, endowed us with valuable quotes during his stellar career, and many of those quotes can now be found online. One can say that the HPC in general has made massive progress since his unfortunate passing, but the question is, has HPC made advances in a way that lives up to his ideals and his legacy, and furthermore is properly moving forward? Moreover, it is difficult even for a genius to predict the future accurately, and as such, are the ideals from his quotes living up to the present day HPC? We review his quotes against some historical supercomputing developments I have been involved in to address these questions.
Birds of a Feather
TP
XO/EX
DescriptionData intensive supercomputer applications are increasingly important workloads, especially for “Big Data” problems, but are ill suited for most of today’s computing platforms (at any scale!). The Graph500 list has grown to over 328 entries and has demonstrated the challenges of even simple analytics. The new SSSP kernel introduced at SC17 has increased the benchmark’s overall difficulty. This BoF will unveil the latest Graph500 lists, provide in-depth analysis of the kernels and machines, and enhance the new energy metrics the Green Graph500. It will offer a forum for community and provide a rallying point for data intensive supercomputing problems.
Student Cluster Competition
TP
XO/EX
DescriptionThe SDSC/UCSD SCC22 team is enthusiasm-driven, technically capable, fast-learning, and deeply experienced across the computer hardware and software stacks. Each team member is uniquely qualified and committed to using HPC to advance their field. We have one returning team member from the SCC21 virtual cluster competition, one team member graduating from previous competition training to the competition team, three former team members serving as team mentors, and four new students joining the competition team. We are confident in our team’s ability to tackle expected and unexpected challenges in the competition, using a combination of rigorous preparation, strong communication, robust planning, detailed learning, and efficient teamwork. Our team training activities are fully supported by SDSC through the HPC Students Program, and we are engaging directly with each of our sponsors for expert sessions on computer architecture, optimizing compilers, HPC in the cloud, containerization, and more.
Our team members exploit the full flexibility of the UCSD computer science, cognitive science, and computer engineering majors. Our technical stack includes: major programming languages (C, C++, Java, Python, Fortran), system administration, firmware engineering, parallel programming (MPI, OpenMP, CUDA), hardware design (SystemVerilog, tcl, Cadence, Synopsis), scientific applications (LAMMPS, Quantum Espresso, Avogadro, VMD), full stack web development (Node.JS, REACT, HTML), scripting and batch processing, and machine learning. Many team members have both undergraduate research and industry internship experience.
Edward Burns previously interned at SDSC, and he brings image processing, software engineering, and batch scheduler optimization experience to the team. He hopes that HPC experience will help him build highly scalable computer vision software throughout his career.
Davit Margarian brings a VLSI chip design and firmware background to the team. He hopes to use his HPC experience to accelerate computer-aided design tools for billion-gate integrated circuits.
Stefanie Dao is experienced across operating systems, computer vision, and high-performance software. She plans to apply her HPC experience to server-side processing and updating of augmented reality experiences in real time.
Longtian Bao has strong scripting, software engineering, and web development skills, and he participated in last year’s team training. He is excited to apply his skills to resource budgeting and performance monitoring during the competition.
Yuchen Jing has extensive networking and Linux system administration experience from hosting network proxies, file transfer servers, and version control systems. He is looking forward to strengthening his skills in developing, deploying, and maintaining high performance software.
Matthew Mikhailov competed at SCC21, and is the go-to person for the team. He specializes in VLSI chip design and computational materials science, and he uses the LAMMPS code for his research. He hopes to learn from his SCC experience to design the next generation of supercomputer chips.
Team advisor, Dr. Mary Thomas, SDSC HPC Training Lead, holds degrees in physics, computer science, and computational science, and taught parallel computing for 16 years. She has a personal commitment to the SCC program -- she has led 4 teams: SCC16 and 17 (San Diego State University) and SCC20-21 (UCSD). Her enthusiasm, knowledge, and practical experience will benefit the team.
Our team members exploit the full flexibility of the UCSD computer science, cognitive science, and computer engineering majors. Our technical stack includes: major programming languages (C, C++, Java, Python, Fortran), system administration, firmware engineering, parallel programming (MPI, OpenMP, CUDA), hardware design (SystemVerilog, tcl, Cadence, Synopsis), scientific applications (LAMMPS, Quantum Espresso, Avogadro, VMD), full stack web development (Node.JS, REACT, HTML), scripting and batch processing, and machine learning. Many team members have both undergraduate research and industry internship experience.
Edward Burns previously interned at SDSC, and he brings image processing, software engineering, and batch scheduler optimization experience to the team. He hopes that HPC experience will help him build highly scalable computer vision software throughout his career.
Davit Margarian brings a VLSI chip design and firmware background to the team. He hopes to use his HPC experience to accelerate computer-aided design tools for billion-gate integrated circuits.
Stefanie Dao is experienced across operating systems, computer vision, and high-performance software. She plans to apply her HPC experience to server-side processing and updating of augmented reality experiences in real time.
Longtian Bao has strong scripting, software engineering, and web development skills, and he participated in last year’s team training. He is excited to apply his skills to resource budgeting and performance monitoring during the competition.
Yuchen Jing has extensive networking and Linux system administration experience from hosting network proxies, file transfer servers, and version control systems. He is looking forward to strengthening his skills in developing, deploying, and maintaining high performance software.
Matthew Mikhailov competed at SCC21, and is the go-to person for the team. He specializes in VLSI chip design and computational materials science, and he uses the LAMMPS code for his research. He hopes to learn from his SCC experience to design the next generation of supercomputer chips.
Team advisor, Dr. Mary Thomas, SDSC HPC Training Lead, holds degrees in physics, computer science, and computational science, and taught parallel computing for 16 years. She has a personal commitment to the SCC program -- she has led 4 teams: SCC16 and 17 (San Diego State University) and SCC20-21 (UCSD). Her enthusiasm, knowledge, and practical experience will benefit the team.
Posters
Research Posters
TP
XO/EX
DescriptionRadio-frequency cavities are key components for high-energy particle accelerators, quantum computing, etc. Designing cavities comes along with many computational challenges such as multi-objective optimization, high performance computing (HPC) requirement for handling large-sized cavities etc. To be more precise, its multi-objective optimization requires an efficient 3D full-wave electromagnetic simulator. For which, we rely on the integral equation (IE) method and it requires fast solver with HPC and ML algorithms to search for resonance modes.
We propose an HPC-based fast direct matrix solver for IE, combined with hybrid optimization algorithms to attain an efficient simulator for accelerator cavity modeling. First, we solve the linear eigen problem for each trial frequency by a distributed-memory parallel, fast direct solver. Second, we propose the combination of the global optimizer Gaussian Process with the local optimizer Downhill-simplex methods to generate the trial frequency samples which successfully optimize the corresponding 1D objective function with multiple sharp minimums.
We propose an HPC-based fast direct matrix solver for IE, combined with hybrid optimization algorithms to attain an efficient simulator for accelerator cavity modeling. First, we solve the linear eigen problem for each trial frequency by a distributed-memory parallel, fast direct solver. Second, we propose the combination of the global optimizer Gaussian Process with the local optimizer Downhill-simplex methods to generate the trial frequency samples which successfully optimize the corresponding 1D objective function with multiple sharp minimums.
Posters
Research Posters
TP
XO/EX
DescriptionWe present a modern C++20 interface for MPI 4.0. The interface utilizes recent language features to ease development of MPI applications. An aggregate reflection system enables generation of MPI data types from user-defined classes automatically. Immediate and persistent operations are mapped to futures, which can be chained to describe sequential asynchronous operations and task graphs in a concise way. This work introduces the prominent features of the interface with examples. We further measure its performance overhead with respect to the raw C interface.
Workshop
Recorded
W
DescriptionIn High-Performance Computing, new use cases are currently emerging in which classical numerical simulations are coupled with machine learning as a surrogate for complex physical models that are expensive to compute. In the context of simulating reactive thermo-fluid systems, the idea to replace current state-of-the-art tabulated chemistry with machine learning
inference is an active field of research. For this purpose, a simplified OpenFOAM application is coupled with an artificial neural network. In this work, we present a case study focusing solely on the performance of the coupled OpenFOAM-ML application. Our coupling approach features a heterogeneous cluster architecture combining pure CPU nodes and nodes equipped with two Nvidia V100 GPUs. We evaluate our approach by comparing the inference performance and the communication our approach induces with various machine learning frameworks. Additionally,
we also compare the GPUs with NEC Vector Engine Type 10B regarding inference performance.
inference is an active field of research. For this purpose, a simplified OpenFOAM application is coupled with an artificial neural network. In this work, we present a case study focusing solely on the performance of the coupled OpenFOAM-ML application. Our coupling approach features a heterogeneous cluster architecture combining pure CPU nodes and nodes equipped with two Nvidia V100 GPUs. We evaluate our approach by comparing the inference performance and the communication our approach induces with various machine learning frameworks. Additionally,
we also compare the GPUs with NEC Vector Engine Type 10B regarding inference performance.
Workshop
Recorded
Quantum Computing
W
DescriptionPhotons are natural resources in quantum information, and the last decade showed significant progress in high-quality single photon generation and detection. Furthermore, photonic qubits are easy to manipulate and do not require particularly strongly sealed environments, making them an appealing platform for QC. With the one-way model, the vision of universal and large-scale QCs based on photonics becomes feasible. In one-way computing, the input state is not an initial product state |0>^n, but a so-called cluster state. A series of measurements on the cluster state's individual qubits and their temporal order, together with a feed-forward procedure, determine the quantum circuit to be executed. We propose a pipeline to convert a QASM circuit into a graph representation named measurement-graph (m-graph), that can be directly translated to hardware instructions on an optical one-way QC. Additionally, we optimize the graph using ZX-Calculus before evaluating the execution on an experimental discrete variable photonic platform.
Workshop
Recorded
Applications
Architectures
Benchmarking
Exascale Computing
Modeling and Simulation
Performance
Performance Portability
W
DescriptionScientific applications are increasingly adopting Artificial Intelligence (AI) techniques to advance science. High-performance computing centers are evaluating emerging novel hardware accelerators to efficiently run AI-driven science applications. With a wide diversity in the hardware architectures and software stacks of these systems, it is challenging to understand how these accelerators perform. The state-of-the-art in the evaluation of deep learning workloads primarily focuses on CPUs and GPUs. In this paper, we present an overview of dataflow-based novel AI accelerators from SambaNova, Cerebras, Graphcore, and Groq.
We present a first-of-a-kind evaluation of these accelerators with a diverse set of workloads, such as deep learning (DL) primitives, benchmark models, and scientific machine learning applications. We also evaluate the performance of collective communication, which is key for distributed DL implementation, along with a study of scaling efficiency. We then discuss key insights, challenges, and opportunities in integrating these novel AI accelerators in supercomputing systems.
We present a first-of-a-kind evaluation of these accelerators with a diverse set of workloads, such as deep learning (DL) primitives, benchmark models, and scientific machine learning applications. We also evaluate the performance of collective communication, which is key for distributed DL implementation, along with a study of scaling efficiency. We then discuss key insights, challenges, and opportunities in integrating these novel AI accelerators in supercomputing systems.
Invited Talk
Recorded
TP
XO/EX
DescriptionPredictive understanding and actionable insights in sustainability in the modern era requires an effective blend of theory and data-driven sciences. Relevant theory include physics, biogeochemistry, and ecology within the natural sciences, and engineering principles, economics, social, and governance principles in human-engineered systems and the social sciences. The data-driven sciences need to consider Big Data such as from archived numerical model simulations along with remotely sensed observations, and relatively small data such as from historical observations or even prehistorical proxy records, as well as based on prior domain knowledge and lessons learned from rare events and extremes. The underlying spatiotemporal data generation processes may be nonlinear dynamical, even chaotic, while the variability may be low frequency, even 1/f noise. Data may be sparse or incomplete, prior knowledge and physics may be incomplete or over-parameterized, while falsifiability and comprehensive uncertainty characterization are critical to inform decisions and add to our collective knowledge. Understanding the implications for domain-aware high performance computing may be critical both for the sciences and engineering and for investments or research directions in supercomputing. The first part of the presentation will describe these challenges and discuss how next-generation Artificial Intelligence may be able to provide solutions and where further developments may be necessary. The second part of the presentation will discuss recent research at my Sustainability and Data Sciences Laboratory, specifically, on the impacts of climate variability and weather extremes in ecology and biodiversity and in urban or regional critical lifeline infastructures, with an emphasis on the associated challenges and opportunities in processing earth science data.
Doctoral Showcase
Posters
TP
XO/EX
DescriptionPython's extensive software ecosystem leads to high productivity, rendering it the language of choice for scientific computing. However, executing Python code is often slow or impossible in emerging architectures and accelerators. To complement Python's productivity with the performance and portability required in high-performance computing (HPC), we introduce a workflow based on data-centric (DaCe) parallel programming. Python code with HPC-oriented extensions is parsed into a dataflow-based intermediate representation, facilitating analysis of the program's data movement. The representation is optimized via graph transformations driven by the users, performance models, and automatic heuristics. Subsequently, hardware-specific code is generated for supported architectures, including CPU, GPU, and FPGA. We evaluate the above workflow through three case studies. First, to compare our work to other Python-accelerating solutions, we introduce NPBench, a collection of over 50 Python microbenchmarks across a wide range of scientific domains. We show performance results and scaling across CPU, GPU, FPGA, and the Piz Daint supercomputer. DaCe runs 10x faster than the reference Python execution and achieves 2.47x and 3.75x speedups over previous-best solutions and up to 93.16% scaling efficiency. Second, we re-implement in Python and optimize the Quantum Transport Simulator OMEN. The application's DaCe version executes one to two orders of magnitude faster than the original code written in C++, achieving 42.55% of the Summit supercomputer's peak performance. Last, we utilize our workflow to build Deinsum, an automated framework for distributed multilinear algebra computations expressed in Einstein notation. Deinsum performs up to 19x faster over state-of-the-art solutions on the Piz Daint supercomputer.
Workshop
Recorded
Cloud and Distributed Computing
In Situ Processing
Scientific Computing
Workflows
W
DescriptionScientific Workflow Management Systems (SWfMS) systematically capture and store diverse provenance information at various phases. Scientists compose multitude of queries on this information. The support of integrated query composition and visualization in existing SWfMS is limited. Most systems do not support any custom query composition. VisTrails and Taverna introduced custom query languages vtPQL and TriQL to support limited workflow monitoring. Galaxy only tracks histories of operations and displays in lists. No SWfMS supports a scientist-friendly user interface for provenance query composition and visualization. We propose a domain-specific composition environment for provenance query of scientific workflows. As a proof of concept, we developed a provenance system for bioinformatics workflow management system and evaluated it in multiple dimensions, one for measuring the subjective perception of participants on the usability of it using NASA-TLX and SUS survey instruments and the other for measuring the flexibility through plugin integration using NASA-TLX.
Workshop
Recorded
W
DescriptionThere is an increasing demand to incorporate hybrid environments as part of workflows across edge, cloud, and HPC systems. In a such converging environment of cloud and HPC, containers are starting to play a more prominent role, bringing their networking infrastructure along with them. However, the current body of work shows that container overlay networks, which are often used to connect containers across physical hosts, are ill-suited for the HPC environment. They tend to impose significant overhead and noise, resulting in degraded performance and disturbance to co-processes on the same host.
This presentation focuses on utilizing a novel class of hardware, Data Processing Unit, to offload the networking stack of overlay networks away from the host onto the DPU. We intend to show that such ancillary offload is possible and that it will result in decreased overhead on host nodes which in turn will improve the performance of running processes.
This presentation focuses on utilizing a novel class of hardware, Data Processing Unit, to offload the networking stack of overlay networks away from the host onto the DPU. We intend to show that such ancillary offload is possible and that it will result in decreased overhead on host nodes which in turn will improve the performance of running processes.
Workshop
Recorded
W
DescriptionVersion 4.0 of the Message Passing Interface standard introduced the concept of Partitioned Communication, which adds support for multiple contributions to a communication buffer. Although initially targeted at multithreaded MPI applications, Partitioned Communication currently receives attraction in the context of accelerators, especially GPUs. In this publication, it is demonstrated that this communication concept can be implemented for SYCL-programmed FPGAs. This includes a discussion of the design space and the presentation of a prototype implementation. Experimental results show that a lightweight implementation on top of an existing MPI library is possible. The presented approach also reveals issues in both the SYCL and the MPI standard, which needs to be addressed for improved support for the intended communication style.
Workshop
Recorded
W
DescriptionBackground. Automated breast tumor segmentation for dynamic contrast-enhanced magnetic resonance (DCE-MR) is a crucial step to advance and help with the implementation of radiomics for image-based, quantitative assessment of breast tumors and cancer phenotyping. Current studies focus on developing tumor segmentation, which often requires initial seed points from expert radiologists or atlas-based segmentation methods. We develop a robust, fully automated end-to-end segmentation pipeline for breast cancers on bilateral breast MR studies.
Methods. On IRB-approved diverse breast cancer MR cases, a deep learning segmentation algorithm was created and trained. The model’s backbone is UNet++, which consists of U-Nets of varying depths whose decoders are densely connected at the same resolution via the skip connections and all the constituent UNets are trained simultaneously to learn a shared image representation. This design not only improves the overall segmentation performance, but also enables model pruning during the inference time. The model was trained on the breast tumors located independently by a radiologist with consensus review by a second radiologist with at least five years of experience. MRI was performed using a 3.0-T imaging system in the prone position with a dedicated 16-channel breast coil and T1 weighted DEC-MR images were analyzed for the study. We used 80:20 random split for training and validation of the model.
Results. A total of 124 breast cancer patients had pre-treatment MR imaging before the start of NST - the cohort comprised 49 HR+HER2-, 37 HR+HER2+, 11 HR-HER2+, and 27 TNBC cases (mean tumor 2.3 cm (+/- 3.1mm).) The model was tested on 2571 individual images. Overall, the model scored 0.85 [0.84 – 0.86, 95% CI] dice score and 0.8[0.79-0.81, 95% CI] IoU score. TNBC tumors scored dice [0.88 – 0.89, 95% CI], HER2 neg and ER/PR positive dice [0.84-0.85, 95% CI] and HER2 positive dice [0.84-0.85, 95% CI]. We observed that model performed equally for the solid tumors and irregular shapes and didn’t observe any difference in the segmentation performance between residual and non-residual tumors types - dice score [0.85 – 0.86, 95% CI] and [0.83 – 0.84, 95% CI] respectively.
Conclusion. The proposed segmentation model can perform equally well on various clinical breast cancer subtypes. The model has high false positive rate towards biopsy clip and high background enhancement which we plan to solve by adding annotation of the clip and high non-cancer enhancement in future training data. We will release the trained model with open-source license to increase the scalability of the radiomics studies with fully automated segmentation. Given the importance of breast cancer subtypes as prognostic factors in women with operable breast cancer, automated segmentation of varying breast tumor subtypes will help to analyze imaging biomarkers embedded within the standard of care imaging studies in a larger scale study which will ¬potentially help radiologists, pathologists, surgeons, and clinicians understand features driving breast cancer phenotypes and pave the way for developing digital twin for breast cancer patients.
Methods. On IRB-approved diverse breast cancer MR cases, a deep learning segmentation algorithm was created and trained. The model’s backbone is UNet++, which consists of U-Nets of varying depths whose decoders are densely connected at the same resolution via the skip connections and all the constituent UNets are trained simultaneously to learn a shared image representation. This design not only improves the overall segmentation performance, but also enables model pruning during the inference time. The model was trained on the breast tumors located independently by a radiologist with consensus review by a second radiologist with at least five years of experience. MRI was performed using a 3.0-T imaging system in the prone position with a dedicated 16-channel breast coil and T1 weighted DEC-MR images were analyzed for the study. We used 80:20 random split for training and validation of the model.
Results. A total of 124 breast cancer patients had pre-treatment MR imaging before the start of NST - the cohort comprised 49 HR+HER2-, 37 HR+HER2+, 11 HR-HER2+, and 27 TNBC cases (mean tumor 2.3 cm (+/- 3.1mm).) The model was tested on 2571 individual images. Overall, the model scored 0.85 [0.84 – 0.86, 95% CI] dice score and 0.8[0.79-0.81, 95% CI] IoU score. TNBC tumors scored dice [0.88 – 0.89, 95% CI], HER2 neg and ER/PR positive dice [0.84-0.85, 95% CI] and HER2 positive dice [0.84-0.85, 95% CI]. We observed that model performed equally for the solid tumors and irregular shapes and didn’t observe any difference in the segmentation performance between residual and non-residual tumors types - dice score [0.85 – 0.86, 95% CI] and [0.83 – 0.84, 95% CI] respectively.
Conclusion. The proposed segmentation model can perform equally well on various clinical breast cancer subtypes. The model has high false positive rate towards biopsy clip and high background enhancement which we plan to solve by adding annotation of the clip and high non-cancer enhancement in future training data. We will release the trained model with open-source license to increase the scalability of the radiomics studies with fully automated segmentation. Given the importance of breast cancer subtypes as prognostic factors in women with operable breast cancer, automated segmentation of varying breast tumor subtypes will help to analyze imaging biomarkers embedded within the standard of care imaging studies in a larger scale study which will ¬potentially help radiologists, pathologists, surgeons, and clinicians understand features driving breast cancer phenotypes and pave the way for developing digital twin for breast cancer patients.
Workshop
Recorded
W
DescriptionGraphics Processing Units are nowadays used to accelerate applications in multiple scientific domains, and is therefore necessary even for researchers outside of computer science to learn how to use them. However, traditional GPU programming courses are often aimed at people with a computer science or high-performance computing background.
To address this challenge we developed a GPU programming course, following the Carpentries pedagogical style, centered around live coding and the teaching of actionable skills. The course is open-source, freely available online in the Carpentries Incubator, and has been successfully taught both online and in-person.
To address this challenge we developed a GPU programming course, following the Carpentries pedagogical style, centered around live coding and the teaching of actionable skills. The course is open-source, freely available online in the Carpentries Incubator, and has been successfully taught both online and in-person.
Paper
Recorded
Applications
Computational Science
Scientific Computing
TP
DescriptionSimulations to calculate a single gravitational waveform (GW) can take several weeks. Yet, thousands of such simulations are needed for the detection and interpretation of gravitational waves. Future detectors will require even more accurate waveforms. Here we present the first large scale, adaptive mesh, multi-GPU numerical relativity (NR) code along with performance analysis and benchmarking. While comparisons are difficult to make, our GPU extension of the dendrogr~NR code achieves 6x speedup over existing state-of-the-art codes. We achieve 800 GFlops/s on a single NVIDIA A100 GPU with an overall 2.5x speedup over a two-socket, 128-core AMD EPYC 7763 CPU node with an equivalent CPU implementation. We present detailed performance analyses, parallel scalability results, and accuracy assessment for GWs computed for mass ratios q=1,2,4. We also present strong scalability up to 8 A100s and weak scaling up to 229,376 x86 cores on the Texas Advanced Computing Center's Frontera system.
Workshop
Recorded
Algorithms
Architectures
Compilers
Computational Science
Exascale Computing
Heterogeneous Systems
Hierarchical Parallelism
Memory Systems
Parallel Programming Languages and Models
Parallel Programming Systems
Resource Management and Scheduling
W
DescriptionWe introduce a new high-performance design for parallelism within the Quantum Monte Carlo code QMCPACK. We demonstrate that the new design is better able to exploit the hierarchical parallelism of heterogeneous architectures compared to the previous GPU implementation. The new version is able to achieve higher GPU occupancy via the new concept of crowds of Monte Carlo walkers, and by enabling more host CPU threads to effectively offload to the GPU. The higher performance is expected to be achieved independent of the underlying hardware, significantly improving developer productivity and reducing code maintenance costs. Scientific productivity is also improved with full support for fallback to CPU execution when GPU implementations are not available or CPU execution is more optimal.
Posters
Research Posters
TP
XO/EX
DescriptionHPC systems are at risk of being underutilized due to various resource requirements of applications and the imbalance of utilization among subsystems. This work provides a holistic analysis and view of memory utilization on a leadership computing facility, the Perlmutter system at NERSC, through which we gain insights about the resource usage patterns of the memory subsystem. The results of the analysis can help evaluate current system configurations, offer recommendations for future procurement, provide feedback to users on code efficiency, and motivate research in new architecture and system designs.
Posters
Research Posters
TP
XO/EX
DescriptionMonitoring the status of large computing systems is essential to identify unexpected behavior and improve their performance and up-time. However, due to the large-scale and distributed design of such computing systems as well as a large number of monitoring parameters, automated monitoring methods should be applied. Such automatic monitoring methods should also have the ability to adapt themselves to the continuous changes in the computing system. In addition, they should be able to identify behavioral anomalies in useful time, in order to perform appropriate reactions. This work proposes a general light-weight and unsupervised method for near real-time anomaly detection using operational data measurement on large computing systems. The proposed model requires as low as 4 hours of data and 50 epochs for each training process to accurately resemble the behavioral pattern of computing systems.
Birds of a Feather
TP
XO/EX
DescriptionCompute Express Link™ (CXL™) maintains memory coherency between the CPU memory space and memory on CXL attached devices. CXL enables a high-speed, efficient interconnect between the CPU, platform enhancements, and workload accelerators such as GPUs, FPGAs, and other purpose-built accelerator solutions.
This BoF session will feature a panel of experts from the CXL Consortium to discuss available CXL devices and what devices the industry can expect to see in the next year. The experts will also explore the new features in the CXL 3.0 specification and the new usage models it will enable.
This BoF session will feature a panel of experts from the CXL Consortium to discuss available CXL devices and what devices the industry can expect to see in the next year. The experts will also explore the new features in the CXL 3.0 specification and the new usage models it will enable.
Workshop
Recorded
Applications
Architectures
Benchmarking
Exascale Computing
Modeling and Simulation
Performance
Performance Portability
W
DescriptionTighter integration of computational resources can foster superior application performance by mitigating communication bottlenecks. Unfortunately, not every application can use every compute or accelerator all the time. As a result, co-locating resources often leads to under-utilization of resources. In the next five years, HPC system architects will be presented with a spectrum of accelerated solutions ranging from tightly coupled, single package APUs to a sea of disaggregated GPUs interconnected by a global network. In this paper, we detail NEthing, our methodology and tool for evaluating the potential performance implications of such diverse architectural paradigms. We demonstrate our methodology on today’s and projected 2026 technologies for three distinct workloads: a compute-intensive kernel, a tightly-coupled HPC simulation, and an ensemble of loosely-coupled HPC simulations. Our results leverage NEthing to quantify the increased utilization disaggregated systems must achieve in order to match superior performance of APUs and on-board GPUs.
Posters
Research Posters
TP
XO/EX
DescriptionReal-world HPC workloads impose a lot of pressure on storage systems as they are highly data dependent. On the other hand, as a result of recent developments in storage hardware, it is expected that the storage diversity in upcoming HPC systems will grow. This growing complexity in the storage system presents challenges to users, and often results in I/O bottlenecks due to inefficient usage. There have been several studies on reducing I/O bottlenecks. The earliest attempts worked to solve this problem by combining I/O characteristics with expert insight. The recent attempts rely on the performance analysis from the I/O characterization tools. However, the problem is multifaceted with many metrics to consider, hence difficult to do manually, even for experts. In this work, we develop a methodology that produces a multifaceted view of the I/O behavior of a workload to identify potential I/O bottlenecks automatically.
Workshop
Recorded
Career Development
Professional Development
Software Engineering
Workforce
W
DescriptionThis paper shares a perspective for the research software engineering (RSE) community to navigate the National Laboratory landscape. The RSE role is a recent concept that led to organizational challenges to place and evaluate their impact, costs and benefits. The premise is that RSEs are a natural fit into the current landscape and can use traditional career growth strategies in science: publications, community engagements and proposals. Projects funding RSEs can benefit from this synergy and be inclusive on traditional activities. Still, a great deal of introspection is needed to close gaps between the rapidly evolving RSE landscape and the well-established communication patterns in science. This perspective is built upon interactions in industry, academia and government in high-performance computing (HPC) environments. The goal is to contribute to the conversation around RSE career growth and understand their return on investment for scientific projects and sponsors.
Awards Presentation
Test of Time
Recorded
Awards
TP
DescriptionFor decades, the high-performance computing (HPC) community has focused on performance, where performance is defined as speed. To achieve better performance per compute node, microprocessor vendors have not only doubled the number of transistors (and speed) every 18-24 months, but they have also doubled the power densities. Consequently, keeping a large-scale HPC system functioning properly requires continual cooling in a large machine room, thus resulting in substantial operational costs. Furthermore, the increase in power densities has led (in part) to a decrease in system reliability, thus leading to lost productivity.
To address these problems, we propose a power-aware algorithm that automatically and transparently adapts its voltage and frequency settings to achieve significant power reduction and energy savings with minimal impact on performance. Specifically, we leverage a commodity technology called “dynamic voltage and frequency scaling” to implement our power-aware algorithm in the run-time system of commodity HPC systems.
To address these problems, we propose a power-aware algorithm that automatically and transparently adapts its voltage and frequency settings to achieve significant power reduction and energy savings with minimal impact on performance. Specifically, we leverage a commodity technology called “dynamic voltage and frequency scaling” to implement our power-aware algorithm in the run-time system of commodity HPC systems.
Workshop
Recorded
Accelerator-based Architectures
Data Analytics
In Situ Processing
Scientific Computing
Visualization
Workflows
W
DescriptionAlthough in situ visualization can reduce the amount of data written to the storage, in situ visualization can still generate large amount of data for subsequent analysis. For instance, from different viewpoints at every visualization time step. Considering that some of these images can be similar, an appropriate image selection to reduce the total number of images would be beneficial to minimize the analysis time for understanding the underlying simulation phenomena without missing important features. As an approach for such smart in situ visualization, we have worked on adaptive time step selection for skipping time steps with small amount of change between time steps. In this lightning talk, focusing on the set of images which can be generated from different viewpoints at every time step, we will present a PSNR-based image selection approach for eliminating similar images to further reduce the total number of images, targeting smarter in situ visualization.
Workshop
Recorded
Quantum Computing
W
DescriptionWe present Q# implementations for arbitrary fixed-point arithmetic operations for a gate-based quantum computer based on lookup tables (LUT). In general, this is an inefficient way of implementing a function since the number of inputs can be large or even infinite. However, if the input domain can be bounded and there can be some error tolerance in the output (both of which are often the case in practical use-cases), the quantum LUT implementation of certain quantum arithmetic functions can be more efficient than their corresponding reversible arithmetic implementations. We discuss the implementation of the LUT using Q#, show examples of how to use the LUT to implement quantum arithmetic functions, and compare the resources required for the implementation with the current state-of-the-art bespoke implementations of exponential and Gaussian functions.
Workshop
Recorded
Career Development
Professional Development
Software Engineering
Workforce
W
DescriptionResearch Software Engineering (RSE) provides methodological tools to develop software to be deployed in High-Performance Computing (HPC) infrastructures, follow good practices, and achieve a good quality of software. Also, RSE supports actors involved in the development, from developers to users, including development, deployment, interaction, and training. The oil and gas community is one of the most critical contexts for scientific applications, from exploration to econometry and market analysis. Following RSE elements, we present a development path to build robust research software in this contribution.
Workshop
Recorded
AI-HPC Convergence
Extreme Scale Computing
Parallel Programming Languages and Models
Performance
Runtime Systems
W
DescriptionSparse linear algebra routines are fundamental building blocks of a large variety of scientific applications. Direct solvers, which are methods for solving linear systems via the factorization of matrices into products of triangular matrices, are commonly used in many contexts. The Cholesky factorization is the fastest direct method for symmetric and definite positive matrices.
This presentation presents selective nesting, a method to determine the optimal task granularity for the parallel Cholesky factorization based on the structure of sparse matrices. We propose the OPT-D algorithm, which automatically and dynamically applies selective nesting. OPT-D leverages matrix sparsity to drive complex task-based parallel workloads in the context of direct solvers. We run an extensive evaluation campaign considering a heterogeneous set of 35 sparse matrices and a parallel machine featuring the A64FX processor. OPT-D delivers an average performance speedup of 1.46x with respect to the best state-of-the-art parallel method to run direct solvers.
This presentation presents selective nesting, a method to determine the optimal task granularity for the parallel Cholesky factorization based on the structure of sparse matrices. We propose the OPT-D algorithm, which automatically and dynamically applies selective nesting. OPT-D leverages matrix sparsity to drive complex task-based parallel workloads in the context of direct solvers. We run an extensive evaluation campaign considering a heterogeneous set of 35 sparse matrices and a parallel machine featuring the A64FX processor. OPT-D delivers an average performance speedup of 1.46x with respect to the best state-of-the-art parallel method to run direct solvers.
Workshop
Recorded
W
DescriptionHigh Performance Computing (HPC) applications must be containerized to run in a Kubernetes (K8s) environment. The traditional model for running HPC applications in a K8s environment requires the Application Container (APP) to include the runtime environment and the launch support mechanisms, in addition to the application. This requirement can increase the APP size and introduce security vulnerabilities. The separated model presented detaches the runtime from the APP. This allows the system administrators to define, maintain, and secure the Runtime Environment Container (REC). A PMIx library connects the APP and REC. The PMIx library serves as a runtime communication conduit for HPC parallel libraries (like MPI) to perform necessary functions like inter-process wire-up. The APP is nested within the REC using unprivileged, rootless Podman. The separated model is demonstrated by running a set of HPC applications in an off-the-shelf K8s system.
Paper
Recorded
Reliability and Resiliency
TP
DescriptionI/O efficiency is crucial to productivity in scientific computing, but the growing complexity of HPC systems and applications complicates efforts to understand and optimize I/O behavior at scale. Data-driven machine learning-based I/O throughput models offer a solution: they can be used to identify bottlenecks, automate I/O tuning, or optimize job scheduling with minimal human intervention. Unfortunately, current state-of-the-art I/O models are not robust enough for production use and under-perform after being deployed.
We analyze four years of application, scheduler, and storage system logs on two leadership-class HPC platforms to understand why I/O models under-perform in practice. We propose a taxonomy consisting of five categories of I/O modeling errors: poor application and system modeling, inadequate dataset coverage, I/O contention, and I/O noise. We develop litmus tests to quantify each category, allowing researchers to narrow down failure modes, enhance I/O throughput models, and improve future generations of HPC logging and analysis tools.
We analyze four years of application, scheduler, and storage system logs on two leadership-class HPC platforms to understand why I/O models under-perform in practice. We propose a taxonomy consisting of five categories of I/O modeling errors: poor application and system modeling, inadequate dataset coverage, I/O contention, and I/O noise. We develop litmus tests to quantify each category, allowing researchers to narrow down failure modes, enhance I/O throughput models, and improve future generations of HPC logging and analysis tools.
Workshop
Recorded
Accelerator-based Architectures
Data Analytics
In Situ Processing
Scientific Computing
Visualization
Workflows
W
DescriptionWe contribute a new approach for in situ automation of camera placement over time. Our approach incorporates triggers, regularly evaluating the current camera placement and searching for a new camera placement when a trigger fires. We evaluate our approach running in situ with five data sets from two simulation codes, considering camera placement quality (evaluated using a viewpoint quality metric) and overhead (number of camera positions evaluated). We find that our approach has a significant – reduced overhead with similar quality – compared to the naive approach of searching for a new camera placement each cycle.
Posters
Research Posters
TP
XO/EX
DescriptionIn this work we accelerate a target a deep learning model designed to enhance CT images of covid-19 chest scans namely DD-Net using sparse techniques. The model follows an auto encoder decoder architecture in deep learning paradigm and has high dimensionality and thus takes many compute hours of training. We propose a set of techniques which target these two aspects of model - dimensionality and training time. We will implement techniques to prune neurons making the model sparse and thus reduce the effective dimensionality with a loss of accuracy not more than 5% with minimal additional overhead of retraining. Then we propose set of techniques tailored with respect to underlying hardware in order to better utilize the existing components of hardware (such as tensor core) and thus reduce time and associated cost required to train this model.
Workshop
Recorded
W
DescriptionKinetic equilibria are a fundamental aspect of tokamak plasma analysis, but are often highly specialized and labor intensive to produce. This has become a bottleneck to both deeper physics understandings and more sophisticated experiment controls. This project aims to remove these barriers by developing a rapid, fully-automated workflow to produce better-than-human, high-precision whole-discharge kinetic equilibria. The required elements in this workflow now exist separately, but what is missing is the coupling of different aspects and overall performance optimization. We have designed this workflow for the DIII-D national fusion facility with the goal of producing results quickly enough to be used for experiment planning in the 15-20 minute time window between subsequent discharges. The results will also be stored in a database for follow-up analysis and as the foundation for AI/ML surrogate models. Initial results suggest that it may be possible to achieve our goal within a target 10 minute window.
Workshop
Recorded
W
DescriptionEfficient data communication is a major goal for scalable and cost-effective use of datacenter and HPC system resources. To let applications communicate efficiently, exchanged data must be serialized at the source and deserialized at the destination. The serialization/deserialization process enables exchanging data in a language- and machine-independent format. However, serialization/deserialization overheads can negatively impact application performance. For example, a server within a microservice framework must deserialize all incoming requests before invoking the respective microservices. We show how data deserialization can be offloaded to fully programmable SmartNICs and performed on the data path, on a per-packet basis. This solution avoids intermediate memory copies, enabling on-the-fly deserialization. We showcase our approach by offloading Google Protocol Buffers, a widely used framework to serialize/deserialize data. We show through microservice throughput modeling how we can improve the overall throughput by pipelining the deserialization and actual application activities with PsPIN.
Workshop
Recorded
Accelerator-based Architectures
Algorithms
Architectures
Big Data
Data Analytics
Parallel Programming Languages and Models
Productivity Tools
W
DescriptionDatalog, a bottom-up declarative logic programming language, has a wide variety of uses for deduction, modeling, and data analysis, across application domains. Datalog can be efficiently implemented using relational algebra primitives such as join, projection and union. While, there exist several multi-threaded and multi-core implementations of Datalog that target CPU-based systems, our work makes an inroad towards developing a Datalog implementation for GPUs. We demonstrate the feasibility of a high performance relational algebra backend for a small subset of Datalog applications that can effectively leverage the parallelism of GPUs using cuDF. cuDF is a library from the Rapids suite that uses the NVIDIA CUDA programming model for GPU parallelism. It provides similar functionalities to Pandas, a popular data analysis engine. In this presentation, we analyze and evaluate the performance of cuDF versus Pandas for two graph mining problems implemented in Datalog, (1) triangles counting and (2) transitive closure computation.
Exhibitor Forum
Recorded
TP
XO/EX
DescriptionAs supercomputing infrastructures become increasingly distributed, centralizing the management of data that may span multiple data centers, public cloud providers, and edge locations is key to accelerating research. Whether organizations are looking to expand information sharing for science teams; enhance data management practices across collaborative platforms; or unlock access to cloud services for data practitioners, centralizing multi-cloud data management addresses these challenges by seamlessly integrating multiple public clouds and on-premises storage under a single namespace. Modern technologies improve the responsiveness of data workflows by supporting constant movement of data and applications across systems and automating data placement and lifecycle rules. This means that both on-premises and cloud applications can use the same data without negatively impacting performance. It also means that the right data is placed where and when it’s needed for the most effective, agile workflows. Finally, by synchronizing data across multiple cloud-based repositories, multi-cloud data management software enables data to be accessed independent of its physical location to eliminate vendor lock-in and minimize cloud egress fees.
Join us for this technical deep dive into how centralizing multi-cloud data management can maximize the value of cloud initiatives by creating new opportunities for collaboration and innovation across platforms and data-driven ecosystems.
Join us for this technical deep dive into how centralizing multi-cloud data management can maximize the value of cloud initiatives by creating new opportunities for collaboration and innovation across platforms and data-driven ecosystems.
Paper
Recorded
Applications
Numerical Algorithms
Security
TP
DescriptionThe Elliptic Curve Digital Signature Algorithm (ECDSA) is an essential building block of various cryptographic protocols. In particular, most blockchain systems adopt it to ensure transaction integrity. However, due to its high computational intensity, ECDSA is often the performance bottleneck in blockchain transaction processing. Recent work has accelerated ECDSA algorithms on the CPU; in contrast, success has been limited on the GPU, which has great potential for parallelization but is challenging for implementing elliptic curve functions. In this paper, we propose RapidEC, a GPU-based ECDSA implementation for SM2, a popular elliptic curve. Specifically, we design architecture-aware parallel primitives for elliptic curve point operations, and parallelize the processing of a single SM2 request as well as batches of requests. Consequently, our GPU-based RapidEC outperformed the state-of-the-art CPU-based algorithm by orders of magnitude. Additionally, our GPU-based modular arithmetic functions as well as point operation primitives can be applied to other computation tasks.
Workshop
Recorded
W
DescriptionMost high-fidelity physics simulation codes, such as Flash-X, need to save intermediate results (checkpoint files) to restart or gain insights into the evolution of the simulation. These simulation codes save such intermediate files synchronously, where computation is stalled while the data is written to storage. Depending on the problem size and computational requirements, this file write time can be a substantial portion of the total simulation time. In this paper, we evaluate the overheads and the overall benefit of asynchronous I/O in HDF5 to simulations. Results from real-world high-fidelity simulations on the Summit supercomputer show that I/O operation is overlapped with application communication or computation or both, effectively hiding some or all of the I/O latency. Our evaluation shows that while using asynchronous I/O adds overhead to the application, the I/O time reduction is more significant, resulting in overall up to 1.5X performance speedup
Workshop
Recorded
W
DescriptionIn this work, we accelerate the Kernel Ridge Regression algorithm on an adaptive computing platform to achieve higher performance within faster development time by employing a design approach using high-level synthesis. In order to avoid storing the potentially huge kernel matrix in external memory, the designed accelerator computes the matrix on-the-fly in each iteration. Moreover, we overcome the memory bandwidth limitation by partitioning the kernel matrix into smaller tiles that are pre-fetched to small local memories and reused multiple times. The design is also parallelized and fully pipelined to accomplish the highest performance. The final accelerator can be used for any large-scale data without kernel matrix storage limitations and with an arbitrary number of features. This work is an important first step towards a library for accelerating different Kernel methods for Machine Learning applications for FPGA platforms that can be used conveniently from Python with a NumPy interface.
Paper
Recorded
Data Mangement
Storage
TP
DescriptionLossy compression is one of the most efficient solutions to reduce storage overhead and improve I/O performance for HPC applications. However, existing parallel I/O libraries cannot fully utilize lossy compression to accelerate parallel write due to the lack of deep understanding on compression-write performance. To this end, we propose to deeply integrate predictive lossy compression with HDF5 to significantly improve parallel-write performance. Specifically, we propose analytical models to predict the time of compression and parallel write before the actual compression to enable compression-write overlapping. We also introduce an extra space to handle the prediction uncertainty. Moreover, we propose an optimization to reorder the compression tasks to increase the overlapping efficiency. Experiments with up to 4,096 cores show that our solution improves the write performance by up to 4.5x and 2.9x over the non-compression and lossy compression solutions, respectively, with only 1.5% storage overhead (to original data) on two real-world applications.
Tutorial
Recorded
Accelerator-based Architectures
Heterogeneous Systems
Parallel Programming Languages and Models
Performance Portability
Productivity Tools
Software Engineering
TUT
DescriptionThis half-day hands-on tutorial teaches how to accelerate HPC applications using the portable parallelism and concurrency features of the C++17 and C++20 standards, without any language or vendor extensions, such that a single version of the code is portable to multi-core CPU and to GPU systems. We further show how to integrate this approach with MPI to target CPU clusters and multi-GPU platforms. The tutorial exercises follow classical HPC themes like a PDE solver mini-application for the 2D unsteady heat equation. The exercises provide attendees with hands-on experience applying C++ parallel algorithms and execution policies to parallelize and accelerate HPC programs using only standard C++. The attendees are presented problem-solving strategies for common tasks like computing reductions or running iterative solvers for multi-dimensional problems. Furthermore, the tutorial and exercises give attendees hands-on experience in integrating C++ parallel algorithms into pre-existing MPI applications, teaching how to re-use the pre-existing MPI code to produce MPI/C++ applications that run on multi-CPU and multi-GPU systems. Finally, we conclude with a summary of our professional experience applying the ISO C++ parallel programming model to accelerate large real-world HPC applications and provide an outlook of future topics in C++ standard parallelism.
Workshop
Recorded
HPC Training and Education
W
DescriptionIn response to an increasing demand for digital skills in industry and academia, a series of credentialed short courses that cover a variety of topics related to high performance computing were designed and implemented to enable university students and researchers to effectively utilize research computing resources and bridge the gap for users with educational backgrounds that do not include computational training. The courses cover a diverse array of topics, including subjects in programming, cybersecurity, artificial intelligence/machine learning, bioinformatics, and cloud computing. The courses are designed to enable the students to apply the skills they learn to their own research that incorporates use of large-scale computing systems. These courses offer advantages to generic online courses in that they teach computing skills relevant to academic research programs. Finally, the micro-credentials are transcriptable, may be stacked with existing programs to create a larger degree plan, and add to a student’s resume.
Job Posting
DescriptionWe are currently seeking an Account Executive - MIDWEST EDU. The Account Executive will work in the development of new HPC Higher Education and Research position will be field based and involve travel approximately 30%-40% of the time. The job involves managing a territory and growing business through their own experience, including previously established relationships; so a successful applicant must have strong industry contacts and a demonstrated success in personally closing business in the HPC & EDU space.
Responsibilities for this role include but are not limited to:
Create and maintain a customer pipeline, hitting revenue goals and growing the territory.
Lead and coordinate complex, team selling efforts (with internal and external partners).
Develop a strong understanding of the customers’ technology infrastructure, strategy and business requirements.
Partner with internal staff to create successful Proposals and Presentations in response to RFPs and other customer needs.
Attend trade shows and other activities to raise DDN’s presence in the industry.
Manage customer relationships post-sale; including a strategy to close repeat business.
Responsibilities for this role include but are not limited to:
Awards Presentation
SC22 Opening Session & Turing Lecture
Recorded
Awards
Keynote
Turing
TP
W
TUT
XO/EX
DescriptionJoin us for the 2021 ACM A.M. Turing Award Lecture featuring Jack Dongarra. A longtime SC supporter, Jack’s pioneering contributions to numerical algorithms and libraries that enabled HPC software to keep pace with exponential hardware improvements for over four decades has, through the years, accelerated HPC. With our SC22 conference theme, HPC Accelerates, we’re honored that Jack selected SC22 as the location to present his award lecture.
Be sure to include the ACM A.M. Turing Lecture in your schedule when planning your SC22 conference experience. You won’t want to miss it! This lecture replaces our traditional keynote presentation.
Be sure to include the ACM A.M. Turing Lecture in your schedule when planning your SC22 conference experience. You won’t want to miss it! This lecture replaces our traditional keynote presentation.
Paper
Recorded
System Software
TP
DescriptionWe present a technique for applying reverse mode automatic differentiation (AD) on a non-recursive second-order functional array language that supports nested parallelism and is primarily aimed at efficient GPU execution.
The key idea is to eliminate the need for a tape by relying on redundant execution to bring into each new scope all program variables that may be needed by the differentiated code. Efficient execution is enabled by the observation that perfectly nested scopes do not introduce re-execution and that such perfect nests can be readily produced by application of known compiler transformations. Our technique differentiates loops and bulk-parallel operators---e.g., map, reduce(-by-index), scan, and scatter---by specific rewrite rules and aggressively optimizes the resulting nested-parallel code. We report an evaluation that compares with established AD solutions and demonstrates competitive performance on ten common benchmarks from recent applied AD literature.
The key idea is to eliminate the need for a tape by relying on redundant execution to bring into each new scope all program variables that may be needed by the differentiated code. Efficient execution is enabled by the observation that perfectly nested scopes do not introduce re-execution and that such perfect nests can be readily produced by application of known compiler transformations. Our technique differentiates loops and bulk-parallel operators---e.g., map, reduce(-by-index), scan, and scatter---by specific rewrite rules and aggressively optimizes the resulting nested-parallel code. We report an evaluation that compares with established AD solutions and demonstrates competitive performance on ten common benchmarks from recent applied AD literature.
Workshop
Recorded
AI-HPC Convergence
Extreme Scale Computing
Parallel Programming Languages and Models
Performance
Runtime Systems
W
DescriptionThe Message Passing Interface (MPI) is the most dominant programming model on HPC systems and has been instrumental in developing efficient, large scale parallel applications. However, it has a rather static view of compute resources building on top of the concept of immutable communicators. While this provides some easy-of-use and simplicity, it is limiting, in particular for modern workflow-based workloads as well as in its support for resource adaptive systems. The newly introduced concept of MPI Sessions, however, opens the door more dynamicity and adaptivity. In this talk I will highlight the opportunities that can arise from such directions and discuss a novel approaches we are pursuing as part of several EuroHPC projects. Our ultimate goal is to provide full malleability in MPI as well as the surrounding software layers - from system software to applications - and with that enable us to more efficiently harness the computational capabilities of current and future HPC systems.
Workshop
Recorded
Benchmarking
Cloud and Distributed Computing
Containers
Datacenter
Networks
Privacy
Resource Management and Scheduling
Security
SIGHPC
State of the Practice
System Administration
System Software
W
DescriptionAdditional Questions, Community Discussion, and Supply Chain Issues ...
Birds of a Feather
TP
XO/EX
DescriptionLast year's panel "HPC's Growing Sustainability Challenges and Emerging Approaches" gave an excellent introduction to the carbon impact of HPC along with ideas for carbon mitigation. This BoF we will focus on concrete actions that data center operators and users can undertake to reduce HPC's carbon footprint. These range from using more energy efficient processors, to improved cooling, extending the lifetime of computing equipment, shifting load from regions with carbon-intense electricity to regions where the vast majority of electricity comes from renewable resources. Pro's and cons of various will approaches will be discussed. Audience participation and ideas will be welcome.
Paper
Recorded
Applications
Numerical Algorithms
Security
TP
DescriptionSeveral scientific applications rely on sparse direct solvers for their numerical robustness. However, performance optimization for these solvers remains a challenging task, especially on GPUs. This is due to workloads of small dense matrices that are different in size. Matrix decompositions on such irregular workloads are rarely addressed on GPUs.
This paper addresses irregular workloads of matrix computations on GPUs and shows their impact on a sparse LU solver. We designed an interface for the basic matrix operations supporting problems of different sizes. The interface enables us to develop irrLU-GPU, an LU decomposition on matrices of different sizes. We demonstrate the impact of irrLU-GPU on sparse LU solvers using NVIDIA and AMD GPUs. Experimental results are shown for a sparse direct solver based on multifrontal sparse LU decomposition applied to linear systems arising from the simulation, using finite element discretization on unstructured meshes, of a high frequency indefinite Maxwell problem.
This paper addresses irregular workloads of matrix computations on GPUs and shows their impact on a sparse LU solver. We designed an interface for the basic matrix operations supporting problems of different sizes. The interface enables us to develop irrLU-GPU, an LU decomposition on matrices of different sizes. We demonstrate the impact of irrLU-GPU on sparse LU solvers using NVIDIA and AMD GPUs. Experimental results are shown for a sparse direct solver based on multifrontal sparse LU decomposition applied to linear systems arising from the simulation, using finite element discretization on unstructured meshes, of a high frequency indefinite Maxwell problem.
Tutorial
Recorded
Big Data
Cloud and Distributed Computing
Data Analytics
Data Mangement
Emerging Technologies
Exascale Computing
File Systems and I/O
In Situ Processing
Performance
Productivity Tools
Reliability and Resiliency
Resource Management and Scheduling
Software Engineering
Visualization
TUT
DescriptionAs concurrency and complexity continue to increase on high-end machines, storage I/O performance is rapidly becoming a fundamental challenge to scientific discovery. At the exascale, online analysis will become a dominant form of data analytics, and thus scalable in situ workflows will become critical, along with high performance I/O to storage. The many components of a workflow running simultaneously pose another challenge of evaluating and improving the performance of these workflows. Therefore, performance data collection needs to be an integral part of the entire workflow.
In this tutorial, we present ADIOS-2 which allows for building in situ and file-based data processing workflows for extreme scale systems, including interactive, on-demand, in situ visualization of the data, and including performance profiling of the entire workflow. Half of this tutorial will be hands-on sessions, where we provide access to the software, and build together a complete MiniApp with in situ analytics and performance analysis that users can run on their laptop and supercomputers at large scale. We will show how ADIOS-2 is fully integrated into three popular visualization and performance tools: Jupyter Notebook, ParaView, and TAU, creating a software ecosystem for in situ processing of both performance and scientific data.
In this tutorial, we present ADIOS-2 which allows for building in situ and file-based data processing workflows for extreme scale systems, including interactive, on-demand, in situ visualization of the data, and including performance profiling of the entire workflow. Half of this tutorial will be hands-on sessions, where we provide access to the software, and build together a complete MiniApp with in situ analytics and performance analysis that users can run on their laptop and supercomputers at large scale. We will show how ADIOS-2 is fully integrated into three popular visualization and performance tools: Jupyter Notebook, ParaView, and TAU, creating a software ecosystem for in situ processing of both performance and scientific data.
Workshop
Recorded
W
DescriptionWe present efforts to encourage the adoption of modules for teaching heterogeneous parallel computing through a faculty development workshop. The workshop was held remotely using a novel format to exploit the advantages of a virtual format and mitigate its disadvantages. Adoption at a wide variety of institutions showed module effectiveness and also gathered feedback leading to several module improvements. We also report on the adoptions themselves, which show the importance of supporting adaptation of the modules for diverse settings.
Job Posting
DescriptionWith over 40 years of semiconductor process control experience, chipmakers around the globe rely on KLA to ensure that their fabs ramp next-generation devices to volume production quickly and cost-effectively. Enabling the movement towards advanced chip design, KLA's Global Products Group (GPG), which is responsible for creating all of KLA’s metrology and inspection products, is looking for the best and the brightest research scientist, software engineers, application development engineers, and senior product technology process engineers. The Film and Scatterometry Technology (FaST) Division provides industry leading metrology solutions for worldwide semiconductor IC manufacturers. The FaST Division portfolio of metrology products includes hardware and software solutions for optical film thickness, optical critical dimension (CD), composition, and resistivity measurement systems. These products are essential for the IC manufacturers as they provide critical metrology capabilities for the development and implementation of their advanced IC processes. The FaST division is committed to support our customers to achieve performance entitlement of our solution and we effectively partner with our customers from their early research and development phase to the high volume in-line manufacturing implementation specific for their process needs. The division consists of a global team located in US, Israel, China, and India.
The major function of this role is to support investigations into novel metrology technologies. Advanced Development is responsible for both investigating and characterizing new technologies versus current best known methods. Major tasks include the collection of data using KLA tools and software packages, analysis and summary of results, automation of analysis routines.
Responsibilities
Perform guided analysis of KLA tools and software.
Generate reports on data analysis and findings.
Support data collection and analysis.
The major function of this role is to support investigations into novel metrology technologies. Advanced Development is responsible for both investigating and characterizing new technologies versus current best known methods. Major tasks include the collection of data using KLA tools and software packages, analysis and summary of results, automation of analysis routines.
Responsibilities
Perform guided analysis of KLA tools and software.
Generate reports on data analysis and findings.
Support data collection and analysis.
Tutorial
Recorded
Algorithms
Cloud and Distributed Computing
Datacenter
Parallel Programming Languages and Models
Performance
TUT
DescriptionThe vast majority of production parallel scientific applications today use MPI and run successfully on the largest systems in the world. Parallel system architectures are evolving to include complex, heterogeneous nodes comprising general-purpose CPUs as well as accelerators such as GPUs. At the same time, the MPI standard itself is evolving to address the needs and challenges of future extreme-scale platforms as well as applications. This tutorial will cover several advanced features of MPI that can help users program modern systems effectively. Using code examples based on scenarios found in real applications, we will cover several topics including efficient ways of doing 2D and 3D stencil computation, derived datatypes, one-sided communication, hybrid programming (MPI + threads, shared memory, GPUs), topologies and topology mapping, neighborhood and nonblocking collectives, and some of the new performance-oriented features in MPI-4. Attendees will leave the tutorial with an understanding of how to use these advanced features of MPI and guidelines on how they might perform on different platforms and architectures.
Tutorial
Recorded
Accelerator-based Architectures
Directive Based Programming
Heterogeneous Systems
Parallel Programming Languages and Models
Performance
TUT
DescriptionWith the increasing prevalence of multicore processors, shared-memory programming models are essential. OpenMP is a popular, portable, widely supported, and easy-to-use shared-memory model. Developers usually find OpenMP easy to learn. However, they are often disappointed with the performance and scalability of the resulting code. This disappointment stems not from shortcomings of OpenMP, but rather from the lack of depth with which it is employed. Our “Advanced OpenMP Programming” tutorial addresses this critical need by exploring the implications of possible OpenMP parallelization strategies, both in terms of correctness and performance.
We assume attendees understand basic parallelization concepts and know the fundamentals of OpenMP. We focus on performance aspects, such as data and thread locality on NUMA architectures, false sharing, and exploitation of vector units. All topics are accompanied by extensive case studies, and we discuss the corresponding language features in-depth. Continuing the emphasis of this successful tutorial series, we focus solely on performance programming for multi-core architectures. Throughout all topics, we present the recent additions of OpenMP 5.0, 5.1 and 5.2 and comment on developments targeting OpenMP 6.0.
We assume attendees understand basic parallelization concepts and know the fundamentals of OpenMP. We focus on performance aspects, such as data and thread locality on NUMA architectures, false sharing, and exploitation of vector units. All topics are accompanied by extensive case studies, and we discuss the corresponding language features in-depth. Continuing the emphasis of this successful tutorial series, we focus solely on performance programming for multi-core architectures. Throughout all topics, we present the recent additions of OpenMP 5.0, 5.1 and 5.2 and comment on developments targeting OpenMP 6.0.
Birds of a Feather
TP
XO/EX
DescriptionFPGAs have gone from niche components to being a central part of many data centers worldwide to being considered for core HPC installations. The last year has seen tremendous advances in FPGA programmability and technology, and FPGAs for general HPC is apparently within reach. This BoF has two parts. The first is a series of lightning talks presenting advances in tools and technologies emphasizing work by new investigators. The second part of the BoF will be a general discussion driven by the interests of the attendees, potentially including additional topics.
Birds of a Feather
TP
XO/EX
DescriptionThe goal of this BoF session is to bring the HPC and QC communities closer together with the objective to scrutinize HPC codes and workflows for potential hybrid quantum-classical computing.
The focus will be primarily on the identification of the required tool set, including the infrastructure and of the potential applications, and less on the computation acceleration.
The format of the BoF will consist of three short impulse talks followed by a moderated panel discussion, inviting substantial contributions from the audience.
The focus will be primarily on the identification of the required tool set, including the infrastructure and of the potential applications, and less on the computation acceleration.
The format of the BoF will consist of three short impulse talks followed by a moderated panel discussion, inviting substantial contributions from the audience.
Exhibitor Forum
Recorded
TP
XO/EX
DescriptionNext-generation exascale supercomputers are increasingly requiring converged HPC/AI systems, as evidenced by specifications from government, university, and commercial supercomputing labs for future systems, which require AI performance specifications in addition to the traditional specifications for HPC performance.
HPC and AI workloads are similar, as both compute and memory intensive, as well as being highly parallel. HPC and AI diverge, however, in the level of precision that is often required. The level of data analysis required for HPC applications typically needs double-precision or possibly single-precision. AI, however, frequently requires lower precision, with the reduced precision enabling much higher performance. Another key difference is that AI workloads benefit from sparsity to maximize performance and efficiency, and sparsity is not used by HPC.
This presentation compares HPC and AI workloads, reviews the trends that are driving AI and HPC convergence for supercomputers, and presents Tachyum’s Prodigy Universal Processor and its revolutionary architecture which unifies the functionality of CPU, GPU, and TPU to address the demands of both HPC and AI workloads in a single device without needing costly and power-hungry accelerators. Key features that will be highlighted include Prodigy’s advanced HPC and AI subsystems, the benefits of lower precision and sparse data types for AI applications, and recent innovations, Tachyum has made to enhance and accelerate AI processing, that are unique to Prodigy.
HPC and AI workloads are similar, as both compute and memory intensive, as well as being highly parallel. HPC and AI diverge, however, in the level of precision that is often required. The level of data analysis required for HPC applications typically needs double-precision or possibly single-precision. AI, however, frequently requires lower precision, with the reduced precision enabling much higher performance. Another key difference is that AI workloads benefit from sparsity to maximize performance and efficiency, and sparsity is not used by HPC.
This presentation compares HPC and AI workloads, reviews the trends that are driving AI and HPC convergence for supercomputers, and presents Tachyum’s Prodigy Universal Processor and its revolutionary architecture which unifies the functionality of CPU, GPU, and TPU to address the demands of both HPC and AI workloads in a single device without needing costly and power-hungry accelerators. Key features that will be highlighted include Prodigy’s advanced HPC and AI subsystems, the benefits of lower precision and sparse data types for AI applications, and recent innovations, Tachyum has made to enhance and accelerate AI processing, that are unique to Prodigy.
Invited Talk
Recorded
TP
XO/EX
DescriptionRISC-V has grown from a university project into a global open ISA standard with a thriving computing ecosystem comprising hundreds of collaborating organizations, including most major computing companies. This talk will present how RISC-V is well-suited for future HPC computing needs. RISC-V's technical advantages include a greater inherent efficiency than competing architectures, a sophisticated vector processing extension, and natural support for customized instruction set extensions. RISC-V's non-technical advantages include an open standard model that encourages both competition and collaboration, and which ensures long-term stability to protect investment in the software ecosystem.
Birds of a Feather
TP
XO/EX
DescriptionThe Covid-19 pandemic has shone a light on the increasing importance of HPC in public health, particularly with respect to the genomics of key pathogens. This BoF aims to help to provide a starting point to build a new network of those from academic institutions, healthcare organizations, public health agencies, and industry who are responsible for the emerging HPC infrastructures that will be increasingly important in the delivery of Public Health. The BoF will be a forum to share experience and best practice, with the aim of creating a new network of professionals to work together for global benefit.
Students@SC
Posters
Research Posters
TP
XO/EX
DescriptionThe LLVM Flang compiler ("Flang") is currently Fortran 95 compliant, and the frontend can parse Fortran 2018. However, Flang does not have a comprehensive 2018 test suite and does not fully implement the static semantics of the 2018 standard. We are investigating whether agile software development techniques, such as pair programming and test-driven development (TDD), can help Flang to rapidly progress to Fortran 2018 compliance. Because of the paramount importance of parallelism in high-performance computing, we are focusing on Fortran’s parallel features, commonly denoted “CoArray Fortran". We are developing what we believe are the first exhaustive, open-source tests for the static semantics of Fortran 2018 parallel features, and contributing them to the LLVM project. A related effort involves writing runtime tests for parallel 2018 features and supporting those tests by developing a new parallel runtime library: the CoArray Fortran Framework of Efficient Interfaces to Network Environments (Caffeine).
Invited Talk
Recorded
TP
XO/EX
DescriptionToday’s era of explosive data growth poses serious challenges for society in transforming massive, random, heterogeneous data streams and structures into useful knowledge, applicable to every aspect of modern life, including national security, economic productivity, scientific discovery, medical breakthroughs, and social interactions. The burgeoning data, which is increasing exponentially not only in volume, but in velocity, variety, and complexity, already far outpaces the abilities of current computing systems to execute the complex data analytics needed to extract meaningful insights in a timely manner.
The key problem with today’s computers is that they were designed to address yesterday’s compute-intensive problems rather than today’s data-intensive problems. Transforming massive data streams and structures into actionable knowledge and meaningful results in near real-time requires a complete rethinking of computing architectures and technologies – one that places the primary focus on data access and data movement rather than on faster compute power. The data of interest today and in the future is typically sparse, random, and heterogeneous, with minimal locality (it is randomly distributed across the computer), and characterized by poor data re-use, streaming updates flowing into the system, and fine-grain data movement and parallelism. The computations to be performed are determined by the data, and multiple applications might need simultaneous access to the same data. These are very different conditions than those characteristic of yesterday’s compute-intensive applications.
IARPA’s new AGILE Program aims to provide data-analytic results in time for appropriate response, e.g., to predict impending adversarial events rather than forensically analyzing them after the fact. It will accomplish this goal by developing new system-level intelligent mechanisms for moving, accessing, and storing large, random, time-varying data streams and structures that allow for the scalable and efficient execution of dynamic graph analytic applications. The program solicited system designs that emphasize optimizing the fully integrated system, not independent optimization of individual functionalities. AGILE aims to develop scalable, energy-efficient computing system designs that enable solutions to data-intensive problems as well as traditional compute-intensive problems. These designs will be cost-effective and realizable in silicon prior to the year 2030.
The key problem with today’s computers is that they were designed to address yesterday’s compute-intensive problems rather than today’s data-intensive problems. Transforming massive data streams and structures into actionable knowledge and meaningful results in near real-time requires a complete rethinking of computing architectures and technologies – one that places the primary focus on data access and data movement rather than on faster compute power. The data of interest today and in the future is typically sparse, random, and heterogeneous, with minimal locality (it is randomly distributed across the computer), and characterized by poor data re-use, streaming updates flowing into the system, and fine-grain data movement and parallelism. The computations to be performed are determined by the data, and multiple applications might need simultaneous access to the same data. These are very different conditions than those characteristic of yesterday’s compute-intensive applications.
IARPA’s new AGILE Program aims to provide data-analytic results in time for appropriate response, e.g., to predict impending adversarial events rather than forensically analyzing them after the fact. It will accomplish this goal by developing new system-level intelligent mechanisms for moving, accessing, and storing large, random, time-varying data streams and structures that allow for the scalable and efficient execution of dynamic graph analytic applications. The program solicited system designs that emphasize optimizing the fully integrated system, not independent optimization of individual functionalities. AGILE aims to develop scalable, energy-efficient computing system designs that enable solutions to data-intensive problems as well as traditional compute-intensive problems. These designs will be cost-effective and realizable in silicon prior to the year 2030.
Paper
Recorded
Machine Learning and Artificial Intelligence
TP
DescriptionSolving quantum many-body problems is one of the most fascinating research fields in condensed matter physics. An efficient numerical method is crucial to understand the mechanism of novel physics, such as the high Tc superconductivity, as one has to find the optimal solution in the exponentially large Hilbert space. The development of Artificial Intelligence (AI) provides a unique opportunity to solve the quantum many-body problems, but there is still a large gap from the goal. In this work, we present a novel computational framework and adapt it to the Sunway supercomputer. With highly efficient scalability up to 40 million heterogeneous cores, we can drastically increase the number of variational parameters, which greatly improves the accuracy of the solutions. The investigations of the spin-1/2 J1-J2 model and the t-J model achieve unprecedented accuracy and time-to-solution far beyond the previous state of the art.
Birds of a Feather
TP
XO/EX
DescriptionHPC is increasingly employed in AI. Although HPC itself is natively ethically neutral, its use to enable AI applications that can have harmful impacts on humans and society and can render HPC collusive and ethically liable. This BoF will consider the ethical implication of the coupling of AI and HPC and the formation of guidelines for the HPC community to ensure that researchers consider potentially harmful consequences of their research and adhere to best practices for sustainable and ethical use of HPC resources.
Job Posting
DescriptionProvide advanced and innovative data and technology leadership and support for a research unit. Hold the role of lead subject matter expert regarding delivery of IT and data services in support of research. Define and implement a sustainable and secure data management strategy which meets the needs of researchers and sponsors. Collaborate with professional peers on campus and nationally. This position will interact on a consistent basis with: Faculty, staff, students, postdocs, and unit management. Some supervision is possible (e.g., students or junior staff). Hybrid remote work options are available.
Responsibilities
Provide high-level expertise, advice, and technology leadership; defines and implements the vision and strategy for data management for an AI Institute.
Interact and collaborate with Institute management, multiple research groups across partner organizations, and other stakeholders (e.g., local and external HPC providers).
Define, evaluate, and implement technical IT and data systems, architecture, applications, and services to serve the AI Institute research mission.
Coordinate with other data management efforts on campus to plan for and analyze technology investments; identify common needs among research groups and develop associated solutions, focusing on the collaborative use of resources where possible.
Perform other duties as assigned.
Responsibilities
Provide high-level expertise, advice, and technology leadership; defines and implements the vision and strategy for data management for an AI Institute.
Interact and collaborate with Institute management, multiple research groups across partner organizations, and other stakeholders (e.g., local and external HPC providers).
Define, evaluate, and implement technical IT and data systems, architecture, applications, and services to serve the AI Institute research mission.
Coordinate with other data management efforts on campus to plan for and analyze technology investments; identify common needs among research groups and develop associated solutions, focusing on the collaborative use of resources where possible.
Perform other duties as assigned.
Job Posting
DescriptionWith over 40 years of semiconductor process control experience, chipmakers around the globe rely on KLA to ensure that their fabs ramp next-generation devices to volume production quickly and cost-effectively. Enabling the movement towards advanced chip design, KLA's Global Products Group (GPG), which is responsible for creating all of KLA’s metrology and inspection products, is looking for the best and the brightest research scientist, software engineers, application development engineers, and senior product technology process engineers. The Surfscan group includes a team of engineers, technology development, apps engineers and product marketing focused on technology that enables wafer, IC and equipment manufacturers to develop, qualify and monitor their process tools. Defects and process non-uniformities detected on Surfscan equipment allow for early identification of yield excursions. The flagship Surfscan products include the SPx platforms for wafer surface quality and wafer defect inspection tools and systems for inspection of polished wafers, epi wafers and engineered substrates during the wafer fabrication process.
Job Description/Preferred Qualifications
The job focuses on the development of image, signal processing and artificial intelligence algorithms for the next generations of optical inspection and metrology systems.
The position requires a proven innovative track record and solid fundamental knowledge in the related fields of algorithm development, including image segmentation, texture analysis, classification, feature extraction, statistical data analysis, signal processing, filter theory, machine learning, deep learning.
C/C++, Matlab/Python programing skills are must have. CPU optimization (SSE/AVX) and GPU (CUDA) programming are among the highly desired skills too.
The responsibilities of this position cover the entire life cycle of algorithms, including modeling, proof-of-concept design, production software design and implementation, performance characterization, documentation and user support. Since algorithms can affect many aspects of the system, significant amount of time be spent on cross-function team collaboration for prototyping and testing.
The candidate needs to be a self-motivated individual with ability to work independently and/or in a team. Strong written and verbal communications skills are needed for extensive interactions with members of a multi-disciplinary global team.
Job Description/Preferred Qualifications
The job focuses on the development of image, signal processing and artificial intelligence algorithms for the next generations of optical inspection and metrology systems.
The position requires a proven innovative track record and solid fundamental knowledge in the related fields of algorithm development, including image segmentation, texture analysis, classification, feature extraction, statistical data analysis, signal processing, filter theory, machine learning, deep learning.
C/C++, Matlab/Python programing skills are must have. CPU optimization (SSE/AVX) and GPU (CUDA) programming are among the highly desired skills too.
The responsibilities of this position cover the entire life cycle of algorithms, including modeling, proof-of-concept design, production software design and implementation, performance characterization, documentation and user support. Since algorithms can affect many aspects of the system, significant amount of time be spent on cross-function team collaboration for prototyping and testing.
The candidate needs to be a self-motivated individual with ability to work independently and/or in a team. Strong written and verbal communications skills are needed for extensive interactions with members of a multi-disciplinary global team.
Job Posting
DescriptionResponsibilities:
An intern with the AI and Modeling Center of excellence will work in one or more of the following areas. Interns will be technically supported and mentored throughout their stay with KLA.
Work with traditional machine learning and deep learning techniques to meet and improve results on KLA products.
Experiment with new and novel techniques to improve results or reduce compute cost of various modeling techniques.
Build tools for more efficient experimentation.
Manage data used for training and experimentation of AI and physics modeling systems.
Image processing.
Speeding up physics models.
Developing software tools and solutions for KLA products.
An intern with the AI and Modeling Center of excellence will work in one or more of the following areas. Interns will be technically supported and mentored throughout their stay with KLA.
Work with traditional machine learning and deep learning techniques to meet and improve results on KLA products.
Experiment with new and novel techniques to improve results or reduce compute cost of various modeling techniques.
Build tools for more efficient experimentation.
Manage data used for training and experimentation of AI and physics modeling systems.
Image processing.
Speeding up physics models.
Developing software tools and solutions for KLA products.
Job Posting
DescriptionWith over 40 years of semiconductor process control experience, chipmakers around the globe rely on KLA to ensure that their fabs ramp next-generation devices to volume production quickly and cost-effectively. Enabling the movement towards advanced chip design, KLA's Global Products Group (GPG), which is responsible for creating all of KLA’s metrology and inspection products, is looking for the best and the brightest research scientist, software engineers, application development engineers, and senior product technology process engineers. The Film and Scatterometry Technology (FaST) Division provides industry leading metrology solutions for worldwide semiconductor IC manufacturers. The FaST Division portfolio of metrology products includes hardware and software solutions for optical film thickness, optical critical dimension (CD), composition, and resistivity measurement systems. These products are essential for the IC manufacturers as they provide critical metrology capabilities for the development and implementation of their advanced IC processes. The FaST division is committed to support our customers to achieve performance entitlement of our solution and we effectively partner with our customers from their early research and development phase to the high volume in-line manufacturing implementation specific for their process needs. The division consists of a global team located in US, Israel, China, and India.
Responsibilities
In this role, you will be a part of the SCD Algorithm group for the FaST division at KLA. We are looking for an algorithm engineering intern to work in one or more of the following areas:
Develop, implement, and improve electromagnetic algorithms
Research and develop physics-directed ML algorithms
Research and develop specialized optimization algorithms
Responsibilities
In this role, you will be a part of the SCD Algorithm group for the FaST division at KLA. We are looking for an algorithm engineering intern to work in one or more of the following areas:
Develop, implement, and improve electromagnetic algorithms
Research and develop physics-directed ML algorithms
Research and develop specialized optimization algorithms
Job Posting
DescriptionWith over 40 years of semiconductor process control experience, chipmakers around the globe rely on KLA to ensure that their fabs ramp next-generation devices to volume production quickly and cost-effectively. Enabling the movement towards advanced chip design, KLA's Global Products Group (GPG), which is responsible for creating all of KLA’s metrology and inspection products, is looking for the best and the brightest research scientist, software engineers, application development engineers, and senior product technology process engineers. The RAPID division is the world leading provider of reticle inspection solutions for the semiconductor industry. The company provides inspection solutions to both the mask shops and the semiconductor fabs to ensure that lithography yields are consistently high thus enabling cost-effective manufacturing.
Responsibilities:
KLA is seeking a motivated individual for an engineering intern position in a world-class algorithm group within the reticle product division (RAPID). Our intern will work in one or more of the following areas
Computational geometry
Image processing
Work with traditional machine learning and deep learning techniques to meet and improve results on KLA products
Develop software tools and solutions for KLA products
Responsibilities:
KLA is seeking a motivated individual for an engineering intern position in a world-class algorithm group within the reticle product division (RAPID). Our intern will work in one or more of the following areas
Computational geometry
Image processing
Work with traditional machine learning and deep learning techniques to meet and improve results on KLA products
Develop software tools and solutions for KLA products
Job Posting
DescriptionLawrence Berkeley National Lab’s (LBNL, https://www.lbl.gov/) Applied Mathematics and Computational Research Division (https://crd.lbl.gov/divisions/amcr/applied-mathematics-dept/) has an opening for an Algorithm/Architecture Research Scientist to join the team.
In this exciting role, you will lead teams to conduct algorithmic co-design for future HPC architectures and quantify the performance of such systems through modeling, simulation, and numerical analysis. The scientist’s research and expertise must include parallel performance analysis, modeling and simulation of computer architectures, analysis of numerical algorithms, and parallel simulation methodologies such as parallel discrete event simulation.
In this exciting role, you will lead teams to conduct algorithmic co-design for future HPC architectures and quantify the performance of such systems through modeling, simulation, and numerical analysis. The scientist’s research and expertise must include parallel performance analysis, modeling and simulation of computer architectures, analysis of numerical algorithms, and parallel simulation methodologies such as parallel discrete event simulation.
Paper
Recorded
Accelerator-based Architectures
Performance
Visualization
TP
DescriptionSparse Matrix-Vector multiplication (SpMV) is an important computational kernel. Tens of sparse matrix formats and implementations have been designed to speed up SpMV performance. We develop AlphaSparse. It goes beyond the scopes of human-designed artificial formats and traditional auto-tuners subject to prior existing artificial format(s) and implementation(s), by automatically creating new machine-designed formats and SpMV kernel implementations entirely from the knowledge of input sparsity patterns and hardware architectures. Based on our proposed Operator Graph that expresses the path of SpMV code design, it takes an arbitrary sparse matrix as input while outputting the machine-designed format and SpMV implementation that achieve high performance. By extensively evaluating 843 matrices from SuiteSparse Matrix Collection, AlphaSparse achieves performance improvement by up to 22.2 times (3.2 times on average) compared to state-of-the-art five artificial formats and up to 2.8 times (1.5 times on average) over the up-to-date implementation of traditional auto-tuning.
Workshop
Recorded
W
DescriptionThe AMD Heterogeneous Accelerated Computing Program (HACC) is an initiative by AMD to provide an infrastructure and exchange platform for studying FPGA acceleration for HPC and data center workloads. The Paderborn Center for Parallel Computing (PC2) was accepted into the HACC initiative in spring 2022, which now comprises five centers worldwide. I will give a brief overview of the HACC program and will highlight the new Alveo U280 partition of our Noctua 2 supercomputer, which is accessible through the HACC program, and provides a particularly flexible software and networking environment.
Birds of a Feather
TP
XO/EX
DescriptionThe 2022 edition of the Americas High-Performance Computing (HPC) Collaboration BoF seeks to showcase collaborations that have resulted from the partnerships formed in previous editions. It will also present opportunities and experiences between different HPC Networks and Laboratories from countries in North, Central, South America, and the Caribbean. This BoF aims at showing the current state of the art in continental collaboration in HPC research, the latest developments of regional collaborative networks, and updating the roadmap for the next year for the Americas HPC partnerships.
Posters
Research Posters
TP
XO/EX
DescriptionThe fast Fourier Transforms (FFT), a reduced-complexity formulation of the Discrete Fourier Transform (DFT), dominate the computational cost in many areas of science and engineering. Due to the large-scale data, multi-node heterogeneous systems aspire to meet the increasing demands from parallel computing FFT in the field of High-Performance Computing (HPC). In this work, we present a highly efficient GPU-based distributed FFT framework by adapting the Cooley-Tukey recursive FFT algorithm. Two major types of optimizations, including automatic low-dimensional FFT kernel generation and asynchronous strategy for multi-GPUs, are presented to enhance the performance of our approach for large-scale distributed FFT, and numerical experiments demonstrate that our work achieves more than 40x speedup over CPU FFT libraries and about 2x speedup over heFFTe, currently available state-of-art research, on GPUs.
Workshop
Recorded
W
DescriptionOur team is a developing a series of AI Bootcamps for Cyberinfrastructure (CI) Professionals to increase support expertise for researchers with Artificial Intelligence (AI) workloads running at research computing facilities. We have completed the first six-week, virtual program of core foundations topics in AI and machine learning. Our next bootcamp is focused on CI professionals in software and data engineering roles. Our team is comprised of CI professionals and Computer Science and Engineering faculty to provide a comprehensive curriculum for the professional learner. We saw a great deal of enthusiasm among the CI professional community for this program and those who attended rated it highly. We plan to refine the materials and make them generally available at the end of the project.
Workshop
Recorded
Benchmarking
Cloud and Distributed Computing
Containers
Datacenter
Networks
Privacy
Resource Management and Scheduling
Security
SIGHPC
State of the Practice
System Administration
System Software
W
DescriptionWe demonstrate a continuous acceptance testing strategy used at NERSC that can be implemented in the broader HPC community. To accomplish this task, we designed a new framework that can handle the complex parts of HPC systems, allowing us to verify a system is working optimally. buildtest [1] is an acceptance testing framework that can automate the testing of HPC systems and enable HPC support teams to painlessly create and run tests. Testing is initiated by changes to the system/software stack at scheduled system outage that demands for NERSC staff to build, run and monitor test results using GitLab’s Continuous Integration (CI) [2]. Test results are clearly communicated to developers and users via the CDash [3] web interface and test failures are documented as github issues. Together this framework forms a robust method for verifying cutting edge software stacks’ function in challenging HPC environments.
Workshop
Recorded
Cloud and Distributed Computing
In Situ Processing
Scientific Computing
Workflows
W
DescriptionCryogenic electron microscopy (Cryo-EM) is a method applied to samples cooled to cryogenic temperatures that can reach a near-atomic resolution of biological molecules. Recent progress in methodology has created an entirely new set of challenges to overcome - among them, the specific environment of the HPC system and coordination and automation of the initial stages. Our solution is an automated Cryo-EM image pre-processing service tailored to an HPC environment with close to real-time feedback allowing the researchers to interact with the data acquisition session located in a facility remote to the HPC cluster. We automated the data transfer, created a service around the Pegasus Workflow Management System, kept the user interaction minimum, and offered the researcher an option to start the pre-processing right after initiating the microscope session. The users receive real-time feedback enabling them to interact with the data acquisition, adjust it and collect a better dataset.
Workshop
Recorded
HPC Training and Education
W
DescriptionDelivering training and education on hybrid technologies (including AI, ML, GPU, Data and Visual Analytics including VR and Quantum Computing) integrated with HPC resources is key to enable individuals and businesses to take full advantage of digital technologies, hence enhancing processes within organizations and providing the enabling skills to thrive in a digital economy. Supercomputing centers focused on solving industry-led problems face the challenge of having a pool of users with little experience in executing simulations on large scale facilities, as well as limited knowledge of advanced computational techniques and integrated technologies. We aim not only at educating them in using the facilities available, but to raise awareness of methods which have the potential to increase their productivity. In this presentation, we provide our perspective on how to efficiently train industry users, and how to engage about wider digital technologies and how these, used efficiently together, can benefit their business.
Workshop
Recorded
Applications
Architectures
Benchmarking
Exascale Computing
Modeling and Simulation
Performance
Performance Portability
W
DescriptionExpanding upon their Scalable Vector Extension (SVE), Arm have introduced the Scalable Matrix Extension (SME) to improve in-core performance for matrix operations such as matrix multiplication. With the lack of hardware and cycle-accurate simulations available which supports SME, it is unclear how effective this new instruction set extension will be, and for what type of applications it will provide the most benefit.
By adapting The Simulation Engine (SimEng) from the University of Bristol’s High Performance Computing Group to support SME, we aim to compare the simulated performance of a Fujitsu A64FX core (with native SVE support) to a like-for- like hypothetical core with added SME support. By simulating a wide range of Streaming Vector Lengths for our hypothetical SME core model, we provide and discuss first-of-a-kind results for an SME implementation, before discussing future work that will be carried out to further evaluate the suitability of SME.
By adapting The Simulation Engine (SimEng) from the University of Bristol’s High Performance Computing Group to support SME, we aim to compare the simulated performance of a Fujitsu A64FX core (with native SVE support) to a like-for- like hypothetical core with added SME support. By simulating a wide range of Streaming Vector Lengths for our hypothetical SME core model, we provide and discuss first-of-a-kind results for an SME implementation, before discussing future work that will be carried out to further evaluate the suitability of SME.
Posters
Research Posters
TP
XO/EX
DescriptionAutotuning is a widely used method for guiding developers of large-scale applications to achieve high performance. However, autotuners typically employ black-box optimizations to recommend parameter settings at the cost of users missing the opportunity to identify performance bottlenecks. Performance analysis fills that gap and identifies problems and optimization opportunities that can result in better runtime and utilization of hardware resources. This work combines the best of the both worlds by integrating a systematic performance analysis and visualization approach into a publicly available autotuning framework, GPTune, to suggest users which configuration parameters are important to tune, to what value, and how tuning the parameters affect hardware-application interactions. Our experiments demonstrate that a subset of the task parameters impact the execution time of the Hypre application; the memory traffic and page faults cause performance problems in the Plasma-DGEMM routine on Cori-Haswell.
Workshop
Recorded
W
DescriptionWe present an analysis of the collection of user-support tickets that were created during nearly nine years of operation of the Blue Waters supercomputer. The analysis was based on information obtained from the Jira ticketing system and its corresponding queues. The paper contains a set of statistics showing, in quantitative form, the distribution of tickets across system areas. It also shows the computed metrics related to management of the tickets by our staff. Additionally, we present an analysis, based on Machine-Learning and Sentiment Analysis techniques, conducted over the text entered in tickets, aimed at detecting trends on users' views and perspectives about Blue Waters. This kind of study, which is uncommon in the literature, could provide guidance for operators of future large systems about the expected volume of user support demanded by each system area, and about how to allocate support staff such that users receive the best possible assistance.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
Recorded
TP
DescriptionOpenACC is a high-level directive-based parallel programming model that can manage the sophistication of heterogeneity in architectures and abstract it from the users. The portability of the model across CPUs and accelerators has gained the model a wide variety of users. This means it is also crucial to analyze the reliability of the compilers' implementations. To address this challenge, the OpenACC Validation and Verification team has proposed a validation testsuite to verify the OpenACC implementations across various compilers with an infrastructure for a more streamlined execution. This paper will cover the following aspects: (a) the new developments since the last publication on the tetsuite, (b) outline the use of the infrastructure, (c) discuss tests that highlight our workflow process, (d) analyze the results from executing the testsuite on various systems, and (e) outline future developments.
Workshop
Recorded
Accelerator-based Architectures
Compilers
Dataflow and Tasking
Directive Based Programming
Heterogeneous Systems
Parallel Programming Languages and Models
Runtime Systems
W
DescriptionOpenACC is a high-level directive-based parallel programming model that can manage the sophistication of heterogeneity in architectures and abstract it from the users. The portability of the model across CPUs and accelerators has gained the model a wide variety of users. This means it is also crucial to analyze the reliability of the compilers' implementations. To address this challenge, the OpenACC Validation and Verification team has proposed a validation testsuite to verify the OpenACC implementations across various compilers with an infrastructure for a more streamlined execution. This paper will cover the following aspects: (a) the new developments since the last publication on the testsuite, (b) outline the use of the infrastructure, (c) discuss tests that highlight our workflow process, (d) analyze the results from executing the testsuite on various systems, and (e) outline future developments.
Posters
Research Posters
TP
XO/EX
DescriptionNOvA is a world-leading neutrino physics experiment that is making measurements of fundamental neutrino physics parameters and performing searches for physics beyond the Standard Model. These measurements must leverage high performance computing facilities to perform data intensive computations and execute complex statistical analyses. We outline the NOvA analysis workflows we have implemented on NERSC Cori and Perlmutter systems. We have developed an implicitly-parallel data-filtering framework for high energy physics data based on pandas and HDF5. We demonstrate scalability of the framework and advantages of an aggregated monolithic dataset by using a realistic neutrino cross-section measurement. We also demonstrate the performance and scalability of the computationally intensive profiled Feldman-Cousins procedure for statistical analysis. This process performs statistical confidence interval construction based on non-parametric Monte Carlo simulation and was applied to the NOvA sterile neutrino search. We show the NERSC Perlmutter system provides an order of magnitude computing performance gain over Cori.
Birds of a Feather
TP
XO/EX
DescriptionParallel I/O performance can be a critical bottleneck for applications, yet users are often ill-equipped for identifying and diagnosing I/O performance issues. Increasingly complex hierarchies of storage hardware and software deployed on many systems only compound this problem. Tools that can effectively capture, analyze, and tune I/O behavior for these systems empower users to realize performance gains for many applications.
In this BoF, we form a community around best practices in analyzing parallel I/O and cover recent advances to help address the problem presented above, drawing on the expertise of users, I/O researchers, and administrators in attendance.
In this BoF, we form a community around best practices in analyzing parallel I/O and cover recent advances to help address the problem presented above, drawing on the expertise of users, I/O researchers, and administrators in attendance.
Workshop
Recorded
Reliability and Resiliency
W
DescriptionWith exascale computing, the number of components that comprise high-performance computing (HPC) systems has increased by more than 70%, leading to a shorter mean time between failure (MTBF) and larger power budgets. These issues induce the need for (1) checkpoint/restart (C/R) and (2) energy reduction techniques. C/R has evolved with different software and hardware advances, thus it is crucial to understand how its energy usage differs under various storage tiers and synchronicity. In this paper, we present a comparison of the energy consumption of leading, state-of-the-art C/R libraries, VELOC and GenericIO. We perform weak and strong scalability tests of the C/R libraries and show that asynchronous C/R provides 4x greater throughput while using 33% less energy than synchronous C/R. Data size and throughput are directly correlated to energy consumption. Therefore, C/R developers should focus on ways to improve/maintain high throughput in order to reduce energy consumption to address exascale needs.
Workshop
Recorded
W
DescriptionCosmology simulations are among some of the largest simulations being currently run on supercomputers, generating terabytes to petabytes of data for each run. Consequently, scientists are seeking to reduce the amount of storage needed while preserving enough quality for analysis and visualization of the data. One of the most commonly used visualization techniques for cosmology simulations is volume rendering. Here, we investigate how different types of lossy error-bound compression algorithms affect the quality of volume-rendered images generated from reconstructed datasets. We also compute a number of image quality assessment metrics to determine which ones are the most effective at identifying artifacts in the visualizations.
Birds of a Feather
TP
XO/EX
DescriptionThe ultimate goal of outreach activities is to connect with individuals outside or at the periphery of the HPC community and empower them to become the next generation of HPC professionals. While most large centers and organizations have some outreach staff, many small HPC centers find the development and maintenance of an outreach program a serious challenge. This BoF session will gather HPC Outreach facilitators from across the community to share challenges, experiences, lessons learned and strategies for developing sustainable Outreach programs. The discussions will be captured into a shared document that will guide future community efforts.
Birds of a Feather
TP
XO/EX
DescriptionThe goal of this BoF is to introduce the HPC community to the RISC-V ecosystem and how it can enable research and development. We will start with a short panel presentation (20 minutes) on the status of the RISC-V HPC ecosystem. This will be followed by a Q&A session with the panel and audience members. There will be directed questions as well as ad hoc questions from the audience.
Workshop
Recorded
Applications
Architectures
Benchmarking
Exascale Computing
Modeling and Simulation
Performance
Performance Portability
W
DescriptionWhile many good development-oriented tools exist for analyzing and improving the performance of HPC applications, capability for capturing and analyzing the dynamic behavior of application in real production runs is lacking. Many heavily-used applications do keep some internal metrics of their performance, but there is no unified way of using these. In this paper we present the initial idea of AppEKG, both a concept of and a prototype tool for providing a unified, understandable view of HPC application behavior in production. Our prototype AppEKG framework can achieve less than 1% overhead, thus usable in production, and still provide dynamic data collection that captures time-varying runtime behavior.
Workshop
Recorded
W
Workshop
Recorded
HPC Training and Education
W
DescriptionGiven the anticipated growth of the high-performance computing market, HPC is challenged with expanding the size, diversity, and skill of its workforce while also addressing post-pandemic distributed workforce protocols and an ever-expanding ecosystem of architectures, accelerators and software stacks.
As we move toward exascale computing, training approaches need to address how best to prepare future computational scientists and enable established domain researchers to stay current and master tools needed for exascale architectures.
This paper explores adding in-person and virtual hackathons to the training mix to bridge traditional programming curricula and hands-on skills needed among the diverse communities. We outline current learning and development programs available; explain benefits and challenges in implementing hackathons for training; share specific use cases, including training “readiness,” outcomes and sustaining progress; discuss how to engage diverse communities—from early career researchers to veteran scientists; and recommend best practices for implementing these events into their training mix.
As we move toward exascale computing, training approaches need to address how best to prepare future computational scientists and enable established domain researchers to stay current and master tools needed for exascale architectures.
This paper explores adding in-person and virtual hackathons to the training mix to bridge traditional programming curricula and hands-on skills needed among the diverse communities. We outline current learning and development programs available; explain benefits and challenges in implementing hackathons for training; share specific use cases, including training “readiness,” outcomes and sustaining progress; discuss how to engage diverse communities—from early career researchers to veteran scientists; and recommend best practices for implementing these events into their training mix.
Paper
Recorded
Post-Moore Computing
Quantum Computing
TP
DescriptionAs computer system technology approaches the end of Moore's law, new computing paradigms that improve performance become a necessity. One such paradigm is approximate computing (AC). AC can present significant performance improvements, but a challenge lies in providing confidence that approximations will not overly degrade the application output quality. In AC, application domain experts manually identify code regions amenable to approximation. However, automatically guiding a developer where to apply AC is still a challenge.
We propose Puppeteer, a novel method to rank code regions based on amenability to approximation. Puppeteer uses uncertainty quantification methods to measure the sensitivity of application outputs to approximation errors. A developer annotates possible application code regions and Puppeteer estimates the sensitivity of each region. Puppeteer successfully identifies insensitive regions on different benchmarks. We utilize AC on these regions and we obtain speedups of 1.18x, 1.8x, and 1.3x for HPCCG, DCT, and BlackScholes, respectively.
We propose Puppeteer, a novel method to rank code regions based on amenability to approximation. Puppeteer uses uncertainty quantification methods to measure the sensitivity of application outputs to approximation errors. A developer annotates possible application code regions and Puppeteer estimates the sensitivity of each region. Puppeteer successfully identifies insensitive regions on different benchmarks. We utilize AC on these regions and we obtain speedups of 1.18x, 1.8x, and 1.3x for HPCCG, DCT, and BlackScholes, respectively.
Workshop
Recorded
Architectures
Benchmarking
Cloud and Distributed Computing
Containers
Datacenter
Networks
Privacy
Resource Management and Scheduling
Security
SIGHPC
State of the Practice
System Administration
System Software
W
DescriptionUpdate on the Status of Argonne's New and Expected Systems.
Birds of a Feather
TP
XO/EX
DescriptionThis BoF brings together the Arm HPC community to discuss how current and future standards will influence the growing diversity of Arm-related hardware and software. A panel composed of government, academic, and industry practitioners and vendors will discuss whether hardware standards (e.g., Armv9 and SBSA) and software standards (e.g., C++ Standard Parallelism and OpenMP) can sufficiently support the growing and diverse Arm hardware ecosystem. Audience participation is strongly encouraged with a focus on answering standards-related questions and facilitating the growth and interoperability of future Arm-based extreme scale systems.
Posters
Research Posters
TP
XO/EX
DescriptionHistorical temperature measurements are the basis of important global climate datasets like HadCRUT4 and HadCRUT5 to analyze climate change. These datasets contain many missing values and have low resolution grids. Here we demonstrate that artificial intelligence can skillfully fill these observational gaps and upscale these when combined with numerical climate model data. We show that recently developed image inpainting techniques perform accurate reconstructions via transfer learning. In addition, high resolution in weather and climate was always a common and ongoing goal of the community. We gain a neural network which reconstructs and downscales the important observational data sets (IPCC AR6) at the same time, which is unique and state-of-the-art in climate research.
Birds of a Feather
TP
XO/EX
DescriptionWith the rise of ASEAN significance in the global landscape, so has its HPC. There are multiple world-class supercomputers now being planned and deployed and rising sets of users conducting cutting edge sciences. ASEAN has officially sanctioned its “HPC Task Force” among its coalition of major stakeholders to formulate a collective HPC infrastructure, federate them with advanced tools, collaborate with other regions e.g., Japan with Fugaku as well as with a joint HPC school with Europe and Japan. The BoF will present the status quo of ASEAN HPC and discuss further outreach of ASEAN HPC to the global HPC community.
Workshop
Recorded
W
DescriptionSince 2009, Amazon has offered its unused compute capacity as AWS Spot Instances. For the first eight years of spot, pure market dynamics and high pricing variability created an ideal environment for time-series prediction. Following a pricing-scheme change in 2017, this extreme variability was removed as pricing is artificially smoothed for the end-user, therefore making it significantly easier to accurately predict prices. Nevertheless, the literature demonstrates ongoing efforts to accurately predict spot prices. To show prediction in the modern spot market is unnecessary, we train nearly 2.2 million ARIMA models on new and old data to demonstrate an order of magnitude improvement in accuracy for models trained on new data. Further, we show this new ease of price prediction makes spot instances ideal for large-scale, cost-aware cloud computing, as cost estimation is now trivial. Accordingly, we demonstrate that even naive prediction approaches waste less than $360 for 1,000,000 core hours.
Workshop
Recorded
W
DescriptionMany of Los Alamos National Laboratory's HPC codes are memory bandwidth bound. These codes exhibit high levels of sparse memory
access which differ significantly from standard benchmarks.
In this paper we present an analysis of the memory access of some of our most important code-bases. We then generate micro-benchmarks
that preserve the memory access characteristics of our codes using two approaches,
one based on statistical sampling of relative memory offsets in a sliding time window at the
function level and another at the loop level. The function level approach is used to
assess the impact of advanced memory technologies such as LPDDR5 and HBM3 using
the gem5 simulator. Our simulation results show significant improvements for sparse memory access workloads using HBM3 relative to LPDDR5 and better scaling on a per core basis. Assessment of two different architectures show that higher peak memory bandwidth results in high bandwidth on sparse workloads.
access which differ significantly from standard benchmarks.
In this paper we present an analysis of the memory access of some of our most important code-bases. We then generate micro-benchmarks
that preserve the memory access characteristics of our codes using two approaches,
one based on statistical sampling of relative memory offsets in a sliding time window at the
function level and another at the loop level. The function level approach is used to
assess the impact of advanced memory technologies such as LPDDR5 and HBM3 using
the gem5 simulator. Our simulation results show significant improvements for sparse memory access workloads using HBM3 relative to LPDDR5 and better scaling on a per core basis. Assessment of two different architectures show that higher peak memory bandwidth results in high bandwidth on sparse workloads.
Job Posting
DescriptionRenaissance Computing Institute (RENCI) is seeking an Assistant Director of Analytics and Data Science to develop an independent research program in Data Science including artificial intelligence, machine learning, knowledge graphs and other analytical methods. The individual is expected to support and manage biomedical and environmental research projects, aid in management of the Analytics and Data Science team, and mentor, guide and evaluate several direct reports.
The Assistant Director will provide technical leadership, setting the analytical direction of research projects. The Assistant Director will apply their Data Science expertise to help advance RENCI’s research portfolio in Data Science, leading the development and execution of new research projects and proposals, both independently and in collaboration with internal and external partners.
The Assistant Director will develop algorithms and tools involving, but not limited to: image analysis, natural language processing, graph analysis, question answering, and semantic search. Prior experience in one or more of these areas is essential.
The Assistant Director will work with colleagues with expertise in data analytics, advanced computing including cloud computing systems, software engineering and domain expertise in the biomedical and environmental sciences. The Researcher will work in interdisciplinary teams, both within and beyond RENCI, promoting innovation and collaboration.
Responsibilities:
- Develop independent research portfolio in area of expertise
- Manage research projects, and provide analytical direction
- Work with colleagues to develop and apply tools and methods
- Collaborate with software engineers and researchers to help design analytical systems to advance scientific discovery
- Provide leadership in user engagement and experience
- Contribute to interdisciplinary teams
- Train data contributors on the application of data standards and relevant tools
- Develop proposals and business development efforts
The Assistant Director will provide technical leadership, setting the analytical direction of research projects. The Assistant Director will apply their Data Science expertise to help advance RENCI’s research portfolio in Data Science, leading the development and execution of new research projects and proposals, both independently and in collaboration with internal and external partners.
The Assistant Director will develop algorithms and tools involving, but not limited to: image analysis, natural language processing, graph analysis, question answering, and semantic search. Prior experience in one or more of these areas is essential.
The Assistant Director will work with colleagues with expertise in data analytics, advanced computing including cloud computing systems, software engineering and domain expertise in the biomedical and environmental sciences. The Researcher will work in interdisciplinary teams, both within and beyond RENCI, promoting innovation and collaboration.
Responsibilities:
- Develop independent research portfolio in area of expertise
- Manage research projects, and provide analytical direction
- Work with colleagues to develop and apply tools and methods
- Collaborate with software engineers and researchers to help design analytical systems to advance scientific discovery
- Provide leadership in user engagement and experience
- Contribute to interdisciplinary teams
- Train data contributors on the application of data standards and relevant tools
- Develop proposals and business development efforts
Job Posting
DescriptionJob Summary
The Associate Director of the Integrated Cyberinfrastructure (ICI) Directorate is a senior member of the NCSA Director's Office that works as part of a team on both setting NCSA strategy and executing on tactical directions. This position is actively engaged with leaders across campus with initiative development and follow-through as well as with other academic institutions and industry leaders. This position provides experienced management of technology development in support of advanced applications and communities across the disciplines, with the goal of delivering functional advanced technologies to NCSA's academic and industrial users.
Duties & Responsibilities
Strategic Direction and Leadership (50%)
• Set strategic direction for and provide leadership of NCSA’s Integrated Cyberinfrastructure (ICI) Directorate and the groups within it. This includes developing strategic plans for ICI and implementing policies and procedures to bring the strategic plans to fruition.
• Direct the ICI Directorate budgets and Memorandums of Understanding (MOUs) with regards to the NCSA / ICI mission and vision.
• Create a diverse and inclusive working environment for ICI staff that fosters collaboration, operational excellence, and innovation.
• Provide growth opportunities for ICI staff by creating transparent career paths and promoting professional development opportunities.
• Evaluate internal operating policies and procedures relating to ICI Directorate. Determine what changes / improvements should be made and implement said changes / improvements.
• Supervise ICI Division managers, including establishing strategic initiatives, assigning project tasks, staffing, setting goals and evaluating performance. The Associate Director will also work to empower managers and employees, define the broader context of their work, and explain how the team’s work contributes not only to the success of the ICI Directorate, but to NCSA as a whole.
• Participate in discussions with the NCSA Executive Committee and the Director on strategic issues to best position the Center for accomplishing its mission.
• Coordinate with Senior Associate Directors, Associate Directors, and team leaders to coordinate and leverage staff and expertise which may be useful between multiple NCSA divisions and groups.
• Direct and support staff in the proper implementation of University policy and procedures.
Engagement and Outreach (25%)
• Represent NCSA at key national and international meetings and in the community of cyberinfrastructure developers and relevant standards bodies.
• Represent NCSA in interactions with existing and prospective collaborators.
• Serve as an NCSA representative on advanced technology in support of science and engineering.
• Advance technology collaborations between ICI staff and UIUC faculty and students.
• Assist in supporting the business technology needs of NCSA.
• Provide input to national funding agencies for the creation of opportunities appropriate to advance efforts in support of NCSA’s mission and vision.
Research and Proposal Development (25%)
• Identify, develop, assess, and pursue funding opportunities that will advance NCSA’s mission, vision and strategic goals.
• Lead proposals and assist ICI Division managers on proposals to funding agencies, such as the National Science Foundation, Department of Energy, National Institutes of Health, etc. to develop advanced cyberinfrastructure.
• Lead ICI managers and staff in the development, deployment, and support of an advanced HPC/Data research environment.
The Associate Director of the Integrated Cyberinfrastructure (ICI) Directorate is a senior member of the NCSA Director's Office that works as part of a team on both setting NCSA strategy and executing on tactical directions. This position is actively engaged with leaders across campus with initiative development and follow-through as well as with other academic institutions and industry leaders. This position provides experienced management of technology development in support of advanced applications and communities across the disciplines, with the goal of delivering functional advanced technologies to NCSA's academic and industrial users.
Duties & Responsibilities
Strategic Direction and Leadership (50%)
• Set strategic direction for and provide leadership of NCSA’s Integrated Cyberinfrastructure (ICI) Directorate and the groups within it. This includes developing strategic plans for ICI and implementing policies and procedures to bring the strategic plans to fruition.
• Direct the ICI Directorate budgets and Memorandums of Understanding (MOUs) with regards to the NCSA / ICI mission and vision.
• Create a diverse and inclusive working environment for ICI staff that fosters collaboration, operational excellence, and innovation.
• Provide growth opportunities for ICI staff by creating transparent career paths and promoting professional development opportunities.
• Evaluate internal operating policies and procedures relating to ICI Directorate. Determine what changes / improvements should be made and implement said changes / improvements.
• Supervise ICI Division managers, including establishing strategic initiatives, assigning project tasks, staffing, setting goals and evaluating performance. The Associate Director will also work to empower managers and employees, define the broader context of their work, and explain how the team’s work contributes not only to the success of the ICI Directorate, but to NCSA as a whole.
• Participate in discussions with the NCSA Executive Committee and the Director on strategic issues to best position the Center for accomplishing its mission.
• Coordinate with Senior Associate Directors, Associate Directors, and team leaders to coordinate and leverage staff and expertise which may be useful between multiple NCSA divisions and groups.
• Direct and support staff in the proper implementation of University policy and procedures.
Engagement and Outreach (25%)
• Represent NCSA at key national and international meetings and in the community of cyberinfrastructure developers and relevant standards bodies.
• Represent NCSA in interactions with existing and prospective collaborators.
• Serve as an NCSA representative on advanced technology in support of science and engineering.
• Advance technology collaborations between ICI staff and UIUC faculty and students.
• Assist in supporting the business technology needs of NCSA.
• Provide input to national funding agencies for the creation of opportunities appropriate to advance efforts in support of NCSA’s mission and vision.
Research and Proposal Development (25%)
• Identify, develop, assess, and pursue funding opportunities that will advance NCSA’s mission, vision and strategic goals.
• Lead proposals and assist ICI Division managers on proposals to funding agencies, such as the National Science Foundation, Department of Energy, National Institutes of Health, etc. to develop advanced cyberinfrastructure.
• Lead ICI managers and staff in the development, deployment, and support of an advanced HPC/Data research environment.
Workshop
Recorded
Applications
Architectures
Heterogeneous Systems
Hierarchical Parallelism
Parallel Programming Languages and Models
Performance
Performance Portability
Scientific Computing
W
DescriptionWith dynamic imbalances caused by both software and ever more complex hardware, applications and runtime systems must adapt to dynamic load imbalances. We present a diffusion-based, reactive, fully asynchronous, and decentralized dynamic load balancer for a distributed actor library. With the asynchronous execution model, features such as remote procedure calls, and support for serialization of arbitrary types, UPC++ is especially feasible for the implementation of the actor model. While providing a substantial speedup for small- to medium-sized jobs with both predictable and unpredictable workload imbalances, the scalability of the diffusion-based approaches remains below expectations in most presented test cases.
Exhibitor Forum
Recorded
TP
XO/EX
DescriptionThis presentation consists of two parts, discussing SX-Aurora TSUBASA vector supercomputer and introducing digital annealer working on SX-Aurora TSUBASA called Vector Annealer. The first half of the presentation shows the vector architecture of SX-Aurora TSUBASA, especially its latest vector processors having the highest-level memory bandwidth. Sustained performance and power efficiency are also discussed, as well as NEC’s future plans and roadmap. The second half of the presentation shows NEC’s quantum computing strategies and their products to provide higher sustained performance in the annealing/optimization fields. NEC developed Vector Annealing as a digital annealer and has a strong business relationship with D-Wave providing a quantum annealer. NEC aims at solving various social issues by using the quantum/digital annealing technologies and by developing a hybrid platform with supercomputer and quantum/digital annealer to provide much higher sustained performance,
Workshop
Recorded
W
DescriptionX-ray Bragg coherent diffraction imaging (BCDI) is widely used for materials characterization. However, obtaining X-ray diffraction data is difficult and computationally intensive. Here, we introduce a machine learning approach to identify crystalline line defects in samples from the raw coherent diffraction data. To automate this process, we compose a workflow coupling coherent diffraction data generation with training and inference of deep neural network defect classifiers. In particular, we adopt a continual learning approach, where we generate training and inference data as needed based on the accuracy of the defect classifier instead of all training data generated a priori. The results show that our approach improves the accuracy of defect classifiers while using much fewer samples of data.
Workshop
Recorded
Quantum Computing
W
DescriptionCurrent quantum computers suffer from noise that prohibits extracting useful results directly from longer computations. The figure of merit is often an expectation value, which experiences a noise induced bias. A systematic way to remove such bias is probabilistic error cancellation (PEC). PEC requires noise characterization and introduces an exponential sampling overhead.
Probabilistic error reduction (PER) is a related method that systematically reduces the overhead. In combination with zero-noise extrapolation, PER can yield expectation values with an accuracy comparable to PEC. We present an automated quantum error mitigation software framework that includes noise tomography and application of PER to user-specified circuits. We provide a multi-platform Python package that implements a recently developed Pauli noise tomography technique and exploits a noise scaling method to carry out PER. We also provide software that leverages a previously developed toolchain, employing PyGSTi for gate set tomography and Mitiq for PER.
Probabilistic error reduction (PER) is a related method that systematically reduces the overhead. In combination with zero-noise extrapolation, PER can yield expectation values with an accuracy comparable to PEC. We present an automated quantum error mitigation software framework that includes noise tomography and application of PER to user-specified circuits. We provide a multi-platform Python package that implements a recently developed Pauli noise tomography technique and exploits a noise scaling method to carry out PER. We also provide software that leverages a previously developed toolchain, employing PyGSTi for gate set tomography and Mitiq for PER.
Workshop
Recorded
Quantum Computing
W
DescriptionEmerging quantum algorithms that process data require that classical input data be represented as a quantum state. These data-processing algorithms often follow the gate model of quantum computing---which requires qubits to be initialized to a basis state, typically |0> ---and thus often employ state generation circuits to transform the initialized basis state to a data-representation state. There are many ways to encode classical data in a qubit, and the oft-applied approach of basis encoding does not allow optimization to the extent that other variants do. In this work, we thus consider automatic synthesis of addressable, quantum read-only memory (QROM) circuits, which act as data-encoding state-generation circuits. We investigate three data encoding approaches, one of which we introduce to provide improved dynamic range and precision. We present experimental results that compare these encoding methods for QROM synthesis to better understand the implications of and applications for each.
Workshop
Recorded
W
DescriptionUse of heterogeneous architectures has steadily increased during the past decade. However, non-homogeneous systems present a challenge to the programming model as the execution models between CPU and accelerator might differ considerably. OpenMP, since version 4.0, has been trying to bridge this gap by allowing to offload a code block to a target device. Among the additions to the OpenMP offloading API since, the most notably probably is asynchronous execution between device and host. By default, offloaded regions are executed synchronously, thus the host thread blocks until their completion. The nowait clause allows work to overlap between the host and target device. However, nowait must be manually added by the user, along with the tasks data dependencies and appropriate synchronization to avoid race conditions, increasing the program complexity and developer burden.
Workshop
Recorded
Cloud and Distributed Computing
In Situ Processing
Scientific Computing
Workflows
W
DescriptionProvenance registration is becoming more and more important as we increase the size and number of experiments performed using computers. In particular, when provenance is recorded in HPC environments, it must be efficient and scalable. We propose a provenance registration method for scientific workflows, efficient enough to run in supercomputers (thus, it could run in other environments with more relaxed restrictions, such as distributed ones). It also must be scalable in order to deal with large workflows, that are more typically used in HPC. We also target transparency for the user, shielding them from having to specify how provenance must be recorded. We implement our design using the COMPSs programming model as a Workflow Management System (WfMS) and use RO-Crate as a well-established standard to record provenance. Experiments are provided, demonstrating the efficiency and scalability of our solution.
Workshop
Recorded
W
DescriptionThis presentation introduces a Cloud orchestrator controller that enables the autoscaling of containerized HPC Clusters in the Cloud. This controller triggers the creation or suppression of containerized HPC compute nodes according to metrics collected at the containerized HPC scheduler’s job queue level. Our approach does not modify either the Cloud orchestrator or HPC scheduler. The scheme followed is generic and can be applied to every HPC schedulers. Moreover, the containerization extends the experimentation reproducibility by the addition of the HPC scheduler itself to the environment replayed by the end user. The presentation exemplifies Cloud and HPC convergence to allow a high degree of flexibility for users and community platform developers. It also explores continuous integration/deployment approaches of Cloud computing to orchestrate multiple and potentially different HPC job schedulers that scale under the supervision of the Cloud orchestrator.
Birds of a Feather
TP
XO/EX
DescriptionHPC centers around the world use benchmarks to evaluate their machines and to engage with vendors during procurement. The goal of this BoF is twofold. First, a series of short presentations will gather information on the state of the art methodologies for creating and validating the benchmarking sets. Second, an open discussion will gather community feedback on pitfalls of the current methodologies and how these methodologies should evolve to accommodate the growing diversity of the computational workloads and HPC architectures. The intended audience is HPC application developers and users, teams benchmarking HPC data centers, HPC vendors, and performance researchers.
Meeting_notes
Meeting_notes
Workshop
Recorded
Applications
Architectures
Benchmarking
Exascale Computing
Modeling and Simulation
Performance
Performance Portability
W
DescriptionFortran DO CONCURRENT has emerged as a new way to achieve parallel execution of loops on CPUs and GPUs. This paper studies the performance portability of this construct on a range of processors and compares it with the incumbent models: OpenMP, OpenACC and CUDA. To do this study fairly, we implemented the BabelStream memory bandwidth benchmark from scratch, entirely in modern Fortran, for all of the models considered, which include Fortran DO CONCURRENT, as well as two variants of OpenACC, four variants of OpenMP (2 CPU and 2 GPU), CUDA Fortran, and both loop- and array-based references. BabelStream Fortran matches the C++ implementation as closely as possible, and can be used to make language-based comparisons. This paper represents one of the first detailed studies of the performance of Fortran support on heterogeneous architectures; we include results for AArch64 and x86_64 CPUs as well as AMD, Intel and NVIDIA GPU platforms.
Tutorial
Recorded
Cloud and Distributed Computing
Containers
Datacenter
Productivity Tools
Resource Management and Scheduling
Software Engineering
TUT
DescriptionCloud computing technologies use has grown considerably in HPC during the last few years. The complexity and scale that comes with cloud environments can make the first experience a daunting proposition. Cloud technologies offer a number of new capabilities to streamline tasks for HPC users and administrators. However, how to use these in HPC may not be immediately clear.
This tutorial provides a foundation to run HPC workloads in the cloud. It is organized in four series of progressive lectures and labs that provides a hands-on learning experience. It starts with a primer on cloud foundations and how they map to common HPC concepts, dives deeper into cloud core components, and presents the best practices to run HPC in the cloud.
This tutorial uses a combination of lectures and hands-on labs on provided temporary Amazon Web Services (AWS) accounts to provide both conceptual and hands-on learning.
This tutorial provides a foundation to run HPC workloads in the cloud. It is organized in four series of progressive lectures and labs that provides a hands-on learning experience. It starts with a primer on cloud foundations and how they map to common HPC concepts, dives deeper into cloud core components, and presents the best practices to run HPC in the cloud.
This tutorial uses a combination of lectures and hands-on labs on provided temporary Amazon Web Services (AWS) accounts to provide both conceptual and hands-on learning.
Exhibitor Forum
Recorded
TP
XO/EX
DescriptionJoin this technical deep dive into Google Cloud’s latest high-performance computing (HPC) advancements, covering the latest VMs, processors, accelerators, and storage solutions. We’ll also discuss our new HPC tools for deploying and managing your HPC environments, and how our customers are benefiting from running their HPC in the cloud.
Birds of a Feather
TP
XO/EX
DescriptionGiven the anticipated growth of the HPC market, HPC is challenged with expanding the size, diversity, and skill of its workforce. As we move toward exascale computing, how best do we prepare future computational scientists, and enable established domain researchers to stay current and master tools needed for exascale architectures?
This BoF invites scientists, researchers, trainers, educators, and the RSEs that support them to discuss current learning and development programs, explore adding in-person and virtual hackathons to existing training modalities, and brainstorm implementation strategies to bridge between traditional programming curricula and hands-on skills needed by diverse communities within different environments.
This BoF invites scientists, researchers, trainers, educators, and the RSEs that support them to discuss current learning and development programs, explore adding in-person and virtual hackathons to existing training modalities, and brainstorm implementation strategies to bridge between traditional programming curricula and hands-on skills needed by diverse communities within different environments.
Tutorial
Recorded
Applications
Computational Science
Productivity Tools
Software Engineering
TUT
DescriptionProducing scientific software is a challenge. The high-performance modeling and simulation community, in particular, faces the confluence of disruptive changes in computing architectures and new opportunities (and demands) for greatly improved simulation capabilities, especially through coupling physics and scales. Simultaneously, computational science and engineering (CSE), as well as other areas of science, are experiencing an increasing focus on scientific reproducibility and software quality.
Computer architecture changes require new software design and implementation strategies, including significant refactoring of existing code. Reproducibility demands require more rigor across the entire software endeavor. Code coupling requires aggregate team interactions including integration of software processes and practices. These challenges demand large investments in scientific software development and improved practices. Focusing on improved developer productivity and software sustainability is both urgent and essential.
This tutorial will provide information about software practices, processes, and tools explicitly tailored for CSE and HPC. Goals are improving the productivity of those who develop CSE software, increasing the sustainability of software artifacts, and trustworthiness in their use. Topics include the software processes for (small) teams, including agile processes, collaboration via version control workflows, reproducibility, and scientific software design, refactoring, and testing (including test design strategies and continuous integration).
Computer architecture changes require new software design and implementation strategies, including significant refactoring of existing code. Reproducibility demands require more rigor across the entire software endeavor. Code coupling requires aggregate team interactions including integration of software processes and practices. These challenges demand large investments in scientific software development and improved practices. Focusing on improved developer productivity and software sustainability is both urgent and essential.
This tutorial will provide information about software practices, processes, and tools explicitly tailored for CSE and HPC. Goals are improving the productivity of those who develop CSE software, increasing the sustainability of software artifacts, and trustworthiness in their use. Topics include the software processes for (small) teams, including agile processes, collaboration via version control workflows, reproducibility, and scientific software design, refactoring, and testing (including test design strategies and continuous integration).
Invited Talk
Recorded
TP
XO/EX
DescriptionStorage and compute technologies are no longer improving at pace with exponentially growing global demand. The world’s largest data storage stakeholders already face hard choices about what data to keep in the face of limited capacity, and compute stakeholders are rapidly approaching the resource scaling limits of massive data centers for training the largest AI models.
Biology offers a guide for solving these problems. Living systems store information in DNA with extraordinary density, enough to store all the world’s data in one small room. Living systems also implement natural intelligence – still an aspirational goal for AI – using low-power neural circuit “wetware” that fits between our ears. If we can understand and exploit these capabilities, we can overcome the scaling issues facing the HPC field.
In this talk, I will describe IARPA’s high-risk, high-payoff research programs to address fundamental problems in storage and computing using biology as a guide. This includes the Molecular Information Storage (MIST) program, which is developing DNA data storage technologies that will eventually allow us to store exabytes of data in a tabletop form factor, and the Machine Intelligence from Cortical Networks (MICrONS) program, which has densely mapped the structure and function of neural circuits to guide the development of next-generation computing architectures.
Biology offers a guide for solving these problems. Living systems store information in DNA with extraordinary density, enough to store all the world’s data in one small room. Living systems also implement natural intelligence – still an aspirational goal for AI – using low-power neural circuit “wetware” that fits between our ears. If we can understand and exploit these capabilities, we can overcome the scaling issues facing the HPC field.
In this talk, I will describe IARPA’s high-risk, high-payoff research programs to address fundamental problems in storage and computing using biology as a guide. This includes the Molecular Information Storage (MIST) program, which is developing DNA data storage technologies that will eventually allow us to store exabytes of data in a tabletop form factor, and the Machine Intelligence from Cortical Networks (MICrONS) program, which has densely mapped the structure and function of neural circuits to guide the development of next-generation computing architectures.
Paper
Recorded
Big Data
Computational Science
TP
DescriptionOut-of-core graph processing is an attractive solution for processing very large graphs that do not fit in the memory of a single machine. The new class of ultra-low-latency SSDs should expand the impact and utility of out-of-core graph processing systems. However, current out-of-core systems cannot fully leverage the high IOPS these devices can deliver.
We introduce Blaze, a new out-of-core graph processing system optimized for ultra-low-latency SSDs. Blaze offers high-performance out-of-core graph analytics by constantly saturating these fast SSDs with a new scatter-gather technique called online binning that allows value propagation among graph vertices without atomic synchronization. Blaze offers succinct APIs to allow programmers to write efficient out-of-core graph algorithms without the burden to manage complex IO executions. Our evaluation shows that Blaze outperforms current out-of-core systems by a wide margin on six datasets and a set of representative graph queries on Intel Optane SSD.
We introduce Blaze, a new out-of-core graph processing system optimized for ultra-low-latency SSDs. Blaze offers high-performance out-of-core graph analytics by constantly saturating these fast SSDs with a new scatter-gather technique called online binning that allows value propagation among graph vertices without atomic synchronization. Blaze offers succinct APIs to allow programmers to write efficient out-of-core graph algorithms without the burden to manage complex IO executions. Our evaluation shows that Blaze outperforms current out-of-core systems by a wide margin on six datasets and a set of representative graph queries on Intel Optane SSD.
Workshop
Recorded
W
DescriptionThe choice of programming model for accelerated computing applications depends on a wide range of factors, which weigh differently across application domains, institutions, and even countries. Why does one application use standard programming languages like C++, while another uses embedded programming models like Kokkos or directives such as OpenACC, and yet another directly programs in vendor-specific languages like CUDA or HIP? This panel will work through a comparison of the various choices, and share hands-on experience from developers in different countries and fields of expertise. We’ll explore both technical and non-technical reasons for how the various approaches are mixed. Join us for a fun and insightful session!
Workshop
Recorded
Accelerator-based Architectures
Algorithms
Architectures
Big Data
Data Analytics
Parallel Programming Languages and Models
Productivity Tools
W
DescriptionResearch to accelerate matrix multiplication, pushed by the growing computational demands of deep learning, has sprouted many efficient architectural solutions, such as NVIDIA’s Tensor Cores. These accelerators are designed to process efficiently a high volume of small dense matrix products in parallel. However, it is not obvious how to leverage these accelerators for sparse matrix multiplication. A natural way to adapt the accelerators to this problem is to divide the matrix into small blocks, and then multiply only the nonzero blocks. In this paper, we investigate ways to reorder the rows of a sparse matrix to reduce the number of nonzero blocks and cluster the nonzero elements into a few dense blocks. While this pre-processing can be computationally expensive, we show that the high speed-up provided by the accelerators can easily repay the cost, especially when several multiplications follow one reordering.
Workshop
Recorded
HPC Training and Education
W
DescriptionThe Blue Waters project pursued activities focused on national scale education, outreach, and training. The activities began in 2009. During 2022, the final year of the project, the team is focused on documenting the impact on the national community, lessons learned, and recommendations for programs that adopt/adapt similar activities.
The presentation to the attendees at this workshop will include the impact, lessons learned, and recommendations based on our experiences. If accepted, a full paper will be submitted for publication in the Journal of Computational Science Education that will expand upon the information provided in the presentation.
The presentation to the attendees at this workshop will include the impact, lessons learned, and recommendations based on our experiences. If accepted, a full paper will be submitted for publication in the Journal of Computational Science Education that will expand upon the information provided in the presentation.
Paper
Recorded
Accelerator-based Architectures
Performance
Visualization
TP
DescriptionOptimizing application performance in today's hardware architecture landscape is an important, but increasingly complex task, often requiring detailed performance analyses. In particular, data movement and reuse play a crucial role in optimization and are often hard to improve without detailed program inspection. Performance visualizations can assist in the diagnosis of performance problems, but generally rely on data gathered through lengthy program executions. In this paper, we present a performance visualization geared toward analyzing data movement and reuse to inform impactful optimization decisions, without requiring program execution. We propose an approach that combines static dataflow analysis with parameterized program simulations to analyze both global data movement and fine-grained data access and reuse behavior, and visualize insights in-situ on the program representation. Case studies analyzing and optimizing real-world applications demonstrate our tool's effectiveness in guiding optimization decisions and making the performance tuning process more interactive.
Students@SC
Exhibitor Forum
Recorded
TP
XO/EX
DescriptionAs High Performance Computing (HPC) moves from a specialist science to an everyday commodity, there is still an unreasonably large barrier to entry for new users. Traditionally, getting access to HPC resources is both expensive and time consuming, and once you get access, moving between clusters is equally as cumbersome.
The Alces Flight team has experimented with various concepts in the pursuit of the question, “How can we lower the barrier to entry for HPC users?” Starting in 2015, the team explored a free subscription model and the impact/usage by an individual user on public cloud, from which the base knowledge of the OpenFlightHPC open-source project emerged in 2019.
OpenFlightHPC is an open-source community developing a flexible, functional and stable HPC stack that can be launched on any platform. The project provides the knowledge and toolsets needed for HPC environment creation in a manner that anyone with basic-level HPC experience can utilize. The toolset assists in helping to create more portable HPC environments using process standardization to promote free interchange of knowledge for shared benefit.
This presentation covers:
- The importance of learning through experimentation and successful failures.
- The community and cultural shifts in people, skills, and sustainability that are feeding the need for greater flexibility in HPC.
- How OpenFlightHPC works, including bare-metal and cloud deployment techniques, process automation using tools including Ansible and Salt, and portability of workloads both in container and shared environments.
The Alces Flight team has experimented with various concepts in the pursuit of the question, “How can we lower the barrier to entry for HPC users?” Starting in 2015, the team explored a free subscription model and the impact/usage by an individual user on public cloud, from which the base knowledge of the OpenFlightHPC open-source project emerged in 2019.
OpenFlightHPC is an open-source community developing a flexible, functional and stable HPC stack that can be launched on any platform. The project provides the knowledge and toolsets needed for HPC environment creation in a manner that anyone with basic-level HPC experience can utilize. The toolset assists in helping to create more portable HPC environments using process standardization to promote free interchange of knowledge for shared benefit.
This presentation covers:
- The importance of learning through experimentation and successful failures.
- The community and cultural shifts in people, skills, and sustainability that are feeding the need for greater flexibility in HPC.
- How OpenFlightHPC works, including bare-metal and cloud deployment techniques, process automation using tools including Ansible and Salt, and portability of workloads both in container and shared environments.
Workshop
Recorded
W
DescriptionHigh Performance Computing (HPC) is playing an increasingly important role in industry, research and everyday life. Moreover, a central core of the European HPC strategy is the Modular Supercomputing Architecture (MSA), which breaks with traditional HPC architectures by integrating heterogeneous computing resources in system-level modules. Nevertheless, HPC and especially MSA content only rarely find their way into the curriculum of computer science courses at German universities. In addition, the necessary competencies for independent scientific research are hardly addressed, although these skills are essential for students for writing their final theses.
We present a blended learning based module concept that promotes the understanding and application of modular supercomputing while connecting it with the techniques of scientific project work. The module was first implemented at Goethe University in Summer 2022. The initial feedback and evaluation results are quite encouraging both in terms of learning outcomes and student engagement and interest.
We present a blended learning based module concept that promotes the understanding and application of modular supercomputing while connecting it with the techniques of scientific project work. The module was first implemented at Goethe University in Summer 2022. The initial feedback and evaluation results are quite encouraging both in terms of learning outcomes and student engagement and interest.
Birds of a Feather
TP
XO/EX
DescriptionScientific advances designed to address global challenges require researchers to have seamless access to data and computing and increasingly high performance computing. A certain disconnect has characterized the relationship between the HPC and data communities and this needs to be addressed in order to fully support today's data and compute intensive science. An open exploration of the sociotechnical and technical differences between the two communities, as well as describing any open challenges towards closer collaboration will be discussed. One BoF outcome is to draw in ‘HPC-oriented’ colleagues who wish to learn more or be more aligned with the data community.
Exhibitor Forum
Recorded
TP
XO/EX
DescriptionKubernetes has become the de-facto tool for orchestrating containerized workloads, and AI workloads are no different. But can an orchestrator built for long-running (micro)-services meet the needs of research experimentation and simulations? Can IT easily incorporate K8s into their AI & HPC workflows?
Join Gijsbert Janssen van Doorn of Run:ai for a crash course in Kubernetes for AI & HPC. Learn what’s working, what’s not, and some fixes for supporting these demanding environments with K8s.
In this session we will:
- Explain how and why Kubernetes is the top choice for AI & HPC workloads
- See where Kubernetes is challenged when it comes to AI & HPC workloads
- See how using GPUs instead of CPUs can accelerate your development cycles
Join Gijsbert Janssen van Doorn of Run:ai for a crash course in Kubernetes for AI & HPC. Learn what’s working, what’s not, and some fixes for supporting these demanding environments with K8s.
In this session we will:
- Explain how and why Kubernetes is the top choice for AI & HPC workloads
- See where Kubernetes is challenged when it comes to AI & HPC workloads
- See how using GPUs instead of CPUs can accelerate your development cycles
Workshop
Recorded
AI-HPC Convergence
Extreme Scale Computing
Parallel Programming Languages and Models
Performance
Runtime Systems
W
DescriptionAPEX (Autonomic Performance Environment for eXascale) is a performance measurement library for distributed, asynchronous multitasking runtime systems. It provides support for both lightweight measurement and high concurrency. To support performance measurement in systems that employ user-level threading, APEX uses a dependency chain in addition to the call stack to produce traces and task dependency graphs. APEX also provides a runtime adaptation system based on the observed system performance. In this paper, we describe the evolution of APEX from its design for HPX to support an array of programming models and abstraction layers and describe some of the features that have evolved to help understand the asynchrony and high concurrency of asynchronous tasking models.
Workshop
Recorded
W
DescriptionThe computational storage device (CSD) must aid background tasks for the storage service applications (background tasks) without harming user I/O performance (foreground I/O). However, in practice, SPDK often increases foreground I/O latencies and under-utilizes CPU cores in the CSD. These problems proceed from allocating foreground I/Os and background tasks to the same CPU core because SPDK processes them as the same request without distinguishing them. To tackle this, we propose a Background Task-aware Scheduler (BTS) for CSDs built using SPDK. BTS solves the following problems: (i) idle CPU cores in the CSD are not used, and (ii) the latency of foreground I/O increases due to interference with background tasks. For evaluation, we implemented a key-value interface CSD using SPDK. With BTS, the results show that idle CPUs are used to process background tasks by guaranteeing the low latency of foreground I/O when the background tasks are set to deduplication.
Paper
Recorded
Architectures
Networks
TP
Best Paper Finalist
DescriptionHigh-performance clusters and datacenters pose increasingly demanding requirements on storage systems. If these systems do not operate at scale, applications are doomed to become I/O bound and waste compute cycles. To accelerate the data path to remote storage nodes, remote direct memory access (RDMA) has been embraced by storage systems to let data flow from the network to storage targets, reducing overall latency and CPU utilization. Yet, this approach still involves CPUs on the data path to enforce storage policies such as authentication, replication, and erasure coding. We show how storage policies can be offloaded to fully programmable SmartNICs, without involving host CPUs. By using PsPIN, an open-hardware SmartNIC, we show latency improvements for writes (up to 2x), data replication (up to 2x), and erasure coding (up to 2x), when compared to respective CPU- and RDMA-based alternatives.
Exhibitor Forum
Recorded
TP
XO/EX
DescriptionIn this session, we’ll paint a picture for removing infrastructure constraints to solve complex computational problems. Imagine agile scalable infrastructure with no fixed assets, and no waiting in the queue to start jobs. We’ll share progress on an extraordinary project using Virtual Flow to do extreme scale screening, and computational drug discovery at scale. Together with academic researchers and partners, we’ve built out a 5-10 billion molecular database to identify targets, using 2.2 million virtual CPUs. Learn how the most vexing societal problems of our generation will be solved through what we at AWS call Impact Computing.
Workshop
Recorded
Benchmarking
Cloud and Distributed Computing
Containers
Datacenter
Networks
Privacy
Resource Management and Scheduling
Security
SIGHPC
State of the Practice
System Administration
System Software
W
DescriptionHigh performance computing has always offered batch computing services but demand is growing for a wider range of workflow and data services. Container orchestration is a perfect candidate for offering scheduling services for these types of workloads in a similar way. By leveraging container orchestration with Kubernetes, you can build a platform that includes both a service catalog and lets users run their own containerized services directly.
The power of such a platform is being able to stand on the shoulders of giants. This starts with leveraging Kubernetes for container orchestration and running these types of workloads. Next is using the internal Kubernetes’ paradigms with Operators to provide higher level scheduling of specific types of applications to create a service catalog. Third is using the Kubernetes API to tie everything together under a single user experience.
The power of such a platform is being able to stand on the shoulders of giants. This starts with leveraging Kubernetes for container orchestration and running these types of workloads. Next is using the internal Kubernetes’ paradigms with Operators to provide higher level scheduling of specific types of applications to create a service catalog. Third is using the Kubernetes API to tie everything together under a single user experience.
Exhibitor Forum
Recorded
TP
XO/EX
DescriptionWhile most deployed networks today use C-Band, the L-Band has been available for decades and is also deployed on Dispersion Shifted Fiber. Using both (C+L) doubles the capacity per fiber pair, but requires additional equipment to be added to an in-service traffic bearing system and yields less than optimal performance due to band interaction of separate amplifiers. New C&L Band systems are being designed with fewer components and provide better performance by lighting the entire spectrum day one for lower cost per bit and superior reach. Hear when, where, and why Verizon is pushing the development of this new technology for its nationwide long haul network.
Job Posting
DescriptionPassionate about the protection of critical and high value targets? Join our dynamic team and make a difference providing creative solutions to unique national security challenges!
We are seeking R&D Computer Science professionals to join highly productive teams that research and develop innovative solutions to a broad spectrum of problems of national importance. In this role, you will collaborate in an innovating environment to architect, design, develop, test, and deploy modern data processing software for sophisticated, real-time decision support systems!
On any given day, you may be called on to:
Work on a team developing software systems addressing exciting remote sensing problems, including the capture, processing, exploitation, visualization, and distribution of real-time satellite sensor data.
Collaborate with architects, developers, technical leads, customers, and end users to collect requirements, design solutions, and deliver extensible software applications.
Engage with diverse specialists in areas such as data fusion, signal and image processing, analytics, cloud computing, machine learning, modeling and simulation, service-oriented architectures, data management and visualization, and pattern recognition.
Applicants on this posting may interviewed and hired by multiple organizations within Center 6300.
Due to the nature of the work, the selected applicant must be able to work onsite.
Join our team and achieve your goals while making a difference!
We are seeking R&D Computer Science professionals to join highly productive teams that research and develop innovative solutions to a broad spectrum of problems of national importance. In this role, you will collaborate in an innovating environment to architect, design, develop, test, and deploy modern data processing software for sophisticated, real-time decision support systems!
On any given day, you may be called on to:
Work on a team developing software systems addressing exciting remote sensing problems, including the capture, processing, exploitation, visualization, and distribution of real-time satellite sensor data.
Collaborate with architects, developers, technical leads, customers, and end users to collect requirements, design solutions, and deliver extensible software applications.
Engage with diverse specialists in areas such as data fusion, signal and image processing, analytics, cloud computing, machine learning, modeling and simulation, service-oriented architectures, data management and visualization, and pattern recognition.
Applicants on this posting may interviewed and hired by multiple organizations within Center 6300.
Due to the nature of the work, the selected applicant must be able to work onsite.
Join our team and achieve your goals while making a difference!
Paper
Recorded
Numerical Algorithms
Scientific Computing
TP
DescriptionThis paper presents the Communication-Avoiding 3D Matrix Multiplication (CA3DMM) algorithm, a simple and novel algorithm that has optimal or near-optimal communication cost. CA3DMM is based on a unified view of parallel matrix multiplication. Such a view generalizes 1D, 2D, and 3D matrix multiplication algorithms to reduce the data exchange volume for different shapes of input matrices. CA3DMM further minimizes the actual communication costs by carefully organizing its communication patterns. CA3DMM is much simpler than some other generalized 3D algorithms, and CA3DMM does not require low-level optimization. Numerical experiments show that CA3DMM has good parallel scalability and has similar or better performance when compared to state-of-the-art PGEMM implementations for a wide range of matrix dimensions and number of processes.
Workshop
Recorded
W
DescriptionThis paper provides an introduction to the CoArray Fortran Framework of Efficient Interfaces to Network Environments (Caffeine), a parallel runtime library built atop the GASNet-EX exascale networking library. Caffeine leverages several non-parallel Fortran features to write type- and rank-agnostic interfaces and corresponding procedure definitions that support parallel Fortran 2018 features, including communication, collective operations, and related services. One major goal is to develop a runtime library that can eventually be considered for adoption by LLVM Flang, enabling that compiler to support the parallel features of Fortran.
The paper describes the motivations behind Caffeine's design and implementation decisions, details the current state of Caffeine's development, and previews future work. We explain how the design and implementation offer benefits related to software sustainability by lowering the barrier to user contributions, reducing complexity through the use of Fortran 2018 C-interoperability features, and high performance through the use of a lightweight communication substrate.
The paper describes the motivations behind Caffeine's design and implementation decisions, details the current state of Caffeine's development, and previews future work. We explain how the design and implementation offer benefits related to software sustainability by lowering the barrier to user contributions, reducing complexity through the use of Fortran 2018 C-interoperability features, and high performance through the use of a lightweight communication substrate.
Workshop
Recorded
Algorithms
Architectures
Compilers
Computational Science
Exascale Computing
Heterogeneous Systems
Hierarchical Parallelism
Memory Systems
Parallel Programming Languages and Models
Parallel Programming Systems
Resource Management and Scheduling
W
DescriptionWe present the open-source CAMP tool for assessing deep memory hierarchies through performance measurements of synthetic kernels. CAMP provides different access patterns and allows to vary kernels' operational intensities. We describe the tool's design and implementation, and analyse measurements on a compute node of ARCHER2, the UK national supercomputer and compare it to measurements on a compute node on NEXTGenIO. We report results of a strong scaling study of contiguous, strided and stencil access patterns for various operational intensities and explore thread placement options and data sizes. The results confirm that bandwidth saturation can be achieved with a relatively small number of threads on AMD Rome and that underpopulation may be beneficial as performance drops when the node is fully populated for configurations with lower operational intensity, whilst the effect is less pronounced on the less hierarchical Intel Cascade Lake. Finally we discuss sub-NUMA-node awareness and directions for extending CAMP.
Paper
Recorded
Cloud and Distributed Computing
TP
DescriptionFunction-as-a-Service (FaaS) platforms have recently gained rapid popularity. Many stateful applications have been migrated to FaaS platforms due to their ease of deployment, scalability, and minimal management overhead. However, failures in FaaS have not been thoroughly investigated, thus making these desirable platforms unreliable for guaranteeing function execution and ensuring performance requirements. In this paper, we propose Canary, a highly resilient and fault-tolerant framework for FaaS that mitigates the impact of failures and reduces the overhead of function restart. Canary utilizes replicated container runtimes and application-level checkpoints to reduce application recovery time over FaaS platforms. Our evaluations using representative stateful FaaS applications show that Canary reduces the application recovery time and dollar cost by up to 83% and 12%, respectively over the default retry-based strategy. Moreover, it improves application availability with an additional average execution time and cost overhead of 14% and 8%, respectively, as compared to the ideal failure-free execution.
Invited Talk
Recorded
TP
XO/EX
DescriptionFor decades, Moore’s Law made the economics of specialized chips unattractive because the upfront costs couldn’t be justified when the alternative was fast-improving CPUs. As Moore’s Law fades, however, this is changing. Not only is specialization becoming more economically attractive, but it is now one of the best ways to get performance improvements for many applications. In this talk, I will discuss (1) how the economics of specialization have changed, (2) how specialization is fracturing computing in ways commonly seen in other technologies, and (3) how long we can expect the gains from specialization to make up for the slowdown in Moore’s Law.
Posters
Research Posters
TP
XO/EX
DescriptionQueries on large graphs use the stored graph properties to generate responses. As most of the real-world graphs are dynamic, i.e., the graph topology changes with time, and hence the related graph properties are also time-varying. In such cases, maintaining correctness in stored graph properties requires recomputation or update on previous properties. Here, we present an efficient framework, CANDY for updating the properties in large dynamic networks. We prove the efficacy of our general framework by applying it to update graph properties such as Single Source Shortest Path (SSSP), Vertex Coloring, and PageRank. Empirically we show that our shared-memory parallel and NVIDIA GPU-based data-parallel implementations perform better than the state-of-the-art implementations.
Birds of a Feather
TP
XO/EX
DescriptionData centers consume nearly 1% of global electricity demand, contributing to 0.3% of all global CO2 emissions and this is expected to rise without proactive steps. Tempting as it may be to point the finger at big tech, the truth is that users of various sizes all have had a hand in the increase in data centers’ workloads. How can the HPC community do our part to drive down greenhouse gas emissions without sacrificing the computing power needed to support our mission and services as promised?
Workshop
Recorded
Cloud and Distributed Computing
In Situ Processing
Scientific Computing
Workflows
W
DescriptionWe analyze a heart monitoring center for patients wearing electrocardiogram sensors outside hospitals. This prevents serious heart damages and increases life expectancy and health-care efficiency. In this paper, we address a problem to provide a scalable infrastructure for the real-time processing scenario for at least 10,000 patients simultaneously, and efficient fast processing architecture for the postponed scenario when patients upload data after realized measurements. CardioHPC is a project to realize a simulation of these two scenarios using digital signal processing algorithms and artificial intelligence-based detection and classification software for automated reporting and alerting.
We elaborate the challenges we met in experimenting with different serverless implementations: 1) container-based on Google Cloud Run, and 2) Function-as-a-Service (FaaS) on AWS Lambda. Experimental results present the effect of overhead in the request and transfer time, and speedup achieved by analyzing the response time and throughput on both container-based and FaaS implementations as serverless workflows.
We elaborate the challenges we met in experimenting with different serverless implementations: 1) container-based on Google Cloud Run, and 2) Function-as-a-Service (FaaS) on AWS Lambda. Experimental results present the effect of overhead in the request and transfer time, and speedup achieved by analyzing the response time and throughput on both container-based and FaaS implementations as serverless workflows.
Students@SC
DescriptionThere are so many unique opportunities in HPC! While many core technical skills are applicable across a wide range of careers, there are also a lot of important differences. This panel brings together representatives from diverse career paths including industry, academia, and research labs. Come learn about the differences and similarities, and gain insight regarding the path that is best for you!
Posters
Research Posters
TP
XO/EX
DescriptionIn this work, we study the performance-portability of offloaded lattice Boltzmann kernels and the trade-off between portability and efficiency. The study is based on a proxy application for the lattice Boltzmann method (LBM). The performance portability programming framework of Kokkos (with CUDA or SYCL backend) is used and compared with programming models of native CUDA and native SYCL. The Kokkos library supports the mainstream GPU products in the market. The performance of the code can vary with accelerating models, number of GPUs, scale of the problem, propagation patterns and architectures. Both Kokkos library and CUDA toolkit are studied on the supercomputer of ThetaGPU (Argonne Leadership Computing Facility). It is found that Kokkos (CUDA) has almost the same performance as native CUDA. The automatic data and kernel management in Kokkos may sacrifice the efficiency, but the parallelization parameters can also be tuned by Kokkos to optimize the performances.
Student Cluster Competition
TP
XO/EX
DescriptionWe are proficient in distributed system and parallel computing, algorithm optimization, computer operating system, and other HPC necessary knowledge and have participated in a large number of related researches or projects. Expect the basic knowledge necessary for supercomputers, our team also has very wide-ranging expertise. Bo-Luo Ge has solid knowledge of computer operation and maintenance, network knowledge, and computer system knowledge. He is the main member of the cluster operation and maintenance of the CUHKsz supercomputer club. Zi-Fan Liu has a deep knowledge of reinforcement learning and deep learning and also has done some research on the application of reinforcement learning in Smart Grid. Yi-Liang He has profound compiler-level insights and is excellent at simd and risc-v. Si-Wei Zhang has unique comprehension of the underlying compilation support and has done related research in the CUHKsz laboratory. Bo-Luo Ge, Yi-Liang He, Si-Wei Zhang, and Zi-Fan Liu have also participated in the ASC of 2021 and won second prize. Yang-Lin Zhang has solid knowledge of Computer Vision and has done some jobs in GPU parallel threading. Hao-Nan Xue has been involved in many hardware-related projects.
Except for the professional computer domain knowledge, our team also has wide non-computer domain knowledge, such as Econometrics, Electricity Grid, Operation Management, Data Mining, etc. The diversity of our directions gives us the advantage of solving large-scale problems in various fields, and the combination of the thinking methods in different fields also improves the efficiency of discussion and problem solving within the group.
We are instructed by an outstanding professor, Professor Yeh-Ching Chung. Professor Yeh-Ching Chung established a supercomputing team at National Tsing Hua University before he came to CUHKsz, and led National Tsing Hua University to win the first prize in the final competitions of ASC, ISC, and SC many times. Under the leadership of Professor Yeh-Ching Chung, we participated in the ASC competition of 2021 and won the second prize. We also participated in many parallel optimization-related competitions, such as Intel's PAC to test and improve our skills.
As we know, SCC was developed to provide an immersive high performance computing experience to undergraduate and high school students. As an international platform for students who are interested in HPC, we sincerely hope that we can compete with other groups all over the world to improve our skills and show our ability and knowledge to the world.
Except for the professional computer domain knowledge, our team also has wide non-computer domain knowledge, such as Econometrics, Electricity Grid, Operation Management, Data Mining, etc. The diversity of our directions gives us the advantage of solving large-scale problems in various fields, and the combination of the thinking methods in different fields also improves the efficiency of discussion and problem solving within the group.
We are instructed by an outstanding professor, Professor Yeh-Ching Chung. Professor Yeh-Ching Chung established a supercomputing team at National Tsing Hua University before he came to CUHKsz, and led National Tsing Hua University to win the first prize in the final competitions of ASC, ISC, and SC many times. Under the leadership of Professor Yeh-Ching Chung, we participated in the ASC competition of 2021 and won the second prize. We also participated in many parallel optimization-related competitions, such as Intel's PAC to test and improve our skills.
As we know, SCC was developed to provide an immersive high performance computing experience to undergraduate and high school students. As an international platform for students who are interested in HPC, we sincerely hope that we can compete with other groups all over the world to improve our skills and show our ability and knowledge to the world.
Posters
Research Posters
TP
XO/EX
DescriptionThe NERSC Perlmutter HPC system is the most recent large-scale US system that is publicly available. NERSC chose to deploy a first phase of its GPU-based nodes in late 2021 using 2x Slingshot10 connections and has been upgrading them to 4x Slinghot11 connections starting in summer 2021. In this poster we provide benchmark numbers for using CGYRO, a popular fusion turbulence simulation tool, comparing the original and the upgraded network setup. CGYRO has been previously shown to be communication-bound in many recent HPC systems and we show that the upgraded networking provides a significant boost for fusion science.
Exhibitor Forum
Recorded
TP
XO/EX
DescriptionThe HPC landscape is larger, more complex, and more interconnected than ever before. With both cloud HPC and quantum computing entering as disruptors, users face many challenges managing software and data. We discuss some solutions with Covalent, a new open-source Pythonic toolkit for reproducible computational research. We demonstrate using practical examples from classical and quantum machine learning how users can rapidly iterate over hardware and software in order to efficiently identify novel research results. We also go into detail to discuss some of the challenges around quantum-classical interconnects and how hybrid quantum algorithms map to hybrid infrastructure.
Workshop
Recorded
W
DescriptionThis talk will share stories from CAAR PIConGPU and ECP SOLLVE projects. The stories will present our experiences on porting applications to pre-exascale systems to exascale system, Frontier. It will highlight challenges we faced preparing and using relevant software tools including alpaka, OpenMP and OpenACC programming models among other tools. The talk will also present insights we gathered from profiler/performance analysis tools. Takeaways will be drawn from both the projects to share with the IPDRM community and at the same seek input from the audience so we can together improve our techniques and approaches.
Workshop
Recorded
W
DescriptionIntroducing undergraduate students to key concepts of distributed computing has become almost essential as the world continues to embrace cloud-based solutions to daily problems and as research continues to grow in scale requiring distributed resources. Although distributed computing is an important part of the computer science curriculum, it can be difficult to introduce at some institutions. We explore some key challenges associated with introducing distributed computing into the computer science curriculum at a small, liberal arts college. We focus on an initial failure introducing a specialized distributed computing course too soon and relay the successes and failures experienced over a one year span of incorporating key distributed computing concepts across multiple systems-level courses. We discuss lessons learned from our first foray into teaching distributed computing and provide recommendations for new adopters of distributed computing curriculum based on our experiences.
Workshop
Recorded
W
DescriptionMPI has been very successful, evolving from a parallel programming model for single process per core and node to the dominant internode programming model for HPC applications on today's clusters and extreme scale systems. As MPI approaches its third decade in 2024, what are the challenges to be addressed and changes to be made in MPI? This talk will discuss some of issues facing MPI, with examples from remote memory, I/O, and accelerator-rich nodes.
Birds of a Feather
TP
XO/EX
DescriptionWe envision scientific computing as a key beneficiary of the "deep programmable networks" paradigm, which provide advanced processing capabilities at terabit speeds. Together with high-performance compute nodes, this creates a large distributed system that pushes the performance envelope beyond the currently known bounds. Despite holding a lot of promise, this is far from becoming mainstream. Key hurdles facing programmable networks are in building and operating them. This session will benefit scientific computing, network programming, and operations communities. We intend to have a series of lightning talks followed by moderated panel discussion. Audience will interact with experts and seek their vision.
Workshop
Recorded
Cloud and Distributed Computing
In Situ Processing
Scientific Computing
Workflows
W
DescriptionScientific workflow is one of the well-established pillars of large-scale computational science and emerged as a torchbearer to formalize and structure a massive amount of complex heterogeneous data and accelerate scientific progress. SWfMSs support the automation of repetitive tasks and capture complex analysis through workflows. However, the execution of workflows is costly and requires a lot of resource usage. At different phases of a workflow life cycle, most SWfMSs store provenance information, allowing result reproducibility, sharing, and knowledge reuse in the scientific community. But, this provenance information can be many times larger than the workflow and input data, and managing provenance data is growing in complexity with large-scale applications. We describe the challenges of provenance managing and reusing in e-science, focusing primarily on scientific workflow approaches by exploring different SWfMSs and provenance management systems. We also investigated the ways to overcome the challenges.
Workshop
Recorded
Benchmarking
Cloud and Distributed Computing
Containers
Datacenter
Networks
Privacy
Resource Management and Scheduling
Security
SIGHPC
State of the Practice
System Administration
System Software
W
DescriptionChapter Updates and Closing Remarks
Workshop
Recorded
W
DescriptionAs the scale and complexity of HPC systems keep growing, data compression techniques are often adopted to reduce the data movement bottleneck. While lossy compression becomes preferable to a lossless one because of the potential of generating high compression ratios, it would lose its worth the effort without finding an optimal balance between volume reduction and information loss. The insight of this paper is that quantifying dominant coefficients at the block level reveals the right balance, potentially impacting overall compression ratios. Motivated by this, we characterize three transformation-based lossy compression mechanisms at the block level, using the statistical features that capture data characteristics. We build several prediction models using the collected features and the characteristics of dominant coefficients and evaluate the effectiveness of each model using six HPC datasets. Our results demonstrate that the random forest classifier captures the behavior of dominant coefficients precisely, achieving nearly 99% of prediction accuracy.
Paper
Recorded
Post-Moore Computing
Quantum Computing
TP
DescriptionWhen quantum programs are executed on noisy intermediate-scale quantum (NISQ) computers, they experience hardware noise; consequently, the program outputs are often erroneous. To mitigate the adverse effects of hardware noise, it is necessary to understand the effect of hardware noise on the program output and more fundamentally, understand the impact of hardware noise on specific regions within a quantum program. Identifying and optimizing regions that are more noise-sensitive is the key to expanding the capabilities of NISQ computers.
Toward achieving that goal, we propose CHARTER, a novel technique to pinpoint specific gates and regions within a quantum program that are the most affected by the hardware noise and that have the highest impact on the program output. Using CHARTER's methodology, programmers can obtain a precise understanding of how different components of their code affect the output and optimize those components without the need for non-scalable quantum simulation on classical computers.
Toward achieving that goal, we propose CHARTER, a novel technique to pinpoint specific gates and regions within a quantum program that are the most affected by the hardware noise and that have the highest impact on the program output. Using CHARTER's methodology, programmers can obtain a precise understanding of how different components of their code affect the output and optimize those components without the need for non-scalable quantum simulation on classical computers.
Birds of a Feather
TP
XO/EX
DescriptionThe PMIx Standard APIs facilitate interaction between applications, tools, middleware, and system runtimes. PMIx addresses a range of use cases including: application launch and wire-up; inspection, steering, and debugging tools; dynamic application management, fault tolerance, and cross-library coordination; and communication across container boundaries.
We invite all SC attendees to hear about the current version of the PMIx Standard, significant activity in the PMIx Standard working groups, OpenPMIx and PRRTE implementation releases, and broadening adoption of PMIx. We will recap the activities of the past year, showcase community and working group efforts, and discuss the roadmap for the next year.
We invite all SC attendees to hear about the current version of the PMIx Standard, significant activity in the PMIx Standard working groups, OpenPMIx and PRRTE implementation releases, and broadening adoption of PMIx. We will recap the activities of the past year, showcase community and working group efforts, and discuss the roadmap for the next year.
Workshop
Recorded
Diversity Equity Inclusion (DEI)
Education and Training and Outreach
W
DescriptionConvolutional neural networks (CNNs) are being incorporated into many image-based tasks across a variety of domains. Some of these tasks are real-world safety critical tasks such as object detection and lane line detection for self-driving cars. These applications have strict safety requirements and must be able guarantee the reliable operation of the network. We propose a selective triplication of important parts of the network determined via weight pruning methodologies in order to maintain a reliable CNN in environments that may be resource-limited.
Paper
Recorded
Extreme Scale Computing
Memory Systems
Parallel Programming Systems
State of the Practice
TP
DescriptionThe rise of machine learning (ML) applications and their use of mixed precision to perform interesting science are driving forces behind AI for science on HPC. The convergence of ML and HPC with mixed precision offers the possibility of transformational changes in computational science.
The HPL-AI benchmark is designed to measure the performance of mixed precision arithmetic as opposed to the HPL benchmark which measures double precision performance. Pushing the limits of systems at extreme scale is nontrivial ---little public literature explores optimization of mixed precision computations at this scale.
In this work, we demonstrate how to scale up the HPL-AI benchmark on the pre-exascale Summit and exascale Frontier systems at the Oak Ridge Leadership Computing Facility (OLCF) with a cross-platform design. We present the implementation, performance results, and a guideline of optimization strategies employed for delivering portable performance on both AMD and NVIDIA GPUs at extreme scale.
The HPL-AI benchmark is designed to measure the performance of mixed precision arithmetic as opposed to the HPL benchmark which measures double precision performance. Pushing the limits of systems at extreme scale is nontrivial ---little public literature explores optimization of mixed precision computations at this scale.
In this work, we demonstrate how to scale up the HPL-AI benchmark on the pre-exascale Summit and exascale Frontier systems at the Oak Ridge Leadership Computing Facility (OLCF) with a cross-platform design. We present the implementation, performance results, and a guideline of optimization strategies employed for delivering portable performance on both AMD and NVIDIA GPUs at extreme scale.
Posters
Research Posters
TP
XO/EX
DescriptionApplications of quantum machine learning algorithms are currently still being studied. Recent work suggests that classical gradient descent techniques can effectively train variational quantum circuits. We propose to train quantum variational circuits to find smaller text and image embeddings that preserve contrastive-learning distances based on CLIP large embeddings. This is a critical task since fine-tuning CLIP to produce low-dimensional embeddings is prohibitively expensive. We introduce CLIP-ACQUA, a model trained in a self-supervised configuration from CLIP embeddings to reduce the latent space. We use CLIP-ACQUA on a sizeable unlabelled corpus of text and images to demonstrate its effectiveness. Our experiments show that we can obtain smaller latent spaces that preserve the original embedding distances inferred during contrastive learning. Furthermore, using our model requires no fine-tuning of CLIP, preserving its original robustness and structure. The data used as a demonstration aids in modeling consumer-to-consumer online marketplaces to detect illicit activities.
Workshop
Recorded
Architectures
Data Analytics
Datacenter
Extreme Scale Computing
HPC Community Collaboration
Machine Learning and Artificial Intelligence
Performance
Resource Management and Scheduling
System Software
W
DescriptionIn-person and virtual discussion period covering presentations and position papers.
Workshop
Recorded
AI-HPC Convergence
Extreme Scale Computing
Parallel Programming Languages and Models
Performance
Runtime Systems
W
Workshop
Recorded
Accelerator-based Architectures
Compilers
Dataflow and Tasking
Directive Based Programming
Heterogeneous Systems
Parallel Programming Languages and Models
Runtime Systems
W
DescriptionClosing remarks and awards of the Ninth Workshop on Accelerator Programming Using Directives (WACCPD 2022)
Workshop
Recorded
Accelerator-based Architectures
Data Analytics
In Situ Processing
Scientific Computing
Visualization
Workflows
W
Workshop
Recorded
W
DescriptionAs a method to optimize the investment for computational resources, cloud bursting is collecting a lot of attention, where the organizations utilize the cloud computing environment in on-demand fashion, while preserving the minimum amount of on-premise resources for sensitive data processing. For the practical cloud bursting, we need to achieve 1) secure job / data sharing, 2) uniform job execution environment for on-premise and cloud, and 3) on-demand automatic deployment of the execution environment on the cloud. To enable these items, we propose a meta-scheduling system called CloudQ. CloudQ 1) uses cloud object storage for data sharing, 2) utilizes container images to provide uniform job execution environment, and 3) automatically deploys an execution environment on the cloud.
Student Cluster Competition
TP
XO/EX
DescriptionWe are ClusDur – a team of enthusiastic Durham University students who can’t wait to enter the world of HPC! We would love to participate in IndySCC since it is the ideal opportunity to get first insights into supercomputing and gain cluster competition experience. This would be very valuable to us since neither of us has previously participated in any cluster competition.
We would do well in IndySCC due to our interdisciplinarity, diversity of skill levels and breadth of technical expertise. ClusDur consists of students from computer science, engineering, mathematics and physics - with three of us being in their very first year of study.
Interdisciplinarity and diversity of skill levels are key to enable our students-teach-students approach and allow us to learn and thrive together. This will help us consolidate and win through difficulties, deadlines and all-nighters of the competition together.
To gain first HPC experience, Harrison, Joseph and Matthew have already participated in an introductory course on the usage of the Durham University’s supercomputer Hamilton. Further, Allaida, Jack, Matthew and Robert have already successfully participated in classical Hackathons such as DurHack. Hence, they have experience with solving challenges under time pressure and as a team.
As third-year physics students, Harrison and Robert gained their first scientific computing experience through their computational physics projects. For Harrison, the project allowed him to encounter multiprocessing and apply it to a scientific simulation. Robert used his project as an incentive to gain experience with Arch Linux on a Raspberry Pi. He taught himself system administration skills that are a great asset to the team when it comes to cluster configuration and shell scripting.
Allaida is a first-year Computer Science student. She has already acquired a solid foundation in Python programming and adds experience with machine learning to the team’s skill set thanks to her participation in DurHack. Matthew is a first-year Engineering student and brings domain knowledge and programming experience in Python, C/C++ and MATLAB to the table. Allaida and Matthew aim to further their practical CS skills and gain insights into scientific simulations.
Jack is a first-year Mathematics student, with a strong interest in systems level programming and related industry experience. He is eager to share his software development knowledge with the team. As a second-year computer science student, Joseph has a background and keen interest in hardware optimization and novel computing approaches. Through IndySCC, Jack and Joseph aim to further their knowledge about performance optimization and gain insights into bare metal cloud computing.
Laura conducts research on parallel programming paradigms, especially on task parallelism in molecular dynamics simulations. She participated in the ISC SCC 2015 and aims to share her experience through mentoring. Adam’s research interests include the scheduling behaviour of task-based runtimes and heterogeneous computing. He competed as part of Team Durham in the CIUK SCC 2021 and is keen to mentor ClusDur through their first SCC. Tobias is conducting research on the efficient implementation of multiscale algorithms. He's strongly involved as PI in the UK's exascale programme ExCALIBUR.
We would do well in IndySCC due to our interdisciplinarity, diversity of skill levels and breadth of technical expertise. ClusDur consists of students from computer science, engineering, mathematics and physics - with three of us being in their very first year of study.
Interdisciplinarity and diversity of skill levels are key to enable our students-teach-students approach and allow us to learn and thrive together. This will help us consolidate and win through difficulties, deadlines and all-nighters of the competition together.
To gain first HPC experience, Harrison, Joseph and Matthew have already participated in an introductory course on the usage of the Durham University’s supercomputer Hamilton. Further, Allaida, Jack, Matthew and Robert have already successfully participated in classical Hackathons such as DurHack. Hence, they have experience with solving challenges under time pressure and as a team.
As third-year physics students, Harrison and Robert gained their first scientific computing experience through their computational physics projects. For Harrison, the project allowed him to encounter multiprocessing and apply it to a scientific simulation. Robert used his project as an incentive to gain experience with Arch Linux on a Raspberry Pi. He taught himself system administration skills that are a great asset to the team when it comes to cluster configuration and shell scripting.
Allaida is a first-year Computer Science student. She has already acquired a solid foundation in Python programming and adds experience with machine learning to the team’s skill set thanks to her participation in DurHack. Matthew is a first-year Engineering student and brings domain knowledge and programming experience in Python, C/C++ and MATLAB to the table. Allaida and Matthew aim to further their practical CS skills and gain insights into scientific simulations.
Jack is a first-year Mathematics student, with a strong interest in systems level programming and related industry experience. He is eager to share his software development knowledge with the team. As a second-year computer science student, Joseph has a background and keen interest in hardware optimization and novel computing approaches. Through IndySCC, Jack and Joseph aim to further their knowledge about performance optimization and gain insights into bare metal cloud computing.
Laura conducts research on parallel programming paradigms, especially on task parallelism in molecular dynamics simulations. She participated in the ISC SCC 2015 and aims to share her experience through mentoring. Adam’s research interests include the scheduling behaviour of task-based runtimes and heterogeneous computing. He competed as part of Team Durham in the CIUK SCC 2021 and is keen to mentor ClusDur through their first SCC. Tobias is conducting research on the efficient implementation of multiscale algorithms. He's strongly involved as PI in the UK's exascale programme ExCALIBUR.
Workshop
Recorded
W
DescriptionWith the increasing prevalence of scalable file systems in the context of HPC, the importance of accurate anomaly detection on runtime logs is increasing. But as it currently stands, many log-based anomaly detection methods have encountered numerous challenges when applied to logs from many parallel file systems (PFSes) due to their irregularity and ambiguity in time-based log sequences. To circumvent these problems, this study proposes ClusterLog, a log pre-processing method to cluster temporal sequence of log keys based on their semantic similarity. By grouping semantically and sentimentally similar logs, it aims to represent log sequences with the smallest amount of unique log keys, intending to improve the ability for a downstream sequence based model to learn the log patterns. The preliminary results indicate not only its effectiveness in reducing the granularity of log sequences without the loss of important sequence information, but also its generalizability to different file systems’ logs.
Workshop
Recorded
Cloud and Distributed Computing
In Situ Processing
Scientific Computing
Workflows
W
DescriptionMolecular dynamics (MD) simulations are widely used to study large-scale molecular systems. However, reaching the necessary timescale to detect rare processes is challenging, even with modern supercomputers. To overcome the timescale limitation, the simulation of a long MD trajectory is replaced by multiple short-range simulations executed simultaneously in an ensemble. Analyses are usually co-scheduled with these simulations to efficiently process large volumes of data in situ. Executing a workflow ensemble of simulations and their in situ analyses requires sophisticated management of computational resources so that they are not slowing down each other. In this paper, we propose an efficient method to co-schedule and allocate resources for a workflow ensemble such that the makespan is minimized. We evaluate the proposed approach using an accurate simulator based on the WRENCH simulation framework. Results demonstrate the significance of co-scheduling simulations and in situ analyses that couple data together to benefit from data locality.
Paper
Recorded
Machine Learning and Artificial Intelligence
TP
DescriptionGraph neural networks (GNNs) suffer from low GPU utilization due to frequent memory accesses. Existing concurrent training mechanisms cannot be directly adapted to GNNs because they fail to consider the impact of input irregularity. This requires pre-profiling the memory footprint of concurrent tasks based on input dimensions to ensure successful co-location on GPU. Moreover, massive training tasks generated from scenarios such as hyper-parameter tuning require flexible scheduling strategies. To address these problems, we propose CoGNN that enables efficient management of GNN training tasks on GPUs. Specifically, the CoGNN organizes the tasks in a queue and estimates the memory consumption of each task based on cost functions at operator basis. In addition, the CoGNN implements scheduling policies to generate task groups, which are iteratively submitted for execution. The experiment results show that the CoGNN can achieve shorter completion and queuing time for training tasks from diverse GNN models.