No abstract available.
Front Matter
Front Matter
Accelerating MPI All-to-All Communication with Online Compression on Modern GPU Clusters
- Qinghua Zhou,
- Pouya Kousha,
- Quentin Anthony,
- Kawthar Shafie Khorassani,
- Aamir Shafi,
- Hari Subramoni,
- Dhabaleswar K. Panda
As more High-Performance Computing (HPC) and Deep Learning (DL) applications are adapting to scale using GPUs, the communication of GPU-resident data is becoming vital to end-to-end application performance. Among the available MPI operations in ...
NVIDIA’s Quantum InfiniBand Network Congestion Control Technology and Its Impact on Application Performance
Applications running on large scale systems often suffer from degraded performance and lack of reproducible run-times due to network-level congestion, whether caused by the application network traffic itself, or by unrelated background network ...
SU3_Bench on a Programmable Integrated Unified Memory Architecture (PIUMA) and How that Differs from Standard NUMA CPUs
SU3_Bench explores performance portability across multiple programming models using a simple but nontrivial mathematical kernel. This kernel has been derived from the [inline-graphic not available: see fulltext] (LQCD) code used in applications ...
Front Matter
“Hey CAI” - Conversational AI Enabled User Interface for HPC Tools
- Pouya Kousha,
- Arpan Jain,
- Ayyappa Kolli,
- Saisree Miriyala,
- Prasanna Sainath,
- Hari Subramoni,
- Aamir Shafi,
- Dhableswar K. Panda
HPC system users depend on profiling and analysis tools to obtain insights into the performance of their applications and tweak them. The complexity of modern HPC systems have necessitated advances in the associated HPC tools making them equally ...
Front Matter
Dynamic Task Fusion for a Block-Structured Finite Volume Solver over a Dynamically Adaptive Mesh with Local Time Stepping
Load balancing of generic wave equation solvers over dynamically adaptive meshes with local time stepping is difficult, as the load changes with every time step. Task-based programming promises to mitigate the load balancing problem. We study a ...
Accelerating Simulated Quantum Annealing with GPU and Tensor Cores
Inspired by quantum annealing, simulated quantum annealing (SQA) mimics quantum tunneling effects on classical computers to perform annealing through a path-integral Monte Carlo simulation, which increases the potential to find the global optima ...
m-Cubes: An Efficient and Portable Implementation of Multi-dimensional Integration for GPUs
The task of multi-dimensional numerical integration is frequently encountered in physics and other scientific fields, e.g., in modeling the effects of systematic uncertainties in physical systems and in Bayesian parameter estimation . Multi-...
Front Matter
Comparative Evaluation of Call Graph Generation by Profiling Tools
Call graphs generated by profiling tools are critical to dissecting the performance of parallel programs. Although many mature and sophisticated profiling tools record call graph data, each tool is different in its runtime overheads, memory ...
MAPredict: Static Analysis Driven Memory Access Prediction Framework for Modern CPUs
Application memory access patterns are crucial in deciding how much traffic is served by the cache and forwarded to the dynamic random-access memory (DRAM). However, predicting such memory traffic is difficult because of the interplay of ...
Understanding Distributed Deep Learning Performance by Correlating HPC and Machine Learning Measurements
Frameworks for Distributed Deep Learning (DDL) have become popular alternatives to distribute training by adding a few lines of code to a single-node script. From a High-Performance Computing (HPC) perspective, traditional profiling tools for ...
A Motivating Case Study on Code Variant Selection by Reinforcement Learning
In this paper, we investigate the applicability of reinforcement learning as a possible approach to select code variants. Our approach is based on the observation that code variants are usually convertible between one another by code ...
Front Matter
Remote OpenMP Offloading
OpenMP has a long and successful history in parallel programming for CPUs. Since the introduction of accelerator offloading, it has evolved into a promising candidate for all intra-node parallel computing needs. While this addition broke with the ...
Hybrid Parallel ILU Preconditioner in Linear Solver Library GaspiLS
Krylov subspace solvers such as GMRES and preconditioners such as incomplete LU (ILU) are the most commonly used methods to solve general-purpose, large-scale linear systems in simulations efficiently. Parallel Krylov subspace solvers and ...
A Subset of the CERN Virtual Machine File System: Fast Delivering of Complex Software Stacks for Supercomputing Resources
Delivering a reproducible environment along with complex and up-to-date software stacks on thousands of distributed and heterogeneous worker nodes is a critical task. The CernVM-File System (CVMFS) has been designed to help various communities to ...
Correction to: “Hey CAI” - Conversational AI Enabled User Interface for HPC Tools
- Pouya Kousha,
- Arpan Jain,
- Ayyappa Kolli,
- Saisree Miriyala,
- Prasanna Sainath,
- Hari Subramoni,
- Aamir Shafi,
- Dhableswar K. Panda
In an older version of this paper, the name of the fourth author was missing. This has been corrected.
Index Terms
- High Performance Computing: 37th International Conference, ISC High Performance 2022, Hamburg, Germany, May 29 – June 2, 2022, Proceedings