BridgeGC: An Efficient Cross-Level Garbage Collector for Big Data Frameworks
Popular big data frameworks commonly run atop Java Virtual Machine (JVM), and rely on garbage collection (GC) mechanism to automatically allocate/reclaim in-memory objects. Existing garbage collectors are designed based on the hypothesis that most objects ...
SEED: Speculative Security Metadata Updates for Low-Latency Secure Memory
Securing systems’ main memory is important for building trusted data centers. To ensure memory security, encryption and integrity verification techniques update the security metadata (e.g., encryption counters and integrity trees) during memory data ...
A Lock-free RDMA-friendly Index in CPU-parsimonious Environments
In CPU-parsimonious environments, such as disaggregated memory systems, the limited CPU power on the memory side constrains the ability to perform more operations. Thus, reducing CPU usage and enhancing concurrency performance are critical for indexing ...
Koala: Efficient Pipeline Training through Automated Schedule Searching on Domain-Specific Language
Pipeline parallelism is a crucial technique for large-scale model training, enabling parameter splitting and performance enhancement. However, creating effective pipeline schedules often requires significant manual effort and coding skills, leading to ...
SRSparse: Generating Codes for High-Performance Sparse Matrix-Vector Semiring Computations
Sparse matrix-vector semiring computation is a key operation in sparse matrix computations, with performance strongly dependent on both program design and the features of the sparse matrices. Given the diversity of sparse matrices, designing a tailored ...
PANDA: Adaptive Prefetching and Decentralized Scheduling for Dataflow Architectures
Dataflow architectures are considered promising architecture, offering a commendable balance of performance, efficiency, and flexibility. Abundant prior works have been proposed to improve the performance of dataflow architectures. Nevertheless, these ...
TSN Cache: Exploiting Data Localities in Graph Computing Applications
This paper finds that the reusability of vertices in the same graph in graph processing differs, and the high-reuse and low-reuse vertices are stored together. These phenomena lead to the inability of existing GPU architectures to capture the reusability ...
Overlapping Aware Data Placement Optimizations for LSM Tree-Based Store on ZNS SSDs
Solid State Drives (SSDs) based on the NVMe Zoned Namespaces (ZNS) interface can notably reduce the costs of address mapping, garbage collection, and over-provisioning by dividing the storage space into multiple zones for sequential writes and random ...
ODGS: Dependency-Aware Scheduling for High-Level Synthesis with Graph Neural Network and Reinforcement Learning
Scheduling determines the execution order and time of operations in program. The order is related to operation dependencies, including data and resource dependencies. Data dependency is intrinsic in program, showing operation data flow. Resource ...
Shift-CIM: In-SRAM Alignment To Support General-Purpose Bit-level Sparsity Exploration in SRAM Multiplication
Multiplication plays a critical role in SRAM-based Computing-in-Memory (CIM) architectures. However, current SRAM-based CIMs face three major limitations. First, they do not fully exploit bit-level sparsity, resulting in unnecessary overhead in both ...
LitTLS: Lightweight Thread-Level Speculation on Little Cores
Thread-Level Speculation (TLS) utilizes speculative parallelization to accelerate hard-to-parallelize serial codes on multi-cores. As the heterogeneous multi-core architecture is becoming ubiquitous, it presents an opportunity for TLS to reorganize little ...
FusionFS: A Contention-Resilient File System for Persistent CPU Caches
Byte-addressable storage (BAS), such as persistent memory and CXL-SSDs, does not meet system designers’ expectations for data flushing and access granularity. Persistent CPU caches, enabled by recent techniques like Intel’s eADR and CXL’s Global ...
Sniper: Exploiting Spatial and Temporal Sampling for Large-Scale Performance Analysis
MPI tracing tools is essential to collect the communication events and performance metrics of large-scale programs for further performance analysis and optimization. However, towards the exascale era, the performance and storage overhead for tracing ...
TransCL: An Automatic CUDA-to-OpenCL Programs Transformation Framework
- Changqing Shi,
- Yufei Sun,
- Rui Chen,
- Jiahao Wang,
- Qiang Guo,
- Chunye Gong,
- Yicheng Sui,
- Yutong Jin,
- Yuzhi Zhang
With the rising demand for computational power and the increasing variety of computational scenarios, considerable interest has emerged in transforming existing CUDA programs into more general-purpose OpenCL programs, enabling them to run across diverse ...
CesASMe and Staticdeps: static detection of memory-carried dependencies for code analyzers
A variety of code analyzers, such as IACA, uiCA, llvm-mca or Ithemal, strive to statically predict the throughput of a computation kernel. Each analyzer is based on its own simplified CPU model reasoning at the scale of a basic block. Facing this ...
Comprehensive Evaluation and Opportunity Discovery for Deterministic Concurrency Control
Deterministic concurrency control (DCC) guarantees that the same input transactions produce the same serializable result. It offers benefits in both distributed databases and blockchain systems. Dozens of DCC algorithms have emerged in the past decade. ...
ARACHNE: Optimizing Distributed Parallel Applications with Reduced Inter-Process Communication
In high-performance computing (HPC), parallelization is essential for improving computational efficiency as data and computation scales exceed single-node capacity. Existing methods, such as the polyhedral model used in Pluto-Distmem, focus on loop and ...
Ceiba: An Efficient and Scalable DNN Scheduler for Spatial Accelerators
Spatial accelerators are domain-specific architectures to elevate performance and energy efficiency for deep neural networks (DNNs). They also bring a large number of schedule parameters to determine computation and data movement patterns of DNNs. ...
ScaWL: Scaling k-WL (Weisfeiler-Lehman) Algorithms in Memory and Performance on Shared and Distributed-Memory Systems
- Coby Soss,
- Aravind Sukumaran Rajam,
- Janet Layne,
- Edoardo Serra,
- Mahantesh Halappanavar,
- Assefaw H. Gebremedhin
The k-dimensional Weisfeiler-Leman (k-WL) algorithm—developed as an efficient heuristic for testing if two graphs are isomorphic—is a fundamental kernel for node embedding in the emerging field of graph neural networks. Unfortunately, the k-WL algorithm ...
Dynamic Power Management Through Multi-agent Deep Reinforcement Learning for Heterogeneous Systems
Power management and optimization play a significant role in modern computer systems, from battery-powered devices to servers running in data centres. Existing approaches for power capping fail to meet the requirements presented by dynamic workloads, and ...
VersaTile: Flexible Tiled Architectures via Associative Processors
As modern applications demand more data, processing-in-memory (PIM) architectures have emerged to address the challenges of data movement and parallelism. In this paper, we propose VersaTile, a heterogeneous, fully CMOS-based tiled architecture that ...
Exploiting Dynamic Regular Patterns in Irregular Programs for Efficient Vectorization
Modern optimizing compilers are able to exploit memory access or computation patterns to generate vectorized codes. However, such patterns in irregular programs are unknown until runtime due to the input dependence. Thus, either compiler’s static ...
Accelerating Nearest Neighbor Search in 3D Point Cloud Registration on GPUs
The Iterative Closest Points (ICP) algorithm is the most widely used method for estimating rigid transformation in 3D point cloud registration. However, the ICP relies on repeatedly performing computationally intensive nearest neighbor searches (NNS) ...
OptiFX: Automatic Optimization for Convolutional Neural Networks with Aggressive Operator Fusion on GPUs
- Xueying Wang,
- Shigang Li,
- Hao Qian,
- Fan Luo,
- Zhaoyang Hao,
- Tong Wu,
- Ruiyuan Xu,
- Huimin Cui,
- Xiaobing Feng,
- Guangli Li,
- Jingling Xue
Convolutional Neural Networks (CNNs) are fundamental to advancing computer vision technologies. As CNNs become more complex and larger, optimizing model inference remains a critical challenge in both industry and academia. On modern GPU platforms, CNN ...
Maximizing Data and Hardware Reuse for HLS with Early-Stage Symbolic Partitioning
While traditional HLS (High-Level Synthesis) converts “high-level” C-like programs into hardware automatically, producing high-performance designs still requires hardware expertise. Optimizations such as data partitioning can have a large impact on ...
LIA: Latency-Improved Adaptive routing for Dragonfly networks
Low-diameter network topologies require non-minimal routing, such as Valiant routing, to avoid network congestion under challenging traffic patterns like the so-called adversarial. However, this mechanism tends to increase the average path length, base ...
Unveiling and Evaluating Vulnerabilities in Branch Predictors via a Three-Step Modeling Methodology
With the emergence and proliferation of microarchitectural attacks targeting branch predictors, the once-established security boundary in computer systems and architectures is facing unprecedented challenges. This paper introduces an innovative branch ...
KINDRED: Heterogeneous Split-Lock Architecture for Safe Autonomous Machines
With the increasing practicality of autonomous vehicles and drones, the importance of reliability requirements has escalated substantially. In many instances, traditional system designs tend to overlook reliability issues, emphasizing primarily on ...
Taming Flexible Job Packing in Deep Learning Training Clusters
Job packing is an effective technique to harvest the idle resources allocated to the deep learning (DL) training jobs but not fully utilized, especially when clusters may experience low utilization, and users may overestimate their resource needs. However,...
Enhancing High-Throughput GPU Random Walks Through Multi-Task Concurrency Orchestration
Random walk is a powerful tool for large-scale graph learning, but its high computational demand presents a challenge. While GPUs can accelerate random walk tasks, current frameworks fail to fully utilize GPU parallelism due to memory-to-compute bandwidth ...
gHyPart: GPU-friendly End-to-End Hypergraph Partitioner
Hypergraph partitioning finds practical applications in various fields, such as high-performance computing and circuit partitioning in VLSI physical design, where high-performance solutions often demand substantial parallelism beyond what existing CPU-...
IBing: An Efficient Interleaved Bidirectional Ring All-Reduce Algorithm for Gradient Synchronization
Ring all-reduce is currently the most commonly used collective communication technique in the fields of data parallel and distributed computing. It consists of three phases: communication establishment, data transmission, and data processing at each step. ...
gCom: Fine-grained Compressors in Graphics Memory of Mobile GPU
Nowadays, GPUs significantly boost rendering performance. However, the high memory requirements limit their use, especially on low-end mobile platforms. Compression techniques have been widely adopted to reduce memory consumption but face two primary ...
Bubble-Swap Flow Control
Deadlock-free adaptive routing is extensively adopted in both on-chip and off-chip interconnection networks to improve communication bandwidth and reduce latency. Introducing virtual channels (VCs), also known as virtual lanes (VLs), is the mainstream ...
Flexible and Effective Object Tiering for Heterogeneous Memory Systems
Computing platforms that package multiple types of memory, each with their own performance characteristics, are quickly becoming mainstream. To operate efficiently, heterogeneous memory architectures require new data management solutions that are able to ...
COVER: Alleviating Crash-Consistency Error Amplification in Secure Persistent Memory Systems
Data security (including confidentiality, integrity, and availability) and crash consistency guarantees are essential for building trusted persistent memory (PM) systems. Security and consistency metadata are added to enable the guarantees. Recent studies ...
MasterPlan: A Reinforcement Learning Based Scheduler for Archive Storage
With the sheer volume of data in today’s world, archive storage systems play a significant role in persisting the cold data. Due to stringent cost concerns, one popular design is to organize disks into groups and periodically switch them to be powered on ...
AIS: An Active Idleness I/O Scheduler to Reduce Buffer-Exhausted Degradation of Solid-State Drives
Modern solid-state drives (SSDs) continue to boost storage density and I/O bandwidth at the cost of flash-access I/O latency, especially for write, hence prevalently deploy a build-in buffer to absorb incoming writes. However, when the buffer is used up, ...
TPRepair: Tree-Based Pipelined Repair in Clustered Storage Systems
- Jiahui Yang,
- Fulin Nan,
- Zhirong Shen,
- Jiwu Shu,
- Zhisheng Chen,
- Xiaoli Wang,
- Quanqing Xu,
- Chuanhui Yang,
- Dmitrii Kaplun,
- Yuhui Cai
Erasure coding is an effective technique for guaranteeing data reliability for storage systems, yet it incurs a high repair penalty with amplified repair traffic. The repair becomes more intricate in clustered storage systems with the bandwidth diversity ...
PRAGA: A Priority-Aware Hardware/Software Co-design for High-Throughput Graph Processing Acceleration
- Long Zheng,
- Bing Zhu,
- Pengcheng Yao,
- Yuhang Zhou,
- Chengao Pan,
- Wenju Zhao,
- Xiaofei Liao,
- Hai Jin,
- Jingling Xue
Graph processing is pivotal in deriving insights from complex data structures but faces performance limitations due to the irregular nature of graphs. Traditional general-purpose processors often struggle with low instruction-level parallelism and energy ...
GCNTrain+: A Versatile and Efficient Accelerator for Graph Convolutional Neural Network Training
Recently, graph convolutional networks (GCNs) have gained wide attention due to their ability to capture node relationships in graphs. One problem appears when full-batch GCN is trained on large graph datasets, where the computational and memory ...
exZNS: Extending Zoned Namespace to Support Byte-loggable Zones
Emerging Zoned Namespace (ZNS) provides hosts with fine-grained, performance-predictable storage management. ZNS organizes the address space into zones composed of fixed-size, sequentially written, non-overwritable blocks, making it suitable for log-...
RT-GNN: Accelerating Sparse Graph Neural Networks by Tensor-CUDA Kernel Fusion
Graph Neural Networks (GNNs) have achieved remarkable successes in various graph-based learning tasks, thanks to their ability to leverage advanced GPUs. However, GNNs currently face challenges arising from the concurrent use of advanced Tensor Cores (TCs)...
ATP: Achieving Throughput Peak for DNN Training via Smart GPU Memory Management
Due to the limited GPU memory, the performance of large DNNs training is constrained by the unscalable batch size. Existing researches partially address the issue of GPU memory limit through tensor recomputation and swapping, but overlook the exploration ...
ApSpGEMM: Accelerating Large-scale SpGEMM with Heterogeneous Collaboration and Adaptive Panel
The Sparse General Matrix-Matrix multiplication (SpGEMM) is a fundamental component for many applications, such as algebraic multigrid methods (AMG), graphic processing, and deep learning. However, the unbearable latency of computing high-dimensional, ...
RaNAS: Resource-Aware Neural Architecture Search for Edge Computing
Neural architecture search (NAS) for edge devices is often time-consuming because of long-latency deploying and testing on edge devices. The ability to accurately predict the computation cost and memory requirement for convolutional neural networks (CNNs) ...
SPIRIT: Scalable and Persistent In-Memory Indices for Real-Time Search
Today, real-time search over big microblogging data requires low indexing and query latency. Online services, therefore, prefer to host inverted indices in memory. Unfortunately, as datasets grow, indices grow proportionally, and with limited DRAM scaling,...
ShieldCXL: A Practical Obliviousness Support with Sealed CXL Memory
The CXL (Compute Express Link) technology is an emerging memory interface with high-level commands. Recent studies applied the CXL memory expanding technique to mitigate the capacity limitation of the conventional DDRx memory. Unlike the prior studies to ...
Tiaozhuan: A General and Efficient Indirect Branch Optimization for Binary Translation
Binary translation enables transparent execution, analysis, and modification of the binary program, serving as a core technology that facilitates instruction set emulation, cross-platform compatibility of software, and program instrumentation. Handling ...
Characterizing and Understanding HGNN Training on GPUs
Owing to their remarkable representation capabilities for heterogeneous graph data, Heterogeneous Graph Neural Networks (HGNNs) have been widely adopted in many critical real-world domains such as recommendation systems and medical analysis. Prior to ...
A High Scalability Memory NoC with Shared-Inside Hierarchical-Groupings for Triplet-Based Many-Core Architecture
Innovative processor architecture designs are shifting towards Many-Core Architectures (MCAs) to meet the future demands of high-performance computing as the limits of Moore’s Law have almost been reached. Many-core processors utilize shared memory ...
An Efficient ReRAM-based Accelerator for Asynchronous Iterative Graph Processing
- Jin Zhao,
- Yu Zhang,
- Donghao He,
- Qikun Li,
- Weihang Yin,
- Hui Yu,
- Hao Qi,
- Xiaofei Liao,
- Hai Jin,
- Haikun Liu,
- Linchen Yu,
- Zhang Zhan
Graph processing has become a central concern for many real-world applications and is well-known for its low compute-to-communication ratios and poor data locality. By integrating computing logic into memory, resistive random access memory (ReRAM) tackles ...
Leveraging the Hardware Resources to Accelerate cryo-EM Reconstruction of RELION on the New Sunway Supercomputer
The fast development of biomolecular structure determination has enabled the fine-grained study of objects in the micro-world, such as proteins and RNAs. The world is benefited. However, as the computational algorithms are constantly developed, the ...
Iterating Pointers: Enabling Static Analysis for Loop-based Pointers
Pointers are an integral part of C and other programming languages. They enable substantial flexibility from the programmer’s standpoint, allowing the user fine, unmediated control over data access patterns. However, accesses done through pointers are ...
DTAP: Accelerating Strongly-Typed Programs with Data Type-Aware Hardware Prefetching
- Yingshuai Dong,
- Chencheng Ye,
- Haikun Liu,
- Liting Tang,
- Xiaofei Liao,
- Hai Jin,
- Cheng Chen,
- Yanjiang Li,
- Yi Wang
Queries on linked data structures, such as trees and graphs, often suffer from frequent cache misses and significant performance loss due to dependent and random pointer-chasing memory accesses. In this paper, we propose a software-hardware co-designed ...
Conflict Management in Vector Register Files
The instruction set architecture (ISA) of vector processors operates on vectors stored in the vector register file (VRF) which needs to handle several concurrent accesses by functional units (FUs) with multiple ports. When the vector processor is running ...
Shining Light on the Inter-procedural Code Obfuscation: Keep Pace with Progress in Binary Diffing
- Peihua Zhang,
- Chenggang Wu,
- Hanzhi Hu,
- Lichen Jia,
- Mingfan Peng,
- Jiali Xu,
- Mengyao Xie,
- Yuanming Lai,
- Yan Kang,
- Zhe Wang
Software obfuscation techniques have lost their effectiveness due to the rapid development of binary diffing techniques, which can achieve accurate function matching and identification. In this paper, we propose a new inter-procedural code obfuscation ...
DeepZoning: Re-accelerate CNN Inference with Zoning Graph for Heterogeneous Edge Cluster
Parallelizing CNN inference on heterogeneous edge clusters with data parallelism has gained popularity as a way to meet real-time requirements without sacrificing model accuracy. However, existing algorithms struggle to find optimal parallel granularity ...
MemoriaNova: Optimizing Memory-Aware Model Inference for Edge Computing
In recent years, deploying deep learning models on edge devices has become pervasive, driven by the increasing demand for intelligent edge computing solutions across various industries. From industrial automation to intelligent surveillance and healthcare,...
Multiple Function Merging for Code Size Reduction
Resource-constrained environments, such as embedded devices, have limited amounts of memory and storage. Practical programming languages such as C++ and Rust tend to output multiple similar functions by monomorphizing polymorphic functions. An ...
An Intelligent Scheduling Approach on Mobile OS for Optimizing UI Smoothness and Power
Mobile devices need to respond quickly to diverse user inputs. The existing approaches often heuristically raise the CPU/GPU frequency according to the empirical rules when facing burst inputs and various changes. Although doing so can be effective ...
DLAS: A Conceptual Model for Across-Stack Deep Learning Acceleration
Deep Neural Networks (DNNs) are very computationally demanding, which presents a significant barrier to their deployment, especially on resource-constrained devices. Significant work from both the machine learning and computing systems communities has ...
GraphService: Topology-aware Constructor for Large-scale Graph Applications
Graph-based services are becoming integrated into everyday life through graph applications and graph learning systems. While traditional graph processing approaches boast excellent throughput with millisecond-level processing time, the construction phase ...
FASA-DRAM: Reducing DRAM Latency with Destructive Activation and Delayed Restoration
DRAM memory is a performance bottleneck for many applications, due to its high access latency. Previous work has mainly focused on data locality, introducing small-but-fast regions to cache frequently accessed data, thereby reducing the average latency. ...