TPRepair: Tree-Based Pipelined Repair in Clustered Storage Systems
- Jiahui Yang,
- Fulin Nan,
- Zhirong Shen,
- Jiwu Shu,
- Zhisheng Chen,
- Xiaoli Wang,
- Quanqing Xu,
- Chuanhui Yang,
- Dmitrii Kaplun,
- Yuhui Cai
Erasure coding is an effective technique for guaranteeing data reliability for storage systems, yet it incurs a high repair penalty with amplified repair traffic. The repair becomes more intricate in clustered storage systems with the bandwidth diversity ...
PRAGA: A Priority-Aware Hardware/Software Co-design for High-Throughput Graph Processing Acceleration
- Long Zheng,
- Bing Zhu,
- Pengcheng Yao,
- Yuhang Zhou,
- Chengao Pan,
- Wenju Zhao,
- Xiaofei Liao,
- Hai Jin,
- Jingling Xue
Graph processing is pivotal in deriving insights from complex data structures but faces performance limitations due to the irregular nature of graphs. Traditional general-purpose processors often struggle with low instruction-level parallelism and energy ...
GCNTrain+: A Versatile and Efficient Accelerator for Graph Convolutional Neural Network Training
Recently, graph convolutional networks (GCNs) have gained wide attention due to their ability to capture node relationships in graphs. One problem appears when full-batch GCN is trained on large graph datasets, where the computational and memory ...
exZNS: Extending Zoned Namespace to Support Byte-loggable Zones
Emerging Zoned Namespace (ZNS) provides hosts with fine-grained, performance-predictable storage management. ZNS organizes the address space into zones composed of fixed-size, sequentially written, non-overwritable blocks, making it suitable for log-...
RT-GNN: Accelerating Sparse Graph Neural Networks by Tensor-CUDA Kernel Fusion
Graph Neural Networks (GNNs) have achieved remarkable successes in various graph-based learning tasks, thanks to their ability to leverage advanced GPUs. However, GNNs currently face challenges arising from the concurrent use of advanced Tensor Cores (TCs)...
ATP: Achieving Throughput Peak for DNN Training via Smart GPU Memory Management
Due to the limited GPU memory, the performance of large DNNs training is constrained by the unscalable batch size. Existing researches partially address the issue of GPU memory limit through tensor recomputation and swapping, but overlook the exploration ...
ApSpGEMM: Accelerating Large-scale SpGEMM with Heterogeneous Collaboration and Adaptive Panel
The Sparse General Matrix-Matrix multiplication (SpGEMM) is a fundamental component for many applications, such as algebraic multigrid methods (AMG), graphic processing, and deep learning. However, the unbearable latency of computing high-dimensional, ...
RaNAS: Resource-Aware Neural Architecture Search for Edge Computing
Neural architecture search (NAS) for edge devices is often time-consuming because of long-latency deploying and testing on edge devices. The ability to accurately predict the computation cost and memory requirement for convolutional neural networks (CNNs) ...
SPIRIT: Scalable and Persistent In-Memory Indices for Real-Time Search
Today, real-time search over big microblogging data requires low indexing and query latency. Online services, therefore, prefer to host inverted indices in memory. Unfortunately, as datasets grow, indices grow proportionally, and with limited DRAM scaling,...
ShieldCXL: A Practical Obliviousness Support with Sealed CXL Memory
The CXL (Compute Express Link) technology is an emerging memory interface with high-level commands. Recent studies applied the CXL memory expanding technique to mitigate the capacity limitation of the conventional DDRx memory. Unlike the prior studies to ...
Tiaozhuan: A General and Efficient Indirect Branch Optimization for Binary Translation
Binary translation enables transparent execution, analysis, and modification of the binary program, serving as a core technology that facilitates instruction set emulation, cross-platform compatibility of software, and program instrumentation. Handling ...
Characterizing and Understanding HGNN Training on GPUs
Owing to their remarkable representation capabilities for heterogeneous graph data, Heterogeneous Graph Neural Networks (HGNNs) have been widely adopted in many critical real-world domains such as recommendation systems and medical analysis. Prior to ...
A High Scalability Memory NoC with Shared-Inside Hierarchical-Groupings for Triplet-Based Many-Core Architecture
Innovative processor architecture designs are shifting towards Many-Core Architectures (MCAs) to meet the future demands of high-performance computing as the limits of Moore’s Law have almost been reached. Many-core processors utilize shared memory ...
An Efficient ReRAM-based Accelerator for Asynchronous Iterative Graph Processing
- Jin Zhao,
- Yu Zhang,
- Donghao He,
- Qikun Li,
- Weihang Yin,
- Hui Yu,
- Hao Qi,
- Xiaofei Liao,
- Hai Jin,
- Haikun Liu,
- Linchen Yu,
- Zhang Zhan
Graph processing has become a central concern for many real-world applications and is well-known for its low compute-to-communication ratios and poor data locality. By integrating computing logic into memory, resistive random access memory (ReRAM) tackles ...
Leveraging the Hardware Resources to Accelerate cryo-EM Reconstruction of RELION on the New Sunway Supercomputer
The fast development of biomolecular structure determination has enabled the fine-grained study of objects in the micro-world, such as proteins and RNAs. The world is benefited. However, as the computational algorithms are constantly developed, the ...
Iterating Pointers: Enabling Static Analysis for Loop-based Pointers
Pointers are an integral part of C and other programming languages. They enable substantial flexibility from the programmer’s standpoint, allowing the user fine, unmediated control over data access patterns. However, accesses done through pointers are ...
DTAP: Accelerating Strongly-Typed Programs with Data Type-Aware Hardware Prefetching
- Yingshuai Dong,
- Chencheng Ye,
- Haikun Liu,
- Liting Tang,
- Xiaofei Liao,
- Hai Jin,
- Cheng Chen,
- Yanjiang Li,
- Yi Wang
Queries on linked data structures, such as trees and graphs, often suffer from frequent cache misses and significant performance loss due to dependent and random pointer-chasing memory accesses. In this paper, we propose a software-hardware co-designed ...
Conflict Management in Vector Register Files
The instruction set architecture (ISA) of vector processors operates on vectors stored in the vector register file (VRF) which needs to handle several concurrent accesses by functional units (FUs) with multiple ports. When the vector processor is running ...
Shining Light on the Inter-procedural Code Obfuscation: Keep Pace with Progress in Binary Diffing
- Peihua Zhang,
- Chenggang Wu,
- Hanzhi Hu,
- Lichen Jia,
- Mingfan Peng,
- Jiali Xu,
- Mengyao Xie,
- Yuanming Lai,
- Yan Kang,
- Zhe Wang
Software obfuscation techniques have lost their effectiveness due to the rapid development of binary diffing techniques, which can achieve accurate function matching and identification. In this paper, we propose a new inter-procedural code obfuscation ...
DeepZoning: Re-accelerate CNN Inference with Zoning Graph for Heterogeneous Edge Cluster
Parallelizing CNN inference on heterogeneous edge clusters with data parallelism has gained popularity as a way to meet real-time requirements without sacrificing model accuracy. However, existing algorithms struggle to find optimal parallel granularity ...
MemoriaNova: Optimizing Memory-Aware Model Inference for Edge Computing
In recent years, deploying deep learning models on edge devices has become pervasive, driven by the increasing demand for intelligent edge computing solutions across various industries. From industrial automation to intelligent surveillance and healthcare,...
Multiple Function Merging for Code Size Reduction
Resource-constrained environments, such as embedded devices, have limited amounts of memory and storage. Practical programming languages such as C++ and Rust tend to output multiple similar functions by monomorphizing polymorphic functions. An ...
An Intelligent Scheduling Approach on Mobile OS for Optimizing UI Smoothness and Power
Mobile devices need to respond quickly to diverse user inputs. The existing approaches often heuristically raise the CPU/GPU frequency according to the empirical rules when facing burst inputs and various changes. Although doing so can be effective ...
DLAS: A Conceptual Model for Across-Stack Deep Learning Acceleration
Deep Neural Networks (DNNs) are very computationally demanding, which presents a significant barrier to their deployment, especially on resource-constrained devices. Significant work from both the machine learning and computing systems communities has ...
GraphService: Topology-aware Constructor for Large-scale Graph Applications
Graph-based services are becoming integrated into everyday life through graph applications and graph learning systems. While traditional graph processing approaches boast excellent throughput with millisecond-level processing time, the construction phase ...
FASA-DRAM: Reducing DRAM Latency with Destructive Activation and Delayed Restoration
DRAM memory is a performance bottleneck for many applications, due to its high access latency. Previous work has mainly focused on data locality, introducing small-but-fast regions to cache frequently accessed data, thereby reducing the average latency. ...