Issue Downloads
Cross-core Data Sharing for Energy-efficient GPUs
- Hajar Falahati,
- Mohammad Sadrosadati,
- Qiumin Xu,
- Juan Gómez-Luna,
- Banafsheh Saber Latibari,
- Hyeran Jeon,
- Shaahin Hesaabi,
- Hamid Sarbazi-Azad,
- Onur Mutlu,
- Murali Annavaram,
- Masoud Pedram
Graphics Processing Units (GPUs) are the accelerator of choice in a variety of application domains, because they can accelerate massively parallel workloads and can be easily programmed using general-purpose programming frameworks such as CUDA and OpenCL. ...
ReSA: Reconfigurable Systolic Array for Multiple Tiny DNN Tensors
Systolic array architecture has significantly accelerated deep neural networks (DNNs). A systolic array comprises multiple processing elements (PEs) that can perform multiply-accumulate (MAC). Traditionally, the systolic array can execute a certain amount ...
An Example of Parallel Merkle Tree Traversal: Post-Quantum Leighton-Micali Signature on the GPU
The hash-based signature (HBS) is the most conservative and time-consuming among many post-quantum cryptography (PQC) algorithms. Two HBSs, LMS and XMSS, are the only PQC algorithms standardised by the National Institute of Standards and Technology (NIST) ...
Knowledge-Augmented Mutation-Based Bug Localization for Hardware Design Code
Verification of hardware design code is crucial for the quality assurance of hardware products. Being an indispensable part of verification, localizing bugs in the hardware design code is significant for hardware development but is often regarded as a ...
D2Comp: Efficient Offload of LSM-tree Compaction with Data Processing Units on Disaggregated Storage
LSM-based key-value stores suffer from sub-optimal performance due to their slow and heavy background compactions. The compaction brings severe CPU and network overhead on high-speed disaggregated storage. This article further reveals that data-intensive ...
iSwap: A New Memory Page Swap Mechanism for Reducing Ineffective I/O Operations in Cloud Environments
This article proposes iSwap, a new memory page swap mechanism that reduces the ineffective I/O swap operations and improves the QoS for applications with a high priority in cloud environments. iSwap works in the OS kernel. iSwap accurately learns the ...
GraphSER: Distance-Aware Stream-Based Edge Repartition for Many-Core Systems
With the explosive growth of graph data, distributed graph processing has become popular, and many graph hardware accelerators use distributed frameworks. Graph partitioning is foundation in distributed graph processing. However, dynamic changes in graph ...
COER: A Network Interface Offloading Architecture for RDMA and Congestion Control Protocol Codesign
RDMA (Remote Direct Memory Access) networks require efficient congestion control to maintain their high throughput and low latency characteristics. However, congestion control protocols deployed at the software layer suffer from slow response times due to ...
Intermediate Address Space: virtual memory optimization of heterogeneous architectures for cache-resident workloads
The increasing demand for computing power and the emergence of heterogeneous computing architectures have driven the exploration of innovative techniques to address current limitations in both the compute and memory subsystems. One such solution is the ...
CoolDC: A Cost-Effective Immersion-Cooled Datacenter with Workload-Aware Temperature Scaling
For datacenter architects, it is the most important goal to minimize the datacenter’s total cost of ownership for the target performance (i.e., TCO/performance). As the major component of a datacenter is a server farm, the most effective way of reducing ...
Stripe-schedule Aware Repair in Erasure-coded Clusters with Heterogeneous Star Networks
More and more storage systems use erasure code to tolerate faults. It takes pieces of data blocks as input and encodes a small number of parity blocks as output, where these blocks form a stripe. When reconsidering the recovery problem in the multi-stripe ...
Fixed-point Encoding and Architecture Exploration for Residue Number Systems
Residue Number Systems (RNS) demonstrate the fascinating potential to serve integer addition/ multiplication-intensive applications. The complexity of Artificial Intelligence (AI) models has grown enormously in recent years. From a computer system’s ...
Optimization of Sparse Matrix Computation for Algebraic Multigrid on GPUs
AMG is one of the most efficient and widely used methods for solving sparse linear systems. The computational process of AMG mainly consists of a series of iterative calculations of generalized sparse matrix-matrix multiplication (SpGEMM) and sparse ...
Asynchronous Memory Access Unit: Exploiting Massive Parallelism for Far Memory Access
The growing memory demands of modern applications have driven the adoption of far memory technologies in data centers to provide cost-effective, high-capacity memory solutions. However, far memory presents new performance challenges because its access ...
SAL: Optimizing the Dataflow of Spin-based Architectures for Lightweight Neural Networks
As the Convolutional Neural Network (CNN) goes deeper and more complex, the network becomes memory-intensive and computation-intensive. To address this issue, the lightweight neural network reduces parameters and Multiplication-and-Accumulation (MAC) ...
Scythe: A Low-latency RDMA-enabled Distributed Transaction System for Disaggregated Memory
Disaggregated memory separates compute and memory resources into independent pools connected by RDMA (Remote Direct Memory Access) networks, which can improve memory utilization, reduce cost, and enable elastic scaling of compute and memory resources. ...
Lavender: An Efficient Resource Partitioning Framework for Large-Scale Job Colocation
Workload consolidation is a widely used approach to enhance resource utilization in modern data centers. However, the concurrent execution of multiple jobs on a shared server introduces contention for essential shared resources such as CPU cores, Last ...
Achieving Tunable Erasure Coding with Cluster-Aware Redundancy Transitioning
Erasure coding has been demonstrated as a storage-efficient means against failures, yet its tunability remains a challenging issue in data centers, which is prone to induce substantial cross-cluster traffic. In this article, we present ClusterRT, a ...
Sectored DRAM: A Practical Energy-Efficient and High-Performance Fine-Grained DRAM Architecture
- Ataberk Olgun,
- F. Nisa Bostanci,
- Geraldo Francisco de Oliveira Junior,
- Yahya Can Tugrul,
- Rahul Bera,
- Abdullah Giray Yaglikci,
- Hasan Hassan,
- Oguz Ergin,
- Onur Mutlu
Modern computing systems access data in main memory at coarse granularity (e.g., at 512-bit cache block granularity). Coarse-grained access leads to wasted energy because the system does not use all individually accessed small portions (e.g., words, each ...
ReIPE: Recycling Idle PEs in CNN Accelerator for Vulnerable Filters Soft-Error Detection
- Xiaohui Wei,
- Chenyang Wang,
- Hengshan Yue,
- Jingweijia Tan,
- Zeyu Guan,
- Nan Jiang,
- Xinyang Zheng,
- Jianpeng Zhao,
- Meikang Qiu
To satisfy prohibitively massive computational requirements of current deep Convolutional Neural Networks (CNNs), CNN-specific accelerators are widely deployed in large-scale systems. Caused by high-energy neutrons and α-particle strikes, soft error may ...
Characterizing and Optimizing LDPC Performance on 3D NAND Flash Memories
With the development of NAND flash memories’ bit density and stacking technologies, while storage capacity keeps increasing, the issue of reliability becomes increasingly prominent. Low-density parity check (LDPC) code, as a robust error-correcting code, ...
ReHarvest: An ADC Resource-Harvesting Crossbar Architecture for ReRAM-Based DNN Accelerators
- Jiahong Xu,
- Haikun Liu,
- Zhuohui Duan,
- Xiaofei Liao,
- Hai Jin,
- Xiaokang Yang,
- Huize Li,
- Cong Liu,
- Fubing Mao,
- Yu Zhang
ReRAM-based Processing-In-Memory (PIM) architectures have been increasingly explored to accelerate various Deep Neural Network (DNN) applications because they can achieve extremely high performance and energy-efficiency for in-situ analog Matrix-Vector ...
Time-Aware Spectrum-Based Bug Localization for Hardware Design Code with Data Purification
The verification of hardware design code is a critical aspect in ensuring the quality and reliability of hardware products. Finding bugs in hardware design code is important for hardware development and is frequently considered as a notoriously ...