[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
ACM Transactions on Architecture and Code OptimizationJust Accepted
acm
The below articles have been recently accepted to the journal and are currently in the production process. These Author’s Accepted Manuscripts (AAM) will be available for preview until the “Version of Record” is available and assigned to its proper issue. The AAM carries the article’s permanent DOI and can be cited immediately.
research-article
Open Access
December 2024
JUST ACCEPTED
TPRepair: Tree-Based Pipelined Repair in Clustered Storage Systems

Erasure coding is an effective technique for guaranteeing data reliability for storage systems, yet it incurs a high repair penalty with amplified repair traffic. The repair becomes more intricate in clustered storage systems with the bandwidth diversity ...

research-article
Open Access
November 2024
JUST ACCEPTED
PRAGA: A Priority-Aware Hardware/Software Co-design for High-Throughput Graph Processing Acceleration

Graph processing is pivotal in deriving insights from complex data structures but faces performance limitations due to the irregular nature of graphs. Traditional general-purpose processors often struggle with low instruction-level parallelism and energy ...

research-article
Open Access
November 2024
JUST ACCEPTED
GCNTrain+: A Versatile and Efficient Accelerator for Graph Convolutional Neural Network Training

Recently, graph convolutional networks (GCNs) have gained wide attention due to their ability to capture node relationships in graphs. One problem appears when full-batch GCN is trained on large graph datasets, where the computational and memory ...

research-article
Open Access
November 2024
JUST ACCEPTED
exZNS: Extending Zoned Namespace to Support Byte-loggable Zones

Emerging Zoned Namespace (ZNS) provides hosts with fine-grained, performance-predictable storage management. ZNS organizes the address space into zones composed of fixed-size, sequentially written, non-overwritable blocks, making it suitable for log-...

research-article
Open Access
November 2024
JUST ACCEPTED
RT-GNN: Accelerating Sparse Graph Neural Networks by Tensor-CUDA Kernel Fusion

Graph Neural Networks (GNNs) have achieved remarkable successes in various graph-based learning tasks, thanks to their ability to leverage advanced GPUs. However, GNNs currently face challenges arising from the concurrent use of advanced Tensor Cores (TCs)...

research-article
Open Access
November 2024
JUST ACCEPTED
ATP: Achieving Throughput Peak for DNN Training via Smart GPU Memory Management

Due to the limited GPU memory, the performance of large DNNs training is constrained by the unscalable batch size. Existing researches partially address the issue of GPU memory limit through tensor recomputation and swapping, but overlook the exploration ...

research-article
Open Access
November 2024
JUST ACCEPTED
ApSpGEMM: Accelerating Large-scale SpGEMM with Heterogeneous Collaboration and Adaptive Panel

The Sparse General Matrix-Matrix multiplication (SpGEMM) is a fundamental component for many applications, such as algebraic multigrid methods (AMG), graphic processing, and deep learning. However, the unbearable latency of computing high-dimensional, ...

research-article
Open Access
November 2024
JUST ACCEPTED
RaNAS: Resource-Aware Neural Architecture Search for Edge Computing

Neural architecture search (NAS) for edge devices is often time-consuming because of long-latency deploying and testing on edge devices. The ability to accurately predict the computation cost and memory requirement for convolutional neural networks (CNNs) ...

research-article
Open Access
November 2024
JUST ACCEPTED
SPIRIT: Scalable and Persistent In-Memory Indices for Real-Time Search

Today, real-time search over big microblogging data requires low indexing and query latency. Online services, therefore, prefer to host inverted indices in memory. Unfortunately, as datasets grow, indices grow proportionally, and with limited DRAM scaling,...

research-article
Open Access
November 2024
JUST ACCEPTED
ShieldCXL: A Practical Obliviousness Support with Sealed CXL Memory

The CXL (Compute Express Link) technology is an emerging memory interface with high-level commands. Recent studies applied the CXL memory expanding technique to mitigate the capacity limitation of the conventional DDRx memory. Unlike the prior studies to ...

research-article
Open Access
November 2024
JUST ACCEPTED
Tiaozhuan: A General and Efficient Indirect Branch Optimization for Binary Translation

Binary translation enables transparent execution, analysis, and modification of the binary program, serving as a core technology that facilitates instruction set emulation, cross-platform compatibility of software, and program instrumentation. Handling ...

research-article
Open Access
November 2024
JUST ACCEPTED
Characterizing and Understanding HGNN Training on GPUs

Owing to their remarkable representation capabilities for heterogeneous graph data, Heterogeneous Graph Neural Networks (HGNNs) have been widely adopted in many critical real-world domains such as recommendation systems and medical analysis. Prior to ...

research-article
Open Access
November 2024
JUST ACCEPTED
A High Scalability Memory NoC with Shared-Inside Hierarchical-Groupings for Triplet-Based Many-Core Architecture

Innovative processor architecture designs are shifting towards Many-Core Architectures (MCAs) to meet the future demands of high-performance computing as the limits of Moore’s Law have almost been reached. Many-core processors utilize shared memory ...

research-article
Open Access
November 2024
JUST ACCEPTED
An Efficient ReRAM-based Accelerator for Asynchronous Iterative Graph Processing

Graph processing has become a central concern for many real-world applications and is well-known for its low compute-to-communication ratios and poor data locality. By integrating computing logic into memory, resistive random access memory (ReRAM) tackles ...

research-article
Open Access
October 2024
JUST ACCEPTED
Leveraging the Hardware Resources to Accelerate cryo-EM Reconstruction of RELION on the New Sunway Supercomputer

The fast development of biomolecular structure determination has enabled the fine-grained study of objects in the micro-world, such as proteins and RNAs. The world is benefited. However, as the computational algorithms are constantly developed, the ...

research-article
Open Access
October 2024
JUST ACCEPTED
PARADISE: Criticality-Aware Instruction Reordering for Power Attack Resistance

Power side-channel attacks exploit the correlation of power consumption with the instructions and data being processed to extract secrets from a device (e.g., cryptographic keys). Prior work primarily focused on protecting small embedded micro-controllers ...

research-article
Open Access
October 2024
JUST ACCEPTED
Iterating Pointers: Enabling Static Analysis for Loop-based Pointers

Pointers are an integral part of C and other programming languages. They enable substantial flexibility from the programmer’s standpoint, allowing the user fine, unmediated control over data access patterns. However, accesses done through pointers are ...

research-article
Open Access
October 2024
JUST ACCEPTED
DTAP: Accelerating Strongly-Typed Programs with Data Type-Aware Hardware Prefetching

Queries on linked data structures, such as trees and graphs, often suffer from frequent cache misses and significant performance loss due to dependent and random pointer-chasing memory accesses. In this paper, we propose a software-hardware co-designed ...

research-article
Open Access
October 2024
JUST ACCEPTED
Constructing a Supplementary Benchmark Suite to Represent Android Applications with User Interactions by using Performance Counters

We find existing benchmark suites for smartphone CPU micro-architecture design such as Geekbench 5.0 fail to authentically represent the micro-architecture level performance behavior of widely used real Android applications with interactive operations ...

research-article
Open Access
October 2024
JUST ACCEPTED
Conflict Management in Vector Register Files

The instruction set architecture (ISA) of vector processors operates on vectors stored in the vector register file (VRF) which needs to handle several concurrent accesses by functional units (FUs) with multiple ports. When the vector processor is running ...

research-article
Open Access
October 2024
JUST ACCEPTED
Shining Light on the Inter-procedural Code Obfuscation: Keep Pace with Progress in Binary Diffing

Software obfuscation techniques have lost their effectiveness due to the rapid development of binary diffing techniques, which can achieve accurate function matching and identification. In this paper, we propose a new inter-procedural code obfuscation ...

research-article
Open Access
October 2024
JUST ACCEPTED
DeepZoning: Re-accelerate CNN Inference with Zoning Graph for Heterogeneous Edge Cluster

Parallelizing CNN inference on heterogeneous edge clusters with data parallelism has gained popularity as a way to meet real-time requirements without sacrificing model accuracy. However, existing algorithms struggle to find optimal parallel granularity ...

research-article
Open Access
October 2024
JUST ACCEPTED
MemoriaNova: Optimizing Memory-Aware Model Inference for Edge Computing

In recent years, deploying deep learning models on edge devices has become pervasive, driven by the increasing demand for intelligent edge computing solutions across various industries. From industrial automation to intelligent surveillance and healthcare,...

research-article
Open Access
October 2024
JUST ACCEPTED
Multiple Function Merging for Code Size Reduction

Resource-constrained environments, such as embedded devices, have limited amounts of memory and storage. Practical programming languages such as C++ and Rust tend to output multiple similar functions by monomorphizing polymorphic functions. An ...

research-article
Open Access
October 2024
JUST ACCEPTED
An Intelligent Scheduling Approach on Mobile OS for Optimizing UI Smoothness and Power

Mobile devices need to respond quickly to diverse user inputs. The existing approaches often heuristically raise the CPU/GPU frequency according to the empirical rules when facing burst inputs and various changes. Although doing so can be effective ...

research-article
Open Access
September 2024
JUST ACCEPTED
DLAS: A Conceptual Model for Across-Stack Deep Learning Acceleration

Deep Neural Networks (DNNs) are very computationally demanding, which presents a significant barrier to their deployment, especially on resource-constrained devices. Significant work from both the machine learning and computing systems communities has ...

research-article
Open Access
August 2024
JUST ACCEPTED
GraphService: Topology-aware Constructor for Large-scale Graph Applications

Graph-based services are becoming integrated into everyday life through graph applications and graph learning systems. While traditional graph processing approaches boast excellent throughput with millisecond-level processing time, the construction phase ...

research-article
Open Access
February 2024
JUST ACCEPTED
FASA-DRAM: Reducing DRAM Latency with Destructive Activation and Delayed Restoration

DRAM memory is a performance bottleneck for many applications, due to its high access latency. Previous work has mainly focused on data locality, introducing small-but-fast regions to cache frequently accessed data, thereby reducing the average latency. ...