[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
ACM Transactions on Architecture and Code OptimizationJust Accepted
acm
The below articles have been recently accepted to the journal and are currently in the production process. These Author’s Accepted Manuscripts (AAM) will be available for preview until the “Version of Record” is available and assigned to its proper issue. The AAM carries the article’s permanent DOI and can be cited immediately.
research-article
Open Access
March 2025
JUST ACCEPTED
BridgeGC: An Efficient Cross-Level Garbage Collector for Big Data Frameworks

Popular big data frameworks commonly run atop Java Virtual Machine (JVM), and rely on garbage collection (GC) mechanism to automatically allocate/reclaim in-memory objects. Existing garbage collectors are designed based on the hypothesis that most objects ...

research-article
Open Access
March 2025
JUST ACCEPTED
SEED: Speculative Security Metadata Updates for Low-Latency Secure Memory

Securing systems’ main memory is important for building trusted data centers. To ensure memory security, encryption and integrity verification techniques update the security metadata (e.g., encryption counters and integrity trees) during memory data ...

research-article
Open Access
March 2025
JUST ACCEPTED
A Lock-free RDMA-friendly Index in CPU-parsimonious Environments

In CPU-parsimonious environments, such as disaggregated memory systems, the limited CPU power on the memory side constrains the ability to perform more operations. Thus, reducing CPU usage and enhancing concurrency performance are critical for indexing ...

research-article
Open Access
March 2025
JUST ACCEPTED
Koala: Efficient Pipeline Training through Automated Schedule Searching on Domain-Specific Language

Pipeline parallelism is a crucial technique for large-scale model training, enabling parameter splitting and performance enhancement. However, creating effective pipeline schedules often requires significant manual effort and coding skills, leading to ...

research-article
Open Access
March 2025
JUST ACCEPTED
SRSparse: Generating Codes for High-Performance Sparse Matrix-Vector Semiring Computations

Sparse matrix-vector semiring computation is a key operation in sparse matrix computations, with performance strongly dependent on both program design and the features of the sparse matrices. Given the diversity of sparse matrices, designing a tailored ...

research-article
Open Access
March 2025
JUST ACCEPTED
PANDA: Adaptive Prefetching and Decentralized Scheduling for Dataflow Architectures

Dataflow architectures are considered promising architecture, offering a commendable balance of performance, efficiency, and flexibility. Abundant prior works have been proposed to improve the performance of dataflow architectures. Nevertheless, these ...

research-article
Open Access
March 2025
JUST ACCEPTED
TSN Cache: Exploiting Data Localities in Graph Computing Applications

This paper finds that the reusability of vertices in the same graph in graph processing differs, and the high-reuse and low-reuse vertices are stored together. These phenomena lead to the inability of existing GPU architectures to capture the reusability ...

research-article
Open Access
March 2025
JUST ACCEPTED
Overlapping Aware Data Placement Optimizations for LSM Tree-Based Store on ZNS SSDs

Solid State Drives (SSDs) based on the NVMe Zoned Namespaces (ZNS) interface can notably reduce the costs of address mapping, garbage collection, and over-provisioning by dividing the storage space into multiple zones for sequential writes and random ...

research-article
Open Access
March 2025
JUST ACCEPTED
ODGS: Dependency-Aware Scheduling for High-Level Synthesis with Graph Neural Network and Reinforcement Learning

Scheduling determines the execution order and time of operations in program. The order is related to operation dependencies, including data and resource dependencies. Data dependency is intrinsic in program, showing operation data flow. Resource ...

research-article
Open Access
March 2025
JUST ACCEPTED
Shift-CIM: In-SRAM Alignment To Support General-Purpose Bit-level Sparsity Exploration in SRAM Multiplication

Multiplication plays a critical role in SRAM-based Computing-in-Memory (CIM) architectures. However, current SRAM-based CIMs face three major limitations. First, they do not fully exploit bit-level sparsity, resulting in unnecessary overhead in both ...

research-article
Open Access
February 2025
JUST ACCEPTED
LitTLS: Lightweight Thread-Level Speculation on Little Cores

Thread-Level Speculation (TLS) utilizes speculative parallelization to accelerate hard-to-parallelize serial codes on multi-cores. As the heterogeneous multi-core architecture is becoming ubiquitous, it presents an opportunity for TLS to reorganize little ...

research-article
Open Access
February 2025
JUST ACCEPTED
FusionFS: A Contention-Resilient File System for Persistent CPU Caches

Byte-addressable storage (BAS), such as persistent memory and CXL-SSDs, does not meet system designers’ expectations for data flushing and access granularity. Persistent CPU caches, enabled by recent techniques like Intel’s eADR and CXL’s Global ...

research-article
Open Access
February 2025
JUST ACCEPTED
Sniper: Exploiting Spatial and Temporal Sampling for Large-Scale Performance Analysis

MPI tracing tools is essential to collect the communication events and performance metrics of large-scale programs for further performance analysis and optimization. However, towards the exascale era, the performance and storage overhead for tracing ...

research-article
Open Access
February 2025
JUST ACCEPTED
TransCL: An Automatic CUDA-to-OpenCL Programs Transformation Framework

With the rising demand for computational power and the increasing variety of computational scenarios, considerable interest has emerged in transforming existing CUDA programs into more general-purpose OpenCL programs, enabling them to run across diverse ...

research-article
Open Access
February 2025
JUST ACCEPTED
CesASMe and Staticdeps: static detection of memory-carried dependencies for code analyzers

A variety of code analyzers, such as IACA, uiCA, llvm-mca or Ithemal, strive to statically predict the throughput of a computation kernel. Each analyzer is based on its own simplified CPU model reasoning at the scale of a basic block. Facing this ...

research-article
Open Access
February 2025
JUST ACCEPTED
Comprehensive Evaluation and Opportunity Discovery for Deterministic Concurrency Control

Deterministic concurrency control (DCC) guarantees that the same input transactions produce the same serializable result. It offers benefits in both distributed databases and blockchain systems. Dozens of DCC algorithms have emerged in the past decade. ...

research-article
Open Access
February 2025
JUST ACCEPTED
ARACHNE: Optimizing Distributed Parallel Applications with Reduced Inter-Process Communication

In high-performance computing (HPC), parallelization is essential for improving computational efficiency as data and computation scales exceed single-node capacity. Existing methods, such as the polyhedral model used in Pluto-Distmem, focus on loop and ...

research-article
Open Access
February 2025
JUST ACCEPTED
Ceiba: An Efficient and Scalable DNN Scheduler for Spatial Accelerators

Spatial accelerators are domain-specific architectures to elevate performance and energy efficiency for deep neural networks (DNNs). They also bring a large number of schedule parameters to determine computation and data movement patterns of DNNs. ...

research-article
Open Access
February 2025
JUST ACCEPTED
ScaWL: Scaling k-WL (Weisfeiler-Lehman) Algorithms in Memory and Performance on Shared and Distributed-Memory Systems

The k-dimensional Weisfeiler-Leman (k-WL) algorithm—developed as an efficient heuristic for testing if two graphs are isomorphic—is a fundamental kernel for node embedding in the emerging field of graph neural networks. Unfortunately, the k-WL algorithm ...

research-article
Open Access
February 2025
JUST ACCEPTED
Dynamic Power Management Through Multi-agent Deep Reinforcement Learning for Heterogeneous Systems

Power management and optimization play a significant role in modern computer systems, from battery-powered devices to servers running in data centres. Existing approaches for power capping fail to meet the requirements presented by dynamic workloads, and ...

research-article
Open Access
February 2025
JUST ACCEPTED
VersaTile: Flexible Tiled Architectures via Associative Processors

As modern applications demand more data, processing-in-memory (PIM) architectures have emerged to address the challenges of data movement and parallelism. In this paper, we propose VersaTile, a heterogeneous, fully CMOS-based tiled architecture that ...

research-article
Open Access
February 2025
JUST ACCEPTED
Exploiting Dynamic Regular Patterns in Irregular Programs for Efficient Vectorization

Modern optimizing compilers are able to exploit memory access or computation patterns to generate vectorized codes. However, such patterns in irregular programs are unknown until runtime due to the input dependence. Thus, either compiler’s static ...

research-article
Open Access
February 2025
JUST ACCEPTED
Accelerating Nearest Neighbor Search in 3D Point Cloud Registration on GPUs

The Iterative Closest Points (ICP) algorithm is the most widely used method for estimating rigid transformation in 3D point cloud registration. However, the ICP relies on repeatedly performing computationally intensive nearest neighbor searches (NNS) ...

research-article
Open Access
February 2025
JUST ACCEPTED
OptiFX: Automatic Optimization for Convolutional Neural Networks with Aggressive Operator Fusion on GPUs

Convolutional Neural Networks (CNNs) are fundamental to advancing computer vision technologies. As CNNs become more complex and larger, optimizing model inference remains a critical challenge in both industry and academia. On modern GPU platforms, CNN ...

research-article
Open Access
January 2025
JUST ACCEPTED
Maximizing Data and Hardware Reuse for HLS with Early-Stage Symbolic Partitioning

While traditional HLS (High-Level Synthesis) converts “high-level” C-like programs into hardware automatically, producing high-performance designs still requires hardware expertise. Optimizations such as data partitioning can have a large impact on ...

research-article
Open Access
January 2025
JUST ACCEPTED
LIA: Latency-Improved Adaptive routing for Dragonfly networks

Low-diameter network topologies require non-minimal routing, such as Valiant routing, to avoid network congestion under challenging traffic patterns like the so-called adversarial. However, this mechanism tends to increase the average path length, base ...

research-article
Open Access
January 2025
JUST ACCEPTED
Unveiling and Evaluating Vulnerabilities in Branch Predictors via a Three-Step Modeling Methodology

With the emergence and proliferation of microarchitectural attacks targeting branch predictors, the once-established security boundary in computer systems and architectures is facing unprecedented challenges. This paper introduces an innovative branch ...

research-article
Open Access
January 2025
JUST ACCEPTED
KINDRED: Heterogeneous Split-Lock Architecture for Safe Autonomous Machines

With the increasing practicality of autonomous vehicles and drones, the importance of reliability requirements has escalated substantially. In many instances, traditional system designs tend to overlook reliability issues, emphasizing primarily on ...

research-article
Open Access
January 2025
JUST ACCEPTED
Taming Flexible Job Packing in Deep Learning Training Clusters

Job packing is an effective technique to harvest the idle resources allocated to the deep learning (DL) training jobs but not fully utilized, especially when clusters may experience low utilization, and users may overestimate their resource needs. However,...

research-article
Open Access
January 2025
JUST ACCEPTED
Enhancing High-Throughput GPU Random Walks Through Multi-Task Concurrency Orchestration

Random walk is a powerful tool for large-scale graph learning, but its high computational demand presents a challenge. While GPUs can accelerate random walk tasks, current frameworks fail to fully utilize GPU parallelism due to memory-to-compute bandwidth ...

research-article
Open Access
January 2025
JUST ACCEPTED
gHyPart: GPU-friendly End-to-End Hypergraph Partitioner

Hypergraph partitioning finds practical applications in various fields, such as high-performance computing and circuit partitioning in VLSI physical design, where high-performance solutions often demand substantial parallelism beyond what existing CPU-...

research-article
Open Access
January 2025
JUST ACCEPTED
IBing: An Efficient Interleaved Bidirectional Ring All-Reduce Algorithm for Gradient Synchronization

Ring all-reduce is currently the most commonly used collective communication technique in the fields of data parallel and distributed computing. It consists of three phases: communication establishment, data transmission, and data processing at each step. ...

research-article
Open Access
January 2025
JUST ACCEPTED
gCom: Fine-grained Compressors in Graphics Memory of Mobile GPU

Nowadays, GPUs significantly boost rendering performance. However, the high memory requirements limit their use, especially on low-end mobile platforms. Compression techniques have been widely adopted to reduce memory consumption but face two primary ...

research-article
Open Access
December 2024
JUST ACCEPTED
Bubble-Swap Flow Control

Deadlock-free adaptive routing is extensively adopted in both on-chip and off-chip interconnection networks to improve communication bandwidth and reduce latency. Introducing virtual channels (VCs), also known as virtual lanes (VLs), is the mainstream ...

research-article
Open Access
December 2024
JUST ACCEPTED
Consequence-based Clustered Architecture

We recognize that the execution of many dynamic instructions have no consequence on the overall execution of the program. For example, the execution of a correctly predicted conditional branch instruction, as well as all the instructions leading up to it, ...

research-article
Open Access
December 2024
JUST ACCEPTED
Flexible and Effective Object Tiering for Heterogeneous Memory Systems

Computing platforms that package multiple types of memory, each with their own performance characteristics, are quickly becoming mainstream. To operate efficiently, heterogeneous memory architectures require new data management solutions that are able to ...

research-article
Open Access
December 2024
JUST ACCEPTED
COVER: Alleviating Crash-Consistency Error Amplification in Secure Persistent Memory Systems

Data security (including confidentiality, integrity, and availability) and crash consistency guarantees are essential for building trusted persistent memory (PM) systems. Security and consistency metadata are added to enable the guarantees. Recent studies ...

research-article
Open Access
December 2024
JUST ACCEPTED
MasterPlan: A Reinforcement Learning Based Scheduler for Archive Storage

With the sheer volume of data in today’s world, archive storage systems play a significant role in persisting the cold data. Due to stringent cost concerns, one popular design is to organize disks into groups and periodically switch them to be powered on ...

research-article
Open Access
December 2024
JUST ACCEPTED
Steered Bubble: An Interposer-based Deadlock Recovery Algorithm for Multi-chiplet Systems

Dividing a single System-on-Chip (SoC) into multiple chiplets and integrating them via an interposer can achieve an optimal balance between continuous transistor integration and monetary cost. However, potential deadlock may arise between the chiplets and ...

research-article
Open Access
December 2024
JUST ACCEPTED
AIS: An Active Idleness I/O Scheduler to Reduce Buffer-Exhausted Degradation of Solid-State Drives

Modern solid-state drives (SSDs) continue to boost storage density and I/O bandwidth at the cost of flash-access I/O latency, especially for write, hence prevalently deploy a build-in buffer to absorb incoming writes. However, when the buffer is used up, ...

research-article
Open Access
December 2024
JUST ACCEPTED
TPRepair: Tree-Based Pipelined Repair in Clustered Storage Systems

Erasure coding is an effective technique for guaranteeing data reliability for storage systems, yet it incurs a high repair penalty with amplified repair traffic. The repair becomes more intricate in clustered storage systems with the bandwidth diversity ...

research-article
Open Access
November 2024
JUST ACCEPTED
PRAGA: A Priority-Aware Hardware/Software Co-design for High-Throughput Graph Processing Acceleration

Graph processing is pivotal in deriving insights from complex data structures but faces performance limitations due to the irregular nature of graphs. Traditional general-purpose processors often struggle with low instruction-level parallelism and energy ...

research-article
Open Access
November 2024
JUST ACCEPTED
GCNTrain+: A Versatile and Efficient Accelerator for Graph Convolutional Neural Network Training

Recently, graph convolutional networks (GCNs) have gained wide attention due to their ability to capture node relationships in graphs. One problem appears when full-batch GCN is trained on large graph datasets, where the computational and memory ...

research-article
Open Access
November 2024
JUST ACCEPTED
exZNS: Extending Zoned Namespace to Support Byte-loggable Zones

Emerging Zoned Namespace (ZNS) provides hosts with fine-grained, performance-predictable storage management. ZNS organizes the address space into zones composed of fixed-size, sequentially written, non-overwritable blocks, making it suitable for log-...

research-article
Open Access
November 2024
JUST ACCEPTED
RT-GNN: Accelerating Sparse Graph Neural Networks by Tensor-CUDA Kernel Fusion

Graph Neural Networks (GNNs) have achieved remarkable successes in various graph-based learning tasks, thanks to their ability to leverage advanced GPUs. However, GNNs currently face challenges arising from the concurrent use of advanced Tensor Cores (TCs)...

research-article
Open Access
November 2024
JUST ACCEPTED
ATP: Achieving Throughput Peak for DNN Training via Smart GPU Memory Management

Due to the limited GPU memory, the performance of large DNNs training is constrained by the unscalable batch size. Existing researches partially address the issue of GPU memory limit through tensor recomputation and swapping, but overlook the exploration ...

research-article
Open Access
November 2024
JUST ACCEPTED
ApSpGEMM: Accelerating Large-scale SpGEMM with Heterogeneous Collaboration and Adaptive Panel

The Sparse General Matrix-Matrix multiplication (SpGEMM) is a fundamental component for many applications, such as algebraic multigrid methods (AMG), graphic processing, and deep learning. However, the unbearable latency of computing high-dimensional, ...

research-article
Open Access
November 2024
JUST ACCEPTED
RaNAS: Resource-Aware Neural Architecture Search for Edge Computing

Neural architecture search (NAS) for edge devices is often time-consuming because of long-latency deploying and testing on edge devices. The ability to accurately predict the computation cost and memory requirement for convolutional neural networks (CNNs) ...

research-article
Open Access
November 2024
JUST ACCEPTED
SPIRIT: Scalable and Persistent In-Memory Indices for Real-Time Search

Today, real-time search over big microblogging data requires low indexing and query latency. Online services, therefore, prefer to host inverted indices in memory. Unfortunately, as datasets grow, indices grow proportionally, and with limited DRAM scaling,...

research-article
Open Access
November 2024
JUST ACCEPTED
ShieldCXL: A Practical Obliviousness Support with Sealed CXL Memory

The CXL (Compute Express Link) technology is an emerging memory interface with high-level commands. Recent studies applied the CXL memory expanding technique to mitigate the capacity limitation of the conventional DDRx memory. Unlike the prior studies to ...

research-article
Open Access
November 2024
JUST ACCEPTED
Tiaozhuan: A General and Efficient Indirect Branch Optimization for Binary Translation

Binary translation enables transparent execution, analysis, and modification of the binary program, serving as a core technology that facilitates instruction set emulation, cross-platform compatibility of software, and program instrumentation. Handling ...

research-article
Open Access
November 2024
JUST ACCEPTED
Characterizing and Understanding HGNN Training on GPUs

Owing to their remarkable representation capabilities for heterogeneous graph data, Heterogeneous Graph Neural Networks (HGNNs) have been widely adopted in many critical real-world domains such as recommendation systems and medical analysis. Prior to ...

research-article
Open Access
November 2024
JUST ACCEPTED
A High Scalability Memory NoC with Shared-Inside Hierarchical-Groupings for Triplet-Based Many-Core Architecture

Innovative processor architecture designs are shifting towards Many-Core Architectures (MCAs) to meet the future demands of high-performance computing as the limits of Moore’s Law have almost been reached. Many-core processors utilize shared memory ...

research-article
Open Access
November 2024
JUST ACCEPTED
An Efficient ReRAM-based Accelerator for Asynchronous Iterative Graph Processing

Graph processing has become a central concern for many real-world applications and is well-known for its low compute-to-communication ratios and poor data locality. By integrating computing logic into memory, resistive random access memory (ReRAM) tackles ...

research-article
Open Access
October 2024
JUST ACCEPTED
Leveraging the Hardware Resources to Accelerate cryo-EM Reconstruction of RELION on the New Sunway Supercomputer

The fast development of biomolecular structure determination has enabled the fine-grained study of objects in the micro-world, such as proteins and RNAs. The world is benefited. However, as the computational algorithms are constantly developed, the ...

research-article
Open Access
October 2024
JUST ACCEPTED
PARADISE: Criticality-Aware Instruction Reordering for Power Attack Resistance

Power side-channel attacks exploit the correlation of power consumption with the instructions and data being processed to extract secrets from a device (e.g., cryptographic keys). Prior work primarily focused on protecting small embedded micro-controllers ...

research-article
Open Access
October 2024
JUST ACCEPTED
Iterating Pointers: Enabling Static Analysis for Loop-based Pointers

Pointers are an integral part of C and other programming languages. They enable substantial flexibility from the programmer’s standpoint, allowing the user fine, unmediated control over data access patterns. However, accesses done through pointers are ...

research-article
Open Access
October 2024
JUST ACCEPTED
DTAP: Accelerating Strongly-Typed Programs with Data Type-Aware Hardware Prefetching

Queries on linked data structures, such as trees and graphs, often suffer from frequent cache misses and significant performance loss due to dependent and random pointer-chasing memory accesses. In this paper, we propose a software-hardware co-designed ...

research-article
Open Access
October 2024
JUST ACCEPTED
Constructing a Supplementary Benchmark Suite to Represent Android Applications with User Interactions by using Performance Counters

We find existing benchmark suites for smartphone CPU micro-architecture design such as Geekbench 5.0 fail to authentically represent the micro-architecture level performance behavior of widely used real Android applications with interactive operations ...

research-article
Open Access
October 2024
JUST ACCEPTED
Conflict Management in Vector Register Files

The instruction set architecture (ISA) of vector processors operates on vectors stored in the vector register file (VRF) which needs to handle several concurrent accesses by functional units (FUs) with multiple ports. When the vector processor is running ...

research-article
Open Access
October 2024
JUST ACCEPTED
Shining Light on the Inter-procedural Code Obfuscation: Keep Pace with Progress in Binary Diffing

Software obfuscation techniques have lost their effectiveness due to the rapid development of binary diffing techniques, which can achieve accurate function matching and identification. In this paper, we propose a new inter-procedural code obfuscation ...

research-article
Open Access
October 2024
JUST ACCEPTED
DeepZoning: Re-accelerate CNN Inference with Zoning Graph for Heterogeneous Edge Cluster

Parallelizing CNN inference on heterogeneous edge clusters with data parallelism has gained popularity as a way to meet real-time requirements without sacrificing model accuracy. However, existing algorithms struggle to find optimal parallel granularity ...

research-article
Open Access
October 2024
JUST ACCEPTED
MemoriaNova: Optimizing Memory-Aware Model Inference for Edge Computing

In recent years, deploying deep learning models on edge devices has become pervasive, driven by the increasing demand for intelligent edge computing solutions across various industries. From industrial automation to intelligent surveillance and healthcare,...

research-article
Open Access
October 2024
JUST ACCEPTED
Multiple Function Merging for Code Size Reduction

Resource-constrained environments, such as embedded devices, have limited amounts of memory and storage. Practical programming languages such as C++ and Rust tend to output multiple similar functions by monomorphizing polymorphic functions. An ...

research-article
Open Access
October 2024
JUST ACCEPTED
An Intelligent Scheduling Approach on Mobile OS for Optimizing UI Smoothness and Power

Mobile devices need to respond quickly to diverse user inputs. The existing approaches often heuristically raise the CPU/GPU frequency according to the empirical rules when facing burst inputs and various changes. Although doing so can be effective ...

research-article
Open Access
September 2024
JUST ACCEPTED
DLAS: A Conceptual Model for Across-Stack Deep Learning Acceleration

Deep Neural Networks (DNNs) are very computationally demanding, which presents a significant barrier to their deployment, especially on resource-constrained devices. Significant work from both the machine learning and computing systems communities has ...

research-article
Open Access
August 2024
JUST ACCEPTED
GraphService: Topology-aware Constructor for Large-scale Graph Applications

Graph-based services are becoming integrated into everyday life through graph applications and graph learning systems. While traditional graph processing approaches boast excellent throughput with millisecond-level processing time, the construction phase ...

research-article
Open Access
February 2024
JUST ACCEPTED
FASA-DRAM: Reducing DRAM Latency with Destructive Activation and Delayed Restoration

DRAM memory is a performance bottleneck for many applications, due to its high access latency. Previous work has mainly focused on data locality, introducing small-but-fast regions to cache frequently accessed data, thereby reducing the average latency. ...