TACO: Just Accepted

ACM Transactions on Architecture and Code OptimizationJust Accepted

Just Accepted

The below articles have been recently accepted to the journal and are currently in the production process. These Author’s Accepted Manuscripts (AAM) will be available for preview until the “Version of Record” is available and assigned to its proper issue. The AAM carries the article’s permanent DOI and can be cited immediately.

Editor-in-Chief:

David R. Kaeli

Select All

Export Citations Save to Binder

research-article

Open Access

JUST ACCEPTED

BridgeGC: An Efficient Cross-Level Garbage Collector for Big Data Frameworks

https://doi.org/10.1145/3722110

Popular big data frameworks commonly run atop Java Virtual Machine (JVM), and rely on garbage collection (GC) mechanism to automatically allocate/reclaim in-memory objects. Existing garbage collectors are designed based on the hypothesis that most objects ...

research-article

Open Access

JUST ACCEPTED

SEED: Speculative Security Metadata Updates for Low-Latency Secure Memory

https://doi.org/10.1145/3722111

Securing systems’ main memory is important for building trusted data centers. To ensure memory security, encryption and integrity verification techniques update the security metadata (e.g., encryption counters and integrity trees) during memory data ...

research-article

Open Access

JUST ACCEPTED

A Lock-free RDMA-friendly Index in CPU-parsimonious Environments

https://doi.org/10.1145/3722112

In CPU-parsimonious environments, such as disaggregated memory systems, the limited CPU power on the memory side constrains the ability to perform more operations. Thus, reducing CPU usage and enhancing concurrency performance are critical for indexing ...

research-article

Open Access

JUST ACCEPTED

Koala: Efficient Pipeline Training through Automated Schedule Searching on Domain-Specific Language

https://doi.org/10.1145/3722113

Pipeline parallelism is a crucial technique for large-scale model training, enabling parameter splitting and performance enhancement. However, creating effective pipeline schedules often requires significant manual effort and coding skills, leading to ...

research-article

Open Access

JUST ACCEPTED

SRSparse: Generating Codes for High-Performance Sparse Matrix-Vector Semiring Computations

https://doi.org/10.1145/3722114

Sparse matrix-vector semiring computation is a key operation in sparse matrix computations, with performance strongly dependent on both program design and the features of the sparse matrices. Given the diversity of sparse matrices, designing a tailored ...

research-article

Open Access

JUST ACCEPTED

PANDA: Adaptive Prefetching and Decentralized Scheduling for Dataflow Architectures

https://doi.org/10.1145/3721288

Dataflow architectures are considered promising architecture, offering a commendable balance of performance, efficiency, and flexibility. Abundant prior works have been proposed to improve the performance of dataflow architectures. Nevertheless, these ...

research-article

Open Access

JUST ACCEPTED

TSN Cache: Exploiting Data Localities in Graph Computing Applications

https://doi.org/10.1145/3721286

This paper finds that the reusability of vertices in the same graph in graph processing differs, and the high-reuse and low-reuse vertices are stored together. These phenomena lead to the inability of existing GPU architectures to capture the reusability ...

research-article

Open Access

JUST ACCEPTED

Overlapping Aware Data Placement Optimizations for LSM Tree-Based Store on ZNS SSDs

https://doi.org/10.1145/3721287

Solid State Drives (SSDs) based on the NVMe Zoned Namespaces (ZNS) interface can notably reduce the costs of address mapping, garbage collection, and over-provisioning by dividing the storage space into multiple zones for sequential writes and random ...

research-article

Open Access

JUST ACCEPTED

ODGS: Dependency-Aware Scheduling for High-Level Synthesis with Graph Neural Network and Reinforcement Learning

https://doi.org/10.1145/3721289

Scheduling determines the execution order and time of operations in program. The order is related to operation dependencies, including data and resource dependencies. Data dependency is intrinsic in program, showing operation data flow. Resource ...

research-article

Open Access

JUST ACCEPTED

Shift-CIM: In-SRAM Alignment To Support General-Purpose Bit-level Sparsity Exploration in SRAM Multiplication

https://doi.org/10.1145/3719654

Multiplication plays a critical role in SRAM-based Computing-in-Memory (CIM) architectures. However, current SRAM-based CIMs face three major limitations. First, they do not fully exploit bit-level sparsity, resulting in unnecessary overhead in both ...

research-article

Open Access

JUST ACCEPTED

LitTLS: Lightweight Thread-Level Speculation on Little Cores

https://doi.org/10.1145/3719655

Thread-Level Speculation (TLS) utilizes speculative parallelization to accelerate hard-to-parallelize serial codes on multi-cores. As the heterogeneous multi-core architecture is becoming ubiquitous, it presents an opportunity for TLS to reorganize little ...

research-article

Open Access

JUST ACCEPTED

FusionFS: A Contention-Resilient File System for Persistent CPU Caches

https://doi.org/10.1145/3719656

Byte-addressable storage (BAS), such as persistent memory and CXL-SSDs, does not meet system designers’ expectations for data flushing and access granularity. Persistent CPU caches, enabled by recent techniques like Intel’s eADR and CXL’s Global ...

research-article

Open Access

JUST ACCEPTED

Sniper: Exploiting Spatial and Temporal Sampling for Large-Scale Performance Analysis

https://doi.org/10.1145/3720544

MPI tracing tools is essential to collect the communication events and performance metrics of large-scale programs for further performance analysis and optimization. However, towards the exascale era, the performance and storage overhead for tracing ...

research-article

Open Access

JUST ACCEPTED

TransCL: An Automatic CUDA-to-OpenCL Programs Transformation Framework

https://doi.org/10.1145/3718987

With the rising demand for computational power and the increasing variety of computational scenarios, considerable interest has emerged in transforming existing CUDA programs into more general-purpose OpenCL programs, enabling them to run across diverse ...

research-article

Open Access

JUST ACCEPTED

CesASMe and Staticdeps: static detection of memory-carried dependencies for code analyzers

https://doi.org/10.1145/3715125

A variety of code analyzers, such as IACA, uiCA, llvm-mca or Ithemal, strive to statically predict the throughput of a computation kernel. Each analyzer is based on its own simplified CPU model reasoning at the scale of a basic block. Facing this ...

research-article

Open Access

JUST ACCEPTED

Comprehensive Evaluation and Opportunity Discovery for Deterministic Concurrency Control

https://doi.org/10.1145/3715126

Deterministic concurrency control (DCC) guarantees that the same input transactions produce the same serializable result. It offers benefits in both distributed databases and blockchain systems. Dozens of DCC algorithms have emerged in the past decade. ...

research-article

Open Access

JUST ACCEPTED

ARACHNE: Optimizing Distributed Parallel Applications with Reduced Inter-Process Communication

https://doi.org/10.1145/3716871

In high-performance computing (HPC), parallelization is essential for improving computational efficiency as data and computation scales exceed single-node capacity. Existing methods, such as the polyhedral model used in Pluto-Distmem, focus on loop and ...

research-article

Open Access

JUST ACCEPTED

Ceiba: An Efficient and Scalable DNN Scheduler for Spatial Accelerators

https://doi.org/10.1145/3715123

Spatial accelerators are domain-specific architectures to elevate performance and energy efficiency for deep neural networks (DNNs). They also bring a large number of schedule parameters to determine computation and data movement patterns of DNNs. ...

research-article

Open Access

JUST ACCEPTED

ScaWL: Scaling k-WL (Weisfeiler-Lehman) Algorithms in Memory and Performance on Shared and Distributed-Memory Systems

https://doi.org/10.1145/3715124

The k-dimensional Weisfeiler-Leman (k-WL) algorithm—developed as an efficient heuristic for testing if two graphs are isomorphic—is a fundamental kernel for node embedding in the emerging field of graph neural networks. Unfortunately, the k-WL algorithm ...

research-article

Open Access

JUST ACCEPTED

Dynamic Power Management Through Multi-agent Deep Reinforcement Learning for Heterogeneous Systems

https://doi.org/10.1145/3716872

Power management and optimization play a significant role in modern computer systems, from battery-powered devices to servers running in data centres. Existing approaches for power capping fail to meet the requirements presented by dynamic workloads, and ...

research-article

Open Access

JUST ACCEPTED

VersaTile: Flexible Tiled Architectures via Associative Processors

https://doi.org/10.1145/3716873

As modern applications demand more data, processing-in-memory (PIM) architectures have emerged to address the challenges of data movement and parallelism. In this paper, we propose VersaTile, a heterogeneous, fully CMOS-based tiled architecture that ...

research-article

Open Access

JUST ACCEPTED

Exploiting Dynamic Regular Patterns in Irregular Programs for Efficient Vectorization

https://doi.org/10.1145/3716874

Modern optimizing compilers are able to exploit memory access or computation patterns to generate vectorized codes. However, such patterns in irregular programs are unknown until runtime due to the input dependence. Thus, either compiler’s static ...

research-article

Open Access

JUST ACCEPTED

Accelerating Nearest Neighbor Search in 3D Point Cloud Registration on GPUs

https://doi.org/10.1145/3716875

The Iterative Closest Points (ICP) algorithm is the most widely used method for estimating rigid transformation in 3D point cloud registration. However, the ICP relies on repeatedly performing computationally intensive nearest neighbor searches (NNS) ...

research-article

Open Access

JUST ACCEPTED

OptiFX: Automatic Optimization for Convolutional Neural Networks with Aggressive Operator Fusion on GPUs

https://doi.org/10.1145/3716876

Convolutional Neural Networks (CNNs) are fundamental to advancing computer vision technologies. As CNNs become more complex and larger, optimizing model inference remains a critical challenge in both industry and academia. On modern GPU platforms, CNN ...

research-article

Open Access

JUST ACCEPTED

Maximizing Data and Hardware Reuse for HLS with Early-Stage Symbolic Partitioning

https://doi.org/10.1145/3711926

While traditional HLS (High-Level Synthesis) converts “high-level” C-like programs into hardware automatically, producing high-performance designs still requires hardware expertise. Optimizations such as data partitioning can have a large impact on ...

research-article

Open Access

JUST ACCEPTED

LIA: Latency-Improved Adaptive routing for Dragonfly networks

https://doi.org/10.1145/3711914

Low-diameter network topologies require non-minimal routing, such as Valiant routing, to avoid network congestion under challenging traffic patterns like the so-called adversarial. However, this mechanism tends to increase the average path length, base ...

research-article

Open Access

JUST ACCEPTED

Unveiling and Evaluating Vulnerabilities in Branch Predictors via a Three-Step Modeling Methodology

https://doi.org/10.1145/3711923

With the emergence and proliferation of microarchitectural attacks targeting branch predictors, the once-established security boundary in computer systems and architectures is facing unprecedented challenges. This paper introduces an innovative branch ...

research-article

Open Access

JUST ACCEPTED

KINDRED: Heterogeneous Split-Lock Architecture for Safe Autonomous Machines

https://doi.org/10.1145/3711924

With the increasing practicality of autonomous vehicles and drones, the importance of reliability requirements has escalated substantially. In many instances, traditional system designs tend to overlook reliability issues, emphasizing primarily on ...

research-article

Open Access

JUST ACCEPTED

Taming Flexible Job Packing in Deep Learning Training Clusters

https://doi.org/10.1145/3711927

Job packing is an effective technique to harvest the idle resources allocated to the deep learning (DL) training jobs but not fully utilized, especially when clusters may experience low utilization, and users may overestimate their resource needs. However,...

research-article

Open Access

JUST ACCEPTED

Enhancing High-Throughput GPU Random Walks Through Multi-Task Concurrency Orchestration

https://doi.org/10.1145/3711820

Random walk is a powerful tool for large-scale graph learning, but its high computational demand presents a challenge. While GPUs can accelerate random walk tasks, current frameworks fail to fully utilize GPU parallelism due to memory-to-compute bandwidth ...

research-article

Open Access

JUST ACCEPTED

gHyPart: GPU-friendly End-to-End Hypergraph Partitioner

https://doi.org/10.1145/3711925

Hypergraph partitioning finds practical applications in various fields, such as high-performance computing and circuit partitioning in VLSI physical design, where high-performance solutions often demand substantial parallelism beyond what existing CPU-...

research-article

Open Access

JUST ACCEPTED

IBing: An Efficient Interleaved Bidirectional Ring All-Reduce Algorithm for Gradient Synchronization

https://doi.org/10.1145/3711818

Ring all-reduce is currently the most commonly used collective communication technique in the fields of data parallel and distributed computing. It consists of three phases: communication establishment, data transmission, and data processing at each step. ...

research-article

Open Access

JUST ACCEPTED

gCom: Fine-grained Compressors in Graphics Memory of Mobile GPU

https://doi.org/10.1145/3711819

Nowadays, GPUs significantly boost rendering performance. However, the high memory requirements limit their use, especially on low-end mobile platforms. Compression techniques have been widely adopted to reduce memory consumption but face two primary ...

research-article

Open Access

JUST ACCEPTED

Bubble-Swap Flow Control

https://doi.org/10.1145/3705316

Deadlock-free adaptive routing is extensively adopted in both on-chip and off-chip interconnection networks to improve communication bandwidth and reduce latency. Introducing virtual channels (VCs), also known as virtual lanes (VLs), is the mainstream ...

research-article

Open Access

JUST ACCEPTED

Consequence-based Clustered Architecture

https://doi.org/10.1145/3708539

We recognize that the execution of many dynamic instructions have no consequence on the overall execution of the program. For example, the execution of a correctly predicted conditional branch instruction, as well as all the instructions leading up to it, ...

research-article

Open Access

JUST ACCEPTED

Flexible and Effective Object Tiering for Heterogeneous Memory Systems

https://doi.org/10.1145/3708540

Computing platforms that package multiple types of memory, each with their own performance characteristics, are quickly becoming mainstream. To operate efficiently, heterogeneous memory architectures require new data management solutions that are able to ...

research-article

Open Access

JUST ACCEPTED

COVER: Alleviating Crash-Consistency Error Amplification in Secure Persistent Memory Systems

https://doi.org/10.1145/3708541

Data security (including confidentiality, integrity, and availability) and crash consistency guarantees are essential for building trusted persistent memory (PM) systems. Security and consistency metadata are added to enable the guarantees. Recent studies ...

research-article

Open Access

JUST ACCEPTED

MasterPlan: A Reinforcement Learning Based Scheduler for Archive Storage

https://doi.org/10.1145/3708542

With the sheer volume of data in today’s world, archive storage systems play a significant role in persisting the cold data. Due to stringent cost concerns, one popular design is to organize disks into groups and periodically switch them to be powered on ...

research-article

Open Access

JUST ACCEPTED

Steered Bubble: An Interposer-based Deadlock Recovery Algorithm for Multi-chiplet Systems

https://doi.org/10.1145/3708543

Dividing a single System-on-Chip (SoC) into multiple chiplets and integrating them via an interposer can achieve an optimal balance between continuous transistor integration and monetary cost. However, potential deadlock may arise between the chiplets and ...

research-article

Open Access

JUST ACCEPTED

AIS: An Active Idleness I/O Scheduler to Reduce Buffer-Exhausted Degradation of Solid-State Drives

https://doi.org/10.1145/3708538

Modern solid-state drives (SSDs) continue to boost storage density and I/O bandwidth at the cost of flash-access I/O latency, especially for write, hence prevalently deploy a build-in buffer to absorb incoming writes. However, when the buffer is used up, ...

research-article

Open Access

JUST ACCEPTED

TPRepair: Tree-Based Pipelined Repair in Clustered Storage Systems

https://doi.org/10.1145/3705895

Erasure coding is an effective technique for guaranteeing data reliability for storage systems, yet it incurs a high repair penalty with amplified repair traffic. The repair becomes more intricate in clustered storage systems with the bandwidth diversity ...

research-article

Open Access

JUST ACCEPTED

PRAGA: A Priority-Aware Hardware/Software Co-design for High-Throughput Graph Processing Acceleration

https://doi.org/10.1145/3701998

Graph processing is pivotal in deriving insights from complex data structures but faces performance limitations due to the irregular nature of graphs. Traditional general-purpose processors often struggle with low instruction-level parallelism and energy ...

research-article

Open Access

JUST ACCEPTED

GCNTrain+: A Versatile and Efficient Accelerator for Graph Convolutional Neural Network Training

https://doi.org/10.1145/3705317

Recently, graph convolutional networks (GCNs) have gained wide attention due to their ability to capture node relationships in graphs. One problem appears when full-batch GCN is trained on large graph datasets, where the computational and memory ...

research-article

Open Access

JUST ACCEPTED

exZNS: Extending Zoned Namespace to Support Byte-loggable Zones

https://doi.org/10.1145/3705318

Emerging Zoned Namespace (ZNS) provides hosts with fine-grained, performance-predictable storage management. ZNS organizes the address space into zones composed of fixed-size, sequentially written, non-overwritable blocks, making it suitable for log-...

research-article

Open Access

JUST ACCEPTED

RT-GNN: Accelerating Sparse Graph Neural Networks by Tensor-CUDA Kernel Fusion

https://doi.org/10.1145/3702001

Graph Neural Networks (GNNs) have achieved remarkable successes in various graph-based learning tasks, thanks to their ability to leverage advanced GPUs. However, GNNs currently face challenges arising from the concurrent use of advanced Tensor Cores (TCs)...

research-article

Open Access

JUST ACCEPTED

ATP: Achieving Throughput Peak for DNN Training via Smart GPU Memory Management

https://doi.org/10.1145/3701996

Due to the limited GPU memory, the performance of large DNNs training is constrained by the unscalable batch size. Existing researches partially address the issue of GPU memory limit through tensor recomputation and swapping, but overlook the exploration ...

research-article

Open Access

JUST ACCEPTED

ApSpGEMM: Accelerating Large-scale SpGEMM with Heterogeneous Collaboration and Adaptive Panel

https://doi.org/10.1145/3703352

The Sparse General Matrix-Matrix multiplication (SpGEMM) is a fundamental component for many applications, such as algebraic multigrid methods (AMG), graphic processing, and deep learning. However, the unbearable latency of computing high-dimensional, ...

research-article

Open Access

JUST ACCEPTED

RaNAS: Resource-Aware Neural Architecture Search for Edge Computing

https://doi.org/10.1145/3703353

Neural architecture search (NAS) for edge devices is often time-consuming because of long-latency deploying and testing on edge devices. The ability to accurately predict the computation cost and memory requirement for convolutional neural networks (CNNs) ...

research-article

Open Access

JUST ACCEPTED

SPIRIT: Scalable and Persistent In-Memory Indices for Real-Time Search

https://doi.org/10.1145/3703351

Today, real-time search over big microblogging data requires low indexing and query latency. Online services, therefore, prefer to host inverted indices in memory. Unfortunately, as datasets grow, indices grow proportionally, and with limited DRAM scaling,...

research-article

Open Access

JUST ACCEPTED

ShieldCXL: A Practical Obliviousness Support with Sealed CXL Memory

https://doi.org/10.1145/3703354

The CXL (Compute Express Link) technology is an emerging memory interface with high-level commands. Recent studies applied the CXL memory expanding technique to mitigate the capacity limitation of the conventional DDRx memory. Unlike the prior studies to ...

research-article

Open Access

JUST ACCEPTED

Tiaozhuan: A General and Efficient Indirect Branch Optimization for Binary Translation

https://doi.org/10.1145/3703355

Binary translation enables transparent execution, analysis, and modification of the binary program, serving as a core technology that facilitates instruction set emulation, cross-platform compatibility of software, and program instrumentation. Handling ...

research-article

Open Access

JUST ACCEPTED

Characterizing and Understanding HGNN Training on GPUs

https://doi.org/10.1145/3703356

Owing to their remarkable representation capabilities for heterogeneous graph data, Heterogeneous Graph Neural Networks (HGNNs) have been widely adopted in many critical real-world domains such as recommendation systems and medical analysis. Prior to ...

research-article

Open Access

JUST ACCEPTED

A High Scalability Memory NoC with Shared-Inside Hierarchical-Groupings for Triplet-Based Many-Core Architecture

https://doi.org/10.1145/3688610

Innovative processor architecture designs are shifting towards Many-Core Architectures (MCAs) to meet the future demands of high-performance computing as the limits of Moore’s Law have almost been reached. Many-core processors utilize shared memory ...

research-article

Open Access

JUST ACCEPTED

An Efficient ReRAM-based Accelerator for Asynchronous Iterative Graph Processing

https://doi.org/10.1145/3689335

Graph processing has become a central concern for many real-world applications and is well-known for its low compute-to-communication ratios and poor data locality. By integrating computing logic into memory, resistive random access memory (ReRAM) tackles ...

research-article

Open Access

JUST ACCEPTED

Leveraging the Hardware Resources to Accelerate cryo-EM Reconstruction of RELION on the New Sunway Supercomputer

https://doi.org/10.1145/3701990

The fast development of biomolecular structure determination has enabled the fine-grained study of objects in the micro-world, such as proteins and RNAs. The world is benefited. However, as the computational algorithms are constantly developed, the ...

research-article

Open Access

JUST ACCEPTED

PARADISE: Criticality-Aware Instruction Reordering for Power Attack Resistance

https://doi.org/10.1145/3701991

Power side-channel attacks exploit the correlation of power consumption with the instructions and data being processed to extract secrets from a device (e.g., cryptographic keys). Prior work primarily focused on protecting small embedded micro-controllers ...

research-article

Open Access

JUST ACCEPTED

Iterating Pointers: Enabling Static Analysis for Loop-based Pointers

https://doi.org/10.1145/3701993

Pointers are an integral part of C and other programming languages. They enable substantial flexibility from the programmer’s standpoint, allowing the user fine, unmediated control over data access patterns. However, accesses done through pointers are ...

research-article

Open Access

JUST ACCEPTED

DTAP: Accelerating Strongly-Typed Programs with Data Type-Aware Hardware Prefetching

https://doi.org/10.1145/3701994

Queries on linked data structures, such as trees and graphs, often suffer from frequent cache misses and significant performance loss due to dependent and random pointer-chasing memory accesses. In this paper, we propose a software-hardware co-designed ...

research-article

Open Access

JUST ACCEPTED

Constructing a Supplementary Benchmark Suite to Represent Android Applications with User Interactions by using Performance Counters

https://doi.org/10.1145/3701999

We find existing benchmark suites for smartphone CPU micro-architecture design such as Geekbench 5.0 fail to authentically represent the micro-architecture level performance behavior of widely used real Android applications with interactive operations ...

research-article

Open Access

JUST ACCEPTED

Conflict Management in Vector Register Files

https://doi.org/10.1145/3702002

The instruction set architecture (ISA) of vector processors operates on vectors stored in the vector register file (VRF) which needs to handle several concurrent accesses by functional units (FUs) with multiple ports. When the vector processor is running ...

research-article

Open Access

JUST ACCEPTED

Shining Light on the Inter-procedural Code Obfuscation: Keep Pace with Progress in Binary Diffing

https://doi.org/10.1145/3701992

Software obfuscation techniques have lost their effectiveness due to the rapid development of binary diffing techniques, which can achieve accurate function matching and identification. In this paper, we propose a new inter-procedural code obfuscation ...

research-article

Open Access

JUST ACCEPTED

DeepZoning: Re-accelerate CNN Inference with Zoning Graph for Heterogeneous Edge Cluster

https://doi.org/10.1145/3701995

Parallelizing CNN inference on heterogeneous edge clusters with data parallelism has gained popularity as a way to meet real-time requirements without sacrificing model accuracy. However, existing algorithms struggle to find optimal parallel granularity ...

research-article

Open Access

JUST ACCEPTED

MemoriaNova: Optimizing Memory-Aware Model Inference for Edge Computing

https://doi.org/10.1145/3701997

In recent years, deploying deep learning models on edge devices has become pervasive, driven by the increasing demand for intelligent edge computing solutions across various industries. From industrial automation to intelligent surveillance and healthcare,...

research-article

Open Access

JUST ACCEPTED

Multiple Function Merging for Code Size Reduction

https://doi.org/10.1145/3702000

Resource-constrained environments, such as embedded devices, have limited amounts of memory and storage. Practical programming languages such as C++ and Rust tend to output multiple similar functions by monomorphizing polymorphic functions. An ...

research-article

Open Access

JUST ACCEPTED

An Intelligent Scheduling Approach on Mobile OS for Optimizing UI Smoothness and Power

https://doi.org/10.1145/3674910

Mobile devices need to respond quickly to diverse user inputs. The existing approaches often heuristically raise the CPU/GPU frequency according to the empirical rules when facing burst inputs and various changes. Although doing so can be effective ...

research-article

Open Access

JUST ACCEPTED

DLAS: A Conceptual Model for Across-Stack Deep Learning Acceleration

https://doi.org/10.1145/3688609

Deep Neural Networks (DNNs) are very computationally demanding, which presents a significant barrier to their deployment, especially on resource-constrained devices. Significant work from both the machine learning and computing systems communities has ...

research-article

Open Access

JUST ACCEPTED

GraphService: Topology-aware Constructor for Large-scale Graph Applications

Xinbiao Gan

https://doi.org/10.1145/3689341

Graph-based services are becoming integrated into everyday life through graph applications and graph learning systems. While traditional graph processing approaches boast excellent throughput with millisecond-level processing time, the construction phase ...

research-article

Open Access

JUST ACCEPTED

FASA-DRAM: Reducing DRAM Latency with Destructive Activation and Delayed Restoration

https://doi.org/10.1145/3649135

DRAM memory is a performance bottleneck for many applications, due to its high access latency. Previous work has mainly focused on data locality, introducing small-but-fast regions to cache frequently accessed data, thereby reducing the average latency. ...