TACO: Vol 12, No 4

Volume 12, Issue 4January 2016

Volume 12, Issue 4

January 2016

Editor:

Koen De Bosschere
Ghent University

Publisher:

Association for Computing Machinery
New York
NY
United States

ISSN:1544-3566

EISSN:1544-3973

Tags:

Bibliometrics

Issue Downloads

PDFfront matter (TOC, masthead, submission information)

Select All

Export Citations Save to Binder

research-article

Open Access

Reuse Distance-Based Probabilistic Cache Replacement

Article No.: 33, Pages 1–22https://doi.org/10.1145/2818374

This article proposes Probabilistic Replacement Policy (PRP), a novel replacement policy that evicts the line with minimum estimated hit probability under optimal replacement instead of the line with maximum expected reuse distance. The latter is ...

research-article

Open Access

MINIME-GPU: Multicore Benchmark Synthesizer for GPUs

Article No.: 34, Pages 1–25https://doi.org/10.1145/2818693

We introduce MINIME-GPU, a novel automated benchmark synthesis framework for graphics processing units (GPUs) that serves to speed up architectural simulation of modern GPU architectures. Our framework captures important characteristics of original GPU ...

research-article

Open Access

Scalable Energy Efficiency with Resilience for High Performance Computing Systems: A Quantitative Methodology

Article No.: 35, Pages 1–27https://doi.org/10.1145/2822893

Ever-growing performance of supercomputers nowadays brings demanding requirements of energy efficiency and resilience, due to rapidly expanding size and duration in use of the large-scale computing systems. Many application/architecture-dependent ...

research-article

Open Access

Tumbler: An Effective Load-Balancing Technique for Multi-CPU Multicore Systems

Article No.: 36, Pages 1–24https://doi.org/10.1145/2827698

Schedulers used by modern OSs (e.g., Oracle Solaris 11™ and GNU/Linux) balance load by balancing the number of threads in run queues of different cores. While this approach is effective for a single CPU multicore system, we show that it can lead to a ...

research-article

Open Access

Four Metrics to Evaluate Heterogeneous Multicores

Article No.: 37, Pages 1–25https://doi.org/10.1145/2829950

Semiconductor device scaling has made single-ISA heterogeneous processors a reality. Heterogeneous processors contain a number of different CPU cores that all implement the same Instruction Set Architecture (ISA). This enables greater flexibility and ...

research-article

Open Access

SPCM: The Striped Phase Change Memory

Article No.: 38, Pages 1–25https://doi.org/10.1145/2829951

Phase Change Memory (PCM) devices are one of the known promising technologies to take the place of DRAM devices with the aim of overcoming the obstacles of reducing feature size and stopping ever growing amounts of leakage power. In exchange for ...

research-article

Open Access

Two-Level Hybrid Sampled Simulation of Multithreaded Applications

Article No.: 39, Pages 1–25https://doi.org/10.1145/2818353

Sampled microarchitectural simulation of single-threaded applications is mature technology for over a decade now. Sampling multithreaded applications, on the other hand, is much more complicated. Not until very recently have researchers proposed ...

research-article

Open Access

Integrated Mapping and Synthesis Techniques for Network-on-Chip Topologies with Express Channels

Article No.: 40, Pages 1–26https://doi.org/10.1145/2831233

The addition of express channels to a traditional mesh network-on-chip (NoC) has emerged as a viable solution to solve the problem of high latency. In this article, we address the problem of integrated mapping and synthesis for express channel--based ...

research-article

Open Access

PARSECSs: Evaluating the Impact of Task Parallelism in the PARSEC Benchmark Suite

Article No.: 41, Pages 1–22https://doi.org/10.1145/2829952

In this work, we show how parallel applications can be implemented efficiently using task parallelism. We also evaluate the benefits of such parallel paradigm with respect to other approaches. We use the PARSEC benchmark suite as our test bed, which ...

research-article

Open Access

A Framework for Application-Guided Task Management on Heterogeneous Embedded Systems

Article No.: 42, Pages 1–25https://doi.org/10.1145/2835177

In this article, we propose a general framework for fine-grain application-aware task management in heterogeneous embedded platforms, which allows integration of different mechanisms for an efficient resource utilization, frequency scaling, and task ...

research-article

Open Access

Managing Mismatches in Voltage Stacking with CoreUnfolding

Article No.: 43, Pages 1–26https://doi.org/10.1145/2835178

Five percent to 25% of power could be wasted before it is delivered to the computational resources on a die, due to inefficiencies of voltage regulators and resistive loss. The power delivery could benefit if, at the same power, the delivered voltage ...

research-article

Open Access

FaultSim: A Fast, Configurable Memory-Reliability Simulator for Conventional and 3D-Stacked Systems

Article No.: 44, Pages 1–24https://doi.org/10.1145/2831234

As memory systems scale, maintaining their Reliability Availability and Serviceability (RAS) is becoming more complex. To make matters worse, recent studies of DRAM failures in data centers and supercomputer environments have highlighted that large-...

research-article

Open Access

Adaptive Correction of Sampling Bias in Dynamic Call Graphs

Byeongcheol Lee

Article No.: 45, Pages 1–24https://doi.org/10.1145/2840806

This article introduces a practical low-overhead adaptive technique of correcting sampling bias in profiling dynamic call graphs. Timer-based sampling keeps the overhead low but sampling bias lowers the accuracy when either observable call events or ...

research-article

Open Access

Fence Placement for Legacy Data-Race-Free Programs via Synchronization Read Detection

Article No.: 46, Pages 1–23https://doi.org/10.1145/2835179

Shared-memory programmers traditionally assumed Sequential Consistency (SC), but modern systems have relaxed memory consistency. Here, the trend in languages is toward Data-Race-Free (DRF) models, where, assuming annotated synchronizations and the ...

research-article

Open Access

Optimizing Control Transfer and Memory Virtualization in Full System Emulators

Article No.: 47, Pages 1–24https://doi.org/10.1145/2837027

Full system emulators provide virtual platforms for several important applications, such as kernel and system software development, co-verification with cycle accurate CPU simulators, or application development for hardware still in development. Full ...

research-article

Open Access

The Polyhedral Model of Nonlinear Loops

Article No.: 48, Pages 1–27https://doi.org/10.1145/2838734

Runtime code optimization and speculative execution are becoming increasingly prominent to leverage performance in the current multi- and many-core era. However, a wider and more efficient use of such techniques is mainly hampered by the prohibitive ...

research-article

Open Access

Citadel: Efficiently Protecting Stacked Memory from TSV and Large Granularity Failures

Article No.: 49, Pages 1–24https://doi.org/10.1145/2840807

Stacked memory modules are likely to be tightly integrated with the processor. It is vital that these memory modules operate reliably, as memory failure can require the replacement of the entire socket. To make matters worse, stacked memory designs are ...

research-article

Open Access

Automatic Vectorization of Interleaved Data Revisited

Article No.: 50, Pages 1–25https://doi.org/10.1145/2838735

Automatically exploiting short vector instructions sets (SSE, AVX, NEON) is a critically important task for optimizing compilers. Vector instructions typically work best on data that is contiguous in memory, and operating on non-contiguous data requires ...

research-article

Open Access

A Filtering Mechanism to Reduce Network Bandwidth Utilization of Transaction Execution

Article No.: 51, Pages 1–26https://doi.org/10.1145/2837028

Hardware Transactional Memory (HTM) relies heavily on the on-chip network for intertransaction communication. However, the network bandwidth utilization of transactions has been largely neglected in HTM designs. In this work, we propose a cost model to ...

research-article

Open Access

Enabling PGAS Productivity with Hardware Support for Shared Address Mapping: A UPC Case Study

Article No.: 52, Pages 1–26https://doi.org/10.1145/2842686

Due to its rich memory model, the partitioned global address space (PGAS) parallel programming model strikes a balance between locality-awareness and the ease of use of the global address space model. Although locality-awareness can lead to high ...

research-article

Open Access

On How to Accelerate Iterative Stencil Loops: A Scalable Streaming-Based Approach

Article No.: 53, Pages 1–26https://doi.org/10.1145/2842615

In high-performance systems, stencil computations play a crucial role as they appear in a variety of different fields of application, ranging from partial differential equation solving, to computer simulation of particles’ interaction, to image ...

research-article

Open Access

Falcon: A Graph Manipulation Language for Heterogeneous Systems

Article No.: 54, Pages 1–27https://doi.org/10.1145/2842618

Graph algorithms have been shown to possess enough parallelism to keep several computing resources busy—even hundreds of cores on a GPU. Unfortunately, tuning their implementation for efficient execution on a particular hardware configuration of ...

research-article

Open Access

FluidCheck: A Redundant Threading-Based Approach for Reliable Execution in Manycore Processors

Article No.: 55, Pages 1–26https://doi.org/10.1145/2842620

Soft errors have become a serious cause of concern with reducing feature sizes. The ability to accommodate complex, Simultaneous Multithreading (SMT) cores on a single chip presents a unique opportunity to achieve reliable execution, safe from soft ...

research-article

Open Access

Rethinking Memory Permissions for Protection Against Cross-Layer Attacks

Article No.: 56, Pages 1–27https://doi.org/10.1145/2842621

The inclusive permissions structure (e.g., the Intel ring model) of modern commodity CPUs provides privileged system software layers with arbitrary permissions to access and modify client processes, allowing them to manage these clients and the system ...

research-article

Open Access

Resistive GP-SIMD Processing-In-Memory

Article No.: 57, Pages 1–22https://doi.org/10.1145/2845084

GP-SIMD, a novel hybrid general-purpose SIMD architecture, addresses the challenge of data synchronization by in-memory computing, through combining data storage and massive parallel processing. In this article, we explore a resistive implementation of ...

research-article

Open Access

Iteration Interleaving--Based SIMD Lane Partition

Article No.: 58, Pages 1–18https://doi.org/10.1145/2847253

The efficacy of single instruction, multiple data (SIMD) architectures is limited when handling divergent control flows. This circumstance results in SIMD fragments using only a subset of the available lanes. We propose an iteration interleaving--based ...

research-article

Open Access

Integer Linear Programming-Based Scheduling for Transport Triggered Architectures

Article No.: 59, Pages 1–22https://doi.org/10.1145/2845082

Static multi-issue machines, such as traditional Very Long Instructional Word (VLIW) architectures, move complexity from the hardware to the compiler. This is motivated by the ability to support high degrees of instruction-level parallelism without ...

research-article

Open Access

Sensible Energy Accounting with Abstract Metering for Multicore Systems

Article No.: 60, Pages 1–26https://doi.org/10.1145/2842616

Chip multicore processors (CMPs) are the preferred processing platform across different domains such as data centers, real-time systems, and mobile devices. In all those domains, energy is arguably the most expensive resource in a computing system. ...

research-article

Open Access

Symmetry-Agnostic Coordinated Management of the Memory Hierarchy in Multicore Systems

Article No.: 61, Pages 1–26https://doi.org/10.1145/2847254

In a multicore system, many applications share the last-level cache (LLC) and memory bandwidth. These resources need to be carefully managed in a coordinated way to maximize performance. DRAM is still the technology of choice in most systems. However, ...

research-article

Open Access

RFVP: Rollback-Free Value Prediction with Safe-to-Approximate Loads

Article No.: 62, Pages 1–26https://doi.org/10.1145/2836168

This article aims to tackle two fundamental memory bottlenecks: limited off-chip bandwidth (bandwidth wall) and long access latency (memory wall). To achieve this goal, our approach exploits the inherent error resilience of a wide range of applications. ...

Subjects

Comments

Please enable JavaScript to view thecomments powered by Disqus.

ACM Transactions on Architecture and Code Optimization

Sections

Issue Downloads

Reuse Distance-Based Probabilistic Cache Replacement

MINIME-GPU: Multicore Benchmark Synthesizer for GPUs

Scalable Energy Efficiency with Resilience for High Performance Computing Systems: A Quantitative Methodology

Tumbler: An Effective Load-Balancing Technique for Multi-CPU Multicore Systems

Four Metrics to Evaluate Heterogeneous Multicores

SPCM: The Striped Phase Change Memory

Two-Level Hybrid Sampled Simulation of Multithreaded Applications

Integrated Mapping and Synthesis Techniques for Network-on-Chip Topologies with Express Channels

PARSECSs: Evaluating the Impact of Task Parallelism in the PARSEC Benchmark Suite

A Framework for Application-Guided Task Management on Heterogeneous Embedded Systems

Managing Mismatches in Voltage Stacking with CoreUnfolding

FaultSim: A Fast, Configurable Memory-Reliability Simulator for Conventional and 3D-Stacked Systems

Adaptive Correction of Sampling Bias in Dynamic Call Graphs

Fence Placement for Legacy Data-Race-Free Programs via Synchronization Read Detection

Optimizing Control Transfer and Memory Virtualization in Full System Emulators

The Polyhedral Model of Nonlinear Loops

Citadel: Efficiently Protecting Stacked Memory from TSV and Large Granularity Failures

Automatic Vectorization of Interleaved Data Revisited

A Filtering Mechanism to Reduce Network Bandwidth Utilization of Transaction Execution

Enabling PGAS Productivity with Hardware Support for Shared Address Mapping: A UPC Case Study

On How to Accelerate Iterative Stencil Loops: A Scalable Streaming-Based Approach

Falcon: A Graph Manipulation Language for Heterogeneous Systems

FluidCheck: A Redundant Threading-Based Approach for Reliable Execution in Manycore Processors

Rethinking Memory Permissions for Protection Against Cross-Layer Attacks

Resistive GP-SIMD Processing-In-Memory

Iteration Interleaving--Based SIMD Lane Partition

Integer Linear Programming-Based Scheduling for Transport Triggered Architectures

Sensible Energy Accounting with Abstract Metering for Multicore Systems

Symmetry-Agnostic Coordinated Management of the Memory Hierarchy in Multicore Systems

RFVP: Rollback-Free Value Prediction with Safe-to-Approximate Loads