Issue Downloads
Cooperative Multi-Agent Reinforcement Learning-Based Co-optimization of Cores, Caches, and On-chip Network
Modern multi-core systems provide huge computational capabilities, which can be used to run multiple processes concurrently. To achieve the best possible performance within limited power budgets, the various system resources need to be allocated ...
Bringing Parallel Patterns Out of the Corner: The P3 ARSEC Benchmark Suite
High-level parallel programming is an active research topic aimed at promoting parallel programming methodologies that provide the programmer with high-level abstractions to develop complex parallel software with reduced time to solution. Pattern-based ...
Cache Exclusivity and Sharing: Theory and Optimization
A problem on multicore systems is cache sharing, where the cache occupancy of a program depends on the cache usage of peer programs. Exclusive cache hierarchy as used on AMD processors is an effective solution to allow processor cores to have a large ...
Energy-Efficient Compilation of Irregular Task-Parallel Loops
Energy-efficient compilation is an important problem for multi-core systems. In this context, irregular programs with task-parallel loops present interesting challenges: the threads with lesser work-loads (non-critical-threads) wait at the join-points ...
Compiler-Assisted Loop Hardening Against Fault Attacks
Secure elements widely used in smartphones, digital consumer electronics, and payment systems are subject to fault attacks. To thwart such attacks, software protections are manually inserted requiring experts and time. The explosion of the Internet of ...
A Transactional Correctness Tool for Abstract Data Types
Transactional memory simplifies multiprocessor programming by providing the guarantee that a sequential block of code in the form of a transaction will exhibit atomicity and isolation. Transactional data structures offer the same guarantee to concurrent ...
Power Consumption Models for Multi-Tenant Server Infrastructures
- Matteo Ferroni,
- Andrea Corna,
- Andrea Damiani,
- Rolando Brondolin,
- Juan A. Colmenares,
- Steven Hofmeyr,
- John D. Kubiatowicz,
- Marco D. Santambrogio
Multi-tenant virtualized infrastructures allow cloud providers to minimize costs through workload consolidation. One of the largest costs is power consumption, which is challenging to understand in heterogeneous environments. We propose a power modeling ...
CG-OoO: Energy-Efficient Coarse-Grain Out-of-Order Execution Near In-Order Energy with Near Out-of-Order Performance
We introduce the Coarse-Grain Out-of-Order (CG-OoO) general-purpose processor designed to achieve close to In-Order (InO) processor energy while maintaining Out-of-Order (OoO) performance. CG-OoO is an energy-performance-proportional architecture. Block-...
ECS: Error-Correcting Strings for Lifetime Improvements in Nonvolatile Memories
Emerging nonvolatile memories (NVMs) suffer from low write endurance, resulting in early cell failures (hard errors), which reduce memory lifetime. It was recognized early on that conventional error-correcting codes (ECCs), which are designed for soft ...
SLOOP: QoS-Supervised Loop Execution to Reduce Energy on Heterogeneous Architectures
Most systems allocate computational resources to each executing task without any actual knowledge of the application’s Quality-of-Service (QoS) requirements. Such best-effort policies lead to overprovisioning of the resources and increase energy loss. ...
MBZip: Multiblock Data Compression
Compression techniques at the last-level cache and the DRAM play an important role in improving system performance by increasing their effective capacities. A compressed block in DRAM also reduces the transfer time over the memory bus to the caches, ...
Fuse: Accurate Multiplexing of Hardware Performance Counters Across Executions
Collecting hardware event counts is essential to understanding program execution behavior. Contemporary systems offer few Performance Monitoring Counters (PMCs), thus only a small fraction of hardware events can be monitored simultaneously. We present ...
Could Compression Be of General Use? Evaluating Memory Compression across Domains
Recent proposals present compression as a cost-effective technique to increase cache and memory capacity and bandwidth. While these proposals show potentials of compression, there are several open questions to adopt these proposals in real systems ...
Improving the Efficiency of GPGPU Work-Queue Through Data Awareness
The architecture and programming model of current GPGPUs are best suited for applications that are dominated by structured control and data flows across large regular datasets. Parallel workloads with irregular control and data structures cannot easily ...
A Framework for Automated and Controlled Floating-Point Accuracy Reduction in Graphics Applications on GPUs
Reducing the precision of floating-point values can improve performance and/or reduce energy expenditure in computer graphics, among other, applications. However, reducing the precision level of floating-point values in a controlled fashion needs ...
Generating Fine-Grain Multithreaded Applications Using a Multigrain Approach
The recent evolution in hardware landscape, aimed at producing high-performance computing systems capable of reaching extreme-scale performance, has reignited the interest in fine-grain multithreading, particularly at the intranode level. Indeed, ...
CAIRO: A Compiler-Assisted Technique for Enabling Instruction-Level Offloading of Processing-In-Memory
Three-dimensional (3D)-stacking technology and the memory-wall problem have popularized processing-in-memory (PIM) concepts again, which offers the benefits of bandwidth and energy savings by offloading computations to functional units inside the ...
Triple Engine Processor (TEP): A Heterogeneous Near-Memory Processor for Diverse Kernel Operations
The advent of 3D memory stacking technology, which integrates a logic layer and stacked memories, is expected to be one of the most promising memory technologies to mitigate the memory wall problem by leveraging the concept of near-memory processing (...
ReDirect: Reconfigurable Directories for Multicore Architectures
As we enter the dark silicon era, architects should not envision designs in which every transistor remains turned on permanently but rather ones in which portions of the chip are judiciously turned on/off depending on the characteristics of a workload. ...
HAShCache: Heterogeneity-Aware Shared DRAMCache for Integrated Heterogeneous Systems
Integrated Heterogeneous System (IHS) processors pack throughput-oriented General-Purpose Graphics Pprocessing Units (GPGPUs) alongside latency-oriented Central Processing Units (CPUs) on the same die sharing certain resources, e.g., shared last-level ...
Optimizing Affine Control With Semantic Factorizations
Hardware accelerators generated by polyhedral synthesis techniques make extensive use of affine expressions (affine functions and convex polyhedra) in control and steering logic. Since the control is pipelined, these affine objects must be evaluated at ...
Data-Driven Concurrency for High Performance Computing
In this work, we utilize dynamic dataflow/data-driven techniques to improve the performance of high performance computing (HPC) systems. The proposed techniques are implemented and evaluated through an efficient, portable, and robust programming ...
SCALO: Scalability-Aware Parallelism Orchestration for Multi-Threaded Workloads
- Giorgis Georgakoudis,
- Hans Vandierendonck,
- Peter Thoman,
- Bronis R. De Supinski,
- Thomas Fahringer,
- Dimitrios S. Nikolopoulos
Shared memory machines continue to increase in scale by adding more parallelism through additional cores and complex memory hierarchies. Often, executing multiple applications concurrently, dividing among them hardware threads, provides greater ...
Optimization of Triangular and Banded Matrix Operations Using 2d-Packed Layouts
Over the past few years, multicore systems have become increasingly powerful and thereby very useful in high-performance computing. However, many applications, such as some linear algebra algorithms, still cannot take full advantage of these systems. ...