No abstract available.
Message from the General Chairs
Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0
A significant part of future microprocessor real estate will be dedicated to L2 or L3 caches. These on-chip caches will heavily impact processor perfor- mance, power dissipation, and thermal management strategies. There are a number of interconnect ...
Process Variation Tolerant 3T1D-Based Cache Architectures
Process variations will greatly impact the stability, leakage power consumption, and performance of future microprocessors. These variations are especially detrimental to 6T SRAM (6-transistor static memory) structures and will become critical with ...
Mitigating Parameter Variation with Dynamic Fine-Grain Body Biasing
Parameter variation is detrimental to a processor's frequency and leakage power. One proposed technique to mitigate it is Fine-Grain Body Biasing (FGBB), where different parts of the processor chip are given a voltage bias that changes the speed and ...
Optimal versus Heuristic Global Code Scheduling
We present a global instruction scheduler based on inte- ger linear programming (ILP) that was implemented exper- imentally in the Intel Itanium® product compiler. It features virtually the full scale of known EPIC scheduling optimiza- tions, more than ...
Global Multi-Threaded Instruction Scheduling
Recently, the microprocessor industry has moved toward chip multiprocessor (CMP) designs as a means of utiliz- ing the increasing transistor counts in the face of physi- cal and micro-architectural limitations. Despite this move, CMPs do not directly ...
Revisiting the Sequential Programming Model for Multi-Core
Single-threaded programming is already considered a complicated task. The move to multi-threaded programming only increases the complexity and cost involved in software development due to rewriting legacy code, training of the programmer, increased ...
Penelope: The NBTI-Aware Processor
Transistors consist of lower number of atoms with every technology generation. Such atoms may be displaced due to the stress caused by high temperature, frequency and current, leading to failures. NBTI (negative bias temperature instability) is one of ...
Software-Based Online Detection of Hardware Defects Mechanisms, Architectural Support, and Evaluation
As silicon process technology scales deeper into the nanometer regime, hardware defects are becoming more common. Such de- fects are bound to hinder the correct operation of future processor systems, unless new online techniques become available to ...
Self-calibrating Online Wearout Detection
Technology scaling, characterized by decreasing feature size, thin- ning gate oxide, and non-ideal voltage scaling, will become a major hindrance to microprocessor reliability in future technology gener- ations. Physical analysis of device failure ...
Implementing Signatures for Transactional Memory
Transactional Memory (TM) systems must track the read and write sets--items read and written during a transaction--to detect conflicts among concurrent trans- actions. Several TMs use signatures, which summarize unbounded read/write sets in bounded ...
Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs
DRAMs require periodic refresh for preserving data stored in them. The refresh interval for DRAMs depends on the vendor and the de- sign technology they use. For each refresh in a DRAM row, the stored information in each cell is read out and then ...
Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors
DRAM memory is a major resource shared among cores in a chip multiprocessor (CMP) system. Memory requests from different threads can interfere with each other. Existing memory access scheduling techniques try to optimize the overall data throughput ...
Impact of Cache Coherence Protocols on the Processing of Network Traffic
Sincetheintroductionofthe10GbEstandardin2002,theabilityofgeneralpurposeprocessorstoefficientlyprocessnetworktrafficwithcommonprotocolssuchasTCP/IPhasbeenrevisitedandcriticallyevaluated.However,recentcommerciallyavailableprocessorssuchasIntel®...
Flattened Butterfly Topology for On-Chip Networks
With the trend towards increasing number of cores in chip multiprocessors, the on-chip interconnect that connects the cores needs to scale efficiently. In this work, we propose the use of high-radix networks in on-chip interconnection net- works and ...
Using Address Independent Seed Encryption and Bonsai Merkle Trees to Make Secure Processors OS- and Performance-Friendly
In today's digital world, computer security issues have become increasingly important. In particular, researchers have proposed designs for secure processors which utilize hardware-based mem- ory encryption and integrity verification to protect the ...
Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding
In deep sub-micron ICs, growing amounts of on- die memory and scaling effects make embedded memories increasingly vulnerable to reliability and yield problems. As scaling progresses, soft and hard errors in the memory system will increase and single ...
Argus: Low-Cost, Comprehensive Error Detection in Simple Cores
We have developed Argus, a novel approach for pro- viding low-cost, comprehensive error detection for simple cores. The key to Argus is that the operation of a von Neumann core consists of four fundamental tasks--control flow, dataflow, computation, and ...
Leveraging 3D Technology for Improved Reliability
Aggressive technology scaling over the years has helped improve processor performance but has caused a reduc- tion in processor reliability. Shrinking transistor sizes and lower supply voltages have increased the vulnerability of computer systems ...
Effective Optimistic-Checker Tandem Core Design through Architectural Pruning
Design complexity is rapidly becoming a limiting fac- tor in the design of modern, high-performance micro- processors. This paper introduces an optimization tech- nique to improve the efficiency of complex processors. Us- ing a new metric ( µUtilization)...
FPGA-Accelerated Simulation Technologies (FAST): Fast, Full-System, Cycle-Accurate Simulators
- Derek Chiou,
- Dam Sunwoo,
- Joonsoo Kim,
- Nikhil A. Patil,
- William Reinhart,
- Darrel Eric Johnson,
- Jebediah Keefe,
- Hari Angepat
This paper describes FAST, a novel simulation methodol- ogy that can produce simulators that (i) are orders of mag- nitude faster than comparable simulators, (ii) are cycle- accurate, (iii) model the entire system running unmodified applications and ...
Microarchitectural Design Space Exploration Using an Architecture-Centric Approach
The microarchitectural design space of a new processor is too large for an architect to evaluate in its entirety. Even with the use of statistical simulation, evaluation of a single configuration can take excessive time due to the need to run a set of ...
Informed Microarchitecture Design Space Exploration Using Workload Dynamics
Program runtime characteristics exhibit significant variation. As microprocessor architectures become more complex, their efficiency depends on the capability of adapting with workload dynamics. Moreover, with the approaching billion-transistor ...
Time Interpolation: So Many Metrics, So Few Registers
The performance of computer systems varies over the course of their execution. A system may perform well dur- ing some parts of its execution and poorly during others. To understand why a system behaves in this way performance analysts need to study its ...
Low-Cost Epoch-Based Correlation Prefetching for Commercial Applications
The performance of many important commercial workloads, such as on-line transaction processing, is limited by the frequent stalls due to off-chip instruction and data accesses. These applica- tions are characterized by irregular control flow and complex ...
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy
Current on-chip block-centric memory hierarchies exploit access patterns at the fine-grain scale of small blocks. Several recently proposed techniques for coherence traffic reduction and prefetching suggest that further useful patterns emerge with a ...
Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors
Snoopy cache coherence can be implemented in any physical network topology by embedding a logical unidirectional ring in the network. Control messages are forwarded using the ring, while other messages can use any path. While the resulting coherence ...