Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- research-articleJuly 2020
An imitation learning approach for cache replacement
ICML'20: Proceedings of the 37th International Conference on Machine LearningArticle No.: 579, Pages 6237–6247Program execution speed critically depends on increasing cache hits, as cache hits are orders of magnitude faster than misses. To increase cache hits, we focus on the problem of cache replacement: choosing which cache line to evict upon inserting a new ...
- research-articleApril 2019
Software-Defined Far Memory in Warehouse-Scale Computers
- Andres Lagar-Cavilla,
- Junwhan Ahn,
- Suleiman Souhlal,
- Neha Agarwal,
- Radoslaw Burny,
- Shakeel Butt,
- Jichuan Chang,
- Ashwin Chaugule,
- Nan Deng,
- Junaid Shahid,
- Greg Thelen,
- Kamil Adam Yurtsever,
- Yu Zhao,
- Parthasarathy Ranganathan
ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating SystemsPages 317–330https://doi.org/10.1145/3297858.3304053Increasing memory demand and slowdown in technology scaling pose important challenges to total cost of ownership (TCO) of warehouse-scale computers (WSCs). One promising idea to reduce the memory TCO is to add a cheaper, but slower, "far memory" tier ...
- research-articleSeptember 2018
Nonvolatile Write Buffer-Based Journaling Bypass for Storage Write Reduction in Mobile Devices
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCADICS), Volume 37, Issue 9Pages 1747–1759https://doi.org/10.1109/TCAD.2017.2774192In mobile systems, such as smartphones, most of storage writes are incurred by the SQLite database (DB) system. These writes consist of two parts: writes to original data (e.g., SQLite DB file) and journaling-induced writes. In this paper, we first ...
- research-articleMarch 2018
Benzene: An Energy-Efficient Distributed Hybrid Cache Architecture for Manycore Systems
ACM Transactions on Architecture and Code Optimization (TACO), Volume 15, Issue 1Article No.: 10, Pages 1–23https://doi.org/10.1145/3177963This article proposes Benzene, an energy-efficient distributed SRAM/STT-RAM hybrid cache for manycore systems running multiple applications. It is based on the observation that a naïve application of hybrid cache techniques to distributed caches in a ...
- research-articleJune 2017
Making DRAM Stronger Against Row Hammering
DAC '17: Proceedings of the 54th Annual Design Automation Conference 2017Article No.: 55, Pages 1–6https://doi.org/10.1145/3061639.3062281Modern DRAM suffers from a new problem called row hammering. The problem is expected to become more severe in future DRAMs mostly due to increased inter-row coupling at advanced technology. In order to address this problem, we present a ...
-
- research-articleMarch 2017
A novel zero weight/activation-aware hardware architecture of convolutional neural network
It is imperative to accelerate convolutional neural networks (CNNs) due to their ever-widening application areas from server, mobile to IoT devices. Based on the fact that CNNs can be characterized by a significant amount of zero values in both kernel ...
- research-articleOctober 2016
AIM: Energy-Efficient Aggregation Inside the Memory Hierarchy
ACM Transactions on Architecture and Code Optimization (TACO), Volume 13, Issue 4Article No.: 34, Pages 1–24https://doi.org/10.1145/2994149In this article, we propose Aggregation-in-Memory (AIM), a new processing-in-memory system designed for energy efficiency and near-term adoption. In order to efficiently perform aggregation, we implement simple aggregation operations in main memory and ...
- research-articleOctober 2016
Zero and data reuse-aware fast convolution for deep neural networks on GPU
CODES '16: Proceedings of the Eleventh IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System SynthesisArticle No.: 33, Pages 1–10https://doi.org/10.1145/2968456.2968476Convolution operations dominate the total execution time of deep convolutional neural networks (CNNs). In this paper, we aim at enhancing the performance of the state-of-the-art convolution algorithm (called Winograd convolution) on the GPU. Our work is ...
- research-articleApril 2016
Differential Write-Conscious Software Design on Phase-Change Memory: An SQLite Case Study
ACM Transactions on Design Automation of Electronic Systems (TODAES), Volume 21, Issue 3Article No.: 47, Pages 1–25https://doi.org/10.1145/2842613Phase-change memory (PCM) has several benefits including low cost, non-volatility, byte-addressability, etc., and limitations such as write endurance. There have been several hardware approaches to exploit the benefits while minimizing the negative ...
- articleMarch 2016
Prediction Hybrid Cache: An Energy-Efficient STT-RAM Cache Architecture
IEEE Transactions on Computers (ITCO), Volume 65, Issue 3Pages 940–951https://doi.org/10.1109/TC.2015.2435772Spin-transfer torque RAM (STT-RAM) has emerged as an energy-efficient and high-density alternative to SRAM for large on-chip caches. However, its high write energy has been considered as a serious drawback. Hybrid caches mitigate this problem by ...
- research-articleFebruary 2016
Low-Power Hybrid Memory Cubes With Link Power Management and Two-Level Prefetching
IEEE Transactions on Very Large Scale Integration (VLSI) Systems (ITVL), Volume 24, Issue 2Pages 453–464https://doi.org/10.1109/TVLSI.2015.2420315The hybrid memory cube (HMC) is a 3-D-stacked DRAM architecture designed for substantially improved memory bandwidth. In particular, its I/O interface achieves up to 320 GB/s of external bandwidth through high-speed serial links. However, it comes at the ...
- research-articleOctober 2015
A tiny-capacitor-backed non-volatile buffer to reduce storage writes in smartphones
Mobile storage writes are often dominated by writes to SQLite database files. Our characterization shows that they mostly consist of frequent overwrites with small new data (which we call small writes) and relatively infrequent writes with large data ...
- research-articleJune 2015
A scalable processing-in-memory accelerator for parallel graph processing
ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer ArchitecturePages 105–117https://doi.org/10.1145/2749469.2750386The explosion of digital data and the ever-growing need for fast data analysis have made in-memory big-data processing in computer systems increasingly important. In particular, large-scale graph processing is gaining attention due to its broad ...
Also Published in:
ACM SIGARCH Computer Architecture News: Volume 43 Issue 3S - research-articleJune 2015
PIM-enabled instructions: a low-overhead, locality-aware processing-in-memory architecture
ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer ArchitecturePages 336–348https://doi.org/10.1145/2749469.2750385Processing-in-memory (PIM) is rapidly rising as a viable solution for the memory wall crisis, rebounding from its unsuccessful attempts in 1990s due to practicality concerns, which are alleviated with recent advances in 3D stacking technologies. However,...
Also Published in:
ACM SIGARCH Computer Architecture News: Volume 43 Issue 3S - research-articleMarch 2015
Memory fast-forward: a low cost special function unit to enhance energy efficiency in GPU for big data processing
DATE '15: Proceedings of the 2015 Design, Automation & Test in Europe Conference & ExhibitionPages 1341–1346Big data processing, e.g., graph computation and MapReduce, is characterized by massive parallelism in computation and a large amount of fine-grained random memory accesses often with structural localities due to graph-like data dependency. Recently, ...
- research-articleJune 2014
Dynamic Power Management of Off-Chip Links for Hybrid Memory Cubes
DAC '14: Proceedings of the 51st Annual Design Automation ConferencePages 1–6https://doi.org/10.1145/2593069.2593128The Hybrid Memory Cube (HMC) is a 3D-stacked DRAM architecture designed for substantially improved memory bandwidth. In particular, its I/O interface achieves up to 320 GB/s of external bandwidth through high-speed serial links. However, it comes at a ...
- research-articleSeptember 2013
Write intensity prediction for energy-efficient non-volatile caches
ISLPED '13: Proceedings of the 2013 International Symposium on Low Power Electronics and DesignPages 223–228This paper presents a novel concept called write intensity prediction for energy-efficient non-volatile caches as well as the architecture that implements the concept. The key idea is to correlate write intensity of cache blocks with addresses of memory ...
- research-articleMay 2013
Power-Efficient Predication Techniques for Acceleration of Control Flow Execution on CGRA
ACM Transactions on Architecture and Code Optimization (TACO), Volume 10, Issue 2Article No.: 8, Pages 1–25https://doi.org/10.1145/2459316.2459319Coarse-grained reconfigurable architecture typically has an array of processing elements which are controlled by a centralized unit. This makes it difficult to execute programs having control divergence among PEs without predication. However, ...
- research-articleJanuary 2013
Isomorphism-Aware Identification of Custom Instructions With I/O Serialization
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCADICS), Volume 32, Issue 1Pages 34–46https://doi.org/10.1109/TCAD.2012.2214033Extensible processors have been widely used to achieve the conflicting demands for performance improvement, low power consumption, and flexibility. As extensible processors have become more popular, several algorithms have been proposed for ...
- research-articleOctober 2011
An efficient algorithm for isomorphism-aware custom instruction identification for extensible processors
CODES+ISSS '11: Proceedings of the seventh IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesisPages 345–354https://doi.org/10.1145/2039370.2039424Extensible processors have been widely used to achieve the conflicting demands for performance improvement, low power consumption, and flexibility. As extensible processors have become more popular, several algorithms have been proposed for ...