More Web Proxy on the site http://driver.im/

research-article

Domino Cache: An Energy-Efficient Data Cache for Modern Applications

Authors:

Mahmood Naderan-Tahan,

Hamid Sarbazi-AzadAuthors Info & Claims

ACM Transactions on Design Automation of Electronic Systems (TODAES), Volume 23, Issue 3

Article No.: 31, Pages 1 - 23

https://doi.org/10.1145/3174848

Published: 01 February 2018 Publication History

Abstract

The energy consumption for processing modern workloads is challenging in data centers. Due to the large datasets of cloud workloads, the miss rate of the L1 data cache is high, and with respect to the energy efficiency concerns, such misses are costly for memory instructions because lower levels of memory hierarchy consume more energy per access than the L1. Moreover, large last-level caches are not performance effective, in contrast to traditional scientific workloads. The aim of this article is to propose a large L1 data cache, called Domino, to reduce the number of accesses to lower levels in order to improve the energy efficiency. In designing Domino, we focus on two components that use the on-chip area and are not energy efficient, which makes them good candidates to use their area for enlarging the L1 data cache. Domino is a highly associative cache that extends the conventional cache by borrowing the prefetcher and last-level-cache storage budget and using it as additional ways for data cache. In Domino, the additional ways are separated from the conventional cache ways; hence, the critical path of the first access is not altered. On a miss in the conventional part, it searches the added ways in a mix of parallel-sequential fashion to compromise the latency and energy consumption. Results on the Cloudsuite benchmark suite show that read and write misses are reduced by 30%, along with a 28% reduction in snoop messages. The overall energy consumption per access is then reduced by 20% on average (maximum 38%) as a result of filtering accesses to the lower levels.

References

[1]

A. Agarwal and S. Pudar. 1993. Column-associative caches: A technique for reducing the miss rate of direct-mapped caches. In Proceedings of the International Symposium on Computer Architecture. 179--190.

Digital Library

[2]

L. A. Barroso and U. Holzle. 2007. The case for energy-proportional computing. IEEE Comput. 40, 12 (2007), 33--37.

Digital Library

[3]

B. Calder, D. Grunwald, and J. Emer. 1996. Predictive sequential associative cache. In Proceedings of the IEEE Symposium on High Performance Computer Architecture.

Digital Library

[4]

T. Chen and J. Baer. 1995. Effective hardware-based data prefetching for high-performance processors. IEEE Trans. Comput. 44, 5 (1995), 609--623.

Digital Library

[5]

F. T. Chong, M. J. R. Heck, P. Ranganathan, A. A. M. Saleh, and H. M. G. Wassel. 2014. Data center energy efficiency: Improving energy efficiency in data centers beyond technology scaling. IEEE Design Test. 31, 1 (2014), 93--104.

[6]

R. Clapp, M. Dimitrov, K. Kumar, V. Viswanathan, and T. Willhalm. 2015. Quantifying the performance impact of memory latency and bandwidth for big data workloads. In Proceedings of the International Symposium on Workload Characterization. 213--224.

Digital Library

[7]

P. Conway, N. Kalyanasundharam, G. Donley, K. Lepak, and B. Hughes. 2010. Cache hierarchy and memory subsystem of the AMD opteron processor. IEEE Micro 30, 2 (2010), 16--29.

Digital Library

[8]

M. Ekman, P. Stenstrom, and F. Dahlgren. 2002. TLB and snoop energy-reduction using virtual caches in low-power chip-multiprocessors. In Proceedings of the International Symposium on Low Power Electronics and Design. 243--246.

Digital Library

[9]

M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi. 2012. Clearing the clouds: A study of emerging scale-out workloads on modern hardware. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. 37--48.

Digital Library

[10]

K. Flautner, S. Kim, S. Martin, D. Blaauw, and T. Mudge. 2002. Drowsy caches: Simple techniques for reducing leakage power. In Proceedings of the International Symposium on Computer Architecture. 148--157.

Digital Library

[11]

B. Grot, D. Hardy, P. Lotfi-Kamran, B. Falsafi, C. Nicopoulos, and Y. Sazeidas. 2012. Optimizing data center TCO with scale-out processors. IEEE Micro 32, 5 (2012), 52--63.

Digital Library

[12]

B. Gu, A. S. Yoon, D. H. Bae, I. Jo, J. Lee, J. Yoon, J. U. Kang, M. Kwon, C. Yoon, S. Cho, J. Jeong, and D. Chang. 2016. Biscuit: A framework for near-data processing of big data workloads. In Proceedings of the International Symposium on Computer Architecture. 153--165.

Digital Library

[13]

Z. Hu, M. Martonosi, and S. Kaxiras. 2003. TCP: Tag correlating prefetchers. In Proceedings of the IEEE Symposium on High Performance Computer Architecture. 317--326.

Digital Library

[14]

N. P. Jouppi. 1990. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proceedings of the International Symposium on Computer Architecture. 364--373.

Digital Library

[15]

M. Karlsson, F. Dahlgren, and P. Stenstrom. 2000. A prefetching technique for irregular accesses to linked data structures. In Proceedings of the IEEE Symposium on High Performance Computer Architecture. 206--217.

[16]

R. E. Kessler, R. Jooss, A. Lebeck, and M. D. Hill. 1998. Inexpensive implementations of set-associativity. In Proceedings of the International Symposium on Computer Architecture. 131--139.

Digital Library

[17]

M. Kharbutli, K. Irwin, Y. Solihin, and J. Lee. 2004. Using prime numbers for cache indexing to eliminate conflict misses. In Proceedings of the IEEE Symposium on High Performance Computer Architecture. 288--299.

Digital Library

[18]

C. Kim, D. Burger, and S. W. Keckler. 2002. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. 211--222.

Digital Library

[19]

H. Kodata, J. Miyake, Y. Nishimichi, H. Kudo, and K. Kagawa. 1985. An 8Kb content-addressable and reentrant memory. In Proceedings of the Solid-State Circuits Conference. 42--43.

[20]

A. Lai, C. Fide, and B. Falsafi. 2001. Dead-block prediction & dead-block correlating prefetchers. In Proceedings of the International Symposium on Computer Architecture. 144--154.

Digital Library

[21]

P. Lotfi-Kamran, B. Grot, M. Ferdman, S. Volos, O. Kocberber, J. Picorel, A. Adileh, D. Jevdjic, S. Idgunji, E. Ozer, and B. Falsafi. 2012. Scale-out processors. In Proceedings of the International Symposium on Computer Architecture. 500--511.

Digital Library

[22]

M. Mahmoud and A. Moshovos. 2016. Memory controller design under cloud workloads. In Proceedings of the International Symposium on Workload Characterization. 1--11.

[23]

M. Malik and H. Homayoun. 2015. Big data on low power cores: Are low power embedded processors a good fit for the big data workloads? In Proceedings of the International Conference on Computer Design. 379--382.

Digital Library

[24]

A. Moshovos, G. Memik, B. Falsafi, and A. Choudhary. 2001. JETTY: Filtering snoops for reduced energy consumption in SMP servers. In Proceedings of the IEEE Symposium on High Performance Computer Architecture. 85--96.

Digital Library

[25]

N. Muralimanohar, R. Balasubramonian, and N. Jouppi. 2009. CACTI 6.0: A Tool to Understand Large Caches. Technical Report HPL-2009-85. HP.

[26]

M. Naderan-Tahan and H. Sarbazi-Azad. 2016. Why does data prefetching not work for modern workloads? Comput. J. 59, 2 (2016), 244--259.

[27]

K. Nesbit and J. E. Smith. 2005. Data cache prefetching using a global history buffer. IEEE Micro 25, 1 (2005), 90--97.

Digital Library

[28]

S. H. Pugsley, Z. Chishti, C. Wilkerson, P. Chuang, R. L. Scott, A. Jaleel, S.-L. Lu, K. Chow, and R. Balasubramonian. 2014. Sandbox prefetching: Safe run-time evaluation of aggressive prefetchers. In Proceedings of the IEEE Symposium on High Performance Computer Architecture. 626--637.

[29]

D. Rolan, B. B. Fraguela, and R. Doallo. 2009. Adaptive line placement with the set balancing cache. In Proceedings of the International Symposium on Microarchitecture. 529--540.

Digital Library

[30]

D. Sanchez and C. Kozyrakis. 2010. The ZCache: Decoupling ways and associativity. In Proceedings of the International Symposium on Microarchitecture. 187--198.

Digital Library

[31]

A. Seznec. 1995. Skewed associativity enhances performance predictability. In Proceedings of the International Symposium on Computer Architecture. 265--274.

Digital Library

[32]

F. M. Sleiman, R. G. Dreslinski, and T. F. Wenisch. 2012. Embedded way prediction for last-level caches. In Proceedings of the International Conference on Computer Design. 167--174.

Digital Library

[33]

S. Somogyi, T. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. 2006. Spatial memory streaming. In Proceedings of the International Symposium on Computer Architecture. 252--263.

Digital Library

[34]

S. Srinath, O. Mutlu, H. Kim, and Y. Patt. 2007. Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers. In Proceedings of the IEEE Symposium on High Performance Computer Architecture. 63--74.

Digital Library

[35]

V. Stankovski, J. Trnkoczy, S. Taherizadeh, and M. Cigale. 2016. Implementing time-critical functionalities with a distributed adaptive container architecture. In Proceedings of the International Conference on Information Integration and Web-based Applications and Services. 453--457.

Digital Library

[36]

G. Suciu, V. Suciu, A. Martian, R. Craciunescu, A. Vulpe, I. Marcu, S. Halunga, and O. Halunga. 2015. Big data, internet of things and cloud convergence--an architecture for secure e-health applications. J. Med. Syst. 39, 11 (2015), 141.

Digital Library

[37]

J. Wang, R. Panda, and L. K. John. 2017. Prefetching for cloud workloads: An analysis based on address patterns. In Proceedings of the International Symposium on Performance Analysis of Systems and Software. 163--172.

[38]

T. F. Wenisch, R. E. Wunderlich, M. Ferdman, A. Ailamaki, B. Falsafi, and J. C. Hoe. 2006. SimFlex: Statistical sampling of computer system simulation. IEEE Micro 26, 4 (2006), 18--31.

Digital Library

[39]

A. Yoaz, M. Erez, R. Ronen, and S. Jourdan. 1999. Speculation techniques for improving load related instruction scheduling. In Proceedings of the International Symposium on Computer Architecture. 42--53.

Digital Library

[40]

C. Zhang, F. Vahid, J. Yang, and W. Najjar. 2005. A way-halting cache for low-energy high-performance systems. ACM TECS 2, 2 (2005), 34--54.

Digital Library

[41]

Q. Zhang, L. T. Yang, Z. Chen, and P. Li. 2018. High-order possibilistic c-Means algorithms based on tensor decompositions for big data in IoT. Inf. Fusion 39 (2018), 72--80.

Digital Library

[42]

Q. Zhang, C. Zhu, L. T. Yang, Z. Chen, L. Zhao, and P. Li. 2017. An incremental CFS algorithm for clustering large data in industrial internet of things. IEEE Trans. Industrial Inf. 13, 3 (2017), 1193--1201.

[43]

X. Zhuang and H. Lee. 2003. A hardware-based cache pollution filtering mechanism for aggressive prefetches. In Proceedings of the International Conference on Parallel Processing. 286--293.

Index Terms

Domino Cache: An Energy-Efficient Data Cache for Modern Applications
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
      1. Cloud computing
    2. Parallel architectures
      1. Multicore architectures
2. Hardware

Recommendations

Increasing hardware data prefetching performance using the second-level cache

Techniques to reduce or tolerate large memory latencies are critical for achieving high processor performance. Hardware data prefetching is one of the most heavily studied solutions, but it is essentially applied to first-level caches where it can ...
Adaptive prefetching using global history buffer in multicore processors

Data prefetching is a well-known technique to hide the memory latency in the last-level cache (LCC). Among many prefetching methods in recent years, the Global History Buffer (GHB) proves to be efficient in terms of cost and speedup. In this paper, we ...
Criticality aware tiered cache hierarchy: a fundamental relook at multi-level cache hierarchies
ISCA '18: Proceedings of the 45th Annual International Symposium on Computer Architecture

On-die caches are a popular method to help hide the main memory latency. However, it is difficult to build large caches without substantially increasing their access latency, which in turn hurts performance. To overcome this difficulty, on-die caches ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Design Automation of Electronic Systems

ACM Transactions on Design Automation of Electronic Systems Volume 23, Issue 3

May 2018

341 pages

ISSN:1084-4309

EISSN:1557-7309

DOI:10.1145/3184476

Editor:
Naehyuck Chang
Korea Advanced Institute of Science and Technology, Korea

Issue’s Table of Contents

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 01 February 2018

Accepted: 01 December 2017

Revised: 01 October 2017

Received: 01 May 2017

Published in TODAES Volume 23, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
263
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)0

Reflects downloads up to 04 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents