[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Domino Cache: An Energy-Efficient Data Cache for Modern Applications

Published: 01 February 2018 Publication History

Abstract

The energy consumption for processing modern workloads is challenging in data centers. Due to the large datasets of cloud workloads, the miss rate of the L1 data cache is high, and with respect to the energy efficiency concerns, such misses are costly for memory instructions because lower levels of memory hierarchy consume more energy per access than the L1. Moreover, large last-level caches are not performance effective, in contrast to traditional scientific workloads. The aim of this article is to propose a large L1 data cache, called Domino, to reduce the number of accesses to lower levels in order to improve the energy efficiency. In designing Domino, we focus on two components that use the on-chip area and are not energy efficient, which makes them good candidates to use their area for enlarging the L1 data cache. Domino is a highly associative cache that extends the conventional cache by borrowing the prefetcher and last-level-cache storage budget and using it as additional ways for data cache. In Domino, the additional ways are separated from the conventional cache ways; hence, the critical path of the first access is not altered. On a miss in the conventional part, it searches the added ways in a mix of parallel-sequential fashion to compromise the latency and energy consumption. Results on the Cloudsuite benchmark suite show that read and write misses are reduced by 30%, along with a 28% reduction in snoop messages. The overall energy consumption per access is then reduced by 20% on average (maximum 38%) as a result of filtering accesses to the lower levels.

References

[1]
A. Agarwal and S. Pudar. 1993. Column-associative caches: A technique for reducing the miss rate of direct-mapped caches. In Proceedings of the International Symposium on Computer Architecture. 179--190.
[2]
L. A. Barroso and U. Holzle. 2007. The case for energy-proportional computing. IEEE Comput. 40, 12 (2007), 33--37.
[3]
B. Calder, D. Grunwald, and J. Emer. 1996. Predictive sequential associative cache. In Proceedings of the IEEE Symposium on High Performance Computer Architecture.
[4]
T. Chen and J. Baer. 1995. Effective hardware-based data prefetching for high-performance processors. IEEE Trans. Comput. 44, 5 (1995), 609--623.
[5]
F. T. Chong, M. J. R. Heck, P. Ranganathan, A. A. M. Saleh, and H. M. G. Wassel. 2014. Data center energy efficiency: Improving energy efficiency in data centers beyond technology scaling. IEEE Design Test. 31, 1 (2014), 93--104.
[6]
R. Clapp, M. Dimitrov, K. Kumar, V. Viswanathan, and T. Willhalm. 2015. Quantifying the performance impact of memory latency and bandwidth for big data workloads. In Proceedings of the International Symposium on Workload Characterization. 213--224.
[7]
P. Conway, N. Kalyanasundharam, G. Donley, K. Lepak, and B. Hughes. 2010. Cache hierarchy and memory subsystem of the AMD opteron processor. IEEE Micro 30, 2 (2010), 16--29.
[8]
M. Ekman, P. Stenstrom, and F. Dahlgren. 2002. TLB and snoop energy-reduction using virtual caches in low-power chip-multiprocessors. In Proceedings of the International Symposium on Low Power Electronics and Design. 243--246.
[9]
M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi. 2012. Clearing the clouds: A study of emerging scale-out workloads on modern hardware. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. 37--48.
[10]
K. Flautner, S. Kim, S. Martin, D. Blaauw, and T. Mudge. 2002. Drowsy caches: Simple techniques for reducing leakage power. In Proceedings of the International Symposium on Computer Architecture. 148--157.
[11]
B. Grot, D. Hardy, P. Lotfi-Kamran, B. Falsafi, C. Nicopoulos, and Y. Sazeidas. 2012. Optimizing data center TCO with scale-out processors. IEEE Micro 32, 5 (2012), 52--63.
[12]
B. Gu, A. S. Yoon, D. H. Bae, I. Jo, J. Lee, J. Yoon, J. U. Kang, M. Kwon, C. Yoon, S. Cho, J. Jeong, and D. Chang. 2016. Biscuit: A framework for near-data processing of big data workloads. In Proceedings of the International Symposium on Computer Architecture. 153--165.
[13]
Z. Hu, M. Martonosi, and S. Kaxiras. 2003. TCP: Tag correlating prefetchers. In Proceedings of the IEEE Symposium on High Performance Computer Architecture. 317--326.
[14]
N. P. Jouppi. 1990. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proceedings of the International Symposium on Computer Architecture. 364--373.
[15]
M. Karlsson, F. Dahlgren, and P. Stenstrom. 2000. A prefetching technique for irregular accesses to linked data structures. In Proceedings of the IEEE Symposium on High Performance Computer Architecture. 206--217.
[16]
R. E. Kessler, R. Jooss, A. Lebeck, and M. D. Hill. 1998. Inexpensive implementations of set-associativity. In Proceedings of the International Symposium on Computer Architecture. 131--139.
[17]
M. Kharbutli, K. Irwin, Y. Solihin, and J. Lee. 2004. Using prime numbers for cache indexing to eliminate conflict misses. In Proceedings of the IEEE Symposium on High Performance Computer Architecture. 288--299.
[18]
C. Kim, D. Burger, and S. W. Keckler. 2002. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. 211--222.
[19]
H. Kodata, J. Miyake, Y. Nishimichi, H. Kudo, and K. Kagawa. 1985. An 8Kb content-addressable and reentrant memory. In Proceedings of the Solid-State Circuits Conference. 42--43.
[20]
A. Lai, C. Fide, and B. Falsafi. 2001. Dead-block prediction & dead-block correlating prefetchers. In Proceedings of the International Symposium on Computer Architecture. 144--154.
[21]
P. Lotfi-Kamran, B. Grot, M. Ferdman, S. Volos, O. Kocberber, J. Picorel, A. Adileh, D. Jevdjic, S. Idgunji, E. Ozer, and B. Falsafi. 2012. Scale-out processors. In Proceedings of the International Symposium on Computer Architecture. 500--511.
[22]
M. Mahmoud and A. Moshovos. 2016. Memory controller design under cloud workloads. In Proceedings of the International Symposium on Workload Characterization. 1--11.
[23]
M. Malik and H. Homayoun. 2015. Big data on low power cores: Are low power embedded processors a good fit for the big data workloads? In Proceedings of the International Conference on Computer Design. 379--382.
[24]
A. Moshovos, G. Memik, B. Falsafi, and A. Choudhary. 2001. JETTY: Filtering snoops for reduced energy consumption in SMP servers. In Proceedings of the IEEE Symposium on High Performance Computer Architecture. 85--96.
[25]
N. Muralimanohar, R. Balasubramonian, and N. Jouppi. 2009. CACTI 6.0: A Tool to Understand Large Caches. Technical Report HPL-2009-85. HP.
[26]
M. Naderan-Tahan and H. Sarbazi-Azad. 2016. Why does data prefetching not work for modern workloads? Comput. J. 59, 2 (2016), 244--259.
[27]
K. Nesbit and J. E. Smith. 2005. Data cache prefetching using a global history buffer. IEEE Micro 25, 1 (2005), 90--97.
[28]
S. H. Pugsley, Z. Chishti, C. Wilkerson, P. Chuang, R. L. Scott, A. Jaleel, S.-L. Lu, K. Chow, and R. Balasubramonian. 2014. Sandbox prefetching: Safe run-time evaluation of aggressive prefetchers. In Proceedings of the IEEE Symposium on High Performance Computer Architecture. 626--637.
[29]
D. Rolan, B. B. Fraguela, and R. Doallo. 2009. Adaptive line placement with the set balancing cache. In Proceedings of the International Symposium on Microarchitecture. 529--540.
[30]
D. Sanchez and C. Kozyrakis. 2010. The ZCache: Decoupling ways and associativity. In Proceedings of the International Symposium on Microarchitecture. 187--198.
[31]
A. Seznec. 1995. Skewed associativity enhances performance predictability. In Proceedings of the International Symposium on Computer Architecture. 265--274.
[32]
F. M. Sleiman, R. G. Dreslinski, and T. F. Wenisch. 2012. Embedded way prediction for last-level caches. In Proceedings of the International Conference on Computer Design. 167--174.
[33]
S. Somogyi, T. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. 2006. Spatial memory streaming. In Proceedings of the International Symposium on Computer Architecture. 252--263.
[34]
S. Srinath, O. Mutlu, H. Kim, and Y. Patt. 2007. Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers. In Proceedings of the IEEE Symposium on High Performance Computer Architecture. 63--74.
[35]
V. Stankovski, J. Trnkoczy, S. Taherizadeh, and M. Cigale. 2016. Implementing time-critical functionalities with a distributed adaptive container architecture. In Proceedings of the International Conference on Information Integration and Web-based Applications and Services. 453--457.
[36]
G. Suciu, V. Suciu, A. Martian, R. Craciunescu, A. Vulpe, I. Marcu, S. Halunga, and O. Halunga. 2015. Big data, internet of things and cloud convergence--an architecture for secure e-health applications. J. Med. Syst. 39, 11 (2015), 141.
[37]
J. Wang, R. Panda, and L. K. John. 2017. Prefetching for cloud workloads: An analysis based on address patterns. In Proceedings of the International Symposium on Performance Analysis of Systems and Software. 163--172.
[38]
T. F. Wenisch, R. E. Wunderlich, M. Ferdman, A. Ailamaki, B. Falsafi, and J. C. Hoe. 2006. SimFlex: Statistical sampling of computer system simulation. IEEE Micro 26, 4 (2006), 18--31.
[39]
A. Yoaz, M. Erez, R. Ronen, and S. Jourdan. 1999. Speculation techniques for improving load related instruction scheduling. In Proceedings of the International Symposium on Computer Architecture. 42--53.
[40]
C. Zhang, F. Vahid, J. Yang, and W. Najjar. 2005. A way-halting cache for low-energy high-performance systems. ACM TECS 2, 2 (2005), 34--54.
[41]
Q. Zhang, L. T. Yang, Z. Chen, and P. Li. 2018. High-order possibilistic c-Means algorithms based on tensor decompositions for big data in IoT. Inf. Fusion 39 (2018), 72--80.
[42]
Q. Zhang, C. Zhu, L. T. Yang, Z. Chen, L. Zhao, and P. Li. 2017. An incremental CFS algorithm for clustering large data in industrial internet of things. IEEE Trans. Industrial Inf. 13, 3 (2017), 1193--1201.
[43]
X. Zhuang and H. Lee. 2003. A hardware-based cache pollution filtering mechanism for aggressive prefetches. In Proceedings of the International Conference on Parallel Processing. 286--293.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Design Automation of Electronic Systems
ACM Transactions on Design Automation of Electronic Systems  Volume 23, Issue 3
May 2018
341 pages
ISSN:1084-4309
EISSN:1557-7309
DOI:10.1145/3184476
  • Editor:
  • Naehyuck Chang
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 01 February 2018
Accepted: 01 December 2017
Revised: 01 October 2017
Received: 01 May 2017
Published in TODAES Volume 23, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Computer architecture
  2. cache
  3. cloud workloads
  4. energy
  5. prefetching

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 263
    Total Downloads
  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)0
Reflects downloads up to 04 Jan 2025

Other Metrics

Citations

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media