More Web Proxy on the site http://driver.im/

research-article

Public Access

MORC: a manycore-oriented compressed cache

Authors:

David WentzlaffAuthors Info & Claims

MICRO-48: Proceedings of the 48th International Symposium on Microarchitecture

Pages 76 - 88

https://doi.org/10.1145/2830772.2830828

Published: 05 December 2015 Publication History

Abstract

Cache compression has largely focused on improving single-stream application performance. In contrast, this work proposes utilizing cache compression to improve application throughput for manycore processors while potentially harming single-stream performance. The growing interest in throughput-oriented manycore architectures and widening disparity between on-chip resources and off-chip bandwidth motivate re-evaluation of utilizing costly compression to conserve off-chip memory bandwidth. This work proposes MORC, a Many-core ORiented Compressed Cache architecture that compresses hundreds of cache lines together to maximize compression ratio. By looking across cache lines, MORC is able to achieve compression ratios beyond compression schemes which only compress within a single cache line. MORC utilizes a novel log-based cache organization which selects cache lines that are filled into the cache close in time as candidates to compress together. The proposed design not only compresses cache data, but also cache tags together to further save storage. Future manycore processors will likely have reduced cache sizes and less bandwidth per core than current multicore processors. We evaluate MORC on such future many-core processors utilizing the SPEC2006 benchmark suite. We find that MORC offers 37% more throughput than uncompressed caches and 17% more throughput than the next best cache compression scheme, while simultaneously reducing 17% of memory system energy compared to uncompressed caches.

References

[1]

V. W. Lee, C. Kim, J. Chhugani, M. Deisher, et al., "Debunking the 100x GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU," in ACM SIGARCH Computer Architecture News, vol. 38, pp. 451--460, ACM, 2010.

Digital Library

[2]

Y.-K. Chen, J. Chhugani, P. Dubey, et al., "Convergence of recognition, mining, and synthesis workloads and its implications," Proceedings of the IEEE, vol. 96, no. 5, pp. 790--807, 2008.

[3]

J. Jeffers and J. Reinders, Intel Xeon Phi coprocessor high performance programming. Newnes, 2013.

Digital Library

[4]

L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, et al., "Larrabee: a many-core x86 architecture for visual computing," ACM Transactions on Graphics (TOG), vol. 27, no. 3, p. 18, 2008.

Digital Library

[5]

S. R. Vangal et al., "An 80-tile sub-100-w teraflops processor in 65-nm cmos," Solid-State Circuits, IEEE Journal of, vol. 43, no. 1, pp. 29--41, 2008.

[6]

J. S. Kim, M. B. Taylor, J. Miller, and D. Wentzlaff, "Energy characterization of a tiled architecture processor with on-chip networks," in Proceedings of international symposium on Low power electronics and design, pp. 424--427, ACM, 2003.

Digital Library

[7]

D. Wentzlaff, P. Griffin, H. Hoffmann, L. Bao, B. Edwards, C. Ramey, M. Mattina, C.-C. Miao, J. F. Brown III, and A. Agarwal, "On-chip interconnection architecture of the Tile Processor," IEEE Micro, vol. 27, pp. 15--31, Sept. 2007.

Digital Library

[8]

S. Bell et al., "Tile64 - processor: A 64-core soc with mesh interconnect," in Solid-State Circuits Conference, 2008. ISSCC 2008. Digest of Technical Papers. IEEE International, pp. 88--598, Feb 2008.

[9]

C. Ramey, "Tile-gx100 manycore processor: Acceleration interfaces and architecture," in Proceedings of Hot Chips Symposium, 2011.

[10]

G. E. Moore, "Cramming More Components Onto Integrated Circuits," Electronics, Apr. 1965.

[11]

D. Burger, J. R. Goodman, and A. Kägi, Memory bandwidth limitations of future microprocessors, vol. 24. ACM, 1996.

Digital Library

[12]

P.-J. Chuang, M. Sachdev, and V. Gaudet, "A 167-ps 2.34-mW Single-Cycle 64-Bit Binary Tree Comparator With Constant-Delay Logic in 65-nm CMOS," Circuits and Systems I: Regular Papers, IEEE Transactions on, vol. 61, pp. 160--171, Jan 2014.

[13]

S. Thoziyoor, N. Muralimanohar, and N. P. Jouppi, "CACTI 5.0," HP Laboratories, Technical Report, 2007.

[14]

S. Galal and M. Horowitz, "Energy-efficient floating-point unit design," Computers, IEEE Transactions on, vol. 60, no. 7, pp. 913--922, 2011.

Digital Library

[15]

R. Ho, K. W. Mai, and M. A. Horowitz, "The future of wires," Proceedings of the IEEE, vol. 89, no. 4, pp. 490--504, 2001.

[16]

TECHNICK.NET, "PCB Impedance Calculator," http://www.technick.net/public/code/cp_dpage.php?aiocp_dp=util_pcb_imp_microstrip.

[17]

Micron Technology, "DDR3 System-Power Calculator," www.micron.com/support/power-calc.

[18]

A. R. Alameldeen and D. A. Wood, "Adaptive cache compression for high-performance processors," in Proceedings of International Symposium on Computer Architecture, pp. 212--223, IEEE, 2004.

Digital Library

[19]

S. Sardashti and D. A. Wood, "Decoupled compressed cache: exploiting spatial locality for energy-optimized compressed caching," in Proceedings of IEEE/ACM International Symposium on Microarchitecture, pp. 62--73, ACM, 2013.

Digital Library

[20]

E. G. Hallnor and S. K. Reinhardt, "A unified compressed memory hierarchy," in High-Performance Computer Architecture, 2005. HPCA-11. 11th International Symposium on, pp. 201--212, IEEE, 2005.

Digital Library

[21]

G. Pekhimenko, V. Seshadri, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry, "Base-delta-immediate compression: practical data compression for on-chip caches," in Proceedings of international conference on Parallel architectures and compilation techniques, pp. 377--388, ACM, 2012.

Digital Library

[22]

S. Kim, J. Lee, J. Kim, and S. Hong, "Residue cache: a low-energy low-area L2 cache architecture via compression and partial hits," in Proceedings of IEEE/ACM International Symposium on Microarchitecture, pp. 420--429, ACM, 2011.

Digital Library

[23]

X. Chen, L. Yang, R. P. Dick, L. Shang, and H. Lekatsas, "C-Pack: A high-performance microprocessor cache compression algorithm," Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 18, no. 8, pp. 1196--1208, 2010.

Digital Library

[24]

A. Arelakis and P. Stenstrom, "SC2: A statistical compression cache scheme," in Computer Architecture (ISCA), 2014 ACM/IEEE 41st International Symposium on, pp. 145--156, IEEE, 2014.

Digital Library

[25]

A. Arelakis and P. Stenstrom, "A case for a value-aware cache," Computer Architecture Letters, vol. 13, no. 1, pp. 1--4, 2014.

Digital Library

[26]

I. Pavlov, "LZMA SDK," www.7-zip.org/sdk.html, 2007.

[27]

L. P. Deutsch, "GZIP file format specification version 4.3," 1996.

[28]

E. G. Hallnor and S. K. Reinhardt, "A fully associative software-managed cache design," in Proceedings of International Symposium on Computer Architecture, pp. 107--116, IEEE, 2000.

Digital Library

[29]

A. Agarwal and S. Pudar, "Column-associative Caches: A Technique For Reducing The Miss Rate Of Direct-mapped Caches," in Proceedings of International Symposium on Computer Architecture, pp. 179--190, IEEE, 1993.

Digital Library

[30]

A. Agarwal, J. Hennessy, and M. Horowitz, "Cache performance of operating system and multiprogramming workloads," ACM Transactions on Computer Systems (TOCS), vol. 6, no. 4, pp. 393--431, 1988.

Digital Library

[31]

M. Burtscher and P. Ratanaworabhan, "FPC: A high-speed compressor for double-precision floating-point data," Computers, IEEE Transactions on, vol. 58, no. 1, pp. 18--31, 2009.

Digital Library

[32]

L. P. Deutsch, "DEFLATE compressed data format specification version 1.3," 1996.

[33]

AHA, "AHA Data Compression," http://www.aha.com/data-compression/.

[34]

Indra, "Indra Products," http://www.indranetworks.com/products.html.

[35]

Y. Fu and D. Wentzlaff, "PriME: A parallel and distributed simulator for thousand-core chips," in Performance Analysis of Systems and Software (ISPASS), 2014 IEEE International Symposium on, pp. 116--125, IEEE, 2014.

[36]

T. E. Carlson, W. Heirman, and L. Eeckhout, "Sniper: Exploring the Level of Abstraction for Scalable and Accurate Parallel Multi-Core Simulations," in International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 52:1--52:12, Nov. 2011.

Digital Library

[37]

H. Patil and T. E. Carlson, "Pinballs: Portable and Shareable User-level Checkpoints for Reproducible Analysis and Simulation," in Proceedings of the Workshop on Reproducible Research Methodologies (REPRODUCE), 2014.

[38]

A. R. Alameldeen and D. A. Wood, "Frequent pattern compression: A significance-based compression scheme for L2 caches," Dept. Comp. Scie., Univ. Wisconsin-Madison, Tech. Rep, vol. 1500, 2004.

[39]

A. Jaleel, "Memory characterization of workloads using instrumentation-driven simulation,"Web Copy: http://www.glue.umd.edu/ajaleel/workload, 2010.

[40]

M. Mckeown, J. Balkind, and D. Wentzlaff, "Execution Drafting: Energy Efficiency Through Computation Deduplication," in Proceedings of IEEE/ACM International Symposium on Microarchitecture, pp. 432--444, IEEE Computer Society, 2014.

Digital Library

[41]

O. Villa, D. R. Johnson, M. O'Connor, et al., "Scaling the power wall: a path to exascale," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 830--841, IEEE Press, 2014.

Digital Library

[42]

S. Sardashti, A. Seznec, D. Wood, et al., "Skewed Compressed Caches," in Proceedings of IEEE/ACM International Symposium on Microarchitecture, pp. 331--342, IEEE, 2014.

Digital Library

[43]

R. B. Tremaine et al., "IBM memory expansion technology (MXT)," IBM Journal of Research and Development, vol. 45, no. 2, pp. 271--285, 2001.

Digital Library

[44]

M. Thuresson, L. Spracklen, and P. Stenstrom, "Memory-link compression schemes: A value locality perspective," Computers, IEEE Transactions on, vol. 57, no. 7, pp. 916--927, 2008.

Digital Library

[45]

V. Sathish, M. J. Schulte, and N. S. Kim, "Lossless and lossy memory I/O link compression for improving performance of GPGPU workloads," in Proceedings of international conference on Parallel architectures and compilation techniques, pp. 325--334, ACM, 2012.

Digital Library

Cited By

Buyuktosunoglu ATrilla DAbali BBerger DWalters CLee J(2024)Enterprise-Class Cache Compression Design2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00080(996-1011)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00080
Orenes-Vera MTureci EWentzlaff DMartonosi M(2023)Dalorex: A Data-Local Program Execution and Architecture for Memory-bound Applications2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071089(718-730)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071089
Kim JKang MHong JKim S(2022)Exploiting Inter-block Entropy to Enhance the Compressibility of Blocks with Diverse Data2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00084(1100-1114)Online publication date: Apr-2022
https://doi.org/10.1109/HPCA53966.2022.00084
Show More Cited By

Index Terms

MORC: a manycore-oriented compressed cache
1. Computer systems organization
  1. Architectures
    1. Parallel architectures

Recommendations

DUCATI: High-performance Address Translation by Extending TLB Reach of GPU-accelerated Systems

Conventional on-chip TLB hierarchies are unable to fully cover the growing application working-set sizes. To make things worse, Last-Level TLB (LLT) misses require multiple accesses to the page table even with the use of page walk caches. Consequently, ...
Reducing traffic generated by conflict misses in caches
CF '04: Proceedings of the 1st conference on Computing frontiers

Off-chip memory accesses are a major source of power consumption in embedded processors. In order to reduce the amount of traffic between the processor and the off-chip memory as well as to hide the memory latency, nearly all embedded processors have a ...
Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks

Many modern high-performance processors prefetch blocks into the on-chip cache. Prefetched blocks can potentially pollute the cache by evicting more useful blocks. In this work, we observe that both accurate and inaccurate prefetches lead to cache ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO-48: Proceedings of the 48th International Symposium on Microarchitecture

December 2015

787 pages

ISBN:9781450340342

DOI:10.1145/2830772

General Chair:
Milos Prvulovic
Georgia Tech

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

IEEE Computer Society TC-uARCH
SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 December 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

MICRO-48

Sponsor:

SIGMICRO

MICRO-48: The 48th Annual IEEE/ACM International Symposium of Microarchitecture

December 5 - 9, 2015

Waikiki, Hawaii

Acceptance Rates

MICRO-48 Paper Acceptance Rate 61 of 283 submissions, 22%;

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
781
Total Downloads

Downloads (Last 12 months)96
Downloads (Last 6 weeks)12

Reflects downloads up to 21 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Buyuktosunoglu ATrilla DAbali BBerger DWalters CLee J(2024)Enterprise-Class Cache Compression Design2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00080(996-1011)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00080
Orenes-Vera MTureci EWentzlaff DMartonosi M(2023)Dalorex: A Data-Local Program Execution and Architecture for Memory-bound Applications2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071089(718-730)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071089
Kim JKang MHong JKim S(2022)Exploiting Inter-block Entropy to Enhance the Compressibility of Blocks with Diverse Data2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00084(1100-1114)Online publication date: Apr-2022
https://doi.org/10.1109/HPCA53966.2022.00084
Tomei MDas SSeyedzadeh MBedoukian PBeckmann BKumar RWood D(2021)Byte-Select CompressionACM Transactions on Architecture and Code Optimization10.1145/346220918:4(1-27)Online publication date: 3-Sep-2021
https://dl.acm.org/doi/10.1145/3462209
Tsai PSanchez AFletcher CSanchez DLarus JCeze LStrauss K(2020)Safecracker: Leaking Secrets through Compressed CachesProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3373376.3378453(1125-1140)Online publication date: 9-Mar-2020
https://dl.acm.org/doi/10.1145/3373376.3378453
Tsai PSanchez DBahar IHerlihy MWitchel ELebeck A(2019)Compress Objects, Not Cache LinesProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304006(229-242)Online publication date: 4-Apr-2019
https://dl.acm.org/doi/10.1145/3297858.3304006
Kesavan GKumar A(2019)Comparative Study on Data Compression Techniques in Cache to Promote Performance2019 IEEE International Conference on Intelligent Techniques in Control, Optimization and Signal Processing (INCOS)10.1109/INCOS45849.2019.8951324(1-6)Online publication date: Apr-2019
https://doi.org/10.1109/INCOS45849.2019.8951324
Nguyen TFuchs AWentzlaff DOskin MInoue K(2018)CABLEProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00033(312-325)Online publication date: 20-Oct-2018
https://dl.acm.org/doi/10.1109/MICRO.2018.00033
Kanakagiri RPanda BMutyam M(2017)MBZipACM Transactions on Architecture and Code Optimization10.1145/315103314:4(1-29)Online publication date: 5-Dec-2017
https://dl.acm.org/doi/10.1145/3151033
Young VNair PQureshi M(2017)DICEACM SIGARCH Computer Architecture News10.1145/3140659.308024345:2(627-638)Online publication date: 24-Jun-2017
https://dl.acm.org/doi/10.1145/3140659.3080243
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents