More Web Proxy on the site http://driver.im/

research-article

Protozoa: adaptive granularity cache coherence

Authors:

Arrvindh Shriraman,

Snehasish Kumar,

Sandhya DwarkadasAuthors Info & Claims

ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture

Pages 547 - 558

https://doi.org/10.1145/2485922.2485969

Published: 23 June 2013 Publication History

Abstract

State-of-the-art multiprocessor cache hierarchies propagate the use of a fixed granularity in the cache organization to the design of the coherence protocol. Unfortunately, the fixed granularity, generally chosen to match average spatial locality across a range of applications, not only results in wasted bandwidth to serve an individual thread's access needs, but also results in unnecessary coherence traffic for shared data. The additional bandwidth has a direct impact on both the scalability of parallel applications and overall energy consumption.

In this paper, we present the design of Protozoa, a family of coherence protocols that eliminate unnecessary coherence traffic and match data movement to an application's spatial locality. Protozoa continues to maintain metadata at a conventional fixed cache line granularity while 1) supporting variable read and write caching granularity so that data transfer matches application spatial granularity, 2) invalidating at the granularity of the write miss request so that readers to disjoint data can co-exist with writers, and 3) potentially supporting multiple non-overlapping writers within the cache line, thereby avoiding the traditional ping-pong effect of both read-write and write-write false sharing. Our evaluation demonstrates that Protozoa consistently reduce miss rate and improve the fraction of transmitted data that is actually utilized.

References

[1]

A. R. Alameldeen, M. M. K. Martin, C. J. Mauer, K. E. Moore, M. Xu, M. D. Hill, D. A. Wood, and D. J. Sorin. Simulating a $2m commercial server on a $2k pc. Computer, 36(2):50--57, 2003.

Digital Library

[2]

D. Albonesi, A. Kodi, and V. Stojanovic. NSF Workshop on Emerging Technologies for Interconnects (WETI), 2012.

[3]

C. Bienia. Benchmarking Modern Multiprocessors. In Ph.D. Thesis. Princeton University, 2011.

Digital Library

[4]

S. M. Blackburn, R. Garner, C. Hoffmann, A. M. Khang, K. S. McKinley, R. Bentzur, A. Diwan, D. Feinberg, D. Frampton, S. Z. Guyer, M. Hirzel, A. Hosking, M. Jump, H. Lee, J. E. B. Moss, A. Phansalkar, D. Stefanović, T. VanDrunen, D. von Dincklage, and B. Wiedermann. The DaCapo benchmarks: Java benchmarking development and analysis. In Proc. of the 21st OOPSLA, 2006.

Digital Library

[5]

B. Choi, R. Komuravelli, H. Sung, R. Smolinski, N. Honarmand, S. V. Adve, V. S. Adve, N. P. Carter, and C.-T Chou. DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism. In Proc. of the 20th Intl. Conf. on Parallel Architectures and Compilation Techniques (PACT), Oct. 2011.

Digital Library

[6]

P. Conway and B. Hughes. The AMD Opteron Northbridge Architecture. In IEEE Micro. IEEE Computer Society Press, 2007.

Digital Library

[7]

C. Dubnicki and T. J. Leblanc. Adjustable Block Size Coherent Caches. In Proc. of the 19th Annual Intl. Symp. on Computer Architecture (ISCA), 1992.

Digital Library

[8]

A. González, C. Aliagas, and M. Valero. A data cache with multiple caching strategies tuned to different types of locality. In Proc. of the ACM Intl. Conf. on Supercomputing, 1995.

Digital Library

[9]

M. Kadiyala and L. N. Bhuyan. A dynamic cache sub-block design to reduce false sharing. In Proc. of the 1995 Intl. Conf. on Computer Design: VLSI in Computers and Processors, 1995.

Digital Library

[10]

R. Kalla, B. Sinharoy, W. J. Starke, and M. FloydPower7: IBM's Next-Generation Server Processor. In IEEE Micro Journal, 2010.

Digital Library

[11]

J. H. Keim, D. R. Johnson, W Tuohy, S. S. Lumetta, and S. J. Patel. Cohesion: a hybrid memory model for accelerators. In Proc. of the 37th Intl. Symp. on Computer Architecture (ISCA), 2010.

Digital Library

[12]

S. Kumar, H. Zhao, A. Shriraman, E. Matthews, S. Dwarkadas, and L. Shannon. Amoeba Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy. In Proc. of the 45th Intl. Symp. on Microarchitecture (MICRO), 2012.

Digital Library

[13]

C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. In Proc. of the 2005 ACM SIGPLAN Conf. on Programming Language Design and Implementation (PLDI), 2005.

Digital Library

[14]

P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hållberg, J. Högberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A full system simulation platform. Computer, 35(2):50--58, 2002.

Digital Library

[15]

M. M. K. Martin, M. D. Hill, and D. J. Sorin. Why on-chip cache coherence is here to stay. Commun. ACM, pages 78--89, 2012.

Digital Library

[16]

M. M. K. Martin, M. D. Hill, and D. A. Wood. Token coherence: decoupling performance and correctness. In Proc. of the 30th Intl. Symp. on Computer Architecture (ISCA). 2003.

Digital Library

[17]

M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood. Multifacet's General Execution-driven Multiprocessor Simulator (GEMS) Toolset. In ACM SIGARCH Computer Architecture News, Sept. 2005.

Digital Library

[18]

D. Park, R. H. Saavedra, and S. Moon. Adaptive Granularity: Transparent Integration of Fine- and Coarse-Grain Communication. In Proc. of the 1996 Conf. on Parallel Architectures and Compilation Techniques (PACT), 1996.

Digital Library

[19]

S. H. Pugsley, J. B. Spjut, D. W. Nellans, and R. Balasubramonian. SWEL: hardware cache coherence protocols to map shared data onto shared caches. In 19th Intl. Conf. on Parallel Architecture and Compilation Techniques (PACT), 2010.

Digital Library

[20]

P. Pujara and A. Aggarwal. Increasing the Cache Efficiency by Eliminating Noise. In Proc. of the 12th Intl. Symp. on High Performance Computer Architecture (HPCA), 2006.

[21]

M. K. Qureshi, M. A. Suleman, and Y. N. Patt. Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines. In Proc. of the 13th Intl. Symp. on High Performance Computer Architecture (HPCA), 2007.

Digital Library

[22]

C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating mapreduce for multi-core and multiprocessor systems. In Proc. of the 2007 IEEE 13th Intl. Symp. on High Performance Computer Architecture (HPCA), 2007.

Digital Library

[23]

J. B. Rothman and A. J. Smith. The pool of subsectors cache design. In Proc. of the 13th ACM Intl. Conf. on Supercomputing, 1999.

Digital Library

[24]

J. B. Rothman and A. J. Smith. Minerva: An Adaptive Subblock Coherence Protocol for Improved SMP Performance. In Proc. of the 4th Intl. Symp. on High Performance Computing, 2002.

Digital Library

[25]

B. Saha, X. Zhou, H. Chen, Y. Gao, S. Yan, M. Rajagopalan, J. Fang, P. Zhang, R. Ronen, and A. Mendelson. Programming model for a heterogeneous x86 platform. In Proc. of the 2009 Conf. on Programming Language Design and Implementation (PLDI), 2009.

Digital Library

[26]

D. J. Scales, K. Gharachorloo, and A. Aggarwal. Fine-grain software distributed shared memory on smp clusters. In Proc. of the 4th Intl. Symp. on High-Performance Computer Architecture (HPCA), pages 125--136, Feb. 1998.

Digital Library

[27]

D. J. Scales, K. Gharachorloo, and C. Thekkath. Shasta: A low overhead, software-only approach for supporting fine-grain shared memory. In Proc. of the 7th Symp. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 174--185, Oct. 1996.

Digital Library

[28]

A. Seznec. Decoupled sectored caches: conciliating low tag implementation cost. In Proc. of the 21st Intl. Symp. on Computer Architecture (ISCA), 1994.

Digital Library

[29]

D. J. Sorin, M. D. Hill, and D. A. Wood. A Primer on Memory Consistency and Cache Coherence. In Synthesis Lectures in Computer Architecture, Morgan Claypool Publishers, 2011.

Digital Library

[30]

J. Talbot, R. M. Yoo, and C. Kozyrakis. Phoenix++: modular mapreduce for shared-memory systems. In Proc. of the second international workshop on MapReduce and its applications, 2011.

Digital Library

[31]

E. Totoni, B. Behzad, S. Ghike, and J. Torrellas. Comparing the power and performance of Intel's SCC to state-of-the-art CPUs and GPUs. In IEEE Intl. Symposium on Performance Analysis of Systems & Software (ISPASS), 2012.

Digital Library

[32]

D. Vantrease, M. Lipasti, and N. Binkert. Atomic Coherence: Leveraging Nanophotonics to Build Race-Free Cache Coherence Protocols. In Proc. of the 17th Intl. Symp. on High Performance Computer Architecture (HPCA), 2011.

Digital Library

[33]

A. V. Veidenbaum, W. Tang, R. Gupta, A. Nicolau, and X. Ji. Adapting cache line size to application behavior. In Proc. of the 13th ACM Intl. Conf. on Supercomputing (ICS). 1999.

Digital Library

[34]

S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 programs: characterization and methodological considerations. In Proc. of the 22nd annual Intl. Symp. on Computer architecture (ISCA), 1995.

Digital Library

[35]

J. Zebchuk, V. Srinivasan, M. K. Qureshi, and A. Moshovos. A tagless coherence directory. In Proc. of the 42nd Intl. Symp. on Microarchitecture (MICRO), 2009.

Digital Library

[36]

H. Zhao, A. Shriraman, and S. Dwarkadas. SPACE: Sharing Pattern-based Directory Coherence for Multicore Scalability. In Proc. of Intl. Conf. on Parallel Architectures and Compilation Techniques (PACT), Oct. 2010.

Digital Library

[37]

H. Zhao, A. Shriraman, S. Dwarkadas, and V. Srinivasan. SPATL: Honey, I Shrunk the Coherence Directory. In Proc. of Intl. Conf. on Parallel Architectures and Compilation Techniques (PACT), Oct. 2011.

Digital Library

[38]

Y. Zhou, L. Iftode, J. P. Singh, K. Li, B. R. Toonen, I. Schoinas, M. D. Hill, and D. A. Wood. Relaxed Consistency and Coherence Granularity in DSM Systems: A Performance Evaluation. In Proc. of the 6 th ACM Symp. on Principles and Practice of Parallel Programming (PPoPP), June 1997.

Digital Library

Cited By

Alsop JNa WSinclair MGrayson SAdve S(2022)A Case for Fine-grain Coherence Specialization in Heterogeneous SystemsACM Transactions on Architecture and Code Optimization10.1145/353081919:3(1-26)Online publication date: 22-Aug-2022
https://dl.acm.org/doi/10.1145/3530819
Kornaros G(2020)RSMCC: Enabling Ring-based Software Managed Cache-Coherent Embedded SoCs2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP50117.2020.00026(131-135)Online publication date: Mar-2020
https://doi.org/10.1109/PDP50117.2020.00026
Aguilera MKeeton KNovakovic SSinghal S(2019)Designing Far Memory Data StructuresProceedings of the Workshop on Hot Topics in Operating Systems10.1145/3317550.3321433(120-126)Online publication date: 13-May-2019
https://dl.acm.org/doi/10.1145/3317550.3321433
Show More Cited By

Protozoa: adaptive granularity cache coherence
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory

Recommendations

Protozoa: adaptive granularity cache coherence
ICSA '13

State-of-the-art multiprocessor cache hierarchies propagate the use of a fixed granularity in the cache organization to the design of the coherence protocol. Unfortunately, the fixed granularity, generally chosen to match average spatial locality across ...
TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs

Translation Lookaside Buffers (TLBs) are critical to overall system performance. Much past research has addressed uniprocessor TLBs, lowering access times and miss rates. However, as Chip MultiProcessors (CMPs) become ubiquitous, TLB design and ...
SELECTIVE VICTIM CACHING: A METHOD TO IMPROVE THE PERFORMANCE OF DIRECT-MAPPED CACHES

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture

June 2013

686 pages

ISBN:9781450320795

DOI:10.1145/2485922

General Chair:
Avi Mendelson
Technion

ACM SIGARCH Computer Architecture News Volume 41, Issue 3
ICSA '13
June 2013
666 pages
ISSN:0163-5964
DOI:10.1145/2508148
Issue’s Table of Contents

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

IEEE CS

In-Cooperation

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 June 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

Division of Computer and Network Systems
Natural Sciences and Engineering Research Council of Canada
Division of Computing and Communication Foundations
Canadian Microelectronics Corporation
MARCO Gigascale Research Center

Conference

ISCA'13

Sponsor:

ISCA'13: The 40th Annual International Symposium on Computer Architecture

June 23 - 27, 2013

Tel-Aviv, Israel

Acceptance Rates

ISCA '13 Paper Acceptance Rate 56 of 288 submissions, 19%;

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

26
Total Citations
View Citations
736
Total Downloads

Downloads (Last 12 months)43
Downloads (Last 6 weeks)9

Reflects downloads up to 10 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Alsop JNa WSinclair MGrayson SAdve S(2022)A Case for Fine-grain Coherence Specialization in Heterogeneous SystemsACM Transactions on Architecture and Code Optimization10.1145/353081919:3(1-26)Online publication date: 22-Aug-2022
https://dl.acm.org/doi/10.1145/3530819
Kornaros G(2020)RSMCC: Enabling Ring-based Software Managed Cache-Coherent Embedded SoCs2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP50117.2020.00026(131-135)Online publication date: Mar-2020
https://doi.org/10.1109/PDP50117.2020.00026
Aguilera MKeeton KNovakovic SSinghal S(2019)Designing Far Memory Data StructuresProceedings of the Workshop on Hot Topics in Operating Systems10.1145/3317550.3321433(120-126)Online publication date: 13-May-2019
https://dl.acm.org/doi/10.1145/3317550.3321433
Akin BChou CPark JHughes CAgarwal RJacob B(2018)Dynamic fine-grained sparse memory accessesProceedings of the International Symposium on Memory Systems10.1145/3240302.3240416(85-97)Online publication date: 1-Oct-2018
https://dl.acm.org/doi/10.1145/3240302.3240416
Alsop JSinclair MAdve S(2018)SpandexProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00031(261-274)Online publication date: 2-Jun-2018
https://dl.acm.org/doi/10.1109/ISCA.2018.00031
Cunha MMatoussi OPétrot F(2017)Detecting Software Cache Coherence Violations in MPSoC Using Traces Captured on Virtual PlatformsACM Transactions on Embedded Computing Systems10.1145/299019316:2(1-21)Online publication date: 2-Jan-2017
https://dl.acm.org/doi/10.1145/2990193
Hu SShi FJi WChen XTalpur S(2017)Exploring grouped coherence for clustered hierarchical cacheThe Journal of Supercomputing10.1007/s11227-017-2024-873:9(4137-4157)Online publication date: 1-Sep-2017
https://dl.acm.org/doi/10.1007/s11227-017-2024-8
Sembrant AHagersten EBlack-Schaffer D(2016)Data placement across the cache hierarchy: Minimizing data movement with reuse-aware placement2016 IEEE 34th International Conference on Computer Design (ICCD)10.1109/ICCD.2016.7753269(117-124)Online publication date: Oct-2016
https://doi.org/10.1109/ICCD.2016.7753269
Zhang GHorn WSanchez DPrvulovic M(2015)Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systemsProceedings of the 48th International Symposium on Microarchitecture10.1145/2830772.2830774(13-25)Online publication date: 5-Dec-2015
https://dl.acm.org/doi/10.1145/2830772.2830774
Davari MRos AHagersten EKaxiras S(2015)The Effects of Granularity and Adaptivity on Private/Shared Classification for CoherenceACM Transactions on Architecture and Code Optimization10.1145/279030112:3(1-21)Online publication date: 31-Aug-2015
https://dl.acm.org/doi/10.1145/2790301
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents