[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

Coherence decoupling: making use of incoherence

Published: 07 October 2004 Publication History

Abstract

This paper explores a new technique called coherence decoupling, which breaks a traditional cache coherence protocol into two protocols: a Speculative Cache Lookup (SCL) protocol and a safe, backing coherence protocol. The SCL protocol produces a speculative load value, typically from an invalid cache line, permitting the processor to compute with incoherent data. In parallel, the coherence protocol obtains the necessary coherence permissions and the correct value. Eventually, the speculative use of the incoherent data can be verified against the coherent data. Thus, coherence decoupling can greatly reduce --- if not eliminate --- the effects of false sharing. Furthermore, coherence decoupling can also reduce latencies incurred by true sharing. SCL protocols reduce those latencies by speculatively writing updates into invalid lines, thereby increasing the accuracy of speculation, without complicating the simple, underlying coherence protocol that guarantees correctness.The performance benefits of coherence decoupling are evaluated using a full-system simulator and a mix of commercial and scientific benchmarks. Our results show that 40% to 90% of all coherence misses can be speculated correctly, and therefore their latencies partially or fully hidden. This capability results in performance improvements ranging from 3% to over 16%, in most cases where the latencies of coherence misses have an effect on performance.

References

[1]
A. Alameldeen and D. Wood. Variability in architectural simulations of multi-threaded workloads. In Proceedings of the 9th Int. Symp. on High-Performance Computer Architecture, pages 7--18, Feb. 2003.
[2]
C. Anderson and A. Karlin. Two adaptive hybrid cache coherency protocols. In Proceedings of the 2nd Int. Symp. on High-Performance Computer Architecture, pages 303--313, Feb. 1996.
[3]
A. L. Cox and R. J. Fowler. Adaptive cache coherency for detecting migratory shared data. In Proceedings of the 20th Int. Symp. on Computer Architecture, pages 98--108, May 1993.
[4]
F. Dahlgren. Boosting the performance of hybrid snooping cache protocols. In Proceedings of the 22nd Int. Symp. on Computer Architecture, pages 60--69, June 1995.
[5]
F. Dahlgren, M. Dubois, and P. Stenstr. om. Combined performance gains of simple cache protocol extensions. In Proceedings of the 21st Int. Symp. on Computer Architecture, pages 187--197, Apr. 1994.
[6]
M. Dubois, J. Skeppstedt, L. Ricciulli, K. Ramamurthy, and P. Stenstrom. The detection and elimination of useless misses in multiprocessors. In Proceedings of the 20th Int. Symp. on Computer Architecture, pages 88--97, May 1993.
[7]
B. Falsa, A. R. Lebeck, S. K. Reinhardt, I. Schoinas, M. D. Hill, J. R. Larus, A. Rogers, and D. A. Wood. Application-specific protocols for user-level shared memory. In Supercomputing, pages 380--389, Nov. 1994.
[8]
M. Franklin and G. S. Sohi. ARB: A hardware mechanism for dynamic reordering of memory references. IEEE Transactions on Computers, 45(5):552--571, May 1996.
[9]
K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J. Hennessy. Memory consistency and event ordering in scalable shared-memory. In Proceedings of the 17th Int. Symp. on Computer Architecture, pages 15--26, May 1990.
[10]
P. B. Gibbons, M. Merritt, and K. Gharachorloo. Proving sequential consistency of high-performance shared memories. In Proceedings of the Third ACM Symp. on Parallel Algorithms and Architectures, pages 292--303, July 1991.
[11]
P. N. Glaskowsky. IBM Raises Curtain on Power5. Microprocessor Report, Oct. 14 2003.
[12]
K. Gniady, B. Falsa, and T. Vijaykumar. Is SC + ILP = RC? In Proceedings of the 26th Int. Symp. on Computer Architecture, pages 162--171, May 1999.
[13]
S. Gopal, T. N. Vijaykumar, J. E. Smith, and G. S. Sohi. Speculative versioning cache. In Proceedings of the The 4th Int. Symp. on High-Performance Computer Architecture, pages 195--205, Feb. 1998.
[14]
M. D. Hill, J. R. Larus, S. K. Reinhardt, and D. A. Wood. Cooperative shared memory: software and hardware for scalable multiprocessors. ACM Transactions on Computer Systems, 11(4):300--318, 1993.
[15]
IEEE. IEEE Standard for Scalable Coherent Interface (SCI), 1992. IEEE 1596--1992.
[16]
T. Karkhanis and J. Smith. A day in the life of a cache miss. In Proceedings of 2nd Annual Workshop On Memory Performance Issues, May 2002.
[17]
S. Kaxiras and J. R. Goodman. Improving CC-NUMA performance using instruction-based prediction. In Proceedings of the 5th Int. Symp. on High Performance Computer Architecture, pages 161--170, Jan. 1999.
[18]
S. Kaxiras and C. Young. Coherence communication prediction in shared-memory multiprocessors. In Proceedings of the 6th Int. Symp. on High Performance Computer Architecture, pages 156--167, Feb. 2000.
[19]
J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, and J. Hennessy. The Stanford FLASH multiprocessor. In Proceedings of the 21st Int. Symp. on Computer Architecture, pages 302--313, Apr. 1994.
[20]
A.-C. Lai and B. Falsa. Memory sharing predictor: The key to a speculative coherent DSM. In Proceedings of the 26th Int. Symp. on Computer Architecture, pages 172 -- 183, May 1999.
[21]
A.-C. Lai and B. Falsa. Selective, accurate, and timely self-invalidation using last-touch prediction. In Proceedings of the 27th Int. Symp. on Computer Architecture, pages 139--148, June 2000.
[22]
A. R. Lebeck and D. A. Wood. Dynamic self-invalidation: Reducing coherence overhead in shared-memory multiprocessors. In Proceedings of the 22nd Int. Symp. on Computer Architecture, pages 48--59, June 1995.
[23]
D. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber, A. Gupta, J. Hennessy, M. Horowitz, and M. Lam. The Stanford DASH Multiprocessor. IEEE Computer, 25(3):63--79, Mar. 1992.
[24]
K. M. Lepak and M. H. Lipasti. Silent stores for free. In Proceedings of the 33rd Int. Symp. on Microarchitecture, pages 2231, Dec. 2000.
[25]
K. M. Lepak and M. H. Lipasti. Temporally silent stores. In Proceedings of the 10th Int. Conf. on Architectural Support for Programming Languages and Operating Systems, pages 30--41, Oct. 2002.
[26]
M. M. K. Martin, P. J. Harper, D. J. Sorin, M. D. Hill, and D. A. Wood. Using destination-set prediction to improve the latency/bandwidth tradeo in shared memory multiprocessors. In Proceedings of the 30th Int. Symp. on Computer Architecture, pages 206--217, June 2003.
[27]
M. M. K. Martin, M. D. Hill, and D. A. Wood. Token coherence: decoupling performance and correctness. In Proceedings of the 30th Int. Symp. on Computer Architecture, pages 182--193, June 2003.
[28]
M. M. K. Martin, D. J. Sorin, H. W. Cain, M. D. Hill, and M. H. Lipasti. Correctly implementing value prediction in microprocessors that support multithreading or multiprocessing. In Proceedings of the 34th Int. Symp. on Microarchitecture, pages 328--337, Dec. 2001.
[29]
J. F. Martinez and J. Torrellas. Speculative synchronization: Applying thread-level speculation to explicitly parallel applications. In Proceedings of the 10th Int. Conf. on Architectural Support for Programming Languages and Operating Systems, pages 18--29, Oct. 2002.
[30]
A. I. Moshovos, S. E. Breach, T. Vijaykumar, and G. S. Sohi. Dynamic speculation and synchronization of data dependences. In Proceedings of the 24th Int. Symp. on Computer Architecture, pages 181--193, June 1997.
[31]
F. Mounes-Toussi and D. J. Lilja. The potential of compile-time analysis to adapt the cache coherence enforcement strategy to the data sharing characteristics. IEEE Transactions on Parallel and Distributed Systems, 6(5):470--481, May 1995.
[32]
S. S. Mukherjee and M. D. Hill. Using prediction to accelerate coherence protocols. In Proceedings of the 25th Int. Symp. on Computer Architecture, pages 179--190, June 1998.
[33]
V. S. Pai, P. Ranganathan, S. V. Adve, and T. Harton. An evaluation of memory consistency models for shared-memory systems with ILP processors. In Proceedings of the 7th Int. Conf. on Architectural Support for Programming Languages and Operating Systems, pages 12--23, Oct. 1996.
[34]
Y. N. Patt, W. M. Hwu, and M. Shebanow. HPS, a New Microarchitecture: Rationale and Introduction. In Proceedings of the 18th Annual Workshop on Microprogramming, pages 103--108, 1985.
[35]
R. Rajwar and J. R. Goodman. Speculative lock elision: Enabling highly concurrent multithreaded execution. In Proceedings of the 34th Int. Symp. on Microarchitecture, pages 294--305, Dec. 2001.
[36]
R. Rajwar and J. R. Goodman. Transactional lock-free execution of lock-based programs. In Proceedings of the 10th Int. Conf. on Architectural Support for Programming Languages and Operating Systems, pages 5--17, Oct. 2002.
[37]
A. Raynaud, Z. Zhang, and J. Torrellas. Distance-adaptive update protocols for scalable shared-memory multiprocessors. In Proceedings of the 2nd Int. Symp. on High-Performance Computer Architecture, pages 323--334, Feb. 1996.
[38]
S. K. Reinhardt, J. R. Larus, and D. A. Wood. Tempest and Typhoon: User-level Shared Memory. In Proceedings of the 21st Int. Symp. on Computer Architecture, pages 325--336, Apr. 1994.
[39]
K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh, D. Burger, S. W. Keckler, and C. R. Moore. Exploiting ILP, TLP and DLP with the Polymorphous TRIPS Architecture. In Proceedings of the 30th Int. Symp. on Computer Architecture, pages 422--433, June 2003.
[40]
G. S. Sohi. Instruction issue logic for high-performance, interruptible, multiple functional unit, pipelined computers. IEEE Transaction of Computer, 39(3):349--359, 1990.
[41]
G. S. Sohi, S. Breach, and T. Vijaykumar. Multiscalar processors. In Proceedings of the 22th Int. Symp. on Computer Architecture, pages 414--425, June 1995.
[42]
P. Stenstr. om, M. Brorsson, and L. Sandberg. Adaptive cache coherence protocol optimized for migratory sharing. In Proceedings of the 20th Int. Symp. on Computer Architecture, pages 109--118, May 1993.
[43]
Q. Yang, G. Thangadurai, and L. Bhuyan. Design of adaptive cache coherence protocol for large scale multiprocessors. IEEE Transactions on Parallel and Distributed Systems, pages 281--293, May 1992.
[44]
K. C. Yeager. The MIPS R10000 Superscalar Microprocessor. IEEE Micro, 16(2):28--40, Apr. 1996.

Cited By

View all
  • (2015)Hardware Approaches to Transactional Memory in Chip MultiprocessorsHandbook on Data Centers10.1007/978-1-4939-2092-1_27(805-835)Online publication date: 17-Mar-2015
  • (2022)Advanced Topics in CoherenceA Primer on Memory Consistency and Cache Coherence10.1007/978-3-031-01764-3_9(191-209)Online publication date: 28-Mar-2022
  • (2022)Advanced Topics in CoherenceA Primer on Memory Consistency and Cache Coherence10.1007/978-3-031-01733-9_9(177-195)Online publication date: 18-Oct-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGOPS Operating Systems Review
ACM SIGOPS Operating Systems Review  Volume 38, Issue 5
ASPLOS '04
December 2004
283 pages
ISSN:0163-5980
DOI:10.1145/1037949
Issue’s Table of Contents
  • cover image ACM Conferences
    ASPLOS XI: Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
    October 2004
    296 pages
    ISBN:1581138040
    DOI:10.1145/1024393
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 October 2004
Published in SIGOPS Volume 38, Issue 5

Check for updates

Author Tags

  1. coherence decoupling
  2. coherence misses
  3. false sharing
  4. speculative cache lookup

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)45
  • Downloads (Last 6 weeks)2
Reflects downloads up to 15 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2015)Hardware Approaches to Transactional Memory in Chip MultiprocessorsHandbook on Data Centers10.1007/978-1-4939-2092-1_27(805-835)Online publication date: 17-Mar-2015
  • (2022)Advanced Topics in CoherenceA Primer on Memory Consistency and Cache Coherence10.1007/978-3-031-01764-3_9(191-209)Online publication date: 28-Mar-2022
  • (2022)Advanced Topics in CoherenceA Primer on Memory Consistency and Cache Coherence10.1007/978-3-031-01733-9_9(177-195)Online publication date: 18-Oct-2022
  • (2021)Ghostwriter: A Cache Coherence Protocol for Error-Tolerant Applications50th International Conference on Parallel Processing Workshop10.1145/3458744.3474045(1-10)Online publication date: 9-Aug-2021
  • (2020)A Primer on Memory Consistency and Cache Coherence, Second EditionSynthesis Lectures on Computer Architecture10.2200/S00962ED2V01Y201910CAC04915:1(1-294)Online publication date: 4-Feb-2020
  • (2020)SB-FetchProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392735(1-12)Online publication date: 29-Jun-2020
  • (2017)TC-Release++: An Efficient Timestamp-Based Coherence Protocol for Many-Core ArchitecturesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2017.271967928:11(3313-3327)Online publication date: 1-Nov-2017
  • (2016)Selective GPU caches to eliminate CPU-GPU HW cache coherence2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2016.7446089(494-506)Online publication date: Mar-2016
  • (2016)PleaseTM: Enabling transaction conflict management in requester-wins hardware transactional memory2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2016.7446072(285-296)Online publication date: Mar-2016
  • (2014)A NUCA substrate for flexible CMP cache sharingACM International Conference on Supercomputing 25th Anniversary Volume10.1145/2591635.2667186(380-389)Online publication date: 10-Jun-2014
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media