More Web Proxy on the site http://driver.im/

research-article

An Architecture for High-Performance Scalable Shared-Memory Multiprocessors Exploiting On-Chip Integration

Authors:

Manuel E. Acacio,

Jose M. Garcia,

Jose DuatoAuthors Info & Claims

IEEE Transactions on Parallel and Distributed Systems, Volume 15, Issue 8

Pages 755 - 768

https://doi.org/10.1109/TPDS.2004.27

Published: 01 August 2004 Publication History

Abstract

Recent technology improvements allow multiprocessor designers to put some key components inside the processor chip, such as the memory controller, the coherence hardware, and the network interface/router. In this paper, we exploit such integration scale, presenting a novel node architecture aimed at reducing the long L2 miss latencies and the memory overhead of using directories that characterize cc-NUMA machines and limit their scalability. Our proposal replaces the traditional directory with a novel three-level directory architecture, as well as it adds a small shared data cache to each of the nodes of a multiprocessor system. Due to their small size, the first-level directory and the shared data cache are integrated into the processor chip in every node, which enhances performance by saving accesses to the slower main memory. Scalability is guaranteed by having the second and third-level directories out of the processor chip and using compressed data structures. A taxonomy of the L2 misses, according to the actions performed by the directory to satisfy them, is also presented. Using execution-driven simulations, we show that significant latency reductions can be obtained by using the proposed node architecture, which translates into reductions of more than 30 percent in several cases in the application execution time.

References

[1]

M.E. Acacio J. González J.M. García and J. Duato, “A New Scalable Directory Architecture for Large-Scale Multiprocessors,” Proc. Seventh Int'l Symp. High Performance Computer Architecture, pp. 97-106, Jan. 2001.

Digital Library

[2]

M.M. Martin D.J. Sorin A. Ailamaki A.R. Alameldeen R.M. Dickson C.J. Mauer K.E. Moore M. Plakal M.D. Hill and D.A. Wood, “Timestamp Snooping: An Approach for Extending SMPS,” Proc. Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 25-36, Nov. 2000.

Digital Library

[3]

H. Hadimioglu D. Kaeli and F. Lombardi, “Introduction to the Special Issue on High Performance Memory Systems,” IEEE Trans. Computers, vol. 50, no. 11, pp. 1103-1105, Nov. 2001.

Digital Library

[4]

A. Charlesworth, “Extending the SMP Envelope,” IEEE Micro, vol. 18,no. 1, pp. 39-49, Jan./Feb. 1998.

Digital Library

[5]

L. Gwennap, “Alpha 21364 to Ease Memory Bottleneck,” Microprocessor Report, vol. 12, no. 14, pp. 12-15, Oct. 1998.

[6]

T. Lovett and R. Clapp, “Sting: A cc-Numa Computer System for the Commercial Marketplace,” Proc. 23rd Int'l Symp. Computer Architecture, pp. 308-317, 1996.

Digital Library

[7]

M.E. Acacio J. González J.M. García and J. Duato, “A Novel Approach to Reduce L2 Miss Latency in Shared-Memory Multiprocessors,” Proc. 16th Int'l Parallel and Distributed Processing Symp., Apr. 2002.

Digital Library

[8]

The BlueGene/L Team, “An Overview of the Bluegene/L Supercomputer,” Proc. Int'l SC2002 High Performance Networking and Computing Conf., Nov. 2002.

Digital Library

[9]

A. Ahmed P. Conway B. Hughes and F. Weber, “AMD Opteron(TM) Shared Memory MP Systems,” Proc. 14th HotChips Symp., Aug. 2002.

[10]

J. Torrellas L. Yang and A.T. Nguyen, “Toward a Cost-Effective DSM Organization that Exploits Processor-Memory Integration,” Proc. Sixth Int'l Symp. High Performance Computer Architecture, pp. 15-25, Jan. 2000.

[11]

L.A. Barroso K. Gharachorloo R. McNamara A. Nowatzyk S. Qadeer B. Sano S. Smith R. Stets and B. Verghese, “Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing,” Proc. 27th Int'l Symp. Computer Architecture, pp. 282-293, June 2000.

Digital Library

[12]

L. Hammond B. Hubbert M. Siu M. Prabhu M. Chen and K. Olukotun, “The Stanford Hydra CMP,” IEEE Micro, vol. 20, no. 2, pp. 71-84, Mar./Apr. 2000.

Digital Library

[13]

J. Tendler J. Dodson J. Fields H. Le and B. Sinharoy, “Power4 System Microarchitecture,” IBM J. Research and Development, vol. 46,no. 1, pp. 5-25, Jan. 2002.

Digital Library

[14]

P. Stenström M. Brorsson F. Dahlgren H. Grahn and M. Dubois, “Boosting the Performance of Shared Memory Multiprocessors,” Computer, vol. 30, no. 7, pp. 63-70, July 1997.

Digital Library

[15]

R. Iyer and L.N. Bhuyan, “Switch Cache: A Framework for Improving the Remote Memory Access Latency of cc-Numa Multiprocessors,” Proc. Fifth Int'l Symp. High Performance Computer Architecture, pp. 152-160, Jan. 1999.

Digital Library

[16]

R. Iyer L.N. Bhuyan and A. Nanda, “Using Switch Directories to Speed up Cache-to-Cache Transfers in cc-Numa Multiprocessors,” Proc. 14th Int'l Parallel and Distributed Processing Symp., pp. 721-728, May 2000.

Digital Library

[17]

M.E. Acacio J. González J.M. García and J. Duato, “Owner Prediction for Accelerating Cache-to-Cache Transfer Misses in cc-Numa Multiprocessors,” Proc. Int'l SC2002 High Performance Networking and Computing Conf., Nov. 2002.

Digital Library

[18]

S. Kaxiras and J.R. Goodman, “Improving cc-Numa Performance Using Instruction-Based Prediction,” Proc. Fifth Int'l Symp. High Performance Computer Architecture, pp. 161-170, Jan. 1999.

Digital Library

[19]

A.C. Lai and B. Falsafi, “Selective, Accurate, and Timely Self-Invalidation Using Last-Touch Prediction,” Proc. 27th Int'l Symp. Computer Architecture, pp. 139-148, May 2000.

Digital Library

[20]

M.M. Martin P.J. Harper D.J. Sorin M.D. Hill and D.A. Wood, “Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared Memory Multiprocessors,” Proc. 30th Int'l Symp. Computer Architecture, June 2003.

Digital Library

[21]

D. Lenoski J. Laudon K. Gharachorloo W.-D. Weber A. Gupta J. Hennessy M. Horowitz and M.S. Lam, “The Stanford Dash Multiprocessor,” Computer, vol. 25, no. 3, pp. 63-79, Mar. 1992.

Digital Library

[22]

A. Nowatzyk G. Aybay M. Browne E. Kelly M. Parkin W. Radke and S. Vishin, “The s3.mp Scalable Shared Memory Multiprocessor,” Proc. Int'l Conf. Parallel Processing, pp. 1-10, July 1995.

[23]

A. Gupta W.-D. Weber and T. Mowry, “Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes,” Proc. Int'l Conf. Parallel Processing, pp. 312-321, Aug. 1990.

[24]

B. O'Krafka and A. Newton, “An Empirical Evaluation of Two Memory-Efficient Directory Methods,” Proc. 17th Int'l Symp. Computer Architecture, pp. 138-147, May 1990.

Digital Library

[25]

J. Kuskin D. Ofelt M. Heinrich J. Heinlein R. Simoni K. Gharachorloo J. Chapin D. Nakahira J. Baxter M. Horowitz A. Gupta M. Rosenblum and J. Hennessy, “The Stanford Flash Multiprocessor,” Proc. 21st Int'l Symp. Computer Architecture, pp. 302-313, Apr. 1994.

Digital Library

[26]

M.M. Michael and A.K. Nanda, “Design and Performance of Directory Caches for Scalable Shared Memory Multiprocessors,” Proc. Fifth Int'l Symp. High Performance Computer Architecture, pp. 142-151, Jan. 1999.

Digital Library

[27]

A.K. Nanda A.-T. Nguyen M.M. Michael and D.J. Joseph, “High-Throughput Coherence Control and Hardware Messaging in Everest,” IBM J. Research and Development, vol. 45, no. 2, pp. 229-244, Mar. 2001.

Digital Library

[28]

A. Agarwal R. Simoni J. Hennessy and M. Horowitz, “An Evaluation of Directory Schemes for Cache Coherence,” Proc. 15th Int'l Symp. Computer Architecture, pp. 280-289, May 1988.

Digital Library

[29]

D. Chaiken J. Kubiatowicz and A. Agarwal, “Limitless Directories: A Scalable Cache Coherence Scheme,” Proc. Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 224-234, Apr. 1991.

Digital Library

[30]

R. Simoni and M. Horowitz, “Dynamic Pointer Allocation for Scalable Cache Coherence Directories,” Proc. Int'l Symp. Shared Memory Multiprocessing, pp. 72-81, Apr. 1991.

[31]

J. Laudon and D. Lenoski, “The SGI Origin: A ccnuma Highly Scalable Server,” Proc. 24th Int'l Symp. Computer Architecture, pp. 241-251, June 1997.

Digital Library

[32]

A. Gupta and W.-D. Weber, “Cache Invalidation Patterns in Shared-Memory Multiprocessors,” IEEE Trans. Computers, vol. 41, no. 7, pp. 794-810, July 1992.

Digital Library

[33]

D.E. Culler J.P. Singh and A. Gupta, Parallel Computer Architecture: A Hardware/Software Approach. Kaufmann Publishers, Inc., 1999.

Digital Library

[34]

V. Pai P. Ranganathan and S. Adve, “Rsim Reference Manual Version 1.0,” Technical Report 9705, Dept. of Electrical and Computer Eng., Rice Univ., Aug. 1997.

[35]

M.D. Hill, “Multiprocessors Should Support Simple Memory-Consistency Models,” Computer, vol. 31, no. 8, pp. 28-34, Aug. 1998.

Digital Library

[36]

S.C. Woo M. Ohara E. Torrie J.P. Singh and A. Gupta, “The Splash-2 Programs: Characterization and Methodological Considerations,” Proc. 22nd Int'l Symp. Computer Architecture, pp. 24-36, June 1995.

Digital Library

[37]

J. Singh W.-D. Weber and A. Gupta, “Splash: Stanford Parallel Applications for Shared-Memory,” Computer Architecture News, vol. 20, no. 1, pp. 5-44, Mar. 1992.

Digital Library

Cited By

Hu SShi FJi WChen XTalpur S(2017)Exploring grouped coherence for clustered hierarchical cacheThe Journal of Supercomputing10.1007/s11227-017-2024-873:9(4137-4157)Online publication date: 1-Sep-2017
https://dl.acm.org/doi/10.1007/s11227-017-2024-8
Fang HBrorsson M(2009)Scalable directory architecture for distributed shared memory chip multiprocessorsACM SIGARCH Computer Architecture News10.1145/1556444.155645236:5(56-64)Online publication date: 20-Jun-2009
https://dl.acm.org/doi/10.1145/1556444.1556452
Ladan-Mozes ELeiserson CMeyer auf der Heide FShavit N(2008)A consistency architecture for hierarchical shared cachesProceedings of the twentieth annual symposium on Parallelism in algorithms and architectures10.1145/1378533.1378536(11-22)Online publication date: 1-Jun-2008
https://dl.acm.org/doi/10.1145/1378533.1378536
Show More Cited By

Index Terms

An Architecture for High-Performance Scalable Shared-Memory Multiprocessors Exploiting On-Chip Integration
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
2. Hardware
  1. Integrated circuits
  2. Very large scale integration design
    1. Application-specific VLSI designs
      1. Application specific processors

Recommendations

A Novel Approach to Reduce L2 Miss Latency in Shared-Memory Multiprocessors
IPDPS '02: Proceedings of the 16th International Symposium on Parallel and Distributed Processing

Recent technology improvements allow multiprocessor designers to put some key components inside the processor chip, such as the memory controller, the coherence hardware and the network interface/router. In this work we exploit such integration scale, ...
A Two-Level Directory Architecture for Highly Scalable cc-NUMA Multiprocessors

One important issue the designer of a scalable shared-memory multiprocessor must deal with is the amount of extra memory required to store the directory information. It is desirable that the directory memory overhead be kept as low as possible, and that ...
Scalable directory architecture for distributed shared memory chip multiprocessors

Traditional Directory-based cache coherence protocol is far from optimal for large-scale cache coherent shared memory multiprocessors due to the increasing latency to access directories stored in DRAM memory. Instead of keeping directories in main ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Parallel and Distributed Systems

IEEE Transactions on Parallel and Distributed Systems Volume 15, Issue 8

August 2004

84 pages

ISSN:1045-9219

Issue’s Table of Contents

Copyright © 2004.

Publisher

IEEE Press

Publication History

Published: 01 August 2004

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 11 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Hu SShi FJi WChen XTalpur S(2017)Exploring grouped coherence for clustered hierarchical cacheThe Journal of Supercomputing10.1007/s11227-017-2024-873:9(4137-4157)Online publication date: 1-Sep-2017
https://dl.acm.org/doi/10.1007/s11227-017-2024-8
Fang HBrorsson M(2009)Scalable directory architecture for distributed shared memory chip multiprocessorsACM SIGARCH Computer Architecture News10.1145/1556444.155645236:5(56-64)Online publication date: 20-Jun-2009
https://dl.acm.org/doi/10.1145/1556444.1556452
Ladan-Mozes ELeiserson CMeyer auf der Heide FShavit N(2008)A consistency architecture for hierarchical shared cachesProceedings of the twentieth annual symposium on Parallelism in algorithms and architectures10.1145/1378533.1378536(11-22)Online publication date: 1-Jun-2008
https://dl.acm.org/doi/10.1145/1378533.1378536
Ros AAcacio MGarcía JAlderighi MSalapura VMcKee S(2006)An efficient cache design for scalable glueless shared-memory multiprocessorsProceedings of the 3rd conference on Computing frontiers10.1145/1128022.1128065(321-330)Online publication date: 3-May-2006
https://dl.acm.org/doi/10.1145/1128022.1128065
Ros AAcacio MGarcía J(2005)A novel lightweight directory architecture for scalable shared-memory multiprocessorsProceedings of the 11th international Euro-Par conference on Parallel Processing10.1007/11549468_65(582-591)Online publication date: 30-Aug-2005
https://dl.acm.org/doi/10.1007/11549468_65

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents