[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

An Architecture for High-Performance Scalable Shared-Memory Multiprocessors Exploiting On-Chip Integration

Published: 01 August 2004 Publication History

Abstract

Recent technology improvements allow multiprocessor designers to put some key components inside the processor chip, such as the memory controller, the coherence hardware, and the network interface/router. In this paper, we exploit such integration scale, presenting a novel node architecture aimed at reducing the long L2 miss latencies and the memory overhead of using directories that characterize cc-NUMA machines and limit their scalability. Our proposal replaces the traditional directory with a novel three-level directory architecture, as well as it adds a small shared data cache to each of the nodes of a multiprocessor system. Due to their small size, the first-level directory and the shared data cache are integrated into the processor chip in every node, which enhances performance by saving accesses to the slower main memory. Scalability is guaranteed by having the second and third-level directories out of the processor chip and using compressed data structures. A taxonomy of the L2 misses, according to the actions performed by the directory to satisfy them, is also presented. Using execution-driven simulations, we show that significant latency reductions can be obtained by using the proposed node architecture, which translates into reductions of more than 30 percent in several cases in the application execution time.

References

[1]
M.E. Acacio J. González J.M. García and J. Duato, “A New Scalable Directory Architecture for Large-Scale Multiprocessors,” Proc. Seventh Int'l Symp. High Performance Computer Architecture, pp. 97-106, Jan. 2001.
[2]
M.M. Martin D.J. Sorin A. Ailamaki A.R. Alameldeen R.M. Dickson C.J. Mauer K.E. Moore M. Plakal M.D. Hill and D.A. Wood, “Timestamp Snooping: An Approach for Extending SMPS,” Proc. Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 25-36, Nov. 2000.
[3]
H. Hadimioglu D. Kaeli and F. Lombardi, “Introduction to the Special Issue on High Performance Memory Systems,” IEEE Trans. Computers, vol. 50, no. 11, pp. 1103-1105, Nov. 2001.
[4]
A. Charlesworth, “Extending the SMP Envelope,” IEEE Micro, vol. 18,no. 1, pp. 39-49, Jan./Feb. 1998.
[5]
L. Gwennap, “Alpha 21364 to Ease Memory Bottleneck,” Microprocessor Report, vol. 12, no. 14, pp. 12-15, Oct. 1998.
[6]
T. Lovett and R. Clapp, “Sting: A cc-Numa Computer System for the Commercial Marketplace,” Proc. 23rd Int'l Symp. Computer Architecture, pp. 308-317, 1996.
[7]
M.E. Acacio J. González J.M. García and J. Duato, “A Novel Approach to Reduce L2 Miss Latency in Shared-Memory Multiprocessors,” Proc. 16th Int'l Parallel and Distributed Processing Symp., Apr. 2002.
[8]
The BlueGene/L Team, “An Overview of the Bluegene/L Supercomputer,” Proc. Int'l SC2002 High Performance Networking and Computing Conf., Nov. 2002.
[9]
A. Ahmed P. Conway B. Hughes and F. Weber, “AMD Opteron(TM) Shared Memory MP Systems,” Proc. 14th HotChips Symp., Aug. 2002.
[10]
J. Torrellas L. Yang and A.T. Nguyen, “Toward a Cost-Effective DSM Organization that Exploits Processor-Memory Integration,” Proc. Sixth Int'l Symp. High Performance Computer Architecture, pp. 15-25, Jan. 2000.
[11]
L.A. Barroso K. Gharachorloo R. McNamara A. Nowatzyk S. Qadeer B. Sano S. Smith R. Stets and B. Verghese, “Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing,” Proc. 27th Int'l Symp. Computer Architecture, pp. 282-293, June 2000.
[12]
L. Hammond B. Hubbert M. Siu M. Prabhu M. Chen and K. Olukotun, “The Stanford Hydra CMP,” IEEE Micro, vol. 20, no. 2, pp. 71-84, Mar./Apr. 2000.
[13]
J. Tendler J. Dodson J. Fields H. Le and B. Sinharoy, “Power4 System Microarchitecture,” IBM J. Research and Development, vol. 46,no. 1, pp. 5-25, Jan. 2002.
[14]
P. Stenström M. Brorsson F. Dahlgren H. Grahn and M. Dubois, “Boosting the Performance of Shared Memory Multiprocessors,” Computer, vol. 30, no. 7, pp. 63-70, July 1997.
[15]
R. Iyer and L.N. Bhuyan, “Switch Cache: A Framework for Improving the Remote Memory Access Latency of cc-Numa Multiprocessors,” Proc. Fifth Int'l Symp. High Performance Computer Architecture, pp. 152-160, Jan. 1999.
[16]
R. Iyer L.N. Bhuyan and A. Nanda, “Using Switch Directories to Speed up Cache-to-Cache Transfers in cc-Numa Multiprocessors,” Proc. 14th Int'l Parallel and Distributed Processing Symp., pp. 721-728, May 2000.
[17]
M.E. Acacio J. González J.M. García and J. Duato, “Owner Prediction for Accelerating Cache-to-Cache Transfer Misses in cc-Numa Multiprocessors,” Proc. Int'l SC2002 High Performance Networking and Computing Conf., Nov. 2002.
[18]
S. Kaxiras and J.R. Goodman, “Improving cc-Numa Performance Using Instruction-Based Prediction,” Proc. Fifth Int'l Symp. High Performance Computer Architecture, pp. 161-170, Jan. 1999.
[19]
A.C. Lai and B. Falsafi, “Selective, Accurate, and Timely Self-Invalidation Using Last-Touch Prediction,” Proc. 27th Int'l Symp. Computer Architecture, pp. 139-148, May 2000.
[20]
M.M. Martin P.J. Harper D.J. Sorin M.D. Hill and D.A. Wood, “Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared Memory Multiprocessors,” Proc. 30th Int'l Symp. Computer Architecture, June 2003.
[21]
D. Lenoski J. Laudon K. Gharachorloo W.-D. Weber A. Gupta J. Hennessy M. Horowitz and M.S. Lam, “The Stanford Dash Multiprocessor,” Computer, vol. 25, no. 3, pp. 63-79, Mar. 1992.
[22]
A. Nowatzyk G. Aybay M. Browne E. Kelly M. Parkin W. Radke and S. Vishin, “The s3.mp Scalable Shared Memory Multiprocessor,” Proc. Int'l Conf. Parallel Processing, pp. 1-10, July 1995.
[23]
A. Gupta W.-D. Weber and T. Mowry, “Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes,” Proc. Int'l Conf. Parallel Processing, pp. 312-321, Aug. 1990.
[24]
B. O'Krafka and A. Newton, “An Empirical Evaluation of Two Memory-Efficient Directory Methods,” Proc. 17th Int'l Symp. Computer Architecture, pp. 138-147, May 1990.
[25]
J. Kuskin D. Ofelt M. Heinrich J. Heinlein R. Simoni K. Gharachorloo J. Chapin D. Nakahira J. Baxter M. Horowitz A. Gupta M. Rosenblum and J. Hennessy, “The Stanford Flash Multiprocessor,” Proc. 21st Int'l Symp. Computer Architecture, pp. 302-313, Apr. 1994.
[26]
M.M. Michael and A.K. Nanda, “Design and Performance of Directory Caches for Scalable Shared Memory Multiprocessors,” Proc. Fifth Int'l Symp. High Performance Computer Architecture, pp. 142-151, Jan. 1999.
[27]
A.K. Nanda A.-T. Nguyen M.M. Michael and D.J. Joseph, “High-Throughput Coherence Control and Hardware Messaging in Everest,” IBM J. Research and Development, vol. 45, no. 2, pp. 229-244, Mar. 2001.
[28]
A. Agarwal R. Simoni J. Hennessy and M. Horowitz, “An Evaluation of Directory Schemes for Cache Coherence,” Proc. 15th Int'l Symp. Computer Architecture, pp. 280-289, May 1988.
[29]
D. Chaiken J. Kubiatowicz and A. Agarwal, “Limitless Directories: A Scalable Cache Coherence Scheme,” Proc. Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 224-234, Apr. 1991.
[30]
R. Simoni and M. Horowitz, “Dynamic Pointer Allocation for Scalable Cache Coherence Directories,” Proc. Int'l Symp. Shared Memory Multiprocessing, pp. 72-81, Apr. 1991.
[31]
J. Laudon and D. Lenoski, “The SGI Origin: A ccnuma Highly Scalable Server,” Proc. 24th Int'l Symp. Computer Architecture, pp. 241-251, June 1997.
[32]
A. Gupta and W.-D. Weber, “Cache Invalidation Patterns in Shared-Memory Multiprocessors,” IEEE Trans. Computers, vol. 41, no. 7, pp. 794-810, July 1992.
[33]
D.E. Culler J.P. Singh and A. Gupta, Parallel Computer Architecture: A Hardware/Software Approach. Kaufmann Publishers, Inc., 1999.
[34]
V. Pai P. Ranganathan and S. Adve, “Rsim Reference Manual Version 1.0,” Technical Report 9705, Dept. of Electrical and Computer Eng., Rice Univ., Aug. 1997.
[35]
M.D. Hill, “Multiprocessors Should Support Simple Memory-Consistency Models,” Computer, vol. 31, no. 8, pp. 28-34, Aug. 1998.
[36]
S.C. Woo M. Ohara E. Torrie J.P. Singh and A. Gupta, “The Splash-2 Programs: Characterization and Methodological Considerations,” Proc. 22nd Int'l Symp. Computer Architecture, pp. 24-36, June 1995.
[37]
J. Singh W.-D. Weber and A. Gupta, “Splash: Stanford Parallel Applications for Shared-Memory,” Computer Architecture News, vol. 20, no. 1, pp. 5-44, Mar. 1992.

Cited By

View all
  • (2017)Exploring grouped coherence for clustered hierarchical cacheThe Journal of Supercomputing10.1007/s11227-017-2024-873:9(4137-4157)Online publication date: 1-Sep-2017
  • (2009)Scalable directory architecture for distributed shared memory chip multiprocessorsACM SIGARCH Computer Architecture News10.1145/1556444.155645236:5(56-64)Online publication date: 20-Jun-2009
  • (2008)A consistency architecture for hierarchical shared cachesProceedings of the twentieth annual symposium on Parallelism in algorithms and architectures10.1145/1378533.1378536(11-22)Online publication date: 1-Jun-2008
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Parallel and Distributed Systems
IEEE Transactions on Parallel and Distributed Systems  Volume 15, Issue 8
August 2004
84 pages

Publisher

IEEE Press

Publication History

Published: 01 August 2004

Author Tags

  1. 65
  2. L2 miss latency
  3. cc-NUMA multiprocessor
  4. directory memory overhead
  5. on-processor-chip integration.
  6. shared data cache
  7. three-level directory

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2017)Exploring grouped coherence for clustered hierarchical cacheThe Journal of Supercomputing10.1007/s11227-017-2024-873:9(4137-4157)Online publication date: 1-Sep-2017
  • (2009)Scalable directory architecture for distributed shared memory chip multiprocessorsACM SIGARCH Computer Architecture News10.1145/1556444.155645236:5(56-64)Online publication date: 20-Jun-2009
  • (2008)A consistency architecture for hierarchical shared cachesProceedings of the twentieth annual symposium on Parallelism in algorithms and architectures10.1145/1378533.1378536(11-22)Online publication date: 1-Jun-2008
  • (2006)An efficient cache design for scalable glueless shared-memory multiprocessorsProceedings of the 3rd conference on Computing frontiers10.1145/1128022.1128065(321-330)Online publication date: 3-May-2006
  • (2005)A novel lightweight directory architecture for scalable shared-memory multiprocessorsProceedings of the 11th international Euro-Par conference on Parallel Processing10.1007/11549468_65(582-591)Online publication date: 30-Aug-2005

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media