[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/237090.237187acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
Article
Free access

SoftFLASH: analyzing the performance of clustered distributed virtual shared memory

Published: 01 September 1996 Publication History

Abstract

One potentially attractive way to build large-scale shared-memory machines is to use small-scale to medium-scale shared-memory machines as clusters that are interconnected with an off-the-shelf network. To create a shared-memory programming environment across the clusters, it is possible to use a virtual shared-memory software layer. Because of the low latency and high bandwidth of the interconnect available within each cluster, there are clear advantages in making the clusters as large as possible. The critical question then becomes whether the latency and bandwidth of the top-level network and the software system are sufficient to support the communication demands generated by the clusters.To explore these questions, we have built an aggressive kernel implementation of a virtual shared-memory system using SGI multiprocessors and 100Mbyte/sec HIPPI interconnects. The system obtains speedups on 32 processors (four nodes, eight processors per node plus additional reserved protocol processors) that range from 6.9 on the communication-intensive FFT program to 21.6 on Ocean (both from the SPLASH 2 suite). In general, clustering is effective in reducing internode miss rates, but as the cluster size increases, increases in the remote latency, mostly due to increased TLB synchronization cost, offset the advantages. For communication-intensive applications, such as FFT, the overhead of sending out network requests, the limited network bandwidth, and the long network latency prevent the achievement of good performance. Overall, this approach still appears promising, but our results indicate that large low latency networks may be needed to make cluster-based virtual shared-memory machines broadly useful as large-scale shared-memory multiprocessors.

References

[1]
Anant Agarwal, R. Bianchini, D. Chaiken, K. Johnson, D Kranz, J. Kubiatowicz, Beng-Hong Lira, K. Mackenzie, and D. Yeung. The MIT Alewife Machine: Architecture and Performance, In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pp. 2-13, June 1995.]]
[2]
Brian Bershad and Matthew J. Zekauskas. Midway: Shared Memory Parallel Programming with Entry Consistency for Distributed Memory Multiprocessors, Carnegie Mellon University Technical Report No. CMU-CS 91-170, September 1991.]]
[3]
J.B. Carter. Design of the Munin Distributed Shared Memory System, Journal of Parallel and Distributed Computing, 29(2):219-27, September 1995.]]
[4]
A.L. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, and W. Zwaenepoel. Software versus Hardware Shared-memory Implementation: a Case Study, In Proceedings of the 21st Annual International Symposium on Computer Architecture, pp. 106-17, April 1994.]]
[5]
Rohit Chandra, K. Gharachorloo, V. Soundararajan, and A. Gupta. Performance Evaluation of Hybrid Hardware and Software Distributed Shared Memory Protocols, In Proceedings of International Conference on Supercomputing '94, pp. 274-288. July 1994.]]
[6]
Jeffery Chase, F. Amador, E. Lazowska, H. Levy, and R. Littlefield. The Amber System: Parallel Programming on a Network of Multiprocessors, in Proceedings of the Twelfth A CM Symposium on Operating System Principles, pp. 147-158, December 1989.]]
[7]
D.R. Cheriton, H. Goosen and P. Boyle. Multi-level Shared Caching Techniques for Scalability in VMP-MC, In Proceedings of the 16th International Symposium on Computer Architecture, pp. 16-24, May 1989.]]
[8]
M. Dubois, J. C. Wang, L. A. Barroso, K. L. Lee, and Y. Chen. Delayed Consistency and its Effect on the Miss Rate of Parallel Programs, Proceedings of SuperComputing '95, pp. 197-206, November 1991.]]
[9]
Andrew Erlichson, Basem Nayfeh, Jaswinder P. Singh and Kunle Olukotun. The Benefits of Clustering in Shared Address Space Multiprocessors: An Applications Driven Investigation, Proceedings of SuperComputing '95, Dec. I995.]]
[10]
Ewing Lusk. Portable Programs for Parallel Processors, Holt, Rinehart, and Winston, New York, 1987]]
[11]
K. Gharachofioo, Dan Lenoski, James Laudon, P. Gibbons, Anoop Gupta, and John Hennessy. Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors, In Proceedings of the 17th International Symposium on Computer Architecture, pp. 15-26, May 1990.]]
[12]
Chris Holt and Jaswinder Pal Singh. Hierarchical N-Body Methods on Shared Address Space Multiprocessors, In Proceedings of the Seventh SIAM International Conference on Parallel Processing for Scientific Computing, pp. 313-18, February 1995.]]
[13]
Kirk Johnson, M. F. Kaashoek and D. Wallach. CRL: Highperformance All-software Distributed Shared Memory, In Fifteenth A C Symposium on Operating Systems Principles, pp. 213-28, December 1995.]]
[14]
Magnus Karlsson and Per Stenstrom. Performance Evaluation of a Cluster-Based Muluprocessor Built from ATM Switches and Bus- Based Multiprocessor Servers, In Proceedings of the Second International Symposium on High-Performance Computer Architecture, pp. 4-13, February 1996.]]
[15]
Peter Keleher. Lazy Release Consistency for Distributed Shared Memory, PhD Thesis, Rice University, Houston, January 1995.]]
[16]
Pete Keleher, Alan L. Cox, and Willy Zwaenepoel. Lazy Release Consistency for Software Distributed Shared Memory, In Proceedings of the 19th Annual International Symposium on Computer Architecture, pp. 13-21, May 1992.]]
[17]
P. Keleher, Alan Cox, S. Dwarkadas and W. Zwaenepoel. TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems, In Proceedings of USENIX Winter 1994 Conference, pp. 115-32, January 1994.]]
[18]
Jeff Kuskin, David Ofelt, Mark Heinnch, John Heinlein, Richard Simoni, K, Gharachofioo, J. Chapin, David Nakahira, Joel Baxter, Mark Horowitz, Anoop Gupta, Mendel Rosenblum and John Hennessy, The Stanford FLASH Multiprocessor. in Proceedings of the 21st international Symposium on Computer Architecture, pp. 18-21, April 1994.]]
[19]
W. Leler. System-level Parallel Programming Based on Linda, In Proceedings of the Third North American Transputer Users Group, pp. 175-9, April 1990.]]
[20]
Kai Li and Paul Hudak. Memory Coherence in Shared Virtual Memory Systems. ACM Transactions on Computer Systems, 7(4):321-359, November 1989.]]
[21]
Ron Minnich. Mether-NFS: A Modified NFS which supports Virtual Shared Memory, In Proceedings of Symposium on Experiences with Distributed and Multiprocessor Systems IV, pp. 89-107, September 1993.]]
[22]
Bryan S. Rosenburg. Low-Synchronization Translation Lookaside Buffer Consistency in Large-Scale Shared- Memory Multiprocessors, In Proceedings of the Twelfth A CM Symposium on Operating System Principles, pp. 147-158, December 1989.]]
[23]
Dan Scales and Monica Lam. The Design and Evaluation of a Shared Object System for Distributed Memory Machines, In Proceedings of I st Symposium on Operation Systems Design and Implementation, pp. 101~ 14, November 1994.]]
[24]
Michael Y. Thompson, J. M. Barton, T. Jermoluk, and J. Wagner. Translation Lookaside Buffer Synchronization in a Multiprocssor System, In Proceeding of USENlX Association Winter Conference, pp. 297-302, February 1988.]]
[25]
Steven Cameron Woo, Jaswinder Pal Singh, and John L. Hennessy~ The Performance Advantages of Integrating Block Data Transfer in Cache-Coherent Multiprocessors, In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VI), pp. 219-229, October 1994.]]
[26]
Steven Cameron Woo, Jaswinder Pal Singh, and John L. Hennessy. The Performance Advantages of Integrating Block Data Transfer in Cache-Coherent Multiprocessors, Stanford University Technical Report No. CSL-TR-93-593, December 1993.]]
[27]
Steven Cameron Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 Programs: Characterization and Methodological Considerations, In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pp. 24-36, june 1995.]]
[28]
Donald Yeung, John Kubiatowicz, and Anant Agarwal. MGS: A Multi-Grain Shared Memory System, in Proceedings of the 23rd Annual International Symposium on Computer Architecture, pp. 44-55, April 1996.]]

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ASPLOS VII: Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
October 1996
290 pages
ISBN:0897917677
DOI:10.1145/237090
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 September 1996

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

ASPLOS96
Sponsor:

Acceptance Rates

ASPLOS VII Paper Acceptance Rate 25 of 109 submissions, 23%;
Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)150
  • Downloads (Last 6 weeks)25
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2015)Web Transparency for Complex TargetingACM SIGMETRICS Performance Evaluation Review10.1145/2796314.274589643:1(465-466)Online publication date: 15-Jun-2015
  • (2015)Detecting and Localizing End-to-End Performance Degradation for Cellular Data ServicesACM SIGMETRICS Performance Evaluation Review10.1145/2796314.274589243:1(459-460)Online publication date: 15-Jun-2015
  • (2015)DeltaTreeACM SIGMETRICS Performance Evaluation Review10.1145/2796314.274589143:1(457-458)Online publication date: 15-Jun-2015
  • (2015)Deterministic Near-Optimal P2P StreamingACM SIGMETRICS Performance Evaluation Review10.1145/2796314.274588843:1(451-452)Online publication date: 15-Jun-2015
  • (2015)Clustering and Inference From Pairwise ComparisonsACM SIGMETRICS Performance Evaluation Review10.1145/2796314.274588743:1(449-450)Online publication date: 15-Jun-2015
  • (2015)Understanding Parallel Performance Under Interferences in Multi-tenant CloudsACM SIGMETRICS Performance Evaluation Review10.1145/2796314.274588643:1(447-448)Online publication date: 15-Jun-2015
  • (2006)Exploiting localityProceedings of the 20th international conference on Parallel and distributed processing10.5555/1898953.1898969(33-33)Online publication date: 25-Apr-2006
  • (2006)Exploiting locality: a flexible DSM approachProceedings 20th IEEE International Parallel & Distributed Processing Symposium10.1109/IPDPS.2006.1639273(10 pp.)Online publication date: 2006
  • (2006)A Transparent Distributed Shared Memory for Clustered Symmetric MultiprocessorsThe Journal of Supercomputing10.1007/s11227-006-5483-x37:2(145-160)Online publication date: 1-Aug-2006
  • (2005)In-Kernel Integration of Operating System and Infiniband Functions for High Performance Computing ClustersIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2005.11116:9(830-840)Online publication date: 1-Sep-2005
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media