[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/223982.224442acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
Article
Free access

Optimizing memory system performance for communication in parallel computers

Published: 01 May 1995 Publication History

Abstract

Communication in a parallel system frequently involves moving data from the memory of one node to the memory of another; this is the standard communication model employed in message passing systems. Depending on the application, we observe a variety of patterns as part of communication steps, e.g., regular (i.e. blocks of data), strided, or irregular (indexed) memory accesses. The effective speed of these communication steps is determined by the network bandwidth and the memory bandwidth, and measurements on current parallel supercomputers indicate that the performance is limited by the memory bandwidth rather than the network bandwidth.Current systems provide a wealth of options to perform communication, and a compiler or user is faced with the difficulty of finding the communication operations that best use the available memory and network bandwidth. This paper provides a framework to evaluate different solutions for inter-node communication and presents the copy-transfer model; this model captures the contributions of the memory system to inter-node communication. We demonstrate the usefulness of this simple model by applying it to two commercial parallel systems, the Cray T3D and the Intel Paragon.In particular we identify two methods to transfer data between nodes in these two machines. In buffer-packing transfers, a contiguous block of data is transferred across the network. If the data are not stored contiguously, they are copied to (gathering) or from (scattering) buffers in local memory before and after the transfer. Chained transfers perform gathering, transfer and scattering in one step, reading the data elements with some non-sequential pattern and immediately transferring them on to the destination.Our model and measurements indicate that chaining of the gather, transfer, and scatter operations results in better performance than buffer packing for many important access patterns. Most standard message passing libraries (like MPI, PVM or NX) force the parallelizing compiler (or the programmer) to employ the buffer-packing communication operations. However, the addition of hardware support dedicated to communication (e.g., DMAs, line-transfer units) now gives the compiler a wider range of options.

References

[1]
D Adams. Gray T3D System Architecture Overview. Technical report, Gray Research Inc., September I993. Revision 1.C.
[2]
G. Blelloch and J. Sipelsteln. Collectlon-Oriented Languages. Proc. IEEE, 79(4)'504-523, Apr 199 I
[3]
Intel Corp. ParagonX/PS Product Overview Intel Corp., March I991
[4]
Gray Research Inc. GRAY T3D Apphcations Programmtng Course, Nov 1993 TR-T3DAPPL.
[5]
High Performance Fortran Forum. High Performance Fortran language specification version 1.0 draft, January 1993.
[6]
T. Gross, D. O'Hallaron, and J. Subhlok. Task Parallelism in a High Performance Fortran Framework. IEEE Parallel and Distributed Technology, 2(3): 16-26, Fall 1994.
[7]
K. Hayashi, T. Doi, T. Horie, Y. Koyanagl, O. Shiraki, N. Imamura, T. Shimizu, H. Ishihata, and T. Shindo. Ap 1000+: Architectural Support of a put/get Interface for Parallelizing Compilers. In Proc. of ASPLOS IV, pages 196-207. ACM, Oct 1994.
[8]
S. Hinrichs, C. Kosak, D. O'Hallaron, T Stricker, and R. Take An Architecture for Optimal All-to-All Personalized Communication. In ACM Symposium on Parallel Algorithms and Architectures, pages 310-319, Cape May, New Jersey, June 1994. A revised version is available as Tech. Report CMU-CS-94-140.
[9]
C. Leiserson, A. Abuhamdeh, D. Douglas, C. Feynman, M. Ganmukhi, J. Hill, D. Hillis, B. Kuszmaul, M. St.Pierre, D. Wells, M. Wong, S. Yang, and R. Zak. The Network Architecture of the Connection Machine CM-5. In Symposium on Parallel Algorithms and Architectures, pages 272-285, San Diego, June 1992 ACM.
[10]
A. B. Maccabe, K. S. McCurley, R. Riesen, and S R. Wheat. SUNMOS for the Intel Paragon: A Brief User's Guide. In Proceedings of the lntel Supercomputer Users' Group 1994 Annual North America Users' Con{erence, pages 245-251, June 1994. ftp.cs.sandia.gov/pub/sunmos/paperslpublished/ISUG94-1.ps.
[11]
G. McRae, W. Goodin, and J Seinfeld. Development of a Second-Generation Mathematical Model for Urban Air Pollution - Model Formulation. Atmospheric Environment, 16(4):679- 696, 1982.
[12]
R. Numrich, E Springer, and J. Peterson. Measurement of Communication Rates on the Gray T3D Interproeessor Network. In Proc. HPCN Europe '94, Vol. ii, pages 150-157, Munich, April 1994. Springer Verlag. Lecture Notes in Computer Science, Vol. 797.
[13]
W. Oed. The Gray Research Massively Parallel Processor System Gray T3D, 1993. Available from via ftp from tray.com.
[14]
E. J. Schwabe, G. E. Blelloch, A. Feldmann, O. Ghattas, J. R. Gilbert, G. L. Miller, D. R. O'Hallaron, J. R. Shewchuk, and S. Teng. A Separator-Based Framework for Automated Partitioning and Mapping of Parallel Algorithms for Numerical Solution of PDEs. In Proceedings o/the 1992 DAGSAOC Symposium, pages 48-62, June 1992. Revised version accepted for Comm. ACM.
[15]
J. Stichnoth, D. O'Hallaron, and T. Gross. Generating Communication for Array Statements: Design, Implementation, and Evaluation. Journal o.1' Parallel and Dtstributed Computing, 21(1):150-159, 1994.
[16]
T. Stricker, J. Stichnoth, D. O'Hallaron, S. Hinnchs, and T. Gross. The Performance Impact of Fast Synchronizatxon m Parallel Computers To appear m Proceedings of International Conference of Supercomputing, Barcelona, Spain, July 1995.
[17]
T. von Eicken, D. Culler, S. Goldstein, and K. Schauser. Active Messages. a Mechanism for Integrated Communication and Computation. In Proc. 19th Intl. ConL on Computer Architecture, pages 256-266, May 1992.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ISCA '95: Proceedings of the 22nd annual international symposium on Computer architecture
July 1995
426 pages
ISBN:0897916980
DOI:10.1145/223982
  • cover image ACM SIGARCH Computer Architecture News
    ACM SIGARCH Computer Architecture News  Volume 23, Issue 2
    Special Issue: Proceedings of the 22nd annual international symposium on Computer architecture (ISCA '95)
    May 1995
    412 pages
    ISSN:0163-5964
    DOI:10.1145/225830
    Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 May 1995

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

ISCA95
Sponsor:
ISCA95: International Conference on Computer Architecture
June 22 - 24, 1995
S. Margherita Ligure, Italy

Acceptance Rates

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)192
  • Downloads (Last 6 weeks)38
Reflects downloads up to 04 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2015)Fault-Tolerant MPIFault-Tolerance Techniques for High-Performance Computing10.1007/978-3-319-20943-2_3(145-228)Online publication date: 2-Jul-2015
  • (2012)Implementation and performance optimization of a parallel contour line generation algorithmComputers & Geosciences10.1016/j.cageo.2012.06.01149(21-28)Online publication date: 1-Dec-2012
  • (2011)SymCallACM SIGPLAN Notices10.1145/2007477.195270746:7(193-204)Online publication date: 9-Mar-2011
  • (2011)SymCallProceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments10.1145/1952682.1952707(193-204)Online publication date: 9-Mar-2011
  • (2010)Dodging the cost of unavoidable memory copies in message logging protocolsProceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface10.5555/1894122.1894148(189-197)Online publication date: 12-Sep-2010
  • (2010)Dodging the Cost of Unavoidable Memory Copies in Message Logging ProtocolsRecent Advances in the Message Passing Interface10.1007/978-3-642-15646-5_20(189-197)Online publication date: 2010
  • (2007)$\log_{\rm n}{\rm P}$ and $\log_{3}{\rm P}$IEEE Transactions on Computers10.1109/TC.2007.3856:3(314-327)Online publication date: 1-Mar-2007
  • (2005)OS support for a commodity database on PC clustersProceedings of the 16th Australasian database conference - Volume 3910.5555/1082222.1082238(145-154)Online publication date: 30-Jan-2005
  • (2005)MeDLey: An abstract approach to message passingApplied Parallel Computing Industrial Computation and Optimization10.1007/3-540-62095-8_21(196-203)Online publication date: 7-Jun-2005
  • (2004)Predicting and Evaluating Distributed Communication PerformanceProceedings of the 2004 ACM/IEEE conference on Supercomputing10.1109/SC.2004.40Online publication date: 6-Nov-2004
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media