More Web Proxy on the site http://driver.im/

research-article

Performance analysis of MPI collective operations

Authors:

Jelena Pješivac-Grbović,

George Bosilca,

Graham E. Fagg,

Jack J. DongarraAuthors Info & Claims

Cluster Computing, Volume 10, Issue 2

Pages 127 - 143

https://doi.org/10.1007/s10586-007-0012-0

Published: 01 June 2007 Publication History

Abstract

Previous studies of application usage show that the performance of collective communications are critical for high-performance computing. Despite active research in the field, both general and feasible solution to the optimization of collective communication problem is still missing.

In this paper, we analyze and attempt to improve intra-cluster collective communication in the context of the widely deployed MPI programming paradigm by extending accepted models of point-to-point communication, such as Hockney, LogP/LogGP, and PLogP, to collective operations. We compare the predictions from models against the experimentally gathered data and using these results, construct optimal decision function for broadcast collective. We quantitatively compare the quality of the model-based decision functions to the experimentally-optimal one. Additionally, in this work, we also introduce a new form of an optimized tree-based broadcast algorithm, splitted-binary.

Our results show that all of the models can provide useful insights into various aspects of the different algorithms as well as their relative performance. Still, based on our findings, we believe that the complete reliance on models would not yield optimal results. In addition, our experimental results have identified the gap parameter as being the most critical for accurate modeling of both the classical point-to-point-based pipeline and our extensions to fan-out topologies.

References

[1]

Rabenseifner, R.: Automatic MPI counter profiling of all users: First results on a CRAY T3E 900-512. In: Proceedings of the Message Passing Interface Developer’s and User’s Conference, 1999, pp. 77–85

[2]

Vadhiyar, S.S., Fagg, G.E., Dongarra, J.J.: Automatically tuned collective communications. In: Proceedings of the 2000 ACM/IEEE conference on Supercomputing (CDROM), IEEE Computer Society, 2000, p. 3

[3]

Hockney R. The communication challenge for MPP: Intel Paragon and Meiko CS-2 Parallel Comput. 1994 20 3 389-398

[4]

Culler D., Karp R., Patterson D., Sahay A., Schauser K.E., Santos E., Subramonian R., and von Eicken T. LogP: Towards a realistic model of parallel computation Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming 1993 New York ACM Press 1-12

[5]

Alexandrov A., Ionescu M.F., Schauser K.E., and Scheiman C. LogGP: Incorporating long messages into the LogP model Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures 1995 New York ACM Press 95-105

[6]

Kielmann T., Bal H., and Verstoep K. Rolim J.D.P. Fast measurement of LogP parameters for message passing platforms IPDPS Workshops 2000 London Springer-Verlag 1176-1183

[7]

Culler D., Liu L.T., Martin R.P., and Yoshikawa C. Assessing fast network interfaces IEEE Micro 1996 16 35-43

[8]

Fagg, G.E., Gabriel, E., Chen, Z., Angskun, T., Bosilca, G., Bukovsky, A., Dongarra, J.J.: Fault tolerant communication library and applications for high performance computing. In: LACSI Symposium, 2003

[9]

Grama A., Gupta A., Karypis G., and Kumar V. Introduction to Parallel Computing, second edn 2003 Boston Pearson Education Limited, Addison-Wesley Logman

[10]

Thakur R. and Gropp W. Dongarra J., Laforenza D., and Orlando S. Improving the performance of collective operations in MPICH Recent Advances in Parallel Virtual Machine and Message Passing Interface 2003 ??? Springer Verlag 257-267 10th European PVM/MPI User’s Group Meeting, Venice, Italy

[11]

Chan, E.W., Heimlich, M.F., Purkayastha, A., van de Geijn, R.M.: On optimizing of collective communication. In: Cluster. (2004)

[12]

Rabenseifner R. and Träff J.L. More efficient reduction algorithms for non-power-of-two number of processors in message-passing parallel systems Proceedings of EuroPVM/MPI 2004 Berlin Springer-Verlag

[13]

Kielmann T., Hofman R.F.H., Bal H.E., Plaat A., and Bhoedjang R.A.F. MagPIe: MPI’s collective communication operations for clustered wide area systems Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming 1999 New York ACM 131-140

[14]

Barchet-Estefanel, L.A., Mounié, G.: Fast tuning of intra-cluster collective communications. In: Proceedings, 11th European PVM/MPI Users’ Group Meeting, Budapest, Hungary, 2004, pp. 28–35

[15]

Bell C., Bonachea D., Cote Y., Duell J., Hargrove P., Husbands P., Iancu C., Welcome M., and Yelick K. An evaluation of current high-performance networks Proceedings of the 17th International Symposium on Parallel and Distributed Processing 2003 Washington IEEE Computer Society 28.1

[16]

Bernaschi M., Iannello G., and Lauria M. Efficient implementation of reduce-scatter in MPI J. Syst. Archit. 2003 49 3 89-108

[17]

Bruck J., Ho C.T., Kipnis S., Upfal E., and Weathersby D. Efficient algorithms for all-to-all communications in multiport message-passing systems IEEE Trans. Parallel Distributed Syst. 1997 8 11 1143-1156

[18]

Kielmann T., Bal H.E., Gorlatch S., Verstoep K., and Hofman R.F. Network performance-aware collective communication for clustered wide-area systems Parallel Comput. 2001 27 11 1431-1456

[19]

Gropp W., Lusk E., Doss N., and Skjellum A. A high-performance, portable implementation of the MPI message passing interface standard Parallel Comput. 1996 22 6 789-828

[20]

Gropp W. and Lusk E.L. Reproducible measurements of MPI performance characteristics Proceedings of the 6th European PVM/MPI Users’ Group Meeting on Recent Advances in PVM and MPI 1999 London Springer-Verlag 11-18

[21]

Gabriel, E., Fagg, G.E., Bosilca, G., Angskun, T., Dongarra, J.J., Squyres, J.M., Sahay, V., Kambadur, P., Barrett, B., Lumsdaine, A., Castain, R.H., Daniel, D.J., Graham, R.L., Woodall, T.S.: Open MPI: Goals, concept, and design of a next generation MPI implementation. In: Proceedings, 11th European PVM/MPI Users’ Group Meeting, Budapest, Hungary, 2004, pp. 97–104

Cited By

Copik MBöhringer RCalotoiu AHoefler TGallivan KNikolopoulos DBeivide RGallopoulos E(2023)FMI: Fast and Cheap Message Passing for Serverless FunctionsProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593718(373-385)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593718
Cowan MMaleki SMusuvathi MSaarikivi OXiong YAamodt TJerger NSwift M(2023)MSCCLang: Microsoft Collective Communication LanguageProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575724(502-514)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3575693.3575724
Lakhotia KIsham KMonroe LBesta MHoefler TPetrini FAgrawal KShun J(2023)In-network Allreduce with Multiple Spanning Trees on PolarFlyProceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3558481.3591073(165-176)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3558481.3591073
Show More Cited By

Index Terms

Performance analysis of MPI collective operations

Index terms have been assigned to the content through auto-classification.

Recommendations

Performance Modeling and Evaluation of MPI

Users of parallel machines need to have a good grasp for how different communication patterns and styles affect the performance of message-passing applications. LogGP is a simple performance model that reflects the most important parameters required to ...
LogGP Performance Evaluation of MPI
HPDC '98: Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing

Users of parallel machines need good performance evaluations for several communication patterns in order to develop efficient message-passing applications. LogGP is a simple parallel machine model that reflects the important parameters required to ...
Implementation and performance analysis of non-blocking collective operations for MPI
SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing

Collective operations and non-blocking point-to-point operations have always been part of MPI. Although non-blocking collective operations are an obvious extension to MPI, there have been no comprehensive studies of this functionality. In this paper we ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Cluster Computing

Cluster Computing Volume 10, Issue 2

Jun 2007

137 pages

ISSN:1386-7857

Issue’s Table of Contents

© Springer Science+Business Media, LLC 2007.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 June 2007

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

52
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Copik MBöhringer RCalotoiu AHoefler TGallivan KNikolopoulos DBeivide RGallopoulos E(2023)FMI: Fast and Cheap Message Passing for Serverless FunctionsProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593718(373-385)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593718
Cowan MMaleki SMusuvathi MSaarikivi OXiong YAamodt TJerger NSwift M(2023)MSCCLang: Microsoft Collective Communication LanguageProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575724(502-514)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3575693.3575724
Lakhotia KIsham KMonroe LBesta MHoefler TPetrini FAgrawal KShun J(2023)In-network Allreduce with Multiple Spanning Trees on PolarFlyProceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3558481.3591073(165-176)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3558481.3591073
Weingram ALi YQi HNg DDai LLu X(2023)xCCL: A Survey of Industry-Led Collective Communication Libraries for Deep LearningJournal of Computer Science and Technology10.1007/s11390-023-2894-638:1(166-195)Online publication date: 31-Mar-2023
https://dl.acm.org/doi/10.1007/s11390-023-2894-6
Fan KGilray TPascucci VHuang XMicinski KKumar SWeissman JChandra AGavrilovska ATiwari D(2022)Optimizing the Bruck Algorithm for Non-uniform All-to-all CommunicationProceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing10.1145/3502181.3531468(172-184)Online publication date: 27-Jun-2022
https://dl.acm.org/doi/10.1145/3502181.3531468
Cai ZLiu ZMaleki SMusuvathi MMytkowicz TNelson JSaarikivi OLee JPetrank E(2021)Synthesizing optimal collective algorithmsProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3437801.3441620(62-75)Online publication date: 17-Feb-2021
https://dl.acm.org/doi/10.1145/3437801.3441620
Nuriyev ELastovetsky A(2021)A New Model-Based Approach to Performance Comparison of MPI Collective AlgorithmsParallel Computing Technologies10.1007/978-3-030-86359-3_2(11-25)Online publication date: 13-Sep-2021
https://dl.acm.org/doi/10.1007/978-3-030-86359-3_2
Amar IAbutaha M(2020)Enhancement of an Encryption System Performance using MPI2020 13th International Conference on Communications (COMM)10.1109/COMM48946.2020.9142003(87-91)Online publication date: 18-Jun-2020
https://dl.acm.org/doi/10.1109/COMM48946.2020.9142003
Denis AJeannot ESwartvagher PThibault S(2020)Using Dynamic Broadcasts to Improve Task-Based Runtime PerformancesEuro-Par 2020: Parallel Processing10.1007/978-3-030-57675-2_28(443-457)Online publication date: 24-Aug-2020
https://dl.acm.org/doi/10.1007/978-3-030-57675-2_28
Harsh VKale LSolomonik EScheideler CBerenbrink P(2019)Histogram Sort with SamplingThe 31st ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3323165.3323184(201-212)Online publication date: 17-Jun-2019
https://dl.acm.org/doi/10.1145/3323165.3323184
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents