[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Performance analysis of MPI collective operations

Published: 01 June 2007 Publication History

Abstract

Previous studies of application usage show that the performance of collective communications are critical for high-performance computing. Despite active research in the field, both general and feasible solution to the optimization of collective communication problem is still missing.
In this paper, we analyze and attempt to improve intra-cluster collective communication in the context of the widely deployed MPI programming paradigm by extending accepted models of point-to-point communication, such as Hockney, LogP/LogGP, and PLogP, to collective operations. We compare the predictions from models against the experimentally gathered data and using these results, construct optimal decision function for broadcast collective. We quantitatively compare the quality of the model-based decision functions to the experimentally-optimal one. Additionally, in this work, we also introduce a new form of an optimized tree-based broadcast algorithm, splitted-binary.
Our results show that all of the models can provide useful insights into various aspects of the different algorithms as well as their relative performance. Still, based on our findings, we believe that the complete reliance on models would not yield optimal results. In addition, our experimental results have identified the gap parameter as being the most critical for accurate modeling of both the classical point-to-point-based pipeline and our extensions to fan-out topologies.

References

[1]
Rabenseifner, R.: Automatic MPI counter profiling of all users: First results on a CRAY T3E 900-512. In: Proceedings of the Message Passing Interface Developer’s and User’s Conference, 1999, pp. 77–85
[2]
Vadhiyar, S.S., Fagg, G.E., Dongarra, J.J.: Automatically tuned collective communications. In: Proceedings of the 2000 ACM/IEEE conference on Supercomputing (CDROM), IEEE Computer Society, 2000, p. 3
[3]
Hockney R. The communication challenge for MPP: Intel Paragon and Meiko CS-2 Parallel Comput. 1994 20 3 389-398
[4]
Culler D., Karp R., Patterson D., Sahay A., Schauser K.E., Santos E., Subramonian R., and von Eicken T. LogP: Towards a realistic model of parallel computation Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming 1993 New York ACM Press 1-12
[5]
Alexandrov A., Ionescu M.F., Schauser K.E., and Scheiman C. LogGP: Incorporating long messages into the LogP model Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures 1995 New York ACM Press 95-105
[6]
Kielmann T., Bal H., and Verstoep K. Rolim J.D.P. Fast measurement of LogP parameters for message passing platforms IPDPS Workshops 2000 London Springer-Verlag 1176-1183
[7]
Culler D., Liu L.T., Martin R.P., and Yoshikawa C. Assessing fast network interfaces IEEE Micro 1996 16 35-43
[8]
Fagg, G.E., Gabriel, E., Chen, Z., Angskun, T., Bosilca, G., Bukovsky, A., Dongarra, J.J.: Fault tolerant communication library and applications for high performance computing. In: LACSI Symposium, 2003
[9]
Grama A., Gupta A., Karypis G., and Kumar V. Introduction to Parallel Computing, second edn 2003 Boston Pearson Education Limited, Addison-Wesley Logman
[10]
Thakur R. and Gropp W. Dongarra J., Laforenza D., and Orlando S. Improving the performance of collective operations in MPICH Recent Advances in Parallel Virtual Machine and Message Passing Interface 2003 ??? Springer Verlag 257-267 10th European PVM/MPI User’s Group Meeting, Venice, Italy
[11]
Chan, E.W., Heimlich, M.F., Purkayastha, A., van de Geijn, R.M.: On optimizing of collective communication. In: Cluster. (2004)
[12]
Rabenseifner R. and Träff J.L. More efficient reduction algorithms for non-power-of-two number of processors in message-passing parallel systems Proceedings of EuroPVM/MPI 2004 Berlin Springer-Verlag
[13]
Kielmann T., Hofman R.F.H., Bal H.E., Plaat A., and Bhoedjang R.A.F. MagPIe: MPI’s collective communication operations for clustered wide area systems Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming 1999 New York ACM 131-140
[14]
Barchet-Estefanel, L.A., Mounié, G.: Fast tuning of intra-cluster collective communications. In: Proceedings, 11th European PVM/MPI Users’ Group Meeting, Budapest, Hungary, 2004, pp. 28–35
[15]
Bell C., Bonachea D., Cote Y., Duell J., Hargrove P., Husbands P., Iancu C., Welcome M., and Yelick K. An evaluation of current high-performance networks Proceedings of the 17th International Symposium on Parallel and Distributed Processing 2003 Washington IEEE Computer Society 28.1
[16]
Bernaschi M., Iannello G., and Lauria M. Efficient implementation of reduce-scatter in MPI J. Syst. Archit. 2003 49 3 89-108
[17]
Bruck J., Ho C.T., Kipnis S., Upfal E., and Weathersby D. Efficient algorithms for all-to-all communications in multiport message-passing systems IEEE Trans. Parallel Distributed Syst. 1997 8 11 1143-1156
[18]
Kielmann T., Bal H.E., Gorlatch S., Verstoep K., and Hofman R.F. Network performance-aware collective communication for clustered wide-area systems Parallel Comput. 2001 27 11 1431-1456
[19]
Gropp W., Lusk E., Doss N., and Skjellum A. A high-performance, portable implementation of the MPI message passing interface standard Parallel Comput. 1996 22 6 789-828
[20]
Gropp W. and Lusk E.L. Reproducible measurements of MPI performance characteristics Proceedings of the 6th European PVM/MPI Users’ Group Meeting on Recent Advances in PVM and MPI 1999 London Springer-Verlag 11-18
[21]
Gabriel, E., Fagg, G.E., Bosilca, G., Angskun, T., Dongarra, J.J., Squyres, J.M., Sahay, V., Kambadur, P., Barrett, B., Lumsdaine, A., Castain, R.H., Daniel, D.J., Graham, R.L., Woodall, T.S.: Open MPI: Goals, concept, and design of a next generation MPI implementation. In: Proceedings, 11th European PVM/MPI Users’ Group Meeting, Budapest, Hungary, 2004, pp. 97–104

Cited By

View all
  • (2023)FMI: Fast and Cheap Message Passing for Serverless FunctionsProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593718(373-385)Online publication date: 21-Jun-2023
  • (2023)MSCCLang: Microsoft Collective Communication LanguageProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575724(502-514)Online publication date: 27-Jan-2023
  • (2023)In-network Allreduce with Multiple Spanning Trees on PolarFlyProceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3558481.3591073(165-176)Online publication date: 17-Jun-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Cluster Computing
Cluster Computing  Volume 10, Issue 2
Jun 2007
137 pages

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 June 2007

Author Tags

  1. MPI collective communication
  2. Performance modeling
  3. Parallel communication models
  4. Hockney
  5. LogP
  6. LogGP
  7. PLogP

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)FMI: Fast and Cheap Message Passing for Serverless FunctionsProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593718(373-385)Online publication date: 21-Jun-2023
  • (2023)MSCCLang: Microsoft Collective Communication LanguageProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575724(502-514)Online publication date: 27-Jan-2023
  • (2023)In-network Allreduce with Multiple Spanning Trees on PolarFlyProceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3558481.3591073(165-176)Online publication date: 17-Jun-2023
  • (2023)xCCL: A Survey of Industry-Led Collective Communication Libraries for Deep LearningJournal of Computer Science and Technology10.1007/s11390-023-2894-638:1(166-195)Online publication date: 31-Mar-2023
  • (2022)Optimizing the Bruck Algorithm for Non-uniform All-to-all CommunicationProceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing10.1145/3502181.3531468(172-184)Online publication date: 27-Jun-2022
  • (2021)Synthesizing optimal collective algorithmsProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3437801.3441620(62-75)Online publication date: 17-Feb-2021
  • (2021)A New Model-Based Approach to Performance Comparison of MPI Collective AlgorithmsParallel Computing Technologies10.1007/978-3-030-86359-3_2(11-25)Online publication date: 13-Sep-2021
  • (2020)Enhancement of an Encryption System Performance using MPI2020 13th International Conference on Communications (COMM)10.1109/COMM48946.2020.9142003(87-91)Online publication date: 18-Jun-2020
  • (2020)Using Dynamic Broadcasts to Improve Task-Based Runtime PerformancesEuro-Par 2020: Parallel Processing10.1007/978-3-030-57675-2_28(443-457)Online publication date: 24-Aug-2020
  • (2019)Histogram Sort with SamplingThe 31st ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3323165.3323184(201-212)Online publication date: 17-Jun-2019
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media