[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3126908.3126954acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Scalable reduction collectives with data partitioning-based multi-leader design

Published: 12 November 2017 Publication History

Abstract

Existing designs for MPI_Allreduce do not take advantage of the vast parallelism available in modern multi-/many-core processors like Intel Xeon/Xeon Phis or the increases in communication throughput and recent advances in high-end features seen with modern interconnects like InfiniBand and Omni-Path. In this paper, we propose a high-performance and scalable <u>D</u>ata <u>P</u>artitioning-based <u>M</u>ulti-<u>L</u>eader (DPML) solution for MPI_Allreduce that can take advantage of the parallelism offered by multi-/many-core architectures in conjunction with the high throughput and high-end features offered by InfiniBand and Omni-Path to significantly enhance the performance of MPI_Allreduce on modern HPC systems. We also model DPML-based designs to analyze the communication costs theoretically. Microbenchmark level evaluations show that the proposed DPML-based designs are able to deliver up to 3.5 times performance improvement for MPI_Allreduce for multiple HPC systems at scale. At the application-level, up to 35% and 60% improvement is seen in communication for HPCG and miniAMR respectively.

References

[1]
Mantevo Applications. http://mantevo.org/packages/
[2]
2015. The High Performance Conjugate Gradients Benchmark. (2015). http://hpcg-benchmark.org/
[3]
Albert Alexandrov, Mihai F. Ionescu, Klaus E. Schauser, and Chris Scheiman. 1995. LogGP: Incorporating Long Messages into the LogP Model---One Step Closer Towards a Realistic Model for Parallel Computation. In Proceedings of the Seventh Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA '95). ACM, New York, NY, USA, 95--105.
[4]
Ammar Ahmad Awan, K Hamidouche, A Venkatesh, and DK Panda. 2016. Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning. In Proceedings of the 23rd European MPI Users' Group Meeting. ACM, 15--22.
[5]
Mark S Birrittella, Mark Debbage, Ram Huggahalli, James Kunz, Tom Lovett, Todd Rimmer, Keith D Underwood, and Robert C Zak. 2015. Intel® Omni-path Architecture: Enabling Scalable, High Performance Fabrics. In High-Performance Interconnects (HOTI), 2015 IEEE 23rd Annual Symposium on. IEEE, 1--9.
[6]
Open MPI: Open Source High Performance Computing. 2017. http://www.open-mpi.org. (2017).
[7]
D. E. Culler, R. M. Karp, D. A. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. von Eicken. 1993. LogP: Towards a Realistic Model of Parallel Computation. In Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 262--273.
[8]
Richard L. Graham, Devendar Bureddy, Pak Lui, Hal Rosenstock, Gilad Shainer, Gil Bloch, Dror Goldenerg, Mike Dubman, Sasha Kotchubievsky, Vladimir Koushnir, Lion Levi, Alex Margolin, Tamir Ronen, Alexander Shpiner, Oded Wertheim, and Eitan Zahavi. 2016. Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction. In Proceedings of the First Workshop on Optimization of Communication in HPC (COM-HPC '16). IEEE Press, Piscataway, NJ, USA, 1--10.
[9]
Roger W. Hockney. 1994. The Communication Challenge for MPP: Intel Paragon and Meiko CS-2. Parallel Comput. 20, 3 (March 1994), 389--398.
[10]
IB 2017. InfiniBand Trade Association. http://www.infinibandta.com. (2017).
[11]
J. Liu and W. Jiang and P. Wyckoff and D. K. Panda and D. Ashton and D. Buntinas and B. Gropp and B. Tooney. 2004. High Performance Implementation of MPICH2 over InfiniBand with RDMA Support. In IPDPS.
[12]
Krishna Kandalla, Hari Subramoni, Karen Tomko, Dmitry Pekurovsky, Sayantan Sur, and Dhabaleswar K. Panda. 2011. High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A Study with Parallel 3D FFT. Comput. Sci. 26 (June 2011), 237--246. Issue 3-4.
[13]
K. Kandalla, U. Yang, J. Keasler, T. Kolev, A. Moody, H. Subramoni, K. Tomko, J. Vienne, and D. K. Panda. 2012. Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers. In IEEE International Symposium on Parallel and Distributed Processing (IPDPS).
[14]
Thilo Kielmann, Henri E. Bal, and Kees Verstoep. 2000. Fast Measurement of LogP Parameters for Message Passing Platforms. In Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing (IPDPS '00). Springer-Verlag, London, UK, UK, 1176--1183.
[15]
Shigang Li, Torsten Hoefler, Chungjin Hu, and Marc Snir. 2014. Improved MPI Collectives for MPI Processes in Shared Address Spaces. Cluster Computing 17, 4 (Dec. 2014), 1139--1155.
[16]
Shigang Li, Torsten Hoefler, and Marc Snir. 2013. NUMA-aware Shared-memory Collective Communication for MPI. In Proceedings of the 22Nd International Symposium on High-performance Parallel and Distributed Computing (HPDC '13). ACM, New York, NY, USA, 85--96.
[17]
J. Liu, W. Jiang, P. Wyckoff, D. K. Panda, D. Ashton, D. Buntinas, W. Gropp, and B. Toonen. 2004. Design and Implementation of MPICH2 over InfiniBand with RDMA Support. In Proceedings of Int'l Parallel and Distributed Processing Symposium (IPDPS '04).
[18]
M. Venkata and R. Graham and J. Ladd and P. Shamis and I. Rabinovitz and F. Vasily and G. Shainer. 2011. ConnectX-2 CORE-Direct Enabled Asynchronous Broadcast Collective Communications. In Proceedings of the 25th IEEE International Parallel and Distributed Processing Symposium, Workshops.
[19]
Message Passing Interface Forum 1994. MPI: A Message-Passing Interface Standard. Message Passing Interface Forum.
[20]
OSU Micro-Benchmarks. 2017. http://mvapich.cse.ohio-state.edu/benchmarks. (2017).
[21]
MPI3 2012. MPI-3 Standard Document. http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf. (2012).
[22]
MVAPICH2 2017. MVAPICH: MPI over InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE. http://mvapich.cse.ohio-state.edu/. (2017).
[23]
J. Pjesivac-Grbovic, T. Angskun, G. Bosilca, G. E. Fagg, E. Gabriel, and J. J. Dongarra. 2005. Performance Analysis of MPI Collective Operations. In 19th IEEE International Parallel and Distributed Processing Symposium. 8 pp.-.
[24]
Rolf Rabenseifner. 1999. Automatic MPI Counter Profiling of all Users: First Results on a CRAY T3E 900-512. In Proceedings of the message passing interface developerfis and userfis conference, Vol. 1999. 77--85.
[25]
Rolf Rabenseifner. 2004. Optimization of Collective Reduction Operations. In International Conference on Computational Science. Springer, 1--9.
[26]
Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Optimization of Collective Communication Operations in MPICH. Int. J. High Perform. Comput. Appl. 19, 1 (Feb. 2005), 49--66.
[27]
J. Zhang, X. Lu, J. Jose, M. Li, R. Shi, and D. K. D. K. Panda. 2014. High Performance MPI Library over SR-IOV enabled InfiniBand Clusters. In 2014 21st International Conference on High Performance Computing (HiPC). 1--10.
[28]
Jie Zhang, Xiaoyi Lu, Jithin Jose, Rong Shi, and Dhabaleswar K. (DK) Panda. 2014. Can Inter-VM Shmem Benefit MPI Applications on SR-IOV Based Virtualized Infiniband Clusters? Springer International Publishing, Cham, 342--353.

Cited By

View all
  • (2024)Fast-tunable Graphene-based AWGR for Deep Learning Training NetworksProceedings of the 1st SIGCOMM Workshop on Hot Topics in Optical Technologies and Applications in Networking10.1145/3672201.3674121(14-20)Online publication date: 4-Aug-2024
  • (2024)Partitioned Reduction for Heterogeneous Environments2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP62718.2024.00047(285-289)Online publication date: 20-Mar-2024
  • (2023)Impact of Cache Coherence on the Performance of Shared-Memory based MPI Primitives: A Case Study for Broadcast on Intel Xeon Scalable ProcessorsProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605616(295-305)Online publication date: 7-Aug-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2017
801 pages
ISBN:9781450351140
DOI:10.1145/3126908
  • General Chair:
  • Bernd Mohr,
  • Program Chair:
  • Padma Raghavan
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. MPI
  2. MPI_allreduce
  3. SHArP
  4. collectives
  5. data partitioning
  6. multi-leader

Qualifiers

  • Research-article

Conference

SC '17
Sponsor:

Acceptance Rates

SC '17 Paper Acceptance Rate 61 of 327 submissions, 19%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)27
  • Downloads (Last 6 weeks)2
Reflects downloads up to 14 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Fast-tunable Graphene-based AWGR for Deep Learning Training NetworksProceedings of the 1st SIGCOMM Workshop on Hot Topics in Optical Technologies and Applications in Networking10.1145/3672201.3674121(14-20)Online publication date: 4-Aug-2024
  • (2024)Partitioned Reduction for Heterogeneous Environments2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP62718.2024.00047(285-289)Online publication date: 20-Mar-2024
  • (2023)Impact of Cache Coherence on the Performance of Shared-Memory based MPI Primitives: A Case Study for Broadcast on Intel Xeon Scalable ProcessorsProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605616(295-305)Online publication date: 7-Aug-2023
  • (2023)Optimizing MPI Collectives on Shared Memory Multi-CoresProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607074(1-15)Online publication date: 12-Nov-2023
  • (2023)FMI: Fast and Cheap Message Passing for Serverless FunctionsProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593718(373-385)Online publication date: 21-Jun-2023
  • (2023)COFFEE: Cross-Layer Optimization for Fast and Efficient Executions of Sinkhorn-Knopp Algorithm on HPC SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.327791534:7(2167-2179)Online publication date: Jul-2023
  • (2023)GLEX_Allreduce: Optimization for medium and small message of Allreduce on Tianhe system2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00388(2806-2809)Online publication date: 17-Dec-2023
  • (2023)Designing In-network Computing Aware Reduction Collectives in MPI2023 IEEE Symposium on High-Performance Interconnects (HOTI)10.1109/HOTI59126.2023.00018(25-32)Online publication date: Aug-2023
  • (2023)Uniform Algorithms for Reduce-scatter and (most) other Collectives for MPI2023 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER52292.2023.00031(284-294)Online publication date: 31-Oct-2023
  • (2023)2D-THA-ADMM: communication efficient distributed ADMM algorithm framework based on two-dimensional torus hierarchical AllReduceInternational Journal of Machine Learning and Cybernetics10.1007/s13042-023-01903-915:2(207-226)Online publication date: 28-Jun-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media