More Web Proxy on the site http://driver.im/

research-article

Scalable reduction collectives with data partitioning-based multi-leader design

Authors:

Mohammadreza Bayatpour,

Sourav Chakraborty,

Hari Subramoni,

Dhabaleswar K. (DK) PandaAuthors Info & Claims

SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 64, Pages 1 - 11

https://doi.org/10.1145/3126908.3126954

Published: 12 November 2017 Publication History

Abstract

Existing designs for MPI_Allreduce do not take advantage of the vast parallelism available in modern multi-/many-core processors like Intel Xeon/Xeon Phis or the increases in communication throughput and recent advances in high-end features seen with modern interconnects like InfiniBand and Omni-Path. In this paper, we propose a high-performance and scalable <u>D</u>ata <u>P</u>artitioning-based <u>M</u>ulti-<u>L</u>eader (DPML) solution for MPI_Allreduce that can take advantage of the parallelism offered by multi-/many-core architectures in conjunction with the high throughput and high-end features offered by InfiniBand and Omni-Path to significantly enhance the performance of MPI_Allreduce on modern HPC systems. We also model DPML-based designs to analyze the communication costs theoretically. Microbenchmark level evaluations show that the proposed DPML-based designs are able to deliver up to 3.5 times performance improvement for MPI_Allreduce for multiple HPC systems at scale. At the application-level, up to 35% and 60% improvement is seen in communication for HPCG and miniAMR respectively.

References

[1]

Mantevo Applications. http://mantevo.org/packages/

[2]

2015. The High Performance Conjugate Gradients Benchmark. (2015). http://hpcg-benchmark.org/

[3]

Albert Alexandrov, Mihai F. Ionescu, Klaus E. Schauser, and Chris Scheiman. 1995. LogGP: Incorporating Long Messages into the LogP Model---One Step Closer Towards a Realistic Model for Parallel Computation. In Proceedings of the Seventh Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA '95). ACM, New York, NY, USA, 95--105.

Digital Library

[4]

Ammar Ahmad Awan, K Hamidouche, A Venkatesh, and DK Panda. 2016. Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning. In Proceedings of the 23rd European MPI Users' Group Meeting. ACM, 15--22.

Digital Library

[5]

Mark S Birrittella, Mark Debbage, Ram Huggahalli, James Kunz, Tom Lovett, Todd Rimmer, Keith D Underwood, and Robert C Zak. 2015. Intel® Omni-path Architecture: Enabling Scalable, High Performance Fabrics. In High-Performance Interconnects (HOTI), 2015 IEEE 23rd Annual Symposium on. IEEE, 1--9.

Digital Library

[6]

Open MPI: Open Source High Performance Computing. 2017. http://www.open-mpi.org. (2017).

[7]

D. E. Culler, R. M. Karp, D. A. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. von Eicken. 1993. LogP: Towards a Realistic Model of Parallel Computation. In Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 262--273.

Digital Library

[8]

Richard L. Graham, Devendar Bureddy, Pak Lui, Hal Rosenstock, Gilad Shainer, Gil Bloch, Dror Goldenerg, Mike Dubman, Sasha Kotchubievsky, Vladimir Koushnir, Lion Levi, Alex Margolin, Tamir Ronen, Alexander Shpiner, Oded Wertheim, and Eitan Zahavi. 2016. Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction. In Proceedings of the First Workshop on Optimization of Communication in HPC (COM-HPC '16). IEEE Press, Piscataway, NJ, USA, 1--10.

Digital Library

[9]

Roger W. Hockney. 1994. The Communication Challenge for MPP: Intel Paragon and Meiko CS-2. Parallel Comput. 20, 3 (March 1994), 389--398.

Digital Library

[10]

IB 2017. InfiniBand Trade Association. http://www.infinibandta.com. (2017).

[11]

J. Liu and W. Jiang and P. Wyckoff and D. K. Panda and D. Ashton and D. Buntinas and B. Gropp and B. Tooney. 2004. High Performance Implementation of MPICH2 over InfiniBand with RDMA Support. In IPDPS.

[12]

Krishna Kandalla, Hari Subramoni, Karen Tomko, Dmitry Pekurovsky, Sayantan Sur, and Dhabaleswar K. Panda. 2011. High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A Study with Parallel 3D FFT. Comput. Sci. 26 (June 2011), 237--246. Issue 3-4.

Digital Library

[13]

K. Kandalla, U. Yang, J. Keasler, T. Kolev, A. Moody, H. Subramoni, K. Tomko, J. Vienne, and D. K. Panda. 2012. Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers. In IEEE International Symposium on Parallel and Distributed Processing (IPDPS).

Digital Library

[14]

Thilo Kielmann, Henri E. Bal, and Kees Verstoep. 2000. Fast Measurement of LogP Parameters for Message Passing Platforms. In Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing (IPDPS '00). Springer-Verlag, London, UK, UK, 1176--1183.

Digital Library

[15]

Shigang Li, Torsten Hoefler, Chungjin Hu, and Marc Snir. 2014. Improved MPI Collectives for MPI Processes in Shared Address Spaces. Cluster Computing 17, 4 (Dec. 2014), 1139--1155.

Digital Library

[16]

Shigang Li, Torsten Hoefler, and Marc Snir. 2013. NUMA-aware Shared-memory Collective Communication for MPI. In Proceedings of the 22Nd International Symposium on High-performance Parallel and Distributed Computing (HPDC '13). ACM, New York, NY, USA, 85--96.

Digital Library

[17]

J. Liu, W. Jiang, P. Wyckoff, D. K. Panda, D. Ashton, D. Buntinas, W. Gropp, and B. Toonen. 2004. Design and Implementation of MPICH2 over InfiniBand with RDMA Support. In Proceedings of Int'l Parallel and Distributed Processing Symposium (IPDPS '04).

[18]

M. Venkata and R. Graham and J. Ladd and P. Shamis and I. Rabinovitz and F. Vasily and G. Shainer. 2011. ConnectX-2 CORE-Direct Enabled Asynchronous Broadcast Collective Communications. In Proceedings of the 25th IEEE International Parallel and Distributed Processing Symposium, Workshops.

Digital Library

[19]

Message Passing Interface Forum 1994. MPI: A Message-Passing Interface Standard. Message Passing Interface Forum.

[20]

OSU Micro-Benchmarks. 2017. http://mvapich.cse.ohio-state.edu/benchmarks. (2017).

[21]

MPI3 2012. MPI-3 Standard Document. http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf. (2012).

[22]

MVAPICH2 2017. MVAPICH: MPI over InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE. http://mvapich.cse.ohio-state.edu/. (2017).

[23]

J. Pjesivac-Grbovic, T. Angskun, G. Bosilca, G. E. Fagg, E. Gabriel, and J. J. Dongarra. 2005. Performance Analysis of MPI Collective Operations. In 19th IEEE International Parallel and Distributed Processing Symposium. 8 pp.-.

Digital Library

[24]

Rolf Rabenseifner. 1999. Automatic MPI Counter Profiling of all Users: First Results on a CRAY T3E 900-512. In Proceedings of the message passing interface developerfis and userfis conference, Vol. 1999. 77--85.

[25]

Rolf Rabenseifner. 2004. Optimization of Collective Reduction Operations. In International Conference on Computational Science. Springer, 1--9.

[26]

Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Optimization of Collective Communication Operations in MPICH. Int. J. High Perform. Comput. Appl. 19, 1 (Feb. 2005), 49--66.

Digital Library

[27]

J. Zhang, X. Lu, J. Jose, M. Li, R. Shi, and D. K. D. K. Panda. 2014. High Performance MPI Library over SR-IOV enabled InfiniBand Clusters. In 2014 21st International Conference on High Performance Computing (HiPC). 1--10.

[28]

Jie Zhang, Xiaoyi Lu, Jithin Jose, Rong Shi, and Dhabaleswar K. (DK) Panda. 2014. Can Inter-VM Shmem Benefit MPI Applications on SR-IOV Based Virtualized Infiniband Clusters? Springer International Publishing, Cham, 342--353.

Cited By

Zhang SXue XGuo BLi YLi WShen SQian HYin XWei BYuan GTang XHuang S(2024)Fast-tunable Graphene-based AWGR for Deep Learning Training NetworksProceedings of the 1st SIGCOMM Workshop on Hot Topics in Optical Technologies and Applications in Networking10.1145/3672201.3674121(14-20)Online publication date: 4-Aug-2024
https://dl.acm.org/doi/10.1145/3672201.3674121
De Rango AUtrera GGil MMartorell XGiordano AD'Ambrosio DMendicino G(2024)Partitioned Reduction for Heterogeneous Environments2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP62718.2024.00047(285-289)Online publication date: 20-Mar-2024
https://doi.org/10.1109/PDP62718.2024.00047
Hui LYang WWang Y(2024)Leveraging SmartNIC for Ring AllReduce Offloading2024 IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA)10.1109/ISPA63168.2024.00030(173-180)Online publication date: 30-Oct-2024
https://doi.org/10.1109/ISPA63168.2024.00030
Show More Cited By

Index Terms

Scalable reduction collectives with data partitioning-based multi-leader design
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
2. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms
      1. Massively parallel algorithms

Recommendations

Improved MPI collectives for MPI processes in shared address spaces

As the number of cores per node keeps increasing, it becomes increasingly important for MPI to leverage shared memory for intranode communication. This paper investigates the design and optimization of MPI collectives for clusters of NUMA nodes. We ...
High-performance Monte Carlo radiosity on GPU based on scene partitioning

The recent interest in GPGPU, (General-Purpose computation on Graphics Processing Unit), has stimulated improvements in the programmability of the GPU. Although the utilization of new languages like OpenCL and CUDA facilitate GPU programming, different ...
NUMA-aware shared-memory collective communication for MPI
HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computing

As the number of cores per node keeps increasing, it becomes increasingly important for MPI to leverage shared memory for intranode communication. This paper investigates the design and optimizations of MPI collectives for clusters of NUMA nodes. We ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2017

801 pages

ISBN:9781450351140

DOI:10.1145/3126908

General Chair:
Bernd Mohr
Jülich Supercomputing Center, Jülich, Germany
,
Program Chair:
Padma Raghavan
Vanderbilt University, Nashville, TN

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

In-Cooperation

IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SC '17

Sponsor:

SIGHPC

SC '17: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 12 - 17, 2017

Colorado, Denver

Acceptance Rates

SC '17 Paper Acceptance Rate 61 of 327 submissions, 19%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

36
Total Citations
View Citations
539
Total Downloads

Downloads (Last 12 months)23
Downloads (Last 6 weeks)1

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhang SXue XGuo BLi YLi WShen SQian HYin XWei BYuan GTang XHuang S(2024)Fast-tunable Graphene-based AWGR for Deep Learning Training NetworksProceedings of the 1st SIGCOMM Workshop on Hot Topics in Optical Technologies and Applications in Networking10.1145/3672201.3674121(14-20)Online publication date: 4-Aug-2024
https://dl.acm.org/doi/10.1145/3672201.3674121
De Rango AUtrera GGil MMartorell XGiordano AD'Ambrosio DMendicino G(2024)Partitioned Reduction for Heterogeneous Environments2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP62718.2024.00047(285-289)Online publication date: 20-Mar-2024
https://doi.org/10.1109/PDP62718.2024.00047
Hui LYang WWang Y(2024)Leveraging SmartNIC for Ring AllReduce Offloading2024 IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA)10.1109/ISPA63168.2024.00030(173-180)Online publication date: 30-Oct-2024
https://doi.org/10.1109/ISPA63168.2024.00030
Verma OMalakar P(2024)Hierarchical Communication Optimization for Distributed DNN Training2024 IEEE 31st International Conference on High Performance Computing, Data and Analytics Workshop (HiPCW)10.1109/HiPCW63042.2024.00059(163-164)Online publication date: 18-Dec-2024
https://doi.org/10.1109/HiPCW63042.2024.00059
Katevenis GPloumidis MMarazakis M(2023)Impact of Cache Coherence on the Performance of Shared-Memory based MPI Primitives: A Case Study for Broadcast on Intel Xeon Scalable ProcessorsProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605616(295-305)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3605573.3605616
Peng JFang JLiu JXie MDai YYang BLi SWang ZMohror KArnold DBadia R(2023)Optimizing MPI Collectives on Shared Memory Multi-CoresProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607074(1-15)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607074
Copik MBöhringer RCalotoiu AHoefler TGallivan KNikolopoulos DBeivide RGallopoulos E(2023)FMI: Fast and Cheap Message Passing for Serverless FunctionsProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593718(373-385)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593718
Sun CLuo HJiang HZhang JLi K(2023)COFFEE: Cross-Layer Optimization for Fast and Efficient Executions of Sinkhorn-Knopp Algorithm on HPC SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.327791534:7(2167-2179)Online publication date: Jul-2023
https://doi.org/10.1109/TPDS.2023.3277915
Liu PPeng JLiu JXie MChi L(2023)GLEX_Allreduce: Optimization for medium and small message of Allreduce on Tianhe system2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00388(2806-2809)Online publication date: 17-Dec-2023
https://doi.org/10.1109/ICPADS60453.2023.00388
Ramesh BKuncham GSuresh KVaidya RAlnaasan NAbduljabbar MShafi ASubramoni HPanda D(2023)Designing In-network Computing Aware Reduction Collectives in MPI2023 IEEE Symposium on High-Performance Interconnects (HOTI)10.1109/HOTI59126.2023.00018(25-32)Online publication date: Aug-2023
https://doi.org/10.1109/HOTI59126.2023.00018
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten