Article

Designing Non-blocking Broadcast with Collective Offload on InfiniBand Clusters: A Case Study with HPL

Authors:

S. Sur,

D. K. PandaAuthors Info & Claims

HOTI '11: Proceedings of the 2011 IEEE 19th Annual Symposium on High Performance Interconnects

Pages 27 - 34

https://doi.org/10.1109/HOTI.2011.14

Published: 24 August 2011 Publication History

Abstract

The upcoming MPI-3.0 standard is expected to include non-blocking collective operations. Non-blocking collectives offer a new MPI interface, using which an application can decouple the initiation and completion of collective operations. However, to be effective, the MPI library should provide a high performance and scalable implementation. One of the major challenges in designing an effective non-blocking collective operation is to ensure progress of the operation while processors are busy in application-level computation. The recently introduced Mellanox ConnectX-2 InfiniBand adapters offer a task offload interface (CORE-Direct) that enables communication progress without requiring CPU cycles. In this paper, we present the design of a non-blocking broadcast operation (MPI Ibcast) using the CORE-Direct offload interface. Our experimental evaluations show that our implementation delivers near perfect overlap, without penalizing the latency of the MPI Ibcast operation. Since existing MPI implementations do not provide non-blocking collective communication, scientific applications have been modified to implement collectives on top of MPI point-to-point operations to achieve overlap. HPL is an example of an application use case scenario for non-blocking collectives. We have explored the benefits of our proposed network offload based MPI Ibcast implementation with HPL and we observe that HPL can achieve its peak throughput with significantly smaller problem sizes, which also leads to an improvement in its run-time by up to 78%, with 512 processors. We also observe that our proposed designs can minimize the impact of system noise on applications.

Cited By

View all

Ruhela ASubramoni HChakraborty SBayatpour MKousha PPanda D(2018)Efficient Asynchronous Communication Progress for MPI without Dedicated ResourcesProceedings of the 25th European MPI Users' Group Meeting10.1145/3236367.3236376(1-11)Online publication date: 23-Sep-2018
https://dl.acm.org/doi/10.1145/3236367.3236376
Yang CChen SLo YKristiani EChan Y(2018)On construction of a virtual GPU cluster with InfiniBand and 10 Gb Ethernet virtualizationThe Journal of Supercomputing10.1007/s11227-018-2484-574:12(6876-6897)Online publication date: 1-Dec-2018
https://dl.acm.org/doi/10.1007/s11227-018-2484-5
Chakraborty SSubramoni HMoody AVenkatesh APerkins JPanda DBalaji PXu C(2015)Non-blocking PMI extensions for fast MPI startupProceedings of the 15th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing10.1109/CCGrid.2015.151(131-140)Online publication date: 4-May-2015
https://dl.acm.org/doi/10.1109/CCGrid.2015.151
Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

HOTI '11: Proceedings of the 2011 IEEE 19th Annual Symposium on High Performance Interconnects

August 2011

94 pages

ISBN:9780769545370

Publisher

IEEE Computer Society

United States

Publication History

Published: 24 August 2011

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Ruhela ASubramoni HChakraborty SBayatpour MKousha PPanda D(2018)Efficient Asynchronous Communication Progress for MPI without Dedicated ResourcesProceedings of the 25th European MPI Users' Group Meeting10.1145/3236367.3236376(1-11)Online publication date: 23-Sep-2018
https://dl.acm.org/doi/10.1145/3236367.3236376
Yang CChen SLo YKristiani EChan Y(2018)On construction of a virtual GPU cluster with InfiniBand and 10 Gb Ethernet virtualizationThe Journal of Supercomputing10.1007/s11227-018-2484-574:12(6876-6897)Online publication date: 1-Dec-2018
https://dl.acm.org/doi/10.1007/s11227-018-2484-5
Chakraborty SSubramoni HMoody AVenkatesh APerkins JPanda DBalaji PXu C(2015)Non-blocking PMI extensions for fast MPI startupProceedings of the 15th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing10.1109/CCGrid.2015.151(131-140)Online publication date: 4-May-2015
https://dl.acm.org/doi/10.1109/CCGrid.2015.151
Ramos STaboada GExpósito RTouriño J(2015)Nonblocking collectives for scalable Java communicationsConcurrency and Computation: Practice & Experience10.1002/cpe.327927:5(1169-1187)Online publication date: 10-Apr-2015
https://dl.acm.org/doi/10.1002/cpe.3279
Hoefler TMoor D(2014)Energy, Memory, and Runtime Tradeoffs for Implementing Collective Communication OperationsSupercomputing Frontiers and Innovations: an International Journal10.14529/jsfi1402041:2(58-75)Online publication date: 9-Jul-2014
https://dl.acm.org/doi/10.14529/jsfi140204
Pang ZXie MZhang JZheng YWang GDong DSuo G(2014)The TH Express high performance interconnect networksFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-014-3500-98:3(357-366)Online publication date: 1-Jun-2014
https://dl.acm.org/doi/10.1007/s11704-014-3500-9

Abstract

Cited By

Recommendations

Comments

Information

Published In

Publisher

Publication History

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

Share

Share this Publication link

Share on social media

Affiliations