research-article

Efficient Asynchronous Communication Progress for MPI without Dedicated Resources

Authors:

Amit Ruhela,

Hari Subramoni,

Sourav Chakraborty,

Mohammadreza Bayatpour,

Pouya Kousha,

Dhabaleswar K. PandaAuthors Info & Claims

EuroMPI '18: Proceedings of the 25th European MPI Users' Group Meeting

Article No.: 14, Pages 1 - 11

https://doi.org/10.1145/3236367.3236376

Published: 23 September 2018 Publication History

Get Access

Abstract

The overlap of computation and communication is critical for good performance of many HPC applications. State-of-the-art designs for the asynchronous progress require specially designed hardware resources (advanced switches or network interface cards), dedicated processor cores or application modification (e.g. use of MPI_Test). These techniques suffer from various issues like increasing code complexity/cost and loss of available compute resources for end applications. In this paper, we take up this challenge and propose a simple yet effective technique to achieve good overlap without needing any additional hardware or software resources. The proposed thread-based design allows MPI libraries to self-detect when asynchronous communication progress is needed and minimizes the number of context-switches and preemption between the main thread and the asynchronous progress thread. We evaluate the proposed design against state-of-the-art designs in other MPI libraries including MVAPICH2, Intel MPI, and Open MPI. We demonstrate benefits of the proposed approach at microbenchmark and at application level at scale on four different architectures including Intel Broadwell, Intel Xeon Phi (KNL), IBM OpenPOWER, and Intel Skylake with InfiniBand and Omni-Path interconnects. Compared to other state-of-the-art designs, our proposed approach shows upto 46%, 37%, and 49% improvement for All-to-one, One-to-all, and All-to-all communication patterns respectively collectives on 1,024 processes. We also show 38% performance improvement for SPEC MPI compute-intensive applications on 384 processes and 44% performance improvement with the P3DFFT application on 448 processes.

References

[1]

Abdelhalim Amer, Huiwei Lu, Yanjie Wei, Pavan Balaji, and Satoshi Matsuoka. 2015. Mpi+threads: runtime contention and remedies. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2015). ACM, San Francisco, CA, USA. isbn: 978-1-4503-3205-7.

Abstract

References

Cited By

Index Terms

Recommendations

Efficient design for MPI asynchronous progress without dedicated resources

Characterizing CUDA Unified Memory (UM)-Aware MPI Designs on Modern GPU Architectures

Kernel-Assisted MPI Collective Communication among Many-core Clusters

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations