[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3236367.3236376acmotherconferencesArticle/Chapter ViewAbstractPublication PageseurompiConference Proceedingsconference-collections
research-article

Efficient Asynchronous Communication Progress for MPI without Dedicated Resources

Published: 23 September 2018 Publication History

Abstract

The overlap of computation and communication is critical for good performance of many HPC applications. State-of-the-art designs for the asynchronous progress require specially designed hardware resources (advanced switches or network interface cards), dedicated processor cores or application modification (e.g. use of MPI_Test). These techniques suffer from various issues like increasing code complexity/cost and loss of available compute resources for end applications. In this paper, we take up this challenge and propose a simple yet effective technique to achieve good overlap without needing any additional hardware or software resources. The proposed thread-based design allows MPI libraries to self-detect when asynchronous communication progress is needed and minimizes the number of context-switches and preemption between the main thread and the asynchronous progress thread. We evaluate the proposed design against state-of-the-art designs in other MPI libraries including MVAPICH2, Intel MPI, and Open MPI. We demonstrate benefits of the proposed approach at microbenchmark and at application level at scale on four different architectures including Intel Broadwell, Intel Xeon Phi (KNL), IBM OpenPOWER, and Intel Skylake with InfiniBand and Omni-Path interconnects. Compared to other state-of-the-art designs, our proposed approach shows upto 46%, 37%, and 49% improvement for All-to-one, One-to-all, and All-to-all communication patterns respectively collectives on 1,024 processes. We also show 38% performance improvement for SPEC MPI compute-intensive applications on 384 processes and 44% performance improvement with the P3DFFT application on 448 processes.

References

[1]
Abdelhalim Amer, Huiwei Lu, Yanjie Wei, Pavan Balaji, and Satoshi Matsuoka. 2015. Mpi+threads: runtime contention and remedies. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2015). ACM, San Francisco, CA, USA. isbn: 978-1-4503-3205-7.
[2]
Ron Brightwell, Rolf Riesen, and Keith Underwood. 2005. Analyzing the Impact of Overlap, Offload, and Independent Progress for Message Passing Interface Applications. 19, (Jan. 2005), 103--117.
[3]
2010. CORE-Direct: The Most Advanced Technology for MPI/ SHMEM Collectives Offloads. http://www.mellanox.com/pdf/whitepapers/TB_CORE-Direct.pdf. (2010).
[4]
Wataru Endo and Kenjiro Taura. 2018. Parallelized Software Offloading of Low-Level Communication with User-Level Threads. In HPC Asia.
[5]
MPI Forum. 2018. http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf. (2018).
[6]
Edgar Gabriel et al. 2004. Open MPI: goals, concept, and design of a next generation MPI implementation. In Proceedings, 11th European PVM/MPI Users' Group Meeting. Budapest, Hungary, (Sept. 2004), 97--104.
[7]
T Hoefler, P. Gottschling, A. Lumsdaine, and W. Rehm. 2007. Optimizing a Conjugate Gradient Solver with Non-Blocking Collective Operations. Elsevier Journal of Parallel Computing (PARCO). issn: 0167--8191.
[8]
T. Hoefler, A. Lumsdaine, and W. Rehm. 2007. Implementation and performance analysis of non-blocking collective operations for MPI. In Supercomputing, 2007. SC '07. Proceedings of the 2007 ACM/IEEE Conference on. (Nov. 2007), 1--10.
[9]
Torsten Hoefler and Andrew Lumsdaine. 2008. Message progression in parallel computing - to thread or not to thread? 2008 IEEE International Conference on Cluster Computing.
[10]
Torsten Hoefler, Jeffrey M. Squyres, Wolfgang Rehm, and Andrew Lumsdaine. 2006. A case for nonblocking collective operations. In Frontiers of High Performance Computing and Networking -- ISPA 2006 Workshops. Geyong Min, Beniamino Di Martino, Laurence T. Yang, Minyi Guo, and Gudula Rünger, (Eds.) Springer Berlin Heidelberg, Berlin, Heidelberg, 155--164. isbn: 978-3-540-49862-9.
[11]
2018. Intel MPI Library. http://software.intel.com/en-us/intel-mpi-library/. (2018).
[12]
James Jeffers, James Reinders, and Avinash Sodani. 2016. Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition 2Nd Edition. (2nd ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. isbn: 9780128091944.
[13]
K. Kandalla, H. Subramoni, J. Vienne, S. P. Raikar, K. Tomko, S. Sur, and D. K. Panda. 2011. Designing Nonblocking Broadcast with Collective Offload on InfiniBand Clusters: A Case Study with HPL. In 2011 IEEE 19th Annual Symposium on High Performance Interconnects.
[14]
Krishna Chaitanya Kandalla, Hari Subramoni, Karen A. Tomko, Dmitry Pekurovsky, Sayantan Sur, and Dhabaleswar K. Panda. 2011. High-performance and scalable non-blocking all-to-all with collective offload on InfiniBand clusters: a study with parallel 3D FFT. Computer Science - Research and Development, 26, 237--246.
[15]
Argonne National Laboratory. 2018. The Progress Engine - Mpich. In wiki. mpich.org/mpich/index.php/The_Progress_Engine.
[16]
S. Laizet, E. Lamballais, and J.C. Vassilicos. 2010. A numerical strategy to combine high-order schemes, complex geometry and parallel computing for high resolution DNS of fractal generated turbulence. Computers & Fluids, 39, 3, 471--484. issn: 0045--7930.
[17]
Huiwei Lu, Sangmin Seo, and Pavan Balaji. 2015. MPI+ULT: Overlapping Communication and Computation with User-Level Threads. (Aug. 2015).
[18]
Akihiro Nomura and Yutaka Ishikawa. 2010. Design of Kernel-Level Asynchronous Collective Communication. In EuroMPI.
[19]
D. K. Panda, K. Tomko, K. Schulz, and A. Majumdar. 2013. The MVAPICH Project: Evolution and Sustainability of an Open Source Production Quality MPI Library for HPC. In Int'l Workshop on Sustainable Software for Science: Practice and Experiences, held in conjunction with Int'l Conference on Supercomputing (SC '13).
[20]
Dmitry Pekurovsky. 2012. P3DFFT: A Framework for Parallel Computations of Fourier Transforms in Three Dimensions. 34, (Jan. 2012).
[21]
Steve Plimpton. 1995. Fast Parallel Algorithms for Short-range Molecular Dynamics. J. Comput. Phys., 117, 1, (Mar. 1995). issn: 0021--9991.
[22]
Howard Pritchard, Duncan Roweth, David Henseler, and Paul Cassella. 2018. Leveraging the Cray Linux Environment Core Specialization Feature to Realize MPI Asynchronous Progress on Cray XE Systems, (May 2018).
[23]
Srinivasan Ramesh, Aurèle Mahéo, Sameer Shende, Allen D. Malony, Hari Subramoni, Amit Ruhela, and Dhabaleswar K. (DK) Panda. 2018. MPI performance engineering with the MPI tool interface: The integration of MVAPICH and TAU. Parallel Computing, 77, 19--37. issn: 0167--8191.
[24]
Sergio Saponara and Bruno Neri. 2017. Radar Sensor Signal Acquisition and Multidimensional FFT Processing for Surveillance Applications in Transport Systems. 66, (Jan. 2017).
[25]
Timo Schneider, Sven Eckelmann, Torsten Hoefler, and Wolfgang Rehm. 2011. Kernel-Based Offload of Collective Operations - Implementation, Evaluation and Lessons Learned. In Euro-Par.
[26]
M. Si, A. J. Peña, J. Hammond, P. Balaji, M. Takagi, and Y. Ishikawa. 2015. Casper: An Asynchronous Progress Model for MPI RMA on Many-Core Architectures. In 2015 IEEE International Parallel and Distributed Processing Symposium.
[27]
Min Si and Pavan Balaji. 2017. Process-Based Asynchronous Progress Model for MPI Point-to-Point Communication. issn: ANL/MCS-P7070-0717.
[28]
2018. SPEC MPI 2007. (2018). https://www.spec.org/mpi2007/.
[29]
Hari Subramoni, Sourav Chakraborty, and Dhabaleswar K. Panda. 2017. Designing Dynamic and Adaptive MPI Point-to-Point Communication Protocols for Efficient Overlap of Computation and Communication. In ISC.
[30]
F. Trahay, E. Brunet, A. Denis, and R. Namyst. 2008. A multithreaded communication engine for multicore architectures. In 2008 IEEE International Symposium on Parallel and Distributed Processing.
[31]
Karthikeyan Vaidyanathan, Dhiraj D. Kalamkar, Kiran Pamnany, Jeff R. Hammond, Pavan Balaji, Dipankar Das, Jongsoo Park, and Bálint Joó. 2015. Improving concurrency and asynchrony in multithreaded MPI applications using software offloading. SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.
[32]
Jeffrey S. Vetter and Michael O. McCracken. 2001. Statistical Scalability Analysis of Communication Operations in Distributed Applications. SIGPLAN, (June 2001). issn: 0362--1340.
[33]
Markus Wittmann, Georg Hager, Thomas Zeiser, and Gerhard Wellein. 2013. Asynchronous MPI for the Masses. CoRR, abs/1302.4280.
[34]
Huan Zhou and José Gracia. 2016. Asynchronous Progress Design for a MPI-Based PGAS One-Sided Communication System. 2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS), 999--1006.
[35]
Huan Zhou, Yousri Mhedheb, Kamran Idrees, Colin W. Glass, José Gracia, and Karl Fürlinger. 2014. DART-MPI: An MPI-based Implementation of a PGAS Runtime System. In PGAS.

Cited By

View all
  • (2023)A Novel Framework for Efficient Offloading of Communication Operations to Bluefield SmartNICs2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS54959.2023.00022(123-133)Online publication date: May-2023
  • (2022)Improving Communication Asynchrony and Concurrency for Adaptive MPI Endpoints2022 IEEE/ACM International Workshop on Exascale MPI (ExaMPI)10.1109/ExaMPI56604.2022.00007(11-21)Online publication date: Nov-2022
  • (2022)Compiler-enabled optimization of persistent MPI Operations2022 IEEE/ACM International Workshop on Exascale MPI (ExaMPI)10.1109/ExaMPI56604.2022.00006(1-10)Online publication date: Nov-2022
  • Show More Cited By

Index Terms

  1. Efficient Asynchronous Communication Progress for MPI without Dedicated Resources

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      EuroMPI '18: Proceedings of the 25th European MPI Users' Group Meeting
      September 2018
      187 pages
      ISBN:9781450364928
      DOI:10.1145/3236367
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 23 September 2018

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Async progress
      2. Blocking/nonblocking operations
      3. Collective operations
      4. Communication computation overlap
      5. HPC
      6. MPI
      7. P3DFFT
      8. SPEC MPI2007 benchmarks

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Conference

      EuroMPI'18
      EuroMPI'18: 25th European MPI Users' Group Meeting
      September 23 - 26, 2018
      Barcelona, Spain

      Acceptance Rates

      Overall Acceptance Rate 66 of 139 submissions, 47%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)23
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 17 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)A Novel Framework for Efficient Offloading of Communication Operations to Bluefield SmartNICs2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS54959.2023.00022(123-133)Online publication date: May-2023
      • (2022)Improving Communication Asynchrony and Concurrency for Adaptive MPI Endpoints2022 IEEE/ACM International Workshop on Exascale MPI (ExaMPI)10.1109/ExaMPI56604.2022.00007(11-21)Online publication date: Nov-2022
      • (2022)Compiler-enabled optimization of persistent MPI Operations2022 IEEE/ACM International Workshop on Exascale MPI (ExaMPI)10.1109/ExaMPI56604.2022.00006(1-10)Online publication date: Nov-2022
      • (2022)IMB-ASYNC: a revised method and benchmark to estimate MPI-3 asynchronous progress efficiencyCluster Computing10.1007/s10586-021-03452-825:4(2683-2697)Online publication date: 1-Aug-2022
      • (2021)Workload Imbalance in HPC Applications: Effect on Performance of In-Network Processing2021 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC49654.2021.9622847(1-8)Online publication date: 20-Sep-2021
      • (2021)Overlapping Communication and Computation with ExaMPI's Strong Progress and Modern C++ Design2021 Workshop on Exascale MPI (ExaMPI)10.1109/ExaMPI54564.2021.00008(18-26)Online publication date: Nov-2021
      • (2021)Daps: A Dynamic Asynchronous Progress Stealing Model for MPI Communication2021 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/Cluster48925.2021.00027(516-527)Online publication date: Sep-2021
      • (2020)Overlapping MPI communications with Intel TBB computation2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW50202.2020.00159(958-966)Online publication date: May-2020
      • (2019)Study on progress threads placement and dedicated cores for overlapping MPI nonblocking collectives on manycore processorThe International Journal of High Performance Computing Applications10.1177/1094342019860184(109434201986018)Online publication date: 2-Jul-2019
      • (2019)An evaluation of the CORAL interconnectsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356166(1-18)Online publication date: 17-Nov-2019

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media