More Web Proxy on the site http://driver.im/

research-article

Public Access

MPI+Threads: runtime contention and remedies

Authors:

Abdelhalim Amer,

Satoshi MatsuokaAuthors Info & Claims

PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Pages 239 - 248

https://doi.org/10.1145/2688500.2688522

Published: 24 January 2015 Publication History

Abstract

Hybrid MPI+Threads programming has emerged as an alternative model to the “MPI everywhere” model to better handle the increasing core density in cluster nodes. While the MPI standard allows multithreaded concurrent communication, such flexibility comes with the cost of maintaining thread safety within the MPI implementation, typically implemented using critical sections. In contrast to previous works that studied the importance of critical-section granularity in MPI implementations, in this paper we investigate the implication of critical-section arbitration on communication performance. We first analyze the MPI runtime when multithreaded concurrent communication takes place on hierarchical memory systems. Our results indicate that the mutex-based approach that most MPI implementations use today can incur performance penalties due to unfair arbitration. We then present methods to mitigate these penalties with a first-come, first-served arbitration and a priority locking scheme that favors threads doing useful work. Through evaluations using several benchmarks and applications, we demonstrate up to 5-fold improvement in performance.

References

[1]

MPICH. URL www.mpich.org.

[2]

OpenMP. URL openmp.org.

[3]

OSU microbenchmarks suite. URL mvapich.cse.ohio-state.edu/benchmarks.

[4]

Introducing the Graph 500, May 2010. Cray User’s Group (CUG).

[5]

MPI: A message-passing interface standard version 3.0, Sept. 2012. URL http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf.

[6]

P. Balaji, D. Buntinas, D. Goodell, W. D. Gropp, and R. Thakur. Finegrained multithreading support for hybrid threaded MPI programming. Int. J. High Perform. Comput.Appl., 24:49–57, Feb. 2010.

Digital Library

[7]

P. Balaji, D. Buntinas, D. Goodell, W. Gropp, T. Hoefler, S. Kumar, E. Lusk, R. Thakur, and J. L. Träff. MPI on millions of cores. Parallel Processing Letters, 21(01):45–60, 2011.

[8]

J. Chhugani, N. Satish, C. Kim, J. Sewall, and P. Dubey. Fast and efficient graph traversal algorithm for CPUs: Maximizing single-node efficiency. In IEEE 26th International Parallel Distributed Processing Symposium (IPDPS), 2012, pages 378–389, May 2012.

Digital Library

[9]

T. David, R. Guerraoui, and V. Trigonakis. Everything you always wanted to know about synchronization but were afraid to ask. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP ’13, pages 33–48, New York, NY, USA, 2013. ACM.

Digital Library

[10]

J. Dinan, P. Balaji, J. R. Hammond, S. Krishnamoorthy, and V. Tipparaju. Supporting the global arrays PGAS model using MPI onesided communication. In Parallel Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International, pages 739–750, May 2012.

Digital Library

[11]

J. Dinan, P. Balaji, D. Goodell, D. Miller, M. Snir, and R. Thakur. Enabling MPI interoperability through flexible communication endpoints. 2013.

Digital Library

[12]

G. Dózsa, S. Kumar, P. Balaji, D. Buntinas, D. Goodell, W. Gropp, J. Ratterman, and R. Thakur. Enabling concurrent multithreaded MPI communication on multicore petascale systems. In Proceedings of the 17th European MPI Users’ Group Meeting Conference on Recent Advances in the Message Passing Interface, EuroMPI’10, pages 11–20, Berlin, Heidelberg, 2010. Springer-Verlag.

Digital Library

[13]

U. Drepper. Futexes are tricky. Dec. 2005.

[14]

H. Franke, R. Russell, and M. Kirkwood. Fuss, futexes and furwocks: Fast userlevel locking in Linux. In AUUG Conference Proceedings, page 85, 2002.

[15]

D. Goodell, P. Balaji, D. Buntinas, G. Dozsa, W. Gropp, S. Kumar, B. R. de Supinski, and R. Thakur. Minimizing MPI resource contention in multithreaded multicore environments. In Proceedings of the 2010 IEEE International Conference on Cluster Computing, CLUSTER ’10, pages 1–8, Washington, DC, USA, 2010. IEEE Computer Society.

Digital Library

[16]

W. Gropp and R. Thakur. Thread-safety in an MPI implementation: Requirements and analysis. Parallel Comput., 33:595–604, Sept. 2007.

Digital Library

[17]

T. Hoefler, G. Bronevetsky, B. Barrett, B. R. de Supinski, and A. Lumsdaine. Efficient MPI support for advanced hybrid programming models. In Recent Advances in the Message Passing Interface (EuroMPI’10), volume LNCS 6305, pages 50–61. Springer, Sept. 2010.

Digital Library

[18]

J. M. Mellor-Crummey and M. L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems (TOCS), 9(1):21–65, 1991.

Digital Library

[19]

J. M. Mellor-Crummey and M. L. Scott. Synchronization without contention. In ACM SIGARCH Computer Architecture News, volume 19, pages 269–278. ACM, 1991.

Digital Library

[20]

J. Meng, J. Yuan, J. Cheng, Y. Wei, and S. Feng. Small world asynchronous parallel model for genome assembly. In J. Park, A. Zomaya, S.-S. Yeo, and S. Sahni, editors, Network and Parallel Computing, volume 7513, chapter Lecture Notes in Computer Science, pages 145–155. Springer Berlin Heidelberg, 2012.

[21]

J. Meng, B. Wang, Y. Wei, S. Feng, and P. Balaji. SWAP-assembler: scalable and efficient genome assembly towards thousands of cores. BMC Bioinformatics, 15(Suppl 9):–2, 2014.

[22]

I. Molnar. The native POSIX thread library for Linux. Technical report, 2003.

[23]

J. Nieplocha, B. Palmer, V. Tipparaju, M. Krishnan, H. Trease, and E. Aprà. Advances, applications and performance of the global arrays shared memory programming toolkit. Int. J. High Perform. Comput. Appl., 20(2):203–231, May 2006.

Digital Library

[24]

J. Nieplocha, V. Tipparaju, M. Krishnan, and D. K. Panda. High performance remote memory access communication: The ARMCI approach. Int. J. High Perform. Comput. Appl., 20(2):233–253, May 2006.

Digital Library

[25]

C. Pheatt. Intel − threading building blocks. J. Comput. Sci. Coll., 23 (4):298–298, Apr. 2008.

Digital Library

[26]

T. Suzumura, K. Ueno, H. Sato, K. Fujisawa, and S. Matsuoka. Performance characteristics of graph500 on large-scale distributed environment. In Workload Characterization (IISWC), 2011 IEEE International Symposium on, pages 149–158, Nov. 2011.

Digital Library

[27]

R. Thakur and W. Gropp. Test suite for evaluating performance of multithreaded MPI communication. Parallel Computing, 35(12):608– 617, 2009.

Digital Library

[28]

M. Valiev, E. J. Bylaska, N. Govind, K. Kowalski, T. P. Straatsma, H. J. J. V. Dam, D. Wang, J. Nieplocha, E. Apra, T. L. Windus, and W. A. de Jong. NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations. Computer Physics Communications, 181(9):1477–1489, 2010.

Cited By

Temuçin YLevy SSchonbein WGrant RAfsahi A(2023)A Dynamic Network-Native MPI Partitioned Aggregation Over InfiniBand Verbs2023 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER52292.2023.00029(259-270)Online publication date: 31-Oct-2023
https://doi.org/10.1109/CLUSTER52292.2023.00029
Hassan Temucin YGrant RAfsahi A(2022)Micro-Benchmarking MPI Partitioned Point-to-Point CommunicationProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545088(1-12)Online publication date: 29-Aug-2022
https://dl.acm.org/doi/10.1145/3545008.3545088
White SKale L(2022)Improving Communication Asynchrony and Concurrency for Adaptive MPI Endpoints2022 IEEE/ACM International Workshop on Exascale MPI (ExaMPI)10.1109/ExaMPI56604.2022.00007(11-21)Online publication date: Nov-2022
https://doi.org/10.1109/ExaMPI56604.2022.00007
Show More Cited By

Index Terms

MPI+Threads: runtime contention and remedies
1. Computing methodologies
  1. Concurrent computing methodologies
    1. Concurrent programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Concurrent programming languages

Recommendations

MPI+Threads: runtime contention and remedies
PPoPP '15

Hybrid MPI+Threads programming has emerged as an alternative model to the “MPI everywhere” model to better handle the increasing core density in cluster nodes. While the MPI standard allows multithreaded concurrent communication, such flexibility comes ...
Lock Contention Management in Multithreaded MPI

In this article, we investigate contention management in lock-based thread-safe MPI libraries. Specifically, we make two assumptions: (1) locks are the only form of synchronization when protecting communication paths; and (2) contention occurs, and thus ...
Frustrated With MPI+Threads? Try MPIxThreads!
EuroMPI '23: Proceedings of the 30th European MPI Users' Group Meeting

MPI + Threads, embodied by the MPI/OpenMP hybrid programming model, is a parallel programming paradigm where threads are used for on-node shared-memory parallelization and MPI is used for multi-node distributed-memory parallelization. OpenMP provides an ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

January 2015

290 pages

ISBN:9781450332057

DOI:10.1145/2688500

General Chair:
Albert Cohen
INRIA, France
,
Program Chair:
David Grove
IBM Research, USA

ACM SIGPLAN Notices Volume 50, Issue 8
PPoPP '15
August 2015
290 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2858788
Editor:
Andy Gill
University of Kansas, Lawrence, KS
Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 January 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

PPoPP '15

Sponsor:

SIGPLAN

PPoPP '15: 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 7 - 11, 2015

CA, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

39
Total Citations
View Citations
961
Total Downloads

Downloads (Last 12 months)125
Downloads (Last 6 weeks)12

Reflects downloads up to 17 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Temuçin YLevy SSchonbein WGrant RAfsahi A(2023)A Dynamic Network-Native MPI Partitioned Aggregation Over InfiniBand Verbs2023 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER52292.2023.00029(259-270)Online publication date: 31-Oct-2023
https://doi.org/10.1109/CLUSTER52292.2023.00029
Hassan Temucin YGrant RAfsahi A(2022)Micro-Benchmarking MPI Partitioned Point-to-Point CommunicationProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545088(1-12)Online publication date: 29-Aug-2022
https://dl.acm.org/doi/10.1145/3545008.3545088
White SKale L(2022)Improving Communication Asynchrony and Concurrency for Adaptive MPI Endpoints2022 IEEE/ACM International Workshop on Exascale MPI (ExaMPI)10.1109/ExaMPI56604.2022.00007(11-21)Online publication date: Nov-2022
https://doi.org/10.1109/ExaMPI56604.2022.00007
Medvedev A(2022)IMB-ASYNC: a revised method and benchmark to estimate MPI-3 asynchronous progress efficiencyCluster Computing10.1007/s10586-021-03452-825:4(2683-2697)Online publication date: 15-Jan-2022
https://doi.org/10.1007/s10586-021-03452-8
Ibrahim K(2021)CSPACER: A Reduced API Set Runtime for the Space Consistency ModelThe International Conference on High Performance Computing in Asia-Pacific Region10.1145/3432261.3432272(58-68)Online publication date: 20-Jan-2021
https://dl.acm.org/doi/10.1145/3432261.3432272
Eker AWilliams BChiu KPonomarev D(2019)Controlled Asynchronous GVTProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337927(1-10)Online publication date: 5-Aug-2019
https://dl.acm.org/doi/10.1145/3337821.3337927
Amer AArcher CBlocksome MCao CChuvelev MFujita HGarzaran MGuo YHammond JIwasaki SRaffenetti KShiryaev MSi MTaura KThapaliya SBalaji PEigenmann RDing CMcKee S(2019)Software combining to mitigate multithreaded MPI contentionProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3330378(367-379)Online publication date: 26-Jun-2019
https://dl.acm.org/doi/10.1145/3330345.3330378
Amer ALu HBalaji PChabbi MWei YHammond JMatsuoka S(2019)Lock Contention Management in Multithreaded MPIACM Transactions on Parallel Computing10.1145/32754435:3(1-21)Online publication date: 8-Jan-2019
https://dl.acm.org/doi/10.1145/3275443
Fukuoka TEndo WTaura K(2019)An Efficient Inter-Node Communication System with Lightweight-Thread Scheduling2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC/SmartCity/DSS.2019.00103(687-696)Online publication date: Aug-2019
https://doi.org/10.1109/HPCC/SmartCity/DSS.2019.00103
Chakraborty SBayatpour MHashmi JSubramoni HPanda D(2018)Cooperative rendezvous protocols for improved performance and overlapProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.5555/3291656.3291694(1-13)Online publication date: 11-Nov-2018
https://dl.acm.org/doi/10.5555/3291656.3291694
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents