[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2688500.2688522acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article
Public Access

MPI+Threads: runtime contention and remedies

Published: 24 January 2015 Publication History

Abstract

Hybrid MPI+Threads programming has emerged as an alternative model to the “MPI everywhere” model to better handle the increasing core density in cluster nodes. While the MPI standard allows multithreaded concurrent communication, such flexibility comes with the cost of maintaining thread safety within the MPI implementation, typically implemented using critical sections. In contrast to previous works that studied the importance of critical-section granularity in MPI implementations, in this paper we investigate the implication of critical-section arbitration on communication performance. We first analyze the MPI runtime when multithreaded concurrent communication takes place on hierarchical memory systems. Our results indicate that the mutex-based approach that most MPI implementations use today can incur performance penalties due to unfair arbitration. We then present methods to mitigate these penalties with a first-come, first-served arbitration and a priority locking scheme that favors threads doing useful work. Through evaluations using several benchmarks and applications, we demonstrate up to 5-fold improvement in performance.

References

[1]
MPICH. URL www.mpich.org.
[2]
OpenMP. URL openmp.org.
[3]
OSU microbenchmarks suite. URL mvapich.cse.ohio-state.edu/benchmarks.
[4]
Introducing the Graph 500, May 2010. Cray User’s Group (CUG).
[5]
MPI: A message-passing interface standard version 3.0, Sept. 2012. URL http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf.
[6]
P. Balaji, D. Buntinas, D. Goodell, W. D. Gropp, and R. Thakur. Finegrained multithreading support for hybrid threaded MPI programming. Int. J. High Perform. Comput.Appl., 24:49–57, Feb. 2010.
[7]
P. Balaji, D. Buntinas, D. Goodell, W. Gropp, T. Hoefler, S. Kumar, E. Lusk, R. Thakur, and J. L. Träff. MPI on millions of cores. Parallel Processing Letters, 21(01):45–60, 2011.
[8]
J. Chhugani, N. Satish, C. Kim, J. Sewall, and P. Dubey. Fast and efficient graph traversal algorithm for CPUs: Maximizing single-node efficiency. In IEEE 26th International Parallel Distributed Processing Symposium (IPDPS), 2012, pages 378–389, May 2012.
[9]
T. David, R. Guerraoui, and V. Trigonakis. Everything you always wanted to know about synchronization but were afraid to ask. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP ’13, pages 33–48, New York, NY, USA, 2013. ACM.
[10]
J. Dinan, P. Balaji, J. R. Hammond, S. Krishnamoorthy, and V. Tipparaju. Supporting the global arrays PGAS model using MPI onesided communication. In Parallel Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International, pages 739–750, May 2012.
[11]
J. Dinan, P. Balaji, D. Goodell, D. Miller, M. Snir, and R. Thakur. Enabling MPI interoperability through flexible communication endpoints. 2013.
[12]
G. Dózsa, S. Kumar, P. Balaji, D. Buntinas, D. Goodell, W. Gropp, J. Ratterman, and R. Thakur. Enabling concurrent multithreaded MPI communication on multicore petascale systems. In Proceedings of the 17th European MPI Users’ Group Meeting Conference on Recent Advances in the Message Passing Interface, EuroMPI’10, pages 11–20, Berlin, Heidelberg, 2010. Springer-Verlag.
[13]
U. Drepper. Futexes are tricky. Dec. 2005.
[14]
H. Franke, R. Russell, and M. Kirkwood. Fuss, futexes and furwocks: Fast userlevel locking in Linux. In AUUG Conference Proceedings, page 85, 2002.
[15]
D. Goodell, P. Balaji, D. Buntinas, G. Dozsa, W. Gropp, S. Kumar, B. R. de Supinski, and R. Thakur. Minimizing MPI resource contention in multithreaded multicore environments. In Proceedings of the 2010 IEEE International Conference on Cluster Computing, CLUSTER ’10, pages 1–8, Washington, DC, USA, 2010. IEEE Computer Society.
[16]
W. Gropp and R. Thakur. Thread-safety in an MPI implementation: Requirements and analysis. Parallel Comput., 33:595–604, Sept. 2007.
[17]
T. Hoefler, G. Bronevetsky, B. Barrett, B. R. de Supinski, and A. Lumsdaine. Efficient MPI support for advanced hybrid programming models. In Recent Advances in the Message Passing Interface (EuroMPI’10), volume LNCS 6305, pages 50–61. Springer, Sept. 2010.
[18]
J. M. Mellor-Crummey and M. L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems (TOCS), 9(1):21–65, 1991.
[19]
J. M. Mellor-Crummey and M. L. Scott. Synchronization without contention. In ACM SIGARCH Computer Architecture News, volume 19, pages 269–278. ACM, 1991.
[20]
J. Meng, J. Yuan, J. Cheng, Y. Wei, and S. Feng. Small world asynchronous parallel model for genome assembly. In J. Park, A. Zomaya, S.-S. Yeo, and S. Sahni, editors, Network and Parallel Computing, volume 7513, chapter Lecture Notes in Computer Science, pages 145–155. Springer Berlin Heidelberg, 2012.
[21]
J. Meng, B. Wang, Y. Wei, S. Feng, and P. Balaji. SWAP-assembler: scalable and efficient genome assembly towards thousands of cores. BMC Bioinformatics, 15(Suppl 9):–2, 2014.
[22]
I. Molnar. The native POSIX thread library for Linux. Technical report, 2003.
[23]
J. Nieplocha, B. Palmer, V. Tipparaju, M. Krishnan, H. Trease, and E. Aprà. Advances, applications and performance of the global arrays shared memory programming toolkit. Int. J. High Perform. Comput. Appl., 20(2):203–231, May 2006.
[24]
J. Nieplocha, V. Tipparaju, M. Krishnan, and D. K. Panda. High performance remote memory access communication: The ARMCI approach. Int. J. High Perform. Comput. Appl., 20(2):233–253, May 2006.
[25]
C. Pheatt. Intel − threading building blocks. J. Comput. Sci. Coll., 23 (4):298–298, Apr. 2008.
[26]
T. Suzumura, K. Ueno, H. Sato, K. Fujisawa, and S. Matsuoka. Performance characteristics of graph500 on large-scale distributed environment. In Workload Characterization (IISWC), 2011 IEEE International Symposium on, pages 149–158, Nov. 2011.
[27]
R. Thakur and W. Gropp. Test suite for evaluating performance of multithreaded MPI communication. Parallel Computing, 35(12):608– 617, 2009.
[28]
M. Valiev, E. J. Bylaska, N. Govind, K. Kowalski, T. P. Straatsma, H. J. J. V. Dam, D. Wang, J. Nieplocha, E. Apra, T. L. Windus, and W. A. de Jong. NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations. Computer Physics Communications, 181(9):1477–1489, 2010.

Cited By

View all
  • (2023)A Dynamic Network-Native MPI Partitioned Aggregation Over InfiniBand Verbs2023 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER52292.2023.00029(259-270)Online publication date: 31-Oct-2023
  • (2022)Micro-Benchmarking MPI Partitioned Point-to-Point CommunicationProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545088(1-12)Online publication date: 29-Aug-2022
  • (2022)Improving Communication Asynchrony and Concurrency for Adaptive MPI Endpoints2022 IEEE/ACM International Workshop on Exascale MPI (ExaMPI)10.1109/ExaMPI56604.2022.00007(11-21)Online publication date: Nov-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
January 2015
290 pages
ISBN:9781450332057
DOI:10.1145/2688500
  • cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 50, Issue 8
    PPoPP '15
    August 2015
    290 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/2858788
    • Editor:
    • Andy Gill
    Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 January 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. MPI
  2. critical section
  3. runtime contention
  4. threads

Qualifiers

  • Research-article

Funding Sources

Conference

PPoPP '15
Sponsor:

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)125
  • Downloads (Last 6 weeks)12
Reflects downloads up to 17 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)A Dynamic Network-Native MPI Partitioned Aggregation Over InfiniBand Verbs2023 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER52292.2023.00029(259-270)Online publication date: 31-Oct-2023
  • (2022)Micro-Benchmarking MPI Partitioned Point-to-Point CommunicationProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545088(1-12)Online publication date: 29-Aug-2022
  • (2022)Improving Communication Asynchrony and Concurrency for Adaptive MPI Endpoints2022 IEEE/ACM International Workshop on Exascale MPI (ExaMPI)10.1109/ExaMPI56604.2022.00007(11-21)Online publication date: Nov-2022
  • (2022)IMB-ASYNC: a revised method and benchmark to estimate MPI-3 asynchronous progress efficiencyCluster Computing10.1007/s10586-021-03452-825:4(2683-2697)Online publication date: 15-Jan-2022
  • (2021)CSPACER: A Reduced API Set Runtime for the Space Consistency ModelThe International Conference on High Performance Computing in Asia-Pacific Region10.1145/3432261.3432272(58-68)Online publication date: 20-Jan-2021
  • (2019)Controlled Asynchronous GVTProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337927(1-10)Online publication date: 5-Aug-2019
  • (2019)Software combining to mitigate multithreaded MPI contentionProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3330378(367-379)Online publication date: 26-Jun-2019
  • (2019)Lock Contention Management in Multithreaded MPIACM Transactions on Parallel Computing10.1145/32754435:3(1-21)Online publication date: 8-Jan-2019
  • (2019)An Efficient Inter-Node Communication System with Lightweight-Thread Scheduling2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC/SmartCity/DSS.2019.00103(687-696)Online publication date: Aug-2019
  • (2018)Cooperative rendezvous protocols for improved performance and overlapProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.5555/3291656.3291694(1-13)Online publication date: 11-Nov-2018
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media