[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Noise Injection Techniques to Expose Subtle and Unintended Message Races

Published: 26 January 2017 Publication History

Abstract

Debugging intermittently occurring bugs within MPI applications is challenging, and message races, a condition in which two or more sends race to match with a receive, are one of the common root causes. Many debugging tools have been proposed to help programmers resolve them, but their runtime interference perturbs the timing such that subtle races often cannot be reproduced with debugging tools. We present novel noise injection techniques to expose message races even under a tool's control. We first formalize this race problem in the context of non-deterministic parallel applications and use this analysis to determine an effective noise-injection strategy to uncover them. We codified these techniques in NINJA (Noise INJection Agent) that exposes these races without modification to the application. Our evaluations on synthetic cases as well as a real-world bug in Hypre-2.10.1 show that NINJA significantly helps expose races.

References

[1]
P. Beckman, K. Iskra, K. Yoshii, and S. Coghlan. The influence of operating systems on the performance of collective operations at extreme scale. In Cluster Computing, 2006 IEEE International Conference on, pages 1--12, Sept 2006. 10.1109/CLUSTR.2006.311846.
[2]
A. Bouteiller, G. Bosilca, and J. Dongarra. Retrospect: Deterministic Replay of MPI Applications for Interactive Distributed Debugging. In F. Cappello, T. Herault, and J. Dongarra, editors, Recent Advances in Parallel Virtual Machine and Message Passing Interface, volume 4757 of phLecture Notes in Computer Science, pages 297--306. Springer Berlin Heidelberg, 2007. ISBN 978--3--540--75415--2. 10.1007/978--3--540--75416--9_41. URL http://dx.doi.org/10.1007/978--3--540--75416--9_41.
[3]
C. Clemencon, J. Fritscher, M. Meehan, and R. Ruhl. An Implementation of Race Detection and Deterministic Replay with MPI. In EURO-PAR '95 Parallel Processing, volume 966 of phLecture Notes in Computer Science, pages 155--166. Springer Berlin Heidelberg, 1995. ISBN 978--3--540--60247--7. 10.1007/BFb0020462. URL http://dx.doi.org/10.1007/BFb0020462.
[4]
D. Comer. Internetworking with TCP/IP: Principles, Protocols, and Architecture. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1988. ISBN 0--13--470154--2.
[5]
CORAL. Collaboration of Oak Ridge, Argonne, and Livermore benchmark codes. https://asc.llnl.gov/CORAL-benchmarks.
[6]
Emmi:2011:DS:1926385.1926432M. Emmi, S. Qadeer, and Z. Rakamarić. Delay-bounded scheduling. In Proceedings of the 38th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL '11, pages 411--422, New York, NY, USA, 2011. ACM. ISBN 978--1--4503-0490-0. 10.1145/1926385.1926432. URL http://doi.acm.org/10.1145/1926385.1926432.
[7]
C. Engelmann. Investigating operating system noise in extreme-scale high-performance computing systems using simulation. In Proceedings of thehrefhttp://www.iasted.org/conferences/home-795.html 11th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2013, Innsbruck, Austria, Feb. 11--13, 2013.hrefhttp://www.actapress.comACTA Press, Calgary, AB, Canada. ISBN 978-0--88986--943--1. http://dx.doi.org/10.2316/P.2013.795-010. URL http://www.christian-engelmann.info/publications/engelmann12investigating.pdf.
[8]
K. B. Ferreira, P. Bridges, and R. Brightwell. Characterizing application sensitivity to os interference using kernel-level noise injection. In High Performance Computing, Networking, Storage and Analysis, 2008. SC 2008. International Conference for, pages 1--12, Nov 2008. 10.1109/SC.2008.5219920.
[9]
C. Flanagan and S. N. Freund. Fasttrack: Efficient and precise dynamic race detection. In Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '09, pages 121--133, New York, NY, USA, 2009. ACM. ISBN 978--1--60558--392--1. 10.1145/1542476.1542490. URL http://doi.acm.org/10.1145/1542476.1542490.
[10]
M. P. Forum. MPI: A Message-Passing Interface Standard. Technical report, Knoxville, TN, USA, 1994. URL http://www.mpi-forum.org/.
[11]
M. Gusat, D. Craddock, W. Denzel, T. Engbersen, N. Ni, G. Pfister, W. Rooney, and J. Duato. Congestion control in infiniband networks. In High Performance Interconnects, 2005. Proceedings. 13th Symposium on, pages 158--159, Aug 2005. 10.1109/CONECT.2005.14.
[12]
er]Hilbrich:2012:MRE:2388996.2389037T. Hilbrich, J. Protze, M. Schulz, B. R. de Supinski, and M. S. Müller. Runtime error detection with must: Advances in deadlock detection. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, pages 30:1--30:11, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press. ISBN 978-1-4673-0804-5. URL http://dl.acm.org/citation.cfm?id=2388996.2389037.
[13]
T. Hoefler, T. Schneider, and A. Lumsdaine. The impact of network noise at large-scale communication performance. In Parallel Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, pages 1--8, May 2009. 10.1109/IPDPS.2009.5161095.
[14]
J. C. d. Kergommeaux, M. Ronsse, and K. D. Bosschere. MPL*: Efficient Record/Play of Nondeterministic Features of Message Passing Libraries. In Proceedings of the 6th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pages 141--148, London, UK, UK, 1999. Springer-Verlag. ISBN 3-540-66549-8. URL http://dl.acm.org/citation.cfm?id=648136.746462.
[15]
D. Kranzlmüller and J. Volkert. NOPE: A Nondeterministic Program Evaluator. In P. Zinterhof, M. Vajteršic, and A. Uhl, editors, Parallel Computation, volume 1557 of Lecture Notes in Computer Science, pages 490--499. Springer Berlin Heidelberg, 1999. ISBN 978--3--540--65641--8. 10.1007/3--540--49164--3_47. URL http://dx.doi.org/10.1007/3-540-49164-3_47.
[16]
D. Kranzlmüller, C. Schaubschläger, and J. Volkert. An Integrated Record & Replay Mechanism for Nondeterministic Message Passing Programs. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, volume 2131 of Lecture Notes in Computer Science, pages 192--200. Springer Berlin Heidelberg, 2001. ISBN 978-3-540-42609-7. 10.1007/3-540-45417-9_28. URL http://dx.doi.org/10.1007/3--540--45417--9_28.
[17]
R. H. B. Netzer and B. P. Miller. Optimal Tracing and Replay for Debugging Message-passing Parallel Programs. In Proceedings of the 1992 ACM/IEEE Conference on Supercomputing, Supercomputing '92, pages 502--511, Los Alamitos, CA, USA, 1992. IEEE Computer Society Press. ISBN 0--8186--2630--5. URL http://dl.acm.org/citation.cfm?id=147877.148058.
[18]
C.-S. Park, K. Sen, P. Hargrove, and C. Iancu. Efficient data race detection for distributed memory parallel programs. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11, pages 51:1--51:12, New York, NY, USA, 2011. ACM. ISBN 978--1--4503-0771-0. 10.1145/2063384.2063452. URL http://doi.acm.org/10.1145/2063384.2063452.
[19]
M.-Y. Park, S. J. Shim, Y.-K. Jun, and H.-R. Park. phMPIRace-Check: Detection of Message Races in MPI Programs, pages 322--333. Springer Berlin Heidelberg, Berlin, Heidelberg, 2007. ISBN 978--3--540--72360--8. 10.1007/978--3--540--72360--8_28. URL http://dx.doi.org/10.1007/978--3--540--72360--8_28.
[20]
K. Sato, D. H. Ahn, I. Laguna, G. L. Lee, and M. Schulz. Clock delta compression for scalable order-replay of non-deterministic parallel applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '15, pages 62:1--62:12, New York, NY, USA, 2015. ACM. ISBN 978--1--4503--3723--6. 10.1145/2807591.2807642. URL http://doi.acm.org/10.1145/2807591.2807642.
[21]
S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and T. Anderson. Eraser: A dynamic data race detector for multithreaded programs. ACM Trans. Comput. Syst., 15 (4): 391--411, Nov. 1997. ISSN 0734--2071. 10.1145/265924.265927. URL http://doi.acm.org/10.1145/265924.265927.
[22]
K. Serebryany and T. Iskhodzhanov. Threadsanitizer: Data race detection in practice. In phProceedings of the Workshop on Binary Instrumentation and Applications, WBIA '09, pages 62--71, New York, NY, USA, 2009. ACM. ISBN 978--1--60558--793--6. 10.1145/1791194.1791203. URL http://doi.acm.org/10.1145/1791194.1791203.
[23]
G. Shipman, P. M., Cormick, K. Pedretti, S. Olivier, K. B. Ferreira, R. Sankaran, S. Treichler, A. Aiken, and M. Bauer. Analysis of application sensitivity to system performance variability in a dynamic task based runtime. In The Workshop on Runtime Systems for Extreme Scale Programming Models and Architectures, 2015.
[24]
A. Vo, S. Vakkalanka, M. DeLisi, G. Gopalakrishnan, R. M. Kirby, and R. Thakur. Formal verification of practical mpi programs. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '09, pages 261--270, New York, NY, USA, 2009. ACM. ISBN 978--1--60558--397--6. 10.1145/1504176.1504214. URL http://doi.acm.org/10.1145/1504176.1504214.
[25]
A. Vo, S. Aananthakrishnan, G. Gopalakrishnan, B. R. d. Supinski, M. Schulz, and G. Bronevetsky. A scalable and distributed dynamic formal verifier for mpi programs. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC '10, pages 1--10, Washington, DC, USA, 2010. IEEE Computer Society. ISBN 978--1--4244--7559--9. 10.1109/SC.2010.7. URL http://dx.doi.org/10.1109/SC.2010.7.
[26]
R. Xue, X. Liu, M. Wu, Z. Guo, W. Chen, W. Zheng, Z. Zhang, and G. Voelker. Mpiwiz: Subgroup reproducible replay of mpi applications. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '09, pages 251--260, New York, NY, USA, 2009. ACM. ISBN 978--1--60558--397--6. 10.1145/1504176.1504213. URL http://doi.acm.org/10.1145/1504176.1504213.

Cited By

View all
  • (2018)Dynamic Symbolic Verification of MPI ProgramsFormal Methods10.1007/978-3-319-95582-7_28(466-484)Online publication date: 12-Jul-2018
  • (2024)Distributed Order Recording Techniques for Efficient Record-and-Replay of Multi - Threaded Programs2024 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER59578.2024.00010(27-38)Online publication date: 24-Sep-2024
  • (2024)Analysis and prediction of performance variability in large-scale computing systemsThe Journal of Supercomputing10.1007/s11227-024-06040-wOnline publication date: 28-Mar-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices
ACM SIGPLAN Notices  Volume 52, Issue 8
PPoPP '17
August 2017
442 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/3155284
Issue’s Table of Contents
  • cover image ACM Conferences
    PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
    January 2017
    476 pages
    ISBN:9781450344937
    DOI:10.1145/3018743
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 January 2017
Published in SIGPLAN Volume 52, Issue 8

Check for updates

Author Tags

  1. debugging
  2. mpi
  3. non-determinism

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)21
  • Downloads (Last 6 weeks)0
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2018)Dynamic Symbolic Verification of MPI ProgramsFormal Methods10.1007/978-3-319-95582-7_28(466-484)Online publication date: 12-Jul-2018
  • (2024)Distributed Order Recording Techniques for Efficient Record-and-Replay of Multi - Threaded Programs2024 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER59578.2024.00010(27-38)Online publication date: 24-Sep-2024
  • (2024)Analysis and prediction of performance variability in large-scale computing systemsThe Journal of Supercomputing10.1007/s11227-024-06040-wOnline publication date: 28-Mar-2024
  • (2023)A Survey of Graph Comparison Methods with Applications to Nondeterminism in High-Performance ComputingThe International Journal of High Performance Computing Applications10.1177/1094342023116661037:3-4(306-327)Online publication date: 5-Apr-2023
  • (2023)MPIRace: A Static Data Race Detector for MPI ProgramsLanguages and Compilers for Parallel Computing10.1007/978-3-031-31445-2_6(73-90)Online publication date: 10-May-2023
  • (2022)Debugging MPI Implementations via Reduction-to-Primitives2022 IEEE/ACM Third International Symposium on Checkpointing for Supercomputing (SuperCheck)10.1109/SuperCheck56652.2022.00007(1-9)Online publication date: Nov-2022
  • (2022)Leveraging the Dynamic Program Structure Tree to Detect Data Races in OpenMP Programs2022 IEEE/ACM Sixth International Workshop on Software Correctness for HPC Applications (Correctness)10.1109/Correctness56720.2022.00012(54-62)Online publication date: Nov-2022
  • (2021)Identifying Degree and Sources of Non-Determinism in MPI Applications Via Graph KernelsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.308153032:12(2936-2952)Online publication date: 1-Dec-2021
  • (2021)ANACIN-X: A software framework for studying non-determinism in MPI applicationsSoftware Impacts10.1016/j.simpa.2021.10015110(100151)Online publication date: Nov-2021
  • (2020)Detecting and reproducing error-code propagation bugs in MPI implementationsProceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3332466.3374515(187-201)Online publication date: 19-Feb-2020
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media