[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Diagnosis of Performance Faults in LargeScale MPI Applications via Probabilistic Progress-Dependence Inference

Published: 01 May 2015 Publication History

Abstract

Debugging large-scale parallel applications is challenging. Most existing techniques provide little information about failure root causes. Further, most debuggers significantly slow down program execution, and run sluggishly with massively parallel applications. This paper presents a novel technique that scalably infers the tasks in a parallel program on which a failure occurred, as well as the code in which it originated. Our technique combines scalable runtime analysis with static analysis to determine the least-progressed task(s) and to identify the code lines at which the failure arose. We present a novel algorithm that infers probabilistically progress dependence among MPI tasks using a globally constructed Markov model that represents tasks' control-flow behavior. In comparison to previous work, our algorithm infers more precisely the least-progressed task. We combine this technique with static backward slicing analysis, further isolating the code responsible for the current state. A blind study demonstrates that our technique isolates the root cause of a concurrency bug in a molecular dynamics simulation, which only manifests itself at 7,996 tasks or more. We extensively evaluate fault coverage of our technique via fault injections in 10 HPC benchmarks and show that our analysis takes less than a few seconds on thousands of parallel tasks.

References

[1]
D. H. Ahn, B. R. D. Supinski, I. Laguna, G. L. Lee, B. Liblit, B. P. Miller, and M. Schulz, “Scalable temporal order analysis for large scale debugging”, Proc. Conf. High Performance Comput. Netw., Storage Anal., 2009, pp. 1 –11.
[2]
I. Laguna, D. H. Ahn, B. R. de Supinski, S. Bagchi, and T. Gamblin, “Probabilistic diagnosis of performance faults in large-scale parallel applications”, Proc. 21st Int. Conf. Parallel Archit. Compilation Tech., 2012, pp. 213 –222.
[3]
F. H. Streitz, J. N. Glosli, M. V. Patel, B. Chan, R. K. Yates, B. R. de Supinski, J. Sexton, and J.A. Gunnels, “Simulating solidification in metals at high pressure: The drive to petascale computing”, J. Phys.: Conf. Ser.vol. 46, no. 1, pp. 254-267, 2006.
[4]
R. Thakur, R. Rabenseifner, and W. Gropp, “Optimization of collective communication operations in MPICH”, Int. J. High Performance Comput. Appl., vol. 19, pp. 49– 66, 2005.
[5]
M. Weiser, “Program slicing”, Proc. 5th Int. Conf. Softw. Eng., 1981, pp. 439–449 .
[6]
B. Korel, and J. Laski, “Dynamic slicing of computer programs ”, J. Syst. Softw., vol. 13, no. 3, pp. 187– 195, Dec. 1990.
[7]
M. Kamkar, P. Krajina, and P. Fritzson, “Dynamic slicing of parallel message-passing programs”, Proc. 4th Euromicro Workshop Parallel Distrib. Process., Jan. 1996, pp. 170–177.
[8]
J. Rilling, H. Li, and D. Goswami, “Predicate-based dynamic slicing of message passing programs”, Proc. IEEE 2nd Int. Workshop Source Code Anal. Manipulation, 2002, pp. 133 –142.
[9]
G. Shanmuganathan, K. Zhang, E. Wong, and Y. Qi, “Analyzing message-passing programs through visual slicing,” in Proc. Int. Conf. Inf. Technol. Coding and Comput., vol. 2, Apr. 2005, pp. 341–346.
[10]
M. Strout, B. Kreaseck, and P. Hovland, “Data-flow analysis for MPI programs ”, Proc. Int. Conf. Parallel Process., Aug. 2006, pp. 175 –184.
[11]
D. Bailey, J. Barton, T. Lasinski, and H. Simon, “The NAS Parallel Benchmarks,” NASA Ames Research Center, Mountain View, CA, USA, Rep. RNR-91-002, Aug. 1991.
[12]
ASC Sequoia Benchmark Codes, (2013). [Online]. Available: https://asc.llnl.gov/sequoia/benchmarks/.
[13]
Allinea Software Ltd, “Allinea DDT—Debugging tool for parallel computing,” (2013). [Online]. Available: http://www.allinea.com/products/ddt/.
[14]
GDB Steering Committee, “GDB: The GNU Project Debugger,” (2013). [Online]. Available: http://www.gnu.org/software/gdb/documentation/.
[15]
Rogue Wave Software, “TotalView Debugger,” (2013). [Online]. Available: http://www.roguewave.com/products/totalview.aspx.
[16]
J. DelSignore. (2003, Oct.) “TotalView on Blue Gene/L,” Presented at “Blue Gene/L: Applications, Architecture and Software Workshop”, Oct. 2003. [Online]. Available: https://asc.llnl.gov/computing_resources/bluegenel/papers/delsignore.pdf.
[17]
S. M. Balle, B. R. Brett, C. Chen, and D. LaFrance-Linden, “Extending a traditional debugger to debug massively parallel applications”, J. Parallel Distrib. Comput., vol. 64, no. 5, pp. 617 –628, 2004.
[18]
G. Watson, and N. DeBardeleben, “Developing scientific applications using eclipse”, Comput. Sci. Eng., vol. 8, no. 4, pp. 50– 61, 2006.
[19]
J. Hollingsworth, and B. Miller, “Parallel program performance metrics: A comparison and validation”, Proc. Supercomput., Nov. 1992, pp. 4–13.
[20]
The Portland Group, “PGPROF graphical performance profiler,” (2013). [Online]. Available: http://www.pgroup.com/products/pgprof.htm.
[21]
G. Bronevetsky, I. Laguna, S. Bagchi, B. de Supinski, D. Ahn, and M. Schulz, “AutomaDeD: Automata-based debugging for dissimilar parallel tasks”, Proc. IEEE/IFIP Conf. Dependable Syst. Netw., 2010, pp. 231– 240.
[22]
Q. Gao, F. Qin, and D. K. Panda, “DMTracker: Finding bugs in large-scale parallel programs by detecting anomaly in data movements”, Proc. ACM/IEEE Supercomput. Conf., 2007, pp. 15:1–15:12.
[23]
I. Laguna, T. Gamblin, B. R. de Supinski, S. Bagchi, G. Bronevetsky, D. H. Ahn, M. Schulz, and B. Rountree, “Large scale debugging of parallel tasks with automaded”, Proc. ACM/IEEE Supercomput. Conf., 2011, pp. 50:1–50:10.
[24]
A. V. Mirgorodskiy, N. Maruyama, and B. P. Miller, “Problem diagnosis in large-scale computing environments, ”, ACM/IEEE Supercomput Conf.,, New York, NY, USA: ACM, 2006, pp. 11–.
[25]
S. C. Gupta and G. Sreenivasamurthy, “Navigating C¨ïn a L¨eakyB¨oat? Try purify,” IBM developerWorks, 2006. [Online]. Available: www.ibm.com/developerworks/rational/library/06/0822_satish-giridhar/.
[26]
Q. Gao, W. Zhang, and F. Qin, “FlowChecker: Detecting bugs in MPI libraries via message flow checking”, Proc. ACM/IEEE Int. Conf. High Performance Comput., Netw., Storage Anal., 2010, pp. 1–11.
[27]
T. Hilbrich, B. R. de Supinski, M. Schulz, and M. S. Müller, “A graph based approach for MPI deadlock detection”, Proc. Int. Conf. Supercomput, 2009, pp. 296–305.
[28]
J. S. Vetter, and B. R. de Supinski, “Dynamic software testing of MPI applications with umpire”, Proc. ACM/IEEE Supercomput. Conf., 2000, pp. 51:1–51:10.
[29]
J. Ferrante, K. J. Ottenstein, and J. D. Warren, “The program dependence graph and its use in optimization”, ACM Trans. Program. Lang. Syst., vol. 9, no. 3, pp. 319 –349, Jul. 1987.
[30]
M. Kamkar, and P. Krajina, “Dynamic slicing of distributed programs ”, Proc. Int. Conf. Softw. Maintenance, Oct. 1995, pp. 222 –229.

Cited By

View all
  • (2024)GVARP: Detecting Performance Variance on Large-Scale Heterogeneous SystemsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00063(1-16)Online publication date: 17-Nov-2024
  • (2022)VaproProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508411(150-162)Online publication date: 2-Apr-2022
  • (2022)Detecting Performance Variance for Parallel Applications Without Source CodeIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.318179933:12(4239-4255)Online publication date: 1-Dec-2022
  • Show More Cited By

Index Terms

  1. Diagnosis of Performance Faults in LargeScale MPI Applications via Probabilistic Progress-Dependence Inference
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Please enable JavaScript to view thecomments powered by Disqus.

            Information & Contributors

            Information

            Published In

            cover image IEEE Transactions on Parallel and Distributed Systems
            IEEE Transactions on Parallel and Distributed Systems  Volume 26, Issue 5
            May 2015
            291 pages

            Publisher

            IEEE Press

            Publication History

            Published: 01 May 2015

            Author Tags

            1. parallel applications
            2. Distributed debugging
            3. MPI
            4. progress dependence

            Qualifiers

            • Research-article

            Contributors

            Other Metrics

            Bibliometrics & Citations

            Bibliometrics

            Article Metrics

            • Downloads (Last 12 months)0
            • Downloads (Last 6 weeks)0
            Reflects downloads up to 14 Jan 2025

            Other Metrics

            Citations

            Cited By

            View all
            • (2024)GVARP: Detecting Performance Variance on Large-Scale Heterogeneous SystemsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00063(1-16)Online publication date: 17-Nov-2024
            • (2022)VaproProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508411(150-162)Online publication date: 2-Apr-2022
            • (2022)Detecting Performance Variance for Parallel Applications Without Source CodeIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.318179933:12(4239-4255)Online publication date: 1-Dec-2022
            • (2022)Leveraging Code Snippets to Detect Variations in the Performance of HPC SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.315874233:12(3558-3574)Online publication date: 1-Dec-2022

            View Options

            View options

            Media

            Figures

            Other

            Tables

            Share

            Share

            Share this Publication link

            Share on social media