[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2851553.2851575acmconferencesArticle/Chapter ViewAbstractPublication PagesicpeConference Proceedingsconference-collections
research-article
Public Access

Communication Characterization and Optimization of Applications Using Topology-Aware Task Mapping on Large Supercomputers

Published: 12 March 2016 Publication History

Abstract

On large supercomputers, the job scheduling systems may assign a non-contiguous node allocation for user applications depending on available resources. With parallel applications using MPI (Message Passing Interface), the default process ordering does not take into account the actual physical node layout available to the application. This contributes to non-locality in terms of physical network topology and impacts communication performance of the application. In order to mitigate such performance penalties, this work describes techniques to identify suitable task mapping that takes the layout of the allocated nodes as well as the application's communication behavior into account. During the first phase of this research, we instrumented and collected performance data to characterize communication behavior of critical US DOE (United States - Department of Energy) applications using an augmented version of the mpiP tool. Subsequently, we developed several reordering methods (spectral bisection, neighbor join tree etc.) to combine node layout and application communication data for optimized task placement. We developed a tool called mpiAproxy to facilitate detailed evaluation of the various reordering algorithms without requiring full application executions. This work presents a comprehensive performance evaluation (14,000 experiments) of the various task mapping techniques in lowering communication costs on Titan, the leadership class supercomputer at Oak Ridge National Laboratory.

References

[1]
AMG2013 - parallel algebraic multigrid solver. https://codesign.llnl.gov/amg2013.php, 2015.
[2]
BoxLibAMR- Block Structured AMR for Combusion Studies. http://exactcodesign.org/sample-page/s3dboxlib/, 2015.
[3]
HPCG:High Performance Conjugate Gradients Benchmark. http://www.hpcg-benchmark.org, 2015.
[4]
LULESH: Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics. https://codesign.llnl.gov/lulesh.php, 2015.
[5]
MCB - Monte Carlo Benchmark. https://codesign.llnl.gov/mcb.php, 2015.
[6]
MPAS: Model for Prediction Across Scales. https://mpas-dev.github.io/, 2015.
[7]
MultiGridC. http://exactcodesign.org/proxy-app-software/, 2015.
[8]
Nek500. http://nek5000.mcs.anl.gov/, 2015.
[9]
NEKBONE. https://cesar.mcs.anl.gov/content/software/thermal_hydraulics, 2015.
[10]
Titan - Cray XK7 Supercomputer at Oak Ridge National Laboratory. https://www.olcf.ornl.gov/computing-resources/titan-cray-xk7/, 2015.
[11]
Titan - Job Scheduling Policy. https://www.olcf.ornl.gov/kb_articles/titan-scheduling-policy/, 2015.
[12]
TOP500 - Top 500 Supercomputer Sites in the World - June 2015. http://top500.org/lists/2015/06/, 2015.
[13]
G. Bhanot, A. Gara, P. Heidelberger, E. Lawless, J. C. Sexton, and R. Walkup. Optimizing task layout on the Blue Gene/L supercomputer. IBM Journal of Research and Development, 40:489--500, 2005.
[14]
A. Bhatele. Automating topology aware mapping for supercomputers. PhD thesis, University of Illinois at Urbana-Champaign, 2010.
[15]
S. W. Bollinger and S. F. Midkiff. Processor and link assignment in multicomputers using simulated annealing. In Proceedings of the International Conference on Parallel Processing, ICPP '88, The Pennsylvania State University, University Park, PA, USA, August 1988. Volume 1: Architecture., pages 1--7, 1988.
[16]
CORAL Collaboration. CORAL Request for Proposal B604142, 2015.
[17]
E. Cuthill and J. McKee. Reducing the bandwidth of sparse symmetric matrices. In Proceedings of the 1969 24th National Conference, ACM '69, pages 157--172, New York, NY, USA, 1969. ACM.
[18]
M. Deveci, K. Kaya, B. Uçar, and Ü. V. Çatalyürek. Fast and high quality topology-aware task mapping. In Proceedings of 29th IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2014.
[19]
F. Ercal, J. Ramanujam, and P. Sadayappan. Task allocation onto a hypercube by recursive mincut bipartitioning. In Proceedings of the 3rd Conference on Hypercube Concurrent Computers and Applications, pages 210--221. ACM Press, 1988.
[20]
A. George and J. W. Liu. Computer Solution of Large Sparse Positive Definite. Prentice Hall Professional Technical Reference, 1981.
[21]
J. A. George. Nested dissection of a regular finite element mesh. SIAM Journal on Numerical Analysis, 10(2):345--363, 1973.
[22]
J. L. Hintze and R. D. Nelson. Violin plots: A box plot-density trace synergism. The American Statistician, 52(2):181--184, May 1998.
[23]
T. Hoefler and M. Snir. Generic topology mapping strategies for large-scale parallel architectures. In Proceedings of the international conference on Supercomputing, pages 75--84. ACM, 2011.
[24]
S. C. Johnson. Hierarchical clustering schemes. Psychometrika, 32(3), 1967.
[25]
I. Karlin, J. Keasler, and R. Neely. Lulesh 2.0 updates and changes. Technical Report LLNL-TR-641973, August 2013.
[26]
B. Lee, S. McCormick, B. Philip, and D. Quinlan. Asynchronous fast adaptive composite-grid methods: numerical results. SIAM journal on scientific computing, 25(2), 2004.
[27]
B. Lee, S. McCormick, B. Philip, and D. Quinlan. Asynchronous fast adaptive composite-grid methods for elliptic problems: Theoretical foundations. SIAM journal on numerical analysis, 42(1), 2005.
[28]
S. McCormick and D. Quinlan. Asynchronous multilevel adaptive methods for solving partial differential equations on multiprocessors: Performance results* 1. Parallel computing, 12(2), 1989.
[29]
S. F. McCormick. Multilevel Adaptive Methods for Partial Differential Equations. SIAM, Philadelphia, PA, 1989.
[30]
S. F. McCormick and J. W. Thomas. The Fast Adaptive Composite grid (FAC) method for elliptic equations. Math. Comp., 46:439--456, 1986.
[31]
B. Philip, Z. Wang, M. Berrill, M. Birke, and M. Pernice. Dynamic implicit 3d adaptive mesh refinement for non-equilibrium radiation diffusion. Journal of Computational Physics, 262:17 -- 37, 2014.
[32]
N. Saitou and M. Nei. The neighbor-joining method: A new method for reconstructing phylogenetic trees. Mol. Biol. Evol., 4(4):406--425, 1987.
[33]
R. Sankaran, J. Angel, and W. M. Brown. Genetic algorithm based task reordering to improve the performance of batch scheduled massively parallel scientific applications. Concurrency and Computation: Practice and Experience, 2015.
[34]
K. Schloegel, G. Karypis, and V. Kumar. Parallel static and dynamic multi-constraint graph partitioning. Concurrency and Computation: Practice and Experience, 14(3):219--240, 2002.
[35]
E. Solomonik, A. Bhatele, and J. Demmel. Improving communication performance in dense linear algebra via topology aware collectives. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11, pages 1--11, New York, NY, USA, 2011. ACM.
[36]
S. Sreepathi, M. L. Grodowitz, R. Lim, P. Taffet, P. C. Roth, J. Meredith, S. Lee, D. Li, and J. Vetter. Application Characterization Using Oxbow Toolkit and PADS Infrastructure. In Proceedings of the 1st International Workshop on Hardware-Software Co-Design for High Performance Computing, Co-HPC '14, pages 55--63. IEEE Press, 2014.
[37]
J. Vetter and C. Chambreau. mpip: Lightweight, scalable mpi profiling, 2004. URL http://mpip. sourceforge. net.

Cited By

View all
  • (2019)Reproducibility in Benchmarking Parallel Fast Fourier Transform based ApplicationsCompanion of the 2019 ACM/SPEC International Conference on Performance Engineering10.1145/3302541.3313105(5-8)Online publication date: 27-Mar-2019
  • (2019)Coordinated DMA: Improving the DRAM Access Efficiency for Matrix MultiplicationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2019.290689130:10(2148-2164)Online publication date: 1-Oct-2019
  • (2019)Node aware sparse matrix–vector multiplicationJournal of Parallel and Distributed Computing10.1016/j.jpdc.2019.03.016130:C(166-178)Online publication date: 1-Aug-2019
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ICPE '16: Proceedings of the 7th ACM/SPEC on International Conference on Performance Engineering
March 2016
346 pages
ISBN:9781450340809
DOI:10.1145/2851553
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 March 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. communication characterization
  2. reordering algorithms
  3. topology-aware optimization

Qualifiers

  • Research-article

Funding Sources

Conference

ICPE'16

Acceptance Rates

ICPE '16 Paper Acceptance Rate 23 of 74 submissions, 31%;
Overall Acceptance Rate 252 of 851 submissions, 30%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)74
  • Downloads (Last 6 weeks)6
Reflects downloads up to 13 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2019)Reproducibility in Benchmarking Parallel Fast Fourier Transform based ApplicationsCompanion of the 2019 ACM/SPEC International Conference on Performance Engineering10.1145/3302541.3313105(5-8)Online publication date: 27-Mar-2019
  • (2019)Coordinated DMA: Improving the DRAM Access Efficiency for Matrix MultiplicationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2019.290689130:10(2148-2164)Online publication date: 1-Oct-2019
  • (2019)Node aware sparse matrix–vector multiplicationJournal of Parallel and Distributed Computing10.1016/j.jpdc.2019.03.016130:C(166-178)Online publication date: 1-Aug-2019
  • (2018)Level-Spread: A New Job Allocation Policy for Dragonfly Networks2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2018.00121(1123-1132)Online publication date: May-2018
  • (2018)TAMM: A New Topology-Aware Mapping Method for Parallel Applications on the Tianhe-2A SupercomputerAlgorithms and Architectures for Parallel Processing10.1007/978-3-030-05051-1_17(242-256)Online publication date: 7-Dec-2018
  • (2017)Efficient Mapping of Multi-threaded Applications onto 3D Stacked Chip-Multiprocessor2017 IEEE 19th International Conference on High Performance Computing and Communications; IEEE 15th International Conference on Smart City; IEEE 3rd International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC-SmartCity-DSS.2017.43(324-331)Online publication date: Dec-2017

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media