More Web Proxy on the site http://driver.im/

research-article

Public Access

Communication Characterization and Optimization of Applications Using Topology-Aware Task Mapping on Large Supercomputers

Authors:

Sarat Sreepathi,

Patrick WorleyAuthors Info & Claims

ICPE '16: Proceedings of the 7th ACM/SPEC on International Conference on Performance Engineering

Pages 225 - 236

https://doi.org/10.1145/2851553.2851575

Published: 12 March 2016 Publication History

Abstract

On large supercomputers, the job scheduling systems may assign a non-contiguous node allocation for user applications depending on available resources. With parallel applications using MPI (Message Passing Interface), the default process ordering does not take into account the actual physical node layout available to the application. This contributes to non-locality in terms of physical network topology and impacts communication performance of the application. In order to mitigate such performance penalties, this work describes techniques to identify suitable task mapping that takes the layout of the allocated nodes as well as the application's communication behavior into account. During the first phase of this research, we instrumented and collected performance data to characterize communication behavior of critical US DOE (United States - Department of Energy) applications using an augmented version of the mpiP tool. Subsequently, we developed several reordering methods (spectral bisection, neighbor join tree etc.) to combine node layout and application communication data for optimized task placement. We developed a tool called mpiAproxy to facilitate detailed evaluation of the various reordering algorithms without requiring full application executions. This work presents a comprehensive performance evaluation (14,000 experiments) of the various task mapping techniques in lowering communication costs on Titan, the leadership class supercomputer at Oak Ridge National Laboratory.

References

[1]

AMG2013 - parallel algebraic multigrid solver. https://codesign.llnl.gov/amg2013.php, 2015.

[2]

BoxLibAMR- Block Structured AMR for Combusion Studies. http://exactcodesign.org/sample-page/s3dboxlib/, 2015.

[3]

HPCG:High Performance Conjugate Gradients Benchmark. http://www.hpcg-benchmark.org, 2015.

[4]

LULESH: Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics. https://codesign.llnl.gov/lulesh.php, 2015.

[5]

MCB - Monte Carlo Benchmark. https://codesign.llnl.gov/mcb.php, 2015.

[6]

MPAS: Model for Prediction Across Scales. https://mpas-dev.github.io/, 2015.

[7]

MultiGridC. http://exactcodesign.org/proxy-app-software/, 2015.

[8]

Nek500. http://nek5000.mcs.anl.gov/, 2015.

[9]

NEKBONE. https://cesar.mcs.anl.gov/content/software/thermal_hydraulics, 2015.

[10]

Titan - Cray XK7 Supercomputer at Oak Ridge National Laboratory. https://www.olcf.ornl.gov/computing-resources/titan-cray-xk7/, 2015.

[11]

Titan - Job Scheduling Policy. https://www.olcf.ornl.gov/kb_articles/titan-scheduling-policy/, 2015.

[12]

TOP500 - Top 500 Supercomputer Sites in the World - June 2015. http://top500.org/lists/2015/06/, 2015.

[13]

G. Bhanot, A. Gara, P. Heidelberger, E. Lawless, J. C. Sexton, and R. Walkup. Optimizing task layout on the Blue Gene/L supercomputer. IBM Journal of Research and Development, 40:489--500, 2005.

Digital Library

[14]

A. Bhatele. Automating topology aware mapping for supercomputers. PhD thesis, University of Illinois at Urbana-Champaign, 2010.

Digital Library

[15]

S. W. Bollinger and S. F. Midkiff. Processor and link assignment in multicomputers using simulated annealing. In Proceedings of the International Conference on Parallel Processing, ICPP '88, The Pennsylvania State University, University Park, PA, USA, August 1988. Volume 1: Architecture., pages 1--7, 1988.

[16]

CORAL Collaboration. CORAL Request for Proposal B604142, 2015.

[17]

E. Cuthill and J. McKee. Reducing the bandwidth of sparse symmetric matrices. In Proceedings of the 1969 24th National Conference, ACM '69, pages 157--172, New York, NY, USA, 1969. ACM.

Digital Library

[18]

M. Deveci, K. Kaya, B. Uçar, and Ü. V. Çatalyürek. Fast and high quality topology-aware task mapping. In Proceedings of 29th IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2014.

Digital Library

[19]

F. Ercal, J. Ramanujam, and P. Sadayappan. Task allocation onto a hypercube by recursive mincut bipartitioning. In Proceedings of the 3rd Conference on Hypercube Concurrent Computers and Applications, pages 210--221. ACM Press, 1988.

Digital Library

[20]

A. George and J. W. Liu. Computer Solution of Large Sparse Positive Definite. Prentice Hall Professional Technical Reference, 1981.

Digital Library

[21]

J. A. George. Nested dissection of a regular finite element mesh. SIAM Journal on Numerical Analysis, 10(2):345--363, 1973.

Digital Library

[22]

J. L. Hintze and R. D. Nelson. Violin plots: A box plot-density trace synergism. The American Statistician, 52(2):181--184, May 1998.

[23]

T. Hoefler and M. Snir. Generic topology mapping strategies for large-scale parallel architectures. In Proceedings of the international conference on Supercomputing, pages 75--84. ACM, 2011.

Digital Library

[24]

S. C. Johnson. Hierarchical clustering schemes. Psychometrika, 32(3), 1967.

[25]

I. Karlin, J. Keasler, and R. Neely. Lulesh 2.0 updates and changes. Technical Report LLNL-TR-641973, August 2013.

[26]

B. Lee, S. McCormick, B. Philip, and D. Quinlan. Asynchronous fast adaptive composite-grid methods: numerical results. SIAM journal on scientific computing, 25(2), 2004.

Digital Library

[27]

B. Lee, S. McCormick, B. Philip, and D. Quinlan. Asynchronous fast adaptive composite-grid methods for elliptic problems: Theoretical foundations. SIAM journal on numerical analysis, 42(1), 2005.

Digital Library

[28]

S. McCormick and D. Quinlan. Asynchronous multilevel adaptive methods for solving partial differential equations on multiprocessors: Performance results* 1. Parallel computing, 12(2), 1989.

[29]

S. F. McCormick. Multilevel Adaptive Methods for Partial Differential Equations. SIAM, Philadelphia, PA, 1989.

Digital Library

[30]

S. F. McCormick and J. W. Thomas. The Fast Adaptive Composite grid (FAC) method for elliptic equations. Math. Comp., 46:439--456, 1986.

Digital Library

[31]

B. Philip, Z. Wang, M. Berrill, M. Birke, and M. Pernice. Dynamic implicit 3d adaptive mesh refinement for non-equilibrium radiation diffusion. Journal of Computational Physics, 262:17 -- 37, 2014.

Digital Library

[32]

N. Saitou and M. Nei. The neighbor-joining method: A new method for reconstructing phylogenetic trees. Mol. Biol. Evol., 4(4):406--425, 1987.

[33]

R. Sankaran, J. Angel, and W. M. Brown. Genetic algorithm based task reordering to improve the performance of batch scheduled massively parallel scientific applications. Concurrency and Computation: Practice and Experience, 2015.

Digital Library

[34]

K. Schloegel, G. Karypis, and V. Kumar. Parallel static and dynamic multi-constraint graph partitioning. Concurrency and Computation: Practice and Experience, 14(3):219--240, 2002.

[35]

E. Solomonik, A. Bhatele, and J. Demmel. Improving communication performance in dense linear algebra via topology aware collectives. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11, pages 1--11, New York, NY, USA, 2011. ACM.

Digital Library

[36]

S. Sreepathi, M. L. Grodowitz, R. Lim, P. Taffet, P. C. Roth, J. Meredith, S. Lee, D. Li, and J. Vetter. Application Characterization Using Oxbow Toolkit and PADS Infrastructure. In Proceedings of the 1st International Workshop on Hardware-Software Co-Design for High Performance Computing, Co-HPC '14, pages 55--63. IEEE Press, 2014.

Digital Library

[37]

J. Vetter and C. Chambreau. mpip: Lightweight, scalable mpi profiling, 2004. URL http://mpip. sourceforge. net.

Cited By

Aseeri SMuite BTakahashi DApte VDi Marco ALitoiu MMerseguer J(2019)Reproducibility in Benchmarking Parallel Fast Fourier Transform based ApplicationsCompanion of the 2019 ACM/SPEC International Conference on Performance Engineering10.1145/3302541.3313105(5-8)Online publication date: 27-Mar-2019
https://dl.acm.org/doi/10.1145/3302541.3313105
Ma SLiu ZChen SHuang LGuo YWang ZZhang M(2019)Coordinated DMA: Improving the DRAM Access Efficiency for Matrix MultiplicationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2019.290689130:10(2148-2164)Online publication date: 1-Oct-2019
https://doi.org/10.1109/TPDS.2019.2906891
Bienz AGropp WOlson L(2019)Node aware sparse matrix–vector multiplicationJournal of Parallel and Distributed Computing10.1016/j.jpdc.2019.03.016130:C(166-178)Online publication date: 1-Aug-2019
https://dl.acm.org/doi/10.1016/j.jpdc.2019.03.016
Show More Cited By

Index Terms

Communication Characterization and Optimization of Applications Using Topology-Aware Task Mapping on Large Supercomputers

Recommendations

Towards a Communication Characterization Methodology for Parallel Applications
HPCA '97: Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture

Interconnection network (ICN) is a vital component of a parallel machine and is often the limiting factor in the performance of several parallel applications. While ICN performance evaluation has been a widely researched topic, there have been very few ...
Automating topology aware mapping for supercomputers
Hierarchical task mapping for parallel applications on supercomputers

As the scale of supercomputers grows, so does the size of the interconnect network. Topology-aware task mapping, which maps parallel application processes onto processors to reduce communication cost, becomes increasingly important. Previous works ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICPE '16: Proceedings of the 7th ACM/SPEC on International Conference on Performance Engineering

March 2016

346 pages

ISBN:9781450340809

DOI:10.1145/2851553

General Chairs:
Alberto Avritzer
Sonatype, Inc., USA
,
Alexandru Iosup
Delft University of Technology, the Netherlands
,
Program Chairs:
Xiaoyun Zhu
Futurewei Technologies, USA
,
Steffen Becker
Chemnitz University of Technology, Germany

Copyright © 2016 ACM.

Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 March 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

U.S. Department of Energy

Conference

ICPE'16

Sponsor:

ICPE'16: ACM/SPEC International Conference on Performance Engineering

March 12 - 16, 2016

Delft, The Netherlands

Acceptance Rates

ICPE '16 Paper Acceptance Rate 23 of 74 submissions, 31%;

Overall Acceptance Rate 252 of 851 submissions, 30%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
520
Total Downloads

Downloads (Last 12 months)74
Downloads (Last 6 weeks)6

Reflects downloads up to 13 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Aseeri SMuite BTakahashi DApte VDi Marco ALitoiu MMerseguer J(2019)Reproducibility in Benchmarking Parallel Fast Fourier Transform based ApplicationsCompanion of the 2019 ACM/SPEC International Conference on Performance Engineering10.1145/3302541.3313105(5-8)Online publication date: 27-Mar-2019
https://dl.acm.org/doi/10.1145/3302541.3313105
Ma SLiu ZChen SHuang LGuo YWang ZZhang M(2019)Coordinated DMA: Improving the DRAM Access Efficiency for Matrix MultiplicationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2019.290689130:10(2148-2164)Online publication date: 1-Oct-2019
https://doi.org/10.1109/TPDS.2019.2906891
Bienz AGropp WOlson L(2019)Node aware sparse matrix–vector multiplicationJournal of Parallel and Distributed Computing10.1016/j.jpdc.2019.03.016130:C(166-178)Online publication date: 1-Aug-2019
https://dl.acm.org/doi/10.1016/j.jpdc.2019.03.016
Zhang YTuncer OKaplan FOlcoz KLeung VCoskun A(2018)Level-Spread: A New Job Allocation Policy for Dragonfly Networks2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2018.00121(1123-1132)Online publication date: May-2018
https://doi.org/10.1109/IPDPS.2018.00121
Chen XLiu JLi SXie PChi LWang Q(2018)TAMM: A New Topology-Aware Mapping Method for Parallel Applications on the Tianhe-2A SupercomputerAlgorithms and Architectures for Parallel Processing10.1007/978-3-030-05051-1_17(242-256)Online publication date: 7-Dec-2018
https://doi.org/10.1007/978-3-030-05051-1_17
Pandey RSahu A(2017)Efficient Mapping of Multi-threaded Applications onto 3D Stacked Chip-Multiprocessor2017 IEEE 19th International Conference on High Performance Computing and Communications; IEEE 15th International Conference on Smart City; IEEE 3rd International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC-SmartCity-DSS.2017.43(324-331)Online publication date: Dec-2017
https://doi.org/10.1109/HPCC-SmartCity-DSS.2017.43

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents