More Web Proxy on the site http://driver.im/

research-article

Exploring the effect of noise on the performance benefit of nonblocking allreduce

Authors:

Patrick Widener,

Kurt B. Ferreira,

Torsten HoeflerAuthors Info & Claims

EuroMPI/ASIA '14: Proceedings of the 21st European MPI Users' Group Meeting

Pages 77 - 82

https://doi.org/10.1145/2642769.2642786

Published: 09 September 2014 Publication History

Abstract

Relaxed synchronization offers the potential of maintaining application scalability by allowing many processes to make independent progress when some processes suffer delays. Yet, the benefits of this approach in important parallel workloads have not been investigated in detail. In this paper, we use a validated simulation approach to explore the noise mitigation effects of nonblocking allreduce in workloads where allreduce is a major contributor to total execution time. Although a nonblocking allreduce is unlikely to provide significant benefit to applications in the low-OS-noise environments expected in next-generation HPC systems, we show that it can potentially improve application runtime with respect to other noise types.

References

[1]

P. Beckman, K. Iskra, K. Yoshii, S. Coghlan, and A. Nataraj. Benchmarking the effects of operating system interference on extreme-scale parallel machines. Cluster Computing, 11(1):3--16, 2008.

Digital Library

[2]

R. Brightwell, R. Riesen, and K. D. Underwood. Analyzing the impact of overlap, offload, and independent progress for message passing interface applications. Intl. Journal of High Performance Computing Applications, 19(2):103--117, 2005.

Digital Library

[3]

G. Bronevetsky. Communication-sensitive static dataflow for parallel message passing applications. In Proceedings of the 7th annual IEEE/ACM Intl. Symposium on Code Generation and Optimization, pages 1--12. IEEE Computer Society, 2009.

Digital Library

[4]

D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. von Eicken. Logp: towards a realistic model of parallel computation. SIGPLAN Not., 28(7):1--12, July 1993.

Digital Library

[5]

J. E. S. Hertel, R. L. Bell, M. G. Elrick, A. V. Farnsworth, G. I. Kerley, J. M. McGlaun, S. V. PetneY, S. A. Silling, P. A. Taylor, and L. Yarrington. CTH: A software family for multi-dimensional shock physics analysis. In Proceedings of the 19th Intl. Symp. on Shock Waves, pages 377--382, July 1993.

[6]

Exascale Co-Design Center for Materials in Extreme Environments (ExMatEx). http://exmatex.lanl.gov/. Retrieved 16 Jan 2014.

[7]

K. B. Ferreira, P. Bridges, and R. Brightwell. Characterizing application sensitivity to os interference using kernel-level noise injection. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, page 19. IEEE Press, 2008.

Digital Library

[8]

K. B. Ferreira, S. L. Levy, P. M. Widener, D. Arnold, and T. Hoefler. Understanding the effects of communication and coordination on checkpointing at scale. In Proc. Inernational Conference for High Performance Computing, Networking, Storage and Analysis (Supercomputing), New Orleans, Louisiana, November 2014. IEEE/ACM. To appear.

Digital Library

[9]

P. Ghysels and W. Vanroose. Hiding global synchronization latency in the preconditioned conjugate gradient algorithm. Parallel Computing, 2013.

Digital Library

[10]

V. E. Henson and U. M. Yang. Boomeramg: A parallel algebraic multigrid solver and preconditioner. Appl. Num. Math., 41:155--177, 2002.

Digital Library

[11]

M. A. Heroux, D. W. Doerfler, P. S. Crozier, J. M. Willenbring, H. C. Edwards, A. Williams, M. Rajan, E. R. Keiter, H. K. Thornquist, and R. W. Numrich. Improving performance via mini-applications. Technical Report SAND2009-5574, Sandia National Laboratory, 2009.

[12]

T. Hoefler, P. Kambadur, R. L. Graham, G. Shipman, and A. Lumsdaine. A case for standard non-blocking collective operations. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, pages 125--134. Springer, 2007.

Digital Library

[13]

T. Hoefler, A. Lumsdaine, and W. Rehm. Implementation and Performance Analysis of Non-Blocking Collective Operations for MPI. In Proc. of the 2007 Intl. Conference on High Performance Computing, Networking, Storage and Analysis, SC07. IEEE Computer Society/ACM, Nov. 2007.

Digital Library

[14]

T. Hoefler, T. Schneider, and A. Lumsdaine. Characterizing the Influence of System Noise on Large-Scale Applications by Simulation. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC'10), Nov. 2010.

Digital Library

[15]

T. Hoefler, T. Schneider, and A. Lumsdaine. LogGOPSim - Simulating Large-Scale Applications in the LogGOPS Model. In Proc. of the 19th ACM International Symp. on High Performance Distributed Computing, pages 597--604. ACM, Jun. 2010.

Digital Library

[16]

T. Hoefler, C. Siebert, and A. Lumsdaine. Scalable Communication Protocols for Dynamic Sparse Data Exchange. In Proc. of the 2010 ACM Symposium on Principles and Practice of Parallel Programming (PPoPP'10), pages 159--168. ACM, Jan. 2010.

Digital Library

[17]

T. Hoefler, J. Squyres, G. Bosilca, G. Fagg, A. Lumsdaine, and W. Rehm. Non-Blocking Collective Operations for MPI-2. Technical report, Open Systems Lab, Indiana University, Aug. 2006.

[18]

I. Karlin, A. Bhatele, B. L. Chamberlain, J. Cohen, Z. Devito, M. Gokhale, R. Haque, R. Hornung, J. Keasler, D. Laney, E. Luke, S. Lloyd, J. McGraw, R. Neely, D. Richards, M. Schulz, C. H. Still, F. Wang, and D. Wong. Lulesh programming model and performance ports overview. Technical Report LLNL-TR-608824, December 2012.

[19]

L. Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558--565, July 1978.

Digital Library

[20]

S. Levy, B. Topp, K. B. Ferreira, D. Arnold, T. Hoefler, and P. Widener. Using simulation to evaluate the performance of resilience strategies at scale. In High Performance Computing, Networking, Storage and Analysis (SCC), 2013 SC Companion:. IEEE, 2013.

[21]

Message Passing Interface Forum. MPI: A message-passing interface standard, version 3.0. http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf, Sept. 2012.

[22]

F. Petrini, D. Kerbyson, and S. Pakin. The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of asci q. In Supercomputing, 2003 ACM/IEEE Conference, pages 55--55. IEEE, 2003.

Digital Library

[23]

S. J. Plimpton. Fast parallel algorithms for short-range molecular dynamics. Journal Computation Physics, 117:1--19, 1995.

Digital Library

[24]

J. Vetter and C. Chambreau. mpip: Lightweight, scalable mpi profiling. URL: http://www.llnl.gov/CASC/mpiP, 2005.

[25]

K. Yoshii, K. Iskra, H. Naik, P. Beckman, and P. C. Broekema. Characterizing the performance of "big memory" on blue gene linux. In Parallel Processing Workshops, 2009. ICPPW'09. International Conference on, pages 65--72. IEEE, 2009.

Digital Library

Cited By

Gericke JKlimach HEbrahimi Pour NRoller S(2023)Using MPIs Non-Blocking Allreduce for Health Checks in Dynamic SimulationsParallel and Distributed Computing, Applications and Technologies10.1007/978-981-99-8211-0_3(25-31)Online publication date: 29-Nov-2023
https://doi.org/10.1007/978-981-99-8211-0_3
Haghi PGuo AGeng TSkjellum AHerbordt M(2021)Workload Imbalance in HPC Applications: Effect on Performance of In-Network Processing2021 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC49654.2021.9622847(1-8)Online publication date: 20-Sep-2021
https://doi.org/10.1109/HPEC49654.2021.9622847
Groves TGu YWright N(2017)Understanding Performance Variability on the Aries Dragonfly Network2017 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER.2017.76(809-813)Online publication date: Sep-2017
https://doi.org/10.1109/CLUSTER.2017.76
Show More Cited By

Index Terms

Exploring the effect of noise on the performance benefit of nonblocking allreduce

Recommendations

Nonblocking collectives for scalable Java communications

This paper presents a Java implementation of the recently published MPI 3.0 nonblocking message passing collectives in order to analyze and assess the feasibility of taking advantage of these operations in shared memory systems using Java. Nonblocking ...
On noise and the performance benefit of nonblocking collectives

Relaxed synchronization offers the potential for maintaining application scalability, by allowing many processes to make independent progress when some processes suffer delays. Yet the benefits of this approach for important parallel workloads have not ...
Exploring the All-to-All Collective Optimization Space with ConnectX CORE-Direct
ICPP '12: Proceedings of the 2012 41st International Conference on Parallel Processing

The all-to-all collective communication operation is used by many scientific applications, and is one of the most time consuming and challenging collective operation to optimize. The algorithms for all-to-all operations typically fall into two classes, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

EuroMPI/ASIA '14: Proceedings of the 21st European MPI Users' Group Meeting

September 2014

183 pages

ISBN:9781450328753

DOI:10.1145/2642769

General Chair:
Jack Dongarra
University of Tennessee
,
Program Chairs:
Yutaka Ishikawa
University of Tokyo
,
Atsushi Hori
RIKEN AICS

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

Kyoto University: Kyoto University
University of Tokyo
University of Tsukuba: University of Tsukuba

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 September 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

EuroMPI/ASIA '14

EuroMPI/ASIA '14: 21st European MPI Users' Group Meeting

September 9 - 12, 2014

Kyoto, Japan

Acceptance Rates

EuroMPI/ASIA '14 Paper Acceptance Rate 18 of 39 submissions, 46%;

Overall Acceptance Rate 18 of 39 submissions, 46%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
122
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)1

Reflects downloads up to 01 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Gericke JKlimach HEbrahimi Pour NRoller S(2023)Using MPIs Non-Blocking Allreduce for Health Checks in Dynamic SimulationsParallel and Distributed Computing, Applications and Technologies10.1007/978-981-99-8211-0_3(25-31)Online publication date: 29-Nov-2023
https://doi.org/10.1007/978-981-99-8211-0_3
Haghi PGuo AGeng TSkjellum AHerbordt M(2021)Workload Imbalance in HPC Applications: Effect on Performance of In-Network Processing2021 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC49654.2021.9622847(1-8)Online publication date: 20-Sep-2021
https://doi.org/10.1109/HPEC49654.2021.9622847
Groves TGu YWright N(2017)Understanding Performance Variability on the Aries Dragonfly Network2017 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER.2017.76(809-813)Online publication date: Sep-2017
https://doi.org/10.1109/CLUSTER.2017.76
Mondragon OBridges PLevy SFerreira KWidener PWest J(2016)Understanding performance interference in next-generation HPC systemsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3014904.3014949(1-12)Online publication date: 13-Nov-2016
https://dl.acm.org/doi/10.5555/3014904.3014949
Levy SFerreira KWidener PBridges PMondragon ODongarra JHolmes DCollis ALarsson Träff JSmith L(2016)How I Learned to Stop Worrying and Love In Situ AnalyticsProceedings of the 23rd European MPI Users' Group Meeting10.1145/2966884.2966920(140-153)Online publication date: 25-Sep-2016
https://dl.acm.org/doi/10.1145/2966884.2966920
Mondragon OBridges PLevy SFerreira KWidener P(2016)Understanding Performance Interference in Next-Generation HPC SystemsSC16: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC.2016.32(384-395)Online publication date: Nov-2016
https://doi.org/10.1109/SC.2016.32
Mondragon OBridges PLevy SFerreira KWidener PVarela CCastro HBarrios C(2016)Scheduling in-situ analytics in next-generation applicationsProceedings of the 16th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing10.1109/CCGrid.2016.42(102-105)Online publication date: 16-May-2016
https://dl.acm.org/doi/10.1109/CCGrid.2016.42
Fang AFujita HChien A(2015)Towards Understanding Post-recovery Efficiency for Shrinking and Non-shrinking RecoveryEuro-Par 2015: Parallel Processing Workshops10.1007/978-3-319-27308-2_53(656-668)Online publication date: 18-Dec-2015
https://doi.org/10.1007/978-3-319-27308-2_53

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents