More Web Proxy on the site http://driver.im/

research-article

Improving resilience of scientific software through a domain-specific approach

Authors:

S. MaheswaranAuthors Info & Claims

Volume 128, Issue C

Pages 99 - 114

https://doi.org/10.1016/j.jpdc.2019.01.015

Published: 01 June 2019 Publication History

Abstract

In this paper we present research on improving the resilience of the execution of scientific software, an increasingly important concern in High Performance Computing (HPC). We build on an existing high-level abstraction framework, the Oxford Parallel library for Structured meshes (OPS), developed for the solution of multi-block structured mesh-based applications, and implement an algorithm in the library to carry out checkpointing automatically, without the intervention of the user. The target applications are a hydrodynamics benchmark application from the Mantevo Suite, CloverLeaf 3D, the sparse linear solver proxy application TeaLeaf, and the OpenSBLI compressible Navier–Stokes direct numerical simulation (DNS) solver. We present (1) the basic algorithm that OPS relies on to determine the optimal checkpoint in terms of size and location, (2) improvements that supply additional information to improve the decision, (3) techniques that reduce the cost of writing the checkpoints to non-volatile storage, (4) a performance analysis of the developed techniques on a single workstation and on several supercomputers, including ORNL’s Titan. Our results demonstrate the utility of the high-level abstractions approach in automating the checkpointing process and show that performance is comparable to, or better than the reference in all cases.

Highlights

•

We develop an automated algorithm to determine what data to include in checkpoints.

•

Create multi-level checkpointing implementations in the OPS Embedded DSL.

•

Benchmarking checkpointing on large-scale supercomputers.

References

[1]

Aupy G., Robert Y., Vivien F., Zaidouni D., Checkpointing algorithms and fault prediction, J. Parallel Distrib. Comput. 74 (2) (2014) 2048–2064,. URL http://www.sciencedirect.com/science/article/pii/S0743731513002219.

Digital Library

[2]

AWE cloverleaf, https://github.com/UK-MAC (2014).

[3]

AWE TyphonIO, https://github.com/UK-MAC/typhonio (2014).

[4]

Bautista-Gomez L., Tsuboi S., Komatitsch D., Cappello F., Maruyama N., Matsuoka S., FTI: High performance fault tolerance interface for hybrid systems, in: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’11, ACM, New York, NY, USA, 2011, pp. 32:1–32:32,. URL http://doi.acm.org/10.1145/2063384.2063427.

Digital Library

[5]

Bouguerra M.S., Kondo D., Trystram D., On the scheduling of checkpoints in desktop grids, in: 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 2011, pp. 305–313,.

Digital Library

[6]

Bronevetsky G., Marques D., Pingali K., Szwed P., Schulz M., Application-level checkpointing for shared memory programs, SIGOPS Oper. Syst. Rev. 38 (5) (2004) 235–247,. URL http://doi.acm.org/10.1145/1037949.1024421.

Digital Library

[7]

Cappello F., Al G., Gropp W., Kale S., Kramer B., Snir M., Toward exascale resilience: 2014 update, Supercomput. Front. Innov. Int. J. 1 (1) (2014) 5–28,.

Digital Library

[8]

Cappello F., Geist A., Gropp B., Kale L., Kramer B., Snir M., Toward exascale resilience, Int. J. High Perform. Comput. Appl. 23 (4) (2009) 374–388,.

Digital Library

[9]

Chang F., Dean J., Ghemawat S., Hsieh W.C., Wallach D.A., Burrows M., Chandra T., Fikes A., Gruber R.E., Bigtable: A distributed storage system for structured data, ACM Trans. Comput. Syst. 26 (2) (2008) 4:1–4:26,. URL http://doi.acm.org/10.1145/1365815.1365816.

Digital Library

[10]

Chien A., Balaji P., Dun N., Fang A., Fujita H., Iskra K., Rubenstein Z., Zheng Z., Hammond J., Laguna I., Richards D., Dubey A., van Straalen B., Hoemmen M., Heroux M., Teranishi K., Siegel A., Exploring versioned distributed arrays for resilience in scientific applications: global view resilience, Int. J. High Perform. Comput. Appl. 31 (6) (2017) 564–590,. arXiv:https://doi.org/10.1177/1094342016664796.

Digital Library

[11]

Dennard R., Gaensslen F., Yu H.-N., Leo Rideovt V., Bassous E., Leblanc A.R., Design of ion-implanted MOSFET’s with very small physical dimensions, Solid-State Circuits Soc. Newsl. 12 (1) (2007) 38–50,. IEEE.

[12]

DeVito Z., Joubert N., Palacios F., Oakley S., Medina M., Barrientos M., Elsen E., Ham F., Aiken A., Duraisamy K., Darve E., Alonso J., Hanrahan P., Liszt: a domain specific language for building portable mesh-based PDE solvers, in: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’11, ACM, New York, NY, USA, 2011, pp. 9:1–9:12.

[13]

Dongarra J., Fault-tolerance techniques for high-performance computing, in: Herault Y.R.T. (Ed.), Fault-Tolerance Techniques for High-Performance Computing, Springer, 2015, pp. 1–66.

[14]

A. Gainaru, F. Cappello, M. Snir, W. Kramer, Failure prediction for HPC systems and applications: Current situation and open issues, Int. J. High Perform. Comput. Appl., arXiv:http://hpc.sagepub.com/content/early/2013/07/02/1094342013488258.full.pdf+html http:dx.doi.org/10.1177/1094342013488258 URL http://hpc.sagepub.com/content/early/2013/07/02/1094342013488258.abstract.

[15]

Giles M.B., Mudalige G.R., Sharif Z., Markall G.R., Kelly P.H.J., Performance analysis and optimization of the OP2 framework on many-core architectures, Comput. J. 55 (2) (2012) 168–180.

Digital Library

[16]

Haerder T., Reuter A., Principles of transaction-oriented database recovery, ACM Comput. Surv. 15 (4) (1983) 287–317.

Digital Library

[17]

Hargrove P.H., Duell J.C., Berkeley lab checkpoint/restart (BLCR) for Linux clusters, J. Phys. Conf. Ser. 46 (1) (2006) 494. URL http://stacks.iop.org/1742-6596/46/i=1/a=067.

[18]

Herault T., Robert Y., Fault-tolerance Techniques for High-Performance Computing, Springer, 2016.

[19]

Howes L.W., Lokhmotov A., Donaldson A.F., Kelly P.H.J., Deriving efficient data movement from decoupled access/execute specifications, in: Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers, HiPEAC ’09, Springer-Verlag, 2009, pp. 168–182.

[20]

Hwang A.A., Stefanovici I.A., Schroeder B., Cosmic rays don’t strike twice: Understanding the nature of DRAM errors and the implications for system design, SIGPLAN Not. 47 (4) (2012) 111–122,. URL http://doi.acm.org/10.1145/2248487.2150989.

Digital Library

[21]

Islam T.Z., Mohror K., Bagchi S., Moody A., De Supinski B.R., Eigenmann R., McrEngine: a scalable checkpointing system using data-aware aggregation and compression, in: High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for, IEEE, 2012, pp. 1–11.

[22]

Jacobs C.T., Jammy S.P., Sandham N.D., OpenSBLI: A framework for the automated derivation and parallel execution of finite difference solvers on a range of computer architectures, J. Comput. Sci. 18 (2017) 12–23,. URL http://www.sciencedirect.com/science/article/pii/S187775031630299X.

[23]

Kingsley G., Beck M., Plank J.S., Compiler-assisted checkpoint optimization using suif, in: First SUIF Compiler Workshop, 1995, pp. 1–16.

[24]

Li C.C.J., Fuchs W.K., CATCH-compiler-assisted techniques for checkpointing, in: [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium, 1990, pp. 74–81,.

[25]

Martineau M., McIntosh-Smith S., Gaudin W., Assessing the performance portability of modern parallel programming models using TeaLeaf, Concurr. Comput.: Pract. Exper. 29 (15) (2017),. e4117 cpe.4117. https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpe.4117[arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpe.4117], URL https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.4117.

[26]

Martino C.D., Kalbarczyk Z., Iyer R.K., Baccanico F., Fullop J., Kramer W., Lessons learned from the analysis of system failures at petascale: The case of blue waters, in: 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2014, pp. 610–621,.

Digital Library

[27]

Mehnert-Spahn J., Ropars T., Schoettner M., Morin C., The architecture of the XtreemOS grid checkpointing service, in: Sips H., Epema D., Lin H.-X. (Eds.), Euro-Par 2009 Parallel Processing, Springer Berlin Heidelberg, Berlin, Heidelberg, 2009, pp. 429–441.

[28]

Meneses E., Ni X., Jones T., Maxwell D., Analyzing the interplay of failures and workload on a leadership-class supercomputer, in: CUG 2015, 2015, pp. 1–10.

[29]

Moody A., Bronevetsky G., Mohror K., Supinski B.R.d., Design, modeling, and evaluation of a scalable multi-level checkpointing system, in: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’10, IEEE Computer Society, Washington, DC, USA, 2010, pp. 1–11,.

Digital Library

[30]

Mudalige G., Reguly I., Giles M., Mallinson A., Gaudin W., Herdman J., Performance analysis of a high-level abstractions-based hydrocode on future computing systems, in: Jarvis S.A., Wright S.A., Hammond S.D. (Eds.), High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, in: Lecture Notes in Computer Science, vol. 8966, Springer International Publishing, 2015, pp. 85–104,.

[31]

Nieplocha J., Palmer B., Tipparaju V., Krishnan M., Trease H., Aprà E., Advances, Applications and performance of the global arrays shared memory programming toolkit, Int. J. High Perform. Comput. Appl. 20 (2) (2006) 203–231,.

Digital Library

[32]

OPS Library, https://github.com/OP-DSL/OPS (2014).

[33]

Oral S., Dillow D.A., Fuller D., Hill J., Leverman D., Vazhkudai S.S., Wang F., Kim Y., Rogers J., Simmons J., et al., OLCF’s 1 TB/s, next-generation lustre file system, in: Proceedings of Cray User Group Conference (CUG 2013), 2013, pp. 1–12.

[34]

Paul H.S., Gupta A., Sharma A., Finding a suitable checkpoint and recovery protocol for a distributed application, J. Parallel Distrib. Comput. 66 (5) (2006) 732–749,. URL http://www.sciencedirect.com/science/article/pii/S0743731505002662.

Digital Library

[35]

Plank J.S., Beck M., Kingsley G., Compiler-assisted memory exclusion for fast checkpointing, IEEE Tech. Committee Oper. Syst. Appl. Environ. 7 (1995) 62–67.

[36]

Reguly I.Z., Mudalige G.R., Giles M.B., Design and development of domain specific active libraries with proxy applications, in: Cluster Computing (CLUSTER), 2015 IEEE International Conference on, 2015, pp. 738–745,.

Digital Library

[37]

Reguly I.Z., Mudalige G.R., Giles M.B., Curran D., McIntosh-Smith S., The OPS domain specific abstraction for multi-block structured grid computations, in: Proceedings of the Fourth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing, WOLFHPC ’14, IEEE Press, Piscataway, NJ, USA, 2014, pp. 58–67,.

[38]

Russell F.P., Kelly P.H.J., Optimized code generation for finite element local assembly using symbolic manipulation, ACM Trans. Math. Software 39 (4) (2013) 26:1–26:29,. URL http://doi.acm.org/10.1145/2491491.2491496.

Digital Library

[39]

The HDF Group, Hierarchical Data Format, version 5, http://www.hdfgroup.org/HDF5/ (1997-NNNN).

[40]

The Next Platform, Argonne Hints at Future Architecture of Aurora Exascale System, https://www.nextplatform.com/2018/03/19/argonne-hints-at-future-architecture-of-aurora-exascale-system/ (2018).

[41]

Top500, The TaihuLight Supercomputer, https://www.top500.org/resources/top-systems/sunway-taihulight-national-supercomputing-center-i/ (2018).

[42]

Touber E., Sandham N.D., Large-eddy simulation of low-frequency unsteadiness in a turbulent shock-induced separation bubble, Theor. Comput. Fluid Dynam. 23 (2) (2009) 79–107,.

[43]

Young J.W., A first order approximation to the optimum checkpoint interval, Commun. ACM 17 (9) (1974) 530–531,. URL http://doi.acm.org/10.1145/361147.361115.

Digital Library

[44]

Zheng G., Huang C., Kalé L.V., Performance evaluation of automatic checkpoint-based fault tolerance for AMPI and Charm++, SIGOPS Oper. Syst. Rev. 40 (2) (2006) 90–99,. URL http://doi.acm.org/10.1145/1131322.1131340.

Digital Library

[45]

Zheng G., Ni X., Kalé L.V., A scalable double in-memory checkpoint and restart scheme towards exascale, in: IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012), 2012, pp. 1–6,.

Cited By

Mudalige GReguly IJammy SJacobs CGiles MSandham N(2019)Large-scale performance of a DSL-based multi-block structured-mesh application for Direct Numerical SimulationJournal of Parallel and Distributed Computing10.1016/j.jpdc.2019.04.019131:C(130-146)Online publication date: 1-Sep-2019
https://dl.acm.org/doi/10.1016/j.jpdc.2019.04.019

Index Terms

Improving resilience of scientific software through a domain-specific approach
1. Computing methodologies
  1. Modeling and simulation
2. Software and its engineering

Index terms have been assigned to the content through auto-classification.

Recommendations

Improving Program Comprehension Tools for Domain Specific Languages
I/O threads to reduce checkpoint blocking for an electromagnetics solver on Blue Gene/P and Cray XK6
ROSS '12: Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers

Application-level checkpointing has been one of the most popular techniques to proactively deal with unexpected failures in supercomputers with hundreds of thousands of cores. Unfortunately, this approach results in heavy I/O load and often causes I/O ...
Towards High Performance Resilience Using Performance Portable Abstractions
Euro-Par 2021: Parallel Processing
Abstract
In the drive towards Exascale, the extreme heterogeneity of supercomputers at all levels places a major development burden on HPC applications. To this end, performance portable abstractions such as those advocated by Kokkos, RAJA and HPX are ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal of Parallel and Distributed Computing

Journal of Parallel and Distributed Computing Volume 128, Issue C

Jun 2019

185 pages

ISSN:0743-7315

Issue’s Table of Contents

Copyright © 2019.

Publisher

Academic Press, Inc.

United States

Publication History

Published: 01 June 2019

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 29 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Mudalige GReguly IJammy SJacobs CGiles MSandham N(2019)Large-scale performance of a DSL-based multi-block structured-mesh application for Direct Numerical SimulationJournal of Parallel and Distributed Computing10.1016/j.jpdc.2019.04.019131:C(130-146)Online publication date: 1-Sep-2019
https://dl.acm.org/doi/10.1016/j.jpdc.2019.04.019

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents