[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Improving resilience of scientific software through a domain-specific approach

Published: 01 June 2019 Publication History

Abstract

In this paper we present research on improving the resilience of the execution of scientific software, an increasingly important concern in High Performance Computing (HPC). We build on an existing high-level abstraction framework, the Oxford Parallel library for Structured meshes (OPS), developed for the solution of multi-block structured mesh-based applications, and implement an algorithm in the library to carry out checkpointing automatically, without the intervention of the user. The target applications are a hydrodynamics benchmark application from the Mantevo Suite, CloverLeaf 3D, the sparse linear solver proxy application TeaLeaf, and the OpenSBLI compressible Navier–Stokes direct numerical simulation (DNS) solver. We present (1) the basic algorithm that OPS relies on to determine the optimal checkpoint in terms of size and location, (2) improvements that supply additional information to improve the decision, (3) techniques that reduce the cost of writing the checkpoints to non-volatile storage, (4) a performance analysis of the developed techniques on a single workstation and on several supercomputers, including ORNL’s Titan. Our results demonstrate the utility of the high-level abstractions approach in automating the checkpointing process and show that performance is comparable to, or better than the reference in all cases.

Highlights

We develop an automated algorithm to determine what data to include in checkpoints.
Create multi-level checkpointing implementations in the OPS Embedded DSL.
Benchmarking checkpointing on large-scale supercomputers.

References

[1]
Aupy G., Robert Y., Vivien F., Zaidouni D., Checkpointing algorithms and fault prediction, J. Parallel Distrib. Comput. 74 (2) (2014) 2048–2064,. URL http://www.sciencedirect.com/science/article/pii/S0743731513002219.
[2]
[4]
Bautista-Gomez L., Tsuboi S., Komatitsch D., Cappello F., Maruyama N., Matsuoka S., FTI: High performance fault tolerance interface for hybrid systems, in: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’11, ACM, New York, NY, USA, 2011, pp. 32:1–32:32,. URL http://doi.acm.org/10.1145/2063384.2063427.
[5]
Bouguerra M.S., Kondo D., Trystram D., On the scheduling of checkpoints in desktop grids, in: 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 2011, pp. 305–313,.
[6]
Bronevetsky G., Marques D., Pingali K., Szwed P., Schulz M., Application-level checkpointing for shared memory programs, SIGOPS Oper. Syst. Rev. 38 (5) (2004) 235–247,. URL http://doi.acm.org/10.1145/1037949.1024421.
[7]
Cappello F., Al G., Gropp W., Kale S., Kramer B., Snir M., Toward exascale resilience: 2014 update, Supercomput. Front. Innov. Int. J. 1 (1) (2014) 5–28,.
[8]
Cappello F., Geist A., Gropp B., Kale L., Kramer B., Snir M., Toward exascale resilience, Int. J. High Perform. Comput. Appl. 23 (4) (2009) 374–388,.
[9]
Chang F., Dean J., Ghemawat S., Hsieh W.C., Wallach D.A., Burrows M., Chandra T., Fikes A., Gruber R.E., Bigtable: A distributed storage system for structured data, ACM Trans. Comput. Syst. 26 (2) (2008) 4:1–4:26,. URL http://doi.acm.org/10.1145/1365815.1365816.
[10]
Chien A., Balaji P., Dun N., Fang A., Fujita H., Iskra K., Rubenstein Z., Zheng Z., Hammond J., Laguna I., Richards D., Dubey A., van Straalen B., Hoemmen M., Heroux M., Teranishi K., Siegel A., Exploring versioned distributed arrays for resilience in scientific applications: global view resilience, Int. J. High Perform. Comput. Appl. 31 (6) (2017) 564–590,. arXiv:https://doi.org/10.1177/1094342016664796.
[11]
Dennard R., Gaensslen F., Yu H.-N., Leo Rideovt V., Bassous E., Leblanc A.R., Design of ion-implanted MOSFET’s with very small physical dimensions, Solid-State Circuits Soc. Newsl. 12 (1) (2007) 38–50,. IEEE.
[12]
DeVito Z., Joubert N., Palacios F., Oakley S., Medina M., Barrientos M., Elsen E., Ham F., Aiken A., Duraisamy K., Darve E., Alonso J., Hanrahan P., Liszt: a domain specific language for building portable mesh-based PDE solvers, in: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’11, ACM, New York, NY, USA, 2011, pp. 9:1–9:12.
[13]
Dongarra J., Fault-tolerance techniques for high-performance computing, in: Herault Y.R.T. (Ed.), Fault-Tolerance Techniques for High-Performance Computing, Springer, 2015, pp. 1–66.
[14]
A. Gainaru, F. Cappello, M. Snir, W. Kramer, Failure prediction for HPC systems and applications: Current situation and open issues, Int. J. High Perform. Comput. Appl., arXiv:http://hpc.sagepub.com/content/early/2013/07/02/1094342013488258.full.pdf+html http:dx.doi.org/10.1177/1094342013488258 URL http://hpc.sagepub.com/content/early/2013/07/02/1094342013488258.abstract.
[15]
Giles M.B., Mudalige G.R., Sharif Z., Markall G.R., Kelly P.H.J., Performance analysis and optimization of the OP2 framework on many-core architectures, Comput. J. 55 (2) (2012) 168–180.
[16]
Haerder T., Reuter A., Principles of transaction-oriented database recovery, ACM Comput. Surv. 15 (4) (1983) 287–317.
[17]
Hargrove P.H., Duell J.C., Berkeley lab checkpoint/restart (BLCR) for Linux clusters, J. Phys. Conf. Ser. 46 (1) (2006) 494. URL http://stacks.iop.org/1742-6596/46/i=1/a=067.
[18]
Herault T., Robert Y., Fault-tolerance Techniques for High-Performance Computing, Springer, 2016.
[19]
Howes L.W., Lokhmotov A., Donaldson A.F., Kelly P.H.J., Deriving efficient data movement from decoupled access/execute specifications, in: Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers, HiPEAC ’09, Springer-Verlag, 2009, pp. 168–182.
[20]
Hwang A.A., Stefanovici I.A., Schroeder B., Cosmic rays don’t strike twice: Understanding the nature of DRAM errors and the implications for system design, SIGPLAN Not. 47 (4) (2012) 111–122,. URL http://doi.acm.org/10.1145/2248487.2150989.
[21]
Islam T.Z., Mohror K., Bagchi S., Moody A., De Supinski B.R., Eigenmann R., McrEngine: a scalable checkpointing system using data-aware aggregation and compression, in: High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for, IEEE, 2012, pp. 1–11.
[22]
Jacobs C.T., Jammy S.P., Sandham N.D., OpenSBLI: A framework for the automated derivation and parallel execution of finite difference solvers on a range of computer architectures, J. Comput. Sci. 18 (2017) 12–23,. URL http://www.sciencedirect.com/science/article/pii/S187775031630299X.
[23]
Kingsley G., Beck M., Plank J.S., Compiler-assisted checkpoint optimization using suif, in: First SUIF Compiler Workshop, 1995, pp. 1–16.
[24]
Li C.C.J., Fuchs W.K., CATCH-compiler-assisted techniques for checkpointing, in: [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium, 1990, pp. 74–81,.
[25]
Martineau M., McIntosh-Smith S., Gaudin W., Assessing the performance portability of modern parallel programming models using TeaLeaf, Concurr. Comput.: Pract. Exper. 29 (15) (2017),. e4117 cpe.4117. https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpe.4117[arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpe.4117], URL https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.4117.
[26]
Martino C.D., Kalbarczyk Z., Iyer R.K., Baccanico F., Fullop J., Kramer W., Lessons learned from the analysis of system failures at petascale: The case of blue waters, in: 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2014, pp. 610–621,.
[27]
Mehnert-Spahn J., Ropars T., Schoettner M., Morin C., The architecture of the XtreemOS grid checkpointing service, in: Sips H., Epema D., Lin H.-X. (Eds.), Euro-Par 2009 Parallel Processing, Springer Berlin Heidelberg, Berlin, Heidelberg, 2009, pp. 429–441.
[28]
Meneses E., Ni X., Jones T., Maxwell D., Analyzing the interplay of failures and workload on a leadership-class supercomputer, in: CUG 2015, 2015, pp. 1–10.
[29]
Moody A., Bronevetsky G., Mohror K., Supinski B.R.d., Design, modeling, and evaluation of a scalable multi-level checkpointing system, in: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’10, IEEE Computer Society, Washington, DC, USA, 2010, pp. 1–11,.
[30]
Mudalige G., Reguly I., Giles M., Mallinson A., Gaudin W., Herdman J., Performance analysis of a high-level abstractions-based hydrocode on future computing systems, in: Jarvis S.A., Wright S.A., Hammond S.D. (Eds.), High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, in: Lecture Notes in Computer Science, vol. 8966, Springer International Publishing, 2015, pp. 85–104,.
[31]
Nieplocha J., Palmer B., Tipparaju V., Krishnan M., Trease H., Aprà E., Advances, Applications and performance of the global arrays shared memory programming toolkit, Int. J. High Perform. Comput. Appl. 20 (2) (2006) 203–231,.
[33]
Oral S., Dillow D.A., Fuller D., Hill J., Leverman D., Vazhkudai S.S., Wang F., Kim Y., Rogers J., Simmons J., et al., OLCF’s 1 TB/s, next-generation lustre file system, in: Proceedings of Cray User Group Conference (CUG 2013), 2013, pp. 1–12.
[34]
Paul H.S., Gupta A., Sharma A., Finding a suitable checkpoint and recovery protocol for a distributed application, J. Parallel Distrib. Comput. 66 (5) (2006) 732–749,. URL http://www.sciencedirect.com/science/article/pii/S0743731505002662.
[35]
Plank J.S., Beck M., Kingsley G., Compiler-assisted memory exclusion for fast checkpointing, IEEE Tech. Committee Oper. Syst. Appl. Environ. 7 (1995) 62–67.
[36]
Reguly I.Z., Mudalige G.R., Giles M.B., Design and development of domain specific active libraries with proxy applications, in: Cluster Computing (CLUSTER), 2015 IEEE International Conference on, 2015, pp. 738–745,.
[37]
Reguly I.Z., Mudalige G.R., Giles M.B., Curran D., McIntosh-Smith S., The OPS domain specific abstraction for multi-block structured grid computations, in: Proceedings of the Fourth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing, WOLFHPC ’14, IEEE Press, Piscataway, NJ, USA, 2014, pp. 58–67,.
[38]
Russell F.P., Kelly P.H.J., Optimized code generation for finite element local assembly using symbolic manipulation, ACM Trans. Math. Software 39 (4) (2013) 26:1–26:29,. URL http://doi.acm.org/10.1145/2491491.2491496.
[39]
The HDF Group, Hierarchical Data Format, version 5, http://www.hdfgroup.org/HDF5/ (1997-NNNN).
[40]
The Next Platform, Argonne Hints at Future Architecture of Aurora Exascale System, https://www.nextplatform.com/2018/03/19/argonne-hints-at-future-architecture-of-aurora-exascale-system/ (2018).
[42]
Touber E., Sandham N.D., Large-eddy simulation of low-frequency unsteadiness in a turbulent shock-induced separation bubble, Theor. Comput. Fluid Dynam. 23 (2) (2009) 79–107,.
[43]
Young J.W., A first order approximation to the optimum checkpoint interval, Commun. ACM 17 (9) (1974) 530–531,. URL http://doi.acm.org/10.1145/361147.361115.
[44]
Zheng G., Huang C., Kalé L.V., Performance evaluation of automatic checkpoint-based fault tolerance for AMPI and Charm++, SIGOPS Oper. Syst. Rev. 40 (2) (2006) 90–99,. URL http://doi.acm.org/10.1145/1131322.1131340.
[45]
Zheng G., Ni X., Kalé L.V., A scalable double in-memory checkpoint and restart scheme towards exascale, in: IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012), 2012, pp. 1–6,.

Cited By

View all
  • (2019)Large-scale performance of a DSL-based multi-block structured-mesh application for Direct Numerical SimulationJournal of Parallel and Distributed Computing10.1016/j.jpdc.2019.04.019131:C(130-146)Online publication date: 1-Sep-2019

Index Terms

  1. Improving resilience of scientific software through a domain-specific approach
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Please enable JavaScript to view thecomments powered by Disqus.

          Information & Contributors

          Information

          Published In

          cover image Journal of Parallel and Distributed Computing
          Journal of Parallel and Distributed Computing  Volume 128, Issue C
          Jun 2019
          185 pages

          Publisher

          Academic Press, Inc.

          United States

          Publication History

          Published: 01 June 2019

          Author Tags

          1. Domain specific language
          2. High performance computing
          3. Checkpointing
          4. Resilience
          5. Parallel I/O

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 29 Jan 2025

          Other Metrics

          Citations

          Cited By

          View all
          • (2019)Large-scale performance of a DSL-based multi-block structured-mesh application for Direct Numerical SimulationJournal of Parallel and Distributed Computing10.1016/j.jpdc.2019.04.019131:C(130-146)Online publication date: 1-Sep-2019

          View Options

          View options

          Figures

          Tables

          Media

          Share

          Share

          Share this Publication link

          Share on social media