Abstract
With the evolution of high-performance computing, parallel applications have developed an increasing necessity for fault tolerance, most commonly provided by checkpoint and restart techniques. Checkpointing tools are typically implemented at one of two different abstraction levels: at the system level or at the application level. The latter has become an interesting alternative due to its flexibility and the possibility of operating in different environments. However, application-level checkpointing tools often require the user to manually insert checkpoints in order to ensure that certain requirements are met (e.g. forcing checkpoints to be taken at the user code and not inside kernel routines). This paper examines the transformations required to enable automatic checkpointing of parallel applications in the CPPC application-level checkpointing framework. These transformations have been implemented on two very different compiler infrastructures: Cetus and LLVM. Cetus is a Java-based compiler infrastructure aiming to provide an easy to use and clean IR and API for program transformation. LLVM is a low-level, SSA-based toolchain. The fundamental differences of both approaches are analyzed from the structural, behavioral and performance perspectives.
Similar content being viewed by others
References
Aho, A.V., Lam, M.S., Sethi, R., Ullman, J.D.: Compilers: Principles, Techniques, & Tools, pp. 632–638. Pearson Education, Upper Saddle River (2007)
Arenaz, M., Touriño, J., Doallo, R.: XARK: an extensible framework for automatic recognition of computational kernels. ACM Trans. Program. Lang. Syst. 30(6), 32:1–32:56 (2008)
Baratloo, A., Dasgupta, P., Kedem, Z.M.: CALYPSO: a novel software system for fault-tolerant parallel processing on distributed platforms. In: Proceedings of the 4th IEEE International Symposium on High Performance, Distributed Computing (HPDC-4), pp. 122–129 (1995)
Beguelin, A., Seligman, E., Stephan, P.: Application level fault tolerance in heterogeneous networks of workstations. J. Parallel Distrib. Comput. 43(2), 147–155 (1997)
Bouteiller, A., Capello, F., Hérault, T., Krawezik, G., Lemarinier, P., Magniette, F.: MPICH-V2: a fault-tolerant MPI for volatile nodes based on pessimistic sender based message logging. In: Proceedings of the 15th ACM/IEEE Conference on Supercomputing (SC’03), pp. 25–42 (2003)
Bronevetsky, G., Marques, D., Pingali, K., Stodghill, P.: C\(^{\text{3}}\): A system for automating application-level checkpointing of MPI programs. In: Proceedings of the 16th International Workshop on Languages and Compilers for Parallel Computing (LCPC’03), pp. 357–373 (2003)
Chandy, K.M., Lamport, L.: Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst. 3(1), 63–75 (1985)
Dave, C., Bae, H., Min, S.J., Lee, S., Eigenmann, R., Midkiff, S.: Cetus: a source-to-source compiler infrastructure for multicores. IEEE Comput. 42(12), 36–42 (2009)
Elnozahy, E.N., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)
Gibson, G., Schroeder, B., Digney, J.: Failure tolerance in petascale computers. CTWatch Q. 3(4), 4–10 (2007)
Landau, C.R.: The checkpoint mechanism in KeyKOS. In: Proceedings of the 2nd International Workshop on Object Orientation on Operating Systems (I-WOOOS’92), pp. 86–91 (1992)
Lattner, C., Adve, V.S.: LLVM: A compilation framework for lifelong program analysis. In: Proceedings of the 2nd IEEE/ACM International Symposium on Code Generation and Optimization (CGO’04), pp. 75–88 (2004)
Li, C.C.J., Stewart, E.M., Fuchs, W.K.: Compiler-assisted full checkpointing. Softw. Pract. Exp. 24(10), 871–886 (1994)
National Aeronautics and Space Administration: The NAS Parallel Benchmarks. http://www.nas.nasa.gov/publications/npb.html. Retrieved December 2011
Ousterhout, J.K., Cherenson, A.R., Douglis, F., Nelson, M.N., Welch, B.B.: The Sprite network operating system. IEEE Comput. 21(2), 23–36 (1988)
Parr, T.J., Quong, R.W.: ANTLR: a predicated-LL(k) parser generator. Softw. Pract. Exp. 25(7), 789–810 (1995)
Plank, J.S., Beck, M., Kingsley, G.: Compiler-assisted memory exclusion for fast checkpointing. IEEE Tech. Comm. Oper. Syst. Appl. Environ. 7(4), 10–14 (1995)
Plank, J.S., Beck, M., Kingsley, G., Li, K.: Libckpt: Transparent checkpointing under Unix. In: Usenix Winter Technical Conference, pp. 213–223 (1995)
Ramkumar, B., Strumpen, V.: Portable checkpointing for heterogeneous architectures. In: Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS’97), pp. 58–67 (1997)
Rodríguez, G., Martín, M.J., González, P., Touriño, J.: Controller/precompiler for portable checkpointing. IEICE Trans. Inf. Syst. E89–D(2), 408–417 (2006)
Rodríguez, G., Martín, M.J., González, P., Touriño, J.: A heuristic approach for the automatic insertion of checkpoints in message-passing codes. J. Univers. Comput. Sci. 15(14), 2894–2911 (2009)
Rodríguez, G., Martín, M.J., González, P., Touriño, J.: Analysis of performance-impacting factors on checkpointing frameworks: the CPPC case study. Comput. J. 54(11), 1821–1837 (2011)
Rodríguez, G., Martín, M.J., González, P., Touriño, J., Doallo, R.: CPPC: a compiler-assisted tool for portable checkpointing of message-passing applications. Concurr. Comput. Pract. Exp. 22(6), 749–766 (2010)
Russinovich, M., Segall, Z.: Fault-tolerance for off-the-shelf applications and hardware. In: Proceedings of the 25th International Symposium on Fault-Tolerant Computing (FTCS’95), pp. 67–71 (1995)
Shires, D., Pollock, L., Sprenkle, S.: Program flow graph construction for static analysis of MPI programs. In: Proceedings of the 1999 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA’99), pp. 1847–1853 (1999)
Woo, N., Jung, H., Yeom, H.Y., Park, T., Park, H.: MPICH-GF: transparent checkpointing and rollback-recovery for Grid-enabled MPI processes. IEICE Trans. Inf. Syst. E87–D(7), 1820–1828 (2004)
Acknowledgments
This research was supported by the Galician Government (Project 10PXIB105180PR and Consolidation of Competitive Research Groups, Xunta de Galicia ref. 2010/6) and by the Ministry of Science and Innovation of Spain (Project TIN2010-16735).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Rodríguez, G., Martín, M.J., González, P. et al. Compiler-Assisted Checkpointing of Parallel Codes: The Cetus and LLVM Experience. Int J Parallel Prog 41, 782–805 (2013). https://doi.org/10.1007/s10766-012-0231-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10766-012-0231-8