More Web Proxy on the site http://driver.im/

article

Accelerating incremental checkpointing for extreme-scale computing

Authors:

Kurt B. Ferreira,

Patrick Bridges,

Ron BrightwellAuthors Info & Claims

Future Generation Computer Systems, Volume 30, Issue C

Pages 66 - 77

Published: 01 January 2014 Publication History

Abstract

Concern is beginning to grow in the high-performance computing (HPC) community regarding the reliability of future large-scale systems. Disk-based coordinated checkpoint/restart has been the dominant fault tolerance mechanism in HPC systems for the past 30 years. Checkpoint performance is so fundamental to scalability that nearly all capability applications have custom checkpoint strategies to minimize state and reduce checkpoint time. One well-known optimization to traditional checkpoint/restart is incremental checkpointing, which has a number of known limitations. To address these limitations, we describe libhashckpt, a hybrid incremental checkpointing solution that uses both page protection and hashing on GPUs to determine changes in application data with very low overhead. Using real capability workloads and a model outlining the viability and application efficiency increase of this technique, we show that hash-based incremental checkpointing can have significantly lower overheads and increased efficiency than traditional coordinated checkpointing approaches at the scales expected for future extreme-class systems. We describe a novel incremental checkpointing solution using hashing.We examine the performance of this approach with real HPC workloads.We model the benefits of this incremental approach for future systems.We show that GPUs can dramatically increase hashing speeds.However, this increase in speed has little impact on efficiency.

References

[1]

Y. Chen, J.S. Plank, K. Li, Clip: a checkpointing tool for message-passing parallel programs, in: Proceedings of the 1997 ACM/IEEE Conference on Supercomputing (CDROM), ACM, New York, NY, USA, 1997, pp. 1-11. http://doi.acm.org/10.1145/509593.509626

Digital Library

[2]

E.N.M. Elnozahy, L. Alvisi, Y.M. Wang, D.B. Johnson, A survey of rollback-recovery protocols in message-passing systems, ACM Comput. Surv., 34 (2002) 375-408. http://doi.acm.org/10.1145/568522.568525

Digital Library

[3]

J.S. Plank, M. Beck, G. Kingsley, K. Li, Libckpt: transparent checkpointing under unix, in: Proceedings of the USENIX 1995 Technical Conference Proceedings, USENIX Association, Berkeley, CA, USA, 1995, pp. 18.

Digital Library

[4]

K.B. Ferreira, R. Riesen, R. Brightwell, P.G. Bridges, D. Arnold, Libhashckpt: hash-based incremental checkpointing using GPUs, in: Proceedings of the 18th EuroMPI Conference. Santorini, Greece, 2011.

Digital Library

[5]

J.S. Plank, K. Li, ickp: a consistent checkpointer for multicomputers, IEEE Parallel Distrib. Technol., Syst. Appl., 2 (1994) 62-67.

Digital Library

[6]

L. Bautista-Gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama, S. Matsuoka, FTI: high performance fault tolerance interface for hybrid systems, in: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, ACM, New York, NY, USA, 2011, pp. 1-12.

Digital Library

[7]

Libckpt web page, 2011. URL: http://web.eecs.utk.edu/~plank/plank/www/libckpt.html.

[8]

P. Deutsch, J.L. Gailly, ZLIB compressed data format specification version 3.3, RFC 1950 (Informational), 1996. URL: http://www.ietf.org/rfc/rfc1950.txt.

Digital Library

[9]

A.J. Menezes, S.A. Vanstone, P.C.V. Oorschot, Handbook of Applied Cryptography, CRC Press, Inc., Boca Raton, FL, USA, 1996.

Digital Library

[10]

R. Rivest, The MD5 message-digest algorithm, RFC 1321 (Informational), 1992. Updated by RFC 6151. URL: http://www.ietf.org/rfc/rfc1321.txt.

Digital Library

[11]

R. Housley, A 224-bit one-way hash function: SHA-224. RFC 3874 (Informational), 2004. URL: http://www.ietf.org/rfc/rfc3874.txt.

Digital Library

[12]

T. Dierks, E. Rescorla, The Transport Layer Security (TLS) protocol version 1.2. RFC 5246 (Proposed Standard), 2008. Updated by RFCs 5746, 5878, 6176. URL: http://www.ietf.org/rfc/rfc5246.txt.

Digital Library

[13]

J. Callas, L. Donnerhacke, H. Finney, D. Shaw, R. Thayer, OpenPGP message format, RFC 4880 (Proposed Standard), 2007. Updated by RFC 5581. URL: http://www.ietf.org/rfc/rfc4880.txt.

Digital Library

[14]

T. Ylonen, C. Lonvick, The Secure Shell (SSH) transport layer protocol. RFC 4253 (Proposed Standard), 2006. URL: http://www.ietf.org/rfc/rfc4253.txt.

Digital Library

[15]

B. Ramsdell, Secure/multipurpose internet mail extensions (S/MIME) version 3.1 message specification. RFC 3851 (Proposed Standard), 2004. Obsoleted by RFC 5751. URL: http://www.ietf.org/rfc/rfc3851.txt.

[16]

V. Manral, Cryptographic algorithm implementation requirements for Encapsulating Security Payload (ESP) and Authentication Header (AH). RFC 4835 (Proposed Standard), 2007. URL: http://www.ietf.org/rfc/rfc4835.txt.

Digital Library

[17]

F.H. Mathis, A generalized birthday problem, SIAM Rev., 33 (1991) 265-270.

Digital Library

[18]

L. Holst, The general birthday problem, in: Random Graphs 93: Proceedings of the Sixth International Seminar on Random Graphs and Probabilistic Methods in Combinatorics and Computer Science, John Wiley & Sons, Inc., New York, NY, USA, 1995, pp. 201-208.

Digital Library

[19]

D.E. Knuth, The Art of Computer Programming, Volume 3: Sorting and Searching, Addison Wesley Longman Publishing Co., Inc., Redwood City, CA, USA, 1998.

Digital Library

[20]

T.J. Dell, A white paper on the benefits of chipkill-correct ECC for PC server main memory, IBM Microelectronics Division, 1997.

[21]

A. Geist, What is the monster in the closet? in: Invited Talk at Workshop on Architectures I: Exascale and Beyond: Gaps in Research, Gaps in Our Thinking, 2011.

[22]

E.N. Elnozahy, How safe is probabilistic checkpointing?, in: Proceedings of the Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing, IEEE Computer Society, Washington, DC, USA, 1998, pp. 358-363.

Digital Library

[23]

E.S. Hertel jr., R.L. Bell, M.G. Elrick, A.V. Farnsworth, G.I. Kerley, J.M. McGlaun, et al. CTH: a software family for multi-dimensional shock physics analysis, in: Proceedings of the 19th International Symposium on Shock Waves, 1993, pp.¿377-382.

[24]

S.J. Plimpton, Fast parallel algorithms for short-range molecular dynamics, J. Comput. Phys., 117 (1995) 1-19.

Digital Library

[25]

Sandia National Laboratories, The LAMMPS molecular dynamics simulator, 2010. URL: http://lammps.sandia.gov.

[26]

D.J. Kerbyson, H.J. Alme, A. Hoisie, F. Petrini, H.J. Wasserman, M. Gittings, Predictive performance and scalability modeling of a large-scale application, in: Proceedings of the ACM/IEEE conference on Supercomputing, ISBN: 1-58113-293-X, 2001, pp. 37-48. http://doi.acm.org/10.1145/582034.582071.

Digital Library

[27]

Sandia National Laboratories, The Mantevo project home page, 2010. URL: https://software.sandia.gov/mantevo.

[28]

W.J. Camp, J.L. Tomkins, Thor's hammer: the first version of the Red Storm MPP architecture, in: Proceedings of the SC 2002 Conference on High Performance Networking and Computing, Baltimore, MD, 2002.

[29]

K.B. Ferreira, Keeping checkpoint/restart viable for exascale systems, Ph.D. Thesis, University of New Mexico, Department of Computer Science, 2011.

Digital Library

[30]

Libgcrypt web page, 2010. URL: http://directory.fsf.org/project/libgcrypt/.

[31]

J. Nickolls, I. Buck, M. Garland, K. Skadron, Scalable parallel programming with CUDA, Queue, 6 (2008) 40-53. http://doi.acm.org/10.1145/1365490.1365500

Digital Library

[32]

S. Lang, P. Carns, R. Latham, R. Ross, K. Harms, W. Allcock, I/O performance challenges at leadership scale, in: Conference on High Performance Computing Networking, Storage and Analysis, SC'09, ISBN: 978-1-60558-744-8, 2009, pp.¿40:1-40:12. http://dx.doi.org/10.1145/1654059.1654100.

Digital Library

[33]

K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, et al. Exascale computing study: technology challenges in achieving exascale systems, 2008. URL:http://www.science.energy.gov/ascr/Research/CS/DARPAexascale-hardware(2008).pdf.

[34]

J.T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Gener. Comput. Syst., 22 (2006) 303-312. http://dx.doi.org/10.1016/j.future.2004.11.016

[35]

B. Schroeder, G.A. Gibson, Understanding failures in petascale computers, J. Phys. Conf. Ser., 78 (2007) 012022.

[36]

D.Z. Pan, M.A. Linton, Supporting reverse execution for parallel programs, in: 1988 ACM SIGPLAN and SIGOPS Workshop on Parallel and Distributed Debugging, ACM Press, Madison, WI, 1988, pp. 124-129. http://doi.acm.org/10.1145/68210.69227

Digital Library

[37]

K. Li, J.F. Naughton, J.S. Plank, Real-time concurrent checkpoint for parallel programs, in: Proceedings of the Second ACM SIGPLAN Symposium on Principles %26 Practice of Parallel Programming, ACM, New York, NY, USA, 1990, pp. 79-88. http://doi.acm.org/10.1145/99163.99173

Digital Library

[38]

J.S. Plank, Y.B. Kim, J.J. Dongarra, Algorithm-based diskless checkpointing for fault tolerant matrix operations, in: Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers, IEEE Comput. Soc. Press, Pasadena, CA, USA, Los Alamitos, CA, USA, 1995, pp. 351-360.

Digital Library

[39]

J.S. Plank, Y. Kim, J.J. Dongarra, Fault-tolerant matrix operations for networks of workstations using diskless checkpointing, J. Parallel Distrib. Comput., 43 (1997) 125-138. http://dx.doi.org/10.1006/jpdc.1997.1336

Digital Library

[40]

L.M. Silva, J.G. Silva, An experimental study about diskless checkpointing, in: 24th EUROMICRO Conference, IEEE Computer Society Press, Vasteras, Sweden, 1998, pp. 395-402.

Digital Library

[41]

C. Engelmann, A. Geist, A diskless checkpointing algorithm for super-scale architectures applied to the fast Fourier transform, in: Proceedings of the 1st International Workshop on Challenges of Large Applications in Distributed Environments, IEEE Computer Society, Washington, DC, USA, 2003, pp. 47-52.

Digital Library

[42]

X. Dong, N. Muralimanohar, N. Jouppi, R. Kaufmann, Y. Xie, Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems, in: SC'09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, ACM, New York, NY, USA, 2009, pp. 1-12. http://doi.acm.org/10.1145/1654059.1654117

Digital Library

[43]

N.H. Vaidya, A case for two-level distributed recovery schemes, in: ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems, ACM, New York, NY, USA, 1995, pp. 64-73.

Digital Library

[44]

A. Moody, G. Bronevetsky, K. Mohror, B.R. De Supinski, Design, modeling, and evaluation of a scalable multi-level checkpointing system, in: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE Computer Society, Washington, DC, USA, 2010, pp. 1-11. http://dx.doi.org/10.1109/SC.2010.18

Digital Library

[45]

G. Stellner, CoCheck: checkpointing and process migration for MPI, in: Proceedings of the 10th International Parallel Processing Symposium, IEEE Computer Society, Washington, DC, USA, 1996, pp. 526-531.

Digital Library

[46]

V.C. Zandy, B.P. Miller, M. Livny, Process hijacking, in: Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing, IEEE Computer Society, Washington, DC, USA, 1999, pp. 177-184.

Digital Library

[47]

J. Bent, G. Gibson, G. Grider, B. McClelland, P. Nowoczynski, J. Nunez, et al. PLFS: a checkpoint filesystem for parallel applications, in: Conference on High Performance Computing Networking, Storage and Analysis, SC'09, ISBN: 978-1-60558-744-8, 2009, pp. 21:1-21:12. http://doi.acm.org/10.1145/1654059.1654081.

Digital Library

[48]

J.S. Plank, Y. Chen, K. Li, M. Beck, G. Kingsley, Memory exclusion: optimizing the performance of checkpointing systems, Softw. Pract. Exp., 29 (1999) 125-142.

Digital Library

[49]

G. Bronevetsky, D.J. Marques, K.K. Pingali, R. Rugina, S.A. McKee, Compiler-enhanced incremental checkpointing for openmp applications, in: PPoPP'08: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ACM, New York, NY, USA, 2008, pp. 275-276. http://doi.acm.org/10.1145/1345206.1345253

Digital Library

[50]

S. Agarwal, R. Garg, M.S. Gupta, J.E. Moreira, Adaptive incremental checkpointing for massively parallel systems, in: Proceedings of the 2004 International Conference on Supercomputing. St. Malo, France, 2004.

Digital Library

[51]

R. Gioiosa, J.C. Sancho, S. Jiang, F. Petrini, Transparent, incremental checkpointing at kernel level: a foundation for fault tolerance for parallel computers, in: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing, IEEE Computer Society, Washington, DC, USA, 2005, pp. 1-8. http://dx.doi.org/10.1109/SC.2005.76

Digital Library

[52]

J.C. Sancho, F. Petrini, G. Johnson, J. Fernandez, E. Frachtenberg, On the feasibility of incremental checkpointing for scientific computing, in: Parallel and Distributed Processing Symposium, International, 2004, 1:58b. http://dx.doi.org/http://doi.ieeecomputersociety.org/10.1109/IPDPS.2004.1302982.

[53]

K. Li, J.F. Naughton, J.S. Plank, Low-latency, concurrent checkpointing for parallel programs, IEEE Trans. Parallel Distrib. Syst., 5 (1994) 874-879. http://dx.doi.org/10.1109/71.298215

Digital Library

[54]

E.N. Elnozahy, D.B. Johnson, W. Zwaenepoel, The performance of consistent checkpointing, in: 11th Symposium on Reliable Distributed Systems, IEEE Computer Society Press, Houston, TX, USA, 1992, pp. 39-47.

[55]

H.C. Nam, J. Kim, S. Hong, S. Lee, Probabilistic checkpointing, in: Twenty-Seventh Annual International Symposium on Fault-Tolerant Computing, 1997, FTCS-27. Digest of Papers. 1997, pp. 48-57. http://dx.doi.org/10.1109/FTCS.1997.614077.

[56]

H. Chang Nam, J. Kim, S.J. Hong, S. Lee, A secure checkpointing system, in: 2001. Proceedings. 2001 Pacific Rim International Symposium on Dependable Computing, 2001, pp. 49-56. http://dx.doi.org/10.1109/PRDC.2001.992679.

Cited By

Rodríguez-Pascual MCao JMoríñigo JCooperman GMayo-García R(2019)Job migration in HPC clusters by means of checkpoint/restartThe Journal of Supercomputing10.1007/s11227-019-02857-y75:10(6517-6541)Online publication date: 1-Oct-2019
https://dl.acm.org/doi/10.1007/s11227-019-02857-y
Kananizadeh SKononenko K(2017)Development of dynamic protection against timing channelsInternational Journal of Information Security10.1007/s10207-016-0356-716:6(641-651)Online publication date: 1-Nov-2017
https://dl.acm.org/doi/10.1007/s10207-016-0356-7
Kononenko K(2016)An approach to error correction in program code using dynamic optimization in a virtual execution environmentThe Journal of Supercomputing10.1007/s11227-015-1616-472:3(845-873)Online publication date: 1-Mar-2016
https://dl.acm.org/doi/10.1007/s11227-015-1616-4

Recommendations

Surviving sensor node failures by MMU-less incremental checkpointing

For some critical safety applications, sensor nodes embed valuable information, and they should be able to operate unattended and unfailing for several months or years. One promising solution is to adopt a checkpointing that periodically saves the state ...
Low Overhead Incremental Checkpointing and Rollback Recovery Scheme on Windows Operating System
WKDD '10: Proceedings of the 2010 Third International Conference on Knowledge Discovery and Data Mining

Implementation of a low overhead incremental checkpointing and rollback recovery scheme that consists of incremental checkpointing combines copy-on-write technique and optimal checkpointing interval is addressed in this article. The checkpointing ...
A fully informed model-based checkpointing protocol for preventing useless checkpoints

Checkpointing and rollback recovery are widely used techniques for handling failures in distributed systems. When processes involved in a distributed computation are allowed to take checkpoints independently without any coordination with each other, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Future Generation Computer Systems

Future Generation Computer Systems Volume 30, Issue C

January 2014

307 pages

ISSN:0167-739X

Issue’s Table of Contents

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 January 2014

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Rodríguez-Pascual MCao JMoríñigo JCooperman GMayo-García R(2019)Job migration in HPC clusters by means of checkpoint/restartThe Journal of Supercomputing10.1007/s11227-019-02857-y75:10(6517-6541)Online publication date: 1-Oct-2019
https://dl.acm.org/doi/10.1007/s11227-019-02857-y
Kananizadeh SKononenko K(2017)Development of dynamic protection against timing channelsInternational Journal of Information Security10.1007/s10207-016-0356-716:6(641-651)Online publication date: 1-Nov-2017
https://dl.acm.org/doi/10.1007/s10207-016-0356-7
Kononenko K(2016)An approach to error correction in program code using dynamic optimization in a virtual execution environmentThe Journal of Supercomputing10.1007/s11227-015-1616-472:3(845-873)Online publication date: 1-Mar-2016
https://dl.acm.org/doi/10.1007/s11227-015-1616-4

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents