[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

Accelerating incremental checkpointing for extreme-scale computing

Published: 01 January 2014 Publication History

Abstract

Concern is beginning to grow in the high-performance computing (HPC) community regarding the reliability of future large-scale systems. Disk-based coordinated checkpoint/restart has been the dominant fault tolerance mechanism in HPC systems for the past 30 years. Checkpoint performance is so fundamental to scalability that nearly all capability applications have custom checkpoint strategies to minimize state and reduce checkpoint time. One well-known optimization to traditional checkpoint/restart is incremental checkpointing, which has a number of known limitations. To address these limitations, we describe libhashckpt, a hybrid incremental checkpointing solution that uses both page protection and hashing on GPUs to determine changes in application data with very low overhead. Using real capability workloads and a model outlining the viability and application efficiency increase of this technique, we show that hash-based incremental checkpointing can have significantly lower overheads and increased efficiency than traditional coordinated checkpointing approaches at the scales expected for future extreme-class systems. We describe a novel incremental checkpointing solution using hashing.We examine the performance of this approach with real HPC workloads.We model the benefits of this incremental approach for future systems.We show that GPUs can dramatically increase hashing speeds.However, this increase in speed has little impact on efficiency.

References

[1]
Y. Chen, J.S. Plank, K. Li, Clip: a checkpointing tool for message-passing parallel programs, in: Proceedings of the 1997 ACM/IEEE Conference on Supercomputing (CDROM), ACM, New York, NY, USA, 1997, pp. 1-11. http://doi.acm.org/10.1145/509593.509626
[2]
E.N.M. Elnozahy, L. Alvisi, Y.M. Wang, D.B. Johnson, A survey of rollback-recovery protocols in message-passing systems, ACM Comput. Surv., 34 (2002) 375-408. http://doi.acm.org/10.1145/568522.568525
[3]
J.S. Plank, M. Beck, G. Kingsley, K. Li, Libckpt: transparent checkpointing under unix, in: Proceedings of the USENIX 1995 Technical Conference Proceedings, USENIX Association, Berkeley, CA, USA, 1995, pp. 18.
[4]
K.B. Ferreira, R. Riesen, R. Brightwell, P.G. Bridges, D. Arnold, Libhashckpt: hash-based incremental checkpointing using GPUs, in: Proceedings of the 18th EuroMPI Conference. Santorini, Greece, 2011.
[5]
J.S. Plank, K. Li, ickp: a consistent checkpointer for multicomputers, IEEE Parallel Distrib. Technol., Syst. Appl., 2 (1994) 62-67.
[6]
L. Bautista-Gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama, S. Matsuoka, FTI: high performance fault tolerance interface for hybrid systems, in: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, ACM, New York, NY, USA, 2011, pp. 1-12.
[7]
Libckpt web page, 2011. URL: http://web.eecs.utk.edu/~plank/plank/www/libckpt.html.
[8]
P. Deutsch, J.L. Gailly, ZLIB compressed data format specification version 3.3, RFC 1950 (Informational), 1996. URL: http://www.ietf.org/rfc/rfc1950.txt.
[9]
A.J. Menezes, S.A. Vanstone, P.C.V. Oorschot, Handbook of Applied Cryptography, CRC Press, Inc., Boca Raton, FL, USA, 1996.
[10]
R. Rivest, The MD5 message-digest algorithm, RFC 1321 (Informational), 1992. Updated by RFC 6151. URL: http://www.ietf.org/rfc/rfc1321.txt.
[11]
R. Housley, A 224-bit one-way hash function: SHA-224. RFC 3874 (Informational), 2004. URL: http://www.ietf.org/rfc/rfc3874.txt.
[12]
T. Dierks, E. Rescorla, The Transport Layer Security (TLS) protocol version 1.2. RFC 5246 (Proposed Standard), 2008. Updated by RFCs 5746, 5878, 6176. URL: http://www.ietf.org/rfc/rfc5246.txt.
[13]
J. Callas, L. Donnerhacke, H. Finney, D. Shaw, R. Thayer, OpenPGP message format, RFC 4880 (Proposed Standard), 2007. Updated by RFC 5581. URL: http://www.ietf.org/rfc/rfc4880.txt.
[14]
T. Ylonen, C. Lonvick, The Secure Shell (SSH) transport layer protocol. RFC 4253 (Proposed Standard), 2006. URL: http://www.ietf.org/rfc/rfc4253.txt.
[15]
B. Ramsdell, Secure/multipurpose internet mail extensions (S/MIME) version 3.1 message specification. RFC 3851 (Proposed Standard), 2004. Obsoleted by RFC 5751. URL: http://www.ietf.org/rfc/rfc3851.txt.
[16]
V. Manral, Cryptographic algorithm implementation requirements for Encapsulating Security Payload (ESP) and Authentication Header (AH). RFC 4835 (Proposed Standard), 2007. URL: http://www.ietf.org/rfc/rfc4835.txt.
[17]
F.H. Mathis, A generalized birthday problem, SIAM Rev., 33 (1991) 265-270.
[18]
L. Holst, The general birthday problem, in: Random Graphs 93: Proceedings of the Sixth International Seminar on Random Graphs and Probabilistic Methods in Combinatorics and Computer Science, John Wiley & Sons, Inc., New York, NY, USA, 1995, pp. 201-208.
[19]
D.E. Knuth, The Art of Computer Programming, Volume 3: Sorting and Searching, Addison Wesley Longman Publishing Co., Inc., Redwood City, CA, USA, 1998.
[20]
T.J. Dell, A white paper on the benefits of chipkill-correct ECC for PC server main memory, IBM Microelectronics Division, 1997.
[21]
A. Geist, What is the monster in the closet? in: Invited Talk at Workshop on Architectures I: Exascale and Beyond: Gaps in Research, Gaps in Our Thinking, 2011.
[22]
E.N. Elnozahy, How safe is probabilistic checkpointing?, in: Proceedings of the Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing, IEEE Computer Society, Washington, DC, USA, 1998, pp. 358-363.
[23]
E.S. Hertel jr., R.L. Bell, M.G. Elrick, A.V. Farnsworth, G.I. Kerley, J.M. McGlaun, et al. CTH: a software family for multi-dimensional shock physics analysis, in: Proceedings of the 19th International Symposium on Shock Waves, 1993, pp.¿377-382.
[24]
S.J. Plimpton, Fast parallel algorithms for short-range molecular dynamics, J. Comput. Phys., 117 (1995) 1-19.
[25]
Sandia National Laboratories, The LAMMPS molecular dynamics simulator, 2010. URL: http://lammps.sandia.gov.
[26]
D.J. Kerbyson, H.J. Alme, A. Hoisie, F. Petrini, H.J. Wasserman, M. Gittings, Predictive performance and scalability modeling of a large-scale application, in: Proceedings of the ACM/IEEE conference on Supercomputing, ISBN: 1-58113-293-X, 2001, pp. 37-48. http://doi.acm.org/10.1145/582034.582071.
[27]
Sandia National Laboratories, The Mantevo project home page, 2010. URL: https://software.sandia.gov/mantevo.
[28]
W.J. Camp, J.L. Tomkins, Thor's hammer: the first version of the Red Storm MPP architecture, in: Proceedings of the SC 2002 Conference on High Performance Networking and Computing, Baltimore, MD, 2002.
[29]
K.B. Ferreira, Keeping checkpoint/restart viable for exascale systems, Ph.D. Thesis, University of New Mexico, Department of Computer Science, 2011.
[30]
Libgcrypt web page, 2010. URL: http://directory.fsf.org/project/libgcrypt/.
[31]
J. Nickolls, I. Buck, M. Garland, K. Skadron, Scalable parallel programming with CUDA, Queue, 6 (2008) 40-53. http://doi.acm.org/10.1145/1365490.1365500
[32]
S. Lang, P. Carns, R. Latham, R. Ross, K. Harms, W. Allcock, I/O performance challenges at leadership scale, in: Conference on High Performance Computing Networking, Storage and Analysis, SC'09, ISBN: 978-1-60558-744-8, 2009, pp.¿40:1-40:12. http://dx.doi.org/10.1145/1654059.1654100.
[33]
K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, et al. Exascale computing study: technology challenges in achieving exascale systems, 2008. URL:http://www.science.energy.gov/ascr/Research/CS/DARPAexascale-hardware(2008).pdf.
[34]
J.T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Gener. Comput. Syst., 22 (2006) 303-312. http://dx.doi.org/10.1016/j.future.2004.11.016
[35]
B. Schroeder, G.A. Gibson, Understanding failures in petascale computers, J. Phys. Conf. Ser., 78 (2007) 012022.
[36]
D.Z. Pan, M.A. Linton, Supporting reverse execution for parallel programs, in: 1988 ACM SIGPLAN and SIGOPS Workshop on Parallel and Distributed Debugging, ACM Press, Madison, WI, 1988, pp. 124-129. http://doi.acm.org/10.1145/68210.69227
[37]
K. Li, J.F. Naughton, J.S. Plank, Real-time concurrent checkpoint for parallel programs, in: Proceedings of the Second ACM SIGPLAN Symposium on Principles %26 Practice of Parallel Programming, ACM, New York, NY, USA, 1990, pp. 79-88. http://doi.acm.org/10.1145/99163.99173
[38]
J.S. Plank, Y.B. Kim, J.J. Dongarra, Algorithm-based diskless checkpointing for fault tolerant matrix operations, in: Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers, IEEE Comput. Soc. Press, Pasadena, CA, USA, Los Alamitos, CA, USA, 1995, pp. 351-360.
[39]
J.S. Plank, Y. Kim, J.J. Dongarra, Fault-tolerant matrix operations for networks of workstations using diskless checkpointing, J. Parallel Distrib. Comput., 43 (1997) 125-138. http://dx.doi.org/10.1006/jpdc.1997.1336
[40]
L.M. Silva, J.G. Silva, An experimental study about diskless checkpointing, in: 24th EUROMICRO Conference, IEEE Computer Society Press, Vasteras, Sweden, 1998, pp. 395-402.
[41]
C. Engelmann, A. Geist, A diskless checkpointing algorithm for super-scale architectures applied to the fast Fourier transform, in: Proceedings of the 1st International Workshop on Challenges of Large Applications in Distributed Environments, IEEE Computer Society, Washington, DC, USA, 2003, pp. 47-52.
[42]
X. Dong, N. Muralimanohar, N. Jouppi, R. Kaufmann, Y. Xie, Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems, in: SC'09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, ACM, New York, NY, USA, 2009, pp. 1-12. http://doi.acm.org/10.1145/1654059.1654117
[43]
N.H. Vaidya, A case for two-level distributed recovery schemes, in: ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems, ACM, New York, NY, USA, 1995, pp. 64-73.
[44]
A. Moody, G. Bronevetsky, K. Mohror, B.R. De Supinski, Design, modeling, and evaluation of a scalable multi-level checkpointing system, in: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE Computer Society, Washington, DC, USA, 2010, pp. 1-11. http://dx.doi.org/10.1109/SC.2010.18
[45]
G. Stellner, CoCheck: checkpointing and process migration for MPI, in: Proceedings of the 10th International Parallel Processing Symposium, IEEE Computer Society, Washington, DC, USA, 1996, pp. 526-531.
[46]
V.C. Zandy, B.P. Miller, M. Livny, Process hijacking, in: Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing, IEEE Computer Society, Washington, DC, USA, 1999, pp. 177-184.
[47]
J. Bent, G. Gibson, G. Grider, B. McClelland, P. Nowoczynski, J. Nunez, et al. PLFS: a checkpoint filesystem for parallel applications, in: Conference on High Performance Computing Networking, Storage and Analysis, SC'09, ISBN: 978-1-60558-744-8, 2009, pp. 21:1-21:12. http://doi.acm.org/10.1145/1654059.1654081.
[48]
J.S. Plank, Y. Chen, K. Li, M. Beck, G. Kingsley, Memory exclusion: optimizing the performance of checkpointing systems, Softw. Pract. Exp., 29 (1999) 125-142.
[49]
G. Bronevetsky, D.J. Marques, K.K. Pingali, R. Rugina, S.A. McKee, Compiler-enhanced incremental checkpointing for openmp applications, in: PPoPP'08: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ACM, New York, NY, USA, 2008, pp. 275-276. http://doi.acm.org/10.1145/1345206.1345253
[50]
S. Agarwal, R. Garg, M.S. Gupta, J.E. Moreira, Adaptive incremental checkpointing for massively parallel systems, in: Proceedings of the 2004 International Conference on Supercomputing. St. Malo, France, 2004.
[51]
R. Gioiosa, J.C. Sancho, S. Jiang, F. Petrini, Transparent, incremental checkpointing at kernel level: a foundation for fault tolerance for parallel computers, in: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing, IEEE Computer Society, Washington, DC, USA, 2005, pp. 1-8. http://dx.doi.org/10.1109/SC.2005.76
[52]
J.C. Sancho, F. Petrini, G. Johnson, J. Fernandez, E. Frachtenberg, On the feasibility of incremental checkpointing for scientific computing, in: Parallel and Distributed Processing Symposium, International, 2004, 1:58b. http://dx.doi.org/http://doi.ieeecomputersociety.org/10.1109/IPDPS.2004.1302982.
[53]
K. Li, J.F. Naughton, J.S. Plank, Low-latency, concurrent checkpointing for parallel programs, IEEE Trans. Parallel Distrib. Syst., 5 (1994) 874-879. http://dx.doi.org/10.1109/71.298215
[54]
E.N. Elnozahy, D.B. Johnson, W. Zwaenepoel, The performance of consistent checkpointing, in: 11th Symposium on Reliable Distributed Systems, IEEE Computer Society Press, Houston, TX, USA, 1992, pp. 39-47.
[55]
H.C. Nam, J. Kim, S. Hong, S. Lee, Probabilistic checkpointing, in: Twenty-Seventh Annual International Symposium on Fault-Tolerant Computing, 1997, FTCS-27. Digest of Papers. 1997, pp. 48-57. http://dx.doi.org/10.1109/FTCS.1997.614077.
[56]
H. Chang Nam, J. Kim, S.J. Hong, S. Lee, A secure checkpointing system, in: 2001. Proceedings. 2001 Pacific Rim International Symposium on Dependable Computing, 2001, pp. 49-56. http://dx.doi.org/10.1109/PRDC.2001.992679.

Cited By

View all
  • (2019)Job migration in HPC clusters by means of checkpoint/restartThe Journal of Supercomputing10.1007/s11227-019-02857-y75:10(6517-6541)Online publication date: 1-Oct-2019
  • (2017)Development of dynamic protection against timing channelsInternational Journal of Information Security10.1007/s10207-016-0356-716:6(641-651)Online publication date: 1-Nov-2017
  • (2016)An approach to error correction in program code using dynamic optimization in a virtual execution environmentThe Journal of Supercomputing10.1007/s11227-015-1616-472:3(845-873)Online publication date: 1-Mar-2016

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Future Generation Computer Systems
Future Generation Computer Systems  Volume 30, Issue C
January 2014
307 pages

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 January 2014

Author Tags

  1. Checkpointing
  2. Fault-tolerance
  3. Graphics processing units
  4. Incremental checkpointing

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2019)Job migration in HPC clusters by means of checkpoint/restartThe Journal of Supercomputing10.1007/s11227-019-02857-y75:10(6517-6541)Online publication date: 1-Oct-2019
  • (2017)Development of dynamic protection against timing channelsInternational Journal of Information Security10.1007/s10207-016-0356-716:6(641-651)Online publication date: 1-Nov-2017
  • (2016)An approach to error correction in program code using dynamic optimization in a virtual execution environmentThe Journal of Supercomputing10.1007/s11227-015-1616-472:3(845-873)Online publication date: 1-Mar-2016

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media