[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3203217.3203232acmconferencesArticle/Chapter ViewAbstractPublication PagescfConference Proceedingsconference-collections
research-article

On the theory of speculative checkpointing: time and energy considerations

Published: 08 May 2018 Publication History

Abstract

Collective checkpoint/rollback is the most popular approach for dealing with fail-stop errors on high-performance computing platforms. Prior work has focused on choosing checkpoint intervals that minimize the total cost of checkpoint/rollback. This work introduces the notion of speculative checkpointing, where we probabilistically skip some checkpoints. The careful selection of checkpoints either to be taken or skipped has the potential to reduce the total checkpoint/rollback overhead. We mathematically formulate the overall checkpoint/rollback cost in the presence of speculation. We consider the choice of speculation as a fixed probability or a probability distribution. We formulate two criteria to be minimized: total execution time and approximate total energy. We derive the criteria for beneficial speculative checkpointing for exponential and arbitrary failure distributions. Furthermore, we analyze the joint optimization of energy and time to express the trade-offs mathematically. We validate the formulations and evaluate various scenarios using discrete-event simulation. Experimental evaluation validates the models and demonstrates that employing speculation and choosing to speculate by sampling a distribution derived from the failure distribution achieves the best performance.

References

[1]
Milton Abramowitz. 1974. Handbook of Mathematical Functions, With Formulas, Graphs, and Mathematical Tables,. Dover Publications, Incorporated.
[2]
Saman Amarasinghe, Dan Campbell, William Carlson, Andrew Chien, William Dally, Elmootazbellah Elnohazy, Robert Harrison, William Harrod, Jon Hiller, Sherman Karp, Charles Koelbel, David Koester, Peter Kogge, John Levesque, Daniel Reed, Robert Schreiber, Mark Richards, Al Scarpelli, John Shalf, Allan Snavely, and Thomas Sterling. 2009. ExaScale Software Study: Software Challenges in Extreme Scale Systems. (2009).
[3]
Muhammad ALFIAN AMRIZAL, Atsuya UNO, Yukinori SATO, Hiroyuki TAKIZAWA, and Hiroaki KOBAYASHI. 2017. Energy-Performance Modeling of Speculative Checkpointing for Exascale Systems. IEICE Transactions on Information and Systems E100.D, 12 (2017), 2749--2760.
[4]
L.C.Andrews. 1992. Special Functions of Mathematics for Engineers. SPIE Optical Engineering Press. https://books.google.com/books?id=2CAqsF-RebgC
[5]
Guillaume Aupy, Anne Benoit, Mohammed El Mehdi Diouri, Olivier Glück, and Laurent Lefèvre. 2015. Energy-Aware Checkpointing Strategies. Springer International Publishing, Cham, 279--317.
[6]
Guillaume Aupy, Anne Benoit, Thomas Hérault, Yves Robert, and Jack Dongarra. 2014. Optimal Checkpointing Period: Time vs. Energy. In High Performance Computing Systems. Performance Modeling, Benchmarking and Simulation, Stephen A. Jarvis, Steven A. Wright, and Simon D. Hammond (Eds.). Springer International Publishing, Cham, 203--214.
[7]
J. T. Daly. 2006. A Higher Order Estimate of the Optimum Checkpoint Interval for Restart Dumps. Future Gener. Comput. Syst. 22, 3 (Feb. 2006), 303--312.
[8]
Jack Dongarra, Pete Beckman, Terry Moore, Patrick Aerts, Giovanni Aloisio, Jean-Claude Andre, David Barkai, Jean-Yves Berthou, Taisuke Boku, Bertrand Braunschweig, Franck Cappello, Barbara Chapman, Xuebin Chi, Alok Choudhary, Sudip Dosanjh, Thom Dunning, Sandro Fiore, Al Geist, Bill Gropp, Robert Harrison, Mark Hereld, Michael Heroux, Adolfy Hoisie, Koh Hotta, Zhong Jin, Yutaka Ishikawa, Fred Johnson, Sanjay Kale, Richard Kenway, David Keyes, Bill Kramer, Jesus Labarta, Alain Lichnewsky, Thomas Lippert, Bob Lucas, Barney Maccabe, Satoshi Matsuoka, Paul Messina, Peter Michielse, Bernd Mohr, Matthias S. Mueller, Wolfgang E. Nagel, Hiroshi Nakashima, Michael E Papka, Dan Reed, Mitsuhisa Sato, Ed Seidel, John Shalf, David Skinner, Marc Snir, Thomas Sterling, Rick Stevens, Fred Streitz, Bob Sugar, Shinji Sumimoto, William Tang, John Taylor, Rajeev Thakur, Anne Trefethen, Mateo Valero, Aad Van Der Steen, Jeffrey Vetter, Peg Williams, Robert Wisniewski, and Kathy Yelick. 2011. The International Exascale Software Project Roadmap. Int. J. High Perform. Comput. Appl. 25, 1 (Feb. 2011), 3--60.
[9]
N. El-Sayed and B. Schroeder. 2014. To checkpoint or not to checkpoint: Understanding energy-performance-I/O tradeoffs in HPC checkpointing. In 2014 IEEE International Conference on Cluster Computing (CLUSTER). 93--102.
[10]
Shunsuke Hiroyama, Tadashi Dohi, and Hiroyuki Okamura. 2013. Aperiodic Checkpoint Placement Algorithms-Survey and Comparison. In Journal of Software Engineering and Applications. 41--53.
[11]
Yibei Ling, Jie Mi, and Xiaola Lin. 2001. A Variational Calculus Approach to Optimal Checkpoint Placement. IEEE Trans. Comput. 50, 7 (July 2001), 699--708.
[12]
Yudan Liu, R. Nassar, C. Leangsuksun, N. Naksinehaboon, M. Paun, and S. L. Scott. 2008. An optimal checkpoint/restart model for a large scale high performance computing system. In 2008 IEEE International Symposium on Parallel and Distributed Processing. 1--9.
[13]
S. Matsuoka, I. Yamagata, H. Jitsumoto, and H. Nakada. 2009. Speculative Check-pointing: Exploiting Temporal Affinity of Memory Operations. (2009).
[14]
S.M. Ross. 1996. Stochastic processes. Wiley. https://books.google.com/books?id=ImUPAQAAMAAJ
[15]
John Shalf, Sudip Dosanjh, and John Morrison. 2011. Exascale Computing Technology Challenges. In Proceedings of the 9th International Conference on High Performance Computing for Computational Science (VECPAR'10). Springer-Verlag, Berlin, Heidelberg, 1--25. http://dl.acm.org/citation.cfm?id=1964238.1964240
[16]
Omer Subasi, Gokcen Kestor, and Sriram Krishnamoorthy. 2017. Toward a General Theory of Optimal Checkpoint Placement. In 2017 IEEE International Conference on Cluster Computing, CLUSTER. 464--474.
[17]
Omer Subasi, Tatiana Martsinkevich, Ferad Zyulkyarov, Osman Unsal, Jesus Labarta, and Franck Cappello. 2016. Unified fault-tolerance framework for hybrid task-parallel message-passing applications. The International Journal of High Performance Computing Applications (2016).
[18]
Omer Subasi, Ferad Zyulkyarov, Osman S. Unsal, and Jesús Labarta. 2015. Marriage Between Coordinated and Uncoordinated Checkpointing for the Exascale Era. In 17th IEEE International Conference on High Performance Computing and Communications, HPCC. 470--478.
[19]
Devesh Tiwari, Saurabh Gupta, and Sudharshan S. Vazhkudai. 2014. Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems. In Proceedings of the 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN '14). 25--36.
[20]
Dirk Vogt, Armando Miraglia, Georgios Portokalidis, Herbert Bos, Andy Tanenbaum, and Cristiano Giuffrida. 2015. Speculative Memory Checkpointing. In Proceedings of the 16th Annual Middleware Conference (Middleware '15). ACM, New York, NY, USA, 197--209.
[21]
L. Votta, C. Vick, K. Pattabiraman, Z. Kalbarczyk, L. Wang, A. Wood, andR. K. Iyer. 2005. Modeling Coordinated Checkpointing for Large-Scale Supercomputers. In 2005 International Conference on Dependable Systems and Networks (DSN). 812--821.
[22]
Zhenpeng Xu, Chaoguang Men, Weiwei Li, and Xiang Li. 2011. Checkpoint Scheduling Model for Optimality. Inf. Process. Lett. 111, 19 (Oct. 2011), 979--984.
[23]
John W. Young. 1974. A First Order Approximation to the Optimum Checkpoint Interval. Commun. ACM 17, 9 (Sept. 1974), 530--531.
[24]
Z. Zheng and Z. Lan. 2009. Reliability-aware scalability models for high performance computing. In 2009 IEEE International Conference on Cluster Computing and Workshops. 1--9.

Cited By

View all
  • (2023)A Checkpointing Recovery Approach for Soft Errors Based on Detector LocationsElectronics10.3390/electronics1204080512:4(805)Online publication date: 6-Feb-2023
  • (2022)A Genetic Algorithm-Based Approach to Identify Near-Optimal Non-Equidistant Checkpointing Strategies2022 Annual Reliability and Maintainability Symposium (RAMS)10.1109/RAMS51457.2022.9894018(1-6)Online publication date: 24-Jan-2022

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CF '18: Proceedings of the 15th ACM International Conference on Computing Frontiers
May 2018
401 pages
ISBN:9781450357616
DOI:10.1145/3203217
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 May 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. checkpoint/restart
  2. optimal checkpoint frequency
  3. speculative checkpointing
  4. time vs. energy optimization

Qualifiers

  • Research-article

Conference

CF '18
Sponsor:
CF '18: Computing Frontiers Conference
May 8 - 10, 2018
Ischia, Italy

Acceptance Rates

Overall Acceptance Rate 273 of 785 submissions, 35%

Upcoming Conference

CF '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)A Checkpointing Recovery Approach for Soft Errors Based on Detector LocationsElectronics10.3390/electronics1204080512:4(805)Online publication date: 6-Feb-2023
  • (2022)A Genetic Algorithm-Based Approach to Identify Near-Optimal Non-Equidistant Checkpointing Strategies2022 Annual Reliability and Maintainability Symposium (RAMS)10.1109/RAMS51457.2022.9894018(1-6)Online publication date: 24-Jan-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media