Abstract
Checkpoint/rollback is an effective approach to guarantee that the long-running applications can be completed in the face of failures. However, it does not come for free. The application suffers from long downtime and performance penalty when it is being checkpointed or rolled back, which result in extra overhead on application execution time. This problem would get worse in virtualized environment mainly due to the heavyweight of virtual machine. This paper proposes warmCR, a lightweight checkpoint/rollback system for virtual machine, which aims to reduce its own extra overhead on application execution time. First, warmCR employs the redirect-on-write approach to create disk checkpoint and leverages the copy-on-write method to lively create memory checkpoint, so that both the downtime and checkpoint duration are reduced. Second, we propose a working set based rollback approach to provide short downtime without compromising application performance. Third, workload-aware batched processing is proposed to achieve trade-off between downtime and performance loss. In addition to presenting warmCR, we detail its implementation, and provide extensive experimental results to prove its efficiency and effectiveness.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Amazon EC2. http://aws.amazon.com/ec2/
ElasticSearch. http://www.elasticsearch.org/
Vallee, G., Naughton, T., Ong, H., et al.: Checkpoint/restart of virtual machines based on Xen. In: HAPCW (2006)
Ford, D., Labelle, F., Popovici, F.I., et al.: Availability in globally distributed storage systems. In: OSDI, pp. 1–14 (2010)
Plank, J.S., Beck, M., Kingsley, G., et al.: Libckpt: transparent checkpointing under Unix. Computer Science Department (1994)
Li, J., Liu, H., Cui, L., Li, B., Wo, T.: iROW: an efficient live snapshot system for virtual machine disk. In: ICPADS, pp. 376–383 (2012)
Vaidya, N.H.: Impact of checkpoint latency on overhead ratio of a checkpointing scheme. TOC 46(8), 942–947 (1997)
Zhang, I., Garthwaite, A., Baskakov, Y., et al.: Fast restore of checkpointed memory using working set estimation. In: VEE, pp. 87–98 (2011)
Song, X., Shi, J., Liu, R., et al.: Parallelizing live migration of virtual machines. In: VEE, pp. 85–96 (2013)
Lee, M., Krishnakumar, A.S., Krishnan, P., et al.: Hypervisor-assisted application checkpointing in virtualized environments. In: DSN, pp. 371–382 (2011)
Arunagiri, S., Seelam, S., Oldfield, R.A., et al.: Impact of checkpoint latency on the optimal checkpoint interval and execution time (2008)
Young, J.M.: A first order approximation to the optimal checkpoint interval. Comm. ACM 17(9), 530–531 (1974)
Tantawi, A.N., Ruschitzka, M.: Performance analysis of checkpointing strategies. TOC 2(2), 123–144 (1984)
Duda, A.: The effects of checkpointing on program execution time. Inf. Process. Lett. 16(5), 221–229 (1983)
Kourai, K., Chiba, S.: Fast software rejuvenation of virtual machine monitors. TDSC 8(6), 839–851 (2011)
Leners, J.B., Wu, H., Hung, W.L., et al.: Detecting failures in distributed systems with the FALCON spy network. In: SOSP, pp. 279–294 (2011)
Garg, S., et al.: Minimizing completion time of a program by checkpointing and rejuvenation. In: SIGMETRICS, pp. 252–261 (1996)
Kangarlou, A., Eugster, P., Xu, D.: VNsnap: taking snapshots of virtual networked environments with minimal downtime. In: DSN, pp. 524–533 (2009)
Sun, M.H., Blough, D.M.: Fast, Lightweight Virtual Machine Checkpointing (2010)
Liu, H.K., Jin, H., Liao, X.F., et al.: VMckpt: lightweight and live virtual machine checkpointing. Sci. China Inf. Sci. 55(12), 2865–2880 (2012)
Garg, R., Sodha, K., Cooperman, G.: A generic checkpoint-restart mechanism for virtual machines (2012). arXiv preprint. arXiv:1212.1787
Hibler, M., Ricci, R., Stoller, L., Duerig, J., et al.: Large-scale virtualization in the emulab network testbed. In: ATC, pp. 113–128 (2008)
Liu, Y., Nassar, R., Leangsuksun, C.B., Naksinehaboon, N., Paun, M., Scott, S.L.: An optimal checkpoint/restart model for a large scale high performance computing system. In: Symposium on Parallel and Distributed Processing, pp. 1–9 (2008)
Maoz, T., Barak, A., Amar, L.: Combining virtual machine migration with process migration for HPC on multi-clusters and grids. In: Cluster, pp. 89–98 (2008)
Waldspurger, C.A.: Memory resource management in VMware ESX server. In: OSDI, pp. 181–194 (2002)
Jin, H., Deng, L., Wu, S.: Live virtual machine migration with adaptive memory compression. In: CLUSTER, pp. 1–10 (2009)
Hines, M.R., Gopalan, K.: Post-copy based live virtual machine migration using adaptive pre-paging and dynamic self-ballooning. In: VEE, pp. 51–60 (2009)
Park, E., Egger, B., Lee, J.: Fast and space-efficient virtual machine checkpointing. In: VEE, pp. 75–85 (2011)
Chiang, J.-H., Li, H.-L., Chiueh, T.-C.: Introspection-based memory de-duplication and migration. In: VEE, pp. 51–62 (2013)
Gray, J.: Why do computers stop and what can be done about it? In: German Association for Computing Machinery Conference on Office Automation (1985)
Acknowledgement
We would like to thank the anonymous reviewers for their valuable comments and help in improving this paper. This work is supported by National Key Technology Support Program under grant No. 2012BAH46B02.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Cui, L. et al. (2015). Lightweight Virtual Machine Checkpoint and Rollback for Long-running Applications. In: Wang, G., Zomaya, A., Martinez, G., Li, K. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2015. Lecture Notes in Computer Science(), vol 9530. Springer, Cham. https://doi.org/10.1007/978-3-319-27137-8_42
Download citation
DOI: https://doi.org/10.1007/978-3-319-27137-8_42
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27136-1
Online ISBN: 978-3-319-27137-8
eBook Packages: Computer ScienceComputer Science (R0)