More Web Proxy on the site http://driver.im/

research-article

COLO: COarse-grained LOck-stepping virtual machines for non-stop service

Authors:

HaiBing GuanAuthors Info & Claims

SOCC '13: Proceedings of the 4th annual Symposium on Cloud Computing

Article No.: 3, Pages 1 - 16

https://doi.org/10.1145/2523616.2523630

Published: 01 October 2013 Publication History

Abstract

Virtual machine (VM) replication provides a software solution of for business continuity and disaster recovery through application-agnostic hardware fault tolerance by replicating the state of primary VM (PVM) to secondary VM (SVM) on a different physical node. Unfortunately, current VM replication approaches suffer from excessive overhead, which severely limit their applicability and suitability. In this paper, we leverage the practical effect of networked server-client system that PVM and SVM are considered as in the same state only if they can generate the same response from the clients' point of view, and this is exploited to optimize performance. To this end, we propose a generic and highly efficient non-stop service solution, named as "COLO" (COarse-grained LOck-stepping virtual machine) utilizing on-demand VM replication. COLO monitors the output responses of the PVM and SVM, and rules the SVM as a valid replica of the PVM according to the output similarity between PVM and SVM. If the responses do not match, the commit of network response is withheld until PVM's state has been synchronized to SVM. Hence, we ensure that the system is always capable of failover by SVM. Although non-determinism may mean a different internal state of SVM from that of the PVM, it is equally valid and remains consistent from external observations. Unlike earlier instruction level lock-stepping deterministic execution approaches, COLO can easily support Multi-Processors (MP) involving workloads with the satisfying performance. Results show that COLO significantly outperforms existing approaches, particularly on server-client workloads such as online databases and web server applications.

References

[1]

Very secure ftp daemon (vsftpd). http://www.nlm.nih.gov/mesh/jablonski/syndrome_title.html.

[2]

Web bench,. http://home.tiscali.cz/cz210552/webbench.html.

[3]

Xen summit 2012. http://www-archive.xenproject.org/xensummit/xs12na_talks/agenda.html.

[4]

S. Abood. Hp non stop server. http://www.hp.com, Jun 2002.

[5]

S. Abood. Intel® 82576 and 82599 gigabit ethernet controller datasheet,. http://www.intel.com, Jun 2002.

[6]

S. Abood. Sysbench. http://sysbench.sourceforge.net/, Jun 2002.

[7]

N. Aghdaie and Y. Tamir. Coral: A transparent fault-tolerant web service. Journal of Systems and Software, 82(1): 131--143, 2009.

Digital Library

[8]

G. Altekar and I. Stoica. Odr: output-deterministic replay for multicore debugging. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pages 193--206. ACM, 2009.

Digital Library

[9]

A. Aviram, S.-C. Weng, S. Hu, and B. Ford. Efficient system-enforced deterministic parallelism. Communications of the ACM, 55(5): 111--119, 2012.

Digital Library

[10]

P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield. Xen and the art of virtualization. ACM SIGOPS Operating Systems Review, 37(5): 164--177, 2003.

Digital Library

[11]

R. Bhargava, B. Serebrin, F. Spadini, and S. Manne. Accelerating two-dimensional page walks for virtualized systems. ACM SIGARCH Computer Architecture News, 36(1): 26--35, 2008.

Digital Library

[12]

T. C. Bressoud. Tft: A software system for application-transparent fault tolerance. In Fault-Tolerant Computing, 1998. Digest of Papers. Twenty-Eighth Annual International Symposium on, pages 128--137. IEEE, 1998.

Digital Library

[13]

T. C. Bressoud and F. B. Schneider. Hypervisor-based fault tolerance. ACM Transactions on Computer Systems (TOCS), 14(1): 80--107, 1996.

Digital Library

[14]

N. Burton-Krahn. Hotswap-transparent server failover for linux. In USENIX LISA'02: Sixteenth Systems Administration Conference, pages 205--212, 2002.

Digital Library

[15]

M. Castro and B. Liskov. Practical byzantine fault tolerance and proactive recovery. ACM Transactions on Computer Systems (TOCS), 20(4): 398--461, 2002.

Digital Library

[16]

C. Clark, K. Fraser, S. Hand, J. G. Hansen, E. Jul, C. Limpach, I. Pratt, and A. Warfield. Live migration of virtual machines. In Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation-Volume 2, pages 273--286. USENIX Association, 2005.

Digital Library

[17]

B. Cully, G. Lefebvre, D. Meyer, M. Feeley, N. Hutchinson, and A. Warfield. Remus: High availability via asynchronous virtual machine replication. In Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation, pages 161--174. San Francisco, 2008.

Digital Library

[18]

G. W. Dunlap, S. T. King, S. Cinar, M. A. Basrai, and P. M. Chen. Revirt: Enabling intrusion analysis through virtual-machine logging and replay. ACM SIGOPS Operating Systems Review, 36(SI): 211--224, 2002.

Digital Library

[19]

G. W. Dunlap, D. G. Lucchetti, M. A. Fetterman, and P. M. Chen. Execution replay of multiprocessor virtual machines. In Proceedings of the fourth ACM SIGPLAN/SIGOPS international conference on Virtual execution environments, pages 121--130. ACM, 2008.

Digital Library

[20]

S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. In ACM SIGOPS Operating Systems Review, volume 37, pages 29--43. ACM, 2003.

Digital Library

[21]

C. M. Jeffery and R. J. Figueiredo. A flexible approach to improving system reliability with virtual lockstep. Dependable and Secure Computing, IEEE Transactions on, 9(1): 2--15, 2012.

Digital Library

[22]

M. Kapritsos, Y. Wang, V. Quema, A. Clement, L. Alvisi, and M. Dahlin. All about eve: execute-verify replication for multi-core servers. In Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation, pages 237--250. USENIX Association, 2012.

Digital Library

[23]

O. Laadan, N. Viennot, and J. Nieh. Transparent, lightweight application execution replay on commodity multiprocessor operating systems. In ACM SIGMETRICS Performance Evaluation Review, volume 38, pages 155--166. ACM, 2010.

Digital Library

[24]

L. Lamport, R. Shostak, and M. Pease. The byzantine generals problem. ACM Transactions on Programming Languages and Systems (TOPLAS), 4(3): 382--401, 1982.

Digital Library

[25]

T. Liu, C. Curtsinger, and E. D. Berger. Dthreads: efficient deterministic multithreading. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, pages 327--336. ACM, 2011.

Digital Library

[26]

M. Lu and T.-c. Chiueh. Fast memory state synchronization for virtualization-based fault tolerance. In Dependable Systems & Networks, 2009. DSN'09. IEEE/IFIP International Conference on, pages 534--543. IEEE, 2009.

[27]

M. Marwah, S. Mishra, and C. Fetzer. Tcp server fault tolerance using connection migration to a backup server. In Proc. IEEE Intl. Conf. on Dependable Systems and Networks (DSN), pages 373--382. Citeseer, 2003.

[28]

U. F. Minhas, S. Rajagopalan, B. Cully, A. Aboulnaga, K. Salem, and A. Warfield. Remusdb: Transparent high availability for database systems. In Proc. of VLDB, 2011.

[29]

E. B. Nightingale, K. Veeraraghavan, P. M. Chen, and J. Flinn. Rethink the sync. In In OSDI, 2006.

Digital Library

[30]

H. P. Reiser and R. Kapitza. Hypervisor-based efficient proactive recovery. In Reliable Distributed Systems, 2007. SRDS 2007. 26th IEEE International Symposium on, pages 83--92. IEEE, 2007.

Digital Library

[31]

D. J. Scales, M. Nelson, and G. Venkitachalam. The design of a practical system for fault-tolerant virtual machines. ACM SIGOPS Operating Systems Review, 44(4): 30--39, 2010.

Digital Library

[32]

F. B. Schneider. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Computing Surveys (CSUR), 22(4): 299--319, 1990.

Digital Library

[33]

A. Shye, J. Blomstedt, T. Moseley, V. J. Reddi, and D. A. Connors. Plr: A software approach to transient fault tolerance for multicore architectures. Dependable and Secure Computing, IEEE Transactions on, 6(2): 135--148, 2009.

Digital Library

[34]

A. Thomson and D. J. Abadi. The case for determinism in database systems. Proceedings of the VLDB Endowment, 3(1--2): 70--80, 2010.

Digital Library

[35]

D. Zagorodnov, K. Marzullo, L. Alvisi, and T. C. Bressoud. Engineering fault-tolerant tcp/ip servers using ft-tcp. In Proc. IEEE Intl. Conf. on Dependable Systems and Networks (DSN), pages 393--402. Citeseer, 2003.

[36]

J. Zhu, W. Dong, Z. Jiang, X. Shi, Z. Xiao, and X. Li. Improving the performance of hypervisor-based fault tolerance. In Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pages 1--10. IEEE, 2010.

Cited By

Cerveira FFerreira ABarbosa R(2024)Resilient VirtualizationComputer10.1109/MC.2023.330661757:2(70-78)Online publication date: 31-Jan-2024
https://dl.acm.org/doi/10.1109/MC.2023.3306617
Kazato YSaito SFujimoto K(2024)Packet Buffering to Minimize Service Downtime and Packet Loss During Redundancy Switchover2024 IEEE 30th International Conference on Telecommunications (ICT)10.1109/ICT62760.2024.10606135(1-8)Online publication date: 24-Jun-2024
https://doi.org/10.1109/ICT62760.2024.10606135
Murata TKourai K(2024)Parallel and consistent live checkpointing and restoration of split-memory VMsFuture Generation Computer Systems10.1016/j.future.2024.05.024159(432-443)Online publication date: Oct-2024
https://doi.org/10.1016/j.future.2024.05.024
Show More Cited By

Index Terms

COLO: COarse-grained LOck-stepping virtual machines for non-stop service

Recommendations

SRVM: Hypervisor Support for Live Migration with Passthrough SR-IOV Network Devices
VEE '16

Single-Root I/O Virtualization (SR-IOV) is a specification that allows a single PCI Express (PCIe) device (ysical function or PF) to be used as multiple PCIe devices (virtual functions or VF). In a virtualization system, each VF can be directly assigned ...
SRVM: Hypervisor Support for Live Migration with Passthrough SR-IOV Network Devices
VEE '16: Proceedings of the12th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments

Single-Root I/O Virtualization (SR-IOV) is a specification that allows a single PCI Express (PCIe) device (ysical function or PF) to be used as multiple PCIe devices (virtual functions or VF). In a virtualization system, each VF can be directly assigned ...
A Hypervisor Approach to Enable Live Migration with Passthrough SR-IOV Network Devices
Special Topics

Single-Root I/O Virtualization (SR-IOV) is a specification that allows a single PCI Express (PCIe) device (physical function or PF) to be used as multiple PCIe devices (virtual functions or VF). In a virtualization system, each VF can be directly ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SOCC '13: Proceedings of the 4th annual Symposium on Cloud Computing

October 2013

427 pages

ISBN:9781450324281

DOI:10.1145/2523616

General Chair:
Guy Lohman

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

Conference

SOCC '13

Sponsor:

SOCC '13: ACM Symposium on Cloud Computing

October 1 - 3, 2013

California, Santa Clara

Acceptance Rates

SOCC '13 Paper Acceptance Rate 23 of 114 submissions, 20%;

Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

42
Total Citations
View Citations
654
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)2

Reflects downloads up to 06 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Cerveira FFerreira ABarbosa R(2024)Resilient VirtualizationComputer10.1109/MC.2023.330661757:2(70-78)Online publication date: 31-Jan-2024
https://dl.acm.org/doi/10.1109/MC.2023.3306617
Kazato YSaito SFujimoto K(2024)Packet Buffering to Minimize Service Downtime and Packet Loss During Redundancy Switchover2024 IEEE 30th International Conference on Telecommunications (ICT)10.1109/ICT62760.2024.10606135(1-8)Online publication date: 24-Jun-2024
https://doi.org/10.1109/ICT62760.2024.10606135
Murata TKourai K(2024)Parallel and consistent live checkpointing and restoration of split-memory VMsFuture Generation Computer Systems10.1016/j.future.2024.05.024159(432-443)Online publication date: Oct-2024
https://doi.org/10.1016/j.future.2024.05.024
Decourcelle JNgoc TTeabe BHagimont D(2023)Fast VM Replication on Heterogeneous Hypervisors for Robust Fault ToleranceProceedings of the 24th International Middleware Conference10.1145/3590140.3592849(15-28)Online publication date: 27-Nov-2023
https://dl.acm.org/doi/10.1145/3590140.3592849
Fernando DTerner JYang PGopalan K(2023)V-Recover: Virtual Machine Recovery When Live Migration FailsIEEE Transactions on Cloud Computing10.1109/TCC.2023.3282466(1-12)Online publication date: 2023
https://doi.org/10.1109/TCC.2023.3282466
Tsai WTsao PLee C(2022)FVMM: Fast VM Migration for Virtualization-based Fault Tolerance Using Templates2022 IEEE International Conference on Cloud Computing Technology and Science (CloudCom)10.1109/CloudCom55334.2022.00012(9-16)Online publication date: Dec-2022
https://doi.org/10.1109/CloudCom55334.2022.00012
FUJITA RHE FOKI E(2021)Analytical Model of Middlebox Unavailability under Shared Protection Allowing Multiple BackupsIEICE Transactions on Communications10.1587/transcom.2020EBP3176E104.B:9(1147-1158)Online publication date: 1-Sep-2021
https://doi.org/10.1587/transcom.2020EBP3176
Fukai TShinagawa TKato K(2021)Live Migration in Bare-Metal CloudsIEEE Transactions on Cloud Computing10.1109/TCC.2018.28489819:1(226-239)Online publication date: 1-Jan-2021
https://doi.org/10.1109/TCC.2018.2848981
Cerveira FBarbosa RMadeira H(2021)Mitigating Virtualization Failures Through Migration to a Co-Located HypervisorIEEE Access10.1109/ACCESS.2021.30986449(105255-105269)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3098644
Singhvi AKhalid JAkella ABanerjee SFonseca RDelimitrou COoi B(2020)SNFProceedings of the 11th ACM Symposium on Cloud Computing10.1145/3419111.3421295(296-310)Online publication date: 12-Oct-2020
https://dl.acm.org/doi/10.1145/3419111.3421295
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents