[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2523616.2523630acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

COLO: COarse-grained LOck-stepping virtual machines for non-stop service

Published: 01 October 2013 Publication History

Abstract

Virtual machine (VM) replication provides a software solution of for business continuity and disaster recovery through application-agnostic hardware fault tolerance by replicating the state of primary VM (PVM) to secondary VM (SVM) on a different physical node. Unfortunately, current VM replication approaches suffer from excessive overhead, which severely limit their applicability and suitability. In this paper, we leverage the practical effect of networked server-client system that PVM and SVM are considered as in the same state only if they can generate the same response from the clients' point of view, and this is exploited to optimize performance. To this end, we propose a generic and highly efficient non-stop service solution, named as "COLO" (COarse-grained LOck-stepping virtual machine) utilizing on-demand VM replication. COLO monitors the output responses of the PVM and SVM, and rules the SVM as a valid replica of the PVM according to the output similarity between PVM and SVM. If the responses do not match, the commit of network response is withheld until PVM's state has been synchronized to SVM. Hence, we ensure that the system is always capable of failover by SVM. Although non-determinism may mean a different internal state of SVM from that of the PVM, it is equally valid and remains consistent from external observations. Unlike earlier instruction level lock-stepping deterministic execution approaches, COLO can easily support Multi-Processors (MP) involving workloads with the satisfying performance. Results show that COLO significantly outperforms existing approaches, particularly on server-client workloads such as online databases and web server applications.

References

[1]
Very secure ftp daemon (vsftpd). http://www.nlm.nih.gov/mesh/jablonski/syndrome_title.html.
[2]
Web bench,. http://home.tiscali.cz/cz210552/webbench.html.
[3]
Xen summit 2012. http://www-archive.xenproject.org/xensummit/xs12na_talks/agenda.html.
[4]
S. Abood. Hp non stop server. http://www.hp.com, Jun 2002.
[5]
S. Abood. Intel® 82576 and 82599 gigabit ethernet controller datasheet,. http://www.intel.com, Jun 2002.
[6]
S. Abood. Sysbench. http://sysbench.sourceforge.net/, Jun 2002.
[7]
N. Aghdaie and Y. Tamir. Coral: A transparent fault-tolerant web service. Journal of Systems and Software, 82(1): 131--143, 2009.
[8]
G. Altekar and I. Stoica. Odr: output-deterministic replay for multicore debugging. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pages 193--206. ACM, 2009.
[9]
A. Aviram, S.-C. Weng, S. Hu, and B. Ford. Efficient system-enforced deterministic parallelism. Communications of the ACM, 55(5): 111--119, 2012.
[10]
P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield. Xen and the art of virtualization. ACM SIGOPS Operating Systems Review, 37(5): 164--177, 2003.
[11]
R. Bhargava, B. Serebrin, F. Spadini, and S. Manne. Accelerating two-dimensional page walks for virtualized systems. ACM SIGARCH Computer Architecture News, 36(1): 26--35, 2008.
[12]
T. C. Bressoud. Tft: A software system for application-transparent fault tolerance. In Fault-Tolerant Computing, 1998. Digest of Papers. Twenty-Eighth Annual International Symposium on, pages 128--137. IEEE, 1998.
[13]
T. C. Bressoud and F. B. Schneider. Hypervisor-based fault tolerance. ACM Transactions on Computer Systems (TOCS), 14(1): 80--107, 1996.
[14]
N. Burton-Krahn. Hotswap-transparent server failover for linux. In USENIX LISA'02: Sixteenth Systems Administration Conference, pages 205--212, 2002.
[15]
M. Castro and B. Liskov. Practical byzantine fault tolerance and proactive recovery. ACM Transactions on Computer Systems (TOCS), 20(4): 398--461, 2002.
[16]
C. Clark, K. Fraser, S. Hand, J. G. Hansen, E. Jul, C. Limpach, I. Pratt, and A. Warfield. Live migration of virtual machines. In Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation-Volume 2, pages 273--286. USENIX Association, 2005.
[17]
B. Cully, G. Lefebvre, D. Meyer, M. Feeley, N. Hutchinson, and A. Warfield. Remus: High availability via asynchronous virtual machine replication. In Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation, pages 161--174. San Francisco, 2008.
[18]
G. W. Dunlap, S. T. King, S. Cinar, M. A. Basrai, and P. M. Chen. Revirt: Enabling intrusion analysis through virtual-machine logging and replay. ACM SIGOPS Operating Systems Review, 36(SI): 211--224, 2002.
[19]
G. W. Dunlap, D. G. Lucchetti, M. A. Fetterman, and P. M. Chen. Execution replay of multiprocessor virtual machines. In Proceedings of the fourth ACM SIGPLAN/SIGOPS international conference on Virtual execution environments, pages 121--130. ACM, 2008.
[20]
S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. In ACM SIGOPS Operating Systems Review, volume 37, pages 29--43. ACM, 2003.
[21]
C. M. Jeffery and R. J. Figueiredo. A flexible approach to improving system reliability with virtual lockstep. Dependable and Secure Computing, IEEE Transactions on, 9(1): 2--15, 2012.
[22]
M. Kapritsos, Y. Wang, V. Quema, A. Clement, L. Alvisi, and M. Dahlin. All about eve: execute-verify replication for multi-core servers. In Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation, pages 237--250. USENIX Association, 2012.
[23]
O. Laadan, N. Viennot, and J. Nieh. Transparent, lightweight application execution replay on commodity multiprocessor operating systems. In ACM SIGMETRICS Performance Evaluation Review, volume 38, pages 155--166. ACM, 2010.
[24]
L. Lamport, R. Shostak, and M. Pease. The byzantine generals problem. ACM Transactions on Programming Languages and Systems (TOPLAS), 4(3): 382--401, 1982.
[25]
T. Liu, C. Curtsinger, and E. D. Berger. Dthreads: efficient deterministic multithreading. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, pages 327--336. ACM, 2011.
[26]
M. Lu and T.-c. Chiueh. Fast memory state synchronization for virtualization-based fault tolerance. In Dependable Systems & Networks, 2009. DSN'09. IEEE/IFIP International Conference on, pages 534--543. IEEE, 2009.
[27]
M. Marwah, S. Mishra, and C. Fetzer. Tcp server fault tolerance using connection migration to a backup server. In Proc. IEEE Intl. Conf. on Dependable Systems and Networks (DSN), pages 373--382. Citeseer, 2003.
[28]
U. F. Minhas, S. Rajagopalan, B. Cully, A. Aboulnaga, K. Salem, and A. Warfield. Remusdb: Transparent high availability for database systems. In Proc. of VLDB, 2011.
[29]
E. B. Nightingale, K. Veeraraghavan, P. M. Chen, and J. Flinn. Rethink the sync. In In OSDI, 2006.
[30]
H. P. Reiser and R. Kapitza. Hypervisor-based efficient proactive recovery. In Reliable Distributed Systems, 2007. SRDS 2007. 26th IEEE International Symposium on, pages 83--92. IEEE, 2007.
[31]
D. J. Scales, M. Nelson, and G. Venkitachalam. The design of a practical system for fault-tolerant virtual machines. ACM SIGOPS Operating Systems Review, 44(4): 30--39, 2010.
[32]
F. B. Schneider. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Computing Surveys (CSUR), 22(4): 299--319, 1990.
[33]
A. Shye, J. Blomstedt, T. Moseley, V. J. Reddi, and D. A. Connors. Plr: A software approach to transient fault tolerance for multicore architectures. Dependable and Secure Computing, IEEE Transactions on, 6(2): 135--148, 2009.
[34]
A. Thomson and D. J. Abadi. The case for determinism in database systems. Proceedings of the VLDB Endowment, 3(1--2): 70--80, 2010.
[35]
D. Zagorodnov, K. Marzullo, L. Alvisi, and T. C. Bressoud. Engineering fault-tolerant tcp/ip servers using ft-tcp. In Proc. IEEE Intl. Conf. on Dependable Systems and Networks (DSN), pages 393--402. Citeseer, 2003.
[36]
J. Zhu, W. Dong, Z. Jiang, X. Shi, Z. Xiao, and X. Li. Improving the performance of hypervisor-based fault tolerance. In Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pages 1--10. IEEE, 2010.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SOCC '13: Proceedings of the 4th annual Symposium on Cloud Computing
October 2013
427 pages
ISBN:9781450324281
DOI:10.1145/2523616
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2013

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

Conference

SOCC '13
Sponsor:
SOCC '13: ACM Symposium on Cloud Computing
October 1 - 3, 2013
California, Santa Clara

Acceptance Rates

SOCC '13 Paper Acceptance Rate 23 of 114 submissions, 20%;
Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)19
  • Downloads (Last 6 weeks)2
Reflects downloads up to 06 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Resilient VirtualizationComputer10.1109/MC.2023.330661757:2(70-78)Online publication date: 31-Jan-2024
  • (2024)Packet Buffering to Minimize Service Downtime and Packet Loss During Redundancy Switchover2024 IEEE 30th International Conference on Telecommunications (ICT)10.1109/ICT62760.2024.10606135(1-8)Online publication date: 24-Jun-2024
  • (2024)Parallel and consistent live checkpointing and restoration of split-memory VMsFuture Generation Computer Systems10.1016/j.future.2024.05.024159(432-443)Online publication date: Oct-2024
  • (2023)Fast VM Replication on Heterogeneous Hypervisors for Robust Fault ToleranceProceedings of the 24th International Middleware Conference10.1145/3590140.3592849(15-28)Online publication date: 27-Nov-2023
  • (2023)V-Recover: Virtual Machine Recovery When Live Migration FailsIEEE Transactions on Cloud Computing10.1109/TCC.2023.3282466(1-12)Online publication date: 2023
  • (2022)FVMM: Fast VM Migration for Virtualization-based Fault Tolerance Using Templates2022 IEEE International Conference on Cloud Computing Technology and Science (CloudCom)10.1109/CloudCom55334.2022.00012(9-16)Online publication date: Dec-2022
  • (2021)Analytical Model of Middlebox Unavailability under Shared Protection Allowing Multiple BackupsIEICE Transactions on Communications10.1587/transcom.2020EBP3176E104.B:9(1147-1158)Online publication date: 1-Sep-2021
  • (2021)Live Migration in Bare-Metal CloudsIEEE Transactions on Cloud Computing10.1109/TCC.2018.28489819:1(226-239)Online publication date: 1-Jan-2021
  • (2021)Mitigating Virtualization Failures Through Migration to a Co-Located HypervisorIEEE Access10.1109/ACCESS.2021.30986449(105255-105269)Online publication date: 2021
  • (2020)SNFProceedings of the 11th ACM Symposium on Cloud Computing10.1145/3419111.3421295(296-310)Online publication date: 12-Oct-2020
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media