[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.5555/1656980.1657005guideproceedingsArticle/Chapter ViewAbstractPublication PagesmiddlewareConference Proceedingsconference-collections
research-article
Free access

Why do upgrades fail and what can we do about it?: toward dependable, online upgrades in enterprise system

Published: 30 November 2009 Publication History

Abstract

Enterprise-system upgrades are unreliable and often produce downtime or data-loss. Errors in the upgrade procedure, such as broken dependencies, constitute the leading cause of upgrade failures. We propose a novel upgrade-centric fault model, based on data from three independent sources, which focuses on the impact of procedural errors rather than software defects. We show that current approaches for upgrading enterprise systems, such as rolling upgrades, are vulnerable to these faults because the upgrade is not an atomic operation and it risks breaking hidden dependencies among the distributed system-components. We also present a mechanism for tolerating complex procedural errors during an upgrade. Our system, called Imago, improves availability in the fault-free case, by performing an online upgrade, and in the faulty case, by reducing the risk of failure due to breaking hidden dependencies. Imago performs an end-to-end upgrade atomically and dependably, by dedicating separate resources to the new version and by isolating the old version from the upgrade procedure. Through fault injection, we show that Imago is more reliable than online-upgrade approaches that rely on dependency-tracking and that create system states with mixed versions.

References

[1]
Crameri, O., et al.: Staged deployment in Mirage, an integrated software upgrade testing and distribution system. In: Symposium on Operating Systems Principles, Stevenson, WA (Oct 2007) 221--236
[2]
Neumann, P., et al.: America Offline. The Risks Digest 18(30--31) (Aug 8--9 1996) http://catless.ncl.ac.uk/Risks/18.30.html.
[3]
Koch, C.: AT&T Wireless self-destructs. CIO Magazine (Apr 2004) http://www.cio.com/archive/041504/wireless.html.
[4]
Wears, R. L., Cook, R. I., Perry, S. J.: Automation, interaction, complexity, and failure: A case study. Reliability Engineering and System Safety 91(12) (Dec 2006) 1494--1501
[5]
Di Cosmo, R.: Report on formal management of software dependencies. Technical report, INRIA (Sep 2005) (EDOS Project Deliverable WP2-D2.1).
[6]
Office of Government Commerce: Service Transition. Information Technology Infrastructure Library (ITIL). (2007)
[7]
Oracle Corporation: Database rolling upgrade using Data Guard SQL Apply. Maximum Availability Architecture White Paper (Dec 2008)
[8]
: Oxford English Dictionary. 2nd edn. Oxford University Press (1989) http://www.oed.com.
[9]
Brewer, E. A.: Lessons from giant-scale services. IEEE Internet Computing 5(4) (2001) 46--55
[10]
Oppenheimer, D., Ganapathi, A., Patterson, D. A.: Why do Internet services fail, and what can be done about it? In: USENIX Symposium on Internet Technologies and Systems, Seattle, WA (Mar 2003)
[11]
Keller, L., Upadhyaya, P., Candea, G.: ConfErr: A tool for assessing resilience to human configuration errors. In: International Conference on Dependable Systems and Networks, Anchorage, AK (Jun 2008)
[12]
Nagaraja, K., et al.: Understanding and dealing with operator mistakes in Internet services. In: USENIX Symposium on Operating Systems Design and Implementation, San Francisco, CA (Dec 2004) 61--76
[13]
Oliveira, F., et al.: Understanding and validating database system administration. USENIX Annual Technical Conference (Jun 2006)
[14]
Dumitraş, T., Kavulya, S., Narasimhan, P.: A fault model for upgrades in distributed systems. Technical Report CMU-PDL-08-115, Carnegie Mellon University (2008)
[15]
Kaufman, L., Rousseeuw, P. J.: Finding Groups in Data: an Introduction to Cluster Analysis. Wiley Series in Probability and Mathematical Statistics. Wiley (1990)
[16]
Sullivan, M., Chillarege, R.: Software defects and their impact on system availability-a study of field failures in operating systems. In: Fault-Tolerant Computing Symposium. (1991) 2--9
[17]
Chatfield, C.: Statistics for Technology: A Course in Applied Statistics. 3rd edn. Chapman&Hall/CRC (1983)
[18]
Dig, D., Comertoglu, C., Marinov, D., Johnson, R.: Automated detection of refactorings in evolving components. In: European Conference on Object-Oriented Programming, Nantes, France (Jul 2006) 404--428
[19]
Anderson, R.: The end of DLL Hell. MSDN Magazine (Jan 2000)
[20]
Di Cosmo, R., Zacchiroli, S., Trezentos, P.: Package upgrades in FOSS distributions: details and challenges. In: Workshop on Hot Topics in Software Upgrades. (Oct 2008)
[21]
Menascé, D.: TPC-W: A benchmark for e-commerce. IEEE Internet Computing 6(3) (May/Jun 2002) 83--87
[22]
Dumitraş, T., Tan, J., Gho, Z., Narasimhan, P.: No more HotDependencies: Toward dependency-agnostic upgrades in distributed systems. In: Workshop on Hot Topics in System Dependability, Edinburgh, Scotland (Jun 2007)
[23]
Amir, Y., Danilov, C., Stanton, J.: A low latency, loss tolerant architecture and protocol for wide area group communication. In: International Conference on Dependable Systems and Networks, New York, NY (June 2000) 327--336
[24]
Amza, C., et al.: Specification and implementation of dynamic web site benchmarks. In: IEEE Workshop on Workload Characterization, Austin, TX (Nov 2002) 3--13 http://rubis.objectweb.org/.
[25]
Downing, A., Oracle Corporation. Personal communication (2008)
[26]
Boyapati, C., et al.: Lazy modular upgrades in persistent object stores. In: Object-Oriented Programing, Systems, Languages and Applications, Anaheim, CA (Oct 2003) 403--417
[27]
Zolti, I., Accenture. Personal communication (2006)
[28]
Neamtiu, I., Hicks, M., Stoyle, G., Oriol, M.: Practical dynamic software updating for C. In: ACM Conference on Programming Language Design and Implementation, Ottawa, Canada (Jun 2006) 72--83
[29]
Neamtiu, I., Hicks, M.: Safe and timely dynamic updates for multi-threaded programs. In: ACM Conference on Programming Language Design and Implementation, Dublin, Ireland (Jun 2009)
[30]
Lowell, D., Saito, Y., Samberg, E.: Devirtualizable virtual machines enabling general, single-node, online maintenance. In: International Conference on Architectural Support for Programming Languages and Operating Systems, Boston, MA (Oct 2004) 211--223
[31]
Potter, S., Nieh, J.: Reducing downtime due to system maintenance and upgrades. In: Large Installation System Administration Conference, San Diego, CA (Dec 2005) 47--62

Cited By

View all
  • (2024)ChangeRCA: Finding Root Causes from Software Changes in Large Online SystemsProceedings of the ACM on Software Engineering10.1145/36437281:FSE(24-46)Online publication date: 12-Jul-2024
  • (2021)Understanding and Detecting Software Upgrade Failures in Distributed SystemsProceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles10.1145/3477132.3483577(116-131)Online publication date: 26-Oct-2021
  • (2019)Multi-objective Optimisation of Online Distributed Software Update for DevOps in CloudsACM Transactions on Internet Technology10.1145/333885119:3(1-20)Online publication date: 27-Aug-2019
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
Middleware '09: Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware
November 2009
497 pages

Sponsors

  • Professional
  • USENIX Assoc: USENIX Assoc
  • IFIP

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 30 November 2009

Qualifiers

  • Research-article

Acceptance Rates

Overall Acceptance Rate 203 of 948 submissions, 21%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)32
  • Downloads (Last 6 weeks)4
Reflects downloads up to 13 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)ChangeRCA: Finding Root Causes from Software Changes in Large Online SystemsProceedings of the ACM on Software Engineering10.1145/36437281:FSE(24-46)Online publication date: 12-Jul-2024
  • (2021)Understanding and Detecting Software Upgrade Failures in Distributed SystemsProceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles10.1145/3477132.3483577(116-131)Online publication date: 26-Oct-2021
  • (2019)Multi-objective Optimisation of Online Distributed Software Update for DevOps in CloudsACM Transactions on Internet Technology10.1145/333885119:3(1-20)Online publication date: 27-Aug-2019
  • (2019)MVEDSUAProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304063(573-585)Online publication date: 4-Apr-2019
  • (2017)Improving Timeliness and Visibility in Publishing Software Engineering ResearchIEEE Transactions on Software Engineering10.1109/TSE.2017.266391843:3(205-206)Online publication date: 1-Mar-2017
  • (2017)Automating Live Update for Generic Server ProgramsIEEE Transactions on Software Engineering10.1109/TSE.2016.258406643:3(207-225)Online publication date: 1-Mar-2017
  • (2017)Zero-downtime SQL database schema evolution for continuous deploymentProceedings of the 39th International Conference on Software Engineering: Software Engineering in Practice Track10.1109/ICSE-SEIP.2017.5(143-152)Online publication date: 20-May-2017
  • (2016)Evolving multi-tenant SaaS applications through self-adaptive upgrade enactment and tenant mediationProceedings of the 11th International Symposium on Software Engineering for Adaptive and Self-Managing Systems10.1145/2897053.2897057(151-157)Online publication date: 14-May-2016
  • (2015)Middleware for customizable multi-staged dynamic upgrades of multi-tenant SaaS applicationsProceedings of the 8th International Conference on Utility and Cloud Computing10.5555/3233397.3233415(102-111)Online publication date: 7-Dec-2015
  • (2015)Continuous deployment and schema evolution in SQL databasesProceedings of the Third International Workshop on Release Engineering10.5555/2820690.2820699(16-19)Online publication date: 16-May-2015
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media