[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3173162.3177161acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

FCatch: Automatically Detecting Time-of-fault Bugs in Cloud Systems

Published: 19 March 2018 Publication History

Abstract

It is crucial for distributed systems to achieve high availability. Unfortunately, this is challenging given the common component failures (i.e., faults). Developers often cannot anticipate all the timing conditions and system states under which a fault might occur, and introduce time-of-fault (TOF) bugs that only manifest when a node crashes or a message drops at a special moment. Although challenging, detecting TOF bugs is fundamental to developing highly available distributed systems. Unlike previous work that relies on fault injection to expose TOF bugs, this paper carefully models TOF bugs as a new type of concurrency bugs, and develops FCatch to automatically predict TOF bugs by observing correct execution. Evaluation on representative cloud systems shows that FCatch is effective, accurately finding severe TOF bugs.

References

[1]
Hbase-3596. https://issues.apache.org/jira/browse/HBASE-3596, 2011.
[2]
Mapreduce-3858. https://issues.apache.org/jira/browse/MAPREDUCE-3858, 2012.
[3]
Cassandra-5393. https://issues.apache.org/jira/browse/CASSANDRA-5393, 2013.
[4]
Cassandra-6415. https://issues.apache.org/jira/browse/CASSANDRA-6415, 2013.
[5]
Hbase-10090. https://issues.apache.org/jira/browse/HBASE-10090, 2013.
[6]
Mapreduce-5476. https://issues.apache.org/jira/browse/MAPREDUCE-5476, 2013.
[7]
Zookeeper-1653. https://issues.apache.org/jira/browse/ZOOKEEPER-1653, 2013.
[8]
Java platform standard edition 7 documentation. https://docs.oracle.com/javase/7/docs/api/java/lang/Object.html#hashCode(), 2017.
[9]
Ramnatthan Alagappan, Aishwarya Ganesan, Yuvraj Patel, Thanumalayan Sankaranarayana Pillai, Andrea C Arpaci-Dusseau, and Remzi H Arpaci-Dusseau. Correlated crash vulnerabilities. In OSDI, 2016.
[10]
Peter Alvaro, Kolton Andrus, Chris Sanden, Casey Rosenthal, Ali Basiri, and Lorin Hochstein. Automating failure testing research at internet scale. In SoCC, 2016.
[11]
Peter Alvaro, Joshua Rosen, and Joseph M. Hellerstein. Lineage-driven Fault Injection. In SIGMOD, 2015.
[12]
Kumud Bhandari, Dhruva R Chakrabarti, and Hans-J Boehm. Makalu: Fast recoverable allocation of non-volatile memory. In OOPSLA, 2016.
[13]
Lucas Brutschy, Dimitar Dimitrov, Peter Müller, and Martin T. Vechev. Serializability for eventual consistency: criterion, analysis, and applications. In POPL, 2017.
[14]
Pierre Castéran and Yves Bertot. Interactive theorem proving and program development. coq'art: The calculus of inductive constructions., 2004.
[15]
Feng Chen, Traian Florin Serbanuta, and Grigore Rosu. jPredictor: a predictive runtime analysis tool for java. In ICSE, 2008.
[16]
Datapath.io. Recent aws outage and how you could have avoided downtime. https://medium.com/@datapath_io/recent-aws-outage-and-how-you-could-have-avoided-downtime-7d9d9443d776, 2017.
[17]
Jeff Dean. Designs, lessons and advice from building large distributed systems. Keynote from LADIS, 2009.
[18]
Pantazis Deligiannis, Alastair F Donaldson, Jeroen Ketema, Akash Lal, and Paul Thomson. Asynchronous programming, analysis and testing with state machines. In PLDI, 2015.
[19]
Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C Arpaci-Dusseau, and Remzi H Arpaci-Dusseau. Redundancy does not imply fault tolerance: Analysis of distributed storage reactions to single errors and corruptions. In FAST, 2017.
[20]
Haryadi S. Gunawi, Thanh Do, Pallavi Joshi, Peter Alvaro, Joseph M. Hellerstein, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Koushik Sen, and Dhruba Borthakur. textscFate and textscDestini: A Framework for Cloud Recovery Testing. In NSDI, 2011.
[21]
Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria. What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems. In SoCC, 2014.
[22]
Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Ben Liblit. EIO: error handling is occasionally correct. In FAST, 2008.
[23]
Huayang Guo, Ming Wu, Lidong Zhou, Gang Hu, Junfeng Yang, and Lintao Zhang. Practical Software Model Checking via Dynamic Interface Reduction. In SOSP, 2011.
[24]
Zhenyu Guo, Sean McDirmid, Mao Yang, Li Zhuang, Pu Zhang, Yingwei Luo, Tom Bergan, Peter Bodik, Madan Musuvathi, Zheng Zhang, and Lidong Zhou. Failure recovery: When the cure is worse than the disease. In HotOS, 2013.
[25]
Chris Hawblitzel, Jon Howell, Manos Kapritsos, Jacob R Lorch, Bryan Parno, Michael L Roberts, Srinath Setty, and Brian Zill. IronFleet: proving practical distributed systems correct. In SOSP, 2015.
[26]
Chun-Hung Hsiao, Jie Yu, Satish Narayanasamy, Ziyun Kong, Cristiano L Pereira, Gilles A Pokam, Peter M Chen, and Jason Flinn. Race detection for event-driven mobile applications. In PLDI, 2014.
[27]
IBM. Main page - walawiki. http://wala.sourceforge.net/wiki/index.php/Main_Page.
[28]
jboss javassist. Javassist. http://jboss-javassist.github.io/javassist/.
[29]
Pallavi Joshi, Malay Ganai, Gogul Balakrishnan, Aarti Gupta, and Nadia Papakonstantinou. Setsud=o: perturbation-based testing framework for scalable distributed systems. In Proceedings of the First ACM SIGOPS Conference on Timely Results in Operating Systems, 2013.
[30]
Charles Killian, James Anderson, Ranjit Jhala, and Amin Vahdat. Life, Death, and the Critical Transition: Finding Liveness Bugs in Systems Code. In NSDI, 2007.
[31]
Aasheesh Kolli, Steven Pelley, Ali Saidi, Peter M Chen, and Thomas F Wenisch. High-performance transactions for persistent memories. In ASPLOS, 2016.
[32]
Eric Koskinen and Junfeng Yang. Reducing crash recoverability to reachability. In POPL, 2016.
[33]
Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Commun. ACM, 21(7):558--565, July 1978.
[34]
Leslie Lamport. Specifying systems: the TLA+ language and tools for hardware and software engineers. Addison-Wesley Longman Publishing Co., Inc., 2002.
[35]
Philip Lantz, Dulloor Subramanya Rao, Sanjay Kumar, Rajesh Sankaran, and Jeff Jackson. Yat: A validation framework for persistent memory software. In ATC, 2014.
[36]
Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi, Jeffrey F. Lukman, and Haryadi S. Gunawi. SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems. In OSDI, 2014.
[37]
Tanakorn Leesatapornwongsa, Jeffrey F Lukman, Shan Lu, and Haryadi S Gunawi. TaxDC: A taxonomy of non-deterministic concurrency bugs in datacenter distributed systems. In ASPLOS, 2016.
[38]
Kaituo Li, Pallavi Joshi, Aarti Gupta, and Malay K Ganai. Reprolite: A lightweight tool to quickly reproduce hard system bugs. In SoCC, 2014.
[39]
Haopeng Liu, Guangpu Li, Jeffrey F Lukman, Jiaxin Li, Shan Lu, Haryadi S Gunawi, and Chen Tian. DCatch: Automatically detecting distributed concurrency bugs in cloud systems. In ASPLOS, 2017.
[40]
Jonathan Mace, Ryan Roelke, and Rodrigo Fonseca. Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems. In SOSP, 2015.
[41]
Pallavi Maiya, Aditya Kanade, and Rupak Majumdar. Race detection for android applications. In PLDI, 2014.
[42]
IHS Markit. Businesses losing $700 billion a year to it downtime, says ihs. http://news.ihsmarkit.com/press-release/technology/businesses-losing-700-billion-year-it-downtime-says-ihs, 2016.
[43]
Ali José Mashtizadeh, Tal Garfinkel, David Terei, David Mazieres, and Mendel Rosenblum. Towards practical default-on multi-core record/replay. In ASPLOS, 2017.
[44]
Robert H. B. Netzer and Barton P. Miller. Improving The Accuracy of Data Race Detection. In PPoPP, 1991.
[45]
Oracle. Virtualbox -- oracle vm virtualbox. https://www.virtualbox.org/wiki/VirtualBox.
[46]
Steven Pelley, Peter M Chen, and Thomas F Wenisch. Memory persistency. In ISCA, 2014.
[47]
Boris Petrov, Martin Vechev, Manu Sridharan, and Julian Dolby. Race detection for web applications. In PLDI, 2012.
[48]
Cindy Rubio-González, Haryadi S. Gunawi, Ben Liblit, Remzi H. Arpaci-Dusseau, and Andrea C. Arpaci-Dusseau. Error propagation analysis for file systems. In PLDI, 2009.
[49]
Suman Saha, Jean-Pierre Lozi, Gaël Thomas, Julia L. Lawall, and Gilles Muller. Hector: Detecting resource-release omission faults in error-handling code for systems software. In DSN, 2013.
[50]
Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, and Thomas Anderson. Eraser: A Dynamic Data Race Detector for Multithreaded Programs. ACM TOCS, 1997.
[51]
Jiri Simsa, Randy Bryant, and Garth Gibson. dBug: Systematic Evaluation of Distributed Systems. In SSV, 2010.
[52]
Chen Tian, Vijay Nagarajan, Rajiv Gupta, and Sriraman Tallam. Dynamic Recognition of Synchronization Operations for Improved Data Race Detection. In ISSTA, 2008.
[53]
Haris Volos, Andres Jaan Tack, and Michael M Swift. Mnemosyne: Lightweight persistent memory. In ASPLOS, 2011.
[54]
James R Wilcox, Doug Woos, Pavel Panchekha, Zachary Tatlock, Xi Wang, Michael D Ernst, and Thomas Anderson. Verdi: a framework for implementing and formally verifying distributed systems. In PLDI, 2015.
[55]
Weiwei Xiong, Soyeon Park, Jiaqi Zhang, Yuanyuan Zhou, and Zhiqiang Ma. Ad Hoc Synchronization Considered Harmful. In OSDI, 2010.
[56]
Junfeng Yang, Tisheng Chen, Ming Wu, Zhilei Xu, Xuezheng Liu, Haoxiang Lin, Mao Yang, Fan Long, Lintao Zhang, and Lidong Zhou. MODIST: Transparent Model Checking of Unmodified Distributed Systems. In NSDI, 2009.
[57]
Junfeng Yang, Can Sar, and Dawson Engler. Explode: a lightweight, general system for finding serious storage system errors. In OSDI, 2006.
[58]
Junfeng Yang, Paul Twohey, Dawson Engler, and Madanlal Musuvathi. Using model checking to find serious file system errors. In OSDI, 2004.
[59]
Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay Jain, and Michael Stumm. Simple testing can prevent most critical failures: An analysis of production failures in distributed data-intensive systems. In OSDI, 2014.
[60]
Mai Zheng, Joseph Tucek, Dachuan Huang, Feng Qin, Mark Lillibridge, Elizabeth S Yang, Bill W Zhao, and Shashank Singh. Torturing databases for fun and profit. In OSDI, 2014.

Cited By

View all
  • (2024)Efficient Reproduction of Fault-Induced Failures in Distributed Systems with Feedback-Driven Fault InjectionProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695979(46-62)Online publication date: 4-Nov-2024
  • (2024)FaultFuzz: A Coverage Guided Fault Injection Tool for Distributed SystemsProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings10.1145/3639478.3640036(129-133)Online publication date: 14-Apr-2024
  • (2023)Randomized Testing of Byzantine Fault Tolerant AlgorithmsProceedings of the ACM on Programming Languages10.1145/35860537:OOPSLA1(757-788)Online publication date: 6-Apr-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems
March 2018
827 pages
ISBN:9781450349116
DOI:10.1145/3173162
  • cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 53, Issue 2
    ASPLOS '18
    February 2018
    809 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/3296957
    Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 March 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. bug detection
  2. cloud computing
  3. distributed systems
  4. fault tolerance
  5. timing bugs

Qualifiers

  • Research-article

Funding Sources

  • Huawei
  • Google Faculty Research Award
  • CERES Center for Unstoppable Computing
  • CCF
  • CNS
  • IIS

Conference

ASPLOS '18

Acceptance Rates

ASPLOS '18 Paper Acceptance Rate 56 of 319 submissions, 18%;
Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)142
  • Downloads (Last 6 weeks)30
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Efficient Reproduction of Fault-Induced Failures in Distributed Systems with Feedback-Driven Fault InjectionProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695979(46-62)Online publication date: 4-Nov-2024
  • (2024)FaultFuzz: A Coverage Guided Fault Injection Tool for Distributed SystemsProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings10.1145/3639478.3640036(129-133)Online publication date: 14-Apr-2024
  • (2023)Randomized Testing of Byzantine Fault Tolerant AlgorithmsProceedings of the ACM on Programming Languages10.1145/35860537:OOPSLA1(757-788)Online publication date: 6-Apr-2023
  • (2023)Model Checking Guided Testing for Distributed SystemsProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3587442(127-143)Online publication date: 8-May-2023
  • (2022)A deep study of the effects and fixes of server-side request races in web applicationsProceedings of the 19th International Conference on Mining Software Repositories10.1145/3524842.3528463(744-756)Online publication date: 23-May-2022
  • (2022)Common Data Guided Crash Injection for Cloud Systems2022 IEEE/ACM 44th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion)10.1109/ICSE-Companion55297.2022.9793803(36-40)Online publication date: May-2022
  • (2021)Understanding and detecting server-side request races in web applicationsProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3468264.3468594(842-854)Online publication date: 20-Aug-2021
  • (2021)CausalTester: Measuring the Consistency of Replicated Services via Causality Semantics2021 IEEE 30th Asian Test Symposium (ATS)10.1109/ATS52891.2021.00021(49-54)Online publication date: Nov-2021
  • (2020)Effective Concurrency Testing for Distributed SystemsProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3373376.3378484(1141-1156)Online publication date: 9-Mar-2020
  • (2020)CoFIProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering10.1145/3324884.3416548(536-547)Online publication date: 21-Dec-2020
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media