More Web Proxy on the site http://driver.im/

research-article

FCatch: Automatically Detecting Time-of-fault Bugs in Cloud Systems

Authors:

Chen TianAuthors Info & Claims

ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Pages 419 - 431

https://doi.org/10.1145/3173162.3177161

Published: 19 March 2018 Publication History

Abstract

It is crucial for distributed systems to achieve high availability. Unfortunately, this is challenging given the common component failures (i.e., faults). Developers often cannot anticipate all the timing conditions and system states under which a fault might occur, and introduce time-of-fault (TOF) bugs that only manifest when a node crashes or a message drops at a special moment. Although challenging, detecting TOF bugs is fundamental to developing highly available distributed systems. Unlike previous work that relies on fault injection to expose TOF bugs, this paper carefully models TOF bugs as a new type of concurrency bugs, and develops FCatch to automatically predict TOF bugs by observing correct execution. Evaluation on representative cloud systems shows that FCatch is effective, accurately finding severe TOF bugs.

References

[1]

Hbase-3596. https://issues.apache.org/jira/browse/HBASE-3596, 2011.

[2]

Mapreduce-3858. https://issues.apache.org/jira/browse/MAPREDUCE-3858, 2012.

[3]

Cassandra-5393. https://issues.apache.org/jira/browse/CASSANDRA-5393, 2013.

[4]

Cassandra-6415. https://issues.apache.org/jira/browse/CASSANDRA-6415, 2013.

[5]

Hbase-10090. https://issues.apache.org/jira/browse/HBASE-10090, 2013.

[6]

Mapreduce-5476. https://issues.apache.org/jira/browse/MAPREDUCE-5476, 2013.

[7]

Zookeeper-1653. https://issues.apache.org/jira/browse/ZOOKEEPER-1653, 2013.

[8]

Java platform standard edition 7 documentation. https://docs.oracle.com/javase/7/docs/api/java/lang/Object.html#hashCode(), 2017.

[9]

Ramnatthan Alagappan, Aishwarya Ganesan, Yuvraj Patel, Thanumalayan Sankaranarayana Pillai, Andrea C Arpaci-Dusseau, and Remzi H Arpaci-Dusseau. Correlated crash vulnerabilities. In OSDI, 2016.

Digital Library

[10]

Peter Alvaro, Kolton Andrus, Chris Sanden, Casey Rosenthal, Ali Basiri, and Lorin Hochstein. Automating failure testing research at internet scale. In SoCC, 2016.

Digital Library

[11]

Peter Alvaro, Joshua Rosen, and Joseph M. Hellerstein. Lineage-driven Fault Injection. In SIGMOD, 2015.

Digital Library

[12]

Kumud Bhandari, Dhruva R Chakrabarti, and Hans-J Boehm. Makalu: Fast recoverable allocation of non-volatile memory. In OOPSLA, 2016.

Digital Library

[13]

Lucas Brutschy, Dimitar Dimitrov, Peter Müller, and Martin T. Vechev. Serializability for eventual consistency: criterion, analysis, and applications. In POPL, 2017.

Digital Library

[14]

Pierre Castéran and Yves Bertot. Interactive theorem proving and program development. coq'art: The calculus of inductive constructions., 2004.

Digital Library

[15]

Feng Chen, Traian Florin Serbanuta, and Grigore Rosu. jPredictor: a predictive runtime analysis tool for java. In ICSE, 2008.

Digital Library

[16]

Datapath.io. Recent aws outage and how you could have avoided downtime. https://medium.com/@datapath_io/recent-aws-outage-and-how-you-could-have-avoided-downtime-7d9d9443d776, 2017.

[17]

Jeff Dean. Designs, lessons and advice from building large distributed systems. Keynote from LADIS, 2009.

[18]

Pantazis Deligiannis, Alastair F Donaldson, Jeroen Ketema, Akash Lal, and Paul Thomson. Asynchronous programming, analysis and testing with state machines. In PLDI, 2015.

Digital Library

[19]

Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C Arpaci-Dusseau, and Remzi H Arpaci-Dusseau. Redundancy does not imply fault tolerance: Analysis of distributed storage reactions to single errors and corruptions. In FAST, 2017.

Digital Library

[20]

Haryadi S. Gunawi, Thanh Do, Pallavi Joshi, Peter Alvaro, Joseph M. Hellerstein, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Koushik Sen, and Dhruba Borthakur. textscFate and textscDestini: A Framework for Cloud Recovery Testing. In NSDI, 2011.

Digital Library

[21]

Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria. What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems. In SoCC, 2014.

Digital Library

[22]

Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Ben Liblit. EIO: error handling is occasionally correct. In FAST, 2008.

Digital Library

[23]

Huayang Guo, Ming Wu, Lidong Zhou, Gang Hu, Junfeng Yang, and Lintao Zhang. Practical Software Model Checking via Dynamic Interface Reduction. In SOSP, 2011.

Digital Library

[24]

Zhenyu Guo, Sean McDirmid, Mao Yang, Li Zhuang, Pu Zhang, Yingwei Luo, Tom Bergan, Peter Bodik, Madan Musuvathi, Zheng Zhang, and Lidong Zhou. Failure recovery: When the cure is worse than the disease. In HotOS, 2013.

Digital Library

[25]

Chris Hawblitzel, Jon Howell, Manos Kapritsos, Jacob R Lorch, Bryan Parno, Michael L Roberts, Srinath Setty, and Brian Zill. IronFleet: proving practical distributed systems correct. In SOSP, 2015.

Digital Library

[26]

Chun-Hung Hsiao, Jie Yu, Satish Narayanasamy, Ziyun Kong, Cristiano L Pereira, Gilles A Pokam, Peter M Chen, and Jason Flinn. Race detection for event-driven mobile applications. In PLDI, 2014.

Digital Library

[27]

IBM. Main page - walawiki. http://wala.sourceforge.net/wiki/index.php/Main_Page.

[28]

jboss javassist. Javassist. http://jboss-javassist.github.io/javassist/.

[29]

Pallavi Joshi, Malay Ganai, Gogul Balakrishnan, Aarti Gupta, and Nadia Papakonstantinou. Setsud=o: perturbation-based testing framework for scalable distributed systems. In Proceedings of the First ACM SIGOPS Conference on Timely Results in Operating Systems, 2013.

Digital Library

[30]

Charles Killian, James Anderson, Ranjit Jhala, and Amin Vahdat. Life, Death, and the Critical Transition: Finding Liveness Bugs in Systems Code. In NSDI, 2007.

Digital Library

[31]

Aasheesh Kolli, Steven Pelley, Ali Saidi, Peter M Chen, and Thomas F Wenisch. High-performance transactions for persistent memories. In ASPLOS, 2016.

Digital Library

[32]

Eric Koskinen and Junfeng Yang. Reducing crash recoverability to reachability. In POPL, 2016.

Digital Library

[33]

Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Commun. ACM, 21(7):558--565, July 1978.

Digital Library

[34]

Leslie Lamport. Specifying systems: the TLA+ language and tools for hardware and software engineers. Addison-Wesley Longman Publishing Co., Inc., 2002.

Digital Library

[35]

Philip Lantz, Dulloor Subramanya Rao, Sanjay Kumar, Rajesh Sankaran, and Jeff Jackson. Yat: A validation framework for persistent memory software. In ATC, 2014.

Digital Library

[36]

Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi, Jeffrey F. Lukman, and Haryadi S. Gunawi. SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems. In OSDI, 2014.

Digital Library

[37]

Tanakorn Leesatapornwongsa, Jeffrey F Lukman, Shan Lu, and Haryadi S Gunawi. TaxDC: A taxonomy of non-deterministic concurrency bugs in datacenter distributed systems. In ASPLOS, 2016.

Digital Library

[38]

Kaituo Li, Pallavi Joshi, Aarti Gupta, and Malay K Ganai. Reprolite: A lightweight tool to quickly reproduce hard system bugs. In SoCC, 2014.

Digital Library

[39]

Haopeng Liu, Guangpu Li, Jeffrey F Lukman, Jiaxin Li, Shan Lu, Haryadi S Gunawi, and Chen Tian. DCatch: Automatically detecting distributed concurrency bugs in cloud systems. In ASPLOS, 2017.

Digital Library

[40]

Jonathan Mace, Ryan Roelke, and Rodrigo Fonseca. Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems. In SOSP, 2015.

Digital Library

[41]

Pallavi Maiya, Aditya Kanade, and Rupak Majumdar. Race detection for android applications. In PLDI, 2014.

Digital Library

[42]

IHS Markit. Businesses losing $700 billion a year to it downtime, says ihs. http://news.ihsmarkit.com/press-release/technology/businesses-losing-700-billion-year-it-downtime-says-ihs, 2016.

[43]

Ali José Mashtizadeh, Tal Garfinkel, David Terei, David Mazieres, and Mendel Rosenblum. Towards practical default-on multi-core record/replay. In ASPLOS, 2017.

Digital Library

[44]

Robert H. B. Netzer and Barton P. Miller. Improving The Accuracy of Data Race Detection. In PPoPP, 1991.

Digital Library

[45]

Oracle. Virtualbox -- oracle vm virtualbox. https://www.virtualbox.org/wiki/VirtualBox.

[46]

Steven Pelley, Peter M Chen, and Thomas F Wenisch. Memory persistency. In ISCA, 2014.

Digital Library

[47]

Boris Petrov, Martin Vechev, Manu Sridharan, and Julian Dolby. Race detection for web applications. In PLDI, 2012.

Digital Library

[48]

Cindy Rubio-González, Haryadi S. Gunawi, Ben Liblit, Remzi H. Arpaci-Dusseau, and Andrea C. Arpaci-Dusseau. Error propagation analysis for file systems. In PLDI, 2009.

Digital Library

[49]

Suman Saha, Jean-Pierre Lozi, Gaël Thomas, Julia L. Lawall, and Gilles Muller. Hector: Detecting resource-release omission faults in error-handling code for systems software. In DSN, 2013.

Digital Library

[50]

Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, and Thomas Anderson. Eraser: A Dynamic Data Race Detector for Multithreaded Programs. ACM TOCS, 1997.

Digital Library

[51]

Jiri Simsa, Randy Bryant, and Garth Gibson. dBug: Systematic Evaluation of Distributed Systems. In SSV, 2010.

Digital Library

[52]

Chen Tian, Vijay Nagarajan, Rajiv Gupta, and Sriraman Tallam. Dynamic Recognition of Synchronization Operations for Improved Data Race Detection. In ISSTA, 2008.

Digital Library

[53]

Haris Volos, Andres Jaan Tack, and Michael M Swift. Mnemosyne: Lightweight persistent memory. In ASPLOS, 2011.

Digital Library

[54]

James R Wilcox, Doug Woos, Pavel Panchekha, Zachary Tatlock, Xi Wang, Michael D Ernst, and Thomas Anderson. Verdi: a framework for implementing and formally verifying distributed systems. In PLDI, 2015.

Digital Library

[55]

Weiwei Xiong, Soyeon Park, Jiaqi Zhang, Yuanyuan Zhou, and Zhiqiang Ma. Ad Hoc Synchronization Considered Harmful. In OSDI, 2010.

Digital Library

[56]

Junfeng Yang, Tisheng Chen, Ming Wu, Zhilei Xu, Xuezheng Liu, Haoxiang Lin, Mao Yang, Fan Long, Lintao Zhang, and Lidong Zhou. MODIST: Transparent Model Checking of Unmodified Distributed Systems. In NSDI, 2009.

Digital Library

[57]

Junfeng Yang, Can Sar, and Dawson Engler. Explode: a lightweight, general system for finding serious storage system errors. In OSDI, 2006.

Digital Library

[58]

Junfeng Yang, Paul Twohey, Dawson Engler, and Madanlal Musuvathi. Using model checking to find serious file system errors. In OSDI, 2004.

Digital Library

[59]

Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay Jain, and Michael Stumm. Simple testing can prevent most critical failures: An analysis of production failures in distributed data-intensive systems. In OSDI, 2014.

Digital Library

[60]

Mai Zheng, Joseph Tucek, Dachuan Huang, Feng Qin, Mark Lillibridge, Elizabeth S Yang, Bill W Zhao, and Shashank Singh. Torturing databases for fun and profit. In OSDI, 2014.

Digital Library

Cited By

Pan JWu HLeesatapornwongsa TNath SHuang PWitchel EArpaci-Dusseau ARossbach CKeeton K(2024)Efficient Reproduction of Fault-Induced Failures in Distributed Systems with Feedback-Driven Fault InjectionProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695979(46-62)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3694715.3695979
Feng WPei QGao YWang DDou WWei JLiang ZLong ZRoychoudhury APaiva AAbreu RStorey M(2024)FaultFuzz: A Coverage Guided Fault Injection Tool for Distributed SystemsProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings10.1145/3639478.3640036(129-133)Online publication date: 14-Apr-2024
https://dl.acm.org/doi/10.1145/3639478.3640036
Winter LBuse Fde Graaf Dvon Gleissenthall KKulahcioglu Ozkan B(2023)Randomized Testing of Byzantine Fault Tolerant AlgorithmsProceedings of the ACM on Programming Languages10.1145/35860537:OOPSLA1(757-788)Online publication date: 6-Apr-2023
https://dl.acm.org/doi/10.1145/3586053
Show More Cited By

Index Terms

FCatch: Automatically Detecting Time-of-fault Bugs in Cloud Systems
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging
  2. Software organization and properties
    1. Extra-functional properties
      1. Software reliability
    2. Software system structures
      1. Distributed systems organizing principles
        Cloud computing

Recommendations

CrashTuner: detecting crash-recovery bugs in cloud systems via meta-info analysis
SOSP '19: Proceedings of the 27th ACM Symposium on Operating Systems Principles

Crash-recovery bugs (bugs in crash-recovery-related mechanisms) are among the most severe bugs in cloud systems and can easily cause system failures. It is notoriously difficult to detect crash-recovery bugs since these bugs can only be exposed when ...
FCatch: Automatically Detecting Time-of-fault Bugs in Cloud Systems
ASPLOS '18

It is crucial for distributed systems to achieve high availability. Unfortunately, this is challenging given the common component failures (i.e., faults). Developers often cannot anticipate all the timing conditions and system states under which a fault ...
CloudRaid: hunting concurrency bugs in the cloud via log-mining
ESEC/FSE 2018: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Cloud systems suffer from distributed concurrency bugs, which are notoriously difficult to detect and often lead to data loss and service outage. This paper presents CloudRaid, a new effective tool to battle distributed concurrency bugs. CloudRaid ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

March 2018

827 pages

ISBN:9781450349116

DOI:10.1145/3173162

General Chairs:
Xipeng Shen
North Carolina State University, USA
,
James Tuck
North Carolina State University, USA
,
Program Chairs:
Ricardo Bianchini
Microsoft Research, USA
,
Vivek Sarkar
Georgia Institute of Technology, USA

ACM SIGPLAN Notices Volume 53, Issue 2
ASPLOS '18
February 2018
809 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/3296957
Editor:
Matthew Fluet
Rodchester Institude of Technology
Issue’s Table of Contents

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

SIGBED: ACM Special Interest Group on Embedded Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 March 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Huawei
Google Faculty Research Award
CERES Center for Unstoppable Computing
CCF
CNS
IIS

Conference

ASPLOS '18

Sponsor:

ASPLOS '18: Architectural Support for Programming Languages and Operating Systems

March 24 - 28, 2018

VA, Williamsburg, USA

Acceptance Rates

ASPLOS '18 Paper Acceptance Rate 56 of 319 submissions, 18%;

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

23
Total Citations
View Citations
935
Total Downloads

Downloads (Last 12 months)142
Downloads (Last 6 weeks)30

Reflects downloads up to 01 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Pan JWu HLeesatapornwongsa TNath SHuang PWitchel EArpaci-Dusseau ARossbach CKeeton K(2024)Efficient Reproduction of Fault-Induced Failures in Distributed Systems with Feedback-Driven Fault InjectionProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695979(46-62)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3694715.3695979
Feng WPei QGao YWang DDou WWei JLiang ZLong ZRoychoudhury APaiva AAbreu RStorey M(2024)FaultFuzz: A Coverage Guided Fault Injection Tool for Distributed SystemsProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings10.1145/3639478.3640036(129-133)Online publication date: 14-Apr-2024
https://dl.acm.org/doi/10.1145/3639478.3640036
Winter LBuse Fde Graaf Dvon Gleissenthall KKulahcioglu Ozkan B(2023)Randomized Testing of Byzantine Fault Tolerant AlgorithmsProceedings of the ACM on Programming Languages10.1145/35860537:OOPSLA1(757-788)Online publication date: 6-Apr-2023
https://dl.acm.org/doi/10.1145/3586053
Wang DDou WGao YWu CWei JHuang TFedorova ANarayanan DDi Luna GQuerzoni L(2023)Model Checking Guided Testing for Distributed SystemsProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3587442(127-143)Online publication date: 8-May-2023
https://dl.acm.org/doi/10.1145/3552326.3587442
Qiu ZShao SZhao QKhan HHui XJin GLo DMcIntosh SNovielli N(2022)A deep study of the effects and fixes of server-side request races in web applicationsProceedings of the 19th International Conference on Mining Software Repositories10.1145/3524842.3528463(744-756)Online publication date: 23-May-2022
https://dl.acm.org/doi/10.1145/3524842.3528463
Gao YWang DDai QDou WWei J(2022)Common Data Guided Crash Injection for Cloud Systems2022 IEEE/ACM 44th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion)10.1109/ICSE-Companion55297.2022.9793803(36-40)Online publication date: May-2022
https://doi.org/10.1109/ICSE-Companion55297.2022.9793803
Qiu ZShao SZhao QJin GSpinellis DGousios GChechik MDi Penta M(2021)Understanding and detecting server-side request races in web applicationsProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3468264.3468594(842-854)Online publication date: 20-Aug-2021
https://dl.acm.org/doi/10.1145/3468264.3468594
Tang YZhao LYuan WWang X(2021)CausalTester: Measuring the Consistency of Replicated Services via Causality Semantics2021 IEEE 30th Asian Test Symposium (ATS)10.1109/ATS52891.2021.00021(49-54)Online publication date: Nov-2021
https://doi.org/10.1109/ATS52891.2021.00021
Yuan XYang JLarus JCeze LStrauss K(2020)Effective Concurrency Testing for Distributed SystemsProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3373376.3378484(1141-1156)Online publication date: 9-Mar-2020
https://dl.acm.org/doi/10.1145/3373376.3378484
Chen HDou WWang DQin FGrundy JLe Goues CLo D(2020)CoFIProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering10.1145/3324884.3416548(536-547)Online publication date: 21-Dec-2020
https://dl.acm.org/doi/10.1145/3324884.3416548
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents