[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1109/ASE.2019.00040acmconferencesArticle/Chapter ViewAbstractPublication PagesaseConference Proceedingsconference-collections
research-article

Understanding exception-related bugs in large-scale cloud systems

Published: 07 February 2020 Publication History

Abstract

Exception mechanism is widely used in cloud systems. This is mainly because it separates the error handling code from main business logic. However, the huge space of potential error conditions and the sophisticated logic of cloud systems present a big hurdle to the correct use of exception mechanism. As a result, mistakes in the exception use may lead to severe consequences, such as system downtime and data loss. To address this issue, the communities direly need a better understanding of the exception-related bugs, i.e., eBugs, which are caused by the incorrect use of exception mechanism, in cloud systems.
In this paper, we present a comprehensive study on 210 eBugs from six widely-deployed cloud systems, including Cassandra, HBase, HDFS, Hadoop MapReduce, YARN, and ZooKeeper. For all the studied eBugs, we analyze their triggering conditions, root causes, bug impacts, and their relations. To the best of our knowledge, this is the first study on eBugs in cloud systems, and the first one that focuses on triggering conditions. We find that eBugs are severe in cloud systems: 74% of our studied eBugs affect system availability or integrity. Luckily, exposing eBugs through testing is possible: 54% of the eBugs are triggered by non-semantic conditions, such as network errors; 40% of the eBugs can be triggered by simulating the triggering conditions at simple system states. Furthermore, we find that the triggering conditions are useful for detecting eBugs. Based on such relevant findings, we build a static analysis tool, called DIET, and apply it to the latest versions of the studied systems. Our results show that DIET reports 31 bugs and bad practices, and 23 of them are confirmed by the developers as "previously-unknown" ones.

References

[1]
Apache Hadoop. [Online]. Available: https://hadoop.apache.org
[2]
Advantages of exceptions. [Online]. Available: https://docs.oracle.com/javase/tutorial/essential/exceptions/advantages.html
[3]
D. Yuan, Y. Luo, X. Zhuang, G. R. Rodrigues, X. Zhao, Y. Zhang, P. Jain, and M. Stumm, "Simple testing can prevent most critical failures: An analysis of production failures in distributed data-intensive systems," in Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, 2014, pp. 249--265.
[4]
H. S. Gunawi, M. Hao, T. Leesatapornwongsa, T. Patana-anake, T. Do, J. Adityatama, K. J. Eliazar, A. Laksono, J. F. Lukman, V. Martin et al., "What bugs live in the cloud? A study of 3000+ issues in cloud systems," in Proceedings of the ACM Symposium on Cloud Computing, 2014, pp. 1--14.
[5]
F. Ebert, F. Castor, and A. Serebrenik, "An exploratory study on exception handling bugs in Java programs," Journal of Systems and Software, vol. 106, pp. 82--101, 2015.
[6]
J. Oliveira, D. Borges, T. Silva, N. Cacho, and F. Castor, "Do Android developers neglect error handling? A maintenance-centric study on the relationship between Android abstractions and uncaught exceptions," Journal of Systems and Software, vol. 136, pp. 1--18, 2018.
[7]
R. Coelho, L. Almeida, G. Gousios, and A. van Deursen, "Unveiling exception handling bug hazards in Android based on GitHub and Google code issues," in Proceedings of the 12th Working Conference on Mining Software Repositories, 2015, pp. 134--145.
[8]
L. Fan, T. Su, S. Chen, G. Meng, Y. Liu, L. Xu, G. Pu, and Z. Su, "Large-scale analysis of framework-specific exceptions in Android Apps," in Proceedings of the 40th International Conference on Software Engineering, 2018, pp. 408--419.
[9]
R. Coelho, A. Rashid, A. von Staa, J. Noble, U. Kulesza, and C. Lucena, "A catalogue of bug patterns for exception handling in aspect-oriented programs," in Proceedings of the 15th Conference on Pattern Languages of Programs, 2008, p. 23.
[10]
Jira Software. [Online]. Available: https://www.atlassian.com/software/jira
[11]
Apache Cassandra. [Online]. Available: http://cassandra.apache.org
[12]
Apache HBase. [Online]. Available: http://hbase.apache.org
[13]
HDFS architecture. [Online]. Available: http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
[14]
MapReduce tutorial. [Online]. Available: http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
[15]
Apache Hadoop YARN. [Online]. Available: http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
[16]
Apache ZooKeeper. [Online]. Available: http://zookeeper.apache.org
[17]
EBugs in cloud systems. [Online]. Available: https://hanseychen.github.io/eBugs/
[18]
O. R. Gatla, M. Hameed, M. Zheng, V. Dubeyko, A. Manzanares, F. Blagojević, C. Guyot, and R. Mateescu, "Towards robust file system checkers," in Proceedings of the 16th USENIX Conference on File and Storage Technologies, 2018, pp. 105--122.
[19]
Y. Gao, W. Dou, F. Qin, C. Gao, D. Wang, J. Wei, R. Huang, L. Zhou, and Y. Wu, "An empirical study on crash recovery bugs in large-scale distributed systems," in Proceedings of the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2018, pp. 539--550.
[20]
C. Cadar, D. Dunbar, D. R. Engler et al., "KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs," in Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, 2008, pp. 209--224.
[21]
P. Godefroid, M. Y. Levin, D. A. Molnar et al., "Automated whitebox fuzz testing," in Proceedings of the 16th Network and Distributed System Security Symposium, 2008, pp. 151--166.
[22]
M. Zheng, J. Tucek, D. Huang, F. Qin, M. Lillibridge, E. S. Yang, B. W. Zhao, and S. Singh, "Torturing databases for fun and profit," in Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, 2014, pp. 449--464.
[23]
H. S. Gunawi, T. Do, P. Joshi, P. Alvaro, J. M. Hellerstein, A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, K. Sen, and D. Borthakur, "FATE and DESTINI: A framework for cloud recovery testing," in Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, 2011, pp. 1--18.
[24]
R. Alagappan, A. Ganesan, Y. Patel, T. S. Pillai, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau, "Correlated crash vulnerabilities," in Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, 2016, pp. 151--167.
[25]
H. Liu, X. Wang, G. Li, S. Lu, F. Ye, and C. Tian, "FCatch: Automatically detecting time-of-fault bugs in cloud systems," in Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems, 2018, pp. 419--431.
[26]
Jepsen. [Online]. Available: https://jepsen.io/
[27]
A. Alquraan, H. Takruri, M. Alfatafta, and S. Al-Kiswany, "An analysis of network-partitioning failures in cloud systems," in Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation, 2018, pp. 51--68.
[28]
T. Leesatapornwongsa, M. Hao, P. Joshi, J. F. Lukman, and H. S. Gunawi, "SAMC: Semantic-aware model checking for fast discovery of deep bugs in cloud systems," in Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, 2014, pp. 399--414.
[29]
A. Ganesan, R. Alagappan, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau, "Redundancy does not imply fault tolerance: Analysis of distributed storage reactions to single errors and corruptions," in Proceedings of the 15th USENIX Conference on File and Storage Technologies, 2017, pp. 149--166.
[30]
G. B. de Pádua and W. Shang, "Revisiting exception handling practices with exception flow analysis," in Proceedings of 17th International Working Conference on Source Code Analysis and Manipulation, 2017, pp. 11--20.
[31]
D. Sena, R. Coelho, U. Kulesza, and R. Bonifácio, "Understanding the exception handling strategies of Java libraries: An empirical study," in Proceedings of the 13th International Conference on Mining Software Repositories, 2016, pp. 212--222.
[32]
S. Liang, W. Sun, M. Might, A. Keep, and D. Van Horn, "Pruning, pushdown exception-flow analysis," in Proceedings of 14th International Working Conference on Source Code Analysis and Manipulation, 2014, pp. 265--274.
[33]
H. Melo, R. Coelho, U. Kulesza, and D. Sena, "In-depth characterization of exception flows in software product lines: An empirical study," Journal of Software Engineering Research and Development, vol. 1, no. 1, p. 3, 2013.
[34]
P. Prabhu, N. Maeda, G. Balakrishnan, F. Ivančić, and A. Gupta, "Interprocedural exception analysis for C++," in Proceedings of the 25th European Conference on Object-Oriented Programming, 2011, pp. 583--608.
[35]
M. Bravenboer and Y. Smaragdakis, "Exception analysis and points-to analysis: Better together," in Proceedings of the 18th International Symposium on Software Testing and Analysis, 2009, pp. 1--12.
[36]
S. Thummalapenta and T. Xie, "Mining exception-handling rules as sequence association rules," in Proceedings of the 31st International Conference on Software Engineering, 2009, pp. 496--506.
[37]
T. Montenegro, H. Melo, R. Coelho, and E. Barbosa, "Improving developers awareness of the exception handling policy," in Proceedings of the 25th International Conference on Software Analysis, Evolution and Reengineering, 2018, pp. 413--422.
[38]
HDFS-14486. [Online]. Available: https://issues.apache.org/jira/browse/HDFS-14486
[39]
CASSANDRA-15111. [Online]. Available: https://issues.apache.org/jira/browse/CASSANDRA-15111
[40]
CASSANDRA-15112. [Online]. Available: https://issues.apache.org/jira/browse/CASSANDRA-15112
[41]
CASSANDRA-15114. [Online]. Available: https://issues.apache.org/jira/browse/CASSANDRA-15114
[42]
CASSANDRA-15116. [Online]. Available: https://issues.apache.org/jira/browse/CASSANDRA-15116
[43]
CASSANDRA-15117. [Online]. Available: https://issues.apache.org/jira/browse/CASSANDRA-15117
[44]
HBASE-22369. [Online]. Available: https://issues.apache.org/jira/browse/HBASE-22369
[45]
S. Nakshatri, M. Hegde, and S. Thandra, "Analysis of exception handling patterns in Java projects: An empirical study," in Proceedings of the 13th International Conference on Mining Software Repositories, 2016, pp. 500--503.
[46]
M. Monperrus, M. G. de Montauzan, B. Cornu, R. Marvie, and R. Rouvoy, "Challenging analytical knowledge on exception-handling: An empirical study of 32 Java software packages," Tech. Rep. hal-01093908, 2014.
[47]
M. B. Kery, C. Le Goues, and B. A. Myers, "Examining programmer practices for locally handling exceptions," in Proceedings of the 13th International Conference on Mining Software Repositories, 2016, pp. 484--487.
[48]
T. Leesatapornwongsa, J. F. Lukman, S. Lu, and H. S. Gunawi, "TaxDC: A taxonomy of non-deterministic concurrency bugs in datacenter distributed systems," in Proceedings of the 21st International Conference on Architectural Support for Programming Languages and Operating Systems, 2016, pp. 517--530.
[49]
T. Dai, J. He, X. Gu, and S. Lu, "Understanding real-world timeout problems in cloud server systems," in Proceeding of the IEEE International Conference on Cloud Engineering, 2018, pp. 1--11.
[50]
S. Lu, S. Park, E. Seo, and Y. Zhou, "Learning from mistakes: A comprehensive study on real world concurrency bug characteristics," in Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems, 2008, pp. 329--339.
[51]
A. Chou, J. Yang, B. Chelf, S. Hallem, and D. Engler, "An empirical study of operating systems errors," in Proceedings of the 18th Symposium on Operating Systems Principles, 2001, pp. 73--88.
[52]
Z. Yin, D. Yuan, Y. Zhou, S. Pasupathy, and L. Bairavasundaram, "How do fixes become bugs?" in Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, 2011, pp. 26--36.
[53]
S. Park, S. Lu, and Y. Zhou, "CTrigger: Exposing atomicity violation bugs from their hiding places," in Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems, 2009, pp. 25--36.
[54]
W. Zhang, C. Sun, and S. Lu, "ConMem: Detecting severe concurrency bugs through an effect-oriented approach," in Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems, 2010, pp. 179--192.
[55]
B. Kasikci, B. Schubert, C. Pereira, G. Pokam, and G. Candea, "Failure sketching: A technique for automated root cause diagnosis of in-production failures," in Proceedings of the 25th Symposium on Operating Systems Principles, 2015, pp. 344--360.

Cited By

View all
  • (2024)An Empirical Study on the Challenges of eBPF Application DevelopmentProceedings of the ACM SIGCOMM 2024 Workshop on eBPF and Kernel Extensions10.1145/3672197.3673429(1-8)Online publication date: 4-Aug-2024
  • (2024)Characterizing and Detecting Program Representation Faults of Static Analysis FrameworksProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680398(1772-1784)Online publication date: 11-Sep-2024
  • (2024)An Empirical Study on Kubernetes Operator BugsProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680396(1746-1758)Online publication date: 11-Sep-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ASE '19: Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering
November 2019
1333 pages
ISBN:9781728125084

Sponsors

In-Cooperation

  • IEEE CS

Publisher

IEEE Press

Publication History

Published: 07 February 2020

Check for updates

Qualifiers

  • Research-article

Conference

ASE '19
Sponsor:

Acceptance Rates

Overall Acceptance Rate 82 of 337 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)2
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)An Empirical Study on the Challenges of eBPF Application DevelopmentProceedings of the ACM SIGCOMM 2024 Workshop on eBPF and Kernel Extensions10.1145/3672197.3673429(1-8)Online publication date: 4-Aug-2024
  • (2024)Characterizing and Detecting Program Representation Faults of Static Analysis FrameworksProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680398(1772-1784)Online publication date: 11-Sep-2024
  • (2024)An Empirical Study on Kubernetes Operator BugsProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680396(1746-1758)Online publication date: 11-Sep-2024
  • (2024)Automatic Root Cause Analysis via Large Language Models for Cloud IncidentsProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629553(674-688)Online publication date: 22-Apr-2024
  • (2024)RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language ModelsProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3680016(4966-4974)Online publication date: 21-Oct-2024
  • (2024)Towards automatic labeling of exception handling bugs: A case study of 10 years bug-fixing in Apache HadoopEmpirical Software Engineering10.1007/s10664-024-10494-029:4Online publication date: 5-Jun-2024
  • (2023)Detection Is Better Than Cure: A Cloud Incidents PerspectiveProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3613898(1891-1902)Online publication date: 30-Nov-2023
  • (2023)Fail through the Cracks: Cross-System Interaction Failures in Modern Cloud SystemsProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3587448(433-451)Online publication date: 8-May-2023
  • (2023)Model Checking Guided Testing for Distributed SystemsProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3587442(127-143)Online publication date: 8-May-2023
  • (2022)Which Exception Shall We Throw?Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering10.1145/3551349.3556895(1-12)Online publication date: 10-Oct-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media