[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3236024.3236071acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article

CloudRaid: hunting concurrency bugs in the cloud via log-mining

Published: 26 October 2018 Publication History

Abstract

Cloud systems suffer from distributed concurrency bugs, which are notoriously difficult to detect and often lead to data loss and service outage. This paper presents CloudRaid, a new effective tool to battle distributed concurrency bugs. CloudRaid automatically detects concurrency bugs in cloud systems, by analyzing and testing those message orderings that are likely to expose errors. We observe that large-scale online cloud applications process millions of user requests per second, exercising many permutations of message orderings extensively. Those already sufficiently-tested message orderings are unlikely to expose errors. Hence, CloudRaid mines logs from previous executions to uncover those message orderings which are feasible, but not sufficiently tested. Specifically, CloudRaid tries to flip the order of a pair of messages <S,P> if they may happen in parallel, but S always arrives before P from existing logs, i.e., excercising the order PS. The log-based approach makes it suitable to live systems.
We have applied CloudRaid to automatically test four representative distributed systems: Apache Hadoop2/Yarn, HBase, HDFS and Cassandra. CloudRaid can automatically test 40 different versions of the 4 systems (10 versions per system) in 35 hours, and can successfully trigger 28 concurrency bugs, including 8 new bugs that have never been found before. The 8 new bugs have all been confirmed by their original developers, and 3 of them are considered as critical bugs that have already been fixed.

References

[2]
Google Protocol Buffer. (2018).
[3]
Retrieved April 26, 2018 from https: //developers.google.com/protocolbuffers/. 2018.
[4]
WALA Home page. (2018).
[5]
Retrieved April 26, 2018 from http://wala. sourceforge.net/wiki/index.php/Main_Page/.
[6]
Ivan Beschastnikh, Yuriy Brun, Michael D Ernst, Arvind Krishnamurthy, and Thomas E Anderson. 2012. Mining temporal invariants from partially ordered logs. ACM SIGOPS Operating Systems Review 45, 3 (2012), 39–46.
[7]
Ivan Beschastnikh, Yuriy Brun, Sigurd Schneider, Michael Sloan, and Michael D Ernst. 2011. Leveraging existing instrumentation to automatically infer invariantconstrained models. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering. ACM, 267–277.
[8]
Dhruba Borthakur et al. 2008. HDFS architecture guide. Hadoop Apache Project 53 (2008).
[9]
Michael Chow, David Meisner, Jason Flinn, Daniel Peek, and Thomas F Wenisch. 2014. The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services. In OSDI. 217–231.
[10]
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 51, 1 (Jan. 2008), 107–113. 1145/1327452.1327492
[11]
Florin Dinu and TS Ng. 2012.
[12]
Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. 2017. DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. ACM, 1285–1298.
[13]
David Fiala, Frank Mueller, Christian Engelmann, Rolf Riesen, Kurt Ferreira, and Ron Brightwell. 2012. Detection and Correction of Silent Data Corruption for Large-scale High-performance Computing. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC ’12). IEEE Computer Society Press, Los Alamitos, CA, USA, Article 78, 12 pages. http://dl.acm.org/citation.cfm?id=2388996.2389102
[14]
Cormac Flanagan and Stephen N Freund. 2009. FastTrack: efficient and precise dynamic race detection. In ACM Sigplan Notices, Vol. 44. ACM, 121–133.
[15]
Qiang Fu, Jian-Guang Lou, Yi Wang, and Jiang Li. 2009.
[16]
Execution anomaly detection in distributed systems through unstructured log analysis. In Data Mining, 2009. ICDM’09. Ninth IEEE International Conference on. IEEE, 149–158.
[17]
Erich Gamma. 1995.
[18]
Design patterns: elements of reusable object-oriented software. Pearson Education India.
[19]
Yu Gao, Wensheng Dou, Feng Qin, Chushu Gao, Dong Wang, Jun Wei, Ruirui Huang, Li Zhou, and Yongming Wu. 2018. An Empirical Study on Crash Recovery Bugs in Large-Scale Distributed Systems. In Proceedings of the 26th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2018).
[20]
Lars George. 2011.
[21]
HBase: the definitive guide: random access to your planet-size data. " O’Reilly Media, Inc.".
[22]
Haryadi S Gunawi, Thanh Do, Pallavi Joshi, Peter Alvaro, Joseph M Hellerstein, Andrea C Arpaci-Dusseau, Remzi H Arpaci-Dusseau, Koushik Sen, and Dhruba Borthakur. 2011. FATE and DESTINI: A framework for cloud recovery testing. In Proceedings of NSDIâĂŹ11: 8th USENIX Symposium on Networked Systems Design and Implementation. 239.
[23]
Haryadi S Gunawi, Thanh Do, Pallavi Joshi, Joseph M Hellerstein, Andrea C Arpaci-Dusseau, Remzi H Arpaci-Dusseau, and Koushik Sen. 2010.
[24]
Towards Automatically Checking Thousands of Failures with Micro-specifications. In HotDep.
[25]
Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patanaanake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria. 2014. What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems. In Proceedings of the ACM Symposium on Cloud Computing (SOCC ’14). ACM, New York, NY, USA, Article 7, 14 pages.
[26]
Huayang Guo, Ming Wu, Lidong Zhou, Gang Hu, Junfeng Yang, and Lintao Zhang. 2011.
[27]
Practical software model checking via dynamic interface reduction. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles. ACM, 265–278.
[28]
Zhenyu Guo, Sean McDirmid, Mao Yang, Li Zhuang, Pu Zhang, Yingwei Luo, Tom Bergan, Peter Bodik, Madan Musuvathi, Zheng Zhang, and Lidong Zhou. 2013. Failure Recovery: When the Cure is Worse Than the Disease. In Proceedings of the 14th USENIX Conference on Hot Topics in Operating Systems (HotOS’13). USENIX Association, Berkeley, CA, USA, 8–8. http://dl.acm.org/citation.cfm? id=2490483.2490491
[29]
Chun-Hung Hsiao, Jie Yu, Satish Narayanasamy, Ziyun Kong, Cristiano L Pereira, Gilles A Pokam, Peter M Chen, and Jason Flinn. 2014. Race detection for eventdriven mobile applications. ACM SIGPLAN Notices 49, 6 (2014), 326–336. ESEC/FSE ’18, November 4–9, 2018, Lake Buena Vista, FL, USA Jie Lu, Feng Li, Lian Li, and Xiaobing Feng
[30]
Pallavi Joshi, Malay Ganai, Gogul Balakrishnan, Aarti Gupta, and Nadia Papakonstantinou. 2013. SETSUD ¯ O: perturbation-based testing framework for scalable distributed systems. In Proceedings of the First ACM SIGOPS Conference on Timely Results in Operating Systems. ACM, 7.
[31]
Pallavi Joshi, Haryadi S Gunawi, and Koushik Sen. 2011.
[32]
PREFAIL: A programmable tool for multiple-failure injection. In ACM SIGPLAN Notices, Vol. 46. ACM, 171–188.
[33]
Xiaoen Ju, Livio Soares, Kang G Shin, Kyung Dong Ryu, and Dilma Da Silva. 2013.
[34]
On fault resilience of OpenStack. In Proceedings of the 4th annual Symposium on Cloud Computing. ACM, 2.
[35]
Baris Kasikci, Cristian Zamfir, and George Candea. 2012. Data races vs. data race bugs: telling the difference with portend. ACM SIGPLAN Notices 47, 4 (2012), 185–198.
[36]
Kamal Kc and Xiaohui Gu. 2011. ELT: Efficient log-based troubleshooting system for cloud computing infrastructures. In Reliable Distributed Systems (SRDS), 2011 30th IEEE Symposium on. IEEE, 11–20.
[37]
Charles Killian, James W Anderson, Ranjit Jhala, and Amin Vahdat. 2007. Life, death, and the critical transition: Finding liveness bugs in systems code. NSDI.
[38]
Avinash Lakshman and Prashant Malik. 2010. Cassandra: a decentralized structured storage system. ACM SIGOPS Operating Systems Review 44, 2 (2010), 35–40.
[39]
Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi, Jeffrey F Lukman, and Haryadi S Gunawi. 2014. SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems. In OSDI. 399–414.
[40]
Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, and Haryadi S. Gunawi. 2016.
[41]
TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’16). ACM, New York, NY, USA, 517–530.
[42]
Lian Li, Cristina Cifuentes, and Nathan Keynes. 2011. Boosting the Performance of Flow-sensitive Points-to Analysis Using Value Flow. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering (ESEC/FSE ’11). ACM, New York, NY, USA, 343–353.
[43]
Lian Li, Cristina Cifuentes, and Nathan Keynes. 2013.
[44]
Precise and Scalable Context-sensitive Pointer Analysis via Value Flow Graph. In Proceedings of the 2013 International Symposium on Memory Management (ISMM ’13). ACM, New York, NY, USA, 85–96.
[45]
Haoxiang Lin, Mao Yang, Fan Long, Lintao Zhang, and Lidong Zhou. 2009.
[46]
MODIST: Transparent model checking of unmodified distributed systems. In 6th USENIX Symposium on Networked Systems Design &amp; Implementation (NSDI).
[47]
Haopeng Liu, Guangpu Li, Jeffrey F Lukman, Jiaxin Li, Shan Lu, Haryadi S Gunawi, and Chen Tian. 2017. DCatch: Automatically Detecting Distributed Concurrency Bugs in Cloud Systems. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 677–691.
[48]
Jian-Guang Lou, Qiang Fu, Yi Wang, and Jiang Li. 2010. Mining dependency in distributed systems through unstructured logs analysis. ACM SIGOPS Operating Systems Review 44, 1 (2010), 91–96.
[49]
Jian-Guang Lou, Qiang Fu, Shengqi Yang, Jiang Li, and Bin Wu. 2010. Mining program workflow from interleaved traces. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 613–622.
[50]
Shan Lu, Joseph Tucek, Feng Qin, and Yuanyuan Zhou. 2006. AVIO: detecting atomicity violations via access interleaving invariants. In ACM SIGOPS Operating Systems Review, Vol. 40. ACM, 37–48.
[51]
Brandon Lucia, Luis Ceze, and Karin Strauss. 2010. ColorSafe: architectural support for debugging and dynamically avoiding multi-variable atomicity violations. ACM SIGARCH computer architecture news 38, 3 (2010), 222–233.
[52]
Karthik Nagaraj, Charles Killian, and Jennifer Neville. 2012. Structured comparative analysis of systems logs to diagnose performance problems. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 26–26.
[53]
Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, and Thomas Anderson. 1997. Eraser: A dynamic data race detector for multithreaded programs. ACM Transactions on Computer Systems (TOCS) 15, 4 (1997), 391–411.
[54]
Jiri Simsa, Randal E Bryant, and Garth Gibson. 2010. dBug: systematic evaluation of distributed systems. USENIX.
[55]
Yulei Sui and Jingling Xue. 2016. On-demand Strong Update Analysis via Valueflow Refinement. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2016). ACM, New York, NY, USA, 460–473.
[56]
Jiaqi Tan, Soila Kavulya, Rajeev Gandhi, and Priya Narasimhan. 2010.
[57]
Visual, log-based causal tracing for performance debugging of mapreduce systems. In Distributed Computing Systems (ICDCS), 2010 IEEE 30th International Conference on. IEEE, 795–806.
[58]
Jiaqi Tan, Xinghao Pan, Soila Kavulya, Rajeev Gandhi, and Priya Narasimhan. 2008. SALSA: Analyzing Logs as StAte Machines. WASL 8 (2008), 6–6.
[59]
Tian Tan, Yue Li, and Jingling Xue. 2017. Efficient and Precise Points-to Analysis: Modeling the Heap by Merging Equivalent Automata. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2017). ACM, New York, NY, USA, 278–291.
[60]
[61]
Hadoop Team. 2018. Fault Injection framework. (2018).
[62]
Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O’Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler. 2013.
[63]
Apache Hadoop YARN: Yet Another Resource Negotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing (SOCC ’13). ACM, New York, NY, USA, Article 5, 16 pages.
[64]
Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael I Jordan. 2009.
[65]
Detecting large-scale system problems by mining console logs. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles. ACM, 117–132.
[66]
Xiao Yu, Pallavi Joshi, Jianwu Xu, Guoliang Jin, Hui Zhang, and Guofei Jiang. 2016. Cloudseer: Workflow monitoring of cloud infrastructures via interleaved logs. In ACM SIGPLAN Notices, Vol. 51. ACM, 489–502.
[67]
Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm. 2014. Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Dataintensive Systems. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI’14). USENIX Association, Berkeley, CA, USA, 249–265. http://dl.acm.org/citation.cfm?id=2685048.2685068
[68]
Xu Zhao, Kirk Rodrigues, Yu Luo, Ding Yuan, and Michael Stumm. 2016. Non-Intrusive Performance Profiling for Entire Software Stacks Based on the Flow Reconstruction Principle. In OSDI. 603–618.
[69]
Xu Zhao, Yongle Zhang, David Lion, Muhammad Faizan Ullah, Yu Luo, Ding Yuan, and Michael Stumm. 2014. lprof: A Non-intrusive Request Flow Profiler for Distributed Systems. In OSDI, Vol. 14. 629–644.
[70]
Qing Zhou, Lian Li, Lei Wang, Jingling Xue, and Xiaobing Feng. 2018.

Cited By

View all
  • (2024)A Review of Software Testing Process Log Parsing and Mining2024 IEEE International Conference on Software Services Engineering (SSE)10.1109/SSE62657.2024.00055(334-343)Online publication date: 7-Jul-2024
  • (2024)End-to-end log statement generation at block-levelJournal of Systems and Software10.1016/j.jss.2024.112146216(112146)Online publication date: Oct-2024
  • (2023) LoGenText-Plus: Improving Neural Machine Translation Based Logging Texts Generation with Syntactic TemplatesACM Transactions on Software Engineering and Methodology10.1145/362474033:2(1-45)Online publication date: 22-Dec-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ESEC/FSE 2018: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering
October 2018
987 pages
ISBN:9781450355735
DOI:10.1145/3236024
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 October 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Bug Detection
  2. Cloud Computing
  3. Concurrency Bugs
  4. Distributed Systems

Qualifiers

  • Research-article

Conference

ESEC/FSE '18
Sponsor:

Acceptance Rates

Overall Acceptance Rate 112 of 543 submissions, 21%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)55
  • Downloads (Last 6 weeks)6
Reflects downloads up to 13 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A Review of Software Testing Process Log Parsing and Mining2024 IEEE International Conference on Software Services Engineering (SSE)10.1109/SSE62657.2024.00055(334-343)Online publication date: 7-Jul-2024
  • (2024)End-to-end log statement generation at block-levelJournal of Systems and Software10.1016/j.jss.2024.112146216(112146)Online publication date: Oct-2024
  • (2023) LoGenText-Plus: Improving Neural Machine Translation Based Logging Texts Generation with Syntactic TemplatesACM Transactions on Software Engineering and Methodology10.1145/362474033:2(1-45)Online publication date: 22-Dec-2023
  • (2023)Adonis: Practical and Efficient Control Flow Recovery through OS-level TracesACM Transactions on Software Engineering and Methodology10.1145/360718733:1(1-27)Online publication date: 4-Jul-2023
  • (2023)LogKG: Log Failure Diagnosis Through Knowledge GraphIEEE Transactions on Services Computing10.1109/TSC.2023.329389016:5(3493-3507)Online publication date: Sep-2023
  • (2023)LogRule: Efficient Structured Log Mining for Root Cause AnalysisIEEE Transactions on Network and Service Management10.1109/TNSM.2023.328227020:4(4231-4243)Online publication date: Dec-2023
  • (2023)DyCause: Crowdsourcing to Diagnose Microservice Kernel FailureIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2022.323391520:6(4763-4777)Online publication date: Nov-2023
  • (2022)Investigating and improving log parsing in practiceProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3540250.3558947(1566-1577)Online publication date: 7-Nov-2022
  • (2022)TeLL: log level suggestions via modeling multi-level code block informationProceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3533767.3534379(27-38)Online publication date: 18-Jul-2022
  • (2022)A deep study of the effects and fixes of server-side request races in web applicationsProceedings of the 19th International Conference on Mining Software Repositories10.1145/3524842.3528463(744-756)Online publication date: 23-May-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media