[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3694715.3695971acmconferencesArticle/Chapter ViewAbstractPublication PagessospConference Proceedingsconference-collections
research-article
Open access

If At First You Don’t Succeed, Try, Try, Again...? Insights and LLM-informed Tooling for Detecting Retry Bugs in Software Systems

Published: 15 November 2024 Publication History

Abstract

Retry---the re-execution of a task on failure---is a common mechanism to enable resilient software systems. Yet, despite its commonality and long history, retry remains difficult to implement and test.
Guided by our study of real-world retry issues, we propose a novel suite of static and dynamic techniques to detect retry problems in software. We find that the ad-hoc nature of retry implementation poses challenges for traditional program analysis but can be well suited for large language models; and that carefully repurposing existing unit tests can, along with fault injection, expose various types of retry problems.

References

[1]
Toufique Ahmed and Premkumar Devanbu. 2023. Few-shot training LLMs for project-specific code-summarization. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (, Rochester, MI, USA,) (ASE '22). Association for Computing Machinery, New York, NY, USA, Article 177, 5 pages.
[2]
Toufique Ahmed, Kunal Suresh Pai, Premkumar Devanbu, and Earl T. Barr. 2024. Automatic Semantic Augmentation of Language Model Prompts (for Code Summarization). In 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE) (ICSE'24). IEEE Computer Society, Los Alamitos, CA, USA, 1004--1004. https://doi.ieeecomputersociety.org/
[3]
Ramnatthan Alagappan, Aishwarya Ganesan, Yuvraj Patel, Thanu-malayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2016. Correlated crash vulnerabilities. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (Savannah, GA, USA) (OSDI'16). USENIX Association, USA, 151--167.
[4]
Ahmed Alquraan, Hatem Takruri, Mohammed Alfatafta, and Samer Al-Kiswany. 2018. An analysis of network-partitioning failures in cloud systems. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (Carlsbad, CA, USA) (OSDI'18). USENIX Association, USA, 51--68.
[5]
Peter Alvaro, Joshua Rosen, and Joseph M. Hellerstein. 2015. Lineage-driven Fault Injection. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (Melbourne, Victoria, Australia) (SIGMOD '15). Association for Computing Machinery, New York, NY, USA, 331--346.
[6]
Radu Banabic and George Candea. 2012. Fast black-box testing of system recovery code. In Proceedings of the 7th ACM European Conference on Computer Systems (Bern, Switzerland) (EuroSys'12). Association for Computing Machinery, New York, NY, USA, 281--294.
[7]
Nathan Bronson, Abutalib Aghayev, Aleksey Charapko, and Timothy Zhu. 2021. Metastable failures in distributed systems. In Proceedings of the Workshop on Hot Topics in Operating Systems (Ann Arbor, Michigan) (HotOS '21). Association for Computing Machinery, New York, NY, USA, 221--227.
[8]
Marco Canini, Daniele Venzano, Peter Perešíni, Dejan Kostić, and Jennifer Rexford. 2012. A NICE way to test openflow applications. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (San Jose, CA) (NSDI'12). USENIX Association, USA, 10.
[9]
Haicheng Chen, Wensheng Dou, Dong Wang, and Feng Qin. 2021. CoFI: consistency-guided fault injection for cloud systems. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (Virtual Event, Australia) (ASE'20). Association for Computing Machinery, New York, NY, USA, 536--547.
[10]
Yinfang Chen, Xudong Sun, Suman Nath, Ze Yang, and Tianyin Xu. 2023. Push-Button Reliability Testing for Cloud-Backed Applications with Rainmaker. In Proceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2023 (Proceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2023). USENIX Association, USA, 1701--1716.
[11]
Maria Christakis, Patrick Emmisberger, Patrice Godefroid, and Peter Müller. 2017. A general framework for dynamic stub injection. In Proceedings of the 39th International Conference on Software Engineering (Buenos Aires, Argentina) (ICSE'17). IEEE Press, 586--596.
[12]
Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, Chenyuan Yang, and Lingming Zhang. 2023. Large Language Models Are Zero-Shot Fuzzers: Fuzzing Deep-Learning Libraries via Large Language Models. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (Seattle, WA, USA) (ISSTA'23). Association for Computing Machinery, New York, NY, USA, 423--435.
[13]
Yangruibo Ding, Benjamin Steenhoek, Kexin Pei, Gail Kaiser, Wei Le, and Baishakhi Ray. 2024. TRACED: Execution-aware Pre-training for Source Code. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering (, Lisbon, Portugal,) (ICSE'24). Association for Computing Machinery, New York, NY, USA, Article 36, 12 pages.
[14]
Apache Cassandra Docs. Accessed: April 2024. https://cassandra.apache.org/doc/stable/cassandra/configuration/cass_yaml_file.html.
[15]
Apache Cassandra Docs. Accessed: April 2024. https://www.elastic.co/guide/en/elasticsearch/hadoop/8.13/configuration.html.
[16]
Apache HDFS Docs. Accessed: April 2024. https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml.
[17]
Apache Hive Docs. Accessed: April 2024. https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties.
[18]
Apache HBase Docs. Accessed: April 2024. https://hbase.apache.org/book.html.
[19]
Apache MapReduce Docs. Accessed: April 2024. https://hadoop.apache.org/docs/r3.1.0/hadoop-project-dist/hadoop-common/core-default.xml.
[20]
Apache MapReduce Docs. Accessed: April 2024. https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml.
[21]
Apache Yarn Docs. Accessed: April 2024. https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/yarn-service/Configurations.html.
[22]
CodeQL Documentation. Accessed: April 2024. https://codeql.github.com/docs/.
[23]
Polly documentation. Accessed: April 2024. https://www.pollydocs.org.
[24]
Emily First, Markus Rabe, Talia Ringer, and Yuriy Brun. 2023. Baldur: Whole-Proof Generation and Repair with Large Language Models. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (, San Francisco, CA, USA,) (ESEC/FSE'23). Association for Computing Machinery, New York, NY, USA, 1229--1241.
[25]
Supriyo Ghosh, Manish Shetty, Chetan Bansal, and Suman Nath. 2022. How to fight production incidents? an empirical study on a large-scale cloud service. In Proceedings of the 13th Symposium on Cloud Computing (San Francisco, California) (SoCC '22). Association for Computing Machinery, New York, NY, USA, 126--141.
[26]
Haryadi S. Gunawi, Thanh Do, Pallavi Joshi, Peter Alvaro, Joseph M. Hellerstein, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Koushik Sen, and Dhruba Borthakur. 2011. FATE and DESTINI: a framework for cloud recovery testing. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (Boston, MA) (NSDI'11). USENIX Association, USA, 238--252.
[27]
Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria. 2014. What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems. In Proceedings of the ACM Symposium on Cloud Computing (Seattle, WA, USA) (SOCC '14). Association for Computing Machinery, New York, NY, USA, 1--14.
[28]
Haryadi S. Gunawi, Mingzhe Hao, Riza O. Suminto, Agung Laksono, Anang D. Satria, Jeffry Adityatama, and Kurnia J. Eliazar. 2016. Why Does the Cloud Stop Computing? Lessons from Hundreds of Service Outages. In Proceedings of the Seventh ACM Symposium on Cloud Computing (Santa Clara, CA, USA) (SoCC'16). Association for Computing Machinery, New York, NY, USA, 1--16.
[29]
Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dussea, and Ben Liblit. 2008. EIO: error handling is occasionally correct (FAST'08). USENIX Association, USA, Article 14, 16 pages.
[30]
Victor Heorhiadi, Shriram Rajagopalan, Hani Jamjoom, Michael K. Reiter, and Vyas Sekar. 2016. Gremlin: Systematic Resilience Testing of Microservices. In 2016 IEEE 36th International Conference on Distributed Computing Systems (ICDCS). 57--66.
[31]
Lexiang Huang, Matthew Magnusson, Abishek Bangalore Muralikrishna, Salman Estyak, Rebecca Isaacs, Abutalib Aghayev, Timothy Zhu, and Aleksey Charapko. 2022. Metastable Failures in the Wild. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) (OSDI'22). USENIX Association, Carlsbad, CA, 73--90. https://www.usenix.org/conference/osdi22/presentation/huang-lexiang
[32]
Netflix Hystrix. Accessed: April 2024. https://github.com/Netflix/Hystrix.
[33]
Naman Jain, Skanda Vaidyanath, Arun Iyer, Nagarajan Natarajan, Suresh Parthasarathy, Sriram Rajamani, and Rahul Sharma. 2022. Jigsaw: large language models meet program synthesis. In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE'22). Association for Computing Machinery, New York, NY, USA, 1219--1231.
[34]
Suman Jana, Yuan Kang, Samuel Roth, and Baishakhi Ray. 2016. Automatically detecting error handling bugs using error specifications. In Proceedings of the 25th USENIX Conference on Security Symposium (Austin, TX, USA) (SEC'16). USENIX Association, USA, 345--362.
[35]
Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2023. LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 13358--13376.
[36]
Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2024. LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 1658--1677. https://aclanthology.org/2024.acl-long.91
[37]
Xiaoen Ju, Livio Soares, Kang G. Shin, Kyung Dong Ryu, and Dilma Da Silva. 2013. On fault resilience of OpenStack. In Proceedings of the 4th Annual Symposium on Cloud Computing (Santa Clara, California) (SOCC'13). Association for Computing Machinery, New York, NY, USA, Article 2, 16 pages.
[38]
Sungmin Kang, Juyeon Yoon, and Shin Yoo. 2023. Large Language Models are Few-shot Testers: Exploring LLM-based General Bug Reproduction. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) (ICSE'23). 2312--2323.
[39]
Kyle Kingsbury and Peter Alvaro. 2020. Elle: inferring isolation anomalies from experimental observations. Proc. VLDB Endow. 14, 3 (nov 2020), 268--280.
[40]
Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi, Jeffrey F. Lukman, and Haryadi S. Gunawi. 2014. SAMC: semantic-aware model checking for fast discovery of deep bugs in cloud systems. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (Broomfield, CO) (OSDI'14). USENIX Association, USA, 399--414.
[41]
Caroline Lemieux, Jeevana Priya Inala, Shuvendu K. Lahiri, and Siddhartha Sen. 2023. CodaMosa: Escaping Coverage Plateaus in Test Generation with Pre-trained Large Language Models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) (ICSE'23). 919--931.
[42]
Ao Li, Shan Lu, Suman Nath, Rohan Padhye, and Vyas Sekar. 2024. ExChain: Exception Dependency Analysis for Root Cause Diagnosis. In Proceedings of the 21th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024 (Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024). USENIX Association, USA.
[43]
Haonan Li, Yu Hao, Yizhuo Zhai, and Zhiyun Qian. 2023. Assisting Static Analysis with Large Language Models: A ChatGPT Experiment. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (, San Francisco, CA, USA,) (ESEC/FSE 2023). Association for Computing Machinery, New York, NY, USA, 2107--2111.
[44]
Haonan Li, Yu Hao, Yizhuo Zhai, and Zhiyun Qian. 2024. Enhancing Static Analysis for Practical Bug Detection: An LLM-Integrated Approach. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications (Pasadena, California, USA) (OOPSLA'24). Association for Computing Machinery, New York, NY, USA.
[45]
Haopeng Liu, Shan Lu, Madan Musuvathi, and Suman Nath. 2019. What bugs cause production cloud incidents?. In Proceedings of the Workshop on Hot Topics in Operating Systems (Bertinoro, Italy) (HotOS'19). Association for Computing Machinery, New York, NY, USA, 155--162.
[46]
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. In Advances in Neural Information Processing Systems (NeurIPS'23, Vol. 36), A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.). Curran Associates, Inc., 21558--21572. https://proceedings.neurips.cc/paper_files/paper/2023/file/43e9d647ccd3e4b7b5baab53f0368686-Paper-Conference.pdf
[47]
Jie Lu, Chen Liu, Lian Li, Xiaobing Feng, Feng Tan, Jun Yang, and Liang You. 2019. CrashTuner: detecting crash-recovery bugs in cloud systems via meta-info analysis. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (Huntsville, Ontario, Canada) (SOSP'19). Association for Computing Machinery, New York, NY, USA, 114--130.
[48]
Rupak Majumdar and Filip Niksic. 2017. Why is random testing effective for partition tolerance bugs? Proc. ACM Program. Lang. 2, POPL, Article 46, 24 pages.
[49]
Paul D. Marinescu, Radu Banabic, and George Candea. 2010. An extensible technique for high-precision testing of recovery code. In Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference (Boston, MA) (USENIXATC'10). USENIX Association, USA, 23.
[50]
Paul D. Marinescu and George Candea. 2009. LFI: A practical and general library-level fault injector. In 2009 IEEE/IFIP International Conference on Dependable Systems and Networks. 379--388.
[51]
Christopher S. Meiklejohn, Andrea Estrada, Yiwen Song, Heather Miller, and Rohan Padhye. 2021. Service-Level Fault Injection Testing. In Proceedings of the ACM Symposium on Cloud Computing (Seattle, WA, USA) (SoCC'21). Association for Computing Machinery, New York, NY, USA, 388--402.
[52]
Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju, and Vijay Chidambaram. 2018. Finding crash-consistency bugs with bounded black-box crash testing. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (Carlsbad, CA, USA) (OSDI'18). USENIX Association, USA, 33--50.
[53]
Rangeet Pan, Ali Reza Ibrahimzada, Rahul Krishna, Divya Sankar, Lambert Pouguem Wassi, Michele Merler, Boris Sobolev, Raju Pavuluri, Saurabh Sinha, and Reyhaneh Jabbarvand. 2024. Lost in Translation: A Study of Bugs Introduced by Large Language Models while Translating Code. In 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE) (ICSE'24). IEEE Computer Society, Los Alamitos, CA, USA, 866--866. https://doi.ieeecomputersociety.org/
[54]
Kexin Pei, David Bieber, Kensen Shi, Charles Sutton, and Pengcheng Yin. 2023. Can large language models reason about program invariants?. In Proceedings of the 40th International Conference on Machine Learning (, Honolulu, Hawaii, USA,) (ICML'23). JMLR.org, Article 1144, 25 pages.
[55]
Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, Ramnatthan Alagappan, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2014. All file systems are not created equal: on the complexity of crafting crash-consistent applications. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (Broomfield, CO) (OSDI'14). USENIX Association, USA, 433--448.
[56]
AspectJ Maven Plugin. Accessed: April 2024. https://www.mojohaus.org/aspectj-maven-plugin/.
[57]
The Chameleon Cloud Project. Accessed: September 2024. https://chameleoncloud.org/.
[58]
Cindy Rubio-González, Haryadi S. Gunawi, Ben Liblit, Remzi H. Arpaci-Dusseau, and Andrea C. Arpaci-Dusseau. 2009. Error propagation analysis for file systems. In Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation (Dublin, Ireland) (PLDI'09). Association for Computing Machinery, New York, NY, USA, 270--280.
[59]
Utsav Sethi, Haochen Pan, Shan Lu, Madanlal Musuvathi, and Suman Nath. 2022. Cancellation in Systems: An Empirical Study of Task Cancellation Patterns and Failures. In Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI'22). USENIX Association, Carlsbad, CA, 127--141. https://www.usenix.org/conference/osdi22/presentation/sethi
[60]
Yiming Su, Chengcheng Wan, Utsav Sethi, Shan Lu, Madan Musuvathi, and Suman Nath. 2023. HotGPT: How to Make Software Documentation More Useful with a Large Language Model?. In Proceedings of the 19th Workshop on Hot Topics in Operating Systems (Providence, RI, USA) (HOTOS'23). Association for Computing Machinery, New York, NY, USA, 87--93.
[61]
Xudong Sun, Wenqing Luo, Jiawei Tyler Gu, Aishwarya Ganesan, Ramnatthan Alagappan, Michael Gasch, Lalith Suresh, and Tianyin Xu. 2022. Automatic Reliability Testing For Cluster Management Controllers. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). USENIX Association, Carlsbad, CA, 143--159. https://www.usenix.org/conference/osdi22/presentation/sun
[62]
Lilia Tang, Chaitanya Bhandari, Yongle Zhang, Anna Karanika, Shuyang Ji, Indranil Gupta, and Tianyin Xu. 2023. Fail through the Cracks: Cross-System Interaction Failures in Modern Cloud Systems. In Proceedings of the Eighteenth European Conference on Computer Systems (Rome, Italy) (EuroSys'23). Association for Computing Machinery, New York, NY, USA, 433--451.
[63]
The Wasabi Toolkit. Release: September 2024. https://github.com/bastoica/wasabi.
[64]
Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. 2023. Automated Program Repair in the Era of Large Pre-trained Language Models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) (ICSE'23). 1482--1494.
[65]
Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm. 2014. Simple testing can prevent most critical failures: an analysis of production failures in distributed data-intensive systems. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (Broomfield, CO) (OSDI'14). USENIX Association, USA, 249--265.
[66]
Jiyang Zhang, Pengyu Nie, Junyi Jessy Li, and Milos Gligoric. 2023. Multilingual Code Co-evolution using Large Language Models. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (, San Francisco, CA, USA,) (ESEC/FSE'23). Association for Computing Machinery, New York, NY, USA, 695--707.
[67]
Pingyu Zhang and Sebastian Elbaum. 2012. Amplifying tests to validate exception handling code. In 2012 34th International Conference on Software Engineering (ICSE) (ICSE'12). 595--605.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SOSP '24: Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles
November 2024
765 pages
ISBN:9798400712517
DOI:10.1145/3694715
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike International 4.0 License.

Sponsors

In-Cooperation

  • USENIX

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 November 2024

Check for updates

Badges

Qualifiers

  • Research-article

Funding Sources

Conference

SOSP '24
Sponsor:

Acceptance Rates

SOSP '24 Paper Acceptance Rate 43 of 245 submissions, 18%;
Overall Acceptance Rate 174 of 961 submissions, 18%

Upcoming Conference

SOSP '25
ACM SIGOPS 31st Symposium on Operating Systems Principles
October 13 - 16, 2025
Seoul , Republic of Korea

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 531
    Total Downloads
  • Downloads (Last 12 months)531
  • Downloads (Last 6 weeks)265
Reflects downloads up to 02 Mar 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media