[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3674805.3690758acmconferencesArticle/Chapter ViewAbstractPublication PagesesemConference Proceedingsconference-collections
research-article

Debugging with Open-Source Large Language Models: An Evaluation

Published: 24 October 2024 Publication History

Abstract

Large language models have shown good potential in supporting software development tasks. This is why more and more developers turn to LLMs (e.g. ChatGPT) to support them in fixing their buggy code. While this can save time and effort, many companies prohibit it due to strict code sharing policies. To address this, companies can run open-source LLMs locally. But until now there is not much research evaluating the performance of open-source large language models in debugging. This work is a preliminary evaluation of the capabilities of open-source LLMs in fixing buggy code. The evaluation covers five open-source large language models and uses the benchmark DebugBench which includes more than 4000 buggy code instances written in Python, Java and C++. Open-source LLMs achieved scores ranging from 43.9% to 66.6% with DeepSeek-Coder achieving the best score for all three programming languages.

References

[1]
Meta AI. [n. d.]. llama3 · hugging face. URL https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct.
[2]
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models. arxiv:2108.07732 [cs.PL]
[3]
Simone Balloccu, Patrícia Schmidtová, Mateusz Lango, and Ondřej Dušek. 2024. Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source llms. arXiv preprint arXiv:2402.03927 (2024).
[4]
Athiwaratkun Ben, Gouda Sanjay Krishna, Wang Zijian, Li Xiaopeng, Tian Yuchen, Tan Ming, Ahmad Wasi Uddin, Wang Shiqi, Sun Qing, Shang Mingyue, 2022. Multi-lingual Evaluation of Code Generation Models. In The Eleventh International Conference on Learning Representations.
[5]
Steenhoek Benjamin, Rahman Md Mahbubur, Roy Monoshi Kumar, Alam Mirza Sanjida, Barr Earl T, and Le Wei. 2024. A Comprehensive Study of the Capabilities of Large Language Models for Vulnerability Detection. arXiv preprint arXiv:2403.17218 (2024).
[6]
Xu Can, Sun Qingfeng, Zheng Kai, Geng Xiubo, Zhao Pu, Feng Jiazhan, Tao Chongyang, and Jiang Daxin. 2023. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023).
[7]
Jialun Cao, Wuqi Zhang, and Shing-Chi Cheung. 2024. Concerned with Data Contamination? Assessing Countermeasures in Code Language Model. arXiv preprint arXiv:2403.16898 (2024).
[8]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan..., and Wojciech Zaremba. 2021. Evaluating Large Language Models Trained on Code. arxiv:2107.03374 [cs.LG]
[9]
Sobania Dominik, Briesch Martin, Hanna Carol, and Petke Justyna. 2023. An analysis of the automatic bug fixing performance of chatgpt. In 2023 IEEE/ACM International Workshop on Automated Program Repair (APR). IEEE, 23–30.
[10]
Feng, Sidong, Chen, and Chunyang. 2024. Prompting Is All You Need: Automated Android Bug Replay with Large Language Models. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–13.
[11]
Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y.K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence. https://arxiv.org/abs/2401.14196
[12]
Bouzenia Islem, Devanbu Premkumar, and Pradel Michael. 2024. RepairAgent: An Autonomous, LLM-Based Agent for Program Repair. arXiv preprint arXiv:2403.17134 (2024).
[13]
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2024. LiveCodeBench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974 (2024).
[14]
Liu Jiawei, Xia Chunqiu Steven, Wang Yuyao, and Zhang Lingming. 2024. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in Neural Information Processing Systems 36 (2024).
[15]
Huang Jie and Chang Kevin Chen-Chuan. 2023. Towards Reasoning in Large Language Models: A Survey. In 61st Annual Meeting of the Association for Computational Linguistics, ACL 2023. Association for Computational Linguistics (ACL), 1049–1065.
[16]
Cheryl Lee, Chunqiu Steven Xia, Jen-tse Huang, Zhouruixin Zhu, Lingming Zhang, and Michael R Lyu. 2024. A Unified Debugging Approach via LLM-Based Multi-Agent Synergy. arXiv preprint arXiv:2404.17153 (2024).
[17]
Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, and Harm de Vries. 2023. StarCoder: may the source be with you!arxiv:2305.06161 [cs.CL]
[18]
Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. 2023. WizardCoder: Empowering Code Large Language Models with Evol-Instruct. arxiv:2306.08568 [cs.CL]
[19]
Böhme Marcel, Soremekun Ezekiel O, Chattopadhyay Sudipta, Ugherughe Emamurho, and Zeller Andreas. 2017. Where is the bug and how is it fixed? an experiment with practitioners. In Proceedings of the 2017 11th joint meeting on foundations of software engineering. 117–128.
[20]
Phind. [n. d.]. Phind, Phind/phind-codellama-34b-v2 - hugging face. URL https://huggingface.co/Phind/Phind-CodeLlama-34B-v2.
[21]
Paul Ralph, Sebastian Baltes, Domenico Bianculli, Yvonne Dittrich, Michael Felderer, Robert Feldt, ..., and Sira Vegas. 2020. ACM SIGSOFT Empirical Standards. CoRR abs/2010.03525 (2020). arXiv:2010.03525https://arxiv.org/abs/2010.03525
[22]
Khojah Ranim, Mohamad Mazen, Leitner Philipp, and Neto Francisco Gomes de Oliveira. 2024. Beyond Code Generation: An Observational Study of ChatGPT Usage in Software Engineering Practice, In Proc. ACM Softw. Eng.ACM International Conference on the Foundations of Software Engineering (FSE).
[23]
Mathieu Ravaut, Bosheng Ding, Fangkai Jiao, Hailin Chen, Xingxuan Li, Ruochen Zhao, Chengwei Qin, Caiming Xiong, and Shafiq Joty. 2024. How Much are LLMs Contaminated? A Comprehensive Survey and the LLMSanitize Library. arXiv preprint arXiv:2404.00699 (2024).
[24]
Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan,..., and Gabriel Synnaeve. 2024. Code Llama: Open Foundation Models for Code. arxiv:2308.12950 [cs.CL]
[25]
Oscar Sainz, Jon Ander Campos, Iker García-Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, and Eneko Agirre. 2023. Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark. arXiv preprint arXiv:2310.18018 (2023).
[26]
Xia Chunqiu Steven, Wei Yuxiang, and Zhang Lingming. 2023. Automated program repair in the era of large pre-trained language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1482–1494.
[27]
Kang Sungmin, Chen Bei, Yoo Shin, and Lou Jian-Guang. 2023. Explainable Automated Debugging via Large Language Model-driven Scientific Debugging. Proceedings of the 45th International Conference on Software Engineering (2023). https://doi.org/10.48550/ARXIV.2304.02195 arxiv:2304.02195 [cs.SE]
[28]
Kang Sungmin, Yoon Juyeon, and Yoo Shin. 2023. Large language models are few-shot testers: Exploring llm-based general bug reproduction. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2312–2323.
[29]
Runchu Tian, Yining Ye, Yujia Qin, Xin Cong, Yankai Lin, Yinxu Pan, Yesai Wu, Zhiyuan Liu, and Maosong Sun. 2024. DebugBench: Evaluating Debugging Capability of Large Language Models. arxiv:2401.04621 [cs.SE]
[30]
Singh Vikramank, Vaidya Kapil Eknath, Kumar Vinayshekhar Bannihatti, Khosla Sopan, Narayanaswamy Murali, Gangadharaiah Rashmi, and Kraska Tim. 2024. Panda: Performance debugging for databases using LLM agents. Amazon Science (2024).
[31]
Hasselbring Wilhelm. 2021. Benchmarking as empirical standard in software engineering research. In Proceedings of the 25th International Conference on Evaluation and Assessment in Software Engineering. 365–372.
[32]
Shuo Yang, Wei-Lin Chiang, Lianmin Zheng, Joseph E Gonzalez, and Ion Stoica. 2023. Rethinking benchmark and contamination for language models with rephrased samples. arXiv preprint arXiv:2311.04850 (2023).
[33]
Wu Yonghao, Li Zheng, Zhang Jie M, Mike Papadakis, Mark Harman, and Yong Liu. 2023. Large language models in fault localisation. arXiv preprint arXiv:2308.15276 (2023).
[34]
Lily Zhong, Zilong Wang, and Jingbo Shang. 2024. LDB: A Large Language Model Debugger via Verifying Runtime Execution Step-by-step. CoRR, February 2024 (Feb. 2024). https://doi.org/10.48550/ARXIV.2402.16906 arxiv:2402.16906 [cs.SE]

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ESEM '24: Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement
October 2024
633 pages
ISBN:9798400710476
DOI:10.1145/3674805
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 October 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Debugging
  2. Large Language Models
  3. Open-Source LLMs

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ESEM '24
Sponsor:

Acceptance Rates

Overall Acceptance Rate 130 of 594 submissions, 22%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 51
    Total Downloads
  • Downloads (Last 12 months)51
  • Downloads (Last 6 weeks)31
Reflects downloads up to 21 Dec 2024

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media