More Web Proxy on the site http://driver.im/

research-article

Debugging with Open-Source Large Language Models: An Evaluation

Authors:

Yacine Majdoub,

Eya Ben CharradaAuthors Info & Claims

ESEM '24: Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement

Pages 510 - 516

https://doi.org/10.1145/3674805.3690758

Published: 24 October 2024 Publication History

Abstract

Large language models have shown good potential in supporting software development tasks. This is why more and more developers turn to LLMs (e.g. ChatGPT) to support them in fixing their buggy code. While this can save time and effort, many companies prohibit it due to strict code sharing policies. To address this, companies can run open-source LLMs locally. But until now there is not much research evaluating the performance of open-source large language models in debugging. This work is a preliminary evaluation of the capabilities of open-source LLMs in fixing buggy code. The evaluation covers five open-source large language models and uses the benchmark DebugBench which includes more than 4000 buggy code instances written in Python, Java and C++. Open-source LLMs achieved scores ranging from 43.9% to 66.6% with DeepSeek-Coder achieving the best score for all three programming languages.

References

[1]

Meta AI. [n. d.]. llama3 · hugging face. URL https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct.

[2]

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models. arxiv:2108.07732 [cs.PL]

[3]

Simone Balloccu, Patrícia Schmidtová, Mateusz Lango, and Ondřej Dušek. 2024. Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source llms. arXiv preprint arXiv:2402.03927 (2024).

[4]

Athiwaratkun Ben, Gouda Sanjay Krishna, Wang Zijian, Li Xiaopeng, Tian Yuchen, Tan Ming, Ahmad Wasi Uddin, Wang Shiqi, Sun Qing, Shang Mingyue, 2022. Multi-lingual Evaluation of Code Generation Models. In The Eleventh International Conference on Learning Representations.

[5]

Steenhoek Benjamin, Rahman Md Mahbubur, Roy Monoshi Kumar, Alam Mirza Sanjida, Barr Earl T, and Le Wei. 2024. A Comprehensive Study of the Capabilities of Large Language Models for Vulnerability Detection. arXiv preprint arXiv:2403.17218 (2024).

[6]

Xu Can, Sun Qingfeng, Zheng Kai, Geng Xiubo, Zhao Pu, Feng Jiazhan, Tao Chongyang, and Jiang Daxin. 2023. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023).

[7]

Jialun Cao, Wuqi Zhang, and Shing-Chi Cheung. 2024. Concerned with Data Contamination? Assessing Countermeasures in Code Language Model. arXiv preprint arXiv:2403.16898 (2024).

[8]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan..., and Wojciech Zaremba. 2021. Evaluating Large Language Models Trained on Code. arxiv:2107.03374 [cs.LG]

[9]

Sobania Dominik, Briesch Martin, Hanna Carol, and Petke Justyna. 2023. An analysis of the automatic bug fixing performance of chatgpt. In 2023 IEEE/ACM International Workshop on Automated Program Repair (APR). IEEE, 23–30.

[10]

Feng, Sidong, Chen, and Chunyang. 2024. Prompting Is All You Need: Automated Android Bug Replay with Large Language Models. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–13.

[11]

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y.K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence. https://arxiv.org/abs/2401.14196

[12]

Bouzenia Islem, Devanbu Premkumar, and Pradel Michael. 2024. RepairAgent: An Autonomous, LLM-Based Agent for Program Repair. arXiv preprint arXiv:2403.17134 (2024).

[13]

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2024. LiveCodeBench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974 (2024).

[14]

Liu Jiawei, Xia Chunqiu Steven, Wang Yuyao, and Zhang Lingming. 2024. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in Neural Information Processing Systems 36 (2024).

[15]

Huang Jie and Chang Kevin Chen-Chuan. 2023. Towards Reasoning in Large Language Models: A Survey. In 61st Annual Meeting of the Association for Computational Linguistics, ACL 2023. Association for Computational Linguistics (ACL), 1049–1065.

[16]

Cheryl Lee, Chunqiu Steven Xia, Jen-tse Huang, Zhouruixin Zhu, Lingming Zhang, and Michael R Lyu. 2024. A Unified Debugging Approach via LLM-Based Multi-Agent Synergy. arXiv preprint arXiv:2404.17153 (2024).

[17]

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, and Harm de Vries. 2023. StarCoder: may the source be with you!arxiv:2305.06161 [cs.CL]

[18]

Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. 2023. WizardCoder: Empowering Code Large Language Models with Evol-Instruct. arxiv:2306.08568 [cs.CL]

[19]

Böhme Marcel, Soremekun Ezekiel O, Chattopadhyay Sudipta, Ugherughe Emamurho, and Zeller Andreas. 2017. Where is the bug and how is it fixed? an experiment with practitioners. In Proceedings of the 2017 11th joint meeting on foundations of software engineering. 117–128.

[20]

Phind. [n. d.]. Phind, Phind/phind-codellama-34b-v2 - hugging face. URL https://huggingface.co/Phind/Phind-CodeLlama-34B-v2.

[21]

Paul Ralph, Sebastian Baltes, Domenico Bianculli, Yvonne Dittrich, Michael Felderer, Robert Feldt, ..., and Sira Vegas. 2020. ACM SIGSOFT Empirical Standards. CoRR abs/2010.03525 (2020). arXiv:2010.03525https://arxiv.org/abs/2010.03525

[22]

Khojah Ranim, Mohamad Mazen, Leitner Philipp, and Neto Francisco Gomes de Oliveira. 2024. Beyond Code Generation: An Observational Study of ChatGPT Usage in Software Engineering Practice, In Proc. ACM Softw. Eng.ACM International Conference on the Foundations of Software Engineering (FSE).

[23]

Mathieu Ravaut, Bosheng Ding, Fangkai Jiao, Hailin Chen, Xingxuan Li, Ruochen Zhao, Chengwei Qin, Caiming Xiong, and Shafiq Joty. 2024. How Much are LLMs Contaminated? A Comprehensive Survey and the LLMSanitize Library. arXiv preprint arXiv:2404.00699 (2024).

[24]

Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan,..., and Gabriel Synnaeve. 2024. Code Llama: Open Foundation Models for Code. arxiv:2308.12950 [cs.CL]

[25]

Oscar Sainz, Jon Ander Campos, Iker García-Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, and Eneko Agirre. 2023. Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark. arXiv preprint arXiv:2310.18018 (2023).

[26]

Xia Chunqiu Steven, Wei Yuxiang, and Zhang Lingming. 2023. Automated program repair in the era of large pre-trained language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1482–1494.

[27]

Kang Sungmin, Chen Bei, Yoo Shin, and Lou Jian-Guang. 2023. Explainable Automated Debugging via Large Language Model-driven Scientific Debugging. Proceedings of the 45th International Conference on Software Engineering (2023). https://doi.org/10.48550/ARXIV.2304.02195 arxiv:2304.02195 [cs.SE]

[28]

Kang Sungmin, Yoon Juyeon, and Yoo Shin. 2023. Large language models are few-shot testers: Exploring llm-based general bug reproduction. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2312–2323.

[29]

Runchu Tian, Yining Ye, Yujia Qin, Xin Cong, Yankai Lin, Yinxu Pan, Yesai Wu, Zhiyuan Liu, and Maosong Sun. 2024. DebugBench: Evaluating Debugging Capability of Large Language Models. arxiv:2401.04621 [cs.SE]

[30]

Singh Vikramank, Vaidya Kapil Eknath, Kumar Vinayshekhar Bannihatti, Khosla Sopan, Narayanaswamy Murali, Gangadharaiah Rashmi, and Kraska Tim. 2024. Panda: Performance debugging for databases using LLM agents. Amazon Science (2024).

[31]

Hasselbring Wilhelm. 2021. Benchmarking as empirical standard in software engineering research. In Proceedings of the 25th International Conference on Evaluation and Assessment in Software Engineering. 365–372.

[32]

Shuo Yang, Wei-Lin Chiang, Lianmin Zheng, Joseph E Gonzalez, and Ion Stoica. 2023. Rethinking benchmark and contamination for language models with rephrased samples. arXiv preprint arXiv:2311.04850 (2023).

[33]

Wu Yonghao, Li Zheng, Zhang Jie M, Mike Papadakis, Mark Harman, and Yong Liu. 2023. Large language models in fault localisation. arXiv preprint arXiv:2308.15276 (2023).

[34]

Lily Zhong, Zilong Wang, and Jingbo Shang. 2024. LDB: A Large Language Model Debugger via Verifying Runtime Execution Step-by-step. CoRR, February 2024 (Feb. 2024). https://doi.org/10.48550/ARXIV.2402.16906 arxiv:2402.16906 [cs.SE]

Index Terms

Debugging with Open-Source Large Language Models: An Evaluation
1. Human-centered computing
  1. Visualization
    1. Visualization techniques
2. Software and its engineering
  1. Software creation and management
    1. Software development techniques
    2. Software verification and validation
      1. Software defect analysis
        Software testing and debugging
  2. Software notations and tools
    1. General programming languages
      1. Language types

Index terms have been assigned to the content through auto-classification.

Recommendations

Debugging native extensions of dynamic languages
ManLang '18: Proceedings of the 15th International Conference on Managed Languages & Runtimes

Many dynamic programming languages such as Ruby and Python enable developers to use so called native extensions, code implemented in typically statically compiled languages like C and C++. However, debuggers for these dynamic languages usually lack ...
Debugging Debugging
COMPSACW '11: Proceedings of the 2011 IEEE 35th Annual Computer Software and Applications Conference Workshops

When a program fails to accomplish its intended task, debugging is conducted to identify and remove any bugs. The debugging operation itself is not immune to flaws. Empirical evidence suggests many bugs are found after shipping, which calls into ...
What constitutes debugging? An exploratory study of debugging episodes
Abstract
When debugging, developers engage in activities such as navigating, editing, testing, and inspecting code. Despite being the building blocks of debugging, little is known about how they constitute debugging. To address this gap, we introduce the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ESEM '24: Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement

October 2024

633 pages

ISBN:9798400710476

DOI:10.1145/3674805

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 October 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ESEM '24

Sponsor:

SIGSOFT

ESEM '24: ACM / IEEE International Symposium on Empirical Software Engineering and Measurement

October 24 - 25, 2024

Barcelona, Spain

Acceptance Rates

Overall Acceptance Rate 130 of 594 submissions, 22%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
51
Total Downloads

Downloads (Last 12 months)51
Downloads (Last 6 weeks)31

Reflects downloads up to 21 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents