[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3639477.3639719acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model

Published: 31 May 2024 Publication History

Abstract

Code Large Language Models (Code LLMs) have gained significant attention in the industry due to their wide applications in the full lifecycle of software engineering. However, the effectiveness of existing models in understanding non-English inputs for multi-lingual code-related tasks is still far from well studied. This paper introduces CodeFuse-13B, an open-sourced pre-trained code LLM 2. It is specifically designed for code-related tasks with both English and Chinese prompts and supports over 40 programming languages. CodeFuse achieves its effectiveness by utilizing a high-quality pre-training dataset that is carefully filtered by program analyzers and optimized during the training process. Extensive experiments are conducted using real-world usage scenarios, the industry-standard benchmark HumanEval-x, and the specially designed CodefuseEval for Chinese prompts. To assess the effectiveness of CodeFuse, we actively collected valuable human feedback from the AntGroup's software development process where CodeFuse has been successfully deployed. The results demonstrate that CodeFuse-13B achieves a HumanEval pass@1 score of 37.10%, positioning it as one of the top multi-lingual code LLMs with similar parameter sizes. In practical scenarios, such as code generation, code translation, code comments, and testcase generation, CodeFuse performs better than other models when confronted with Chinese prompts.

References

[1]
Loubna Ben Allal, Raymond Li, Denis Kocetkov, et al. 2023. SantaCoder: don't reach for the stars! arXiv:2301.03988 [cs.SE]
[2]
Rohan Anil, Andrew M. Dai, Orhan Firat, et al. 2023. PaLM 2 Technical Report. arXiv:2305.10403 [cs.CL]
[3]
Jacob Austin, Augustus Odena, Maxwell Nye, et al. 2021. Program Synthesis with Large Language Models. arXiv:2108.07732 [cs.PL]
[4]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer Normalization. arXiv:1607.06450 [stat.ML]
[5]
Sid Black, Stella Biderman, Eric Hallahan, et al. 2022. GPT-NeoX-20B: An Open-Source Autoregressive Language Model. arXiv:2204.06745 [cs.CL]
[6]
Tom B. Brown, Benjamin Mann, Nick Ryder, et al. 2020. Language Models are Few-Shot Learners. arXiv:2005.14165 [cs.CL]
[7]
Mark Chen, Jerry Tworek, Heewoo Jun, et al. 2021. Evaluating Large Language Models Trained on Code. arXiv:2107.03374 [cs.LG]
[8]
Mark Chen, Jerry Tworek, Heewoo Jun, et al. 2021. Evaluating Large Language Models Trained on Code. arXiv:2107.03374 [cs.LG]
[9]
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, et al. 2022. PaLM: Scaling Language Modeling with Pathways. arXiv:2204.02311 [cs.CL]
[10]
Tri Dao, Daniel Y. Fu, Stefano Ermon, et al. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. arXiv:2205.14135 [cs.LG]
[11]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs.CL]
[12]
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. GPTQ: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323 (2022).
[13]
Daniel Fried, Armen Aghajanyan, Jessy Lin, et al. 2023. InCoder: A Generative Model for Code Infilling and Synthesis. arXiv:2204.05999 [cs.SE]
[14]
Leo Gao, Stella Biderman, Sid Black, et al. 2020. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv:2101.00027 [cs.CL]
[15]
Ant Group. 2023. Sparrow. http://sparrow.alipay.com.
[16]
Dan Hendrycks and Kevin Gimpel. 2023. Gaussian Error Linear Units (GELUs). arXiv:1606.08415 [cs.LG]
[17]
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, et al. 2022. Training Compute-Optimal Large Language Models. arXiv:2203.15556 [cs.CL]
[18]
Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs.LG]
[19]
Denis Kocetkov, Raymond Li, Loubna Ben Allal, et al. 2022. The Stack: 3 TB of permissively licensed source code. arXiv:2211.15533 [cs.CL]
[20]
Raymond Li, Loubna Ben Allal, Yangtian Zi, et al. 2023. StarCoder: may the source be with you! arXiv:2305.06161 [cs.CL]
[21]
Yujia Li, David Choi, Junyoung Chung, et al. 2022. Competition-level code generation with AlphaCode. Science 378, 6624 (dec 2022), 1092--1097.
[22]
Jiangchao Liu, Jierui Liu, Peng Di, et al. 2023. Hybrid Inlining: A Framework for Compositional and Context-Sensitive Static Analysis. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2023, Seattle, WA, USA, July 17-21, 2023, René Just and Gordon Fraser (Eds.). ACM, 114--126.
[23]
Ziyang Luo, Can Xu, Pu Zhao, et al. 2023. WizardCoder: Empowering Code Large Language Models with Evol-Instruct. arXiv:2306.08568 [cs.CL]
[24]
MicroSoft. 2023. ChatML. https://github.com/openai/openai-python/blob/main/chatml.md.
[25]
Oege de Moor, Mathieu Verbaere, Elnar Hajiyev, et al. 2007. Keynote Address: .QL for Source Code Analysis. In Seventh IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM 2007). 3--16.
[26]
Erik Nijkamp, Bo Pang, Hiroaki Hayashi, et al. 2023. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. arXiv:2203.13474 [cs.LG]
[27]
OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
[28]
Jack W. Rae, Sebastian Borgeaud, Trevor Cai, et al. 2022. Scaling Language Models: Methods, Analysis & Insights from Training Gopher. arXiv:2112.11446 [cs.CL]
[29]
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. arXiv:1910.02054 [cs.LG]
[30]
Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, et al. 2023. Code Llama: Open Foundation Models for Code. arXiv:2308.12950 [cs.CL]
[31]
Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2023. An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation. arXiv:2302.06527 [cs.SE]
[32]
Bo Shen, Jiaxin Zhang, Taihong Chen, et al. 2023. PanGu-Coder2: Boosting Large Language Models for Code with Ranking Feedback. arXiv:2307.14936 [cs.CL]
[33]
Qingkai Shi, Xiao Xiao, Rongxin Wu, et al. 2018. Pinpoint: fast and precise sparse value flow analysis for million lines of code. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2018, Philadelphia, PA, USA, June 18-22, 2018, Jeffrey S. Foster and Dan Grossman (Eds.). ACM, 693--706.
[34]
Yusuke Shibata, Takuya Kida, Shuichi Fukamachi, et al. 1999. Byte Pair Encoding: A Text Compression Scheme That Accelerates Pattern Matching. (09 1999).
[35]
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, et al. 2020. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053 [cs.CL]
[36]
Jianlin Su, Yu Lu, Shengfeng Pan, et al. 2022. RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864 [cs.CL]
[37]
Yu Sun, Shuohuan Wang, Yukun Li, et al. 2019. ERNIE 2.0: A Continual Pre-training Framework for Language Understanding. arXiv:1907.12412 [cs.CL]
[38]
Yu Sun, Shuohuan Wang, Yukun Li, et al. 2019. ERNIE: Enhanced Representation through Knowledge Integration. arXiv:1904.09223 [cs.CL]
[39]
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, et al. 2023. Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca.
[40]
Hugo Touvron, Thibaut Lavril, Gautier Izacard, et al. 2023. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971 [cs.CL]
[41]
Hugo Touvron, Louis Martin, Kevin Stone, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
[42]
Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax.
[43]
Junjie Wang, Yuchao Huang, Chunyang Chen, et al. 2023. Software Testing with Large Language Model: Survey, Landscape, and Vision. arXiv:2307.07221 [cs.SE]
[44]
Shuohuan Wang, Yu Sun, Yang Xiang, et al. 2021. ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation. arXiv:2112.12731 [cs.CL]
[45]
Yue Wang, Hung Le, Akhilesh Deepak Gotmare, et al. 2023. CodeT5+: Open Code Large Language Models for Code Understanding and Generation. arXiv:2305.07922 [cs.CL]
[46]
Yue Wang, Weishi Wang, Shafiq Joty, and Steven C. H. Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. arXiv:2109.00859 [cs.CL]
[47]
Thomas Wolf, Lysandre Debut, Victor Sanh, et al. 2020. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Online, 38--45.
[48]
Ruibin Xiong, Yunchang Yang, Di He, et al. 2020. On Layer Normalization in the Transformer Architecture. arXiv:2002.04745 [cs.LG]
[49]
Guixin Ye, Zhanyong Tang, Shin Hwei Tan, Songfang Huang, et al. 2021. Automated Conformance Testing for JavaScript Engines via Deep Compiler Fuzzing. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI 2021). Association for Computing Machinery, 435--450.
[50]
Aohan Zeng, Xiao Liu, Zhengxiao Du, et al. 2022. GLM-130B: An Open Bilingual Pre-trained Model. arXiv:2210.02414 [cs.CL]
[51]
Wayne Xin Zhao, Kun Zhou, Junyi Li, et al. 2023. A Survey of Large Language Models. arXiv:2303.18223 [cs.CL]
[52]
Qinkai Zheng, Xiao Xia, Xu Zou, et al. 2023. CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X. arXiv:2303.17568 [cs.LG]
[53]
Zexin Zhong, Jiangchao Liu, Diyu Wu, Peng Di, et al. 2022. Field-Based Static Taint Analysis for Industrial Microservices. In 44th IEEE/ACM International Conference on Software Engineering: Software Engineering in Practice, ICSE (SEIP) 2022, Pittsburgh, PA, USA, May 22-24, 2022. IEEE, 149--150.
[54]
Zexin Zhong, Jiangchao Liu, Diyu Wu, Peng Di,et al. 2023. Scalable Compositional Static Taint Analysis for Sensitive Data Tracing on Industrial Micro-Services. In 45th IEEE/ACM International Conference on Software Engineering: Software Engineering in Practice, SEIP@ICSE 2023, Melbourne, Australia, May 14-20, 2023. IEEE, 110--121.
[55]
Xin Zhou, Kisub Kim, Bowen Xu, et al. 2023. The Devil is in the Tails: How Long-Tailed Code Distributions Impact Large Language Models. arXiv:2309.03567 [cs.SE]

Index Terms

  1. CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICSE-SEIP '24: Proceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice
    April 2024
    480 pages
    ISBN:9798400705014
    DOI:10.1145/3639477
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    In-Cooperation

    • Faculty of Engineering of University of Porto

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 31 May 2024

    Check for updates

    Author Tags

    1. code large language models
    2. multi-lingual
    3. chinese prompts

    Qualifiers

    • Research-article

    Conference

    ICSE-SEIP '24
    Sponsor:

    Upcoming Conference

    ICSE 2025

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 127
      Total Downloads
    • Downloads (Last 12 months)127
    • Downloads (Last 6 weeks)22
    Reflects downloads up to 12 Dec 2024

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media