[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation

Published: 01 July 2023 Publication History

Abstract

Large language models have demonstrated the ability to generate both natural language and programming language text. Although contemporary code generation models are trained on corpora with several programming languages, they are tested using benchmarks that are typically monolingual. The most widely used code generation benchmarks only target Python, so there is little quantitative evidence of how code generation models perform on other programming languages. We propose MultiPL-E, a system for translating unit test-driven code generation benchmarks to new languages. We create the first massively multilingual code generation benchmark by using MultiPL-E to translate two popular Python code generation benchmarks to 18 additional programming languages. We use MultiPL-E to extend the HumanEval benchmark (Chen et al., 2021) and MBPP benchmark (Austin et al., 2021) to 18 languages that encompass a range of programming paradigms and popularity. Using these new parallel benchmarks, we evaluate the multi-language performance of three state-of-the-art code generation models: Codex (Chen et al., 2021), CodeGen (Nijkamp et al., 2022) and InCoder (Fried et al., 2022). We find that Codex matches or even exceeds its performance on Python for several other languages. The range of programming languages represented in MultiPL-E allow us to explore the impact of language frequency and language features on model performance. Finally, the MultiPL-E approach of compiling code generation benchmarks to new programming languages is both scalable and extensible, making it straightforward to evaluate new models, benchmarks, and languages.

References

[1]
M. Chen et al., “Evaluating large language models trained on code,” 2021, arXiv:2107.03374.
[2]
J. Austin et al., “Program synthesis with large language models,” 2021, arXiv:2108.07732.
[3]
E. Nijkamp et al., “CodeGen: An open large language model for code with multi-turn program synthesis,” in Proc. Int. Conf. Learn. Representations, 2022.
[4]
D. Fried et al., “InCoder: A generative model for code infilling and synthesis,” in Proc. Int. Conf. Learn. Representations, 2022.
[5]
S. Black et al., “GPT-NeoX-20B: An open-source autoregressive language model,” 2022, arXiv:2204.06745.
[6]
F. F. Xu, U. Alon, G. Neubig, and V. J. Hellendoorn, “A systematic evaluation of large language models of code,” in Proc. ACM SIGPLAN Int. Symp. Mach. Program., 2022, pp. 1–10.
[7]
A. Ziegler et al., “Productivity assessment of neural code completion,” in Proc. ACM SIGPLAN Int. Symp. Mach. Program., 2022, pp. 21–29.
[8]
S. Kulal et al., “SPoC: Search-based pseudocode to code,” in Proc. Adv. Neural Inf. Process. Syst., 2019, pp. 11906–11917.
[9]
D. Hendrycks et al., “Measuring coding challenge competence with APPS,” in Proc. Conf. Neural Inf. Process. Syst. Datasets Benchmarks Track, 2021.
[10]
P. Yin, B. Deng, E. Chen, B. Vasilescu, and G. Neubig, “Learning to mine aligned code and natural language pairs from Stack Overflow,” in Proc. Int. Conf. Mining Softw. Repositories, 2018, pp. 476–486.
[11]
R. Alur et al., “Syntax-guided synthesis,” in Proc. IEEE Conf. Formal Methods Comput.-Aided Des., 2013, pp. 1–8.
[12]
S. Chaudhuri, K. Ellis, O. Polozov, R. Singh, A. Solar-Lezama, and Y. Yue, “Neurosymbolic programming,” Found. Trends Program. Lang., vol. 7, no. 3, pp. 158–243, 2021.
[13]
S. Gulwani et al., “Program synthesis,” Found. Trends Program. Lang., vol. 4, no. 1/2, pp. 1–119, 2017.
[14]
T. Brown et al., “Language models are few-shot learners,” in Proc. Adv. Neural Inf. Process. Syst., 2020, pp. 1877–1901.
[15]
B. Wang and A. Komatsuzaki, “GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model,” 2021. [Online]. Available: https://github.com/kingoflolz/mesh-transformer-jax
[16]
Z. Feng et al., “CodeBERT: A pre-trained model for programming and natural languages,” in Proc. Findings Assoc. Comput. Linguistics, 2020, pp. 1536–1547.
[17]
C. Clement, D. Drain, J. Timcheck, A. Svyatkovskiy, and N. Sundaresan, “PyMT5: Multi-mode translation of natural language and Python code with transformers,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2020, pp. 9052–9065.
[18]
M. Tufano, D. Drain, A. Svyatkovskiy, S. K. Deng, and N. Sundaresan, “Unit test case generation with transformers and focal context,” 2020, arXiv:2009.05617.
[19]
S. Lu et al., “CodeXGLUE: A machine learning benchmark dataset for code understanding and generation,” in Proc. Conf. Neural Inf. Process. Syst. Datasets Benchmarks Track, 2021.
[20]
T. Ahmed and P. Devanbu, “Multilingual training for software engineering,” in Proc. IEEE/ACM Int. Conf. Softw. Eng., 2022, pp. 1443–1455.
[21]
J. Wei, M. Goyal, G. Durrett, and I. Dillig, “LambdaNet: Probabilistic type inference using graph neural networks,” in Proc. Int. Conf. Learn. Representations, 2020.
[22]
V. J. Hellendoorn, C. Bird, E. T. Barr, and M. Allamanis, “Deep learning type inference,” in Proc. 26th ACM Joint Meeting Eur. Softw. Eng. Conf. Symp. Found.s Softw. Eng., 2018, pp. 152–162.
[23]
M. Pradel, G. Gousios, J. Liu, and S. Chandra, “TypeWriter: Neural type prediction with search-based validation,” in Proc. ACM Joint Meeting Eur. Softw. Eng. Conf. Symp. Found. Softw. Eng., 2020, pp. 209–220.
[24]
I. Drori et al., “A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level,” Proc. Nat. Acad. Sci. USA, vol. 119, no. 32, 2022, Art. no.
[25]
A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi, “The curious case of neural text degeneration,” in Proc. Int. Conf. Learn. Representations, 2020.
[26]
S. Ren et al., “CodeBLEU: A method for automatic evaluation of code synthesis,” 2020, arXiv:2009.10297.
[27]
T. Yu et al., “Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2018, pp. 3911–3921.
[28]
V. Zhong, C. Xiong, and R. Socher, “Seq2SQL: Generating structured queries from natural language using reinforcement learning,” 2017, arXiv:1709.00103.
[29]
A. Vaswani et al., “Attention is all you need,” in Proc. Conf. Neural Inf. Process. Syst., 2017, pp. 6000–6010.
[30]
A. Aghajanyan et al., “CM3: A causal masked multimodal model of the Internet,” 2022, arXiv:2201.07520.
[31]
L. Gao et al., “The Pile: An 800gb dataset of diverse text for language modeling,” 2021. [Online]. Available: https://arxiv.org/abs/2101.00027
[32]
D. Bates, M. Mächler, B. Bolker, and S. Walker, “Fitting linear mixed-effects models using lme4,” J. Statist. Softw., vol. 67, no. 1, pp. 1–48, 2015.
[33]
L. B. Allal et al., “SantaCoder: Don't reach for the stars!,” in Proc. Deep Learn. Code Workshop, 2023.
[34]
B. Chen et al., “CodeT: Code generation with generated tests,” 2022, arXiv:2207.10397.
[35]
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: A method for automatic evaluation of machine translation,” in Proc. Annu. Meeting Assoc. Comput. Linguistics, 2002, pp. 311–318.
[36]
Z. Wang, G. Cuenca, S. Zhou, F. F. Xu, and G. Neubig, “MCoNaLa: A benchmark for code generation from multiple natural languages,” 2022, arXiv:2203.08388.
[37]
S. Iyer, I. Konstas, A. Cheung, and L. Zettlemoyer, “Mapping language to code in programmatic context,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2018, pp. 1643–1652.
[38]
T. Yu et al., “Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2018, pp. 3911–3921.
[39]
L. Tunstall, L. von Werra, and T. Wolf, Natural Language Processing With Transformers. Sebastopol, CA, USA: O’Reilly Media, 2022.
[40]
Y. Li et al., “Competition-level code generation with AlphaCode,” Science, vol. 378, no. 6624, pp. 1092–1097, 2022.
[41]
A. Chowdhery et al., “PaLM: Scaling language modeling with pathways,” 2022, arXiv:2204.02311.
[42]
B. Athiwaratkun et al., “Multi-lingual evaluation of code generation models,” in Proc. Int. Conf. Learn. Representations, 2022.

Cited By

View all
  • (2024)Self-infilling code generationProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694618(61614-61648)Online publication date: 21-Jul-2024
  • (2024)Knowledge Transfer from High-Resource to Low-Resource Programming Languages for Code LLMsProceedings of the ACM on Programming Languages10.1145/36897358:OOPSLA2(677-708)Online publication date: 8-Oct-2024
  • (2024)LLMEffiChecker: Understanding and Testing Efficiency Degradation of Large Language ModelsACM Transactions on Software Engineering and Methodology10.1145/366481233:7(1-38)Online publication date: 26-Aug-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Software Engineering
IEEE Transactions on Software Engineering  Volume 49, Issue 7
July 2023
295 pages

Publisher

IEEE Press

Publication History

Published: 01 July 2023

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 04 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Self-infilling code generationProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694618(61614-61648)Online publication date: 21-Jul-2024
  • (2024)Knowledge Transfer from High-Resource to Low-Resource Programming Languages for Code LLMsProceedings of the ACM on Programming Languages10.1145/36897358:OOPSLA2(677-708)Online publication date: 8-Oct-2024
  • (2024)LLMEffiChecker: Understanding and Testing Efficiency Degradation of Large Language ModelsACM Transactions on Software Engineering and Methodology10.1145/366481233:7(1-38)Online publication date: 26-Aug-2024
  • (2024)Generating and Reviewing Programming Codes with Large Language Models: A Systematic Mapping StudyProceedings of the 20th Brazilian Symposium on Information Systems10.1145/3658271.3658342(1-10)Online publication date: 20-May-2024
  • (2024)CoderUJB: An Executable and Unified Java Benchmark for Practical Programming ScenariosProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3652115(124-136)Online publication date: 11-Sep-2024
  • (2024)Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Quality IssuesACM Transactions on Software Engineering and Methodology10.1145/364367433:5(1-26)Online publication date: 4-Jun-2024
  • (2024)Can Large Language Models Write Parallel Code?Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658689(281-294)Online publication date: 3-Jun-2024
  • (2024)How Beginning Programmers and Code LLMs (Mis)read Each OtherProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642706(1-26)Online publication date: 11-May-2024
  • (2024)Isolating Compiler Bugs by Generating Effective Witness Programs With Large Language ModelsIEEE Transactions on Software Engineering10.1109/TSE.2024.339782250:7(1768-1788)Online publication date: 1-Jul-2024
  • (2024)Promoting open science in test-driven software experimentsJournal of Systems and Software10.1016/j.jss.2024.111971212:COnline publication date: 1-Jun-2024
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media