More Web Proxy on the site http://driver.im/

research-article

MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation

Authors:

Federico Cassano,

Luna Phipps-Costin,

Donald Pinckney,

Carolyn Jane Anderson,

Molly Q Feldman,

Michael Greenberg,

Abhinav JangdaAuthors Info & Claims

IEEE Transactions on Software Engineering, Volume 49, Issue 7

Pages 3675 - 3691

https://doi.org/10.1109/TSE.2023.3267446

Published: 01 July 2023 Publication History

Abstract

Large language models have demonstrated the ability to generate both natural language and programming language text. Although contemporary code generation models are trained on corpora with several programming languages, they are tested using benchmarks that are typically monolingual. The most widely used code generation benchmarks only target Python, so there is little quantitative evidence of how code generation models perform on other programming languages. We propose MultiPL-E, a system for translating unit test-driven code generation benchmarks to new languages. We create the first massively multilingual code generation benchmark by using MultiPL-E to translate two popular Python code generation benchmarks to 18 additional programming languages. We use MultiPL-E to extend the HumanEval benchmark (Chen et al., 2021) and MBPP benchmark (Austin et al., 2021) to 18 languages that encompass a range of programming paradigms and popularity. Using these new parallel benchmarks, we evaluate the multi-language performance of three state-of-the-art code generation models: Codex (Chen et al., 2021), CodeGen (Nijkamp et al., 2022) and InCoder (Fried et al., 2022). We find that Codex matches or even exceeds its performance on Python for several other languages. The range of programming languages represented in MultiPL-E allow us to explore the impact of language frequency and language features on model performance. Finally, the MultiPL-E approach of compiling code generation benchmarks to new programming languages is both scalable and extensible, making it straightforward to evaluate new models, benchmarks, and languages.

References

[1]

M. Chen et al., “Evaluating large language models trained on code,” 2021, arXiv:2107.03374.

[2]

J. Austin et al., “Program synthesis with large language models,” 2021, arXiv:2108.07732.

[3]

E. Nijkamp et al., “CodeGen: An open large language model for code with multi-turn program synthesis,” in Proc. Int. Conf. Learn. Representations, 2022.

[4]

D. Fried et al., “InCoder: A generative model for code infilling and synthesis,” in Proc. Int. Conf. Learn. Representations, 2022.

[5]

S. Black et al., “GPT-NeoX-20B: An open-source autoregressive language model,” 2022, arXiv:2204.06745.

[6]

F. F. Xu, U. Alon, G. Neubig, and V. J. Hellendoorn, “A systematic evaluation of large language models of code,” in Proc. ACM SIGPLAN Int. Symp. Mach. Program., 2022, pp. 1–10.

[7]

A. Ziegler et al., “Productivity assessment of neural code completion,” in Proc. ACM SIGPLAN Int. Symp. Mach. Program., 2022, pp. 21–29.

[8]

S. Kulal et al., “SPoC: Search-based pseudocode to code,” in Proc. Adv. Neural Inf. Process. Syst., 2019, pp. 11906–11917.

[9]

D. Hendrycks et al., “Measuring coding challenge competence with APPS,” in Proc. Conf. Neural Inf. Process. Syst. Datasets Benchmarks Track, 2021.

[10]

P. Yin, B. Deng, E. Chen, B. Vasilescu, and G. Neubig, “Learning to mine aligned code and natural language pairs from Stack Overflow,” in Proc. Int. Conf. Mining Softw. Repositories, 2018, pp. 476–486.

[11]

R. Alur et al., “Syntax-guided synthesis,” in Proc. IEEE Conf. Formal Methods Comput.-Aided Des., 2013, pp. 1–8.

[12]

S. Chaudhuri, K. Ellis, O. Polozov, R. Singh, A. Solar-Lezama, and Y. Yue, “Neurosymbolic programming,” Found. Trends Program. Lang., vol. 7, no. 3, pp. 158–243, 2021.

Digital Library

[13]

S. Gulwani et al., “Program synthesis,” Found. Trends Program. Lang., vol. 4, no. 1/2, pp. 1–119, 2017.

[14]

T. Brown et al., “Language models are few-shot learners,” in Proc. Adv. Neural Inf. Process. Syst., 2020, pp. 1877–1901.

[15]

B. Wang and A. Komatsuzaki, “GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model,” 2021. [Online]. Available: https://github.com/kingoflolz/mesh-transformer-jax

[16]

Z. Feng et al., “CodeBERT: A pre-trained model for programming and natural languages,” in Proc. Findings Assoc. Comput. Linguistics, 2020, pp. 1536–1547.

[17]

C. Clement, D. Drain, J. Timcheck, A. Svyatkovskiy, and N. Sundaresan, “PyMT5: Multi-mode translation of natural language and Python code with transformers,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2020, pp. 9052–9065.

[18]

M. Tufano, D. Drain, A. Svyatkovskiy, S. K. Deng, and N. Sundaresan, “Unit test case generation with transformers and focal context,” 2020, arXiv:2009.05617.

[19]

S. Lu et al., “CodeXGLUE: A machine learning benchmark dataset for code understanding and generation,” in Proc. Conf. Neural Inf. Process. Syst. Datasets Benchmarks Track, 2021.

[20]

T. Ahmed and P. Devanbu, “Multilingual training for software engineering,” in Proc. IEEE/ACM Int. Conf. Softw. Eng., 2022, pp. 1443–1455.

[21]

J. Wei, M. Goyal, G. Durrett, and I. Dillig, “LambdaNet: Probabilistic type inference using graph neural networks,” in Proc. Int. Conf. Learn. Representations, 2020.

[22]

V. J. Hellendoorn, C. Bird, E. T. Barr, and M. Allamanis, “Deep learning type inference,” in Proc. 26th ACM Joint Meeting Eur. Softw. Eng. Conf. Symp. Found.s Softw. Eng., 2018, pp. 152–162.

[23]

M. Pradel, G. Gousios, J. Liu, and S. Chandra, “TypeWriter: Neural type prediction with search-based validation,” in Proc. ACM Joint Meeting Eur. Softw. Eng. Conf. Symp. Found. Softw. Eng., 2020, pp. 209–220.

[24]

I. Drori et al., “A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level,” Proc. Nat. Acad. Sci. USA, vol. 119, no. 32, 2022, Art. no.

[25]

A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi, “The curious case of neural text degeneration,” in Proc. Int. Conf. Learn. Representations, 2020.

[26]

S. Ren et al., “CodeBLEU: A method for automatic evaluation of code synthesis,” 2020, arXiv:2009.10297.

[27]

T. Yu et al., “Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2018, pp. 3911–3921.

[28]

V. Zhong, C. Xiong, and R. Socher, “Seq2SQL: Generating structured queries from natural language using reinforcement learning,” 2017, arXiv:1709.00103.

[29]

A. Vaswani et al., “Attention is all you need,” in Proc. Conf. Neural Inf. Process. Syst., 2017, pp. 6000–6010.

[30]

A. Aghajanyan et al., “CM3: A causal masked multimodal model of the Internet,” 2022, arXiv:2201.07520.

[31]

L. Gao et al., “The Pile: An 800gb dataset of diverse text for language modeling,” 2021. [Online]. Available: https://arxiv.org/abs/2101.00027

[32]

D. Bates, M. Mächler, B. Bolker, and S. Walker, “Fitting linear mixed-effects models using lme4,” J. Statist. Softw., vol. 67, no. 1, pp. 1–48, 2015.

[33]

L. B. Allal et al., “SantaCoder: Don't reach for the stars!,” in Proc. Deep Learn. Code Workshop, 2023.

[34]

B. Chen et al., “CodeT: Code generation with generated tests,” 2022, arXiv:2207.10397.

[35]

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: A method for automatic evaluation of machine translation,” in Proc. Annu. Meeting Assoc. Comput. Linguistics, 2002, pp. 311–318.

[36]

Z. Wang, G. Cuenca, S. Zhou, F. F. Xu, and G. Neubig, “MCoNaLa: A benchmark for code generation from multiple natural languages,” 2022, arXiv:2203.08388.

[37]

S. Iyer, I. Konstas, A. Cheung, and L. Zettlemoyer, “Mapping language to code in programmatic context,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2018, pp. 1643–1652.

[38]

T. Yu et al., “Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2018, pp. 3911–3921.

[39]

L. Tunstall, L. von Werra, and T. Wolf, Natural Language Processing With Transformers. Sebastopol, CA, USA: O’Reilly Media, 2022.

[40]

Y. Li et al., “Competition-level code generation with AlphaCode,” Science, vol. 378, no. 6624, pp. 1092–1097, 2022.

[41]

A. Chowdhery et al., “PaLM: Scaling language modeling with pathways,” 2022, arXiv:2204.02311.

[42]

B. Athiwaratkun et al., “Multi-lingual evaluation of code generation models,” in Proc. Int. Conf. Learn. Representations, 2022.

Cited By

Zheng LYuan JZhang ZYang HKong LSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Self-infilling code generationProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694618(61614-61648)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3694618
Cassano FGouwar JLucchetti FSchlesinger CFreeman AAnderson CFeldman MGreenberg MJangda AGuha A(2024)Knowledge Transfer from High-Resource to Low-Resource Programming Languages for Code LLMsProceedings of the ACM on Programming Languages10.1145/36897358:OOPSLA2(677-708)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3689735
Feng XHan XChen SYang W(2024)LLMEffiChecker: Understanding and Testing Efficiency Degradation of Large Language ModelsACM Transactions on Software Engineering and Methodology10.1145/366481233:7(1-38)Online publication date: 26-Aug-2024
https://dl.acm.org/doi/10.1145/3664812
Show More Cited By

Recommendations

Application-specific benchmarking
Renaissance: benchmarking suite for parallel applications on the JVM
PLDI 2019: Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation

Established benchmark suites for the Java Virtual Machine (JVM), such as DaCapo, ScalaBench, and SPECjvm2008, lack workloads that take advantage of the parallel programming abstractions and concurrency primitives offered by the JVM and the Java Class ...
Benchmarking data warehouses

Database benchmarks can either help users in comparing the performances of different systems, or help engineers in testing the effect of various design choices. In the field of data warehouses, the Transaction Processing Performance Council's standard ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Software Engineering

IEEE Transactions on Software Engineering Volume 49, Issue 7

July 2023

295 pages

ISSN:0098-5589

Issue’s Table of Contents

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.

Publisher

IEEE Press

Publication History

Published: 01 July 2023

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 04 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zheng LYuan JZhang ZYang HKong LSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Self-infilling code generationProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694618(61614-61648)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3694618
Cassano FGouwar JLucchetti FSchlesinger CFreeman AAnderson CFeldman MGreenberg MJangda AGuha A(2024)Knowledge Transfer from High-Resource to Low-Resource Programming Languages for Code LLMsProceedings of the ACM on Programming Languages10.1145/36897358:OOPSLA2(677-708)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3689735
Feng XHan XChen SYang W(2024)LLMEffiChecker: Understanding and Testing Efficiency Degradation of Large Language ModelsACM Transactions on Software Engineering and Methodology10.1145/366481233:7(1-38)Online publication date: 26-Aug-2024
https://dl.acm.org/doi/10.1145/3664812
Albuquerque BCunha ASouza LSiqueira SSantos R(2024)Generating and Reviewing Programming Codes with Large Language Models: A Systematic Mapping StudyProceedings of the 20th Brazilian Symposium on Information Systems10.1145/3658271.3658342(1-10)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3658271.3658342
Zeng ZWang YXie RYe WZhang SChristakis MPradel M(2024)CoderUJB: An Executable and Unified Java Benchmark for Practical Programming ScenariosProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3652115(124-136)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3652115
Liu YLe-Cong TWidyasari RTantithamthavorn CLi LLe XLo D(2024)Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Quality IssuesACM Transactions on Software Engineering and Methodology10.1145/364367433:5(1-26)Online publication date: 4-Jun-2024
https://dl.acm.org/doi/10.1145/3643674
Nichols DDavis JXie ZRajaram ABhatele AMencagli GDazzi PLowenthal DBadia R(2024)Can Large Language Models Write Parallel Code?Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658689(281-294)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3625549.3658689
Nguyen SBabe HZi YGuha AAnderson CFeldman M(2024)How Beginning Programmers and Code LLMs (Mis)read Each OtherProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642706(1-26)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642706
Tu HZhou ZJiang HYusuf ILi YJiang L(2024)Isolating Compiler Bugs by Generating Effective Witness Programs With Large Language ModelsIEEE Transactions on Software Engineering10.1109/TSE.2024.339782250:7(1768-1788)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1109/TSE.2024.3397822
Kessel MAtkinson C(2024)Promoting open science in test-driven software experimentsJournal of Systems and Software10.1016/j.jss.2024.111971212:COnline publication date: 1-Jun-2024
https://dl.acm.org/doi/10.1016/j.jss.2024.111971
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents