[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3524842.3528507acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
short-paper
Open access

ManyTypes4TypeScript: a comprehensive TypeScript dataset for sequence-based type inference

Published: 17 October 2022 Publication History

Abstract

In this paper, we present ManyTypes4TypeScript, a very large corpus for training and evaluating machine-learning models for sequence-based type inference in TypeScript. The dataset includes over 9 million type annotations, across 13,953 projects and 539,571 files. The dataset is approximately 10x larger than analogous type inference datasets for Python, and is the largest available for Type-Script. We also provide API access to the dataset, which can be integrated into any tokenizer and used with any state-of-the-art sequence-based model. Finally, we provide analysis and performance results for state-of-the-art code-specific models, for baselining. ManyTypes4TypeScript is available on Huggingface, Zenodo, and CodeXGLUE.

References

[1]
Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Unified Pre-training for Program Understanding and Generation. arXiv preprint arXiv:2103.06333 (2021).
[2]
Toufique Ahmed and Premkumar Devanbu. 2021. Multilingual training for Software Engineering. arXiv preprint arXiv:2112.02043 (2021).
[3]
Miltiadis Allamanis. 2019. The adverse effects of code duplication in machine learning models of code. In Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software. 143--153.
[4]
Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, and Charles Sutton. 2018. A survey of machine learning for big code and naturalness. ACM Computing Surveys (CSUR) 51, 4 (2018), 1--37.
[5]
Miltiadis Allamanis, Earl T Barr, Soline Ducousso, and Zheng Gao. 2020. Typilus: Neural type hints. In Proceedings of the 41st acm sigplan conference on programming language design and implementation. 91--105.
[6]
Luca Buratti, Saurabh Pujar, Mihaela Bornea, Scott McCarley, Yunhui Zheng, Gaetano Rossiello, Alessandro Morari, Jim Laredo, Veronika Thost, Yufan Zhuang, et al. 2020. Exploring software naturalness through neural language models. arXiv preprint arXiv:2006.12641 (2020).
[7]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[8]
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, et al. 2020. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020).
[9]
Zheng Gao, Christian Bird, and Earl T Barr. 2017. To type or not to type: quantifying detectable bugs in JavaScript. In 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE). IEEE, 758--769.
[10]
Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, et al. 2020. Graphcodebert: Pre-training code representations with data flow. arXiv preprint arXiv:2009.08366 (2020).
[11]
Vincent J Hellendoorn, Christian Bird, Earl T Barr, and Miltiadis Allamanis. 2018. Deep learning type inference. In Proceedings of the 2018 26th acm joint meeting on european software engineering conference and symposium on the foundations of software engineering. 152--162.
[12]
Kevin Jesse, Premkumar T Devanbu, and Toufique Ahmed. 2021. Learning type annotation: is big data enough?. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1483--1486.
[13]
Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, and Kensen Shi. 2019. Pre-trained contextual embedding of source code. (2019).
[14]
Anjan Karmakar and Romain Robbes. 2021. What do pre-trained code models know about code?. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 1332--1336.
[15]
Taku Kudo. 2018. Subword regularization: Improving neural network translation models with multiple subword candidates. arXiv preprint arXiv:1804.10959 (2018).
[16]
Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226 (2018).
[17]
Quentin Lhoest, Albert Villanova del Moral, Yacine Jernite, Abhishek Thakur, Patrick von Platen, Suraj Patil, Julien Chaumond, Mariama Drame, Julien Plu, Lewis Tunstall, et al. 2021. Datasets: A Community Library for Natural Language Processing. arXiv preprint arXiv:2109.02846 (2021).
[18]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
[19]
Cristina V Lopes, Petr Maj, Pedro Martins, Vaibhav Saini, Di Yang, Jakub Zitny, Hitesh Sajnani, and Jan Vitek. 2017. DéjàVu: a map of code duplicates on GitHub. Proceedings of the ACM on Programming Languages 1, OOPSLA (2017), 1--28.
[20]
Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al. 2021. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. arXiv preprint arXiv:2102.04664 (2021).
[21]
Rabee Sohail Malik, Jibesh Patra, and Michael Pradel. 2019. NL2Type: inferring JavaScript function types from natural language information. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 304--315.
[22]
Amir M Mir, Evaldas Latoskinas, and Georgios Gousios. 2021. ManyTypes4Py: A Benchmark Python Dataset for Machine Learning-based Type Inference. arXiv preprint arXiv:2104.04706 (2021).
[23]
Amir M Mir, Evaldas Latoskinas, Sebastian Proksch, and Georgios Gousios. 2021. Type4py: Deep similarity learning-based type inference for python. arXiv preprint arXiv:2101.04470 (2021).
[24]
Irene Vlassi Pandi, Earl T Barr, Andrew D Gordon, and Charles Sutton. 2020. OptTyper: Probabilistic Type Inference by Optimising Logical and Natural Constraints. arXiv preprint arXiv:2004.00348 (2020).
[25]
Yun Peng, Zongjie Li, Cuiyun Gao, Bowei Gao, David Lo, and Michael Lyu. 2021. HiTyper: A Hybrid Static Type Inference Framework with Neural Prediction. arXiv preprint arXiv:2105.03595 (2021).
[26]
Michael Pradel, Georgios Gousios, Jason Liu, and Satish Chandra. 2020. Typewriter: Neural type prediction with search-based validation. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 209--220.
[27]
Veselin Raychev, Martin Vechev, and Andreas Krause. 2015. Predicting program properties from" big code". ACM SIGPLAN Notices 50, 1 (2015), 111--124.
[28]
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909 (2015).
[29]
Jieke Shi, Zhou Yang, Junda He, Bowen Xu, and David Lo. 2022. Can Identifier Splitting Improve Open-Vocabulary Language Model of Code? arXiv preprint arXiv:2201.01988 (2022).
[30]
Xiaobing Sun, Xiangyue Liu, Jiajun Hu, and Junwu Zhu. 2014. Empirical studies on the nlp techniques for source code data preprocessing. In Proceedings of the 2014 3rd international workshop on evidential assessment of software technologies. 32--39.
[31]
Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. arXiv preprint arXiv:2109.00859 (2021).
[32]
Jiayi Wei, Maruth Goyal, Greg Durrett, and Isil Dillig. 2020. Lambdanet: Probabilistic type inference using graph neural networks. arXiv preprint arXiv:2005.02161 (2020).
[33]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. 2019. Huggingface's transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019).
[34]
Zhaogui Xu, Xiangyu Zhang, Lin Chen, Kexin Pei, and Baowen Xu. 2016. Python probabilistic type inference with natural language support. In Proceedings of the 2016 24th ACM SIGSOFT international symposium on foundations of software engineering. 607--618.

Cited By

View all
  • (2024)OppropBERL: A GNN and BERT-Style Reinforcement Learning-Based Type Inference2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)10.1109/SANER60148.2024.00021(131-136)Online publication date: 12-Mar-2024
  • (2023)Learning Program Representations with a Tree-Structured Transformer2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)10.1109/SANER56733.2023.00032(248-259)Online publication date: Mar-2023
  • (2022)FlexType: A Plug-and-Play Framework for Type Inference ModelsProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering10.1145/3551349.3559527(1-5)Online publication date: 10-Oct-2022

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MSR '22: Proceedings of the 19th International Conference on Mining Software Repositories
May 2022
815 pages
ISBN:9781450393034
DOI:10.1145/3524842
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

In-Cooperation

  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2022

Check for updates

Author Tags

  1. TypeScript
  2. code properties
  3. machine learning
  4. type inference

Qualifiers

  • Short-paper

Funding Sources

  • NSF CISE

Conference

MSR '22
Sponsor:

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)195
  • Downloads (Last 6 weeks)31
Reflects downloads up to 10 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)OppropBERL: A GNN and BERT-Style Reinforcement Learning-Based Type Inference2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)10.1109/SANER60148.2024.00021(131-136)Online publication date: 12-Mar-2024
  • (2023)Learning Program Representations with a Tree-Structured Transformer2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)10.1109/SANER56733.2023.00032(248-259)Online publication date: Mar-2023
  • (2022)FlexType: A Plug-and-Play Framework for Type Inference ModelsProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering10.1145/3551349.3559527(1-5)Online publication date: 10-Oct-2022

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media