poster

Learning to mine parallel natural language/source code corpora from stack overflow

Authors:

Graham NeubigAuthors Info & Claims

ICSE '18: Proceedings of the 40th International Conference on Software Engineering: Companion Proceeedings

Pages 388 - 389

https://doi.org/10.1145/3183440.3195021

Published: 27 May 2018 Publication History

Get Access

Abstract

For tasks like code synthesis from natural language, code retrieval, and code summarization, data-driven models have shown great promise. However, creating these models requires parallel data between natural language (NL) and code with fine-grained alignments. Stack Overflow (SO) is a promising source to create such a data set but existing heuristic methods are limited both in their coverage and the correctness of the NL-code pairs obtained. In this paper, we propose a method to mine high-quality aligned data from SO by training a classifier using two sets of features: hand-crafted features considering the structure of the extracted snippets, and correspondence features obtained by training a neural network model to capture the correlation between NL and code. Experiments using Python and Java as test beds show that the proposed method greatly expands coverage and accuracy over existing mining methods, even when using only a small number of labeled examples.

References

[1]

Miltiadis Allamanis, Hao Peng, and Charles Sutton. 2016. A Convolutional Attention Network for Extreme Summarization of Source Code. arXiv preprint arXiv:1602.03001 (2016).

Google Scholar

[2]

Miltiadis Allamanis, Daniel Tarlow, Andrew D Gordon, and Yi Wei. 2015. Bimodal Modelling of Source Code and Natural Language. In ICML.

Digital Library

Google Scholar

[3]

Aditya Desai, Sumit Gulwani, Vineet Hingorani, Nidhi Jain, Amey Karkare. Mark Marron, Subhajit Roy, and others. 2016. Program Synthesis using Natural Language. In ICSE.

Digital Library

Google Scholar

[4]

Christine Franks, Zhaopeng Tu, Premkumar Devanbu, and Vincent Hellendoorn. 2015. CACHECA: A Cache Language Model based Code Suggestion Tool. In ICSE, Vol. 2.

Digital Library

Google Scholar

[5]

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing Source Code using a Neural Attention Model. In ACL.

Google Scholar

[6]

Nicholas Locascio, Karthik Narasimhan, Eduardo De Leon, Nate Kushman, and Regina Barzilay. 2016. Neural Generation of Regular Expressions from Natural Language with Minimal Domain Knowledge. In EMNLP.

Google Scholar

[7]

Graham Neubig. 2017. Neural Machine Translation and Sequence-to-Sequence Models: A Tutorial. arXiv preprint arXiv:1703.01619 (2017).

Google Scholar

[8]

Chris Quirk, Raymond Mooney, and Michel Galley. 2015. Language to Code: Learning Semantic Parsers for If-This-Then-That Recipes. In ACL.

Google Scholar

[9]

Yi Wei, Nirupama Chandrasekaran, Sumit Gulwani, and Youssef Hamadi. 2015. Building Bing Developer Assistant. Technical Report. MSR-TR-2015--36, Microsoft Research.

Google Scholar

[10]

Edmund Wong, Jinqiu Yang, and Lin Tan. AutoComment: Mining question and answer sites for automatic comment generation. In ASE.

Digital Library

Google Scholar

[11]

Pengcheng Yin and Graham Neubig. 2017. A Syntactic Neural Model for General-Purpose Code Generation. In ACL.

Google Scholar

Cited By

View all

Anwar ZAfzal HKadry SCheng X(2024)Semantic Web Approaches in Stack OverflowInternational Journal on Semantic Web & Information Systems10.4018/IJSWIS.35861720:1(1-61)Online publication date: 13-Dec-2024
https://dl.acm.org/doi/10.4018/IJSWIS.358617
Wang YWu J(2021)The Code Generation Method Based on Gated Attention and InterAction-LSTMWeb Information Systems and Applications10.1007/978-3-030-87571-8_47(544-555)Online publication date: 24-Sep-2021
https://dl.acm.org/doi/10.1007/978-3-030-87571-8_47
Watson CTufano MMoran KBavota GPoshyvanyk DRothermel GBae D(2020)On learning meaningful assert statements for unit test casesProceedings of the ACM/IEEE 42nd International Conference on Software Engineering10.1145/3377811.3380429(1398-1409)Online publication date: 27-Jun-2020
https://dl.acm.org/doi/10.1145/3377811.3380429
Show More Cited By

Recommendations

Learning to mine aligned code and natural language pairs from stack overflow
MSR '18: Proceedings of the 15th International Conference on Mining Software Repositories

For tasks like code synthesis from natural language, code retrieval, and code summarization, data-driven models have shown great promise. However, creating these models require parallel data between natural language (NL) and code with fine-grained ...
Optimising the fit of stack overflow code snippets into existing code
GECCO '20: Proceedings of the 2020 Genetic and Evolutionary Computation Conference Companion

Software developers often reuse code from online sources such as Stack Overflow within their projects. However, the process of searching for code snippets and integrating them within existing source code can be tedious. In order to improve efficiency ...
Bootstrapping parallel corpora
HLT-NAACL-PARALLEL '03: Proceedings of the HLT-NAACL 2003 Workshop on Building and using parallel texts: data driven machine translation and beyond - Volume 3

We present two methods for the automatic creation of parallel corpora. Whereas previous work into the automatic construction of parallel corpora has focused on harvesting them from the web, we examine the use of existing parallel corpora to bootstrap ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

ICSE '18: Proceedings of the 40th International Conference on Software Engineering: Companion Proceeedings

May 2018

231 pages

ISBN:9781450356633

DOI:10.1145/3183440

Conference Chair:
Michel Chaudron
Chalmers University of Technology, University of Gothenburg, Sweden
,
General Chair:
Ivica Crnkovic
Chalmers University of Technology, University of Gothenburg, Sweden
,
Program Chairs:
Marsha Chechik
University of Toronto, Canada
,
Mark Harman
Facebook and University College London, United Kingdom

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2018

Check for updates

Qualifiers

Poster

Conference

ICSE '18

Sponsor:

SIGSOFT
IEEE-CS

ICSE '18: 40th International Conference on Software Engineering

May 27 - June 3, 2018

Gothenburg, Sweden

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
126
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Anwar ZAfzal HKadry SCheng X(2024)Semantic Web Approaches in Stack OverflowInternational Journal on Semantic Web & Information Systems10.4018/IJSWIS.35861720:1(1-61)Online publication date: 13-Dec-2024
https://dl.acm.org/doi/10.4018/IJSWIS.358617
Wang YWu J(2021)The Code Generation Method Based on Gated Attention and InterAction-LSTMWeb Information Systems and Applications10.1007/978-3-030-87571-8_47(544-555)Online publication date: 24-Sep-2021
https://dl.acm.org/doi/10.1007/978-3-030-87571-8_47
Watson CTufano MMoran KBavota GPoshyvanyk DRothermel GBae D(2020)On learning meaningful assert statements for unit test casesProceedings of the ACM/IEEE 42nd International Conference on Software Engineering10.1145/3377811.3380429(1398-1409)Online publication date: 27-Jun-2020
https://dl.acm.org/doi/10.1145/3377811.3380429
Blanco GPérez-López RFdez-Riverola FLourenço A(2020)Understanding the social evolution of the Java community in Stack OverflowFuture Generation Computer Systems10.1016/j.future.2019.12.021105:C(446-454)Online publication date: 1-Apr-2020
https://dl.acm.org/doi/10.1016/j.future.2019.12.021

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Recommendations

Learning to mine aligned code and natural language pairs from stack overflow

Optimising the fit of stack overflow code snippets into existing code

Bootstrapping parallel corpora

Comments

Information

Published In

Sponsors

Publisher

Publication History

Check for updates

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations