[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3183440.3195021acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
poster

Learning to mine parallel natural language/source code corpora from stack overflow

Published: 27 May 2018 Publication History

Abstract

For tasks like code synthesis from natural language, code retrieval, and code summarization, data-driven models have shown great promise. However, creating these models requires parallel data between natural language (NL) and code with fine-grained alignments. Stack Overflow (SO) is a promising source to create such a data set but existing heuristic methods are limited both in their coverage and the correctness of the NL-code pairs obtained. In this paper, we propose a method to mine high-quality aligned data from SO by training a classifier using two sets of features: hand-crafted features considering the structure of the extracted snippets, and correspondence features obtained by training a neural network model to capture the correlation between NL and code. Experiments using Python and Java as test beds show that the proposed method greatly expands coverage and accuracy over existing mining methods, even when using only a small number of labeled examples.

References

[1]
Miltiadis Allamanis, Hao Peng, and Charles Sutton. 2016. A Convolutional Attention Network for Extreme Summarization of Source Code. arXiv preprint arXiv:1602.03001 (2016).
[2]
Miltiadis Allamanis, Daniel Tarlow, Andrew D Gordon, and Yi Wei. 2015. Bimodal Modelling of Source Code and Natural Language. In ICML.
[3]
Aditya Desai, Sumit Gulwani, Vineet Hingorani, Nidhi Jain, Amey Karkare. Mark Marron, Subhajit Roy, and others. 2016. Program Synthesis using Natural Language. In ICSE.
[4]
Christine Franks, Zhaopeng Tu, Premkumar Devanbu, and Vincent Hellendoorn. 2015. CACHECA: A Cache Language Model based Code Suggestion Tool. In ICSE, Vol. 2.
[5]
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing Source Code using a Neural Attention Model. In ACL.
[6]
Nicholas Locascio, Karthik Narasimhan, Eduardo De Leon, Nate Kushman, and Regina Barzilay. 2016. Neural Generation of Regular Expressions from Natural Language with Minimal Domain Knowledge. In EMNLP.
[7]
Graham Neubig. 2017. Neural Machine Translation and Sequence-to-Sequence Models: A Tutorial. arXiv preprint arXiv:1703.01619 (2017).
[8]
Chris Quirk, Raymond Mooney, and Michel Galley. 2015. Language to Code: Learning Semantic Parsers for If-This-Then-That Recipes. In ACL.
[9]
Yi Wei, Nirupama Chandrasekaran, Sumit Gulwani, and Youssef Hamadi. 2015. Building Bing Developer Assistant. Technical Report. MSR-TR-2015--36, Microsoft Research.
[10]
Edmund Wong, Jinqiu Yang, and Lin Tan. AutoComment: Mining question and answer sites for automatic comment generation. In ASE.
[11]
Pengcheng Yin and Graham Neubig. 2017. A Syntactic Neural Model for General-Purpose Code Generation. In ACL.

Cited By

View all
  • (2024)Semantic Web Approaches in Stack OverflowInternational Journal on Semantic Web & Information Systems10.4018/IJSWIS.35861720:1(1-61)Online publication date: 13-Dec-2024
  • (2021)The Code Generation Method Based on Gated Attention and InterAction-LSTMWeb Information Systems and Applications10.1007/978-3-030-87571-8_47(544-555)Online publication date: 24-Sep-2021
  • (2020)On learning meaningful assert statements for unit test casesProceedings of the ACM/IEEE 42nd International Conference on Software Engineering10.1145/3377811.3380429(1398-1409)Online publication date: 27-Jun-2020
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ICSE '18: Proceedings of the 40th International Conference on Software Engineering: Companion Proceeedings
May 2018
231 pages
ISBN:9781450356633
DOI:10.1145/3183440
  • Conference Chair:
  • Michel Chaudron,
  • General Chair:
  • Ivica Crnkovic,
  • Program Chairs:
  • Marsha Chechik,
  • Mark Harman
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2018

Check for updates

Qualifiers

  • Poster

Conference

ICSE '18
Sponsor:

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Semantic Web Approaches in Stack OverflowInternational Journal on Semantic Web & Information Systems10.4018/IJSWIS.35861720:1(1-61)Online publication date: 13-Dec-2024
  • (2021)The Code Generation Method Based on Gated Attention and InterAction-LSTMWeb Information Systems and Applications10.1007/978-3-030-87571-8_47(544-555)Online publication date: 24-Sep-2021
  • (2020)On learning meaningful assert statements for unit test casesProceedings of the ACM/IEEE 42nd International Conference on Software Engineering10.1145/3377811.3380429(1398-1409)Online publication date: 27-Jun-2020
  • (2020)Understanding the social evolution of the Java community in Stack OverflowFuture Generation Computer Systems10.1016/j.future.2019.12.021105:C(446-454)Online publication date: 1-Apr-2020

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media