[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2723742.2723758acmotherconferencesArticle/Chapter ViewAbstractPublication PagesisecConference Proceedingsconference-collections
research-article

Naturalness of Natural Language Artifacts in Software

Published: 18 February 2015 Publication History

Abstract

We present a study on the naturalness of the natural language artifacts in software. Naturalness is essentially repetitiveness or predictability. By natural language artifacts, we mean source code comments, revision history messages, bug reports and so on. We measure "naturalness" using a standard measure, cross-entropy or perplexity from the widely used N-Gram models. Previously, Hindle et al. demonstrated empirically that source code was comparatively more repetitive or regular (i.e., more natural) when compared with traditional English text. A question that logically follows from their work is the naturalness of other artifacts associated with software. We present our findings on source code comments, commit logs, bug reports, string messages and content present in the popular question and answer forum, StackOverflow. Each of the artifact that we examine is a natural language artifact that is associated with software. However, they do not exhibit the same amount of regularity (naturalness). Commit logs were the most regular, followed by string literal messages and source code comments. Content from StackOverflow (viz., title, question and answers) showed a behavior similar to traditional English text i.e., comparatively lesser regularity. Bug reports from industrial projects exhibited more regularity than bug reports from open source projects, whose naturalness resembled that of typical English text. Our findings have implications for feasibility of building tools such as comment and bug report completion engines. We describe a next-word prediction tool that we built using the N-Gram language model. This tool achieved an accuracy ranging from 70 to 90% on commit messages in different projects. It also achieved an accuracy ranging from 56 to 78% on source comments. We also present a part of speech based analysis of words that are easy to predict and difficult to predict.

References

[1]
N. Bettenburg, S. Just, A. Schröter, C. Weiss, R. Premraj, and T. Zimmermann. What makes a good bug report? In SIGSOFT FSE, 2008.
[2]
T. J. Biggerstaff, B. G. Mitbander, and D. Webster. The concept assignment problem in program understanding. In ICSE, 1993.
[3]
D. Binkley, M. Hearn, and D. Lawrie. Improving identifier informativeness using part of speech information. In MSR, 2011.
[4]
M. Bruch, M. Monperrus, and M. Mezini. Learning from examples to improve code completion systems. In ESEC/FSE, 2009.
[5]
E. Enslen, E. Hill, L. Pollock, and K. Vijay-Shanker. Mining source code to automatically split identifiers for software analysis. 6th Intl Working Conf on Mining Software Repositories, 2009.
[6]
S. Han, D. R. Wallace, and R. C. Miller. Code completion of multiple keywords from abbreviated input. Automated Software Engg., 2011.
[7]
E. Hill, L. Pollock, and K. Vijay-Shanker. Exploring the neighborhood with dora to expedite software maintenance. In ASE, 2007.
[8]
A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu. On the naturalness of software. In ICSE 2012.
[9]
E. W. Host and B. M. Ostvold. Debugging method names. In ECOOP 2009, 2009.
[10]
D. Jurafsky and J. H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition Second Edition. Prentice Hall PTR, 2008.
[11]
R. Robbes and M. Lanza. Improving code completion with program history. Automated Software Engg., 2010.
[12]
P. Runeson, M. Alexandersson, and O. Nyholm. In 29th ICSE, Washington, DC, USA, 2007.
[13]
D. Shepherd, Z. P. Fry, E. Hill, L. Pollock, and K. Vijay-Shanker. Using Natural Language Program Analysis to Locate and Understand Action-oriented Concerns. In AOSD '07: 6th Intl Conf on Aspect-oriented Software Development, 2007.
[14]
G. Sridhara. Automatic Generation of Descriptive Summary Comments for Methods in Object-Oriented Programs. PhD thesis, University of Delaware, 2012.
[15]
G. Sridhara, E. Hill, D. Muppaneni, L. Pollock, and K. Vijay-Shanker. Towards automatically generating summary comments for java methods. In International conference on Automated software engineering, ASE '10, 2010.
[16]
G. Sridhara, E. Hill, L. Pollock, and K. Vijay-Shanker. Identifying Word Relations in Software: A Comparative Study of Semantic Similarity Tools. In 16th IEEE Intl Conf on Program Comprehension, 2008.
[17]
G. Sridhara, L. Pollock, and K. Vijay-Shanker. Automatically detecting and describing high level actions within methods. In Intl Conf on Software Engineering (ICSE'11), 2011.
[18]
G. Sridhara, L. Pollock, and K. Vijay-Shanker. Generating parameter comments and integrating with method summaries. In Intl Conf on Program Comprehension (ICPC'11), 2011.
[19]
K. Trnka, D. Yarrington, J. McCaw, K. F. McCoy, and C. Pennington. The effects of word prediction on communication rate for aac. NAACL-Short '07, 2007.
[20]
Wikipedia. Language model. online. http://en.wikipedia.org/wiki/Language_model/.

Cited By

View all
  • (2016)Part of Speech Tagging Java Method Names2016 IEEE International Conference on Software Maintenance and Evolution (ICSME)10.1109/ICSME.2016.80(483-487)Online publication date: Oct-2016

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
ISEC '15: Proceedings of the 8th India Software Engineering Conference
February 2015
207 pages
ISBN:9781450334327
DOI:10.1145/2723742
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

  • iSOFT: iSOFT
  • ACM India: ACM India

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 February 2015

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ISEC '15
ISEC '15: 8th India Software Engineering Conference
February 18 - 20, 2015
Bangalore, India

Acceptance Rates

Overall Acceptance Rate 76 of 315 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)1
Reflects downloads up to 12 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2016)Part of Speech Tagging Java Method Names2016 IEEE International Conference on Software Maintenance and Evolution (ICSME)10.1109/ICSME.2016.80(483-487)Online publication date: Oct-2016

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media