More Web Proxy on the site http://driver.im/

research-article

Naturalness of Natural Language Artifacts in Software

Authors:

Giriprasad Sridhara,

Vibha Singhal Sinha,

Senthil ManiAuthors Info & Claims

ISEC '15: Proceedings of the 8th India Software Engineering Conference

Pages 156 - 165

https://doi.org/10.1145/2723742.2723758

Published: 18 February 2015 Publication History

Abstract

We present a study on the naturalness of the natural language artifacts in software. Naturalness is essentially repetitiveness or predictability. By natural language artifacts, we mean source code comments, revision history messages, bug reports and so on. We measure "naturalness" using a standard measure, cross-entropy or perplexity from the widely used N-Gram models. Previously, Hindle et al. demonstrated empirically that source code was comparatively more repetitive or regular (i.e., more natural) when compared with traditional English text. A question that logically follows from their work is the naturalness of other artifacts associated with software. We present our findings on source code comments, commit logs, bug reports, string messages and content present in the popular question and answer forum, StackOverflow. Each of the artifact that we examine is a natural language artifact that is associated with software. However, they do not exhibit the same amount of regularity (naturalness). Commit logs were the most regular, followed by string literal messages and source code comments. Content from StackOverflow (viz., title, question and answers) showed a behavior similar to traditional English text i.e., comparatively lesser regularity. Bug reports from industrial projects exhibited more regularity than bug reports from open source projects, whose naturalness resembled that of typical English text. Our findings have implications for feasibility of building tools such as comment and bug report completion engines. We describe a next-word prediction tool that we built using the N-Gram language model. This tool achieved an accuracy ranging from 70 to 90% on commit messages in different projects. It also achieved an accuracy ranging from 56 to 78% on source comments. We also present a part of speech based analysis of words that are easy to predict and difficult to predict.

References

[1]

N. Bettenburg, S. Just, A. Schröter, C. Weiss, R. Premraj, and T. Zimmermann. What makes a good bug report? In SIGSOFT FSE, 2008.

Digital Library

[2]

T. J. Biggerstaff, B. G. Mitbander, and D. Webster. The concept assignment problem in program understanding. In ICSE, 1993.

Digital Library

[3]

D. Binkley, M. Hearn, and D. Lawrie. Improving identifier informativeness using part of speech information. In MSR, 2011.

Digital Library

[4]

M. Bruch, M. Monperrus, and M. Mezini. Learning from examples to improve code completion systems. In ESEC/FSE, 2009.

Digital Library

[5]

E. Enslen, E. Hill, L. Pollock, and K. Vijay-Shanker. Mining source code to automatically split identifiers for software analysis. 6th Intl Working Conf on Mining Software Repositories, 2009.

Digital Library

[6]

S. Han, D. R. Wallace, and R. C. Miller. Code completion of multiple keywords from abbreviated input. Automated Software Engg., 2011.

Digital Library

[7]

E. Hill, L. Pollock, and K. Vijay-Shanker. Exploring the neighborhood with dora to expedite software maintenance. In ASE, 2007.

Digital Library

[8]

A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu. On the naturalness of software. In ICSE 2012.

Digital Library

[9]

E. W. Host and B. M. Ostvold. Debugging method names. In ECOOP 2009, 2009.

Digital Library

[10]

D. Jurafsky and J. H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition Second Edition. Prentice Hall PTR, 2008.

Digital Library

[11]

R. Robbes and M. Lanza. Improving code completion with program history. Automated Software Engg., 2010.

Digital Library

[12]

P. Runeson, M. Alexandersson, and O. Nyholm. In 29th ICSE, Washington, DC, USA, 2007.

[13]

D. Shepherd, Z. P. Fry, E. Hill, L. Pollock, and K. Vijay-Shanker. Using Natural Language Program Analysis to Locate and Understand Action-oriented Concerns. In AOSD '07: 6th Intl Conf on Aspect-oriented Software Development, 2007.

Digital Library

[14]

G. Sridhara. Automatic Generation of Descriptive Summary Comments for Methods in Object-Oriented Programs. PhD thesis, University of Delaware, 2012.

Digital Library

[15]

G. Sridhara, E. Hill, D. Muppaneni, L. Pollock, and K. Vijay-Shanker. Towards automatically generating summary comments for java methods. In International conference on Automated software engineering, ASE '10, 2010.

Digital Library

[16]

G. Sridhara, E. Hill, L. Pollock, and K. Vijay-Shanker. Identifying Word Relations in Software: A Comparative Study of Semantic Similarity Tools. In 16th IEEE Intl Conf on Program Comprehension, 2008.

Digital Library

[17]

G. Sridhara, L. Pollock, and K. Vijay-Shanker. Automatically detecting and describing high level actions within methods. In Intl Conf on Software Engineering (ICSE'11), 2011.

Digital Library

[18]

G. Sridhara, L. Pollock, and K. Vijay-Shanker. Generating parameter comments and integrating with method summaries. In Intl Conf on Program Comprehension (ICPC'11), 2011.

Digital Library

[19]

K. Trnka, D. Yarrington, J. McCaw, K. F. McCoy, and C. Pennington. The effects of word prediction on communication rate for aac. NAACL-Short '07, 2007.

Digital Library

[20]

Wikipedia. Language model. online. http://en.wikipedia.org/wiki/Language_model/.

Cited By

Olney WHill EThurber CLemma B(2016)Part of Speech Tagging Java Method Names2016 IEEE International Conference on Software Maintenance and Evolution (ICSME)10.1109/ICSME.2016.80(483-487)Online publication date: Oct-2016
https://doi.org/10.1109/ICSME.2016.80

Index Terms

Naturalness of Natural Language Artifacts in Software

Recommendations

Towards Utilizing Natural Language Processing Techniques to Assist in Software Engineering Tasks
ICSE '23: Proceedings of the 45th International Conference on Software Engineering: Companion Proceedings

Machine learning-based approaches have been widely used to address natural language processing (NLP) problems. Considering the similarity between natural language text and source code, researchers have been working on applying techniques from NLP to ...
Quantitatively Exploring Non-code Software Artifacts
QSIC '14: Proceedings of the 2014 14th International Conference on Quality Software

Most software engineering research focuses its analyses on source code, because correct, well designed, and efficient program code is the desired end output of software development. Nevertheless, source code is not the only constituent of software ...
On the "naturalness" of buggy code
ICSE '16: Proceedings of the 38th International Conference on Software Engineering

Real software, the kind working programmers produce by the kLOC to solve real-world problems, tends to be "natural", like speech or natural language; it tends to be highly repetitive and predictable. Researchers have captured this naturalness of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ISEC '15: Proceedings of the 8th India Software Engineering Conference

February 2015

207 pages

ISBN:9781450334327

DOI:10.1145/2723742

General Chairs:
Srinivas Padmanabhuni
Infosys Labs
,
Raghu Nambiar
Siemens
,
Program Chairs:
Prem Devanbu
University of California, Davis
,
Murali Krishna Ramanathan
IISc, Bangalore
,
Publications Chair:
Ashish Sureka
IIIT Delhi

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

iSOFT: iSOFT
ACM India: ACM India

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 February 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Conference

ISEC '15

ISEC '15: 8th India Software Engineering Conference

February 18 - 20, 2015

Bangalore, India

Acceptance Rates

Overall Acceptance Rate 76 of 315 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
193
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Olney WHill EThurber CLemma B(2016)Part of Speech Tagging Java Method Names2016 IEEE International Conference on Software Maintenance and Evolution (ICSME)10.1109/ICSME.2016.80(483-487)Online publication date: Oct-2016
https://doi.org/10.1109/ICSME.2016.80

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten