Abstract
Commit messages are important complementary information used in understanding code changes. To address message scarcity, some work is proposed for automatically generating commit messages. However, most of these approaches focus on generating summary of the changed software entities at the superficial level, without considering the intent behind the code changes (e.g., the existing approaches cannot generate such message: “fixing null pointer exception”). Considering developers often describe the intent behind the code change when writing the messages, we propose ChangeDoc, an approach to reuse existing messages in version control systems for automatical commit message generation. Our approach includes syntax, semantic, pre-syntax, and pre-semantic similarities. For a given commit without messages, it is able to discover its most similar past commit from a large commit repository, and recommend its message as the message of the given commit. Our repository contains half a million commits that were collected from SourceForge. We evaluate our approach on the commits from 10 projects. The results show that 21.5% of the recommended messages by ChangeDoc can be directly used without modification, and 62.8% require minor modifications. In order to evaluate the quality of the commit messages recommended by ChangeDoc, we performed two empirical studies involving a total of 40 participants (10 professional developers and 30 students). The results indicate that the recommended messages are very good approximations of the ones written by developers and often include important intent information that is not included in the messages generated by other tools.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Barnett M, Bird C, Brunet J, Lahiri S K. Helping developers help themselves: Automatic decomposition of code review changesets. In Proc. the 37th IEEE/ACM International Conference on Software Engineering, May 2015, pp.134-144.
Huang Y, Jia N, Zhou Q, Chen X, Xiong Y F, Luo X N. Guiding developers to make informative commenting decisions in source code. In Proc. the 40th IEEE/ACM International Conference on Software Engineering: Companion, May 2018, pp.260-261.
Hattori L, Lanza M. On the nature of commits. In Proc. the 23rd IEEE/ACM International Conference on Automated Software Engineering, September 2008, pp.63-71.
Huang Y, Huang S, Chen H, Chen X, Zheng Z, Luo X, Jia N, Hu X, Zhou X. Towards automatically generating block comments for code snippets. Information and Software Technology, 2020, 127: Article No. 106373.
Tao Y, Dang Y, Xie T, Zhang D, Kim S. How do software engineers understand code changes? An exploratory study in industry. In Proc. the 20th ACM SIGSOFT Symposium on the Foundations of Software Engineering, November 2012, Article No. 51.
Huang Y, Chen X, Zou Q, Luo X. A probabilistic neural network-based approach for related software changes detection. In Proc. the 21st Asia-Pacific Software Engineering Conference, Dec. 2014, pp.279-286.
Maalej W, Happel H J. Can development work describe itself? In Proc. the 7th International Working Conference on Mining Software Repositories, May 2010, pp.191-200.
Dyer R, Nguyen H A, Rajan H, Nguyen T N. Boa: A language and infrastructure for analyzing ultra-large-scale software repositories. In Proc. the 35th International Conference on Software Engineering, May 2013, pp.422-431.
Linares-Vásquez M, Cortés-Coy L F, Aponte J, Poshyvanyk D. ChangeScribe: A tool for automatically generating commit messages. In Proc. the 37th IEEE/ACM International Conference on Software Engineering, May 2015, pp.709-712.
Moreno L, Bavota G, Penta M D, Oliveto R, Marcus A, Canfora G. Automatic generation of release notes. In Proc. the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, November 2014, pp.484-495.
Moreno L, Bavota G, Penta M D, Oliveto R, Marcus A, Canfora G. ARENA: An approach for the automated generation of release notes. IEEE Transactions on Software Engineering, 2016, 43(2): 106-127.
Shen J, Sun X, Li B, Yang H, Hu J. On automatic summarization of what and why information in source code changes. In Proc. the 40th IEEE Annual Computer Software and Applications Conference, June 2016, pp.103-112.
Buse R P, Weimer W R. Automatically documenting program changes. In Proc. the 25th IEEE/ACM International Conference on Automated Software Engineering, September 2010, pp.33-42.
Rastkar S, Murphy G C. Why did this code change? In Proc. the 35th International Conference on Software Engineering, May 2013, pp.1193-1196.
Parnin C, Görg C. Improving change descriptions with change contexts. In Proc. the 2008 International Working Conference on Mining Software Repositories, May 2008, pp.51-60.
Sridhara G, Hill E, Muppaneni D, Pollock L, Vijay-Shanker K. Towards automatically generating summary comments for Java methods. In Proc. the 25th IEEE/ACM International Conference on Automated Software Engineering, September 2010, pp.43-52.
Moreno L, Aponte J, Sridhara G, Marcus A, Pollock L, Vijay-Shanker K. Automatic generation of natural language summaries for Java classes. In Proc. the 21st International Conference on Program Comprehension, May 2013, pp.23-32.
Spinellis D. Version control systems. IEEE Software, 2005, 22(5): 108-109.
Zhong H, Meng N. Towards reusing hints from past fixes: An exploratory study on thousands of real samples. In Proc. the 40th IEEE/ACM International Conference on Software Engineering, May 2018, pp.885-885.
Huang Y, Zheng Q, Chen X, Xiong Y, Liu Z, Luo X. Mining version control system for automatically generating commit comment. In Proc. the 2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, November 2017, pp.414-423.
Cortes-Coy L F, Linares-Vásquez M, Aponte J, Poshyvanyk D. On automatically generating commit messages via summarization of source code changes. In Proc. the 14th IEEE International Working Conference on Source Code Analysis and Manipulation, September 2014, pp.275-284.
Jiang S, McMillan C. Towards automatic generation of short summaries of commits. arXiv:1703.09603, 2017. https://arxiv.org/abs/1703.09603, Sept. 2020.
Jiang S, Armaly A. Automatically generating commit messages from diffs using neural machine translation. In Proc. the 32nd IEEE/ACM International Conference on Automated Software Engineering, October 2017, pp.135-146.
Hoang T, Kang H J, Lawall J, Lo D. CC2Vec: Distributed representations of code changes. arXiv:2003.05620, 2003. https://arxiv.org/pdf/2003.05620.pdf, Sept. 2020.
Xu S, Yao Y, Xu F, Gu T, Tong H, Lu J. Commit message generation for source code changes. In Proc. the 28th International Joint Conference on Artificial Intelligence, August 2019, pp.3975-3981.
Liu Z, Xia X, Hassan A E, Lo D, Xing Z, Wang X. Neural-machine-translation-based commit message generation: How far are we? In Proc. the 33rd ACM/IEEE International Conference on Automated Software Engineering, September 2018, pp. 373-384.
Nie L Y, Gao C, Zhong Z, Lam W, Liu Y, Xu Z. Contextualized code representation learning for commit message generation. arXiv:2007.06934, 2020. https://arxiv.org/pdf/2007.06934, Sept. 2020.
Liu S, Gao C, Chen S, Nie L Y, Liu Y. ATOM: Commit message generation based on abstract syntax tree and hybrid ranking. arXiv:1912.02972, 2019. https://arxiv.org/abs/1912.02972, Sept. 2020.
McBurney P W, McMillan C. Automatic documentation generation via source code summarization of method context. In Proc. the 22nd International Conference on Program Comprehension, June 2014, pp.279-290.
Wong E, Yang J, Tan L. AutoComment: Mining question and answer sites for automatic comment generation. In Proc. the 28th IEEE/ACM International Conference on Automated Software Engineering, November 2013, pp.562-567.
Wong E, Liu T, Tan L. CloCom: Mining existing source code for automatic comment generation. In Proc. the 22nd IEEE International Conference on Software Analysis, Evolution, and Reengineering, March 2015, pp.380-389.
Haiduc S, Aponte J, Moreno L, Marcus A. On the use of automated text summarization techniques for summarizing source code. In Proc. the 17th Working Conference on Reverse Engineering, October 2010, pp.35-44.
Haiduc S, Aponte J, Marcus A. Supporting program comprehension with source code summarization. In Proc. the 32nd ACM/IEEE International Conference on Software Engineering, May 2010, pp.223-226.
Iyer S, Konstas I, Cheung A, Zettlemoyer L. Summarizing source code using a neural attention model. In Proc. the 54th Annual Meeting of the Association for Computational Linguistics, August 2016, pp.2073-2083.
Allamanis M, Peng H, Sutton C. A convolutional attention network for extreme summarization of source code. In Proc. the 33rd International Conference on Machine Learning, June 2016, pp.2091-2100.
Hu X, Li G, Xia X, Lo D, Jin Z. Deep code comment generation. In Proc. the 26th IEEE International Conference on Program Comprehension, May 2018, pp.200-210.
Hu X, Li G, Xia X, Lo D, Lu S, Jin Z. Summarizing source code with transferred API knowledge. In Proc. the 27th International Joint Conference on Artificial Intelligence, July 2018, pp.2269-2275.
Baxter I D, Yahin A, de Moura L M et al. Clone detection using abstract syntax trees. In Proc. the 1998 Int. Conf. Software Maintenance, November 1998, pp.368-377.
Roy C K, Cordy J R, Koschke R. Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Science of Computer Programming, 2009, 74(7): 470-495.
Wettel R, Marinescu R. Archeology of code duplication: Recovering duplication chains from small duplication fragments. In Proc. the 7th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, September 2005, pp.63-70.
Yuan Y, Guo Y. Boreas: An accurate and scalable token-based approach to code clone detection. In Proc. the 27th IEEE/ACM International Conference on Automated Software Engineering, Sept. 2012, pp.286-289.
Kamiya T, Kusumoto S, Inoue K. CCFinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering, 2002, 28(7): 654-670.
Fluri B, Wuersch M, PInzger M, Gall H. Change distilling: Tree differencing for fine-grained source code change extraction. IEEE Transactions on Software Engineering, 2007, 33(11): 725-743.
Misra J, Annervaz K, Kaulgud V. Software clustering: Unifying syntactic and semantic features. In Proc. the 19th Working Conference on Reverse Engineering, October 2012, pp.113-122.
Huang Y, Chen X, Liu Z, Luo X, Zheng Z. Using discriminative feature in software entities for relevance identification of code changes. Journal of Software: Evolution and Process, 2017, 29(7): Article No. 2.
Huang Y, Jia N, Chen X, Hong K, Zheng Z. Salient-class location: Help developers understand code change in code review. In Proc. the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, November 2018, pp.770-774.
Khatchadourian R, Rashid A, Masuhara H, Watanabe T. Detecting broken pointcuts using structural commonality and degree of interest (N). In Proc. the 30th IEEE/ACM International Conference on Automated Software Engineering, Nov. 2015, pp.641-646.
Nguyen H A, Nguyen A T, Nguyen T T, Nguyen T N, Rajan H. A study of repetitiveness of code changes in software evolution. In Proc. the 28th IEEE/ACM International Conference on Automated Software Engineering, Nov. 2013, pp.180-190.
Gao Q, Zhang H, Wang J, Xiong Y, Zhang L, Mei H. Fixing recurring crash bugs via analyzing Q & A sites (T). In Proc. the 30th IEEE/ACM International Conference on Automated Software Engineering, Nov. 2015, pp.307-318.
Huang Y, Hu X, Jia N, Chen X, Xiong Y, Zheng Z. Learning code context information to predict comment locations. IEEE Transactions on Reliability, 2020, 69(1): 88-105.
Huang Y, Jia N, Shu J, Hu X, Chen X, Zhou Q. Does your code need comment? Software — Practice and Experience, 2020, 50(3): 227-245.
Huang Y, Hu X, Jia N, Chen X, Zheng Z, Luo X. CommtPst: Deep learning source code for commenting positions prediction. Journal of Systems and Software, 2020, 170: Article No. 110754.
Oliva J, Serrano J I, del Castillo M D, Iglesias Á. SyMSS: A syntax-based measure for short-text semantic similarity. Data & Knowledge Engineering, 2011, 70(4): 390-405.
Salton G. A vector space model for automatic indexing. Communications of the ACM, 1975, 18(11): 613-620.
Zhang J, Chen J, Hao D, Xiong Y, Xie B, Zhang L, Mei H. Search-based inference of polynomial metamorphic relations. In Proc. the 2014 ACM/IEEE International Conference on Automated Software Engineering, September 2014, pp.701-712.
Li Q. A novel Likert scale based on fuzzy sets theory. Expert Systems with Applications, 2013, 40(5): 1609-1618.
Navigli R. Word sense disambiguation: A survey. ACM Computing Surveys, 2009, 41(2): 115-183.
Author information
Authors and Affiliations
Corresponding author
Supplementary Information
ESM 1
(PDF 108 kb)
Rights and permissions
About this article
Cite this article
Huang, Y., Jia, N., Zhou, HJ. et al. Learning Human-Written Commit Messages to Document Code Changes. J. Comput. Sci. Technol. 35, 1258–1277 (2020). https://doi.org/10.1007/s11390-020-0496-0
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-020-0496-0