Abstract
Social coding facilitates the sharing of knowledge in GitHub projects. In particular, issue reports, as an important knowledge in the software development, usually contain relevant information, and can thus be shared and linked in the developers’ discussion to aid the issue resolution. Linking issues to potentially related issues, i.e. issue knowledge acquisition, would provide developers with more targeted resource and information when they search and resolve issues. However, identifying and acquiring related issues is in general challenging, because the real-world acquiring practice is time-consuming and mainly depends on the experience and knowledge of the individual developers. Therefore, acquiring related issues automatically is a meaningful task which can improve development efficiency of GitHub projects. In this paper, we formulate the problem of acquiring related issue knowledge as a recommendation problem. To solve this problem, we propose a novel approach, iLinker, combining information retrieval technique, i.e. TF-IDF, and deep learning techniques, i.e. Word Embedding and Document Embedding. Our evaluation results show that, in both coarse-grained recommendation and fine-grained recommendation tasks, iLinker outperforms the baseline approaches.
Similar content being viewed by others
Notes
In our dataset, the percentage of developers’ linking duplicate issues is less than 20%: during the analysis, we randomly select 250 link cases from collected data of Request project (with population= 1,110 and confidence level= 95%, the confidence interval\(\simeq \)5.46), and manual check the duplicate relationship between two linked issues by following the strategy used by Ye et al. [31]. The analysis is performed by two coders (first and third author) separately. The inter-rater agreement between the two coders is almost perfect (Fleiss’s Kappa value [32] is 0.83). All authors reviewed and agreed on the final result.
In this study, we use Lancaster stemmer that was implemented by NLTK. Because it works very well in Python programs and it is a very aggressive stemming algorithm with the fastest processing speed. It can reduce our working set of words hugely, which is meaningful for the GitHub projects to quickly train issues data and build practical tools.
In our study, for each query issue, we calculate its metric values for NextBug and iLinker. We compute p-value and Cliff’s delta based on all query issues. We use Bonferroni correction to counteract the impact of multiple hypothesis tests.
For each group, the Wilcoxon test results and Cliff’s delta confirm that their differences are significant and substantial.
References
Dabbish, L., Stuart, C., Tsay, J., Herbsleb, J.: Social Coding in Github: Transparency and Collaboration in an Open Software Repository. In: CSCW, pp. 1277–1286. ACM (2012)
Zhang, Y., Wang, H., Yin, G., et al.: Social media in GitHub: the role of @-mention in assisting software development. Sci. China Inf. Sci. 60(3), 032102 (2017)
Gharehyazie, M., Ray, B., Filkov, V.: Some from Here, Some from There: Cross-Project Code Reuse in Github. In: MSR, pp. 291–301. IEEE (2017)
Sun, C., Lo, D., Khoo, S. -C., Jiang, J.: Towards More Accurate Retrieval of Duplicate Bug Reports. In: ASE, pp. 253–262. IEEE (2011)
Zhou, J., Zhang, H., Lo, D.: Where Should the Bugs Be Fixed? More Accurate Information Retrieval-Based Bug Localization Based on Bug Reports. In: ICSE, pp. 14–24. IEEE (2012)
Rocha, H., Valente, M. T., Marques-Neto, H., Murphy, G. C.: An Empirical Study on Recommendations of Similar Bugs. In: SANER, pp. 46–56. IEEE (2016)
Le, Q., Mikolov, T.: Distributed Representations of Sentences and Documents. In: ICML, pp. 1188–1196 (2014)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv:1301.3781 (2013)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., Dean, J.: Distributed Representations of Words and Phrases and Their Compositionality. In: NIPS, pp. 3111–3119 (2013)
Xu, B., Ye, D., Xing, Z., Xia, X., Chen, G., Li, S.: Predicting Semantically Linkable Knowledge in Developer Online Forums via Convolutional Neural Network. In: ASE. ACM, pp. 51–62 (2016)
Ye, X., Shen, H., Ma, X., Bunescu, R., Liu, C.: From Word Embeddings to Document Similarities for Improved Information Retrieval in Software Engineering. In: ICSE, pp. 404–415. ACM (2016)
Yang, X., Lo, D., Xia, X., Bao, L., Sun, J.: Combining Word Embedding with Information Retrieval to Recommend Similar Bug Reports. In: ISSRE, pp. 127–137. IEEE (2016)
Fan, Y., Xia, X., Lo, D., Hassan, A.E.: Chaff from the wheat: Characterizing and determining valid bug reports. IEEE Transactions on Software Engineering (2018)
Li, L., Ren, Z., Li, X., Zou, W., Jiang, H.: How are Issue Units Linked? Empirical Study on the Linking Behavior in GitHub. In: APSEC, pp. 386–395. IEEE (2018)
Zampetti, F., Ponzanelli, L., Bavota, G., Mocci, A., Penta, M. D., Lanza, M.: How Developers Document Pull Requests with External References. In: ICPC, pp. 23-33. IEEE (2017)
Zhang, Y., Yu, Y., Wang, H., Vasilescu, B., Filkov, V.: Within-Ecosystem Issue Linking: a Large-Scale Study of Rails. In: Software Mining, pp. 12–19. ACM (2018)
Zhang, Y., Wu, Y., Wang, T., et al.: A novel approach for recommending semantically linkable issues in GitHub projects. Sci. China Inf. Sci. 62(9), 199105 (2019)
Boisselle, V., Adams, B.: The Impact of Cross-Distribution Bug Duplicates, Empirical Study on Debian and Ubuntu. In: SCAM, pp. 131–140. IEEE (2015)
Blei, D. M., Ng, A. Y., Jordan, M. I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)
Dai, A. M., Olah, C., Le, Q. V.: Document embedding with paragraph vectors. arXiv:1507.07998 (2015)
Crowston, K., Scozzi, B.: Bug fixing practices within free/libre open source software development teams (2008)
Jeong, G., Kim, S., Zimmermann, T.: Improving Bug Triage with Bug Tossing Graphs. In: ESEC/FSE, pp. 111–120. ACM (2009)
Xia, X., Lo, D., Ding, Y., Al-Kofahi, J. M., Nguyen, T. N., Wang, X.: Improving automated bug triaging with specialized topic model. IEEE Trans. Softw. Eng. 43(3), 272–297 (2017)
Yan, M., Zhang, X., Yang, D., Xu, L., Kymer, J. D.: A component recommender for bug reports using Discriminative Probability Latent Semantic Analysis. Inf. Softw. Technol. 73, 37–51 (2016)
Anvik, J., Hiew, L., Murphy, G. C.: Who Should Fix This Bug?. In: ICSE, pp. 361–370. ACM (2006)
Guo, P. J., Zimmermann, T., Nagappan, N., Murphy, B.: Characterizing and Predicting Which Bugs Get Fixed: an Empirical Study of Microsoft Windows. In: ICSE, pp. 495–504. IEEE (2010)
Bachmann, A., Bird, C., Rahman, F., Devanbu, P., Bernstein, A.: The Missing Links: Bugs and Bug-Fix Commits. In: FSE, pp. 97–106. ACM (2010)
Ye, X., Bunescu, R., Liu, C.: Learning to Rank Relevant Files for Bug Reports Using Domain Knowledge. In: FSE, pp. 689–699. ACM (2014)
Zhang, Y., Yin, G., Wang, T., Yu, Y., knowledge, H. Wang.: Evaluating Bug Severity Using Crowd-Based an Exploratory Study. In: Internetware, pp. 70–73. ACM (2015)
Wang, X., Zhang, L., Xie, T., Anvik, J., Sun, J.: An Approach to Detecting Duplicate Bug Reports Using Natural Language and Execution Information. In: ICSE, pp. 461–470. IEEE (2008)
Ye, D., Xing, Z., Kapre, N.: The structure and dynamics of knowledge network in domain-specific q&a sites: a case study of stack overflow. Empir. Softw. Eng. 22(1), 375–406 (2017)
Landis, J. R., Koch, G. G.: The measurement of observer agreement for categorical data. Biometrics 33(1), 159–174 (1977)
Paice, C.: A Word Stemmer Based on the Lancaster Stemming Algorithm. In: ACM SIGIR, pp. 56–61 (1990)
Kohavi, R.: A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In: IJCAI, pp. 1137–1145 (1995)
Hindle, A., Alipour, A., Stroulia, E.: A contextual approach towards more accurate duplicate bug report detection and ranking. Empir. Softw. Eng. 21(2), 368–410 (2016)
Thung, F., Kochhar, P. S., Lo, D.: Dupfinder: Integrated Tool Support for Duplicate Bug Report Detection. In: ASE, pp. 871-874. ACM (2014)
Tian, Y., Sun, C., Lo, D.: Improved Duplicate Bug Report Identification. In: CSMR, pp. 385-390. IEEE (2012)
Zhang, Y., Lo, D., Xia, X., Sun, J.-L.: Multi-factor duplicate question detection in stack overflow. J. Comput. Sci. Technol. 30(5), 981–997 (2015)
Zhang, W. E., Sheng, Q. Z., Tang, Z., Ruan, W.: Related Or Duplicate: Distinguishing Similar CQA Questions via Convolutional Neural Networks. In: SIGIR, pp. 1153-1156. ACM (2018)
Acknowledgements
We thank the anonymous reviewers for their insightful comments on earlier versions of this paper. This work was supported by A New Generation of Artificial Intelligence 2030 Program (Grant No.2018AAA0102304), National Grand R&D Plan (Grant No. 2018YFB1003903), and National Natural Science Foundation of China (Grant No. 61432020).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhang, Y., Wu, Y., Wang, T. et al. iLinker: a novel approach for issue knowledge acquisition in GitHub projects. World Wide Web 23, 1589–1619 (2020). https://doi.org/10.1007/s11280-019-00770-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-019-00770-1