Abstract
Software developers have benefited from various sources of knowledge such as forums, question-and-answer sites, and social media platforms to help them in various tasks. Extracting software-related knowledge from different platforms involves many challenges. In this paper, we propose an approach to improve the effectiveness of knowledge extraction tasks by performing cross-platform analysis. Our approach is based on transfer representation learning and word embedding, leveraging information extracted from a source platform which contains rich domain-related content. The information extracted is then used to solve tasks in another platform (considered as target platform) with less domain-related content. We first build a word embedding model as a representation learned from the source platform, and use the model to improve the performance of knowledge extraction tasks in the target platform. We experiment with Software Engineering Stack Exchange and Stack Overflow as source platforms, and two different target platforms, i.e., Twitter and YouTube. Our experiments show that our approach improves performance of existing work for the tasks of identifying software-related tweets and helpful YouTube comments.
Similar content being viewed by others
Notes
References
Achananuparp P, Lubis IN, Tian Y, Lo D, Lim E-P (2012) Observatory of trends in software related microblogs. In: 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering. IEEE, pp 334–337
Andrews JTA, Tanay T, Morton EJ, Griffin LD (2016) Transfer representation-learning for anomaly detection. In: Anomaly detection workshop. ICML
Aniche M, Treude C, Steinmacher I, Wiese I, Pinto G, Storey M-A, Gerosa M A (2018) How modern news aggregators help development communities shape and share knowledge. In: 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE). IEEE, pp 499–510
Asaduzzaman M, Mashiyat AS, Roy CK, Schneider KA (2013) Answering questions about unanswered questions of stack overflow. In: Proceedings of the 10th Working Conference on Mining Software Repositories. IEEE Press, pp 97–100
Azad S, Rigby PC, Guerrouj L (2017) Generating api call rules from version history and stack overflow posts. ACM Transactions on Software Engineering and Methodology (TOSEM)
Bacchelli A, Sasso TD, D’Ambros M, Lanza M (2012a) Content classification of development emails. In: 2012 34Th international conference on software engineering (ICSE). IEEE, pp 375–385
Bacchelli A, Ponzanelli L, Lanza M (2012b) Harnessing stack overflow for the ide. In: Proceedings of the Third International Workshop on Recommendation Systems for Software Engineering. IEEE Press, pp 26–30
Barua A, Thomas SW, Hassan AE (2014) What are developers talking about? an analysis of topics and trends in stack overflow. Empir Softw Eng 19(3):619–654
Begel Andrew, Bosch Jan (2013) Social networking meets software development: Perspectives from github, msdn, stack exchange, and topcoder. IEEE Softw 30(1):52–66
Bengio Y, Courville A (2013) Representation learning: A review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828
Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146
Bougie G, Starke J, Storey M-A, German DM (2011) Towards understanding twitter use in software engineering: preliminary findings, ongoing challenges and future questions. In: Proceedings of the 2nd international workshop on Web 2.0 for software engineering, pp 31–36
Cai X, Zhu J, Shen B, Chen Y (2016) Greta: Graph-based tag assignment for github repositories. In: Computer software and applications conference (COMPSAC), 2016 IEEE 40th annual. IEEE, vol 1, pp 63–72
Calefato F, Lanubile F, Maiorano F, Novielli N (2018) Sentiment polarity detection for software development. Empir Softw Eng 23(3):1352–1382
Chen C, Sa G, Xing Z (2016a) Mining analogical libraries in q&a discussions–incorporating relational and categorical knowledge into word embedding. In: 2016 IEEE 23rd international conference on Software analysis, evolution, and reengineering (SANER). IEEE, vol 1, pp 338–348
Chen C, Xing Z (2016b) Similartech: automatically recommend analogical libraries across different programming languages. In: 2016 31st IEEE/ACM international conference on Automated software engineering (ASE). IEEE, pp 834–839
Chen G, Chen C, Xing Z, Xu B (2016c) Learning a dual-language vector space for domain-specific cross-lingual question retrieval. In: Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, pp 744–755
Chen C, Xing Z, Wang X (2017) Unsupervised software-specific morphological forms inference from informal discussions. In: Proceedings of the 39th International Conference on Software Engineering. IEEE Press, pp 450–461
Chen C, Xing Z, Liu Y (2018) By the community & for the community: A deep learning approach to assist collaborative editing in q&a sites. In: Proceedings of the 21st ACM Conference on Computer-Supported Cooperative Work and Social Computing. ACM, pp 32:1–32:21
Chenail RJ (2008) Youtube as a qualitative research asset: Reviewing user generated videos as learning resources. Q Rep 13(3):18–24
De Lucia A, Di Penta M, Oliveto R, Panichella A, Panichella S (2014) Labeling source code with information retrieval methods: an empirical study. Empir Softw Eng 19(5):1383–1420
El Mezouar M, Zhang F, Zou Y (2018) Are tweets useful in the bug fixing process? an empirical study on firefox and chrome. Empir Softw Eng 23(3):1704–1742
Guo J, Cheng J, Cleland-Huang J (2017) Semantically enhanced software traceability using deep learning techniques. In: 2017 IEEE/ACM 39th international conference on Software engineering (ICSE). IEEE, pp 3–14
Guzman E, Alkadhi R, Seyff N (2016) A needle in a haystack What do twitter users say about software?. In: 2016 IEEE 24th international Requirements engineering conference (RE). IEEE, pp 96–105
Guzman E, Alkadhi R, Seyff N (2017a) An exploratory study of twitter messages about software applications. Requir Eng 22(3):387–412
Guzman E, Ibrahim M, Glinz Martin (2017b) A little bird told me: mining tweets for requirements and software evolution. In: 2017 IEEE 25Th international requirements engineering conference (RE). IEEE, pp 11–20
Hindle A, Onuczko C (2019) Preventing duplicate bug reports by continuously querying bug reports. Empir Softw Eng 24(2):902–936
Johnson R, Zhang T (2015) Semi-supervised convolutional neural networks for text categorization via region embedding. In: Advances in neural information processing systems, pp 919–927
Johnson R, Zhang T (2016) Supervised and semi-supervised text categorization using lstm for region embeddings. arXiv:1602.02373
Kenter T, De Rijke M (2015) Short text similarity with word embeddings. In: Proceedings of the 24th ACM international on conference on information and knowledge management. ACM, pp 1411–1420
Kwak H, Lee C, Park H, Moon SB (2010) What is twitter, a social network or a news media?. In: Proceedings of the 19th International Conference on World Wide Web, WWW 2010, Raleigh, North Carolina, pp 591–600
Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. biometrics pp 159–174
Lee JY, Dernoncourt F, Szolovits P (2017) Transfer learning for named-entity recognition with neural networks. arXiv:1705.06273
Maalej W, Tiarks R, Roehm T, Koschke R (2014) On the comprehension of program comprehension. ACM Trans Softw Eng Methodol (TOSEM) 23(4):31
MacLeod L, Storey M-A, Bergen A (2015) Code, camera, action: how software developers document and share program knowledge using youtube. In: Proceedings of the 2015 IEEE 23rd International Conference on Program Comprehension. IEEE Press, pp 104–114
MacLeod L, Bergen A, Storey M-A (2017) Documenting and sharing software knowledge using screencasts. Empir Softw Eng 22(3):1478–1507
Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. arXiv:1301.3781
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
Mou L, Meng Z, Yan R, Li G, Xu Y, Lu Z, Jin Z (2016) How transferable are neural networks in nlp applications?. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp 479–489
Ott J, Atchison A, Harnack P, Best N, Anderson H, Firmani C, Linstead E (2018) Learning lexical features of programming languages from imagery using convolutional neural networks. In: Proceedings of the 26th Conference on Program Comprehension, ICPC ’18. ACM, New York, pp 336–339
Palomba F, Panichella A, De Lucia A, Oliveto R, Zaidman A (2016) A textual-based technique for smell detection. In: 2016 IEEE 24Th international conference on program comprehension (ICPC). IEEE, pp 1–10
Pan SJ, Yang Q, et al. (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359
Parnin C, Treude C (2011) Measuring api documentation on the web. In: Proceedings of the 2nd international workshop on Web 2.0 for software engineering. ACM, pp 25–30
Parnin C, Treude C, Grammel L (2012) Crowd documentation: Exploring the coverage and the dynamics of api discussions on stack overflow. Georgia Institute of Technology, Technical Report
Parra E, Escobar-Avila J, Haiduc S (2018) Automatic tag recommendation for software development video tutorials. In: Proceedings of the 26th Conference on Program Comprehension. ACM, pp 222–232
Pennington J, Socher R (2014) Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Poché E, Jha N, Williams G, Staten J, Vesper M, Mahmoud A (2017) Analyzing user comments on youtube coding tutorial videos. In: Proceedings of the 25th International Conference on Program Comprehension. IEEE Press, pp 196–206
Ponzanelli L, Bacchelli A, Lanza M (2013) Seahawk: Stack overflow in the ide. In: Proceedings of the 2013 International Conference on Software Engineering. IEEE Press, pp 1295–1298
Ponzanelli L, Mocci A, Bacchelli A, Lanza M (2014a) Understanding and classifying the quality of technical forum questions. In: Quality software (QSIC), 2014 14th international conference on. IEEE, pp 343–352
Ponzanelli L, Mocci A, Bacchelli A, Lanza M, Fullerton D (2014b) Improving low quality stack overflow post detection. In: 2014 IEEE international conference on Software maintenance and evolution (ICSME). IEEE, pp 541–544
Ponzanelli L, Bavota G, Mocci A, Di Penta M, Oliveto R, Hasan M, Russo B, Haiduc S, Lanza M (2016a) Too long; didn’t watch!: extracting relevant fragments from software development video tutorials. In: Proceedings of the 38th International Conference on Software Engineering. ACM, pp 261–272
Ponzanelli L, Bavota G, Mocci A, Di Penta M, Oliveto R, Russo B, Haiduc S, Lanza M (2016b) Codetube: extracting relevant fragments from software development video tutorials. In: Proceedings of the 38th International Conference on Software Engineering Companion. ACM, pp 645–648
Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137
Posnett D, Warburg E, Devanbu P, Filkov V (2012) Mining stack exchange: Expertise is evident from initial contributions. In: 2012 international conference on Social informatics (socialinformatics). IEEE, pp 199–204
Prasetyo PK, Lo D, Achananuparp P, Tian Y, Lim E-P (2012) Automatic classification of software related microblogs. In: 2012 28th IEEE international conference on Software maintenance (ICSM). IEEE, pp 596–599
Rahman MM, Roy CK (2015) An insight into the unresolved questions at stack overflow. In: Proceedings of the 12th Working Conference on Mining Software Repositories. IEEE Press, pp 426–429
Rahman MM, Roy CK (2017) Strict: Information retrieval based search term identification for concept location. In: 2017 IEEE 24Th international conference on software analysis, evolution and reengineering (SANER). IEEE, pp 79–90
Semwal T, Yenigalla P, Mathur G, Nair SB (2018) A practitioners’ guide to transfer learning for text classification using convolutional neural networks. In: Proceedings of the 2018 SIAM International Conference on Data Mining. SIAM, pp 513–521
Sharma A, Tian Y, Lo D (2015a) Nirmal: Automatic identification of software relevant tweets leveraging language model. In: 2015 IEEE 22nd international conference on Software analysis, evolution and reengineering (SANER). IEEE, pp 449–458
Sharma A, Tian Y, Lo D (2015b) What’s hot in software engineering twitter space?. In: 2015 IEEE international conference on Software maintenance and evolution (ICSME). IEEE, pp 541–545
Sharma A, Tian Y, Sulistya A, Lo D, Yamashita AF (2017c) Harnessing twitter to support serendipitous learning of developers. In: 2017 IEEE 24th international conference on Software analysis, evolution and reengineering (SANER). IEEE, pp 387–391
Sharma A, Tian Y, Sulistya A, Wijedasa D, Lo D (2018) Recommending who to follow in the software engineering twitter space. ACM Trans Softw Eng Methodol 27(4):16:1–16:33
Singer L, Filho FF, Storey M-A (2014) Software engineering at the speed of light: how developers stay current using twitter. In: Proceedings of the 36th International Conference on Software Engineering. ACM, pp 211–221
Socher R, Perelygin A, Wu J, Chuang J, Manning CD, Ng A, Potts C (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 conference on empirical methods in natural language processing, pp 1631–1642
StackExchange (2019) About software engineering stack exchange. [Online; accessed 16-April-2019]
Stolcke A (2002) Srilm-an extensible language modeling toolkit. In: Seventh international conference on spoken language processing
Storey M-A, Singer L, Cleary B, Filho FF, Zagalsky A (2014) The (r) evolution of social media in software engineering. In: Proceedings of the on Future of Software Engineering. ACM, pp 100–116
Storey M-A, Zagalsky A, Singer L, German D et al (2017) How social and communication channels shape and challenge a participatory culture in software development. IEEE Transactions on Software Engineering, (1):1–1
Tian Y, Achananuparp P, Lubis IN, Lo D, Lim E-P (2012) What does software engineering community microblog about?. In: 2012 9th IEEE working conference on Mining software repositories (MSR). IEEE, pp 247–250
Tian Y, Lo D (2014) An exploratory study on software microblogger behaviors. In: 2014 IEEE 4Th workshop on mining unstructured data. IEEE, pp 1–5
Treude C, Barzilay O, Storey M-A (2011) How do programmers ask and answer questions on the web?: Nier track. In: 2011 33rd international conference on Software engineering (ICSE). IEEE, pp 804–807
Uddin G, Khomh F (2017a) Automatic summarization of api reviews. In: 2017 32Nd IEEE/ACM international conference on automated software engineering (ASE). IEEE, pp 159–170
Uddin Gx, Khomh F (2017b) Opiner: An opinion search and summarization engine for apis. In: 2017 32Nd IEEE/ACM international conference on automated software engineering (ASE). IEEE, pp 978–983
Van Nguyen T, Nguyen AT, Phan HD, Nguyen TD, Nguyen TN (2017) Combining word2vec with revised vector space model for better code retrieval. In: Proceedings of the 39th International Conference on Software Engineering Companion. IEEE Press, pp 183–185
Vasilescu B, Filkov V, Serebrenik A (2013) Stackoverflow and github: Associations between software development and crowdsourced knowledge. In: 2013 international conference on Social computing (socialcom). IEEE, pp 188–195
Vasilescu B, Serebrenik A, Devanbu P, Filkov V (2014) How social q&a sites are changing knowledge sharing in open source software communities. In: Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing. ACM, pp 342–354
Wang X, Kuzmickaja I, Stol K-J, Abrahamsson P, Fitzgerald B (2013) Microblogging in open source software development: The case of drupal and twitter, Software. IEEE
Wang S, Lo D, Vasilescu B, Serebrenik A (2014) Entagrec: An enhanced tag recommendation system for software information sites. In: 2014 IEEE international conference on Software maintenance and evolution (ICSME). IEEE, pp 291–300
Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bullet 1 (6):80–83
Williams G, Mahmoud A (2017) Mining twitter feeds for software user requirements. In: 2017 IEEE 25Th international requirements engineering conference (RE). IEEE, pp 1–10
Xia X, Bao L, Lo D, Kochhar PS, Hassan AE, Xing Z (2017) What do developers search for on the web? Empir Softw Eng 22(6):3149–3185
Xu B, Xing Z, Xia X, Lo D, Wang Q, Li S (2016) Domain-specific cross-language relevant question retrieval. In: Proceedings of the 13th International Conference on Mining Software Repositories. ACM, pp 413–424
Xu C, Sun X, Li B, Lu X, Guo H (2018) Mulapi: Improving api method recommendation with api usage location. J Syst Softw 142:195–205
Yadid S, Yahav E (2016) Extracting code from programming tutorial videos. In: Proceedings of the 2016 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software. ACM, pp 98–111
Yang X, Lo D, Xia X, Bao L, Sun J (2016) Combining word embedding with information retrieval to recommend similar bug reports. In: 2016 IEEE 27th international symposium on Software reliability engineering (ISSRE). IEEE, pp 127–137
Ye D, Xing Z, Kapre N (2017) The structure and dynamics of knowledge network in domain-specific q&a sites: a case study of stack overflow. Empir Softw Eng 22(1):375–406
Ye X, Shen H, Ma X, Bunescu R, Liu C (2016) From word embeddings to document similarities for improved information retrieval in software engineering. In: Proceedings of the 38th international conference on software engineering. ACM, pp 404–415
YouTube (2017) Youtube. [Online; accessed 20-AUG-2018]
Yu J, Qiu M, Jiang J, Huang J, Song S, Chu W, Chen H (2018) Modelling domain relationships for transfer learning on retrieval-based question answering systems in e-commerce. In: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM, pp 682–690
Zhang J, He J, Ren Z, Chen X (2018) Recommending apis for api related questions in stack overflow. IEEE Access 6:6205–6219
Zhao T, Cao Q, Sun Q (2017) An improved approach to traceability recovery based on word embeddings. In: 2017 24th Asia-pacific software engineering conference (APSEC). IEEE, pp 81–89
Zhou P, Liu J, Yang Z, Zhou G (2017) Scalable tag recommendation for software information sites. In: 2017 IEEE 24th international conference on Software analysis, evolution and reengineering (SANER). IEEE, pp 272–282
Zhenchang HL, Han XZ, Li X, Feng Z (2018) Reasoning common software weaknesses via knowledge graph embedding. In: 2018 IEEE 25rd international conference on Software analysis, evolution, and reengineering (SANER)
Acknowledgments
This research is supported by the National Research Foundation, Prime Ministers Office, Singapore under its International Research Centres in Singapore Funding Initiative.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Denys Poshyvanyk
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Sulistya, A., Prana, G.A.A., Sharma, A. et al. SIEVE: Helping developers sift wheat from chaff via cross-platform analysis. Empir Software Eng 25, 996–1030 (2020). https://doi.org/10.1007/s10664-019-09775-w
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-019-09775-w