SIEVE: Helping developers sift wheat from chaff via cross-platform analysis

Agus Sulistya ORCID: orcid.org/0000-0002-2835-5781¹,
Gede Artha Azriadi Prana²,
Abhishek Sharma²,
David Lo² &
…
Christoph Treude³

543 Accesses
6 Citations
Explore all metrics

Abstract

Software developers have benefited from various sources of knowledge such as forums, question-and-answer sites, and social media platforms to help them in various tasks. Extracting software-related knowledge from different platforms involves many challenges. In this paper, we propose an approach to improve the effectiveness of knowledge extraction tasks by performing cross-platform analysis. Our approach is based on transfer representation learning and word embedding, leveraging information extracted from a source platform which contains rich domain-related content. The information extracted is then used to solve tasks in another platform (considered as target platform) with less domain-related content. We first build a word embedding model as a representation learned from the source platform, and use the model to improve the performance of knowledge extraction tasks in the target platform. We experiment with Software Engineering Stack Exchange and Stack Overflow as source platforms, and two different target platforms, i.e., Twitter and YouTube. Our experiments show that our approach improves performance of existing work for the tasks of identifying software-related tweets and helpful YouTube comments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Analyzing and Categorization Developer Intent on Twitch Live Chat

Article 26 September 2024

DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text

Article Open access 04 February 2022

Mining crowd sourcing repositories for open innovation in software engineering

Article 08 January 2024

Notes

References

Achananuparp P, Lubis IN, Tian Y, Lo D, Lim E-P (2012) Observatory of trends in software related microblogs. In: 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering. IEEE, pp 334–337
Andrews JTA, Tanay T, Morton EJ, Griffin LD (2016) Transfer representation-learning for anomaly detection. In: Anomaly detection workshop. ICML
Aniche M, Treude C, Steinmacher I, Wiese I, Pinto G, Storey M-A, Gerosa M A (2018) How modern news aggregators help development communities shape and share knowledge. In: 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE). IEEE, pp 499–510
Asaduzzaman M, Mashiyat AS, Roy CK, Schneider KA (2013) Answering questions about unanswered questions of stack overflow. In: Proceedings of the 10th Working Conference on Mining Software Repositories. IEEE Press, pp 97–100
Azad S, Rigby PC, Guerrouj L (2017) Generating api call rules from version history and stack overflow posts. ACM Transactions on Software Engineering and Methodology (TOSEM)
Bacchelli A, Sasso TD, D’Ambros M, Lanza M (2012a) Content classification of development emails. In: 2012 34Th international conference on software engineering (ICSE). IEEE, pp 375–385
Bacchelli A, Ponzanelli L, Lanza M (2012b) Harnessing stack overflow for the ide. In: Proceedings of the Third International Workshop on Recommendation Systems for Software Engineering. IEEE Press, pp 26–30
Barua A, Thomas SW, Hassan AE (2014) What are developers talking about? an analysis of topics and trends in stack overflow. Empir Softw Eng 19(3):619–654
Article Google Scholar
Begel Andrew, Bosch Jan (2013) Social networking meets software development: Perspectives from github, msdn, stack exchange, and topcoder. IEEE Softw 30(1):52–66
Article Google Scholar
Bengio Y, Courville A (2013) Representation learning: A review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828
Article Google Scholar
Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146
Article Google Scholar
Bougie G, Starke J, Storey M-A, German DM (2011) Towards understanding twitter use in software engineering: preliminary findings, ongoing challenges and future questions. In: Proceedings of the 2nd international workshop on Web 2.0 for software engineering, pp 31–36
Cai X, Zhu J, Shen B, Chen Y (2016) Greta: Graph-based tag assignment for github repositories. In: Computer software and applications conference (COMPSAC), 2016 IEEE 40th annual. IEEE, vol 1, pp 63–72
Calefato F, Lanubile F, Maiorano F, Novielli N (2018) Sentiment polarity detection for software development. Empir Softw Eng 23(3):1352–1382
Article Google Scholar
Chen C, Sa G, Xing Z (2016a) Mining analogical libraries in q&a discussions–incorporating relational and categorical knowledge into word embedding. In: 2016 IEEE 23rd international conference on Software analysis, evolution, and reengineering (SANER). IEEE, vol 1, pp 338–348
Chen C, Xing Z (2016b) Similartech: automatically recommend analogical libraries across different programming languages. In: 2016 31st IEEE/ACM international conference on Automated software engineering (ASE). IEEE, pp 834–839
Chen G, Chen C, Xing Z, Xu B (2016c) Learning a dual-language vector space for domain-specific cross-lingual question retrieval. In: Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, pp 744–755
Chen C, Xing Z, Wang X (2017) Unsupervised software-specific morphological forms inference from informal discussions. In: Proceedings of the 39th International Conference on Software Engineering. IEEE Press, pp 450–461
Chen C, Xing Z, Liu Y (2018) By the community & for the community: A deep learning approach to assist collaborative editing in q&a sites. In: Proceedings of the 21st ACM Conference on Computer-Supported Cooperative Work and Social Computing. ACM, pp 32:1–32:21
Chenail RJ (2008) Youtube as a qualitative research asset: Reviewing user generated videos as learning resources. Q Rep 13(3):18–24
Google Scholar
De Lucia A, Di Penta M, Oliveto R, Panichella A, Panichella S (2014) Labeling source code with information retrieval methods: an empirical study. Empir Softw Eng 19(5):1383–1420
Article Google Scholar
El Mezouar M, Zhang F, Zou Y (2018) Are tweets useful in the bug fixing process? an empirical study on firefox and chrome. Empir Softw Eng 23(3):1704–1742
Article Google Scholar
Guo J, Cheng J, Cleland-Huang J (2017) Semantically enhanced software traceability using deep learning techniques. In: 2017 IEEE/ACM 39th international conference on Software engineering (ICSE). IEEE, pp 3–14
Guzman E, Alkadhi R, Seyff N (2016) A needle in a haystack What do twitter users say about software?. In: 2016 IEEE 24th international Requirements engineering conference (RE). IEEE, pp 96–105
Guzman E, Alkadhi R, Seyff N (2017a) An exploratory study of twitter messages about software applications. Requir Eng 22(3):387–412
Article Google Scholar
Guzman E, Ibrahim M, Glinz Martin (2017b) A little bird told me: mining tweets for requirements and software evolution. In: 2017 IEEE 25Th international requirements engineering conference (RE). IEEE, pp 11–20
Hindle A, Onuczko C (2019) Preventing duplicate bug reports by continuously querying bug reports. Empir Softw Eng 24(2):902–936
Article Google Scholar
Johnson R, Zhang T (2015) Semi-supervised convolutional neural networks for text categorization via region embedding. In: Advances in neural information processing systems, pp 919–927
Johnson R, Zhang T (2016) Supervised and semi-supervised text categorization using lstm for region embeddings. arXiv:1602.02373
Kenter T, De Rijke M (2015) Short text similarity with word embeddings. In: Proceedings of the 24th ACM international on conference on information and knowledge management. ACM, pp 1411–1420
Kwak H, Lee C, Park H, Moon SB (2010) What is twitter, a social network or a news media?. In: Proceedings of the 19th International Conference on World Wide Web, WWW 2010, Raleigh, North Carolina, pp 591–600
Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. biometrics pp 159–174
Lee JY, Dernoncourt F, Szolovits P (2017) Transfer learning for named-entity recognition with neural networks. arXiv:1705.06273
Maalej W, Tiarks R, Roehm T, Koschke R (2014) On the comprehension of program comprehension. ACM Trans Softw Eng Methodol (TOSEM) 23(4):31
Article Google Scholar
MacLeod L, Storey M-A, Bergen A (2015) Code, camera, action: how software developers document and share program knowledge using youtube. In: Proceedings of the 2015 IEEE 23rd International Conference on Program Comprehension. IEEE Press, pp 104–114
MacLeod L, Bergen A, Storey M-A (2017) Documenting and sharing software knowledge using screencasts. Empir Softw Eng 22(3):1478–1507
Article Google Scholar
Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. arXiv:1301.3781
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
Mou L, Meng Z, Yan R, Li G, Xu Y, Lu Z, Jin Z (2016) How transferable are neural networks in nlp applications?. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp 479–489
Ott J, Atchison A, Harnack P, Best N, Anderson H, Firmani C, Linstead E (2018) Learning lexical features of programming languages from imagery using convolutional neural networks. In: Proceedings of the 26th Conference on Program Comprehension, ICPC ’18. ACM, New York, pp 336–339
Palomba F, Panichella A, De Lucia A, Oliveto R, Zaidman A (2016) A textual-based technique for smell detection. In: 2016 IEEE 24Th international conference on program comprehension (ICPC). IEEE, pp 1–10
Pan SJ, Yang Q, et al. (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359
Article Google Scholar
Parnin C, Treude C (2011) Measuring api documentation on the web. In: Proceedings of the 2nd international workshop on Web 2.0 for software engineering. ACM, pp 25–30
Parnin C, Treude C, Grammel L (2012) Crowd documentation: Exploring the coverage and the dynamics of api discussions on stack overflow. Georgia Institute of Technology, Technical Report
Parra E, Escobar-Avila J, Haiduc S (2018) Automatic tag recommendation for software development video tutorials. In: Proceedings of the 26th Conference on Program Comprehension. ACM, pp 222–232
Pennington J, Socher R (2014) Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Poché E, Jha N, Williams G, Staten J, Vesper M, Mahmoud A (2017) Analyzing user comments on youtube coding tutorial videos. In: Proceedings of the 25th International Conference on Program Comprehension. IEEE Press, pp 196–206
Ponzanelli L, Bacchelli A, Lanza M (2013) Seahawk: Stack overflow in the ide. In: Proceedings of the 2013 International Conference on Software Engineering. IEEE Press, pp 1295–1298
Ponzanelli L, Mocci A, Bacchelli A, Lanza M (2014a) Understanding and classifying the quality of technical forum questions. In: Quality software (QSIC), 2014 14th international conference on. IEEE, pp 343–352
Ponzanelli L, Mocci A, Bacchelli A, Lanza M, Fullerton D (2014b) Improving low quality stack overflow post detection. In: 2014 IEEE international conference on Software maintenance and evolution (ICSME). IEEE, pp 541–544
Ponzanelli L, Bavota G, Mocci A, Di Penta M, Oliveto R, Hasan M, Russo B, Haiduc S, Lanza M (2016a) Too long; didn’t watch!: extracting relevant fragments from software development video tutorials. In: Proceedings of the 38th International Conference on Software Engineering. ACM, pp 261–272
Ponzanelli L, Bavota G, Mocci A, Di Penta M, Oliveto R, Russo B, Haiduc S, Lanza M (2016b) Codetube: extracting relevant fragments from software development video tutorials. In: Proceedings of the 38th International Conference on Software Engineering Companion. ACM, pp 645–648
Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137
Article Google Scholar
Posnett D, Warburg E, Devanbu P, Filkov V (2012) Mining stack exchange: Expertise is evident from initial contributions. In: 2012 international conference on Social informatics (socialinformatics). IEEE, pp 199–204
Prasetyo PK, Lo D, Achananuparp P, Tian Y, Lim E-P (2012) Automatic classification of software related microblogs. In: 2012 28th IEEE international conference on Software maintenance (ICSM). IEEE, pp 596–599
Rahman MM, Roy CK (2015) An insight into the unresolved questions at stack overflow. In: Proceedings of the 12th Working Conference on Mining Software Repositories. IEEE Press, pp 426–429
Rahman MM, Roy CK (2017) Strict: Information retrieval based search term identification for concept location. In: 2017 IEEE 24Th international conference on software analysis, evolution and reengineering (SANER). IEEE, pp 79–90
Semwal T, Yenigalla P, Mathur G, Nair SB (2018) A practitioners’ guide to transfer learning for text classification using convolutional neural networks. In: Proceedings of the 2018 SIAM International Conference on Data Mining. SIAM, pp 513–521
Sharma A, Tian Y, Lo D (2015a) Nirmal: Automatic identification of software relevant tweets leveraging language model. In: 2015 IEEE 22nd international conference on Software analysis, evolution and reengineering (SANER). IEEE, pp 449–458
Sharma A, Tian Y, Lo D (2015b) What’s hot in software engineering twitter space?. In: 2015 IEEE international conference on Software maintenance and evolution (ICSME). IEEE, pp 541–545
Sharma A, Tian Y, Sulistya A, Lo D, Yamashita AF (2017c) Harnessing twitter to support serendipitous learning of developers. In: 2017 IEEE 24th international conference on Software analysis, evolution and reengineering (SANER). IEEE, pp 387–391
Sharma A, Tian Y, Sulistya A, Wijedasa D, Lo D (2018) Recommending who to follow in the software engineering twitter space. ACM Trans Softw Eng Methodol 27(4):16:1–16:33
Article Google Scholar
Singer L, Filho FF, Storey M-A (2014) Software engineering at the speed of light: how developers stay current using twitter. In: Proceedings of the 36th International Conference on Software Engineering. ACM, pp 211–221
Socher R, Perelygin A, Wu J, Chuang J, Manning CD, Ng A, Potts C (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 conference on empirical methods in natural language processing, pp 1631–1642
StackExchange (2019) About software engineering stack exchange. [Online; accessed 16-April-2019]
Stolcke A (2002) Srilm-an extensible language modeling toolkit. In: Seventh international conference on spoken language processing
Storey M-A, Singer L, Cleary B, Filho FF, Zagalsky A (2014) The (r) evolution of social media in software engineering. In: Proceedings of the on Future of Software Engineering. ACM, pp 100–116
Storey M-A, Zagalsky A, Singer L, German D et al (2017) How social and communication channels shape and challenge a participatory culture in software development. IEEE Transactions on Software Engineering, (1):1–1
Tian Y, Achananuparp P, Lubis IN, Lo D, Lim E-P (2012) What does software engineering community microblog about?. In: 2012 9th IEEE working conference on Mining software repositories (MSR). IEEE, pp 247–250
Tian Y, Lo D (2014) An exploratory study on software microblogger behaviors. In: 2014 IEEE 4Th workshop on mining unstructured data. IEEE, pp 1–5
Treude C, Barzilay O, Storey M-A (2011) How do programmers ask and answer questions on the web?: Nier track. In: 2011 33rd international conference on Software engineering (ICSE). IEEE, pp 804–807
Uddin G, Khomh F (2017a) Automatic summarization of api reviews. In: 2017 32Nd IEEE/ACM international conference on automated software engineering (ASE). IEEE, pp 159–170
Uddin Gx, Khomh F (2017b) Opiner: An opinion search and summarization engine for apis. In: 2017 32Nd IEEE/ACM international conference on automated software engineering (ASE). IEEE, pp 978–983
Van Nguyen T, Nguyen AT, Phan HD, Nguyen TD, Nguyen TN (2017) Combining word2vec with revised vector space model for better code retrieval. In: Proceedings of the 39th International Conference on Software Engineering Companion. IEEE Press, pp 183–185
Vasilescu B, Filkov V, Serebrenik A (2013) Stackoverflow and github: Associations between software development and crowdsourced knowledge. In: 2013 international conference on Social computing (socialcom). IEEE, pp 188–195
Vasilescu B, Serebrenik A, Devanbu P, Filkov V (2014) How social q&a sites are changing knowledge sharing in open source software communities. In: Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing. ACM, pp 342–354
Wang X, Kuzmickaja I, Stol K-J, Abrahamsson P, Fitzgerald B (2013) Microblogging in open source software development: The case of drupal and twitter, Software. IEEE
Wang S, Lo D, Vasilescu B, Serebrenik A (2014) Entagrec: An enhanced tag recommendation system for software information sites. In: 2014 IEEE international conference on Software maintenance and evolution (ICSME). IEEE, pp 291–300
Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bullet 1 (6):80–83
Article Google Scholar
Williams G, Mahmoud A (2017) Mining twitter feeds for software user requirements. In: 2017 IEEE 25Th international requirements engineering conference (RE). IEEE, pp 1–10
Xia X, Bao L, Lo D, Kochhar PS, Hassan AE, Xing Z (2017) What do developers search for on the web? Empir Softw Eng 22(6):3149–3185
Article Google Scholar
Xu B, Xing Z, Xia X, Lo D, Wang Q, Li S (2016) Domain-specific cross-language relevant question retrieval. In: Proceedings of the 13th International Conference on Mining Software Repositories. ACM, pp 413–424
Xu C, Sun X, Li B, Lu X, Guo H (2018) Mulapi: Improving api method recommendation with api usage location. J Syst Softw 142:195–205
Article Google Scholar
Yadid S, Yahav E (2016) Extracting code from programming tutorial videos. In: Proceedings of the 2016 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software. ACM, pp 98–111
Yang X, Lo D, Xia X, Bao L, Sun J (2016) Combining word embedding with information retrieval to recommend similar bug reports. In: 2016 IEEE 27th international symposium on Software reliability engineering (ISSRE). IEEE, pp 127–137
Ye D, Xing Z, Kapre N (2017) The structure and dynamics of knowledge network in domain-specific q&a sites: a case study of stack overflow. Empir Softw Eng 22(1):375–406
Article Google Scholar
Ye X, Shen H, Ma X, Bunescu R, Liu C (2016) From word embeddings to document similarities for improved information retrieval in software engineering. In: Proceedings of the 38th international conference on software engineering. ACM, pp 404–415
YouTube (2017) Youtube. [Online; accessed 20-AUG-2018]
Yu J, Qiu M, Jiang J, Huang J, Song S, Chu W, Chen H (2018) Modelling domain relationships for transfer learning on retrieval-based question answering systems in e-commerce. In: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM, pp 682–690
Zhang J, He J, Ren Z, Chen X (2018) Recommending apis for api related questions in stack overflow. IEEE Access 6:6205–6219
Article Google Scholar
Zhao T, Cao Q, Sun Q (2017) An improved approach to traceability recovery based on word embeddings. In: 2017 24th Asia-pacific software engineering conference (APSEC). IEEE, pp 81–89
Zhou P, Liu J, Yang Z, Zhou G (2017) Scalable tag recommendation for software information sites. In: 2017 IEEE 24th international conference on Software analysis, evolution and reengineering (SANER). IEEE, pp 272–282
Zhenchang HL, Han XZ, Li X, Feng Z (2018) Reasoning common software weaknesses via knowledge graph embedding. In: 2018 IEEE 25rd international conference on Software analysis, evolution, and reengineering (SANER)

Download references

Acknowledgments

This research is supported by the National Research Foundation, Prime Ministers Office, Singapore under its International Research Centres in Singapore Funding Initiative.

Author information

Authors and Affiliations

PT Telkom Indonesia, Jakarta, Indonesia
Agus Sulistya
Singapore Management University, Singapore, Singapore
Gede Artha Azriadi Prana, Abhishek Sharma & David Lo
University of Adelaide, Adelaide, Australia
Christoph Treude

Authors

Agus Sulistya
View author publications
You can also search for this author in PubMed Google Scholar
Gede Artha Azriadi Prana
View author publications
You can also search for this author in PubMed Google Scholar
Abhishek Sharma
View author publications
You can also search for this author in PubMed Google Scholar
David Lo
View author publications
You can also search for this author in PubMed Google Scholar
Christoph Treude
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Agus Sulistya.

Additional information

Communicated by: Denys Poshyvanyk

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sulistya, A., Prana, G.A.A., Sharma, A. et al. SIEVE: Helping developers sift wheat from chaff via cross-platform analysis. Empir Software Eng 25, 996–1030 (2020). https://doi.org/10.1007/s10664-019-09775-w

Download citation

Published: 02 October 2019
Issue Date: January 2020
DOI: https://doi.org/10.1007/s10664-019-09775-w

SIEVE: Helping developers sift wheat from chaff via cross-platform analysis

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Analyzing and Categorization Developer Intent on Twitch Live Chat

DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text

Mining crowd sourcing repositories for open innovation in software engineering

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

SIEVE: Helping developers sift wheat from chaff via cross-platform analysis

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Analyzing and Categorization Developer Intent on Twitch Live Chat

DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text

Mining crowd sourcing repositories for open innovation in software engineering

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation