[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Analyzing Techniques for Duplicate Question Detection on Q&A Websites for Game Developers

Published: 01 January 2023 Publication History

Abstract

Game development is currently the largest industry in the entertainment segment and has a high demand for skilled game developers that can produce high-quality games. To satiate this demand, game developers need resources that can provide them with the knowledge they need to learn and improve their skills. Question and Answer (Q&A) websites are one of such resources that provide a valuable source of knowledge about game development practices. However, the presence of duplicate questions on Q&A websites hinders their ability to effectively provide information for their users. While several researchers created and analyzed techniques for duplicate question detection on websites such as Stack Overflow, so far no studies have explored how well those techniques work on Q&A websites for game development. With that in mind, in this paper we analyze how we can use pre-trained and unsupervised techniques to detect duplicate questions on Q&A websites focused on game development using data extracted from the Game Development Stack Exchange and Stack Overflow. We also explore how we can leverage a small set of labelled data to improve the performance of those techniques. The pre-trained technique based on MPNet achieved the highest results in identifying duplicate questions about game development, and we could achieve a better performance when combining multiple unsupervised techniques into a single supervised model. Furthermore, the supervised models could identify duplicate questions on websites different from those they were trained on with little to no decrease in performance. Our results lay the groundwork for building better duplicate question detection systems in Q&A websites for game developers and ultimately providing game developers with a more effective Q&A community.

References

[1]
Abric D, Clark OE, Caminiti M, Gallaba K, McIntosh S (2019) Can duplicate questions on Stack Overflow benefit the software development community?. In: 2019 IEEE/ACM 16th international conference on mining software repositories (MSR). IEEE, pp 230–234
[2]
Ahasanuzzaman M, Asaduzzaman M, Roy CK, Schneider KA (2016) Mining duplicate questions of Stack Overflow. In: 2016 IEEE/ACM 13th working conference on mining software repositories (MSR). IEEE, pp 402–412
[3]
Ahmad A, Feng C, Ge S, Yousif A (2018) A survey on mining Stack Overflow: question and answering (Q&A) community. Data Technologies and Applications
[4]
Barua A, Thomas SW, and Hassan AE What are developers talking about? An analysis of topics and trends in Stack Overflow Empir Softw Eng 2014 19 3 619-654
[5]
Bazelli B, Hindle A, Stroulia E (2013) On the personality traits of StackOverflow users. In: 2013 IEEE International conference on software maintenance. IEEE, pp 460–463
[6]
Blei DM, Ng AY, and Jordan MI Latent dirichlet allocation J Mach Learn Res 2003 3 993-1022
[7]
Chen L, Baird A, and Straub D Why do participants continue to contribute? Evaluation of usefulness voting and commenting motivational affordances within an online knowledge community Decis Support Syst 2019 118 21-32
[8]
Chowdhury A, Frieder O, Grossman D, and McCabe MC Collection statistics for fast duplicate document detection ACM Trans Inf Syst (TOIS) 2002 20 2 171-191
[9]
Dalip DH, Gonçalves MA, Cristo M, Calado P (2013) Exploiting user feedback to learn to rank answers in Q&A forums: a case study with Stack Overflow. In: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, pp 543–552
[10]
Deng S, Tong J, Lin Y, Li H, and Liu Y Motivating scholars’ responses in academic social networking sites: an empirical study on ResearchGate Q&A behavior Inf Process Manag 2019 56 6 102082
[11]
Devlin J, Chang MW, Lee K, Toutanova K (2018) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805
[12]
Ellmann M Same-same but different: on understanding duplicates in Stack Overflow Inform Spektrum 2019 42 4 266-286
[13]
Fang C and Zhang J Users’ continued participation behavior in social Q&A communities: a motivation perspective Comput Hum Behav 2019 92 87-109
[14]
Fu H and Oh S Quality assessment of answers with user-identified criteria and data-driven features in social Q&A Inf Process Manag 2019 56 1 14-28
[15]
Guan T, Wang L, Jin J, and Song X Knowledge contribution behavior in online Q&A communities: an empirical investigation Comput Hum Behav 2018 81 137-147
[16]
Hindle A and Onuczko C Preventing duplicate bug reports by continuously querying bug reports Empir Softw Eng 2019 24 2 902-936
[17]
Hindle A, Alipour A, and Stroulia E A contextual approach towards more accurate duplicate bug report detection and ranking Empir Softw Eng 2016 21 2 368-410
[18]
Homma Y, Sy S, Yeh C (2016) Detecting duplicate questions with deep learning. In: Proceedings of the international conference on neural information processing systems (NIPS)
[19]
Hong Z, Deng Z, Evans R, and Wu H Patient questions and physician responses in a Chinese health Q&A website: content analysis J Med Internet Res 2020 22 4 e13071
[20]
Hoogeveen D, Bennett A, Li Y, Verspoor KM, Baldwin T (2018) Detecting misflagged duplicate questions in community question-answering archives. In: Twelfth international AAAI conference on web and social media
[21]
Imtiaz Z, Umer M, Ahmad M, Ullah S, Choi GS, and Mehmood A Duplicate questions pair detection using siamese maLSTM IEEE Access 2020 8 21932-21942
[22]
Jaccard P The distribution of the flora in the alpine zone. 1 New Phytol 1912 11 2 37-50
[23]
Jin J, Li Y, Zhong X, and Zhai L Why users contribute knowledge to online communities: an empirical study of an online social Q&A community Inf Manag 2015 52 7 840-849
[24]
Kamath A, Gupta S, Carvalho V (2019) Reversing gradients in adversarial domain adaptation for question deduplication and textual entailment tasks. In: Proceedings of the 57th annual meeting of the association for computational linguistics, pp 5545–5550
[25]
Kamienski A, Bezemer CP (2021) An empirical study of Q&A websites for game developers. Empir Softw Eng
[26]
Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: International conference on machine learning, pp 1188–1196
[27]
Li Z, Yin G, Yu Y, Wang T, Wang H (2017) Detecting duplicate pull-requests in GitHub. In: Proceedings of the 9th Asia-Pacific symposium on internetware, pp 1–6
[28]
Li Z, Yu Y, Zhou M, Wang T, Yin G, Lan L, Wang H (2020) Redundancy, context, and preference: an empirical study of duplicate pull requests in OSS projects, IEEE Trans Softw Eng
[29]
Liang D, Zhang F, Zhang W, Zhang Q, Fu J, Peng M, Gui T, Huang X (2019) Adaptive multi-attention network incorporating answer information for duplicate question detection. In: Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval, pp 95–104
[30]
Lopresti DP (1999) Models and algorithms for duplicate document detection. In: Proceedings of the fifth international conference on document analysis and recognition. ICDAR’99 (Cat. No. PR00318). IEEE, pp 297–300
[31]
Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. arXiv:1301.3781
[32]
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
[33]
Mizobuchi Y, Takayama K (2017) Two improvements to detect duplicates in Stack Overflow. In: 2017 IEEE 24th international conference on software analysis, evolution and reengineering (SANER). IEEE, pp 563–564
[34]
Nasehi SM, Sillito J, Maurer F, Burns C (2012) What makes a good code example?: a study of programming Q&A in StackOverflow. In: 2012 28th IEEE international conference on software maintenance (ICSM). IEEE, pp 25–34
[35]
Niwattanakul S, Singthongchai J, Naenudorn E, Wanapu S (2013) Using of Jaccard coefficient for keywords similarity. In: Proceedings of the international multiconference of engineers and computer scientists, vol 6, pp 380–384
[36]
Omondiagbe OP, Licorish SA, MacDonell SG (2019) Features that predict the acceptability of Java and JavaScript answers on Stack Overflow. In: Proceedings of the evaluation and assessment on software engineering, pp 101–110
[37]
Overflow S (2021) About Stack Overflow. https://stackoverflow.com/company. Accessed 25 July 2021
[38]
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, and Duchesnay E Scikit-learn: machine learning in Python J Mach Learn Res 2011 12 2825-2830
[39]
Poerner N, Schütze H (2019) Multi-view domain adapted sentence embeddings for low-resource unsupervised duplicate question detection. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp 1630–1641
[40]
Porter MF et al. An algorithm for suffix stripping Program 1980 14 3 130-137
[41]
Prabowo DA, Herwanto GB (2019) Duplicate question detection in question answer website using convolutional neural network. In: 2019 5th International conference on science and technology (ICST), vol 1. IEEE, pp 1–6
[42]
Procaci TB, Nunes BP, Nurmikko-Fuller T, Siqueira SW (2016) Finding topical experts in question & answer communities. In: 2016 IEEE 16th international conference on advanced learning technologies (ICALT). IEEE, pp 407–411
[43]
Procaci TB, Siqueira SW, Nunes BP, Nurmikko-Fuller T (2017) Modelling experts behaviour in Q&A communities to predict worthy discussions. In: 2017 IEEE 17th international conference on advanced learning technologies (ICALT). IEEE, pp 291–295
[44]
Rahman MM, Roy CK (2015) An insight into the unresolved questions at Stack Overflow. In: 2015 IEEE/ACM 12th working conference on mining software repositories. IEEE, pp 426–429
[45]
Rakha MS, Bezemer CP, and Hassan AE Revisiting the performance evaluation of automated approaches for the retrieval of duplicate issue reports IEEE Trans Softw Eng 2017 44 12 1245-1268
[46]
Rakha MS, Bezemer CP, and Hassan AE Revisiting the performance of automated approaches for the retrieval of duplicate reports in issue tracking systems that perform just-in-time duplicate retrieval Empir Softw Eng 2018 23 5 2597-2621
[47]
Ramos J, et al. (2003) Using tf-idf to determine word relevance in document queries. In: Proceedings of the first instructional conference on machine learning, vol 242. Citeseer, pp 29–48
[48]
Řehůřek R (2021) Gensim: topic modelling for humans. https://radimrehurek.com/gensim. Accessed 5 Sept 2021
[49]
Reimers N, Gurevych I (2019) Sentence-BERT: sentence embeddings using siamese BERT-networks. In: Proceedings of the 2019 conference on empirical methods in natural language processing. arXiv:1908.10084. Association for Computational Linguistics
[50]
Richardson L (2020) Beautiful soup. https://www.crummy.com/software/BeautifulSoup. Accessed 5 Sep 2021
[51]
Rochette A, Yaghoobzadeh Y, Hazen TJ (2019) Unsupervised domain adaptation of contextual embeddings for low-resource duplicate question detection. arXiv:1911.02645
[52]
Rodrigues J, Saedi C, Maraev V, Silva J, Branco A (2017) Ways of asking and replying in duplicate question detection. In: Proceedings of the 6th joint conference on lexical and computational semantics (SEM), pp 262–270
[53]
Rücklé A, Moosavi NS, Gurevych I (2019) Neural duplicate question detection without labeled training data. arXiv:1911.05594
[54]
Runeson P, Alexandersson M, Nyholm O (2007) Detection of duplicate defect reports using natural language processing. In: 29th International conference on software engineering (ICSE’07). IEEE, pp 499–510
[55]
Saedi C, Rodrigues J, Silva J, Branco A, Maraev V (2017) Learning profiles in duplicate question detection. In: 2017 IEEE International conference on information reuse and integration (IRI). IEEE, pp 544– 550
[56]
Santos T, Burghardt K, Lerman K, Helic D (2020) Can badges foster a more welcoming culture on Q&A boards?. In: Proceedings of the international AAAI conference on Web and social media, vol 14, pp 969–973
[57]
Shah DJ, Lei T, Moschitti A, Romeo S, Nakov P (2018) Adversarial domain adaptation for duplicate question detection. arXiv:1809.02255
[58]
Shen X, Jia AL, Shen S, and Dou Y Helping the ineloquent farmers: finding experts for questions with limited text in agricultural Q&A communities IEEE Access 2020 8 62238-62247
[59]
Silva RF, Paixão K, de Almeida Maia M (2018) Duplicate question detection in Stack Overflow: a reproducibility study. In: 2018 IEEE 25th international conference on software analysis, evolution and reengineering (SANER). IEEE, pp 572–581
[60]
Song K, Tan X, Qin T, Lu J, Liu TY (2020) MPNet: masked and permuted pre-training for language understanding. arXiv:2004.09297
[61]
Sun C, Lo D, Wang X, Jiang J, Khoo SC (2010) A discriminative model approach for accurate duplicate bug report retrieval. In: Proceedings of the 32nd ACM/IEEE international conference on software engineering, vol 1, pp 45–54
[62]
Sun C, Lo D, Khoo SC, Jiang J (2011) Towards more accurate retrieval of duplicate bug reports. In: 2011 26th IEEE/ACM international conference on automated software engineering (ASE 2011). IEEE, pp 253–262
[63]
Tabassum J, Maddela M, Xu W, Ritter A (2020) Code and named entity recognition in StackOverflow. In: Proceedings of the 58th annual meeting of the association for computational linguistics (ACL). https://www.aclweb.org/anthology/2020.acl-main.443/
[64]
Viggiato M, Lin D, Hindle A, Bezemer CP (2021) What causes wrong sentiment classifications of game reviews. IEEE Trans Games
[65]
Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Burovski E, Peterson P, Weckesser W, Bright J, van der Walt SJ, Brett M, Wilson J, Millman KJ, Mayorov N, Nelson ARJ, Jones E, Kern R, Larson E, Carey CJ, Polat İ, Feng Y, Moore EW, VanderPlas J, Laxalde D, Perktold J, Cimrman R, Henriksen I, Quintero EA, Harris CR, Archibald AM, Ribeiro AH, Pedregosa F, and van Mulbregt PScipy 1.0 contributors: SciPy 1.0: fundamental algorithms for scientific computing in PythonNat Methods202017261-272https://doi.org/10.1038/s41592-019-0686-2
[66]
Wang Y (2020) The price of being polite: politeness, social status, and their joint impacts on community Q&A efficiency. J Comput Social Sci 1–22
[67]
Wang L, Zhang L, Jiang J (2019a) Detecting duplicate questions in Stack Overflow via deep learning approaches. In: 2019 26th Asia-Pacific software engineering conference (APSEC). IEEE, pp 506– 513
[68]
Wang Q, Xu B, Xia X, Wang T, Li S (2019b) Duplicate pull request detection: when time matters. In: Proceedings of the 11th Asia-Pacific symposium on internetware, pp 1–10
[69]
Wang L, Zhang L, and Jiang J Duplicate question detection with deep learning in Stack Overflow IEEE Access 2020 8 25964-25975
[71]
Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Louf R, Funtowicz M, Davison J, Shleifer S, von Platen P, Ma C, Jernite Y, Plu J, Xu C, Scao TL, Gugger S, Drame M, Lhoest Q, Rush AM (2020) Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations. Online. https://www.aclweb.org/anthology/2020.emnlp-demos.6. Association for Computational Linguistics, pp 38–45
[72]
Wu Y, Wang S, Bezemer CP, and Inoue K How do developers utilize source code from Stack Overflow? Empir Softw Eng 2019 24 2 637-673
[73]
Xu Z and Yuan H Forum duplicate question detection by domain adaptive semantic matching IEEE Access 2020 8 56029-56038
[74]
Xu B, Hoang T, Sharma A, Yang C, Xia X, Lo D (2021) Post2vec: learning distributed representations of Stack Overflow posts. IEEE Trans Softw Eng
[75]
Yang XL, Lo D, Xia X, Wan ZY, and Sun JL What security questions do developers ask? A large-scale study of Stack Overflow posts J Comput Sci Technol 2016 31 5 910-924
[76]
Ying ATT (2015) Mining challenge 2015: comparing and combining different information sources on the Stack Overflow data set. In: The 12th working conference on mining software repositories
[77]
Zhang Y, Lo D, Xia X, and Sun JL Multi-factor duplicate question detection in Stack Overflow J Comput Sci Technol 2015 30 5 981-997
[78]
Zhang WE, Sheng QZ, Lau JH, Abebe E (2017a) Detecting duplicate posts in programming QA communities via latent semantics and association rules. In: Proceedings of the 26th international conference on World Wide Web, pp 1221–1229
[79]
Zhang WE, Sheng QZ, Shu Y, Nguyen VK (2017b) Feature analysis for duplicate detection in programming QA communities. In: International conference on advanced data mining and applications. Springer, pp 623–638
[80]
Zhang X, Liu S, Chen X, et al. (2017c) Social capital, motivations, and knowledge sharing intention in health Q&A communities. Manag Decis
[81]
Zhang WE, Sheng QZ, Lau JH, Abebe E, and Ruan W Duplicate detection in programming question answering communities ACM Trans Internet Technol (TOIT) 2018 18 3 1-21
[82]
Zhang WE, Sheng QZ, Tang Z, Ruan W (2018b) Related or duplicate: distinguishing similar CQA questions via convolutional neural networks. In: The 41st international ACM SIGIR conference on research & development in information retrieval, pp 1153–1156
[83]
Zhang Y, Lu T, Phang CW, and Zhang C Scientific knowledge communication in online Q&A communities: linguistic devices as a tool to increase the popularity and perceived professionalism of knowledge contribution J Assoc Inf Syst 2019 20 8 3
[84]
Zhou Q, Liu X, and Wang Q Interpretable duplicate question detection models based on attention mechanism Inf Sci 2021 543 259-272

Cited By

View all
  • (2024)On the Helpfulness of Answering Developer Questions on Discord with Similar Conversations and Posts from the PastProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3623341(1-13)Online publication date: 20-May-2024
  • (2024)Quantifying and characterizing clones of self-admitted technical debt in build systemsEmpirical Software Engineering10.1007/s10664-024-10449-529:2Online publication date: 26-Feb-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Empirical Software Engineering
Empirical Software Engineering  Volume 28, Issue 1
Jan 2023
827 pages

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 January 2023
Accepted: 02 November 2022

Author Tags

  1. Q&A communities
  2. Game development

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)On the Helpfulness of Answering Developer Questions on Discord with Similar Conversations and Posts from the PastProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3623341(1-13)Online publication date: 20-May-2024
  • (2024)Quantifying and characterizing clones of self-admitted technical debt in build systemsEmpirical Software Engineering10.1007/s10664-024-10449-529:2Online publication date: 26-Feb-2024

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media