[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3038912.3052701acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Detecting Duplicate Posts in Programming QA Communities via Latent Semantics and Association Rules

Published: 03 April 2017 Publication History

Abstract

Programming community-based question-answering (PCQA) websites such as Stack Overflow enable programmers to find working solutions to their questions. Despite detailed posting guidelines, duplicate questions that have been answered are frequently created. To tackle this problem, Stack Overflow provides a mechanism for reputable users to manually mark duplicate questions. This is a laborious effort, and leads to many duplicate questions remain undetected. Existing duplicate detection methodologies from traditional community based question-answering (CQA) websites are difficult to be adopted directly to PCQA, as PCQA posts often contain source code which is linguistically very different from natural languages. In this paper, we propose a methodology designed for the PCQA domain to detect duplicate questions. We model the detection as a classification problem over question pairs. To extract features for question pairs, our methodology leverages continuous word vectors from the deep learning literature, topic model features and phrases pairs that co-occur frequently in duplicate questions mined using machine translation systems. These features capture semantic similarities between questions and produce a strong performance for duplicate detection. Experiments on a range of real-world datasets demonstrate that our method works very well; in some cases over 30% improvement compared to state-of-the-art benchmarks. As a product of one of the proposed features, the association score feature, we have mined a set of associated phrases from duplicate questions on Stack Overflow and open the dataset to the public.

References

[1]
M. Ahasanuzzaman, M. Asaduzzaman, C. K. Roy, and K. A. Schneider. Mining Duplicate Questions in Stack Overflow. In Proc. of the MSR 2016, pages 402--412.
[2]
N. S. Altman. An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46(3):175--185, 1992.
[3]
B. Bazelli, A. Hindle, and E. Stroulia. On the Personality Traits of StackOverflow Users. In Proc. of the ICSM 2013, pages 460--463.
[4]
J. Berant and P. Liang. Semantic Parsing via Paraphrasing. In Proc. of the ACL 2014, pages 1415--1425.
[5]
L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth, 1984.
[6]
X. Cao, G. Cong, B. Cui, and C. S. Jensen. A Generalized Framework of Exploring Category Information for Question Retrieval in Community Question Answer Archives. In Proc. of the WWW 2010, pages 201--210.
[7]
X. Cao, G. Cong, B. Cui, C. S. Jensen, and Q. Yuan. Approaches to Exploring Category Information for Question Retrieval in Community Question-Answer Archives. ACM Transactions on Information Systems, 30(2):7, 2012.
[8]
R. Caruana and A. Niculescu-Mizil. An empirical comparison of supervised learning algorithms. In Proc. of the ICML 2006, pages 161--168.
[9]
T. F. Chan, G. H. Golub, and R. J. LeVeque. Updating Formulae and A Pairwise Algorithm for Computing Sample Variances. In Proc. of the 5th Symposium in Computational Statistics (COMPSTAT 1982), pages 30--41, 1982.
[10]
M. Collins. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. In Proc. of the EMNLP 2002, pages 1--8.
[11]
D. Correa and A. Sureka. Chaff from the Wheat: Characterization and Modeling of Deleted Questions on Stack Overflow. In Proc. of the WWW 2014, pages 631--642.
[12]
K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online Passive-Aggressive Algorithms. Journal of Machine Learning Research, 7:551--585, 2006.
[13]
C. Fellbaum. WordNet: An Electronic Lexical Database. MIT Press, 1998.
[14]
M. A. Hearst, S. T. Dumais, E. Osman, J. Platt, and B. Scholkopf. Support Vector Machines. IEEE Intelligent Systems and their Applications, 13(4):18--28, 1998.
[15]
T. K. Ho. The Random Subspace Method for Constructing Decision Forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8):832--844, 1998.
[16]
P. Koehn, F. J. Och, and D. Marcu. Statistical phrase-based translation. In Proc. of the NAACL 2003, pages 48--54.
[17]
J. H. Lau and T. Baldwin. An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation. In Proc. of the RepL4NLP 2016, pages 78--86.
[18]
Q. V. Le and T. Mikolov. Distributed Representations of Sentences and Documents. In Proc. of the ICML 2014, pages 1188--1196.
[19]
C. Li, H. Wang, Z. Zhang, A. Sun, and Z. Ma. Topic Modeling for Short Texts with Auxiliary Word Embeddings. In Proc. of the SIGIR 2016, pages 165--174.
[20]
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proc. of the NIPS 2013, pages 3111--3119.
[21]
F. J. Och and H. Ney. A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics, 29(1):19--51, 2003.
[22]
F. J. Och and H. Ney. The Alignment Template Approach to Statistical Machine Translation. Computational Linguistics, 30(4):417--449, 2004.
[23]
A. Shtok, G. Dror, Y. Maarek, and I. Szpektor. Learning from the Past: Answering New Questions with Past Answers. In Proc. of the WWW 2012, pages 759--768.
[24]
K. Tao, F. Abel, C. Hauff, G. Houben, and U. Gadiraju. Groundhog Day: Near-Duplicate Detection on Twitter. In Proc. of the WWW 2013, pages 1273--1284.
[25]
C. Treude, O. Barzilay, and M. D. Storey. How Do Programmers Ask and Answer Questions on the Web? In Proc. of the ICSE 2011, pages 804--807.
[26]
S. H. Walker and D. B. Duncan. Estimation of the Probability of an Event as a Function of Several Independent Variables. Biometrika, 54(1--2):167--179, 1967.
[27]
K. Wang, Z. Ming, and T. Chua. A Syntactic Tree Matching Approach to Finding Similar Questions in Community-based QA Services. In Proc. of the SIGIR 2009, pages 187--194.
[28]
L. Yang, S. Bao, Q. Lin, X. Wu, D. Han, Z. Su, and Y. Yu. Analyzing and Predicting Not-Answered Questions in Community-based Question Answering Services. In Proc. of the AAAI 2011, pages 1273--1278.
[29]
P. Yin, N. Duan, B. Kao, J. Bao, and M. Zhou. Answering Questions with Complex Semantic Constraints on Open Knowledge Bases. In Proc. of the 24th ACM International on Conference on Information and Knowledge Management, (CIKM 2015), pages 1301--1310, October 2015.
[30]
T. Zhang. Solving Large Scale Linear Prediction Problems Using Stochastic Gradient Descent Algorithms. In Proc. of the ICML 2004, pages 919--926, 2004.
[31]
Y. Zhang, D. Lo, X. Xia, and J. Sun. Multi-Factor Duplicate Question Detection in Stack Overflow. Journal of Computer Science and Technology, 30(5):981--997, 2015.
[32]
G. Zhou, Y. Liu, F. Liu, D. Zeng, and J. Zhao. Improving Question Retrieval in Community Question Answering Using World Knowledge. In Proc. of the IJCAI 2013, pages 2239--2245.

Cited By

View all
  • (2025)How are discussions linked? A link analysis study on GitHub DiscussionsJournal of Systems and Software10.1016/j.jss.2024.112196219(112196)Online publication date: Jan-2025
  • (2024)INCEPT: A Framework for Duplicate Posts Classification with Combined Text RepresentationsACM Transactions on the Web10.1145/367732218:3(1-24)Online publication date: 15-Jul-2024
  • (2024)Refining GPT-3 Embeddings with a Siamese Structure for Technical Post Duplicate Detection2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)10.1109/SANER60148.2024.00019(114-125)Online publication date: 12-Mar-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
WWW '17: Proceedings of the 26th International Conference on World Wide Web
April 2017
1678 pages
ISBN:9781450349130

Sponsors

  • IW3C2: International World Wide Web Conference Committee

In-Cooperation

Publisher

International World Wide Web Conferences Steering Committee

Republic and Canton of Geneva, Switzerland

Publication History

Published: 03 April 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. association rules
  2. classification
  3. community-based question answering
  4. latent semantics
  5. question quality

Qualifiers

  • Research-article

Conference

WWW '17
Sponsor:
  • IW3C2

Acceptance Rates

WWW '17 Paper Acceptance Rate 164 of 966 submissions, 17%;
Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)22
  • Downloads (Last 6 weeks)1
Reflects downloads up to 14 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2025)How are discussions linked? A link analysis study on GitHub DiscussionsJournal of Systems and Software10.1016/j.jss.2024.112196219(112196)Online publication date: Jan-2025
  • (2024)INCEPT: A Framework for Duplicate Posts Classification with Combined Text RepresentationsACM Transactions on the Web10.1145/367732218:3(1-24)Online publication date: 15-Jul-2024
  • (2024)Refining GPT-3 Embeddings with a Siamese Structure for Technical Post Duplicate Detection2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)10.1109/SANER60148.2024.00019(114-125)Online publication date: 12-Mar-2024
  • (2023)I Know What You Are Searching for: Code Snippet Recommendation from Stack Overflow PostsACM Transactions on Software Engineering and Methodology10.1145/355015032:3(1-42)Online publication date: 26-Apr-2023
  • (2023)DupHunter: Detecting Duplicate Pull Requests in Fork-Based DevelopmentIEEE Transactions on Software Engineering10.1109/TSE.2023.323594249:4(2920-2940)Online publication date: 1-Apr-2023
  • (2023)UMTCSF: A Graduate Forum Platform Utilizing an Ensemble Similarity Model2023 International Conference on Informatics Engineering, Science & Technology (INCITEST)10.1109/INCITEST59455.2023.10397021(1-6)Online publication date: 25-Oct-2023
  • (2023)A Programming Language Learning Service by Linking Stack Overflow with Textbooks2023 IEEE International Conference on Web Services (ICWS)10.1109/ICWS60048.2023.00043(234-245)Online publication date: Jul-2023
  • (2023)Analyzing Techniques for Duplicate Question Detection on Q&A Websites for Game DevelopersEmpirical Software Engineering10.1007/s10664-022-10256-w28:1Online publication date: 1-Jan-2023
  • (2022)An Empirical Evaluation of Document Embeddings and Similarity Metrics for Scientific ArticlesApplied Sciences10.3390/app1211566412:11(5664)Online publication date: 2-Jun-2022
  • (2022)Towards exploring the code reuse from stack overflow during software developmentProceedings of the 30th IEEE/ACM International Conference on Program Comprehension10.1145/3524610.3527923(548-559)Online publication date: 16-May-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media