More Web Proxy on the site http://driver.im/

research-article

Near-duplicate detection for web-forums

Authors:

Klemens Muthmann,

Wojciech M. Barczyński,

Alexander LöserAuthors Info & Claims

IDEAS '09: Proceedings of the 2009 International Database Engineering & Applications Symposium

Pages 142 - 151

https://doi.org/10.1145/1620432.1620447

Published: 16 September 2009 Publication History

Abstract

Current forum search technologies lack the ability to identify threads with near-duplicate content and to group these threads in the search results. As a result, forum users are overloaded with duplicated search results and prefer to create new threads without trying to find existing ones. In this paper we therefore identify common reasons leading to near-duplicates and develop a new near-duplicate detection algorithm for forum threads. The algorithm is implemented using a large case study of a real-world forum serving more than one million users. We compare this work with current algorithms, similar to [4, 5], for detecting near-duplicates on machine generated web pages. Our preliminary results show, that we significantly outperform these algorithms and that we are able to group forum threads with a precision of 74%.

References

[1]

E. Agichtein, C. Castillo, D. Donato, A. Gionis, and G. Mishne. Finding high quality content in social media, with an application to community-based question answering. In Proceedings of ACM WSDM, pages 183--194, Stanford, CA, USA, February 2008. ACM Press.

Digital Library

[2]

R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, May 1999.

Digital Library

[3]

W. M. Barczynski, F. Brauer, A. Loeser, and A. Mocan. Algebraic information extraction of enterprise data: Methodology and operators. In IK-KR Workshop at IJCAI 2009 (to be published), 2009.

[4]

A. Z. Broder. Identifying and filtering near-duplicate documents. In COM '00: Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching, pages 1--10, London, UK, 2000. Springer-Verlag.

Digital Library

[5]

M. S. Charikar. Similarity estimation techniques from rounding algorithms. In STOC 02: Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pages 380--388, New York, NY, USA, 2002. ACM.

Digital Library

[6]

A. Chowdhury, O. Frieder, D. Grossman, and M. C. McCabe. Collection statistics for fast duplicate document detection. ACM Trans. Inf. Syst., 20(2):171--191, 2002.

Digital Library

[7]

X. Z. Fern and W. Lin. Cluster ensemble selection. In SDM, pages 787--797. SIAM, 2008.

[8]

A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB '99: Proceedings of the 25th International Conference on Very Large Data Bases, pages 518--529, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc.

Digital Library

[9]

M. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 284--291, New York, NY, USA, 2006. ACM.

Digital Library

[10]

U. Manber. Finding similar files in a large file system. In Proceedings of the USENIX Winter 1994 Technical Conference, pages 1--10, San Fransisco, CA, USA, January July--February January 1994.

Digital Library

[11]

R. Ramakrishnan and A. Tomkins. Toward a peopleweb. IEEE Computer, 40(8):63--72, 2007.

Digital Library

[12]

M. Theobald, J. Siddharth, and A. Paepcke. Spotsigs: robust and efficient near duplicate detection in large web collections. In SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 563--570, New York, NY, USA, 2008. ACM.

Digital Library

[13]

T. S. Jayram, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. Zhu. Avatar information extraction system. In IEEE Data Engineering Bulletin, May 2006.

[14]

W. Xi, E. A. Fox, W. Fan, B. Zhang, Z. Chen, J. Yan, and D. Zhuang. Simfusion: measuring similarity using unified relationship matrix. In SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 130--137, New York, NY, USA, 2005. ACM.

Digital Library

[15]

S. Ye, R. Song, J.-R. Wen, and W.-Y. Ma. A query-dependent duplicate detection approach for large scale search engines. In APWeb, pages 48--58, 2004.

Cited By

Roy PSaumya SSingh JBanerjee SGutub A(2022)Analysis of community question‐answering issues via machine learning and deep learningCAAI Transactions on Intelligence Technology10.1049/cit2.120818:1(95-117)Online publication date: 4-May-2022
https://dl.acm.org/doi/10.1049/cit2.12081
Sillaber CBreu R(2013)Improving Near-Duplicate Detection in Multi-Layered Collaborative Requirements Engineering Discussions Through Discussion ClusteringThe 8th International Conference on Knowledge Management in Organizations10.1007/978-94-007-7287-8_20(249-261)Online publication date: 6-Sep-2013
https://doi.org/10.1007/978-94-007-7287-8_20
Wang LKim SBaldwin T(2013)The Utility of Discourse Structure in Forum Thread RetrievalInformation Retrieval Technology10.1007/978-3-642-45068-6_25(284-295)Online publication date: 2013
https://doi.org/10.1007/978-3-642-45068-6_25
Show More Cited By

Index Terms

Near-duplicate detection for web-forums
1. Information systems
  1. Information retrieval

Recommendations

Discovering highly expected utility itemsets for revenue prediction

Identifying patterns of items that are purchased frequently and generate high profits is crucial for inventory and profit management. However, neither approaches based on frequent itemsets nor those based on high-utility itemsets (HUIs) can meet this ...
Mining N-most interesting itemsets without support threshold by the COFI-tree

Data mining is the discovery of interesting and hidden patterns from a large amount of collected data. Applications can be found in many organisations with large databases, for many different purposes such as customer relationships, marketing, planning, ...
Dataless Transitions Between Concise Representations of Frequent Patterns

For many data mining problems in order to solve them it is required to discover frequent patterns. Frequent itemsets are useful e.g. in the discovery of association and episode rules, sequential patterns and clusters. Nevertheless, the number of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

IDEAS '09: Proceedings of the 2009 International Database Engineering & Applications Symposium

September 2009

347 pages

ISBN:9781605584027

DOI:10.1145/1620432

General Chair:
Bipin C. Desai
Concordia University, Montreal, Canada
,
Program Chairs:
Domenico Sacca
Universita della Calabria, Rende, Italy
,
Sergio Greco
Universita della Calabria, Rende, Italy

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

ACM: Association for Computing Machinery
ICAR-CNR, Rende (CS), Italy
Universita della Calabria, Rende(CS), Italy
BytePress
Concordia University: Concordia University

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 September 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Seventh Framework Programme

Conference

IDEAS '09

Sponsor:

ACM
Concordia University

IDEAS '09: Thirteenth International Database Engineering & Applications Symposium

September 16 - 18, 2009

Cetraro - Calabria, Italy

Acceptance Rates

Overall Acceptance Rate 74 of 210 submissions, 35%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
313
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Roy PSaumya SSingh JBanerjee SGutub A(2022)Analysis of community question‐answering issues via machine learning and deep learningCAAI Transactions on Intelligence Technology10.1049/cit2.120818:1(95-117)Online publication date: 4-May-2022
https://dl.acm.org/doi/10.1049/cit2.12081
Sillaber CBreu R(2013)Improving Near-Duplicate Detection in Multi-Layered Collaborative Requirements Engineering Discussions Through Discussion ClusteringThe 8th International Conference on Knowledge Management in Organizations10.1007/978-94-007-7287-8_20(249-261)Online publication date: 6-Sep-2013
https://doi.org/10.1007/978-94-007-7287-8_20
Wang LKim SBaldwin T(2013)The Utility of Discourse Structure in Forum Thread RetrievalInformation Retrieval Technology10.1007/978-3-642-45068-6_25(284-295)Online publication date: 2013
https://doi.org/10.1007/978-3-642-45068-6_25
Zhang QWu YDing ZHuang XHersh WCallan JMaarek YSanderson M(2012)Learning hash codes for efficient content reuse detectionProceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval10.1145/2348283.2348339(405-414)Online publication date: 12-Aug-2012
https://dl.acm.org/doi/10.1145/2348283.2348339
Wei YWang SYuan CHuang Y(2012)Parallelized Near-Duplicate Document Detection Algorithm for Large Scale Chinese Web PagesProceedings of the 2012 13th International Conference on Parallel and Distributed Computing, Applications and Technologies10.1109/PDCAT.2012.108(523-528)Online publication date: 14-Dec-2012
https://dl.acm.org/doi/10.1109/PDCAT.2012.108
Wang LLui MKim SNivre JBaldwin TMerlo PBarzilay RJohnson M(2011)Predicting thread discourse structure over technical web forumsProceedings of the Conference on Empirical Methods in Natural Language Processing10.5555/2145432.2145435(13-25)Online publication date: 27-Jul-2011
https://dl.acm.org/doi/10.5555/2145432.2145435
Tsagkias Mde Rijke MWeerkamp WMa WNie JBaeza-Yates RChua TCroft W(2011)Hypergeometric language models for republished article findingProceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval10.1145/2009916.2009983(485-494)Online publication date: 24-Jul-2011
https://dl.acm.org/doi/10.1145/2009916.2009983
Zhang QZhang YYu HHuang XCrestani FMarchand-Maillet SChen HEfthimiadis ESavoy J(2010)Efficient partial-duplicate detection based on sequence matchingProceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval10.1145/1835449.1835562(675-682)Online publication date: 19-Jul-2010
https://dl.acm.org/doi/10.1145/1835449.1835562
Brauer FHuber MHackenbroich GLeser UNaumann FBarczynski WRappa MJones PFreire JChakrabarti S(2010)Graph-based concept identification and disambiguation for enterprise searchProceedings of the 19th international conference on World wide web10.1145/1772690.1772709(171-180)Online publication date: 26-Apr-2010
https://dl.acm.org/doi/10.1145/1772690.1772709

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents