[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1620432.1620447acmconferencesArticle/Chapter ViewAbstractPublication PagesideasConference Proceedingsconference-collections
research-article

Near-duplicate detection for web-forums

Published: 16 September 2009 Publication History

Abstract

Current forum search technologies lack the ability to identify threads with near-duplicate content and to group these threads in the search results. As a result, forum users are overloaded with duplicated search results and prefer to create new threads without trying to find existing ones. In this paper we therefore identify common reasons leading to near-duplicates and develop a new near-duplicate detection algorithm for forum threads. The algorithm is implemented using a large case study of a real-world forum serving more than one million users. We compare this work with current algorithms, similar to [4, 5], for detecting near-duplicates on machine generated web pages. Our preliminary results show, that we significantly outperform these algorithms and that we are able to group forum threads with a precision of 74%.

References

[1]
E. Agichtein, C. Castillo, D. Donato, A. Gionis, and G. Mishne. Finding high quality content in social media, with an application to community-based question answering. In Proceedings of ACM WSDM, pages 183--194, Stanford, CA, USA, February 2008. ACM Press.
[2]
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, May 1999.
[3]
W. M. Barczynski, F. Brauer, A. Loeser, and A. Mocan. Algebraic information extraction of enterprise data: Methodology and operators. In IK-KR Workshop at IJCAI 2009 (to be published), 2009.
[4]
A. Z. Broder. Identifying and filtering near-duplicate documents. In COM '00: Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching, pages 1--10, London, UK, 2000. Springer-Verlag.
[5]
M. S. Charikar. Similarity estimation techniques from rounding algorithms. In STOC 02: Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pages 380--388, New York, NY, USA, 2002. ACM.
[6]
A. Chowdhury, O. Frieder, D. Grossman, and M. C. McCabe. Collection statistics for fast duplicate document detection. ACM Trans. Inf. Syst., 20(2):171--191, 2002.
[7]
X. Z. Fern and W. Lin. Cluster ensemble selection. In SDM, pages 787--797. SIAM, 2008.
[8]
A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB '99: Proceedings of the 25th International Conference on Very Large Data Bases, pages 518--529, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc.
[9]
M. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 284--291, New York, NY, USA, 2006. ACM.
[10]
U. Manber. Finding similar files in a large file system. In Proceedings of the USENIX Winter 1994 Technical Conference, pages 1--10, San Fransisco, CA, USA, January July--February January 1994.
[11]
R. Ramakrishnan and A. Tomkins. Toward a peopleweb. IEEE Computer, 40(8):63--72, 2007.
[12]
M. Theobald, J. Siddharth, and A. Paepcke. Spotsigs: robust and efficient near duplicate detection in large web collections. In SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 563--570, New York, NY, USA, 2008. ACM.
[13]
T. S. Jayram, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. Zhu. Avatar information extraction system. In IEEE Data Engineering Bulletin, May 2006.
[14]
W. Xi, E. A. Fox, W. Fan, B. Zhang, Z. Chen, J. Yan, and D. Zhuang. Simfusion: measuring similarity using unified relationship matrix. In SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 130--137, New York, NY, USA, 2005. ACM.
[15]
S. Ye, R. Song, J.-R. Wen, and W.-Y. Ma. A query-dependent duplicate detection approach for large scale search engines. In APWeb, pages 48--58, 2004.

Cited By

View all
  • (2022)Analysis of community question‐answering issues via machine learning and deep learningCAAI Transactions on Intelligence Technology10.1049/cit2.120818:1(95-117)Online publication date: 4-May-2022
  • (2013)Improving Near-Duplicate Detection in Multi-Layered Collaborative Requirements Engineering Discussions Through Discussion ClusteringThe 8th International Conference on Knowledge Management in Organizations10.1007/978-94-007-7287-8_20(249-261)Online publication date: 6-Sep-2013
  • (2013)The Utility of Discourse Structure in Forum Thread RetrievalInformation Retrieval Technology10.1007/978-3-642-45068-6_25(284-295)Online publication date: 2013
  • Show More Cited By

Index Terms

  1. Near-duplicate detection for web-forums

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    IDEAS '09: Proceedings of the 2009 International Database Engineering & Applications Symposium
    September 2009
    347 pages
    ISBN:9781605584027
    DOI:10.1145/1620432
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 16 September 2009

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data and process mining
    2. data mining
    3. databases for e-commerce
    4. knowledge discovery
    5. knowledge management
    6. semantic web
    7. web-forum analysis

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    IDEAS '09
    Sponsor:
    • ACM
    • Concordia University

    Acceptance Rates

    Overall Acceptance Rate 74 of 210 submissions, 35%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)4
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 13 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Analysis of community question‐answering issues via machine learning and deep learningCAAI Transactions on Intelligence Technology10.1049/cit2.120818:1(95-117)Online publication date: 4-May-2022
    • (2013)Improving Near-Duplicate Detection in Multi-Layered Collaborative Requirements Engineering Discussions Through Discussion ClusteringThe 8th International Conference on Knowledge Management in Organizations10.1007/978-94-007-7287-8_20(249-261)Online publication date: 6-Sep-2013
    • (2013)The Utility of Discourse Structure in Forum Thread RetrievalInformation Retrieval Technology10.1007/978-3-642-45068-6_25(284-295)Online publication date: 2013
    • (2012)Learning hash codes for efficient content reuse detectionProceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval10.1145/2348283.2348339(405-414)Online publication date: 12-Aug-2012
    • (2012)Parallelized Near-Duplicate Document Detection Algorithm for Large Scale Chinese Web PagesProceedings of the 2012 13th International Conference on Parallel and Distributed Computing, Applications and Technologies10.1109/PDCAT.2012.108(523-528)Online publication date: 14-Dec-2012
    • (2011)Predicting thread discourse structure over technical web forumsProceedings of the Conference on Empirical Methods in Natural Language Processing10.5555/2145432.2145435(13-25)Online publication date: 27-Jul-2011
    • (2011)Hypergeometric language models for republished article findingProceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval10.1145/2009916.2009983(485-494)Online publication date: 24-Jul-2011
    • (2010)Efficient partial-duplicate detection based on sequence matchingProceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval10.1145/1835449.1835562(675-682)Online publication date: 19-Jul-2010
    • (2010)Graph-based concept identification and disambiguation for enterprise searchProceedings of the 19th international conference on World wide web10.1145/1772690.1772709(171-180)Online publication date: 26-Apr-2010

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media