Article

A fusion of algorithms in near duplicate document detection

Authors:

Jun Fan,

Tiejun HuangAuthors Info & Claims

PAKDD'11: Proceedings of the 15th international conference on New Frontiers in Applied Data Mining

Pages 234 - 242

https://doi.org/10.1007/978-3-642-28320-8_20

Published: 24 May 2011 Publication History

Publisher Site

Abstract

With the rapid development of the World Wide Web, there are a huge number of fully or fragmentally duplicated pages in the Internet. Return of these near duplicated results to the users greatly affects user experiences. In the process of deploying digital libraries, the protection of intellectual property and removal of duplicate contents needs to be considered. This paper fuses some "state of the art" algorithms to reach a better performance. We first introduce the three major algorithms (shingling, I-match, simhash) in duplicate document detection and their developments in the following days. We take sequences of words (shingles) as the feature of simhash algorithm. We then import the random lexicons based multi fingerprints generation method into shingling base simhash algorithm and named it shingling based multi fingerprints simhash algorithm. We did some preliminary experiments on the synthetic dataset based on the "China-US Million Book Digital Library Project". The experiment result proves the efficiency of these algorithms.

References

[1]

Brin, S., Davis, J., Garcia-Molina, H.: Copy Detection Mechanisms for Digital Documents. In: Proceedings of the ACM SIGMOD Annual Conference (1995)

Digital Library

Google Scholar

[2]

Shivakumar, N., Garcia-Molina, H.: SCAM: A copy detection mechanism for digital documents. In: Proceedings of the 2nd International Conference in Theory and Practice of Digital Libraries, DL 1995 (1995)

Google Scholar

[3]

Shivakumar, N., Garcia-Molina, H.: Building a scalable and accurate copy detection mechanism. In: Proceedings of the 1st ACM Conference on Digital Libraries, DL 1996 (1996)

Digital Library

Google Scholar

[4]

Chowdhury, A., Frieder, O., Grossman, D., Mccabe, M.C.: Collection statistics for fast duplicate document detection. ACM Transactions on Information Systems 20(2) (2002)

Digital Library

Google Scholar

[5]

Kolcz, A., Chowdhury, A., Alspector, J.: Improved Robustness of Signature-Based Near-Replica Detection via Lexicon Randomization. In: Proceedings of the tenth ACM SIGKDD, Seattle, WA, USA (2004)

Digital Library

Google Scholar

[6]

Conrad, J. G., Guo, X. S., Schriber, C. P.: Online Duplicate Document Detection: Signature Reliability in a Dynamic Retrieval Environment. In: Proceedings of the Twelfth International Conference on Information and Knowledge Management (2003)

Digital Library

Google Scholar

[7]

Broder, A. Z., Glassman, S.C., Manasse, M. S.: Syntactic clustering of the Web. In: Proceedings of the 6th International Web Conference (1997)

Digital Library

Google Scholar

[8]

Broder, A. Z., Charikar, M., Frieze, A., Mitzenmacher, M.: Min-Wise Independent Permutations. Journal of Computer and System Sciences, 630-659 (2000)

Digital Library

Google Scholar

[9]

Fetterly, D., Manasse, M., Najork, M.: On the evolution of clusters of near-duplicate web pages. In: Proceedings of First Latin American Web Congress, pp. 37-45 (2003)

Digital Library

Google Scholar

[10]

Fetterly, D., Manasse, M., Najork, M.: Detecting Phrase-level Duplication on the World Wide Web. In: The 28th ACM SIGIR, pp. 170-177 (2005)

Digital Library

Google Scholar

[11]

Charikar, M. S.: Similarity estimation techniques from rounding algorithms. In: Proceedings of 34th Annual Symposium on Theory of Computing (2002)

Digital Library

Google Scholar

[12]

Henzinger, M.: Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: Proceedings of the 29th ACM SIGIR, pp. 284-291 (2006)

Digital Library

Google Scholar

[13]

Manku, G. S., Jain, A., Sarma, A. D.: Detecting near-duplicates for web crawling. In: Proceedings of the 16th International Conference on World Wide Web, pp. 141-150 (2007)

Digital Library

Google Scholar

[14]

Theobald, M., Siddharth, J., Paepcke, A.: SpotSigs: robust and efficient near duplicate detection in large web collections. In: Proceedings of ACM SIGIR (2008)

Digital Library

Google Scholar

Index Terms

A fusion of algorithms in near duplicate document detection
1. Information systems
  1. Information systems applications
    1. Digital libraries and archives

Recommendations

Online duplicate document detection: signature reliability in a dynamic retrieval environment
CIKM '03: Proceedings of the twelfth international conference on Information and knowledge management

As online document collections continue to expand, both on the Web and in proprietary environments, the need for duplicate detection becomes more critical. Few users wish to retrieve search results consisting of sets of duplicate documents, whether ...
Near duplicate detection in an academic digital library
DocEng '13: Proceedings of the 2013 ACM symposium on Document engineering

The detection and potential removal of duplicates is desirable for a number of reasons, such as to reduce the need for unnecessary storage and computation, and to provide users with uncluttered search results. This paper describes an investigation into ...
Constructing a text corpus for inexact duplicate detection
SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval

As online document collections continue to expand, both on the Web and in proprietary environments, the need for duplicate detection becomes more critical. The goal of this work is to facilitate (a) investigations into the phenomenon of near duplicates ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

PAKDD'11: Proceedings of the 15th international conference on New Frontiers in Applied Data Mining

May 2011

506 pages

ISBN:9783642283192

Editors:
Longbing Cao
Faculty of Engineering and Information Technology, University of Technology Sydney, Broadway, Sydney, Australia
,
Joshua Zhexue Huang
Shenzhen Institute of Advanced Technology (SIAT), Chinese Academy of Sciences, Broadway, Shenzhen, China
,
James Bailey
Shenzhen Institute of Advanced Technology (SIAT), The University of Melbourne, Broadway, Melbourne, Australia
,
Yun Sing Koh
Shenzhen Institute of Advanced Technology (SIAT), The University of Auckland, Broadway, Auckland, New Zealand
,
Jun Luo
Shenzhen Institute of Advanced Technology (SIAT), Chinese Academy of Sciences, Broadway, Shenzhen, China

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 24 May 2011

Author Tags

Qualifiers

Article

Index Terms

Recommendations

Online duplicate document detection: signature reliability in a dynamic retrieval environment

Near duplicate detection in an academic digital library

Constructing a text corpus for inexact duplicate detection

Comments

Published In

Sponsors

Publisher

Publication History

Author Tags

Qualifiers

Other Metrics

Article Metrics

Other Metrics

Abstract

References

Index Terms

Recommendations

Online duplicate document detection: signature reliability in a dynamic retrieval environment

Near duplicate detection in an academic digital library

Constructing a text corpus for inexact duplicate detection

Comments

Information

Published In

Sponsors

Publisher

Publication History

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations