More Web Proxy on the site http://driver.im/

Article

Detecting near-duplicates for web crawling

Authors:

Gurmeet Singh Manku,

Anish Das SarmaAuthors Info & Claims

WWW '07: Proceedings of the 16th international conference on World Wide Web

Pages 141 - 150

https://doi.org/10.1145/1242572.1242592

Published: 08 May 2007 Publication History

Abstract

Near-duplicate web documents are abundant. Two such documents differ from each other in a very small portion that displays advertisements, for example. Such differences are irrelevant for web search. So the quality of a web crawler increases if it can assess whether a newly crawled web page is a near-duplicate of a previously crawled web page or not. In the course of developing a near-duplicate detection system for a multi-billion page repository, we make two research contributions. First, we demonstrate that Charikar's fingerprinting technique is appropriate for this goal. Second, we present an algorithmic technique for identifying existing f-bit fingerprints that differ from a given fingerprint in at most k bit-positions, for small k. Our technique is useful for both online queries (single fingerprints) and all batch queries (multiple fingerprints). Experimental evaluation over real data confirms the practicality of our design.

References

[1]

A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, and S. Raghavan. Searching the web. ACM Transactions on Internet Technology, 1(1):2--43, 2001.

Digital Library

[2]

A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In Proc. ACM SIGMOD 2003, pages 337--348, 2003.

Digital Library

[3]

A.N. Arslan and Ö Eugeciouglu. Dictionary look-up within small edit distance. In Proc. 8th Annual Intl. Computing and Combinatorics Conference (COCOON'02), pages 127--136, 2002.

Digital Library

[4]

B. S. Baker. A theory of parameterized pattern matching algorithms and applications. In Proc. 25th Annual Symposium on Theory of Computing (STOC 1993), pages 71--80, 1993.

Digital Library

[5]

B. S. Baker. On finding duplication and near-duplication in large software systems. In Proc. 2nd Working Conference on Reverse Engineering, page 86, 1995.

Digital Library

[6]

K. Bharat and A. Broder. Mirror, mirror on the Web: A study of hst pairs with replicated content. In Proc. 8th International Conference on World Wide Web (WWW 1999), pages 1579--1590, 1999.

Digital Library

[7]

K. Bharat, A. Broder, J. Dean, and M. R. Henzinger. A comparison of techniques to find mirrored hosts on the WWW. J American Society for Information Science, 51(12):1114--1122, Aug. 2000.

Digital Library

[8]

S. Brin, J. Davis, and H. Garcia-Molina. Copy detection mechanisms for digital documents. In Proc. ACM SIGMOD Annual Conference, pages 398--409, May 1995.

Digital Library

[9]

S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1--7):107--117, 1998.

Digital Library

[10]

G. S. Brodal and L. Gcasieniec. Approximate dictionary queries. In Proc 7th Combinatorial Pattern Matching Symposium, pages 65--74, 1996.

Digital Library

[11]

G. S. Brodal and S. Venkatesh. Improved bounds for dictionary look-up with one error. Information Processing Letters, 75(1--2):57--59, 2000.

Digital Library

[12]

A. Broder. On the resemblance and containment of documents. In Compression and Complexity of Sequences, 1998.

Digital Library

[13]

A. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher. Min-wise independent permutations. In Proc. 30th Annual Symposium on Theory of Computing (STOC 1998), pages 327--336, 1998.

Digital Library

[14]

A. Broder, S. C. Glassman, M. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks, 29(8-13):1157--1166, 1997.

Digital Library

[15]

A. Z. Broder, M. Najork, and J. L. Wiener. Efficient URL caching for World Wide Web crawling. In International conference on World Wide Web, 2003.

Digital Library

[16]

S. Chakrabarti. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan-Kauffman, 2002.

Digital Library

[17]

M. Charikar. Similarity estimation techniques from rounding algorithms. In Proc. 34th Annual Symposium on Theory of Computing (STOC 2002), pages 380--388, 2002.

Digital Library

[18]

Z. Chen, F. Korn, N. Koudas, and S. Muthukrishnan. Selectivity estimation for boolean queries. In Proc. PODS 2000, pages 216-225, 2000.

Digital Library

[19]

J. Cho, H. García-Molina, and L. Page. Efficient crawling through URL ordering. Computer Networks and ISDN Systems, 30(1-7): 161--172, 1998.

Digital Library

[20]

A. Chowdhury, O. Frieder, D. Grossman, and M. C. McCabe. Collection statistics for fast duplicate document detection. ACM Transactions on Information Systems, 20(2):171--191, 2002.

Digital Library

[21]

E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. D. Ullman, and C. Yang. Finding interesting associations without support pruning. In Proc. 16th Intl. Conf. on Data Engineering (ICDE 2000), pages 489--499, 2000.

Digital Library

[22]

J. G. Conrad and C. P. Schriber. Constructing a text corpus for inexact duplicate detection. In SIGIR 2004, pages 582--583, July 2004.

Digital Library

[23]

J. W. Cooper, A. R. Coden, and E. W. Brown. Detecting similar documents using salient terms. In Proc. 1st Intl. Conf. on Information and Knowledge Management (CIKM 2002), pages 245--251, Nov. 2002.

Digital Library

[24]

J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proc. 6th Symposium on Operating System Design and Implementation (OSDI 2004), pages 137--150, Dec. 2004.

Digital Library

[25]

J. Dean and M. Henzinger. Finding related pages in the World Wide Web. Computer Networks, 31(11--16):1467--1479, 1999.

Digital Library

[26]

S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. J American Society for Information Science, 41(6):391--407, 1990.

[27]

M. Diligenti, F. Coetzee, S. Lawrence, C. L. Giles, and M. Gori. Focused crawling using context graphs. In 26th International Conference on Very Large Databases, (VLDB 2000), pages 527--534, sep 2000.

Digital Library

[28]

D. Dolev, Y. Harari, M. Linial, N. Nisan, and M. Parnas. Neighborhood preserving hashing and approximate queries. In Proc. 5th ACM Symposium on Discrete Algorithms (SODA 1994), 1994.

Digital Library

[29]

S. Ghemawat, H. Gobioff, and S.T. Leung. The Google File System. In 19th ACM Symposium on Operating Systems Principles (SOSP 2003), pages 29--43, Oct. 2003.

Digital Library

[30]

A. Gionis, D. Gunopulos, and N. Koudas. Efficient and tunable similar set retrieval. In Proc. SIGMOD 2001, pages 247--258, 2001.

Digital Library

[31]

D. Greene, MParnas, and FYao. Multi-index hashing for information retrieval. In Proc. 35th Annual Symposium on Foundations of Computer Science (FOCS 1994), pages 722--731, 1994.

Digital Library

[32]

K.M. Hammouda and M.S. Kamel. Efficient phrase-based document indexing for web document clustering. IEEE Transactions on Knowledge and Data Engineering, 6(10):1279--1296, Aug. 2004.

Digital Library

[33]

T. H. Haveliwala, A. Gionis, and P. Indyk. Scalable techniques for clustering the web. In Proc. 3rd Intl. Workshop on the Web and Databases (WebDB 2000), pages 129--134, May 2000.

[34]

T. H. Haveliwala, A. Gionis, D. Klein, and P. Indyk. Evaluating strategies for similarity search on the Web. In Proc. 11th International World Wide Web Conference, pages 432--442, May 2002.

Digital Library

[35]

M. R. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR 2006, pages 284--291, 2006.

Digital Library

[36]

T. C. Hoad and J. Zobel. Methods for identifying versioned and plagiarised documents. J of the American Society for Information Science and Technology, 54(3):203--215, Feb. 2003.

Digital Library

[37]

D. A. Huffman. A method for the construction of minimum-redundancy codes. In Proc. Institute of Radio Engineering, volume 40, pages 1098--1102, Sept. 1952.

[38]

S. Joshi, N. Agrawal, R. Krishnapuram, and S. Negi. A bag of paths model for measuring structural similarity in Web documents. In Proc. 9th ACM Intl. Conf. on Knowledge Discovery and Data Mining (SIGKDD 2003), pages 577--582, 2003.

Digital Library

[39]

J. M. Kleinberg. Authoritative sources in a hyperlinked environment. J of the ACM, 46(5):604--632, Sept. 1999.

Digital Library

[40]

A. Kolcz, A. Chowdhury, and J. Alspector. Improved robustness of signature-based near-replica detection via lexicon randomization. In SIGKDD 2004, pages 605--610, Aug. 2004.

Digital Library

[41]

R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the Web for emerging cyber-communities. Computer Networks: The Intl. J of Computer and Telecommunications Networks, 31:1481--1493, 1999.

Digital Library

[42]

U. Manber. Finding similar files in a large file system. In Proc. 1994 USENIX Conference, pages 1--10, Jan. 1994.

Digital Library

[43]

F. Menczer, G. Pant, P. Srinivasan, and M. E. Ruiz. Evaluating topic-driven web crawlers. In Research and Development in Information Retrieval, pages 241--249, 2001.

Digital Library

[44]

M. Minsky and S. Papert. Perceptrons. MIT Press, 1969.

[45]

A. Muthitacharoen, B. Chen, and D. Mazières. A low-bandwidth network file system. In Proc. 18th ACM Symposium on Operating System Principles (SOSP 2001), pages 174--187, Oct. 2001.

Digital Library

[46]

S. Pandey and C. Olston. User-centric web crawling. In WWW '05: Proc. 14th Intl. Conference on World Wide Web, pages 401--411, 2005.

Digital Library

[47]

W. Pugh and M. R. Henzinger. Detecting duplicate and near-duplicate files. United States Patent 6,658,423, granted on Dec 2, 2003, 2003.

[48]

S. Quinlan and S. Dorward. Venti: A new approach to archival storage. In First USENIX Conference on File and Storage Technologies, pages 89--101, 2002.

Digital Library

[49]

M. O. Rabin. Fingerprinting by random polynomials. Technical Report Report TR-15-81, Center for Research in Computing Techonlogy, Harvard University, 1981.

[50]

S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: Local algorithms for document fingerprinting. In Proc. SIGMOD 2003, pages 76--85, June 2003.

Digital Library

[51]

N. Shivakumar and H. Garcia-Molina. SCAM: A copy detection mechanism for digital documents. In Proceedings of the 2nd International Conference on Theory and Practice of Digital Libraries, 1995.

[52]

A. Yao and F. Yao. The complexity of searching in an ordered random table. In Proc. 17th Annual Symposium on Foundations of Computer Science (FOCS 1976), pages 173--177, 1976.

Digital Library

[53]

A. C. Yao and F. F. Yao. Dictionary look-up with one error. J of Algorithms, 25(1):194--202, 1997.

Digital Library

Cited By

De Mofo GWacka ATchakounte FFotso J(2024)Detection of Vulnerabilities in Cryptocurrency Smart Contracts Based on Image ProcessingGlobal Perspectives on the Applications of Computer Vision in Cybersecurity10.4018/978-1-6684-8127-1.ch004(102-123)Online publication date: 23-Feb-2024
https://doi.org/10.4018/978-1-6684-8127-1.ch004
Backes TDietze S(2024)Connected Components for Scaling Partial-order Blocking to Billion EntitiesJournal of Data and Information Quality10.1145/364655316:1(1-29)Online publication date: 19-Mar-2024
https://dl.acm.org/doi/10.1145/3646553
Toney ACurlee KProbasco E(2024)Trust Issues: Discrepancies in Trustworthy AI Keywords Use in Policy and ResearchProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency10.1145/3630106.3659035(2222-2233)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3630106.3659035
Show More Cited By

Index Terms

Detecting near-duplicates for web crawling
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing
  2. Information storage systems
    1. Storage architectures
      1. Distributed storage
2. Mathematics of computing
  1. Discrete mathematics
    1. Combinatorics
      1. Combinatorial algorithms
      2. Permutations and combinations

Recommendations

Fixing the threshold for effective detection of near duplicate web documents in web crawling
ADMA'10: Proceedings of the 6th international conference on Advanced data mining and applications: Part I

The drastic development of the WWW in recent times has made the concept of Web Crawling receive remarkable significance. The voluminous amounts of web documents swarming the web have posed huge challenges to web search engines making their results less ...
Achieving both high precision and high recall in near-duplicate detection
CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management

To find near-duplicate documents, fingerprint-based paradigms such as Broder's shingling and Charikar's simhash algorithms have been recognized as effective approaches and are considered the state-of-the-art. Nevertheless, we see two aspects of these ...
Current challenges in web crawling
ICWE'13: Proceedings of the 13th international conference on Web Engineering

Web crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '07: Proceedings of the 16th international conference on World Wide Web

May 2007

1382 pages

ISBN:9781595936547

DOI:10.1145/1242572

General Chairs:
Carey Williamson
University of Calgary, Canada
,
Mary Ellen Zurko
IBM, USA
,
Program Chairs:
Peter Patel-Schneider
Bell Labs Research, USA
,
Prashant Shenoy
University of Massachusetts at Amherst, USA

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

ACM: Association for Computing Machinery

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 May 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

WWW'07

Sponsor:

ACM

WWW'07: 16th International World Wide Web Conference

May 8 - 12, 2007

Alberta, Banff, Canada

Acceptance Rates

Overall Acceptance Rate 1,068 of 6,946 submissions, 15%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

389
Total Citations
View Citations
3,773
Total Downloads

Downloads (Last 12 months)128
Downloads (Last 6 weeks)17

Reflects downloads up to 12 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

De Mofo GWacka ATchakounte FFotso J(2024)Detection of Vulnerabilities in Cryptocurrency Smart Contracts Based on Image ProcessingGlobal Perspectives on the Applications of Computer Vision in Cybersecurity10.4018/978-1-6684-8127-1.ch004(102-123)Online publication date: 23-Feb-2024
https://doi.org/10.4018/978-1-6684-8127-1.ch004
Backes TDietze S(2024)Connected Components for Scaling Partial-order Blocking to Billion EntitiesJournal of Data and Information Quality10.1145/364655316:1(1-29)Online publication date: 19-Mar-2024
https://dl.acm.org/doi/10.1145/3646553
Toney ACurlee KProbasco E(2024)Trust Issues: Discrepancies in Trustworthy AI Keywords Use in Policy and ResearchProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency10.1145/3630106.3659035(2222-2233)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3630106.3659035
Clerissi DDenaro GMobilio MMariani L(2024)Guess the State: Exploiting Determinism to Improve GUI Exploration EfficiencyIEEE Transactions on Software Engineering10.1109/TSE.2024.336658650:4(836-853)Online publication date: 16-Feb-2024
https://dl.acm.org/doi/10.1109/TSE.2024.3366586
Pan QCao HZhu YLiu JLi B(2024)Contextual Client Selection for Efficient Federated Learning Over Edge DevicesIEEE Transactions on Mobile Computing10.1109/TMC.2023.332364523:6(6538-6548)Online publication date: Jun-2024
https://doi.org/10.1109/TMC.2023.3323645
Wu WLi BLuo CNejdl WTan X(2024)MPSketch: Message Passing Networks via Randomized Hashing for Efficient Attributed Network EmbeddingIEEE Transactions on Cybernetics10.1109/TCYB.2023.324376354:5(2941-2954)Online publication date: May-2024
https://doi.org/10.1109/TCYB.2023.3243763
Wang CWang KLi MWei FXiong N(2024)Chunk2vec: A novel resemblance detection scheme based on Sentence‐BERT for post‐deduplication delta compression in network transmissionIET Communications10.1049/cmu2.12719Online publication date: 4-Jan-2024
https://doi.org/10.1049/cmu2.12719
Zhong KJackson TWest ACosma G(2024)Natural Language Processing Approaches in Industrial MaintenanceProcedia Computer Science10.1016/j.procs.2024.02.029232:C(2082-2097)Online publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1016/j.procs.2024.02.029
Rahhal IKassou IGhogho M(2024)Data science for job market analysis: A survey on applications and techniquesExpert Systems with Applications10.1016/j.eswa.2024.124101251(124101)Online publication date: Oct-2024
https://doi.org/10.1016/j.eswa.2024.124101
Bao HLi WChen HMiao HWang QTang ZLiu FWang W(2024)Stories behind decisions: Towards interpretable malware family classification with hierarchical attentionComputers & Security10.1016/j.cose.2024.103943144(103943)Online publication date: Sep-2024
https://doi.org/10.1016/j.cose.2024.103943
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents