[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1242572.1242592acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
Article

Detecting near-duplicates for web crawling

Published: 08 May 2007 Publication History

Abstract

Near-duplicate web documents are abundant. Two such documents differ from each other in a very small portion that displays advertisements, for example. Such differences are irrelevant for web search. So the quality of a web crawler increases if it can assess whether a newly crawled web page is a near-duplicate of a previously crawled web page or not. In the course of developing a near-duplicate detection system for a multi-billion page repository, we make two research contributions. First, we demonstrate that Charikar's fingerprinting technique is appropriate for this goal. Second, we present an algorithmic technique for identifying existing f-bit fingerprints that differ from a given fingerprint in at most k bit-positions, for small k. Our technique is useful for both online queries (single fingerprints) and all batch queries (multiple fingerprints). Experimental evaluation over real data confirms the practicality of our design.

References

[1]
A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, and S. Raghavan. Searching the web. ACM Transactions on Internet Technology, 1(1):2--43, 2001.
[2]
A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In Proc. ACM SIGMOD 2003, pages 337--348, 2003.
[3]
A.N. Arslan and Ö Eugeciouglu. Dictionary look-up within small edit distance. In Proc. 8th Annual Intl. Computing and Combinatorics Conference (COCOON'02), pages 127--136, 2002.
[4]
B. S. Baker. A theory of parameterized pattern matching algorithms and applications. In Proc. 25th Annual Symposium on Theory of Computing (STOC 1993), pages 71--80, 1993.
[5]
B. S. Baker. On finding duplication and near-duplication in large software systems. In Proc. 2nd Working Conference on Reverse Engineering, page 86, 1995.
[6]
K. Bharat and A. Broder. Mirror, mirror on the Web: A study of hst pairs with replicated content. In Proc. 8th International Conference on World Wide Web (WWW 1999), pages 1579--1590, 1999.
[7]
K. Bharat, A. Broder, J. Dean, and M. R. Henzinger. A comparison of techniques to find mirrored hosts on the WWW. J American Society for Information Science, 51(12):1114--1122, Aug. 2000.
[8]
S. Brin, J. Davis, and H. Garcia-Molina. Copy detection mechanisms for digital documents. In Proc. ACM SIGMOD Annual Conference, pages 398--409, May 1995.
[9]
S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1--7):107--117, 1998.
[10]
G. S. Brodal and L. Gcasieniec. Approximate dictionary queries. In Proc 7th Combinatorial Pattern Matching Symposium, pages 65--74, 1996.
[11]
G. S. Brodal and S. Venkatesh. Improved bounds for dictionary look-up with one error. Information Processing Letters, 75(1--2):57--59, 2000.
[12]
A. Broder. On the resemblance and containment of documents. In Compression and Complexity of Sequences, 1998.
[13]
A. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher. Min-wise independent permutations. In Proc. 30th Annual Symposium on Theory of Computing (STOC 1998), pages 327--336, 1998.
[14]
A. Broder, S. C. Glassman, M. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks, 29(8-13):1157--1166, 1997.
[15]
A. Z. Broder, M. Najork, and J. L. Wiener. Efficient URL caching for World Wide Web crawling. In International conference on World Wide Web, 2003.
[16]
S. Chakrabarti. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan-Kauffman, 2002.
[17]
M. Charikar. Similarity estimation techniques from rounding algorithms. In Proc. 34th Annual Symposium on Theory of Computing (STOC 2002), pages 380--388, 2002.
[18]
Z. Chen, F. Korn, N. Koudas, and S. Muthukrishnan. Selectivity estimation for boolean queries. In Proc. PODS 2000, pages 216-225, 2000.
[19]
J. Cho, H. García-Molina, and L. Page. Efficient crawling through URL ordering. Computer Networks and ISDN Systems, 30(1-7): 161--172, 1998.
[20]
A. Chowdhury, O. Frieder, D. Grossman, and M. C. McCabe. Collection statistics for fast duplicate document detection. ACM Transactions on Information Systems, 20(2):171--191, 2002.
[21]
E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. D. Ullman, and C. Yang. Finding interesting associations without support pruning. In Proc. 16th Intl. Conf. on Data Engineering (ICDE 2000), pages 489--499, 2000.
[22]
J. G. Conrad and C. P. Schriber. Constructing a text corpus for inexact duplicate detection. In SIGIR 2004, pages 582--583, July 2004.
[23]
J. W. Cooper, A. R. Coden, and E. W. Brown. Detecting similar documents using salient terms. In Proc. 1st Intl. Conf. on Information and Knowledge Management (CIKM 2002), pages 245--251, Nov. 2002.
[24]
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proc. 6th Symposium on Operating System Design and Implementation (OSDI 2004), pages 137--150, Dec. 2004.
[25]
J. Dean and M. Henzinger. Finding related pages in the World Wide Web. Computer Networks, 31(11--16):1467--1479, 1999.
[26]
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. J American Society for Information Science, 41(6):391--407, 1990.
[27]
M. Diligenti, F. Coetzee, S. Lawrence, C. L. Giles, and M. Gori. Focused crawling using context graphs. In 26th International Conference on Very Large Databases, (VLDB 2000), pages 527--534, sep 2000.
[28]
D. Dolev, Y. Harari, M. Linial, N. Nisan, and M. Parnas. Neighborhood preserving hashing and approximate queries. In Proc. 5th ACM Symposium on Discrete Algorithms (SODA 1994), 1994.
[29]
S. Ghemawat, H. Gobioff, and S.T. Leung. The Google File System. In 19th ACM Symposium on Operating Systems Principles (SOSP 2003), pages 29--43, Oct. 2003.
[30]
A. Gionis, D. Gunopulos, and N. Koudas. Efficient and tunable similar set retrieval. In Proc. SIGMOD 2001, pages 247--258, 2001.
[31]
D. Greene, MParnas, and FYao. Multi-index hashing for information retrieval. In Proc. 35th Annual Symposium on Foundations of Computer Science (FOCS 1994), pages 722--731, 1994.
[32]
K.M. Hammouda and M.S. Kamel. Efficient phrase-based document indexing for web document clustering. IEEE Transactions on Knowledge and Data Engineering, 6(10):1279--1296, Aug. 2004.
[33]
T. H. Haveliwala, A. Gionis, and P. Indyk. Scalable techniques for clustering the web. In Proc. 3rd Intl. Workshop on the Web and Databases (WebDB 2000), pages 129--134, May 2000.
[34]
T. H. Haveliwala, A. Gionis, D. Klein, and P. Indyk. Evaluating strategies for similarity search on the Web. In Proc. 11th International World Wide Web Conference, pages 432--442, May 2002.
[35]
M. R. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR 2006, pages 284--291, 2006.
[36]
T. C. Hoad and J. Zobel. Methods for identifying versioned and plagiarised documents. J of the American Society for Information Science and Technology, 54(3):203--215, Feb. 2003.
[37]
D. A. Huffman. A method for the construction of minimum-redundancy codes. In Proc. Institute of Radio Engineering, volume 40, pages 1098--1102, Sept. 1952.
[38]
S. Joshi, N. Agrawal, R. Krishnapuram, and S. Negi. A bag of paths model for measuring structural similarity in Web documents. In Proc. 9th ACM Intl. Conf. on Knowledge Discovery and Data Mining (SIGKDD 2003), pages 577--582, 2003.
[39]
J. M. Kleinberg. Authoritative sources in a hyperlinked environment. J of the ACM, 46(5):604--632, Sept. 1999.
[40]
A. Kolcz, A. Chowdhury, and J. Alspector. Improved robustness of signature-based near-replica detection via lexicon randomization. In SIGKDD 2004, pages 605--610, Aug. 2004.
[41]
R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the Web for emerging cyber-communities. Computer Networks: The Intl. J of Computer and Telecommunications Networks, 31:1481--1493, 1999.
[42]
U. Manber. Finding similar files in a large file system. In Proc. 1994 USENIX Conference, pages 1--10, Jan. 1994.
[43]
F. Menczer, G. Pant, P. Srinivasan, and M. E. Ruiz. Evaluating topic-driven web crawlers. In Research and Development in Information Retrieval, pages 241--249, 2001.
[44]
M. Minsky and S. Papert. Perceptrons. MIT Press, 1969.
[45]
A. Muthitacharoen, B. Chen, and D. Mazières. A low-bandwidth network file system. In Proc. 18th ACM Symposium on Operating System Principles (SOSP 2001), pages 174--187, Oct. 2001.
[46]
S. Pandey and C. Olston. User-centric web crawling. In WWW '05: Proc. 14th Intl. Conference on World Wide Web, pages 401--411, 2005.
[47]
W. Pugh and M. R. Henzinger. Detecting duplicate and near-duplicate files. United States Patent 6,658,423, granted on Dec 2, 2003, 2003.
[48]
S. Quinlan and S. Dorward. Venti: A new approach to archival storage. In First USENIX Conference on File and Storage Technologies, pages 89--101, 2002.
[49]
M. O. Rabin. Fingerprinting by random polynomials. Technical Report Report TR-15-81, Center for Research in Computing Techonlogy, Harvard University, 1981.
[50]
S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: Local algorithms for document fingerprinting. In Proc. SIGMOD 2003, pages 76--85, June 2003.
[51]
N. Shivakumar and H. Garcia-Molina. SCAM: A copy detection mechanism for digital documents. In Proceedings of the 2nd International Conference on Theory and Practice of Digital Libraries, 1995.
[52]
A. Yao and F. Yao. The complexity of searching in an ordered random table. In Proc. 17th Annual Symposium on Foundations of Computer Science (FOCS 1976), pages 173--177, 1976.
[53]
A. C. Yao and F. F. Yao. Dictionary look-up with one error. J of Algorithms, 25(1):194--202, 1997.

Cited By

View all
  • (2024)Detection of Vulnerabilities in Cryptocurrency Smart Contracts Based on Image ProcessingGlobal Perspectives on the Applications of Computer Vision in Cybersecurity10.4018/978-1-6684-8127-1.ch004(102-123)Online publication date: 23-Feb-2024
  • (2024)Connected Components for Scaling Partial-order Blocking to Billion EntitiesJournal of Data and Information Quality10.1145/364655316:1(1-29)Online publication date: 19-Mar-2024
  • (2024)Trust Issues: Discrepancies in Trustworthy AI Keywords Use in Policy and ResearchProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency10.1145/3630106.3659035(2222-2233)Online publication date: 3-Jun-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
WWW '07: Proceedings of the 16th international conference on World Wide Web
May 2007
1382 pages
ISBN:9781595936547
DOI:10.1145/1242572
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 May 2007

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. fingerprint
  2. hamming distance
  3. near-duplicate
  4. search
  5. similarity
  6. sketch
  7. web crawl
  8. web document

Qualifiers

  • Article

Conference

WWW'07
Sponsor:
WWW'07: 16th International World Wide Web Conference
May 8 - 12, 2007
Alberta, Banff, Canada

Acceptance Rates

Overall Acceptance Rate 1,068 of 6,946 submissions, 15%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)128
  • Downloads (Last 6 weeks)17
Reflects downloads up to 12 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Detection of Vulnerabilities in Cryptocurrency Smart Contracts Based on Image ProcessingGlobal Perspectives on the Applications of Computer Vision in Cybersecurity10.4018/978-1-6684-8127-1.ch004(102-123)Online publication date: 23-Feb-2024
  • (2024)Connected Components for Scaling Partial-order Blocking to Billion EntitiesJournal of Data and Information Quality10.1145/364655316:1(1-29)Online publication date: 19-Mar-2024
  • (2024)Trust Issues: Discrepancies in Trustworthy AI Keywords Use in Policy and ResearchProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency10.1145/3630106.3659035(2222-2233)Online publication date: 3-Jun-2024
  • (2024)Guess the State: Exploiting Determinism to Improve GUI Exploration EfficiencyIEEE Transactions on Software Engineering10.1109/TSE.2024.336658650:4(836-853)Online publication date: 16-Feb-2024
  • (2024)Contextual Client Selection for Efficient Federated Learning Over Edge DevicesIEEE Transactions on Mobile Computing10.1109/TMC.2023.332364523:6(6538-6548)Online publication date: Jun-2024
  • (2024)MPSketch: Message Passing Networks via Randomized Hashing for Efficient Attributed Network EmbeddingIEEE Transactions on Cybernetics10.1109/TCYB.2023.324376354:5(2941-2954)Online publication date: May-2024
  • (2024)Chunk2vec: A novel resemblance detection scheme based on Sentence‐BERT for post‐deduplication delta compression in network transmissionIET Communications10.1049/cmu2.12719Online publication date: 4-Jan-2024
  • (2024)Natural Language Processing Approaches in Industrial MaintenanceProcedia Computer Science10.1016/j.procs.2024.02.029232:C(2082-2097)Online publication date: 2-Jul-2024
  • (2024)Data science for job market analysis: A survey on applications and techniquesExpert Systems with Applications10.1016/j.eswa.2024.124101251(124101)Online publication date: Oct-2024
  • (2024)Stories behind decisions: Towards interpretable malware family classification with hierarchical attentionComputers & Security10.1016/j.cose.2024.103943144(103943)Online publication date: Sep-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media