[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.5555/951953.952397guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

On the Evolution of Clusters of Near-Duplicate Web Pages

Published: 10 November 2003 Publication History

Abstract

This paper expands on a 1997 study of the amount and distribution of near-duplicate pages on the World Wide Web. We downloaded a set of 150 million web pages on a weekly basisover the span of 11 weeks. We then determined which of these pages are near-duplicates of one another, and tracked how clusters of near-duplicate documents evolved over time. We found that 29.2% of allweb pages are very similar to other pages, and that 22.2% are virtually identical to other pages. We also found that clusters of near-duplicate documents are fairly stable: Two documents that are near-duplicates of one another are very likely to still be near-duplicates 10 weeks later. This result is of significant relevance to search engines: Web crawlers can be fairly confident that two pages that have been found to be near-duplicates of one another will continue to be so for the foreseeable future, and may thus decide to recrawl only one version of that page, or at least to lower the download priority of the other versions,thereby freeing up crawling resources that can be brought to bear more productively somewhere else.

Cited By

View all
  • (2021)Web Application TestingProceedings of the 15th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)10.1145/3475716.3484187(1-6)Online publication date: 11-Oct-2021
  • (2014)An Anti-Phishing System Employing Diffused InformationACM Transactions on Information and System Security10.1145/258468016:4(1-31)Online publication date: 1-Apr-2014
  • (2013)Detecting near-duplicate documents using sentence-level features and supervised learningExpert Systems with Applications: An International Journal10.1016/j.eswa.2012.08.04540:5(1467-1476)Online publication date: 1-Apr-2013
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
LA-WEB '03: Proceedings of the First Conference on Latin American Web Congress
November 2003
ISBN:0769520588

Publisher

IEEE Computer Society

United States

Publication History

Published: 10 November 2003

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2021)Web Application TestingProceedings of the 15th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)10.1145/3475716.3484187(1-6)Online publication date: 11-Oct-2021
  • (2014)An Anti-Phishing System Employing Diffused InformationACM Transactions on Information and System Security10.1145/258468016:4(1-31)Online publication date: 1-Apr-2014
  • (2013)Detecting near-duplicate documents using sentence-level features and supervised learningExpert Systems with Applications: An International Journal10.1016/j.eswa.2012.08.04540:5(1467-1476)Online publication date: 1-Apr-2013
  • (2013)Learning URL Normalization Rules Using Multiple Alignment of SequencesProceedings of the 20th International Symposium on String Processing and Information Retrieval - Volume 821410.1007/978-3-319-02432-5_23(197-205)Online publication date: 7-Oct-2013
  • (2012)Detecting quilted web pages at scaleProceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval10.1145/2348283.2348337(385-394)Online publication date: 12-Aug-2012
  • (2011)Detection of near-duplicate user generated contentsProceedings of the 3rd international workshop on Search and mining user-generated contents10.1145/2065023.2065031(27-34)Online publication date: 28-Oct-2011
  • (2011)CANTINA+ACM Transactions on Information and System Security10.1145/2019599.201960614:2(1-28)Online publication date: 1-Sep-2011
  • (2011)Efficient similarity joins for near-duplicate detectionACM Transactions on Database Systems10.1145/2000824.200082536:3(1-41)Online publication date: 26-Aug-2011
  • (2011)A fusion of algorithms in near duplicate document detectionProceedings of the 15th international conference on New Frontiers in Applied Data Mining10.1007/978-3-642-28320-8_20(234-242)Online publication date: 24-May-2011
  • (2010)A hierarchical adaptive probabilistic approach for zero hour phish detectionProceedings of the 15th European conference on Research in computer security10.5555/1888881.1888903(268-285)Online publication date: 20-Sep-2010
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media