More Web Proxy on the site http://driver.im/

Article

Do not crawl in the dust: different urls with similar text

Authors:

Ziv Bar-Yossef,

Uri SchonfeldAuthors Info & Claims

WWW '07: Proceedings of the 16th international conference on World Wide Web

Pages 111 - 120

https://doi.org/10.1145/1242572.1242588

Published: 08 May 2007 Publication History

Abstract

We consider the problem of DUST: Different URLs with Similar Text. Such duplicate URLs are prevalent in web sites, as web server software often uses aliases and redirections, and dynamically generates the same page from various different URLrequests. We present a novel algorithm, DustBuster, for uncovering DUST; that is, for discovering rules that transform a given URL to others that are likely to have similar content. DustBuster mines DUST effectively from previous crawl logs or web server logs, without examining page contents. Verifying these rules via sampling requires fetching few actual web pages. Search engines can benefit from information about DUST to increase the effectiveness of crawling, reduce indexing overhead, and improve the quality of popularity statistics such as PageRank.

References

[1]

R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. 20th VLDB, pages 487--499, 1994.

Digital Library

[2]

Z. Bar-Yossef, I. Keidar, and U. Schonfeld. Do not crawl in the DUST: different URLs with similar text. Technical Report CCIT Report #601, Dept. Electrical Engineering, Technion, 2006.

[3]

K. Bharat and A. Z. Broder. Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content. Computer Networks, 31(11-16): 1579--1590, 1999.

Digital Library

[4]

K. Bharat, A. Z. Broder, J. Dean, and M. R. Henzinger. A comparison of techniques to find mirrored hosts on the WWW. IEEE Data Engin. Bull., 23(4):21--26, 2000.

[5]

M. Bognar. A survey on abstract rewriting. Available online at: www.di.ubi.pt/~desousa/1998-1999/logica/mb.ps, 1995.

[6]

S. Brin, J. Davis, and H. Garcia-Molina. Copy Detection Mechanisms for Digital Documents. In Proc. 14th SIGMOD, pages 398--409, 1995.

Digital Library

[7]

A. Z. Broder, S. C. Glassman, and M. S. Manasse. Syntactic clustering of the web. In Proc. 6th WWW, pages 1157--1166, 1997.

Digital Library

[8]

J. Cho, N. Shivakumar, and H. Garcia-Molina. Finding replicated web collections. In Proc. 19th SIGMOD, pages 355--366, 2000.

Digital Library

[9]

E. Di Iorio, M. Diligenti, M. Gori, M. Maggini, and A. Pucci. Detecting Near-replicas on the Web by Content and Hyperlink Analysis. In Proc. 11th WWW, 2003.

[10]

F. Douglis, A. Feldman, B. Krishnamurthy, and J. Mogul. Rate of change and other metrics: a live study of the world wide web. In Proc. 1st USITS, 1997.

Digital Library

[11]

H. Garcia-Molina, L. Gravano, and N. Shivakumar. dscam: Finding document copies across multiple databases. In Proc. 4th PDIS, pages 68--79, 1996.

Digital Library

[12]

M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, 1979.

Digital Library

[13]

Google Inc. Google sitemaps. http://sitemaps.google.com.

[14]

D. Gusfield. Algorithms on Strings, Trees and Sequences: Computer Science and COmputational Biology. Cambridge University Press, 1997.

Digital Library

[15]

T. C. Hoad and J. Zobel. Methods for identifying versioned and plagiarized documents. J. Amer. Soc. Infor. Sci. Tech., 54(3):203--215, 2003.

Digital Library

[16]

N. Jain, M. Dahlin, and R. Tewari. Using bloom filters to refine web search results. In Proc. 7th WebDB, pages 25--30, 2005.

[17]

T. Kelly and J. C. Mogul. Aliasing on the world wide web: prevalence and performance implications. In Proc. 11th WWW, pages 281--292, 2002.

Digital Library

[18]

S. J. Kim, H. S. Jeong, and S. H. Lee. Reliable evaluations of URL normalization. In Proc. 4th ICCSA, pages 609--617, 2006.

Digital Library

[19]

H. Liang. A URL-String-Based Algorithm for Finding WWW Mirror Host. Master's thesis, Auburn University, 2001.

[20]

F. McCown and M. L. Nelson. Evaluation of crawling policies for a web-repository crawler. In Proc. 17th HYPERTEXT, pages 157--168, 2006.

Digital Library

[21]

U. Schonfeld, Z. Bar-Yossef and I. Keidar. Do not crawl in the DUST: different URLs with similar text. In Proc. 15th WWW, pages 1015--1016, 2006.

Digital Library

[22]

N. Shivakumar and H. Garcia-Molina. Finding Near-Replicas of Documents and Servers on the Web. In Proc. 1st WebDB, pages 204--212, 1998.

Digital Library

Cited By

Cambazoglu BBaeza-Yates R(2015)Scalability Challenges in Web Search EnginesSynthesis Lectures on Information Concepts, Retrieval, and Services10.2200/S00662ED1V01Y201508ICR0457:6(1-138)Online publication date: 29-Dec-2015
https://doi.org/10.2200/S00662ED1V01Y201508ICR045
Valkanas GGunopulos DAkerkar RBassiliades NDavies JErmolayev V(2014)Predicting Download Directories for Web ResourcesProceedings of the 4th International Conference on Web Intelligence, Mining and Semantics (WIMS14)10.1145/2611040.2611076(1-12)Online publication date: 2-Jun-2014
https://dl.acm.org/doi/10.1145/2611040.2611076
Zhou YZhang QHuang XWu LHe QIyengar ANejdl WPei JRastogi R(2013)A pattern-based selective recrawling approach for object-level vertical searchProceedings of the 22nd ACM international conference on Information & Knowledge Management10.1145/2505515.2505707(1441-1450)Online publication date: 27-Oct-2013
https://dl.acm.org/doi/10.1145/2505515.2505707
Show More Cited By

Index Terms

Do not crawl in the dust: different urls with similar text
1. Information systems
  1. Information retrieval

Recommendations

Do not crawl in the DUST: Different URLs with similar text

We consider the problem of DUST: Different URLs with Similar Text. Such duplicate URLs are prevalent in Web sites, as Web server software often uses aliases and redirections, and dynamically generates the same page from various different URL requests. ...
Sitemaps: above and beyond the crawl of duty
WWW '09: Proceedings of the 18th international conference on World wide web

Comprehensive coverage of the public web is crucial to web search engines. Search engines use crawlers to retrieve pages and then discover new ones by extracting the pages' outgoing links. However, the set of pages reachable from the publicly linked web ...
Do not crawl in the DUST: different URLs with similar text
WWW '06: Proceedings of the 15th international conference on World Wide Web

We consider the problem of dust: Different URLs with Similar Text. Such duplicate URLs are prevalent in web sites, as web server software often uses aliases and redirections, translates URLs to some canonical form, and dynamically generates the same ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '07: Proceedings of the 16th international conference on World Wide Web

May 2007

1382 pages

ISBN:9781595936547

DOI:10.1145/1242572

General Chairs:
Carey Williamson
University of Calgary, Canada
,
Mary Ellen Zurko
IBM, USA
,
Program Chairs:
Peter Patel-Schneider
Bell Labs Research, USA
,
Prashant Shenoy
University of Massachusetts at Amherst, USA

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

ACM: Association for Computing Machinery

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 May 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

WWW'07

Sponsor:

ACM

WWW'07: 16th International World Wide Web Conference

May 8 - 12, 2007

Alberta, Banff, Canada

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

31
Total Citations
View Citations
569
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 22 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Cambazoglu BBaeza-Yates R(2015)Scalability Challenges in Web Search EnginesSynthesis Lectures on Information Concepts, Retrieval, and Services10.2200/S00662ED1V01Y201508ICR0457:6(1-138)Online publication date: 29-Dec-2015
https://doi.org/10.2200/S00662ED1V01Y201508ICR045
Valkanas GGunopulos DAkerkar RBassiliades NDavies JErmolayev V(2014)Predicting Download Directories for Web ResourcesProceedings of the 4th International Conference on Web Intelligence, Mining and Semantics (WIMS14)10.1145/2611040.2611076(1-12)Online publication date: 2-Jun-2014
https://dl.acm.org/doi/10.1145/2611040.2611076
Zhou YZhang QHuang XWu LHe QIyengar ANejdl WPei JRastogi R(2013)A pattern-based selective recrawling approach for object-level vertical searchProceedings of the 22nd ACM international conference on Information & Knowledge Management10.1145/2505515.2505707(1441-1450)Online publication date: 27-Oct-2013
https://dl.acm.org/doi/10.1145/2505515.2505707
Plegas YStamou SShin SMaldonado J(2013)Reducing information redundancy in search resultsProceedings of the 28th Annual ACM Symposium on Applied Computing10.1145/2480362.2480533(886-893)Online publication date: 18-Mar-2013
https://dl.acm.org/doi/10.1145/2480362.2480533
He YXin DGanti VRajaraman SShah NLeonardi SPanconesi AFerragina PGionis A(2013)Crawling deep web entity pagesProceedings of the sixth ACM international conference on Web search and data mining10.1145/2433396.2433442(355-364)Online publication date: 4-Feb-2013
https://dl.acm.org/doi/10.1145/2433396.2433442
Prieto VAlvarez MCacheda F(2013)Analysis and detection of Soft-404 pagesThird International Conference on Innovative Computing Technology (INTECH 2013)10.1109/INTECH.2013.6653695(217-226)Online publication date: Aug-2013
https://doi.org/10.1109/INTECH.2013.6653695
Jiang JYu NLin CMille AGandon FMisselis JRabinovich MStaab S(2012)FoCUSProceedings of the 21st International Conference on World Wide Web10.1145/2187980.2187985(33-42)Online publication date: 16-Apr-2012
https://dl.acm.org/doi/10.1145/2187980.2187985
Hernández IRivero CRuiz DCorchuelo R(2012)Towards discovering conceptual models behind web sitesProceedings of the 31st international conference on Conceptual Modeling10.1007/978-3-642-34002-4_13(166-175)Online publication date: 15-Oct-2012
https://dl.acm.org/doi/10.1007/978-3-642-34002-4_13
Patro SWang W(2011)Learning top-k transformation rulesProceedings of the 22nd international conference on Database and expert systems applications - Volume Part I10.5555/2035368.2035384(172-186)Online publication date: 29-Aug-2011
https://dl.acm.org/doi/10.5555/2035368.2035384
Mudhasir YDeepika JSendhilkumar S(2011)An evaluation of provenance-based near-duplicates detectionInternational Journal of Knowledge and Web Intelligence10.1504/IJKWI.2011.0441222:2/3(168-184)Online publication date: 1-Dec-2011
https://dl.acm.org/doi/10.1504/IJKWI.2011.044122
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents