[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1242572.1242588acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
Article

Do not crawl in the dust: different urls with similar text

Published: 08 May 2007 Publication History

Abstract

We consider the problem of DUST: Different URLs with Similar Text. Such duplicate URLs are prevalent in web sites, as web server software often uses aliases and redirections, and dynamically generates the same page from various different URLrequests. We present a novel algorithm, DustBuster, for uncovering DUST; that is, for discovering rules that transform a given URL to others that are likely to have similar content. DustBuster mines DUST effectively from previous crawl logs or web server logs, without examining page contents. Verifying these rules via sampling requires fetching few actual web pages. Search engines can benefit from information about DUST to increase the effectiveness of crawling, reduce indexing overhead, and improve the quality of popularity statistics such as PageRank.

References

[1]
R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. 20th VLDB, pages 487--499, 1994.
[2]
Z. Bar-Yossef, I. Keidar, and U. Schonfeld. Do not crawl in the DUST: different URLs with similar text. Technical Report CCIT Report #601, Dept. Electrical Engineering, Technion, 2006.
[3]
K. Bharat and A. Z. Broder. Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content. Computer Networks, 31(11-16): 1579--1590, 1999.
[4]
K. Bharat, A. Z. Broder, J. Dean, and M. R. Henzinger. A comparison of techniques to find mirrored hosts on the WWW. IEEE Data Engin. Bull., 23(4):21--26, 2000.
[5]
M. Bognar. A survey on abstract rewriting. Available online at: www.di.ubi.pt/~desousa/1998-1999/logica/mb.ps, 1995.
[6]
S. Brin, J. Davis, and H. Garcia-Molina. Copy Detection Mechanisms for Digital Documents. In Proc. 14th SIGMOD, pages 398--409, 1995.
[7]
A. Z. Broder, S. C. Glassman, and M. S. Manasse. Syntactic clustering of the web. In Proc. 6th WWW, pages 1157--1166, 1997.
[8]
J. Cho, N. Shivakumar, and H. Garcia-Molina. Finding replicated web collections. In Proc. 19th SIGMOD, pages 355--366, 2000.
[9]
E. Di Iorio, M. Diligenti, M. Gori, M. Maggini, and A. Pucci. Detecting Near-replicas on the Web by Content and Hyperlink Analysis. In Proc. 11th WWW, 2003.
[10]
F. Douglis, A. Feldman, B. Krishnamurthy, and J. Mogul. Rate of change and other metrics: a live study of the world wide web. In Proc. 1st USITS, 1997.
[11]
H. Garcia-Molina, L. Gravano, and N. Shivakumar. dscam: Finding document copies across multiple databases. In Proc. 4th PDIS, pages 68--79, 1996.
[12]
M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, 1979.
[13]
Google Inc. Google sitemaps. http://sitemaps.google.com.
[14]
D. Gusfield. Algorithms on Strings, Trees and Sequences: Computer Science and COmputational Biology. Cambridge University Press, 1997.
[15]
T. C. Hoad and J. Zobel. Methods for identifying versioned and plagiarized documents. J. Amer. Soc. Infor. Sci. Tech., 54(3):203--215, 2003.
[16]
N. Jain, M. Dahlin, and R. Tewari. Using bloom filters to refine web search results. In Proc. 7th WebDB, pages 25--30, 2005.
[17]
T. Kelly and J. C. Mogul. Aliasing on the world wide web: prevalence and performance implications. In Proc. 11th WWW, pages 281--292, 2002.
[18]
S. J. Kim, H. S. Jeong, and S. H. Lee. Reliable evaluations of URL normalization. In Proc. 4th ICCSA, pages 609--617, 2006.
[19]
H. Liang. A URL-String-Based Algorithm for Finding WWW Mirror Host. Master's thesis, Auburn University, 2001.
[20]
F. McCown and M. L. Nelson. Evaluation of crawling policies for a web-repository crawler. In Proc. 17th HYPERTEXT, pages 157--168, 2006.
[21]
U. Schonfeld, Z. Bar-Yossef and I. Keidar. Do not crawl in the DUST: different URLs with similar text. In Proc. 15th WWW, pages 1015--1016, 2006.
[22]
N. Shivakumar and H. Garcia-Molina. Finding Near-Replicas of Documents and Servers on the Web. In Proc. 1st WebDB, pages 204--212, 1998.

Cited By

View all
  • (2015)Scalability Challenges in Web Search EnginesSynthesis Lectures on Information Concepts, Retrieval, and Services10.2200/S00662ED1V01Y201508ICR0457:6(1-138)Online publication date: 29-Dec-2015
  • (2014)Predicting Download Directories for Web ResourcesProceedings of the 4th International Conference on Web Intelligence, Mining and Semantics (WIMS14)10.1145/2611040.2611076(1-12)Online publication date: 2-Jun-2014
  • (2013)A pattern-based selective recrawling approach for object-level vertical searchProceedings of the 22nd ACM international conference on Information & Knowledge Management10.1145/2505515.2505707(1441-1450)Online publication date: 27-Oct-2013
  • Show More Cited By

Index Terms

  1. Do not crawl in the dust: different urls with similar text

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WWW '07: Proceedings of the 16th international conference on World Wide Web
    May 2007
    1382 pages
    ISBN:9781595936547
    DOI:10.1145/1242572
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 08 May 2007

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. URL normalization
    2. anti-aliasing
    3. crawling
    4. duplicate detection
    5. search engines

    Qualifiers

    • Article

    Conference

    WWW'07
    Sponsor:
    WWW'07: 16th International World Wide Web Conference
    May 8 - 12, 2007
    Alberta, Banff, Canada

    Acceptance Rates

    Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 22 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2015)Scalability Challenges in Web Search EnginesSynthesis Lectures on Information Concepts, Retrieval, and Services10.2200/S00662ED1V01Y201508ICR0457:6(1-138)Online publication date: 29-Dec-2015
    • (2014)Predicting Download Directories for Web ResourcesProceedings of the 4th International Conference on Web Intelligence, Mining and Semantics (WIMS14)10.1145/2611040.2611076(1-12)Online publication date: 2-Jun-2014
    • (2013)A pattern-based selective recrawling approach for object-level vertical searchProceedings of the 22nd ACM international conference on Information & Knowledge Management10.1145/2505515.2505707(1441-1450)Online publication date: 27-Oct-2013
    • (2013)Reducing information redundancy in search resultsProceedings of the 28th Annual ACM Symposium on Applied Computing10.1145/2480362.2480533(886-893)Online publication date: 18-Mar-2013
    • (2013)Crawling deep web entity pagesProceedings of the sixth ACM international conference on Web search and data mining10.1145/2433396.2433442(355-364)Online publication date: 4-Feb-2013
    • (2013)Analysis and detection of Soft-404 pagesThird International Conference on Innovative Computing Technology (INTECH 2013)10.1109/INTECH.2013.6653695(217-226)Online publication date: Aug-2013
    • (2012)FoCUSProceedings of the 21st International Conference on World Wide Web10.1145/2187980.2187985(33-42)Online publication date: 16-Apr-2012
    • (2012)Towards discovering conceptual models behind web sitesProceedings of the 31st international conference on Conceptual Modeling10.1007/978-3-642-34002-4_13(166-175)Online publication date: 15-Oct-2012
    • (2011)Learning top-k transformation rulesProceedings of the 22nd international conference on Database and expert systems applications - Volume Part I10.5555/2035368.2035384(172-186)Online publication date: 29-Aug-2011
    • (2011)An evaluation of provenance-based near-duplicates detectionInternational Journal of Knowledge and Web Intelligence10.1504/IJKWI.2011.0441222:2/3(168-184)Online publication date: 1-Dec-2011
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media