[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1526709.1526842acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Sitemaps: above and beyond the crawl of duty

Published: 20 April 2009 Publication History

Abstract

Comprehensive coverage of the public web is crucial to web search engines. Search engines use crawlers to retrieve pages and then discover new ones by extracting the pages' outgoing links. However, the set of pages reachable from the publicly linked web is estimated to be significantly smaller than the invisible web, the set of documents that have no incoming links and can only be retrieved through web applications and web forms. The Sitemaps protocol is a fast-growing web protocol supported jointly by major search engines to help content creators and search engines unlock this hidden data by making it available to search engines. In this paper, we perform a detailed study of how "classic" discovery crawling compares with Sitemaps, in key measures such as coverage and freshness over key representative websites as well as over billions of URLs seen at Google. We observe that Sitemaps and discovery crawling complement each other very well, and offer different tradeoffs.

References

[1]
IRLBot: Scaling to 6 billion pages and beyond. In Proc. 17th WWW, 2008.
[2]
R. Baeza-Yates, C. Castillo, M. Marin, and A. Rodriguez. Crawling a country: Better strategies than breadth-first for Web page ordering. In Proc. 14th WWW, pages 864--872, 2005.
[3]
Z. Bar-Yossef and M. Gurevich. Random sampling from a search engine's index. In Proc. 15th WWW, pages 367--376, 2006.
[4]
Z. Bar-Yossef, I. Keidar and U. Schonfeld Do not crawl in the DUST: different URLs with similar text. In Proc.16th WWW, pages 111--120, 2007.
[5]
M.K. Bergman. The Deep Web: Surfacing hidden value. Journal of Electronic Publishing, 7(1):07--01, 2001.
[6]
M. Bianchini, M. Gori, and F. Scarselli. Inside PageRank. ACM Transactions on Internet Technology (TOIT), 5(1):92--128, 2005.
[7]
O. Brandman, J. Cho, H. Garcia-Molina, and N. Shivakumar. Crawler-friendly Web servers. ACM SIGMETRICS Performance Evaluation Review, 28(2):9--14, 2000.
[8]
S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1-7):107--117, 1998.
[9]
J. Cho and H. Garcia-Molina. Synchronizing a database to improve freshness. ACM SIGMOD Record, 29(2):117--128, 2000.
[10]
J. Cho and H. Garcia-Molina. The evolution of the Web and implications for an incremental crawler. In Proc. 26th VLDB, pages 200--209, 2000.
[11]
J. Cho and H. Garcia-Molina. Effective page refresh policies for Web crawlers. ACM Transactions on Database Systems (TODS), 28(4):390--426, 2003.
[12]
J. Cho and U. Schonfeld. RankMass Crawler: A crawler with high PageRank coverage guarantee. In Proc. 33rd VLDB, volume 7, pages 23--28.
[13]
D. Fetterly, M. Manasse, M. Najork, and J.L. Wiener. A large-scale study of the evolution of Web pages. Software Practice and Experience, 34(2):213--237, 2004.
[14]
D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: using statistical analysis to locate spam Web pages. In Proc. 7th WebDB, pages 1--6, 2004.
[15]
Google. Joint support for the Sitemap protocol. Available online at: http://googlewebmastercentral.blogspot.com/2006/11/joint-support-for-sitemap-protocol.html, 2006.
[16]
Google. Retiring support for OAI. Available online at: http://googlewebmastercentral.blogspot.com/2008/04/retiring-support-for-oai-pmh-in.html, 2008.
[17]
Google. We knew the web was big. Available online at: http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html, 2008.
[18]
Microsoft Google, Yahoo. Sitemaps.org. Available online at: http://sitemaps.org, 2008.
[19]
J. Gray. A conversation with Werner Vogels. Available online at: http://www.acmqueue.com/modules.php?name=Content&pa=showpage&pid=388, 2006.
[20]
Open Archive Initiative. Open archive. Available online at: http://www.openarchives.org, 2008.
[21]
J. Madhavan, D. Ko, L. Kot, V. Ganapathy, A. Rasmussen, and A. Halevy. Google's Deep Web crawl. In Proc. 34th VLDB, 2008.
[22]
G.S. Manku, A. Jain, and A.D. Sarma. Detecting near-duplicates for Web crawling. In Proc. 16th WWW, pages 141--150, 2007.
[23]
M. Najork and J.L. Wiener. Breadth-first crawling yields high-quality pages. In Proc. 10th WWW, pages 114--118, 2001.
[24]
A. Ntoulas, J. Cho, and C. Olston. What's new on the Web?: The evolution of the Web from a search engine perspective. In Proc. 13th WWW, pages 1--12, 2004.
[25]
C. Olston and S. Pandey. Recrawl scheduling based on information longevity. 2008.
[26]
S. Pandey and C. Olston. User-centric Web crawling. In Proc. 14th WWW, pages 401--411, 2005.
[27]
P. Ranjan and N. Shivakumar. Sitemaps: A content discovery protocol for the Web. In Proc. 17th WWW, 2008.
[28]
Reuters. Google, 4 states partner on government info search. Available online at: http://www.reuters.com/article/domesticNews/idUSN2946293620070430?sp=true, 2007.
[29]
J.L. Wolf, M.S. Squillante, P.S. Yu, J. Sethuraman, L. Ozsen, and L. Ozsen. Optimal crawling strategies for Web search engines. In Proc. 11th WWW, pages 136--147, 2002.
[30]
U. Schonfeld and N. Shivakumar. Sitemaps: Above and beyond the crawl of duty. In Proc. 18th WWW, 2009. Available online at: http://www.shuri.org/publications/www2009_sitemaps.pdf, 2009.

Cited By

View all
  • (2023)Improving the Exploration/Exploitation Trade-Off in Web Content DiscoveryCompanion Proceedings of the ACM Web Conference 202310.1145/3543873.3587574(1183-1189)Online publication date: 30-Apr-2023
  • (2022)Investigating bloom filters for web archives' holdingsProceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries10.1145/3529372.3530934(1-10)Online publication date: 20-Jun-2022
  • (2022)Characterizing "permanently dead" links on WikipediaProceedings of the 22nd ACM Internet Measurement Conference10.1145/3517745.3561451(388-394)Online publication date: 25-Oct-2022
  • Show More Cited By

Index Terms

  1. Sitemaps: above and beyond the crawl of duty

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WWW '09: Proceedings of the 18th international conference on World wide web
    April 2009
    1280 pages
    ISBN:9781605584874
    DOI:10.1145/1526709

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 April 2009

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. crawling
    2. metrics
    3. quality
    4. search engines
    5. sitemaps

    Qualifiers

    • Research-article

    Conference

    WWW '09
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)12
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 22 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Improving the Exploration/Exploitation Trade-Off in Web Content DiscoveryCompanion Proceedings of the ACM Web Conference 202310.1145/3543873.3587574(1183-1189)Online publication date: 30-Apr-2023
    • (2022)Investigating bloom filters for web archives' holdingsProceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries10.1145/3529372.3530934(1-10)Online publication date: 20-Jun-2022
    • (2022)Characterizing "permanently dead" links on WikipediaProceedings of the 22nd ACM Internet Measurement Conference10.1145/3517745.3561451(388-394)Online publication date: 25-Oct-2022
    • (2016)Accessing Online Data: Web‐Crawling and Information‐Scraping Techniques to Automate the Assembly of Research DataJournal of Business Logistics10.1111/jbl.1212037:1(34-42)Online publication date: 22-Mar-2016
    • (2016)A quantitative approach to evaluate Website Archivability using the CLEAR+ methodInternational Journal on Digital Libraries10.1007/s00799-015-0144-417:2(119-141)Online publication date: 1-Jun-2016
    • (2015)Scalability Challenges in Web Search EnginesSynthesis Lectures on Information Concepts, Retrieval, and Services10.2200/S00662ED1V01Y201508ICR0457:6(1-138)Online publication date: 29-Dec-2015
    • (2015)Considerations on the functions and importance of a web crawler2015 7th International Conference on Electronics, Computers and Artificial Intelligence (ECAI)10.1109/ECAI.2015.7301171(Y-17-Y-22)Online publication date: Jun-2015
    • (2015)Efficient watcher based web crawler designAslib Journal of Information Management10.1108/AJIM-02-2015-001967:6(663-686)Online publication date: 16-Nov-2015
    • (2014)Crawling the page flipping linksInternational Conference on Information Communication and Embedded Systems (ICICES2014)10.1109/ICICES.2014.7033885(1-6)Online publication date: Feb-2014
    • (2014)Reliable data delivery in MANETs using VDVH frameworkInternational Conference on Information Communication and Embedded Systems (ICICES2014)10.1109/ICICES.2014.7033858(1-6)Online publication date: Feb-2014
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media