More Web Proxy on the site http://driver.im/

research-article

Sitemaps: above and beyond the crawl of duty

Authors:

Narayanan ShivakumarAuthors Info & Claims

WWW '09: Proceedings of the 18th international conference on World wide web

Pages 991 - 1000

https://doi.org/10.1145/1526709.1526842

Published: 20 April 2009 Publication History

Abstract

Comprehensive coverage of the public web is crucial to web search engines. Search engines use crawlers to retrieve pages and then discover new ones by extracting the pages' outgoing links. However, the set of pages reachable from the publicly linked web is estimated to be significantly smaller than the invisible web, the set of documents that have no incoming links and can only be retrieved through web applications and web forms. The Sitemaps protocol is a fast-growing web protocol supported jointly by major search engines to help content creators and search engines unlock this hidden data by making it available to search engines. In this paper, we perform a detailed study of how "classic" discovery crawling compares with Sitemaps, in key measures such as coverage and freshness over key representative websites as well as over billions of URLs seen at Google. We observe that Sitemaps and discovery crawling complement each other very well, and offer different tradeoffs.

References

[1]

IRLBot: Scaling to 6 billion pages and beyond. In Proc. 17th WWW, 2008.

Digital Library

[2]

R. Baeza-Yates, C. Castillo, M. Marin, and A. Rodriguez. Crawling a country: Better strategies than breadth-first for Web page ordering. In Proc. 14th WWW, pages 864--872, 2005.

Digital Library

[3]

Z. Bar-Yossef and M. Gurevich. Random sampling from a search engine's index. In Proc. 15th WWW, pages 367--376, 2006.

Digital Library

[4]

Z. Bar-Yossef, I. Keidar and U. Schonfeld Do not crawl in the DUST: different URLs with similar text. In Proc.16th WWW, pages 111--120, 2007.

Digital Library

[5]

M.K. Bergman. The Deep Web: Surfacing hidden value. Journal of Electronic Publishing, 7(1):07--01, 2001.

[6]

M. Bianchini, M. Gori, and F. Scarselli. Inside PageRank. ACM Transactions on Internet Technology (TOIT), 5(1):92--128, 2005.

Digital Library

[7]

O. Brandman, J. Cho, H. Garcia-Molina, and N. Shivakumar. Crawler-friendly Web servers. ACM SIGMETRICS Performance Evaluation Review, 28(2):9--14, 2000.

Digital Library

[8]

S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1-7):107--117, 1998.

Digital Library

[9]

J. Cho and H. Garcia-Molina. Synchronizing a database to improve freshness. ACM SIGMOD Record, 29(2):117--128, 2000.

Digital Library

[10]

J. Cho and H. Garcia-Molina. The evolution of the Web and implications for an incremental crawler. In Proc. 26th VLDB, pages 200--209, 2000.

Digital Library

[11]

J. Cho and H. Garcia-Molina. Effective page refresh policies for Web crawlers. ACM Transactions on Database Systems (TODS), 28(4):390--426, 2003.

Digital Library

[12]

J. Cho and U. Schonfeld. RankMass Crawler: A crawler with high PageRank coverage guarantee. In Proc. 33rd VLDB, volume 7, pages 23--28.

Digital Library

[13]

D. Fetterly, M. Manasse, M. Najork, and J.L. Wiener. A large-scale study of the evolution of Web pages. Software Practice and Experience, 34(2):213--237, 2004.

Digital Library

[14]

D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: using statistical analysis to locate spam Web pages. In Proc. 7th WebDB, pages 1--6, 2004.

Digital Library

[15]

Google. Joint support for the Sitemap protocol. Available online at: http://googlewebmastercentral.blogspot.com/2006/11/joint-support-for-sitemap-protocol.html, 2006.

[16]

Google. Retiring support for OAI. Available online at: http://googlewebmastercentral.blogspot.com/2008/04/retiring-support-for-oai-pmh-in.html, 2008.

[17]

Google. We knew the web was big. Available online at: http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html, 2008.

[18]

Microsoft Google, Yahoo. Sitemaps.org. Available online at: http://sitemaps.org, 2008.

[19]

J. Gray. A conversation with Werner Vogels. Available online at: http://www.acmqueue.com/modules.php?name=Content&pa=showpage&pid=388, 2006.

[20]

Open Archive Initiative. Open archive. Available online at: http://www.openarchives.org, 2008.

[21]

J. Madhavan, D. Ko, L. Kot, V. Ganapathy, A. Rasmussen, and A. Halevy. Google's Deep Web crawl. In Proc. 34th VLDB, 2008.

Digital Library

[22]

G.S. Manku, A. Jain, and A.D. Sarma. Detecting near-duplicates for Web crawling. In Proc. 16th WWW, pages 141--150, 2007.

Digital Library

[23]

M. Najork and J.L. Wiener. Breadth-first crawling yields high-quality pages. In Proc. 10th WWW, pages 114--118, 2001.

Digital Library

[24]

A. Ntoulas, J. Cho, and C. Olston. What's new on the Web?: The evolution of the Web from a search engine perspective. In Proc. 13th WWW, pages 1--12, 2004.

Digital Library

[25]

C. Olston and S. Pandey. Recrawl scheduling based on information longevity. 2008.

Digital Library

[26]

S. Pandey and C. Olston. User-centric Web crawling. In Proc. 14th WWW, pages 401--411, 2005.

Digital Library

[27]

P. Ranjan and N. Shivakumar. Sitemaps: A content discovery protocol for the Web. In Proc. 17th WWW, 2008.

[28]

Reuters. Google, 4 states partner on government info search. Available online at: http://www.reuters.com/article/domesticNews/idUSN2946293620070430?sp=true, 2007.

[29]

J.L. Wolf, M.S. Squillante, P.S. Yu, J. Sethuraman, L. Ozsen, and L. Ozsen. Optimal crawling strategies for Web search engines. In Proc. 11th WWW, pages 136--147, 2002.

Digital Library

[30]

U. Schonfeld and N. Shivakumar. Sitemaps: Above and beyond the crawl of duty. In Proc. 18th WWW, 2009. Available online at: http://www.shuri.org/publications/www2009_sitemaps.pdf, 2009.

Digital Library

Cited By

Schulam PMuslea I(2023)Improving the Exploration/Exploitation Trade-Off in Web Content DiscoveryCompanion Proceedings of the ACM Web Conference 202310.1145/3543873.3587574(1183-1189)Online publication date: 30-Apr-2023
https://dl.acm.org/doi/10.1145/3543873.3587574
Klein MBalakireva LHolub KCeljak DRudomino IAizawa AMandl TCarevic ZHinze AMayr PSchaer P(2022)Investigating bloom filters for web archives' holdingsProceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries10.1145/3529372.3530934(1-10)Online publication date: 20-Jun-2022
https://dl.acm.org/doi/10.1145/3529372.3530934
Nyayachavadi AZhu JMadhyastha HBarakat CPelsser CBenson TChoffnes D(2022)Characterizing "permanently dead" links on WikipediaProceedings of the 22nd ACM Internet Measurement Conference10.1145/3517745.3561451(388-394)Online publication date: 25-Oct-2022
https://dl.acm.org/doi/10.1145/3517745.3561451
Show More Cited By

Index Terms

Sitemaps: above and beyond the crawl of duty
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Recommendations

Semantic sitemaps: efficient and flexible access to datasets on the semantic web
ESWC'08: Proceedings of the 5th European semantic web conference on The semantic web: research and applications

Increasing amounts of RDF data are available on the Web for consumption by Semantic Web browsers and indexing by Semantic Web search engines. Current Semantic Web publishing practices, however, do not directly support efficient discovery and high-...
Do not crawl in the DUST: Different URLs with similar text

We consider the problem of DUST: Different URLs with Similar Text. Such duplicate URLs are prevalent in Web sites, as Web server software often uses aliases and redirections, and dynamically generates the same page from various different URL requests. ...
Do not crawl in the dust: different urls with similar text
WWW '07: Proceedings of the 16th international conference on World Wide Web

We consider the problem of DUST: Different URLs with Similar Text. Such duplicate URLs are prevalent in web sites, as web server software often uses aliases and redirections, and dynamically generates the same page from various different URLrequests. We ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '09: Proceedings of the 18th international conference on World wide web

April 2009

1280 pages

ISBN:9781605584874

DOI:10.1145/1526709

General Chairs:
Juan Quemada
DIT-UPM
,
Gonzalo León
DIT-UPM
,
Program Chairs:
Yoelle Maarek
Google Inc., Israel
,
Wolfgang Nejdl
L3S and Hannover University

Copyright © 2009 IW3C2 org.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 April 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

WWW '09

Sponsor:

WWW '09: The 18th International World Wide Web Conference

April 20 - 24, 2009

Madrid, Spain

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
512
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)0

Reflects downloads up to 22 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Schulam PMuslea I(2023)Improving the Exploration/Exploitation Trade-Off in Web Content DiscoveryCompanion Proceedings of the ACM Web Conference 202310.1145/3543873.3587574(1183-1189)Online publication date: 30-Apr-2023
https://dl.acm.org/doi/10.1145/3543873.3587574
Klein MBalakireva LHolub KCeljak DRudomino IAizawa AMandl TCarevic ZHinze AMayr PSchaer P(2022)Investigating bloom filters for web archives' holdingsProceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries10.1145/3529372.3530934(1-10)Online publication date: 20-Jun-2022
https://dl.acm.org/doi/10.1145/3529372.3530934
Nyayachavadi AZhu JMadhyastha HBarakat CPelsser CBenson TChoffnes D(2022)Characterizing "permanently dead" links on WikipediaProceedings of the 22nd ACM Internet Measurement Conference10.1145/3517745.3561451(388-394)Online publication date: 25-Oct-2022
https://dl.acm.org/doi/10.1145/3517745.3561451
Massimino B(2016)Accessing Online Data: Web‐Crawling and Information‐Scraping Techniques to Automate the Assembly of Research DataJournal of Business Logistics10.1111/jbl.1212037:1(34-42)Online publication date: 22-Mar-2016
https://doi.org/10.1111/jbl.12120
Banos VManolopoulos Y(2016)A quantitative approach to evaluate Website Archivability using the CLEAR+ methodInternational Journal on Digital Libraries10.1007/s00799-015-0144-417:2(119-141)Online publication date: 1-Jun-2016
https://dl.acm.org/doi/10.1007/s00799-015-0144-4
Cambazoglu BBaeza-Yates R(2015)Scalability Challenges in Web Search EnginesSynthesis Lectures on Information Concepts, Retrieval, and Services10.2200/S00662ED1V01Y201508ICR0457:6(1-138)Online publication date: 29-Dec-2015
https://doi.org/10.2200/S00662ED1V01Y201508ICR045
Pirnau M(2015)Considerations on the functions and importance of a web crawler2015 7th International Conference on Electronics, Computers and Artificial Intelligence (ECAI)10.1109/ECAI.2015.7301171(Y-17-Y-22)Online publication date: Jun-2015
https://doi.org/10.1109/ECAI.2015.7301171
ALQARALEH SRAMADAN OSALAMAH M(2015)Efficient watcher based web crawler designAslib Journal of Information Management10.1108/AJIM-02-2015-001967:6(663-686)Online publication date: 16-Nov-2015
https://doi.org/10.1108/AJIM-02-2015-0019
Priya KDhanalakshmi S(2014)Crawling the page flipping linksInternational Conference on Information Communication and Embedded Systems (ICICES2014)10.1109/ICICES.2014.7033885(1-6)Online publication date: Feb-2014
https://doi.org/10.1109/ICICES.2014.7033885
Priya SGopinathan B(2014)Reliable data delivery in MANETs using VDVH frameworkInternational Conference on Information Communication and Embedded Systems (ICICES2014)10.1109/ICICES.2014.7033858(1-6)Online publication date: Feb-2014
https://doi.org/10.1109/ICICES.2014.7033858
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents