[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1645953.1646277acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
poster

Graph-based seed selection for web-scale crawlers

Published: 02 November 2009 Publication History

Abstract

One of the most important steps in web crawling is determining the starting points, or seed selection. This paper identifies and explores the problem of seed selection in web-scale incremental crawlers. We argue that seed selection is not a trivial but very important problem. Selecting proper seeds can increase the number of pages a crawler will discover, and can result in a repository with more "good" and less "bad" pages. We propose a graph-based framework for crawler seed selection, and present several algorithms within this framework. Evaluation on real web data showed significant improvements over heuristic seed selection approaches.

References

[1]
A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the Web. Computer Networks, 33(1-6):309--320, 2000.
[2]
D. Hochbaum and A. Pathria. Analysis of the Greedy Approach in Problems of Maximum k-Coverage. Naval Research Logistics, 45(6):615--627, 1998.
[3]
G. Pant, P. Srinivasan, and F. Menczer. Crawling the Web. Web Dynamics, pages 153--178, 2004.

Cited By

View all
  • (2023)Web Tarayıcıları için Etkili Tohum URL Seçimi ve Kapsam Genişletme AlgoritmasıEffective Seed URL Selection and Scope Extension Algorithm for Web CrawlerInternational Journal of Advances in Engineering and Pure Sciences10.7240/jeps.117419335:1(27-38)Online publication date: 30-Mar-2023
  • (2020)Modeling Updates of Scholarly Webpages Using Archived Data2020 IEEE International Conference on Big Data (Big Data)10.1109/BigData50022.2020.9377796(1868-1877)Online publication date: 10-Dec-2020
  • (2018)ABC Algorithm for URL ExtractionCurrent Trends in Web Engineering10.1007/978-3-319-74433-9_12(143-148)Online publication date: 22-Feb-2018
  • Show More Cited By

Index Terms

  1. Graph-based seed selection for web-scale crawlers

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management
    November 2009
    2162 pages
    ISBN:9781605585123
    DOI:10.1145/1645953
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 02 November 2009

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. crawler
    2. graph analysis
    3. pagerank
    4. seed selection

    Qualifiers

    • Poster

    Conference

    CIKM '09
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    CIKM '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)5
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 03 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Web Tarayıcıları için Etkili Tohum URL Seçimi ve Kapsam Genişletme AlgoritmasıEffective Seed URL Selection and Scope Extension Algorithm for Web CrawlerInternational Journal of Advances in Engineering and Pure Sciences10.7240/jeps.117419335:1(27-38)Online publication date: 30-Mar-2023
    • (2020)Modeling Updates of Scholarly Webpages Using Archived Data2020 IEEE International Conference on Big Data (Big Data)10.1109/BigData50022.2020.9377796(1868-1877)Online publication date: 10-Dec-2020
    • (2018)ABC Algorithm for URL ExtractionCurrent Trends in Web Engineering10.1007/978-3-319-74433-9_12(143-148)Online publication date: 22-Feb-2018
    • (2016)Finding seeds to bootstrap focused crawlersWorld Wide Web10.1007/s11280-015-0331-719:3(449-474)Online publication date: 1-May-2016
    • (2016)Comparing Topic Coverage in Breadth-First and Depth-First Crawls Using Anchor TextsResearch and Advanced Technology for Digital Libraries10.1007/978-3-319-43997-6_11(133-146)Online publication date: 10-Aug-2016
    • (2015)Set Cover at Web ScaleProceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining10.1145/2783258.2783315(1125-1133)Online publication date: 10-Aug-2015
    • (2015)Considerations on the functions and importance of a web crawler2015 7th International Conference on Electronics, Computers and Artificial Intelligence (ECAI)10.1109/ECAI.2015.7301171(Y-17-Y-22)Online publication date: Jun-2015
    • (2012)The evolution of a crawling strategy for an academic document search engineProceedings of the 4th Annual ACM Web Science Conference10.1145/2380718.2380762(340-343)Online publication date: 22-Jun-2012
    • (2011)SPRINTProceedings of the 14th International Conference on Extending Database Technology10.1145/1951365.1951437(546-549)Online publication date: 21-Mar-2011
    • (2011)The SHARC framework for data quality in Web archivingThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-011-0219-920:2(183-207)Online publication date: 1-Apr-2011
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media