[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1526709.1526870acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
poster

Graph based crawler seed selection

Published: 20 April 2009 Publication History

Abstract

This paper identifies and explores the problem of seed selection in a web-scale crawler. We argue that seed selection is not a trivial but very important problem. Selecting proper seeds can increase the number of pages a crawler will discover, and can result in a collection with more ``good" and less "bad" pages. Based on the analysis of the graph structure of the web, we propose several seed selection algorithms. Effectiveness of these algorithms is proved by our experimental results on real web data.

References

[1]
D. Hochbaum and A. Pathria. Analysis of the Greedy Approach in Problems of Maximum k-Coverage. Naval Research Logistics, 45(6):615--627, 1998.
[2]
G. Pant, P. Srinivasan, and F. Menczer. Crawling the Web. Web Dynamics, pages 153--178, 2004.

Cited By

View all
  • (2023)Web Tarayıcılarında Tohum URL Seçimi ve Performans Analizi: Kapsamlı Bir İncelemeSeed URL Selection and Performance Analysis in Web Crawlers: A Comprehensive ReviewDüzce Üniversitesi Bilim ve Teknoloji Dergisi10.29130/dubited.109712311:3(1399-1423)Online publication date: 31-Jul-2023
  • (2020)Gene-Similarity Normalization in a Genetic Algorithm for the Maximum k-Coverage ProblemMathematics10.3390/math80405138:4(513)Online publication date: 2-Apr-2020
  • (2020)Crawling the German Health Web: Exploratory Study and Graph AnalysisJournal of Medical Internet Research10.2196/1785322:7(e17853)Online publication date: 24-Jul-2020
  • Show More Cited By

Index Terms

  1. Graph based crawler seed selection

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WWW '09: Proceedings of the 18th international conference on World wide web
    April 2009
    1280 pages
    ISBN:9781605584874
    DOI:10.1145/1526709

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 April 2009

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. crawler
    2. graph analysis
    3. pagerank
    4. seed selection

    Qualifiers

    • Poster

    Conference

    WWW '09
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)2
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 07 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Web Tarayıcılarında Tohum URL Seçimi ve Performans Analizi: Kapsamlı Bir İncelemeSeed URL Selection and Performance Analysis in Web Crawlers: A Comprehensive ReviewDüzce Üniversitesi Bilim ve Teknoloji Dergisi10.29130/dubited.109712311:3(1399-1423)Online publication date: 31-Jul-2023
    • (2020)Gene-Similarity Normalization in a Genetic Algorithm for the Maximum k-Coverage ProblemMathematics10.3390/math80405138:4(513)Online publication date: 2-Apr-2020
    • (2020)Crawling the German Health Web: Exploratory Study and Graph AnalysisJournal of Medical Internet Research10.2196/1785322:7(e17853)Online publication date: 24-Jul-2020
    • (2019)Using micro-collections in social media to generate seeds for web archive collectionsProceedings of the 18th Joint Conference on Digital Libraries10.1109/JCDL.2019.00042(251-260)Online publication date: 2-Jun-2019
    • (2018)Bootstrapping Web Archive Collections from Social MediaProceedings of the 29th on Hypertext and Social Media10.1145/3209542.3209560(64-72)Online publication date: 3-Jul-2018
    • (2018)Scraping SERPs for Archival SeedsProceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries10.1145/3197026.3197056(263-272)Online publication date: 23-May-2018
    • (2017)A web page distillation strategy for efficient focused crawling based on optimized Nave bayes (ONB) classifierApplied Soft Computing10.1016/j.asoc.2016.12.02853:C(181-204)Online publication date: 1-Apr-2017
    • (2014)Focused Crawling for Structured DataProceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management10.1145/2661829.2661902(1039-1048)Online publication date: 3-Nov-2014
    • (2014)Seed selection for domain-specific searchProceedings of the 23rd International Conference on World Wide Web10.1145/2567948.2579216(923-928)Online publication date: 7-Apr-2014
    • (2011)Finding potential seeds through rank aggregation of web searchesProceedings of the 4th international conference on Pattern recognition and machine intelligence10.5555/2026851.2026894(227-234)Online publication date: 27-Jun-2011
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media