poster

Graph based crawler seed selection

Authors:

Shuyi Zheng,

Pavel Dmitriev,

C. Lee GilesAuthors Info & Claims

WWW '09: Proceedings of the 18th international conference on World wide web

Pages 1089 - 1090

https://doi.org/10.1145/1526709.1526870

Published: 20 April 2009 Publication History

Get Access

Abstract

This paper identifies and explores the problem of seed selection in a web-scale crawler. We argue that seed selection is not a trivial but very important problem. Selecting proper seeds can increase the number of pages a crawler will discover, and can result in a collection with more ``good" and less "bad" pages. Based on the analysis of the graph structure of the web, we propose several seed selection algorithms. Effectiveness of these algorithms is proved by our experimental results on real web data.

References

[1]

D. Hochbaum and A. Pathria. Analysis of the Greedy Approach in Problems of Maximum k-Coverage. Naval Research Logistics, 45(6):615--627, 1998.

Crossref

Google Scholar

[2]

G. Pant, P. Srinivasan, and F. Menczer. Crawling the Web. Web Dynamics, pages 153--178, 2004.

Crossref

Google Scholar

Cited By

View all

ALANOĞLU ZAKCAYOL M(2023)Web Tarayıcılarında Tohum URL Seçimi ve Performans Analizi: Kapsamlı Bir İncelemeSeed URL Selection and Performance Analysis in Web Crawlers: A Comprehensive ReviewDüzce Üniversitesi Bilim ve Teknoloji Dergisi10.29130/dubited.109712311:3(1399-1423)Online publication date: 31-Jul-2023
https://doi.org/10.29130/dubited.1097123
Yoon YKim Y(2020)Gene-Similarity Normalization in a Genetic Algorithm for the Maximum k-Coverage ProblemMathematics10.3390/math80405138:4(513)Online publication date: 2-Apr-2020
https://doi.org/10.3390/math8040513
Zowalla RWetter TPfeifer D(2020)Crawling the German Health Web: Exploratory Study and Graph AnalysisJournal of Medical Internet Research10.2196/1785322:7(e17853)Online publication date: 24-Jul-2020
https://doi.org/10.2196/17853
Show More Cited By

Index Terms

Graph based crawler seed selection
1. Information systems
  1. Information retrieval

Recommendations

Graph-based seed selection for web-scale crawlers
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

One of the most important steps in web crawling is determining the starting points, or seed selection. This paper identifies and explores the problem of seed selection in web-scale incremental crawlers. We argue that seed selection is not a trivial but ...
Web graph analyzer tool
valuetools '06: Proceedings of the 1st international conference on Performance evaluation methodolgies and tools

We present the software tool "Web Graph Analyzer". This tool is designed to perform a comprehensive analysis of the Web Graph structure. By Web Graph we mean a graph whose vertices are Web pages and whose edges are hyper-links. With the help of the Web ...
A Web Mining Architectural Model of Distributed Crawler for Internet Searches Using PageRank Algorithm
APSCC '08: Proceedings of the 2008 IEEE Asia-Pacific Services Computing Conference

As the World Wide Web is growing rapidly and data in the present day scenario is stored in a distributed manner. The need to develop a search engine based architectural model for people to search through the Web. Broad web search engines as well as many ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

WWW '09: Proceedings of the 18th international conference on World wide web

April 2009

1280 pages

ISBN:9781605584874

DOI:10.1145/1526709

General Chairs:
Juan Quemada
DIT-UPM
,
Gonzalo León
DIT-UPM
,
Program Chairs:
Yoelle Maarek
Google Inc., Israel
,
Wolfgang Nejdl
L3S and Hannover University

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 April 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Poster

Conference

WWW '09

Sponsor:

WWW '09: The 18th International World Wide Web Conference

April 20 - 24, 2009

Madrid, Spain

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
310
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 07 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

ALANOĞLU ZAKCAYOL M(2023)Web Tarayıcılarında Tohum URL Seçimi ve Performans Analizi: Kapsamlı Bir İncelemeSeed URL Selection and Performance Analysis in Web Crawlers: A Comprehensive ReviewDüzce Üniversitesi Bilim ve Teknoloji Dergisi10.29130/dubited.109712311:3(1399-1423)Online publication date: 31-Jul-2023
https://doi.org/10.29130/dubited.1097123
Yoon YKim Y(2020)Gene-Similarity Normalization in a Genetic Algorithm for the Maximum k-Coverage ProblemMathematics10.3390/math80405138:4(513)Online publication date: 2-Apr-2020
https://doi.org/10.3390/math8040513
Zowalla RWetter TPfeifer D(2020)Crawling the German Health Web: Exploratory Study and Graph AnalysisJournal of Medical Internet Research10.2196/1785322:7(e17853)Online publication date: 24-Jul-2020
https://doi.org/10.2196/17853
Nwala AWeigle MNelson MDownie SZhu QHahn J(2019)Using micro-collections in social media to generate seeds for web archive collectionsProceedings of the 18th Joint Conference on Digital Libraries10.1109/JCDL.2019.00042(251-260)Online publication date: 2-Jun-2019
https://dl.acm.org/doi/10.1109/JCDL.2019.00042
Nwala AWeigle MNelson MLee DSastry NWeber I(2018)Bootstrapping Web Archive Collections from Social MediaProceedings of the 29th on Hypertext and Social Media10.1145/3209542.3209560(64-72)Online publication date: 3-Jul-2018
https://dl.acm.org/doi/10.1145/3209542.3209560
Nwala AWeigle MNelson MChen JGonçalves MAllen JFox EKan MPetras V(2018)Scraping SERPs for Archival SeedsProceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries10.1145/3197026.3197056(263-272)Online publication date: 23-May-2018
https://dl.acm.org/doi/10.1145/3197026.3197056
Saleh AAbulwafa AAl Rahmawy M(2017)A web page distillation strategy for efficient focused crawling based on optimized Nave bayes (ONB) classifierApplied Soft Computing10.1016/j.asoc.2016.12.02853:C(181-204)Online publication date: 1-Apr-2017
https://dl.acm.org/doi/10.1016/j.asoc.2016.12.028
Meusel RMika PBlanco RLi JWang XGarofalakis MSoboroff ISuel TWang M(2014)Focused Crawling for Structured DataProceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management10.1145/2661829.2661902(1039-1048)Online publication date: 3-Nov-2014
https://dl.acm.org/doi/10.1145/2661829.2661902
Priyatam PDubey APerumal KPraneeth SKakadia DVarma VChung CBroder AShim KSuel T(2014)Seed selection for domain-specific searchProceedings of the 23rd International Conference on World Wide Web10.1145/2567948.2579216(923-928)Online publication date: 7-Apr-2014
https://dl.acm.org/doi/10.1145/2567948.2579216
Prasath ROztürk P(2011)Finding potential seeds through rank aggregation of web searchesProceedings of the 4th international conference on Pattern recognition and machine intelligence10.5555/2026851.2026894(227-234)Online publication date: 27-Jun-2011
https://dl.acm.org/doi/10.5555/2026851.2026894
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Graph-based seed selection for web-scale crawlers

Web graph analyzer tool

A Web Mining Architectural Model of Distributed Crawler for Internet Searches Using PageRank Algorithm