More Web Proxy on the site http://driver.im/

research-article

Incorporating site-level knowledge for incremental crawling of web forums: a list-wise strategy

Authors:

Jiang-Ming Yang,

Wei-Ying MaAuthors Info & Claims

KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 1375 - 1384

https://doi.org/10.1145/1557019.1557166

Published: 28 June 2009 Publication History

Abstract

We study in this paper the problem of incremental crawling of web forums, which is a very fundamental yet challenging step in many web applications. Traditional approaches mainly focus on scheduling the revisiting strategy of each individual page. However, simply assigning different weights for different individual pages is usually inefficient in crawling forum sites because of the different characteristics between forum sites and general websites. Instead of treating each individual page independently, we propose a list-wise strategy by taking into account the site-level knowledge. Such site-level knowledge is mined through reconstructing the linking structure, called sitemap, for a given forum site. With the sitemap, posts from the same thread but distributed on various pages can be concatenated according to their timestamps. After that, for each thread, we employ a regression model to predict the time when the next post arrives. Based on this model, we develop an efficient crawler which is 260% faster than some state-of-the-art methods in terms of fetching new generated content; and meanwhile our crawler also ensure a high coverage ratio. Experimental results show promising performance of Coverage, Bandwidth utilization, and Timeliness of our crawler on 18 various forums.

Supplementary Material

JPG File (p1375-yang.jpg)

Download
8.98 KB

MP4 File (p1375-yang.mp4)

Download
144.36 MB

References

[1]

R. Baeza-Yates, C. Castillo, M. Marin, and A. Rodriguez. Crawling a country: Better strategies than breadth-first for web page ordering. In Proc. of WWW, 2005.

Digital Library

[2]

B. E. Brewington and G. Cybenko. How dynamic is the web? Computer Networks, 2000.

Digital Library

[3]

B. E. Brewington and G. Cybenko. Keeping up with the changing web. Cumputer, 2000.

Digital Library

[4]

R. Cai, J.-M. Yang, W. Lai, Y. Wang, and L. Zhang. iRobot: An intelligent crawler for Web forums. In Proc. of WWW, pages 447--456, April 2008.

Digital Library

[5]

S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawling: A new approach to topic-specific web resource discovery. Computer Networks, 1999.

Digital Library

[6]

J. Cho and H. Garcia-Molina. Effiective page refresh policies for web crawlers. ACM Transactions on Database Systems, 28(4), 2003.

Digital Library

[7]

E. Coffman, Z. Liu, and R. R. Weber. Optimal robot scheduling of web search engines. Journal of scheduling, 1, 1998.

[8]

G. Cong, L. Wang, C.-Y. Lin, Y.-I. Song, and Y. Sun. Finding question-answer pairs from online forums. In Proc. of SIGIR, pages 467--474, July 2008.

Digital Library

[9]

M. Diligenti, F. M. Coetzee, S. Lawrence, C. L. Giles, and M. Gori. Focused crawling using content graphs. In Proc. of VLDB, 2000.

Digital Library

[10]

J. Edwards, K. S. McCurley, and J. A. Tomlin. An adaptive model for optimizing performance of an incremental web crawler. In Proc. of WWW, 2001.

Digital Library

[11]

F. Menczer, G. Pant, P. Srinivasan, and M. E. Ruiz. Evaluating topic-driven web crawlers. In Proc. of SIGIR, 2001.

Digital Library

[12]

C. Olston and S. Pandey. Recrawl scheduling based on information longevity. In Proc. of WWW, pages 437--446, April 2008.

Digital Library

[13]

S. Pandey and C. Olston. User-centric web crawling. In Proc. of WWW, 2005.

Digital Library

[14]

M. L. A. Vidal, A. S. da Silva, E. S. de Moura, and J. M. B. Cavalcanti. Structure-driven crawler generation by example. In Proc. of SIGIR, 2006.

Digital Library

[15]

Y. Wang, J.-M. Yang, W. Lai, R. Cai, L. Zhang, and W.-Y. Ma. Exploring traversal strategy for Web forum crawling. In Proc. of SIGIR, pages 459--466, July 2008.

Digital Library

[16]

J. Wolf, M. Squillante, P. Yu, J. Sethuraman, and L. Ozsen. Optimal crawling strategies for web search engines. In Proc. of WWW, 2002.

Digital Library

[17]

Y. Zhai and B. Liu. Structured data extraction from the web based on partial tree alignment. IEEE Trans. Knowl. Data Eng, 18(12), 2006.

Digital Library

[18]

S. Zheng, R. Song, J.-R. Wen, and D. Wu. Joint optimization of wrapper generation and template detection. In Proc. of SIGKDD, 2007.

Digital Library

Cited By

Koloveas PChantzios TAlevizopoulou SSkiadopoulos STryfonopoulos C(2021)inTIME: A Machine Learning-Based Framework for Gathering and Leveraging Web Data to Cyber-Threat IntelligenceElectronics10.3390/electronics1007081810:7(818)Online publication date: 30-Mar-2021
https://doi.org/10.3390/electronics10070818
Koloveas PChantzios TTryfonopoulos CSkiadopoulos S(2019)A Crawler Architecture for Harvesting the Clear, Social, and Dark Web for IoT-Related Cyber-Threat Intelligence2019 IEEE World Congress on Services (SERVICES)10.1109/SERVICES.2019.00016(3-8)Online publication date: Jul-2019
https://doi.org/10.1109/SERVICES.2019.00016
Pavkovic MProtic J(2019)SInFo – Structure-Driven Incremental Forum Crawler That Optimizes User-Generated Content RetrievalIEEE Access10.1109/ACCESS.2019.29398727(126941-126961)Online publication date: 2019
https://doi.org/10.1109/ACCESS.2019.2939872
Show More Cited By

Index Terms

Incorporating site-level knowledge for incremental crawling of web forums: a list-wise strategy
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Exploring traversal strategy for web forum crawling
SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval

In this paper, we study the problem of Web forum crawling. Web forum has now become an important data source of many Web applications; while forum crawling is still a challenging task due to complex in-site link structures and login controls of most ...
Incorporating site-level knowledge to extract structured data from web forums
WWW '09: Proceedings of the 18th international conference on World wide web

Web forums have become an important data resource for many web applications, but extracting structured data from unstructured web forum pages is still a challenging task due to both complex page layout designs and unrestricted user created posts. In ...
Clustering-based incremental web crawling

When crawling resources, for example, number of machines, crawl-time, and so on, are limited, so a crawler has to decide an optimal order in which to crawl and recrawl Web pages. Ideally, crawlers should request only those Web pages that have changed ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining

June 2009

1426 pages

ISBN:9781605584959

DOI:10.1145/1557019

General Chairs:
John Elder
Elder Research, Inc., USA
,
Françoise Soulié Fogelman
KXEN, France
,
Program Chairs:
Peter Flach
University of Bristol, UK
,
Mohammed Zaki
RPI, USA

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 June 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

KDD09

Sponsor:

KDD09: The 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

June 28 - July 1, 2009

Paris, France

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
619
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Koloveas PChantzios TAlevizopoulou SSkiadopoulos STryfonopoulos C(2021)inTIME: A Machine Learning-Based Framework for Gathering and Leveraging Web Data to Cyber-Threat IntelligenceElectronics10.3390/electronics1007081810:7(818)Online publication date: 30-Mar-2021
https://doi.org/10.3390/electronics10070818
Koloveas PChantzios TTryfonopoulos CSkiadopoulos S(2019)A Crawler Architecture for Harvesting the Clear, Social, and Dark Web for IoT-Related Cyber-Threat Intelligence2019 IEEE World Congress on Services (SERVICES)10.1109/SERVICES.2019.00016(3-8)Online publication date: Jul-2019
https://doi.org/10.1109/SERVICES.2019.00016
Pavkovic MProtic J(2019)SInFo – Structure-Driven Incremental Forum Crawler That Optimizes User-Generated Content RetrievalIEEE Access10.1109/ACCESS.2019.29398727(126941-126961)Online publication date: 2019
https://doi.org/10.1109/ACCESS.2019.2939872
Lanin V(2018)Tools for Internet Competitive Intelligence Based on Ontology2018 IEEE 12th International Conference on Application of Information and Communication Technologies (AICT)10.1109/ICAICT.2018.8747159(1-4)Online publication date: Oct-2018
https://doi.org/10.1109/ICAICT.2018.8747159
Liu WYan HXiao J(2018)Automatically extracting user reviews from forum sitesComputers & Mathematics with Applications10.1016/j.camwa.2011.07.04462:7(2779-2792)Online publication date: 31-Dec-2018
https://dl.acm.org/doi/10.1016/j.camwa.2011.07.044
Noerlina Wulandhari LSasmoko Muqsith AAlamsyah M(2017)Corruption Cases Mapping Based on Indonesia’s Corruption Perception IndexJournal of Physics: Conference Series10.1088/1742-6596/801/1/012019801(012019)Online publication date: 23-Mar-2017
https://doi.org/10.1088/1742-6596/801/1/012019
Pirnau M(2015)Tool for monitoring Web sites for emergency-related posts and post analysis2015 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)10.1109/SPED.2015.7343102(1-6)Online publication date: Oct-2015
https://doi.org/10.1109/SPED.2015.7343102
Pirnau M(2015)Considerations on the functions and importance of a web crawler2015 7th International Conference on Electronics, Computers and Artificial Intelligence (ECAI)10.1109/ECAI.2015.7301171(Y-17-Y-22)Online publication date: Jun-2015
https://doi.org/10.1109/ECAI.2015.7301171
Park SLee Y(2014)Implementation of a distributed web community crawlerThe 16th Asia-Pacific Network Operations and Management Symposium10.1109/APNOMS.2014.6996586(1-6)Online publication date: Sep-2014
https://doi.org/10.1109/APNOMS.2014.6996586
Węckowski D(2013)Crawling Data-Intensive Web Sources Using Structure InformationBusiness Information Systems Workshops10.1007/978-3-642-41687-3_19(196-207)Online publication date: 2013
https://doi.org/10.1007/978-3-642-41687-3_19
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents