Focused crawling for the hidden web

Panagiotis Liakos¹,
Alexandros Ntoulas^1,2,
Alexandros Labrinidis³ &
…
Alex Delis¹

1103 Accesses
13 Citations
Explore all metrics

Abstract

A constantly growing amount of high-quality information resides in databases and is guarded behind forms that users fill out and submit. The Hidden Web comprises all these information sources that conventional web crawlers are incapable of discovering. In order to excavate and make available meaningful data from the Hidden Web, previous work has focused on developing query generation techniques that aim at downloading all the content of a given Hidden Web site with the minimum cost. However, there are circumstances where only a specific part of such a site might be of interest. For example, a politics portal should not have to waste bandwidth or processing power to retrieve sports articles just because they are residing in databases also containing documents relevant to politics. In cases like this one, we need to make the best use of our resources in downloading only the portion of the Hidden Web site that we are interested in. We investigate how we can build a focused Hidden Web crawler that can autonomously extract topic-specific pages from the Hidden Web by searching only the subset that is related to the corresponding area. In this regard, we present an approach that progresses iteratively and analyzes the returned results in order to extract terms that capture the essence of the topic we are interested in. We propose a number of different crawling policies and we experimentally evaluate them with data from four popular sites. Our approach is able to download most of the content in search in all cases, using a significantly smaller number of queries compared to existing approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

GDist-RIA Crawler: A Greedy Distributed Crawler for Rich Internet Applications

Comparing Topic Coverage in Breadth-First and Depth-First Crawls Using Anchor Texts

A Selection Algorithm for Focused Crawlers Incorporating Semantic Metadata

Notes

www.informatik.uni-trier.de/∼ley/db/
www.imdb.com/?
http://www.dmoz.org
http://stackexchange.com/
The New York Times Annotated Corpus, Linguistic Data Consortium, Philadelphia, http://catalog.ldc.upenn.edu/LDC2008T19
http://istc-bigdata.org/index.php/our-research-data-sets/

References

Álvarez, M., Raposo, J., Pan, A., Cacheda, F., Bellas, F., Carneiro, V.: Deepbot: A focused crawler for accessing hidden web content. In: Proceedings of the 3rd International Workshop on Data Enginering Issues in E-commerce and Services (EC), pp. 18–25, San Diego (2007)
Barbosa, L., Freire, J.: Siphoning hidden-web data through keyword-based interfaces. In: SBBD, pp. 309–321. Distrito Federal, Brasil (2004)
Barbosa, L., Freire, J.: Searching for hidden-web databases. In: Proceedings of the 8th International WebDB, pp. 1–6, Baltimore (2005)
Barbosa, L., Freire, J.: An adaptive crawler for locating hidden-web entry points. In: Proceedings of the 16th International Conference on World Wide Web (WWW), pp. 441–450. Banff, Canada (2007)
Bergholz, A., Chidlovskii, B.: Crawling for domain-specific hidden web resources. In: Proceedings of the 4th International Conference on Web Information Systems Engineering (WISE), pp. 125–133, Roma (2003)
Bergman, M.K.: The deep web. surfacing hidden value. J. Electron. Publ. 7(1), 1–17 (2001)
Article Google Scholar
Cafarella, M.J., Madhavan, J., Halevy, A.: Web-scale extraction of structured data. SIGMOD Rec. 37(4), 55–61 (2009)
Article Google Scholar
Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: A new approach to topic-specific web resource discovery. In: In Proceedings of the 8th International Conference on World Wide Web (WWW), pp. 1623–1640, Toronto (1999)
Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M.: Focused crawling using context graphs. In: Proceedings of the 26th International Conference on Very Large Data Bases (VLDB), pp. 527–534, Cairo (2000)
Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian network classifiers. Mach. Learn. 29(2–3), 131–163 (1997)
Article MATH Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: An update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
Article Google Scholar
He, B., Patel, M., Zhang, Z., Chang, K.C.-C.: Accessing the deep web: A survey. Communications of the ACM 50(5), 94–101 (2007)
Ipeirotis P.G., Gravano, L.: Distributed search over the hidden web: Hierarchical database sampling and selection. In: Proceedings of the 28th International Conference on Very Large Data Bases (VLDB), pp. 394–405, Hong Kong (2002)
Ipeirotis, P.G., Gravano, L., Sahami, M.: Probe, count, and classify: Categorizing hidden web databases. SIGMOD Rec. 30, 67–78 (2001)
Article Google Scholar
Liakos P., Ntoulas, A.: Topic-sensitive hidden-web crawling. In: Proceedings of the 13th International Conference on Web Information Systems Engineering (WISE), pp. 538–551, Paphos (2012)
Lim, T.-S., Loh, W.-Y., Shih, Y.-S.: A comparison of prediction accuracy, complexity, and training time of old, thirty-three algorithms, new classification. Mach. Learn. 40(3), 203–228 (2000)
Article MATH Google Scholar
Lu, J., Wang, Y., Liang, J., Chen, J., Liu, J.: An approach to deep web crawling by sampling. In: Proceedings of the 2008 IEEE / WIC / ACM International Conference on Web Intelligence, (WI), pp. 718–724, New SouthWales (2008)
Madhavan, J., Ko, D., Kot, Ł., Ganapathy, V., Rasmussen, A., Halevy, A.: Google’s deep web crawl. Proc. VLDB Endow. 1(2), 1241–1252 (2008)
Article Google Scholar
McCandless, M., Hatcher, E., Gospodnetic, O.: Lucene in Action, 2nd. Manning Publications Co., Greenwich (2010)
Google Scholar
Noh, S., Choi, Y., Seo, H., Choi, K., Jung, G.: An intelligent topic-specific crawler using degree of relevance. In: IDEAL, volume 3177 of Lecture Notes in Computer Science, pp. 491–498 (2004)
Ntoulas, A., Zerfos, P., Cho, J.: Downloading textual hidden web content through keyword queries. In: Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), pp. 100–109, Denver (2005)
Platt, J.C.: Advances in Kernel Methods. Chapter Fast Training of Support Vector Machines Using Sequential Minimal Optimization, pp. 185–208. MIT Press, Cambridge (1999)
Google Scholar
Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. In: Proceedings of the 27th International Conference on Very Large Data Bases (VLDB), p. 2001, Roma
Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill Inc., New York (1986)
MATH Google Scholar
Schonhofen, P.: Identifying document topics using the wikipedia category network. In: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI), pp. 456–462, Hong Kong (2006)
Wang, Y., Lu, J., Chen, J.: Crawling deep web using a new set covering algorithm. In: Proceedings of the 5th International Conference on Advanced Data Mining and Applications (ADMA), pp. 326–337, Beijing (2009)
Wu, P., Wen, J.-R., Liu, H., Ma, W.-Y. : Query selection techniques for efficient crawling of structured web sources, p. 47, Atlanta (2006)
Wu, W., Yu, C., Doan, A., Meng, W.: An interactive clustering-based approach to integrating source query interfaces on the deep web. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pp. 95–106, Paris (2004)
Yang, Y., Bansal, N., Dakka, W., Ipeirotis, P., Koudas, N., Papadias D.: Query by document. In: Proceedings of the 2nd ACM International Conference on Web Search and Data Mining (WSDM), pp. 34–43, Barcelona (2009)
Zhang, Z, He, B., Chang, K. C.-C.: Understanding web query interfaces: Best-effort parsing with hidden syntax. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pp. 107–118, Paris (2004)

Download references

Acknowledgements

This work has been partially supported by SocWeb and Sucre FP7 EU projects. A preliminary version of the work appeared in the Proc. of the 13th Int. Conf. on Web Information Systems Engineering [15].

Author information

Authors and Affiliations

Universtiy of Athens, 15784, Athens, Greece
Panagiotis Liakos, Alexandros Ntoulas & Alex Delis
Zynga, San Fransisco, CA, 94103, USA
Alexandros Ntoulas
University of Pittsburgh, Pittsburgh, PA, 15260, USA
Alexandros Labrinidis

Authors

Panagiotis Liakos
View author publications
You can also search for this author in PubMed Google Scholar
Alexandros Ntoulas
View author publications
You can also search for this author in PubMed Google Scholar
Alexandros Labrinidis
View author publications
You can also search for this author in PubMed Google Scholar
Alex Delis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Panagiotis Liakos.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liakos, P., Ntoulas, A., Labrinidis, A. et al. Focused crawling for the hidden web. World Wide Web 19, 605–631 (2016). https://doi.org/10.1007/s11280-015-0349-x

Download citation

Received: 24 June 2014
Revised: 19 February 2015
Accepted: 16 April 2015
Published: 21 May 2015
Issue Date: July 2016
DOI: https://doi.org/10.1007/s11280-015-0349-x

Focused crawling for the hidden web

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

GDist-RIA Crawler: A Greedy Distributed Crawler for Rich Internet Applications

Comparing Topic Coverage in Breadth-First and Depth-First Crawls Using Anchor Texts

A Selection Algorithm for Focused Crawlers Incorporating Semantic Metadata

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Focused crawling for the hidden web

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

GDist-RIA Crawler: A Greedy Distributed Crawler for Rich Internet Applications

Comparing Topic Coverage in Breadth-First and Depth-First Crawls Using Anchor Texts

A Selection Algorithm for Focused Crawlers Incorporating Semantic Metadata

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation