Abstract
Web blocks such as navigation menus, advertisements, headers, and footers are key components of Web pages that define not only the appearance, but also the way humans interact with different parts of the page. For machines, however, classifying and interacting with these blocks is a surprisingly hard task. Yet, Web block classification has varied applications in the fields of wrapper induction, assistance to visually impaired people, Web adaptation, Web page topic clustering, and Web search. Our system for Web block classification, \({{\textsc {ber}}}_{y}{\textsc {l}}\), performs automated classification of Web blocks through a combination of machine learning and declarative, model-driven feature extraction based on Datalog rules. \({{\textsc {ber}}}_{y}{\textsc {l}}\) uses refined feature sets for the classification of individual blocks to achieve accurate classification for all the block types we have observed so far. The high accuracy is achieved through these carefully selected features, some even tuned to the specific block type. At the same time, \({{\textsc {ber}}}_{y}{\textsc {l}}\) avoids a high cost of feature engineering through a model-driven rather than programmatic approach to extracting features. Not only does this reduce the time for feature engineering, the model-driven, declarative approach also allows for semi-automatic optimisation of the feature extraction system. We perform evaluation to validate these claims on a selected range of Web blocks.
This work was supported by the ESPRC programme grant EP/M025268/1 “VADA: Value Added Data Systems – Principles and Architecture”.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley Longman Publishing Co. Inc., Boston (1995)
Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with Lixto. In: VLDB (2001)
Baluja, S.: Browsing on small screens: recasting web-page segmentation into an efficient machine learning framework. In: WWW 2006 (2006)
Burget, R., Rudolfova, I.: Web page element classification based on visual features. In: 2009 First Asia Conference on Intelligent Information and Database Systems (2009)
Cai, D., Yu, S., Wen, J., Ma, W.: Block-based web search. In: SIGIR 2004, 25–29 July 2004 (2004)
Cai, D., He, X., Wen, J., Ma, W.: Block-level link analysis. In: SIGIR 2004, 25–29 July 2004 (2004)
Cao, Y., Niu, Z., Dai, L., Zhao, Y.: Extraction of informative blocks from web pages. In: ALPIT 2008 (2008)
Chen, J., Zhou, B., Shi, J., Zhang, H., Fengwu, Q.: Function-based object model towards website adaptation. In: WWW 2010, 1–5 May 2010 (2010)
de Moor, O., Gottlob, G., Furche, T., Sellers, A. (eds.): Datalog 2.0 2010. LNCS, vol. 6702. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-24206-9
Furche, T., Grasso, G., Kravchenko, A., Schallhart, C.: Turn the page: automated traversal of paginated websites. In: Brambilla, M., Tokuda, T., Tolksdorf, R. (eds.) ICWE 2012. LNCS, vol. 7387, pp. 332–346. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31753-8_27
Furche, T., et al.: DIADEM: domain-centric, intelligent, automated data extraction methodology. In: WWW 2012 (2012)
Goel, A., Michelson, M., Knoblock, C.A.: Harvesting maps on the web. Int. J. Doc. Anal. Recognit. 14(4), 349 (2011)
Gottlob, G., Orsi, G., Pieris, A., Šimkus, M.: Datalog and its extensions for semantic web databases. In: Eiter, T., Krennwallner, T. (eds.) Reasoning Web 2012. LNCS, vol. 7487, pp. 54–77. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33158-9_2
Gupta, S., Kaiser, G., Neistadt, D., Grimm, P.: DOM-based content extraction of HTML documents. In: WWW 2003, 20–24 May 2003 (2003)
Kang, J., Choi, J.: Block classification of a web page by using a combination of multiple classifiers. In: Fourth International Conference on Networked Computing and Advanced Information Management, 2–4 September 2008 (2008)
Kang, J., Choi, J.: Recognising informative web page blocks using visual segmentation for efficient information extraction. J. Univ. Comput. Sci. 14(11), 1893 (2008)
Keller, M., Hartenstein, H.: GRABEX: a graph-based method for web site block classification and its application on mining breadcrumb trails. In: 2013 IEEE/WIC/ACM International Conferences on Web Intelligence (WI) and Intelligent Agent Technology (IAT) (2013)
Kordomatis, I., Herzog, C., Fayzrakhmanov, R.R., Krüpl-Sypien, B., Holzinger, W., Baumgartner, R.: Web object identification for web automation and meta-search. In: WIMS 2013 (2012)
Krüpl-Sypien, B., Fayzrakhmanov, R.R., Holzinger, W., Panzenböck, M., Baumgartner, R.: A versatile model for web page representation, information extraction and content re-packaging. In: DocEng 2011, 19–22 September 2011 (2011)
Lee, C.H., Kan, M., Lai, S.: Stylistic and lexical co-training for web block classification. In: WIDM 2004, 12–13 November 2004 (2004)
Li, C., Dong, J., Chen, J.: Extraction of informative blocks from web pages based on VIPS. J. Comput. Inf. Syst. 6(1), 271 (2010)
Liu, W., Meng, X.: VIDE: a vision-based approach for deep web data extraction. IEEE Trans. Knowl. Data Engineering 22(3), 447 (2010)
Luo, P., Lin, F., Xiong, Y., Zhao, Y., Shi, Z.: Towards combining web classification and web information extraction: a case study. In: KDD 2009, 28 June–1 July (2009)
Maekawa, T., Hara, T., Nishio, S.: Image classification for mobile web browsing. In: WWW 2006, 23–26 May (2006)
Romero, R., Berger, A.: Automatic partitioning of web pages using clustering. In: Brewster, S., Dunlop, M. (eds.) Mobile HCI 2004. LNCS, vol. 3160, pp. 388–393. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-28637-0_43
Song, R., Liu, H., Wen, J., Ma, W.: Learning block importance models for web pages. In: WWW 2004, 17–22 May (2004)
Vadrevu, S., Velipasaoglu, E.: Identifying primary content from web page and its application to web search ranking. In: WWW 2011 (2011)
Wang, J., et al.: Can we learn a template-independent wrapper for news article extraction from a single training site? In: KDD 2009, 28 June–1 July (2009)
Wu, C., Zeng, G., Xu, G.: A web page segmentation algorithm for extracting product information. In: Proceedings of the 2006 IEEE International Conference on Information Acquisition, 20–23 August 2006 (2006)
Xiang, P., Yang, X., Shi, Y.: Effective page segmentation combining pattern analysis and visual separators for browsing on small screens. In: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence (2006)
Xiang, P., Yang, X., Shi, Y.: Web page segmentation based on gestalt theory. In: 2007 IEEE International Conference on Multimedia and Expo (2007)
Yang, X., Shi, Y.: Learning web block functions using roles of images. In: Third International Conference on Pervasive Computing and Applications, 6–8 October 2008 (2008)
Yi, L., Liu, B., Li, X.: Eliminating noisy information in web pages for data mining. In: SIGKDD 2003, 24–27 August 2003 (2003)
Yu, S., Cai, D., Wen, J., Ma, W.: Improving pseudo-relevance feedback in web information retrieval using web page segmentation. In: WWW 2003, 20–24 May 2003 (2003)
Zheng, S., Song, R., Wen, J., Giles, C.L.: Efficient record-level wrapper induction. In: CIKM 2009, 2–6 November 2009 (2009)
Zhu, J., Nie, Z., Wen, J., Zhang, B., Ma, W.: Simultaneous record detection and attribute labeling in web data extraction. In: KDD 2006, 20–23 August 2006 (2006)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer-Verlag GmbH Germany, part of Springer Nature
About this chapter
Cite this chapter
Kravchenko, A. (2018). \({{\textsc {ber}}}_{y}{\textsc {l}}\): A System for Web Block Classification. In: Gavrilova, M., Tan, C. (eds) Transactions on Computational Science XXXIII. Lecture Notes in Computer Science(), vol 10990. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-58039-4_4
Download citation
DOI: https://doi.org/10.1007/978-3-662-58039-4_4
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-58038-7
Online ISBN: 978-3-662-58039-4
eBook Packages: Computer ScienceComputer Science (R0)