Abstract
Contrary to traditional Web information retrieval methods that can only return a ranked list of Web pages and only allow search terms in the query, we have developed a novel learning framework for retrieving precise information blocks from Web pages given a query, which may contain some search terms and prior information such as the layout format of the data. There are two challenging sub-tasks for this problem. One challenge is information block detection, where a Web page is automatically segmented into blocks. Another challenge is to find the information blocks relevant to the query. Existing page segmentation methods, which make use of only visual layout information or only content information, do not consider the query information, leading to a solution having conflict with the information need expressed by the query. Our framework aims at modeling the query and the block features to capture both keyword information and prior information via a probabilistic graphical model. Fisher Kernel, which can effectively incorporate the graphical model, is then employed to accomplish the two sub-tasks in a unified manner, optimizing the final goal of block retrieval performance. We have conducted experiments on benchmark datasets and read-world data. Comparisons between existing methods have been conducted to evaluate the effectiveness of our framework.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
The details of DOM can be found in http://www.w3.org/DOM/
VIPS can be obtained in http://www.cad.zju.edu.cn/home/dengcai/VIPS/VIPS.html
References
Arun K, Govindan V (2016) A context-aware semantic modeling framework for efficient image retrieval. Int J Mach Learn Cybern. doi:10.1007/s13042-016-0498-y
Bah A, Chandar P, Carterette B (2015) Document comprehensiveness and user preferences in novelty search tasks. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, pp 735–738
Bilenko M, Kamath B, Mooney R (2006) Adaptive blocking: learning to scale up record linkage. In: Proceedings of the sixth IEEE international conference on data mining (ICDM), pp 87–96
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
Cai Y, Li Q (2010) Personalized search by tag-based user profile and resource profile in collaborative tagging systems. In: Proceedings of the 19th ACM international conference on information and knowledge management, pp 969–978
Cai D, Yu S, Wen J-R, Ma W-Y (2004) Block-based web search. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, pp 456–463
Chen Y, Lee S, Huang C-R (2012) A robust web personal name information extraction system. Exp Syst Appl 39(3):2690–2699
Crescenzi V, Mecca G (2004) Automatic information extraction from large websites. J ACM 51(5):731–779
Culotta A, Wick M, Hall R, Marzilli M, McCallum A (2007) Canonicalization of database records using adaptive similarity measures. In: Proceedings of the thirteenth ACM SIGKDD international conference on knowledge discovery and data mining, pp 201–209
Ekbal A, Saha S, Sikdar U (2014) On active annotation for named entity recognition. Int J Mach Learn Cybern 7(4):623–640
Fernandes D, de Moura ES, Ribeiro-Neto B, da Silva AS, Gonçalves MA (2007) Computing block importance for searching on web sites. In: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management(CIKM), pp 165–174
Hao Q, Cai R, Pang Y, Zhang L (2011) From one tree to a forest: a unified solution for structured web data extraction. In: Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval, pp 775–784
Hu Y, Xin G, Song R, Hu G, Shi S, Cao Y, Li H (2005) Title extraction from bodies of html documents and its application to web page retrieval. In: Proceedings of the 28th international ACM SIGIR conference on research and development in information retrieval, pp 250–257
Jaakkola T, Haussler D (1998) Exploiting generative models in discriminative classifiers. In: Advances in neural information processing systems 11, neural information processing systems, pp 487–493
Jajishirzi H, Yih W, Kolcz A (2010) Adaptive near-duplicate detection via similarity learning. In: Proceedings of the 33st international ACM SIGIR conference on research and development in information retrieval, pp 419–426
Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of 18th international conference on machine learning, pp 282–289
Lau RY, Li C, Liao SS (2014) Social analytics: learning fuzzy product ontologies for aspect-oriented sentiment analysis. Decis Support Syst 65:80–94
Lin S-H, Ho J-M (2002) Discovering informative content blocks from web documents. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (SIGKDD), pp 588–593
Lin S, Jin P, Zhao X, Yue L (2014) Exploiting temporal information in web search. Exp Syst Appl 41(2):331–341
Liu W, Meng X, Meng W (2010) Vide: a vision-based approach for deep web data extraction. IEEE Trans Knowl Data Eng 22(3):447–460
Li X, Wang Y-Y, Acero A (2009) Extracting structured information from user queries with semi-supervised conditional random fields. In: Proceedings of the 30th international ACM SIGIR conference on research and development in information retrieval, pp 572–579
Luo P, Lin F, Xiong Y, Zhao Y, Shi Z (2009) Towards combining web classification and web information extraction: a case study. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1235–1244
McCallum A, Li W (2003) Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: Proceedings of the seventh conference on natural language learning, pp 188–191
Miao G, Tatemura J, Hsiung W-P, Sawires A, Moser LE (2009) Extracting data records from the web using tag path clustering. In: Proceedings of the eighteenth international world wide web conference (WWW), pp 81–990
Monge A, Elkan C (1997) An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Research issues on data mining and knowledge discovery
Qin Y, Zheng D, Zhao T (2012) Research on search results optimization technology with category features integration. Int J Mach Learn Cybern 3(1):71–76
Ruiz-Sarmiento JR, Galindo C, Gonzalez-Jimenez J (2015) Scene object recognition for mobile robots through semantic knowledge and probabilistic graphical models. Exp Syst Appl 42(22):8805–8816
Sha F, Pereira F (2003) Shallow parsing with conditional random fields. In: Proceedings of the 2003 conference of the North American chapter of the association for computational linguistics on human language technology, pp 134–141
Song X, Liu J, Cao Y, Lin C-Y, Hon H-W (2010) Automatic extraction of web data records containing user-generated content. In: Proceedings of the nineteenth ACM conference on Conference on information and knowledge management(CIKM), pp 39–48
Sun Q, Li R, Luo D, Wu X (2008) Text segmentation with lda-based Fisher Kernel. In: Proceedings of the 46th annual meeting of the association for computational linguistics on human language technologies: short papers, pp 269–272
Sun F, Song D, Liao L (2011) Dom based content extraction via text density. In: Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval, pp 245–254
Sutton C, Rohanimanesh K, McCallum A (2004) Dynamic conditional random fileds: factorized probabilistic models for labeling and segmenting sequence data. In: Proceedings of twenty-first international conference on machine learning, pp 783–790
Teh Y, Jordan M, Beal M, Blei D (2006) Hierarchical dirichlet processes. J Am Stat Assoc 101:1566–1581
Theobald M, Siddharth J, Paepcke A (2008) Spotsigs: robust and efficient near duplicate detection in large web collections. In: Proceedings of the 31st international ACM SIGIR conference on research and development in information retrieval, pp 563–370
Turmo J, Ageno A, Catala N (2006) Adaptive information extraction. ACM Comput Surv 38(2). Article 4
van der Maaten L (2011) Learning discriminative Fisher Kernels. In: Proceedings of twenty-eighth international conference on machine learning
Vo D-T, Hai V, Ock C-Y (2015) Exploiting language models to classify events from twitter. Comput Intell Neurosci. Article ID 401024
Wang T, Cai Y, Leung HF, Cai Z, Min H (2015) Entropy-based term weighting schemes for text categorization in vsm. In: Proceedings of the IEEE 27th international conference on tools with artificial intelligence, pp 325–332
Yang C, Cao Y, Nie Z, Zhou J, Wen J-R (2010) Closing the loop in webpage understanding. IEEE Trans Knowl Data Eng 22:639–650
Yan Y, Yin X-C, Li S, Yang M, Hao H-W (2015) Learning document semantic representation with hybrid deep belief network. Comput Intell Neurosci. Article ID 650527
Zheng S, Song R, Wen J-R, Giles CL (2009) Efficient record-level wrapper induction. In: Proceeding of the 18th ACM international conference on information and knowledge management, pp 47–56
Zhu J, Nie Z, Zhang B, Wen J-R (2008) Dynamic hierarchical markov random fields for integrated web data extraction. J Mach Learn Res 9:1583–1614
Acknowledgements
The work described in this paper was partially supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. 14203414 and Project No. UGC/FDS11/E06/14).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wong, TL., Xie, H., Lam, W. et al. A learning framework for information block search based on probabilistic graphical models and Fisher Kernel. Int. J. Mach. Learn. & Cyber. 9, 1473–1487 (2018). https://doi.org/10.1007/s13042-017-0657-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-017-0657-9