A learning framework for information block search based on probabilistic graphical models and Fisher Kernel

Tak-Lam Wong ORCID: orcid.org/0000-0002-4581-9912¹,
Haoran Xie¹,
Wai Lam² &
…
Fu Lee Wang³

281 Accesses
Explore all metrics

Abstract

Contrary to traditional Web information retrieval methods that can only return a ranked list of Web pages and only allow search terms in the query, we have developed a novel learning framework for retrieving precise information blocks from Web pages given a query, which may contain some search terms and prior information such as the layout format of the data. There are two challenging sub-tasks for this problem. One challenge is information block detection, where a Web page is automatically segmented into blocks. Another challenge is to find the information blocks relevant to the query. Existing page segmentation methods, which make use of only visual layout information or only content information, do not consider the query information, leading to a solution having conflict with the information need expressed by the query. Our framework aims at modeling the query and the block features to capture both keyword information and prior information via a probabilistic graphical model. Fisher Kernel, which can effectively incorporate the graphical model, is then employed to accomplish the two sub-tasks in a unified manner, optimizing the final goal of block retrieval performance. We have conducted experiments on benchmark datasets and read-world data. Comparisons between existing methods have been conducted to evaluate the effectiveness of our framework.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Model-Driven Web Page Segmentation for Non Visual Access

Web Page Segmentation Towards Information Extraction for Web Semantics

Web Page Structured Content Detection Using Supervised Machine Learning

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

The details of DOM can be found in http://www.w3.org/DOM/
VIPS can be obtained in http://www.cad.zju.edu.cn/home/dengcai/VIPS/VIPS.html

References

Arun K, Govindan V (2016) A context-aware semantic modeling framework for efficient image retrieval. Int J Mach Learn Cybern. doi:10.1007/s13042-016-0498-y
Bah A, Chandar P, Carterette B (2015) Document comprehensiveness and user preferences in novelty search tasks. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, pp 735–738
Bilenko M, Kamath B, Mooney R (2006) Adaptive blocking: learning to scale up record linkage. In: Proceedings of the sixth IEEE international conference on data mining (ICDM), pp 87–96
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
MATH Google Scholar
Cai Y, Li Q (2010) Personalized search by tag-based user profile and resource profile in collaborative tagging systems. In: Proceedings of the 19th ACM international conference on information and knowledge management, pp 969–978
Cai D, Yu S, Wen J-R, Ma W-Y (2004) Block-based web search. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, pp 456–463
Chen Y, Lee S, Huang C-R (2012) A robust web personal name information extraction system. Exp Syst Appl 39(3):2690–2699
Article Google Scholar
Crescenzi V, Mecca G (2004) Automatic information extraction from large websites. J ACM 51(5):731–779
Article MathSciNet MATH Google Scholar
Culotta A, Wick M, Hall R, Marzilli M, McCallum A (2007) Canonicalization of database records using adaptive similarity measures. In: Proceedings of the thirteenth ACM SIGKDD international conference on knowledge discovery and data mining, pp 201–209
Ekbal A, Saha S, Sikdar U (2014) On active annotation for named entity recognition. Int J Mach Learn Cybern 7(4):623–640
Article Google Scholar
Fernandes D, de Moura ES, Ribeiro-Neto B, da Silva AS, Gonçalves MA (2007) Computing block importance for searching on web sites. In: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management(CIKM), pp 165–174
Hao Q, Cai R, Pang Y, Zhang L (2011) From one tree to a forest: a unified solution for structured web data extraction. In: Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval, pp 775–784
Hu Y, Xin G, Song R, Hu G, Shi S, Cao Y, Li H (2005) Title extraction from bodies of html documents and its application to web page retrieval. In: Proceedings of the 28th international ACM SIGIR conference on research and development in information retrieval, pp 250–257
Jaakkola T, Haussler D (1998) Exploiting generative models in discriminative classifiers. In: Advances in neural information processing systems 11, neural information processing systems, pp 487–493
Jajishirzi H, Yih W, Kolcz A (2010) Adaptive near-duplicate detection via similarity learning. In: Proceedings of the 33st international ACM SIGIR conference on research and development in information retrieval, pp 419–426
Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of 18th international conference on machine learning, pp 282–289
Lau RY, Li C, Liao SS (2014) Social analytics: learning fuzzy product ontologies for aspect-oriented sentiment analysis. Decis Support Syst 65:80–94
Article Google Scholar
Lin S-H, Ho J-M (2002) Discovering informative content blocks from web documents. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (SIGKDD), pp 588–593
Lin S, Jin P, Zhao X, Yue L (2014) Exploiting temporal information in web search. Exp Syst Appl 41(2):331–341
Article Google Scholar
Liu W, Meng X, Meng W (2010) Vide: a vision-based approach for deep web data extraction. IEEE Trans Knowl Data Eng 22(3):447–460
Article Google Scholar
Li X, Wang Y-Y, Acero A (2009) Extracting structured information from user queries with semi-supervised conditional random fields. In: Proceedings of the 30th international ACM SIGIR conference on research and development in information retrieval, pp 572–579
Luo P, Lin F, Xiong Y, Zhao Y, Shi Z (2009) Towards combining web classification and web information extraction: a case study. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1235–1244
McCallum A, Li W (2003) Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: Proceedings of the seventh conference on natural language learning, pp 188–191
Miao G, Tatemura J, Hsiung W-P, Sawires A, Moser LE (2009) Extracting data records from the web using tag path clustering. In: Proceedings of the eighteenth international world wide web conference (WWW), pp 81–990
Monge A, Elkan C (1997) An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Research issues on data mining and knowledge discovery
Qin Y, Zheng D, Zhao T (2012) Research on search results optimization technology with category features integration. Int J Mach Learn Cybern 3(1):71–76
Article Google Scholar
Ruiz-Sarmiento JR, Galindo C, Gonzalez-Jimenez J (2015) Scene object recognition for mobile robots through semantic knowledge and probabilistic graphical models. Exp Syst Appl 42(22):8805–8816
Article Google Scholar
Sha F, Pereira F (2003) Shallow parsing with conditional random fields. In: Proceedings of the 2003 conference of the North American chapter of the association for computational linguistics on human language technology, pp 134–141
Song X, Liu J, Cao Y, Lin C-Y, Hon H-W (2010) Automatic extraction of web data records containing user-generated content. In: Proceedings of the nineteenth ACM conference on Conference on information and knowledge management(CIKM), pp 39–48
Sun Q, Li R, Luo D, Wu X (2008) Text segmentation with lda-based Fisher Kernel. In: Proceedings of the 46th annual meeting of the association for computational linguistics on human language technologies: short papers, pp 269–272
Sun F, Song D, Liao L (2011) Dom based content extraction via text density. In: Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval, pp 245–254
Sutton C, Rohanimanesh K, McCallum A (2004) Dynamic conditional random fileds: factorized probabilistic models for labeling and segmenting sequence data. In: Proceedings of twenty-first international conference on machine learning, pp 783–790
Teh Y, Jordan M, Beal M, Blei D (2006) Hierarchical dirichlet processes. J Am Stat Assoc 101:1566–1581
Article MathSciNet MATH Google Scholar
Theobald M, Siddharth J, Paepcke A (2008) Spotsigs: robust and efficient near duplicate detection in large web collections. In: Proceedings of the 31st international ACM SIGIR conference on research and development in information retrieval, pp 563–370
Turmo J, Ageno A, Catala N (2006) Adaptive information extraction. ACM Comput Surv 38(2). Article 4
van der Maaten L (2011) Learning discriminative Fisher Kernels. In: Proceedings of twenty-eighth international conference on machine learning
Vo D-T, Hai V, Ock C-Y (2015) Exploiting language models to classify events from twitter. Comput Intell Neurosci. Article ID 401024
Wang T, Cai Y, Leung HF, Cai Z, Min H (2015) Entropy-based term weighting schemes for text categorization in vsm. In: Proceedings of the IEEE 27th international conference on tools with artificial intelligence, pp 325–332
Yang C, Cao Y, Nie Z, Zhou J, Wen J-R (2010) Closing the loop in webpage understanding. IEEE Trans Knowl Data Eng 22:639–650
Article Google Scholar
Yan Y, Yin X-C, Li S, Yang M, Hao H-W (2015) Learning document semantic representation with hybrid deep belief network. Comput Intell Neurosci. Article ID 650527
Zheng S, Song R, Wen J-R, Giles CL (2009) Efficient record-level wrapper induction. In: Proceeding of the 18th ACM international conference on information and knowledge management, pp 47–56
Zhu J, Nie Z, Zhang B, Wen J-R (2008) Dynamic hierarchical markov random fields for integrated web data extraction. J Mach Learn Res 9:1583–1614
MATH Google Scholar

Download references

Acknowledgements

The work described in this paper was partially supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. 14203414 and Project No. UGC/FDS11/E06/14).

Author information

Authors and Affiliations

Department of Mathematics and Information Technology, The Education University of Hong Kong, Lo Ping Road, Tai Po, N.T., Hong Kong
Tak-Lam Wong & Haoran Xie
Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong
Wai Lam
School of Computing and Information Sciences, Caritas Institute of Higher Education, Tseung Kwan O, N.T., Hong Kong
Fu Lee Wang

Authors

Tak-Lam Wong
View author publications
You can also search for this author in PubMed Google Scholar
Haoran Xie
View author publications
You can also search for this author in PubMed Google Scholar
Wai Lam
View author publications
You can also search for this author in PubMed Google Scholar
Fu Lee Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tak-Lam Wong.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wong, TL., Xie, H., Lam, W. et al. A learning framework for information block search based on probabilistic graphical models and Fisher Kernel. Int. J. Mach. Learn. & Cyber. 9, 1473–1487 (2018). https://doi.org/10.1007/s13042-017-0657-9

Download citation

Received: 03 April 2016
Accepted: 02 March 2017
Published: 28 March 2017
Issue Date: September 2018
DOI: https://doi.org/10.1007/s13042-017-0657-9

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Model-Driven Web Page Segmentation for Non Visual Access

Web Page Segmentation Towards Information Extraction for Web Semantics

Web Page Structured Content Detection Using Supervised Machine Learning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

A learning framework for information block search based on probabilistic graphical models and Fisher Kernel

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Model-Driven Web Page Segmentation for Non Visual Access

Web Page Segmentation Towards Information Extraction for Web Semantics

Web Page Structured Content Detection Using Supervised Machine Learning

Explore related subjects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation