More Web Proxy on the site http://driver.im/

article

Information extraction for deep web using repetitive subject pattern

Authors:

Wachirawut Thamviset,

Sartra WongthanavasuAuthors Info & Claims

World Wide Web, Volume 17, Issue 5

Pages 1109 - 1139

https://doi.org/10.1007/s11280-013-0248-y

Published: 01 September 2014 Publication History

Abstract

In this paper, we propose an information extraction (IE) system for extracting data records from semi-structured documents on the Deep Web using a promising proposed technique, called Repetitive Subject Pattern. This technique was based on the hypothesis that data records in the web page must have a subject item, and the repetitive pattern of the subject items can be used to identify the boundary of data records. The system consists of four automatic tasks: (1) parsing a sample page to a DOM tree, (2) recognizing a subject string in the DOM tree, (3) using the subject string for identifying the pattern of data records and generating a wrapper, and (4) using the generated wrapper for extracting data records. This approach enables the very flexible wrapper generator; when the automatic process generated the wrong wrapper, user can also provide a new sample subject string for generating better wrapper. As the result, the system can be both semi-supervised and unsupervised system. The experimentation shows that the proposed technique provides the outstanding results in generating the very high quality wrappers, with both recall and precision close to 100 % when tested on a number of datasets.

References

[1]

Adelberg, B.: NoDoSE - A tool for Semi-Automatically Extracting Structured and Semistructured Data from Text Documents. Proceedings of the 1998 ACM SIGMOD in-ternational conference on Management of data. pp. 283---294 ACM, New York, NY, USA (1998).

Digital Library

[2]

Álvarez, M., Pan, A., Raposo, J., Bellas, F., Cacheda, F.: Extracting lists of data records from semi-structured web pages. Data Knowl. Eng 64(2), 491---509 (2008). j.datak.2007.10.002

Digital Library

[3]

Arasu, A., Garcia-Molina, H.: Extracting structured data from Web pages. Proceedings of the 2003 ACM SIGMOD international conference on Management of data. pp. 337---348 ACM, New York, NY, USA (2003).

Digital Library

[4]

Arocena, G.O., Mendelzon, A.O.: WebOQL: Restructuring Documents, Databases, and Webs. Proceedings of the Fourteenth International Conference on Data Engineering. pp. 24---33 I.E. Computer Society, Washington, DC, USA (1998)

[5]

Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: Extracting content structure for web pages based on visual representation. APWeb. 406---417 (2003)

[6]

Chang, C.-H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of Web information extraction systems. IEEE Trans Knowl Data Eng 18(10), 1411---1428 (2006).

[7]

Chang, C.-H., Kuo, S.-C.: OLERA: semisupervised Web-data extraction with visual support. IEEE Intell Syst 19(6), 56---64 (2004).

Digital Library

[8]

Chang, C.-H., Lui, S.-C.: IEPAD: information extraction based on pattern discovery. Proceedings of the 10th international conference on World Wide Web. pp. 681---688 ACM, New York, USA (2001).

[9]

Ciravegna, F., Dingli, A., Wilks, Y., Petrelli, D.: Adaptive information extraction for document annotation in amilcare. Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval. pp. 451---451 ACM, New York, NY, USA (2002).

[10]

Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards Automatic Data Extraction from Large Web Sites. Proceedings of the 27th International Conference on Very Large Data Bases. pp. 109---118 Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2001)

[11]

He, B., Patel, M., Zhang, Z., Chang, K.C.-C.: Accessing the deep web. Commun of the ACM. 50(5), 94---101 (2007).

Digital Library

[12]

Hengru, Z., Chun, C.: Web Information Extraction Technology Research Based on Ajax. Proceedings of the 2011 International Conference on Business Computing and Global Informatization. pp. 208---211 I.E. Computer Society, Washington, DC, USA (2011).

[13]

Hogue, A., Karger, D.: Thresher: automating the unwrapping of semantic content from the World Wide Web. Proceedings of the 14th international conference on World Wide Web. pp. 86---95 ACM, New York, NY, USA (2005).

Digital Library

[14]

Hong, J.L.: Data extraction for deep Web using WordNet. IEEE Trans Syst Man, Cybern, Part C: Appl Rev 41(6), 854---868 (2011).

Digital Library

[15]

Hong, J.L., Siew, E.-G., Egerton, S.: Information extraction for search engines using fast heuristic techniques. Data Knowl. Eng 69(2), 169---196 (2010).

Digital Library

[16]

Hsu, C.-N., Dung, M.-T.: Generating finite-state transducers for semi-structured data extraction from the Web. Inf Syst. 23(8), 521---538 (1998).

Digital Library

[17]

Kayed, M., Chang, C.H.: FiVaTech: page-level Web data extraction from template pages. IEEE Trans Knowl Data Eng 22(2), 249---263 (2009).

[18]

Liu, B., Grossman, R., Zhai, Y.: Mining data records in Web pages. Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 601---606 ACM, New York, NY, USA (2003).

Digital Library

[19]

Liu, W., Meng, X., Meng, W.: ViDE: a vision-based approach for deep Web data extraction. IEEE IEEE Trans Knowl Data Eng 22(3), 447---460 (2010).

Digital Library

[20]

Liu, L., Pu, C., Han, W.: XWRAP: an XML-enabled wrapper construction system for Web information sources. Data Engineering, 2000. Proceedings. 16th International Conference on. pp. 611 ---621 (2000).

[21]

Myllymaki, J.: Effective Web data extraction with standard XML technologies. Computer Networks. 39(5), 635---644 (2002).

[22]

Padmadas, V., Gadge, J.: Web data extraction using visual features. Proceedings of the International Conference and Workshop on Emerging Trends in Technology. pp. 218---221 ACM, New York, NY, USA (2010).

[23]

Qin, Y., Zheng, D., Zhao, T.: Research on search results optimization technology with category features integration. Int J Mach Learn Cybern 3(1), 71---76 (2012).

[24]

Simon, K., Lausen, G.: ViPER: augmenting automatic information extraction with visual perceptions. Proceedings of the 14th ACM international conference on Information and knowledge management. pp. 381---388 ACM, New York, NY, USA (2005). 1099554.1099672

Digital Library

[25]

Sleiman, H.A., Corchuelo, R.: An unsupervised technique to extract information from semi-structured Web pages. In: Wang, X.S., Cruz, I., Delis, A., Huang, G. (eds.) Web information systems engineering - WISE, pp. 631---637. Springer, Berlin (2012)

[26]

Sleiman, H.A., Corchuelo, R.: TEX: an efficient and effective unsupervised Web information extracto. Knowl-Based Syst 39(0), 109---123 (2013).

Digital Library

[27]

Sleiman, H.A., Corchuelo, R.: A Survey on Region Extractors From Web Documents. IEEE Transactions on Knowledge and Data Engineering. 99, (2012). 2012.135

Digital Library

[28]

Thamviset, W., Wongthanavasu, S.: Structured web information extraction using repetitive subject pattern. Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), 2012 9th International Conference on. pp. 1 ---4, Thailand (2012).

[29]

Vadrevu, S., Gelgi, F., Davulcu, H.: Information extraction from Web pages using presentation regularities and domain knowledge. World Wide Web. 10(2), 157---179 (2007).

Digital Library

[30]

Wang, J., Lochovsky, F.H.: Data extraction and label assignment for web databases. Proceedings of the 12th international conference on World Wide Web. pp. 187---196 ACM, New York, NY, USA (2003).

Digital Library

[31]

Yang, S., Wang, G., Han, Y.: Grubber: Allowing End-Users to Develop XML-Based Wrappers for Web Data Sources. Proceedings of the Joint International Conferences on Advances in Data and Web Management. pp. 647---652 Springer-Verlag, Berlin, Heidelberg (2009).

[32]

Zhai, Y., Liu, B.: Structured data extraction from the Web based on partial tree alignment. IEEE Trans Knowledge Data Eng 18(12), 1614---1628 (2006).

Digital Library

[33]

Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully automatic wrapper generation for search engines. Proceedings of the 14th international conference on World Wide Web. pp. 66---75 ACM, New York, NY, USA (2005).

Digital Library

[34]

Zheng, X., Gu, Y., Li, Y.: Data extraction from web pages based on structural-semantic entropy. Proceedings of the 21st international conference companion on World Wide Web. pp. 93---102 ACM, New York, NY, USA (2012).

Digital Library

Cited By

Wang QKanagal BGarg VSivakumar DZhu WTao DCheng XCui PRundensteiner ECarmel DHe QXu Yu J(2019)Constructing a Comprehensive Events Database from the WebProceedings of the 28th ACM International Conference on Information and Knowledge Management10.1145/3357384.3357986(229-238)Online publication date: 3-Nov-2019
https://dl.acm.org/doi/10.1145/3357384.3357986
Yuliang WQi ZFang LXixian HGuodong XBailing W(2019)A novel approach for Web page modeling in personal information extractionWorld Wide Web10.1007/s11280-018-0631-922:2(603-620)Online publication date: 1-Mar-2019
https://dl.acm.org/doi/10.1007/s11280-018-0631-9
Di Castro DGamzu IGrabovitch-Zuyev ILewin-Eytan LPundir ASahoo NViderman MChampin PGandon FMédini LLalmas MIpeirotis P(2018)Automated Extractions for Machine Generated MailCompanion Proceedings of the The Web Conference 201810.1145/3184558.3186582(655-662)Online publication date: 23-Apr-2018
https://dl.acm.org/doi/10.1145/3184558.3186582
Show More Cited By

Information extraction for deep web using repetitive subject pattern
1. Information systems

Recommendations

Towards web-scale structured web data extraction
WSDM '13: Proceedings of the sixth ACM international conference on Web search and data mining

In this paper we present an ongoing PhD research on unsupervised and domain-independent structured data extraction from the Web. We propose a novel method to extract structured data records from template-generated Web pages. The method is based on ...
Web Data Extraction Approach for Deep Web using WEIDJ
Abstract
Data extraction is one of the most prominent areas in data mining analysis that is been extensively studied especially in the field of data requirements and reservoir. The main aim of data extraction with regards to semi-structured data is to ...
Synthesis of Forgiving Data Extractors
WSDM '17: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining

We address the problem of synthesizing a robust data-extractor from a family of websites that contain the same kind of information. This problem is common when trying to aggregate information from many web sites, for example, when extracting information ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image World Wide Web

World Wide Web Volume 17, Issue 5

September 2014

352 pages

ISSN:1386-145X

Issue’s Table of Contents

Copyright © Copyright © 2014 Springer Science+Business Media New York.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 September 2014

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wang QKanagal BGarg VSivakumar DZhu WTao DCheng XCui PRundensteiner ECarmel DHe QXu Yu J(2019)Constructing a Comprehensive Events Database from the WebProceedings of the 28th ACM International Conference on Information and Knowledge Management10.1145/3357384.3357986(229-238)Online publication date: 3-Nov-2019
https://dl.acm.org/doi/10.1145/3357384.3357986
Yuliang WQi ZFang LXixian HGuodong XBailing W(2019)A novel approach for Web page modeling in personal information extractionWorld Wide Web10.1007/s11280-018-0631-922:2(603-620)Online publication date: 1-Mar-2019
https://dl.acm.org/doi/10.1007/s11280-018-0631-9
Di Castro DGamzu IGrabovitch-Zuyev ILewin-Eytan LPundir ASahoo NViderman MChampin PGandon FMédini LLalmas MIpeirotis P(2018)Automated Extractions for Machine Generated MailCompanion Proceedings of the The Web Conference 201810.1145/3184558.3186582(655-662)Online publication date: 23-Apr-2018
https://dl.acm.org/doi/10.1145/3184558.3186582
Avigdor-Elgrabli NCwalinski MDi Castro DGamzu IGrabovitch-Zuyev ILewin-Eytan LMaarek YMukhopadhyay SZhai CBertino ECrestani FMostafa JTang JSi LZhou XChang YLi YSondhi P(2016)Structural Clustering of Machine-Generated MailProceedings of the 25th ACM International on Conference on Information and Knowledge Management10.1145/2983323.2983350(217-226)Online publication date: 24-Oct-2016
https://dl.acm.org/doi/10.1145/2983323.2983350
Omari AKimelfeld BYahav EShoham SKrishnapuram BShah MSmola AAggarwal CShen DRastogi R(2016)Lossless Separation of Web Pages into Layout Code and DataProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining10.1145/2939672.2939858(1805-1814)Online publication date: 13-Aug-2016
https://dl.acm.org/doi/10.1145/2939672.2939858
Omari AShoham SYahav EDillon LVisser WWilliams L(2016)Cross-supervised synthesis of web-crawlersProceedings of the 38th International Conference on Software Engineering10.1145/2884781.2884842(368-379)Online publication date: 14-May-2016
https://dl.acm.org/doi/10.1145/2884781.2884842
Foley JBendersky MJosifovski VBaeza-Yates RLalmas MMoffat ARibeiro-Neto B(2015)Learning to Extract Local Events from the WebProceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/2766462.2767739(423-432)Online publication date: 9-Aug-2015
https://dl.acm.org/doi/10.1145/2766462.2767739

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents