[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

Information extraction for deep web using repetitive subject pattern

Published: 01 September 2014 Publication History

Abstract

In this paper, we propose an information extraction (IE) system for extracting data records from semi-structured documents on the Deep Web using a promising proposed technique, called Repetitive Subject Pattern. This technique was based on the hypothesis that data records in the web page must have a subject item, and the repetitive pattern of the subject items can be used to identify the boundary of data records. The system consists of four automatic tasks: (1) parsing a sample page to a DOM tree, (2) recognizing a subject string in the DOM tree, (3) using the subject string for identifying the pattern of data records and generating a wrapper, and (4) using the generated wrapper for extracting data records. This approach enables the very flexible wrapper generator; when the automatic process generated the wrong wrapper, user can also provide a new sample subject string for generating better wrapper. As the result, the system can be both semi-supervised and unsupervised system. The experimentation shows that the proposed technique provides the outstanding results in generating the very high quality wrappers, with both recall and precision close to 100 % when tested on a number of datasets.

References

[1]
Adelberg, B.: NoDoSE - A tool for Semi-Automatically Extracting Structured and Semistructured Data from Text Documents. Proceedings of the 1998 ACM SIGMOD in-ternational conference on Management of data. pp. 283---294 ACM, New York, NY, USA (1998).
[2]
Álvarez, M., Pan, A., Raposo, J., Bellas, F., Cacheda, F.: Extracting lists of data records from semi-structured web pages. Data Knowl. Eng 64(2), 491---509 (2008). j.datak.2007.10.002
[3]
Arasu, A., Garcia-Molina, H.: Extracting structured data from Web pages. Proceedings of the 2003 ACM SIGMOD international conference on Management of data. pp. 337---348 ACM, New York, NY, USA (2003).
[4]
Arocena, G.O., Mendelzon, A.O.: WebOQL: Restructuring Documents, Databases, and Webs. Proceedings of the Fourteenth International Conference on Data Engineering. pp. 24---33 I.E. Computer Society, Washington, DC, USA (1998)
[5]
Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: Extracting content structure for web pages based on visual representation. APWeb. 406---417 (2003)
[6]
Chang, C.-H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of Web information extraction systems. IEEE Trans Knowl Data Eng 18(10), 1411---1428 (2006).
[7]
Chang, C.-H., Kuo, S.-C.: OLERA: semisupervised Web-data extraction with visual support. IEEE Intell Syst 19(6), 56---64 (2004).
[8]
Chang, C.-H., Lui, S.-C.: IEPAD: information extraction based on pattern discovery. Proceedings of the 10th international conference on World Wide Web. pp. 681---688 ACM, New York, USA (2001).
[9]
Ciravegna, F., Dingli, A., Wilks, Y., Petrelli, D.: Adaptive information extraction for document annotation in amilcare. Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval. pp. 451---451 ACM, New York, NY, USA (2002).
[10]
Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards Automatic Data Extraction from Large Web Sites. Proceedings of the 27th International Conference on Very Large Data Bases. pp. 109---118 Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2001)
[11]
He, B., Patel, M., Zhang, Z., Chang, K.C.-C.: Accessing the deep web. Commun of the ACM. 50(5), 94---101 (2007).
[12]
Hengru, Z., Chun, C.: Web Information Extraction Technology Research Based on Ajax. Proceedings of the 2011 International Conference on Business Computing and Global Informatization. pp. 208---211 I.E. Computer Society, Washington, DC, USA (2011).
[13]
Hogue, A., Karger, D.: Thresher: automating the unwrapping of semantic content from the World Wide Web. Proceedings of the 14th international conference on World Wide Web. pp. 86---95 ACM, New York, NY, USA (2005).
[14]
Hong, J.L.: Data extraction for deep Web using WordNet. IEEE Trans Syst Man, Cybern, Part C: Appl Rev 41(6), 854---868 (2011).
[15]
Hong, J.L., Siew, E.-G., Egerton, S.: Information extraction for search engines using fast heuristic techniques. Data Knowl. Eng 69(2), 169---196 (2010).
[16]
Hsu, C.-N., Dung, M.-T.: Generating finite-state transducers for semi-structured data extraction from the Web. Inf Syst. 23(8), 521---538 (1998).
[17]
Kayed, M., Chang, C.H.: FiVaTech: page-level Web data extraction from template pages. IEEE Trans Knowl Data Eng 22(2), 249---263 (2009).
[18]
Liu, B., Grossman, R., Zhai, Y.: Mining data records in Web pages. Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 601---606 ACM, New York, NY, USA (2003).
[19]
Liu, W., Meng, X., Meng, W.: ViDE: a vision-based approach for deep Web data extraction. IEEE IEEE Trans Knowl Data Eng 22(3), 447---460 (2010).
[20]
Liu, L., Pu, C., Han, W.: XWRAP: an XML-enabled wrapper construction system for Web information sources. Data Engineering, 2000. Proceedings. 16th International Conference on. pp. 611 ---621 (2000).
[21]
Myllymaki, J.: Effective Web data extraction with standard XML technologies. Computer Networks. 39(5), 635---644 (2002).
[22]
Padmadas, V., Gadge, J.: Web data extraction using visual features. Proceedings of the International Conference and Workshop on Emerging Trends in Technology. pp. 218---221 ACM, New York, NY, USA (2010).
[23]
Qin, Y., Zheng, D., Zhao, T.: Research on search results optimization technology with category features integration. Int J Mach Learn Cybern 3(1), 71---76 (2012).
[24]
Simon, K., Lausen, G.: ViPER: augmenting automatic information extraction with visual perceptions. Proceedings of the 14th ACM international conference on Information and knowledge management. pp. 381---388 ACM, New York, NY, USA (2005). 1099554.1099672
[25]
Sleiman, H.A., Corchuelo, R.: An unsupervised technique to extract information from semi-structured Web pages. In: Wang, X.S., Cruz, I., Delis, A., Huang, G. (eds.) Web information systems engineering - WISE, pp. 631---637. Springer, Berlin (2012)
[26]
Sleiman, H.A., Corchuelo, R.: TEX: an efficient and effective unsupervised Web information extracto. Knowl-Based Syst 39(0), 109---123 (2013).
[27]
Sleiman, H.A., Corchuelo, R.: A Survey on Region Extractors From Web Documents. IEEE Transactions on Knowledge and Data Engineering. 99, (2012). 2012.135
[28]
Thamviset, W., Wongthanavasu, S.: Structured web information extraction using repetitive subject pattern. Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), 2012 9th International Conference on. pp. 1 ---4, Thailand (2012).
[29]
Vadrevu, S., Gelgi, F., Davulcu, H.: Information extraction from Web pages using presentation regularities and domain knowledge. World Wide Web. 10(2), 157---179 (2007).
[30]
Wang, J., Lochovsky, F.H.: Data extraction and label assignment for web databases. Proceedings of the 12th international conference on World Wide Web. pp. 187---196 ACM, New York, NY, USA (2003).
[31]
Yang, S., Wang, G., Han, Y.: Grubber: Allowing End-Users to Develop XML-Based Wrappers for Web Data Sources. Proceedings of the Joint International Conferences on Advances in Data and Web Management. pp. 647---652 Springer-Verlag, Berlin, Heidelberg (2009).
[32]
Zhai, Y., Liu, B.: Structured data extraction from the Web based on partial tree alignment. IEEE Trans Knowledge Data Eng 18(12), 1614---1628 (2006).
[33]
Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully automatic wrapper generation for search engines. Proceedings of the 14th international conference on World Wide Web. pp. 66---75 ACM, New York, NY, USA (2005).
[34]
Zheng, X., Gu, Y., Li, Y.: Data extraction from web pages based on structural-semantic entropy. Proceedings of the 21st international conference companion on World Wide Web. pp. 93---102 ACM, New York, NY, USA (2012).

Cited By

View all
  • (2019)Constructing a Comprehensive Events Database from the WebProceedings of the 28th ACM International Conference on Information and Knowledge Management10.1145/3357384.3357986(229-238)Online publication date: 3-Nov-2019
  • (2019)A novel approach for Web page modeling in personal information extractionWorld Wide Web10.1007/s11280-018-0631-922:2(603-620)Online publication date: 1-Mar-2019
  • (2018)Automated Extractions for Machine Generated MailCompanion Proceedings of the The Web Conference 201810.1145/3184558.3186582(655-662)Online publication date: 23-Apr-2018
  • Show More Cited By
  1. Information extraction for deep web using repetitive subject pattern

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image World Wide Web
    World Wide Web  Volume 17, Issue 5
    September 2014
    352 pages

    Publisher

    Kluwer Academic Publishers

    United States

    Publication History

    Published: 01 September 2014

    Author Tags

    1. Information extraction
    2. Subject pattern
    3. Unsupervised learning
    4. Web content mining
    5. Web data extraction
    6. Wrapper induction

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 17 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2019)Constructing a Comprehensive Events Database from the WebProceedings of the 28th ACM International Conference on Information and Knowledge Management10.1145/3357384.3357986(229-238)Online publication date: 3-Nov-2019
    • (2019)A novel approach for Web page modeling in personal information extractionWorld Wide Web10.1007/s11280-018-0631-922:2(603-620)Online publication date: 1-Mar-2019
    • (2018)Automated Extractions for Machine Generated MailCompanion Proceedings of the The Web Conference 201810.1145/3184558.3186582(655-662)Online publication date: 23-Apr-2018
    • (2016)Structural Clustering of Machine-Generated MailProceedings of the 25th ACM International on Conference on Information and Knowledge Management10.1145/2983323.2983350(217-226)Online publication date: 24-Oct-2016
    • (2016)Lossless Separation of Web Pages into Layout Code and DataProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining10.1145/2939672.2939858(1805-1814)Online publication date: 13-Aug-2016
    • (2016)Cross-supervised synthesis of web-crawlersProceedings of the 38th International Conference on Software Engineering10.1145/2884781.2884842(368-379)Online publication date: 14-May-2016
    • (2015)Learning to Extract Local Events from the WebProceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/2766462.2767739(423-432)Online publication date: 9-Aug-2015

    View Options

    View options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media