[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

Finding and Extracting Data Records from Web Pages

Published: 01 April 2010 Publication History

Abstract

Many HTML pages are generated by software programs by querying some underlying databases and then filling in a template with the data. In these situations the metainformation about the data structure is lost, so automated software programs cannot process these data in such powerful manners as information from databases. We propose a set of novel techniques for detecting structured records in a web page and extracting the data values that constitute them. Our method needs only an input page. It starts by identifying the data region of interest in the page. Then it is partitioned into records by using a clustering method that groups similar subtrees in the DOM tree of the page. Finally, the attributes of the data records are extracted by using a method based on multiple string alignment. We have tested our techniques with a high number of real web sources, obtaining high precision and recall values.

References

[1]
Álvarez, M., Pan, A., Raposo, J., Bellas, F., & Cacheda, F. (2007). Finding and extracting data records from web pages. Proc. of 2007 IFIP International Conference on Embedded and Ubiquitous Computing (EUC 2007). Lecture Notes in Computer Science, 4808, 466-478 ISSN: 0302-9743.
[2]
Álvarez, M., Pan, A., Raposo, J., Cacheda, F., Bellas, F., & Carneiro, V. (2007). Crawling the content hidden behind web forms. In Proceedings of the 2007 International Conference on Computational Science and its Applications (ICCSA). Lecture Notes in Computer Science, 4706(2), 322-333 Springer Berlin/ Heidelberg, ISSN: 0302-9743, ISBN-10: 3-540-74475-4, ISBN- 13: 978-3-540-74475-7.
[3]
Arasu, A., & Garcia-Molina, H. (2003). Extracting structured data from web pages. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data.
[4]
Arlota, L., Crescenzi, V., Mecca, G., & Merialdo, P. (2003). Automatic annotation of data extracted from large websites. In Proceedings of the WebDB Workshop, pp. 7-12.
[5]
Baumgartner, R., Flesca, S., Gottlob, G. (2001). Visual web information extraction with lixto. In Proc. of Very Large Data-Bases (VLDB).
[6]
Chakrabarti, S. (2003). Mining the web: Discovering knowledge from hypertext data. San Francisco: Morgan Kaufmann ISBN: 1- 55860-754-4.
[7]
Chang, C., & Lui, S. (2001). IEPAD: Information extraction based on pattern discovery. In Proc. of 2001 Int. World Wide Web Conf., pp. 681-688.
[8]
Chang, K., He, B., & Zhang, Z. (2004). MetaQuerier over the deep web: Shallow integration across holistic sources. In Proceedings of the VLDB Workshop on Information Integration on the Web (VLDB-IIWeb).
[9]
Crescenzi, V., Mecca, G., & Merialdo, P. (2001). ROADRUNNER: Towards automatic data extraction from large web sites. In Proc. of the 2001 Int. VLDB Conf, pp. 109-118.
[10]
Crescenzi, V., Merialdo, P., & Missier, P. (2005). Clustering web pages based on their structure. Data & Knowledge Engineering Journal, 54(3), 279-299. September.
[11]
Gonnet, G. H., Baeza-Yates, R. A., & Snider, T. (1992). >New indices for text: Pat trees and pat arrays. Information retrieval: Data structures and algorithms. Upper Saddle River: Prentice Hall.
[12]
Hammer, J., McHugh, J., & Garcia-Molina, H. (1997). Semistructured data: The Tsimmis experience. In Proceedings of the 1st East-European Symposium on Advances in Databases and Information Systems (ADBIS), pp. 1-8.
[13]
Hogue, A., & Karger, D. (2005). Thresher: Automating the unwrapping of semantic content from the world wide web. In Proceedings of the 14th International World Wide Web Conference.
[14]
Hsu, C. N., & Dung, M. T. (1998). Generating finite-state transducers for semi-structured data extraction from the web. Information System, 23(8), 521-538.
[15]
Jung, Y., Geller, J., Wu, Y., & Ae Chun, S. (2007). Semantic deep web: Automatic attribute extraction from the deep web data sources. In Proceedings of the International SAC Conference, pp. 1667-1672.
[16]
Kovalev, V., Bhowmick, S., & Madria, S. (2005). HW-STALKER: A machine learning-based system for transforming QUREPagelets to XML. Data & Knowledge Engineering Journal, 54(2), 241-276, August.
[17]
Kistlera, T., & Marais, H. (1998). WebL: A Programming Language for the Web. In Proceedings of the 7th International World Wide Web Conference (WWW7), pp. 259-270.
[18]
Kushmerick, N., Weld, D. S., & Doorenbos, R. B. (1997). Wrapper induction for information extraction. In Proceedings of the 15th International Joint Conference on Artificial Intelligence (IJCAI), pp. 729-737.
[19]
Laender, A. H. F., Ribeiro-Neto, B. A., Soares da Silva, A., & Teixeira, J. S. (2002). A brief survey of web data extraction tools. SIGMOD Record, 31(2), 84-93.
[20]
Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10, 707-710.
[21]
Liddle, S., Yau, S., & Embley, D. (2001). On the automatic extraction of data from the hidden web. ER (Workshops), pp. 212-226.
[22]
Muslea, I., Minton, S., & Knoblock, C. (2001). Hierarchical wrapper induction for semistructured information sources. Auton. Agent. Multi Agent Syst., 93-114.
[23]
Notredame, C. (2002). Recent progresses in multiple sequence alignment: A survey. Technical report, Information Genetique et.
[24]
Pan, A., et al. (2002). Semi-automatic wrapper generation for commercial web sources. In Proc. of IFIP WG8.1 Conf. on Engineering Inf. Systems in the Internet Context (EISIC).
[25]
Raghavan, S., & García-Molina, H. (2001). Crawling the hidden web. In Proceedings of the 27th International Conference on Very Large Databases (VLDB).
[26]
Raposo, J., Pan, A., Álvarez, M., & Hidalgo, J. (2007). Automatically maintaining wrappers for web sources. Data & Knowledge Engineering, 61(2), 331-358. datak.2006.06.006.
[27]
Sahuguet, A., & Azavant, F. (2001). Building intelligent web applications using lightweight wrappers. Data & Knowledge Engineering Journal, 36(3), 283-316. (00)00051-3.
[28]
Wang, J., & Lochovsky, F. (2003). Data extraction and label assignment for web databases. In Proceedings of the 12th International World Wide Web Conference (WWW12).
[29]
Zhai, Y., & Liu, B. (2005). Extracting web data using instance-based learning. In Proc. of Web Information Systems Engineering (WISE), pp. 318-331.
[30]
Zhai, Y., & Liu, B. (2006). Structured data extraction from the web based on partial tree alignment. IEEE Transactions on Knowledge and Data Engineering, 18(12), 1614-1628.

Cited By

View all
  • (2024)All in One Place: Ensuring Usable Access to Online Shopping Items for Blind UsersProceedings of the ACM on Human-Computer Interaction10.1145/36646398:EICS(1-25)Online publication date: 17-Jun-2024
  • (2023)AutoDesc: Facilitating Convenient Perusal of Web Data Items for Blind UsersProceedings of the 28th International Conference on Intelligent User Interfaces10.1145/3581641.3584049(32-45)Online publication date: 27-Mar-2023
  • (2022)On validating web information extraction proposalsExpert Systems with Applications: An International Journal10.1016/j.eswa.2022.116700199:COnline publication date: 1-Aug-2022
  • Show More Cited By
  1. Finding and Extracting Data Records from Web Pages

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Journal of Signal Processing Systems
    Journal of Signal Processing Systems  Volume 59, Issue 1
    Apr 2010
    132 pages
    ISSN:1939-8018
    EISSN:1939-8115
    Issue’s Table of Contents

    Publisher

    Kluwer Academic Publishers

    United States

    Publication History

    Published: 01 April 2010

    Author Tags

    1. Automatic data extraction
    2. Data mining
    3. Hidden web
    4. Web
    5. Web mining

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 24 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)All in One Place: Ensuring Usable Access to Online Shopping Items for Blind UsersProceedings of the ACM on Human-Computer Interaction10.1145/36646398:EICS(1-25)Online publication date: 17-Jun-2024
    • (2023)AutoDesc: Facilitating Convenient Perusal of Web Data Items for Blind UsersProceedings of the 28th International Conference on Intelligent User Interfaces10.1145/3581641.3584049(32-45)Online publication date: 27-Mar-2023
    • (2022)On validating web information extraction proposalsExpert Systems with Applications: An International Journal10.1016/j.eswa.2022.116700199:COnline publication date: 1-Aug-2022
    • (2021)Semantic table-of-contents for efficient web screen readingProceedings of the 36th Annual ACM Symposium on Applied Computing10.1145/3412841.3442066(1941-1949)Online publication date: 22-Mar-2021
    • (2020)Towards Personalized Annotation of Webpages for Efficient Screen-Reader InteractionProceedings of the 31st ACM Conference on Hypertext and Social Media10.1145/3372923.3404815(111-116)Online publication date: 13-Jul-2020
    • (2020)iTOC: Enabling Efficient Non-Visual Interaction with Long Web Documents2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC)10.1109/SMC42975.2020.9282972(3799-3806)Online publication date: 11-Oct-2020
    • (2019)Auto-Suggesting Browsing Actions for Personalized Web Screen ReadingProceedings of the 27th ACM Conference on User Modeling, Adaptation and Personalization10.1145/3320435.3320460(252-260)Online publication date: 7-Jun-2019
    • (2019)Feel-ItProceedings of the 16th International Web for All Conference10.1145/3315002.3332441(1-2)Online publication date: 13-May-2019
    • (2018)SteeringWheelProceedings of the 2018 CHI Conference on Human Factors in Computing Systems10.1145/3173574.3173594(1-13)Online publication date: 21-Apr-2018
    • (2017)Speed-DialProceedings of the 19th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3132525.3132531(110-119)Online publication date: 19-Oct-2017
    • Show More Cited By

    View Options

    View options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media