[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1060745.1060761acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
Article

Web data extraction based on partial tree alignment

Published: 10 May 2005 Publication History

Abstract

This paper studies the problem of extracting data from a Web page that contains several structured data records. The objective is to segment these data records, extract data items/fields from them and put the data in a database table. This problem has been studied by several researchers. However, existing methods still have some serious limitations. The first class of methods is based on machine learning, which requires human labeling of many examples from each Web site that one is interested in extracting data from. The process is time consuming due to the large number of sites and pages on the Web. The second class of algorithms is based on automatic pattern discovery. These methods are either inaccurate or make many assumptions. This paper proposes a new method to perform the task automatically. It consists of two steps, (1) identifying individual data records in a page, and (2) aligning and extracting data items from the identified data records. For step 1, we propose a method based on visual information to segment data records, which is more accurate than existing methods. For step 2, we propose a novel partial alignment technique based on tree matching. Partial alignment means that we align only those data fields in a pair of data records that can be aligned (or matched) with certainty, and make no commitment on the rest of the data fields. This approach enables very accurate alignment of multiple data records. Experimental results using a large number of Web pages from diverse domains show that the proposed two-step technique is able to segment data records, align and extract data from them very accurately.

References

[1]
Arasu, A. and Garcia-Molina, H. Extracting Structured Data from Web Pages. SIGMOD-03, 2003.
[2]
Baeza-Yates, R. Algorithms for string matching: A survey. ACM SIGIR Forum, 23(3-4):34--58, 1989.
[3]
Barton, G., Sternberg, M. A strategy for the rapid multiple alignment of protein sequences: confidence levels from tertiary structure comparisons. J. Mol. Biol. 1987, 327--337.
[4]
Bar-Yossef, Z. and Rajagopalan, S. Template Detection via Data Mining and its Applications, WWW 2002, 2002.
[5]
Buttler, D., Liu, L., Pu, C. A fully automated extraction system for the World Wide Web. IEEE ICDCS-21, 2001.
[6]
Carrillo, H., Lipman, D. The multiple sequence alignment problem in biology. SIAM J. Applied Math., 1988;48(5).
[7]
Chakrabarti, S. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann Publishers, 2002.
[8]
Chang, C. and Lui, S-L. IEPAD: Information extraction based on pattern discovery. WWW-10, 2001.
[9]
Chen, H.-H., Tsai, S.-C., and Tsai, J.-H. Mining tables from large scale html texts. COLING-00, 2000.
[10]
Chen, W. New algorithm for ordered tree-to-tree correction problem. Journal of Algorithms, 40:135-158, 2001.
[11]
Cohen, W., Hurst, M., and Jensen, L. A flexible learning system for wrapping tables and lists in HTML documents. WWW-2002, 2002.
[12]
Crescenzi, V., Mecca, G. and Merialdo, P. Roadrunner: Towards automatic data extraction from large web sites. VLDB-01, 2001.
[13]
Doorenbos, R., Etzioni, O., Weld, D. A scalable comparison shopping agent for the World Wide Web. Agents-97, 1997.
[14]
Embley, D., Jiang, Y and Ng, Y. "Record-boundary discovery in Web documents." SIGMOD-99, 1999.
[15]
Gusfield, D. Algorithms on strings, tree, and sequence, Cambridge. 1997.
[16]
Hammer, J., Garcia-Molina, H., Cho, J., Aranha, R., and Crespo, A. Extracting semi-structured information from the Web. Workshop on Manag. of Semi-structured Data, 1997.
[17]
Hogeweg, P., Hesper, B. The alignment of sets of sequences and the construction of phylogenetic trees: An integrated method. J. Mol. Evol., 20, 175--186 (1984).
[18]
Hsu, C.-N. and Dung, M.-T. Generating finite-state transducers for semi-structured data extraction from the Web. Information Systems. 23(8): 521--538, 1998.
[19]
Kushmerick, N. Wrapper induction: efficiency and expressiveness. Artificial Intelligence, 118:15--68, 2000.
[20]
Lerman, K., Getoor L., Minton, S. and Knoblock, C. "Using the Structure of Web Sites for Automatic Segmentation of Tables." SIGMOD-04, 2004.
[21]
Liu, B., Grossman, R. and Zhai, Y. "Mining data records from Web pages." KDD-03, 2003.
[22]
Meng, X., Lu, H., Wang, H., and Gu, M. Schema-Guided Wrapper Generator. ICDE-02, 2002.
[23]
Muslea, I., Minton, S. and Knoblock, C. "A hierarchical approach to wrapper induction." Agents-99, 1999.
[24]
Notredame, C. Recent progresses in multiple sequence alignment: a survey. Technical Report. 2002.
[25]
Pinto, D., McCallum, A., Wei, X. and Bruce, W. Table Extraction Using Conditional Random Fields. SIGIR-03.
[26]
Ramaswamy, L., Ivengar, A., Liu, L., and Douglis, F. Automatic detection of fragments in dynamically generated Web pages. WWW-04, 2004.
[27]
Reis, D. Golgher, P., Silva, A., Laender, A. Automatic Web news extraction using tree edit distance, WWW-04, 2004.
[28]
Rosenfeld, B., Feldman, R., Aumann, Y. Structural extraction from visual layout of documents. CIKM-02, 2002.
[29]
Song, R., Liu, H., Wen, J.-R., Ma, W.-Y. Learning block importance models for Web pages. WWW-04, 2004.
[30]
Tai, K. The tree-to-tree correction problem. J. ACM, 26(3):422--433, 1979
[31]
Valiente, G. Tree edit distance and common subtrees. Research Report LSI-02-20-R, Universitat Politecnica de Catalunya, Barcelona, Spain, 2002.
[32]
Wang, J., Shapiro, J., Shasha, D., Zhang, K., Currey, K. An algorithm for finding the largest approximately common substructures of two trees. IEEE PAMI, 20(8), 1998.
[33]
Wang, Y., Hu, J. A machine learning based approach for table detection on the Web. WWW-2002.
[34]
Wang, J.-Y., and Lochovsky, F. Data extraction and label assignment for Web databases. WWW-03, 2003.
[35]
Yang, W. Identifying syntactic differences between two programs. Softw. Pract. Exper., 21(7):739--755, 1991.
[36]
Zhang, K., Statman, R., Shasha, D. On the editing distance between unordered labeled trees. Information Processing Letters, 42(3):133-139, 1992.

Cited By

View all
  • (2023)Self-Training for Label-Efficient Information Extraction from Semi-Structured Web-PagesProceedings of the VLDB Endowment10.14778/3611479.361151116:11(3098-3110)Online publication date: 24-Aug-2023
  • (2023)EDREW - Enhanced Data Representation for Extraction in WebProceedings of the 29th Brazilian Symposium on Multimedia and the Web10.1145/3617023.3617055(230-237)Online publication date: 23-Oct-2023
  • (2023)AutoDesc: Facilitating Convenient Perusal of Web Data Items for Blind UsersProceedings of the 28th International Conference on Intelligent User Interfaces10.1145/3581641.3584049(32-45)Online publication date: 27-Mar-2023
  • Show More Cited By

Index Terms

  1. Web data extraction based on partial tree alignment

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      WWW '05: Proceedings of the 14th international conference on World Wide Web
      May 2005
      781 pages
      ISBN:1595930469
      DOI:10.1145/1060745
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 10 May 2005

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. data extraction
      2. data record extraction
      3. wrapper

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)17
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 23 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Self-Training for Label-Efficient Information Extraction from Semi-Structured Web-PagesProceedings of the VLDB Endowment10.14778/3611479.361151116:11(3098-3110)Online publication date: 24-Aug-2023
      • (2023)EDREW - Enhanced Data Representation for Extraction in WebProceedings of the 29th Brazilian Symposium on Multimedia and the Web10.1145/3617023.3617055(230-237)Online publication date: 23-Oct-2023
      • (2023)AutoDesc: Facilitating Convenient Perusal of Web Data Items for Blind UsersProceedings of the 28th International Conference on Intelligent User Interfaces10.1145/3581641.3584049(32-45)Online publication date: 27-Mar-2023
      • (2023)BAGEL: An Approach to Automatically Detect Navigation-Based Web Accessibility Barriers for Keyboard UsersProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3580749(1-17)Online publication date: 19-Apr-2023
      • (2022)Web Record Extraction with InvariantsProceedings of the VLDB Endowment10.14778/3574245.357427616:4(959-972)Online publication date: 1-Dec-2022
      • (2022)Customizable Tabular Access to Web Data Records for Convenient Low-vision Screen Magnifier InteractionACM Transactions on Accessible Computing10.1145/351704415:2(1-22)Online publication date: 19-May-2022
      • (2022)On validating web information extraction proposalsExpert Systems with Applications: An International Journal10.1016/j.eswa.2022.116700199:COnline publication date: 1-Aug-2022
      • (2022)Text Preparation and Similarity ComputationMachine Learning for Text10.1007/978-3-030-96623-2_2(19-32)Online publication date: 10-Feb-2022
      • (2021)The smallest extraction problemProceedings of the VLDB Endowment10.14778/3476249.347629314:11(2445-2458)Online publication date: 27-Oct-2021
      • (2021)Tabular Functional Block Detection with Embedding-based Agglomerative Cell ClusteringProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482484(1744-1753)Online publication date: 26-Oct-2021
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media