Article

Finding and extracting data records from web pages

Authors:

Fidel CachedaAuthors Info & Claims

EUC'07: Proceedings of the 2007 international conference on Embedded and ubiquitous computing

Pages 466 - 478

Published: 17 December 2007 Publication History

Abstract

Many HTML pages are generated by software programs by querying some underlying databases and then filling in a template with the data. In these situations the metainformation about the data structure is lost, so automated software programs cannot process these data in such powerful manners as information from databases. We propose a set of novel techniques for detecting structured records in a web page and extracting the data values that constitute them. Our method needs only an input page. It starts by identifying the data region of interest in the page. Then it is partitioned into records by using a clustering method that groups similar subtrees in the DOM tree of the page. Finally, the attributes of the data records are extracted by using a method based on multiple string alignment. We have tested our techniques with a high number of real web sources, obtaining high precision and recall values.

References

[1]

Arasu, A., Garcia-Molina, H.: Extracting Structured Data from Web Pages. In: Proc. of the ACM SIGMOD Int. Conf. on Management of Data (2003).

Digital Library

Google Scholar

[2]

Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. In: Proc. of Very Large DataBases (VLDB) (2001).

Digital Library

Google Scholar

[3]

Chakrabarti, S.: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann Publishers, San Francisco (2003).

Digital Library

Google Scholar

[4]

Chang, C., Lui, S.: IEPAD: Information extraction based on pattern discovery. In: Proc. of 2001 Int. World Wide Web Conf., pp. 681-688 (2001).

Digital Library

Google Scholar

[5]

Crescenzi, V., Mecca, G., Merialdo, P.: ROADRUNNER: Towards automatic data extraction from large web sites. In: Proc. of the 2001 Int. VLDB Conf., pp. 109-118 (2001).

Digital Library

Google Scholar

[6]

Gonnet, G.H., Baeza-Yates, R.A., Snider, T.: New Indices for Text: Pat trees and Pat Arrays. Information Retrieval: Data Structures and Algorithms. Prentice Hall, Englewood Cliffs (1992).

Digital Library

Google Scholar

[7]

Laender, A.H.F., Ribeiro-Neto, B.A., Soares da Silva, A., Teixeira, J.S.: A Brief Survey of Web Data Extraction Tools. ACM SIGMOD Record 31(2), 84-93 (2002).

Digital Library

Google Scholar

[8]

Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10, 707-710 (1966).

Google Scholar

[9]

Muslea, I., Minton, S., Knoblock, C.: Hierarchical Wrapper Induction for Semistructured Information Sources. Autonomous Agents and Multi-Agent Systems, 93-114 (2001).

Digital Library

Google Scholar

[10]

Notredame, C.: Recent Progresses in Multiple Sequence Alignment: A Survey. Technical report, Information Genetique et (2002).

Google Scholar

[11]

Pan, A., et al.: Semi-Automatic Wrapper Generation for Commercial Web Sources. In: Proc. of IFIP WG8.1 Conf. on Engineering Inf. Systems in the Internet Context (EISIC) (2002).

Digital Library

Google Scholar

[12]

Raposo, J., Pan, A., Álvarez, M., Hidalgo, J.: Automatically Maintaining Wrappers for Web Sources. Data & Knowledge Engineering 61(2), 331-358 (2007).

Digital Library

Google Scholar

[13]

Zhai, Y., Liu, B.: Extracting Web Data Using Instance-Based Learning. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, J.-Y., Sheng, Q.Z. (eds.) WISE 2005. LNCS, vol. 3806, pp. 318-331. Springer, Heidelberg (2005).

Digital Library

Google Scholar

[14]

Zhai, Y., Liu, B.: Structured Data Extraction from the Web Based on Partial Tree Alignment. IEEE Trans. Knowl. Data Eng. 18(12), 1614-1628 (2006).

Digital Library

Google Scholar

Cited By

View all

Lee HUddin SAshok V(2020)TableView: Enabling Efficient Access to Web Data Records for Screen-Magnifier UsersProceedings of the 22nd International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3373625.3417030(1-12)Online publication date: 26-Oct-2020
https://dl.acm.org/doi/10.1145/3373625.3417030
Álvarez MPan ARaposo JBellas FCacheda F(2018)Finding and Extracting Data Records from Web PagesJournal of Signal Processing Systems10.1007/s11265-008-0270-y59:1(123-137)Online publication date: 27-Dec-2018
https://dl.acm.org/doi/10.1007/s11265-008-0270-y

Index Terms

Finding and extracting data records from web pages
1. Information systems
  1. Information retrieval

Index terms have been assigned to the content through auto-classification.

Recommendations

Finding and Extracting Data Records from Web Pages

Many HTML pages are generated by software programs by querying some underlying databases and then filling in a template with the data. In these situations the metainformation about the data structure is lost, so automated software programs cannot ...
Finding and Extracting Data Records from Web Pages
Embedded and Ubiquitous Computing
Abstract
Many HTML pages are generated by software programs by querying some underlying databases and then filling in a template with the data. In these situations the metainformation about the data structure is lost, so automated software programs cannot ...
WMS-extracting multiple sections data records from search engine results pages
SAC '10: Proceedings of the 2010 ACM Symposium on Applied Computing

In this paper, we develop an automatic wrapper for the extraction of multiple sections data records from search engine results pages. In the Information Extraction world, less attention has been focused on the development of wrappers for the extraction ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

EUC'07: Proceedings of the 2007 international conference on Embedded and ubiquitous computing

December 2007

769 pages

ISBN:3540770917

Editors:
Tei-Wei Kuo
National Taiwan University, Taiwan, Republic of China
,
Edwin Sha
University of Texas at Dallas, Richardson, TX
,
Minyi Guo
The University of Aizu, Aizu-Wakamatsu City, Japan
,
Laurence T. Yang
St. Francis Xavier University, Antigonish, NS, Canada
,
Zili Shao
The Hong Kong Polytechnic University, Kowloon, Hong Kong

In-Cooperation

National Taiwan University

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 17 December 2007

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
7
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Lee HUddin SAshok V(2020)TableView: Enabling Efficient Access to Web Data Records for Screen-Magnifier UsersProceedings of the 22nd International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3373625.3417030(1-12)Online publication date: 26-Oct-2020
https://dl.acm.org/doi/10.1145/3373625.3417030
Álvarez MPan ARaposo JBellas FCacheda F(2018)Finding and Extracting Data Records from Web PagesJournal of Signal Processing Systems10.1007/s11265-008-0270-y59:1(123-137)Online publication date: 27-Dec-2018
https://dl.acm.org/doi/10.1007/s11265-008-0270-y

Abstract

References

Cited By

Index Terms

Recommendations