[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.5555/1780745.1780797guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Finding and extracting data records from web pages

Published: 17 December 2007 Publication History

Abstract

Many HTML pages are generated by software programs by querying some underlying databases and then filling in a template with the data. In these situations the metainformation about the data structure is lost, so automated software programs cannot process these data in such powerful manners as information from databases. We propose a set of novel techniques for detecting structured records in a web page and extracting the data values that constitute them. Our method needs only an input page. It starts by identifying the data region of interest in the page. Then it is partitioned into records by using a clustering method that groups similar subtrees in the DOM tree of the page. Finally, the attributes of the data records are extracted by using a method based on multiple string alignment. We have tested our techniques with a high number of real web sources, obtaining high precision and recall values.

References

[1]
Arasu, A., Garcia-Molina, H.: Extracting Structured Data from Web Pages. In: Proc. of the ACM SIGMOD Int. Conf. on Management of Data (2003).
[2]
Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. In: Proc. of Very Large DataBases (VLDB) (2001).
[3]
Chakrabarti, S.: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann Publishers, San Francisco (2003).
[4]
Chang, C., Lui, S.: IEPAD: Information extraction based on pattern discovery. In: Proc. of 2001 Int. World Wide Web Conf., pp. 681-688 (2001).
[5]
Crescenzi, V., Mecca, G., Merialdo, P.: ROADRUNNER: Towards automatic data extraction from large web sites. In: Proc. of the 2001 Int. VLDB Conf., pp. 109-118 (2001).
[6]
Gonnet, G.H., Baeza-Yates, R.A., Snider, T.: New Indices for Text: Pat trees and Pat Arrays. Information Retrieval: Data Structures and Algorithms. Prentice Hall, Englewood Cliffs (1992).
[7]
Laender, A.H.F., Ribeiro-Neto, B.A., Soares da Silva, A., Teixeira, J.S.: A Brief Survey of Web Data Extraction Tools. ACM SIGMOD Record 31(2), 84-93 (2002).
[8]
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10, 707-710 (1966).
[9]
Muslea, I., Minton, S., Knoblock, C.: Hierarchical Wrapper Induction for Semistructured Information Sources. Autonomous Agents and Multi-Agent Systems, 93-114 (2001).
[10]
Notredame, C.: Recent Progresses in Multiple Sequence Alignment: A Survey. Technical report, Information Genetique et (2002).
[11]
Pan, A., et al.: Semi-Automatic Wrapper Generation for Commercial Web Sources. In: Proc. of IFIP WG8.1 Conf. on Engineering Inf. Systems in the Internet Context (EISIC) (2002).
[12]
Raposo, J., Pan, A., Álvarez, M., Hidalgo, J.: Automatically Maintaining Wrappers for Web Sources. Data & Knowledge Engineering 61(2), 331-358 (2007).
[13]
Zhai, Y., Liu, B.: Extracting Web Data Using Instance-Based Learning. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, J.-Y., Sheng, Q.Z. (eds.) WISE 2005. LNCS, vol. 3806, pp. 318-331. Springer, Heidelberg (2005).
[14]
Zhai, Y., Liu, B.: Structured Data Extraction from the Web Based on Partial Tree Alignment. IEEE Trans. Knowl. Data Eng. 18(12), 1614-1628 (2006).

Cited By

View all
  • (2020)TableView: Enabling Efficient Access to Web Data Records for Screen-Magnifier UsersProceedings of the 22nd International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3373625.3417030(1-12)Online publication date: 26-Oct-2020
  • (2018)Finding and Extracting Data Records from Web PagesJournal of Signal Processing Systems10.1007/s11265-008-0270-y59:1(123-137)Online publication date: 27-Dec-2018

Index Terms

  1. Finding and extracting data records from web pages
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Guide Proceedings
    EUC'07: Proceedings of the 2007 international conference on Embedded and ubiquitous computing
    December 2007
    769 pages
    ISBN:3540770917
    • Editors:
    • Tei-Wei Kuo,
    • Edwin Sha,
    • Minyi Guo,
    • Laurence T. Yang,
    • Zili Shao

    In-Cooperation

    • National Taiwan University

    Publisher

    Springer-Verlag

    Berlin, Heidelberg

    Publication History

    Published: 17 December 2007

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 25 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2020)TableView: Enabling Efficient Access to Web Data Records for Screen-Magnifier UsersProceedings of the 22nd International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3373625.3417030(1-12)Online publication date: 26-Oct-2020
    • (2018)Finding and Extracting Data Records from Web PagesJournal of Signal Processing Systems10.1007/s11265-008-0270-y59:1(123-137)Online publication date: 27-Dec-2018

    View Options

    View options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media