[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/872757.872799acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Extracting structured data from Web pages

Published: 09 June 2003 Publication History

Abstract

Many web sites contain large sets of pages generated using a common template or layout. For example, Amazon lays out the author, title, comments, etc. in the same way in all its book pages. The values used to generate the pages (e.g., the author, title,...) typically come from a database. In this paper, we study the problem of automatically extracting the database values from such template-generated web pages without any learning examples or other similar human input. We formally define a template, and propose a model that describes how values are encoded into pages using a template. We present an algorithm that takes, as input, a set of template-generated pages, deduces the unknown template used to generate the pages, and extracts, as output, the values encoded in the pages. Experimental evaluation on a large number of real input page collections indicates that our algorithm correctly extracts data in most cases.

References

[1]
S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison Wesley, Reading, Massachussetts, 1995.]]
[2]
Amazon.com. http://www.amazon.com.]]
[3]
S. Brin. Extracting patterns and relations from the world wide web. In WebDB Workshop at 6th Intl. Conf. on Extending Database Technology, 1998.]]
[4]
C. Chang and S. Lui. IEPAD: Information extraction based on pattern discovery. In Proc. of 2001 Intl. World Wide Web Conf., pages 681--688, 2001.]]
[5]
V. Crescenzi, G. Mecca, and P. Merialdo. ROADRUNNER: Towards automatic data extraction from large web sites. In Proc. of the 2001 Intl. Conf. on Very Large Data Bases, pages 109--118, 2001.]]
[6]
Experimental results. http://www-db.stanford.edu/~arvind/extract/.]]
[7]
H. Garcia-Molina, Y. Papakonstantinou, D. Quass, A. Rajaraman, Y. Sagiv, J. D. Ullman, and J. Widom. The TSIMMIS project: Integration of heterogenous information sources. Journal of Intelligent Information Systems, 8(2):117--132, 1997.]]
[8]
M. Garofalokis, A. Gionis, R. Rastogi, S. Seshadr, and K. Shim. XTRACT: A system for extracting document type descriptors from XML documents. In Proc. of the 2000 ACM SIGMOD Intl. Conf. on Management of Data, pages 165--176, 2000.]]
[9]
E. M. Gold. Language identification in the limit. Information and Control, 10(5):447--474, 1967.]]
[10]
S. Grumbach and G. Mecca. In search of the lost schema. In Proc. of 1999 Intl. Conf. of Database Theory, pages 314--331, 1999.]]
[11]
L. M. Haas, D. Kossmann, E. L. Wimmers, and J. Yang. Optimizing queries across diverse data sources. In Proc. of the 1997 Intl. Conf. on Very Large Data Bases, pages 276--285, 1997.]]
[12]
J. Hammer, H. Garcia-Molina, J. Cho, A. Crespo, and R. Aranha. Extracting semi structure information from the web. In Proceedings of the Workshop on Management of Semistructured Data, 1997.]]
[13]
C. N. Hsu and M. T. Dung. Generating finite-state transducers for semi-structured data extraction from the web. Information Systems Special Issue on Semistructured Data, 23(8):521--538, 1998.]]
[14]
IEPAD:. http://www.csie/ncu.edu.tw/~chia.]]
[15]
N. Kushmerick, D. Weld, and R. Doorenbos. Wrapper induction for information extraction. In Proc. of the 1997 Intl. Joint Conf. on Artificial Intelligence, pages 729--737, 1997.]]
[16]
A. Laender, B. Ribeiro-Neto, A. da Silva, and J. Teixeira. A brief survey of web data extraction tools. Sigmod Record, 31(2), 2002.]]
[17]
A. Levy, A. Rajaraman, and J. J. Ordille. Querying heterogeneous information sources using source descriptions. In Proc. of the 1996 Intl. Conf. on Very Large Data Bases, pages 251--262, 1996.]]
[18]
L. Liu, C. Pu, and W. Han. XWRAP: An XML-enabled wrapper construction system for web information sources. In Proc. of the 2000 Intl. Conf. on Data Engineering, pages 611--621, 2000.]]
[19]
I. Muslea, S. Minton, and C. A. Knoblock. A hierarchical approach to wrapper induction. In Proc. of Third Intl. Conf. on Autonomous Agents, pages 190--197, 1999.]]
[20]
L. Pitt. Inductive inference, DFAs, and computational complexity. Analogical and Inductive Inference, pages 18--44, 1989.]]
[21]
RISE:. http://www.isi.edu/~muslea/RISE/.]]
[22]
J. Rissanen. Modeling by shortest data description. Automatica, 14:465--471, 1978.]]
[23]
ROADRUNNER:. http://www.dia.uniroma3.it/db/roadRunner/index.html.]]
[24]
S. Sarawagi. Automation in InformationExtraction and Data Integration (tutorial). VLDB, 2002.]]
[25]
J. D. Ullman. Information integration using logical views. In Proc. of 1997 Intl. Conf. on Database Theory, pages 19--40, 1997.]]

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data
June 2003
702 pages
ISBN:158113634X
DOI:10.1145/872757
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2003

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

SIGMOD/PODS03
Sponsor:

Acceptance Rates

SIGMOD '03 Paper Acceptance Rate 53 of 342 submissions, 15%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)134
  • Downloads (Last 6 weeks)12
Reflects downloads up to 23 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Reduction of information asymmetry in e-commerce: the web scraping approach10.22367/9788378758907Online publication date: 2024
  • (2023)W-Tree: A Concept Correlation Tree for Data Analysis and AnnotationsBig Data, Machine Learning, and Applications10.1007/978-981-99-3481-2_24(299-311)Online publication date: 30-Nov-2023
  • (2023)Monetary Valuation of Data in the Context of AccountingThe Monetization of Technical Data10.1007/978-3-662-66509-1_7(103-116)Online publication date: 1-Jan-2023
  • (2022)Androscanreg 2.0International Journal of Software Innovation10.4018/IJSI.30972410:1(1-28)Online publication date: 30-Sep-2022
  • (2022)A Comparative Study of Machine Learning Techniques for Android Malware DetectionInternational Journal of Software Innovation10.4018/IJSI.30971910:1(1-13)Online publication date: 22-Sep-2022
  • (2022)A Survey of Digital Image Watermarking Techniques in Spatial, Transform, and Hybrid DomainsInternational Journal of Software Innovation10.4018/IJSI.30911310:1(1-21)Online publication date: 16-Sep-2022
  • (2022)An Early Predictive and Recovery Mechanism for Scheduled Outages in Service-Based Systems (SBS)International Journal of Software Innovation10.4018/IJSI.30701610:1(1-35)Online publication date: 5-Aug-2022
  • (2022)C-HUIMInternational Journal of Software Innovation10.4018/IJSI.30701510:1(1-11)Online publication date: 11-Nov-2022
  • (2022)Construction of Fuzzy Database and Analysis Interface Using Fuzzy Graphs for Management System Operation AnalysisInternational Journal of Software Innovation10.4018/IJSI.30701410:1(1-16)Online publication date: 5-Aug-2022
  • (2022)Ensemble Deep Learning Intrusion Detection Model for Fog Computing EnvironmentsInternational Journal of Software Innovation10.4018/IJSI.30358710:1(1-14)Online publication date: 8-Jul-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media