More Web Proxy on the site http://driver.im/

Article

Free access

Record-boundary discovery in Web documents

Authors:

Y.-K. NgAuthors Info & Claims

SIGMOD '99: Proceedings of the 1999 ACM SIGMOD international conference on Management of data

Pages 467 - 478

https://doi.org/10.1145/304182.304223

Published: 01 June 1999 Publication History

Abstract

Extraction of information from unstructured or semistructured Web documents often requires a recognition and delimitation of records. (By “record” we mean a group of information relevant to some entity.) Without first chunking documents that contain multiple records according to record boundaries, extraction of record information will not likely succeed. In this paper we describe a heuristic approach to discovering record boundaries in Web documents. In our approach, we capture the structure of a document as a tree of nested HTML tags, locate the subtree containing the records of interest, identify candidate separator tags within the subtree using five independent heuristics, and select a consensus separator tag based on a combined heuristic. Our approach is fast (runs linearly for practical cases within the context of the larger data-extraction problem) and accurate (100% in the experiments we conducted).

References

[1]

B. Adelberg. Nodose- a tool for semiautomatically extracting structured and semistructured data from text documents. in Proceedings of the 1998 A CM SIGMOD International Conference on Management of Data, pages 283-294, Seattle, Washington, June 1998.

Digital Library

[2]

N. Ashish and C. Knoblock. Semiautomatic wrapper generation for internet information sources. In Proceedings of the CooplS'97, 1997.

Digital Library

[3]

N. Ashish and C. Knoblock. Wrapper generation for semi-structured internet sources. SIGMOD Record, 26(4):8-15, December 1997.

Digital Library

[4]

P. Atzeni and G. Mecca. Cut and paste. In Proceedings of the 16th A CM PODS, pages 144-153, May 1997.

Digital Library

[5]

P.M.G. Apers. Identifying internetrelated database research. In Proceedings o/ the 2nd International East-West Database Workshop, pages 183-193, Klagenfurt, 1994. Springer-Verlag.

[6]

P. Buneman, S. Davidson, M. Fernandez, and D. Suciu. Adding structure to unstructured data. In Proceedings o/ the International Conference on Database Theory (ICDT), 1997.

Digital Library

[7]

R.B. Doorenbos, O. Etzioni, and D.S. Weld. A scalable comparison-shopping agent for the world-wide web. In Proceedings of the First International Conference on Autonomous Agents, pages 39-48, Marina Del Rey, California, February 1997.

Digital Library

[8]

D. Embley, D. Campbell, Y. Jiang, Y.-K. Ng, R. Smith, S. Liddle, and D. Quass. A conceptual-modeling approach to extracting data from the web. In Proceedings of the 17th International Con/erence on Conceptual Modeling (ER'98), Singapore, November 1998. (to appear).

Digital Library

[9]

D.W. Embley, D.M. Campbell, S.W. Liddle, and R.D. Smith. Ontology-based extraction and structuring of information from data-rich unstructured documents. In Proceedings of the Conference on In- /ormation and Knowledge Management (CIKM'98), Washington D.C., November 1998. (to appear).

Digital Library

[10]

A. Gupta, V. Harinarayan, and A. Rajaraman. Virtual database technology. SIGMOD Record, 26(4):57-61, December 1997.

Digital Library

[11]

J. Haramer, H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo. Extracting semistructured information from the web. In Proceedings of the Workshop on Management of Semistructured Data, Tucson, Arizona, May 1997.

[12]

N. Kushmerick, D. Weld, and R. Doorenbos. Wrapper induction for information extraction. In Proceedings of the 1997 International Joint Conference on Artificial Intelligence, pages 729-735, 1997.

[13]

G.F. Luger and W.A. Stubblefield. Artificial Intelligence: Structures and Strategies for Complex Problem Solving, Third Edition. Addison Wesley Longman, Inc., 1998.

Digital Library

[14]

I. Mus}ea, S. Minton, and C. Knoblock. Stakler: learning extraction rules for seraistructured, web-based information sources. In Proceedings of AAAI'98: Workshop on AI and Information Integration, Madison, Wisconsin, July 1998.

[15]

S. Soderland. Learning to extrac{ textbased :information from the world wide web. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, pages 251-254, Newport Beach, California, August 1997.

Digital Library

[16]

Homepage for BYU data extraction research :group. URL: http://www.deg.byu. edu.

Cited By

Chen ZMeng WDragut E(2022)Web Record Extraction with InvariantsProceedings of the VLDB Endowment10.14778/3574245.357427616:4(959-972)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.14778/3574245.3574276
Kiesel JKneist FMeyer LKomlossy KStein BPotthast Md'Aquin MDietze SHauff CCurry ECudre Mauroux P(2020)Web Page Segmentation RevisitedProceedings of the 29th ACM International Conference on Information & Knowledge Management10.1145/3340531.3412782(3047-3054)Online publication date: 19-Oct-2020
https://dl.acm.org/doi/10.1145/3340531.3412782
Pillai ASchnebly JSengupta S(2018)Development of Web-based Automated System for Cyber Analytic Applications2018 9th IEEE Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON)10.1109/UEMCON.2018.8796792(866-871)Online publication date: Nov-2018
https://doi.org/10.1109/UEMCON.2018.8796792
Show More Cited By

Recommendations

Record-boundary discovery in Web documents

Extraction of information from unstructured or semistructured Web documents often requires a recognition and delimitation of records. (By “record” we mean a group of information relevant to some entity.) Without first chunking documents that contain ...
Record linkage for web data
Record Matching over Query Results from Multiple Web Databases

Record matching, which identifies the records that represent the same real-world entity, is an important step for data integration. Most state-of-the-art record matching methods are supervised, which requires the user to provide training data. These ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '99: Proceedings of the 1999 ACM SIGMOD international conference on Management of data

June 1999

604 pages

ISBN:1581130848

DOI:10.1145/304182

Chairmen:
Susan B. Davidson
Univ. of Pennsylvania, Philidelphia
,
Christos Faloutsos
Carnegie Mellon Univ., Pittsburgh

ACM SIGMOD Record Volume 28, Issue 2
June 1999
599 pages
ISSN:0163-5808
DOI:10.1145/304181
Chairmen:
Susan Davidson
Univ. of Pennsylvania
,
Christos Faloutsos
Carnegie Mellon Univ.
Issue’s Table of Contents

Copyright © 1999 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 1999

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Article

Conference

SIGMOD/PODS99

Sponsor:

SIGMOD/PODS99: International Conference on Management of Data and Symposium on Principles of Database Systems

May 31 - June 3, 1999

Pennsylvania, Philadelphia, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

173
Total Citations
View Citations
1,125
Total Downloads

Downloads (Last 12 months)101
Downloads (Last 6 weeks)10

Reflects downloads up to 24 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chen ZMeng WDragut E(2022)Web Record Extraction with InvariantsProceedings of the VLDB Endowment10.14778/3574245.357427616:4(959-972)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.14778/3574245.3574276
Kiesel JKneist FMeyer LKomlossy KStein BPotthast Md'Aquin MDietze SHauff CCurry ECudre Mauroux P(2020)Web Page Segmentation RevisitedProceedings of the 29th ACM International Conference on Information & Knowledge Management10.1145/3340531.3412782(3047-3054)Online publication date: 19-Oct-2020
https://dl.acm.org/doi/10.1145/3340531.3412782
Pillai ASchnebly JSengupta S(2018)Development of Web-based Automated System for Cyber Analytic Applications2018 9th IEEE Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON)10.1109/UEMCON.2018.8796792(866-871)Online publication date: Nov-2018
https://doi.org/10.1109/UEMCON.2018.8796792
Paul SMitra ADey S(2017)Issues and Challenges in Web Crawling for Information ExtractionBio-Inspired Computing for Information Retrieval Applications10.4018/978-1-5225-2375-8.ch004(93-121)Online publication date: 2017
https://doi.org/10.4018/978-1-5225-2375-8.ch004
Schulz ALassig JGaedke M(2016)Practical Web Data Extraction: Are We There Yet? - A Short Survey2016 IEEE/WIC/ACM International Conference on Web Intelligence (WI)10.1109/WI.2016.0096(562-567)Online publication date: Oct-2016
https://doi.org/10.1109/WI.2016.0096
Chu XHe YChakrabarti KGanjam KSellis TDavidson SIves Z(2015)TEGRAProceedings of the 2015 ACM SIGMOD International Conference on Management of Data10.1145/2723372.2723725(1713-1728)Online publication date: 27-May-2015
https://dl.acm.org/doi/10.1145/2723372.2723725
Thamviset WWongthanavasu S(2014)Bottom-up region extractor for semi-structured web pages2014 International Computer Science and Engineering Conference (ICSEC)10.1109/ICSEC.2014.6978209(284-289)Online publication date: Jul-2014
https://doi.org/10.1109/ICSEC.2014.6978209
Bing LLam WWong T(2013)Robust detection of semi-structured web records using a DOM structure-knowledge-driven modelACM Transactions on the Web10.1145/25084347:4(1-32)Online publication date: 1-Nov-2013
https://dl.acm.org/doi/10.1145/2508434
Sleiman HCorchuelo R(2013)A Survey on Region Extractors from Web DocumentsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2012.13525:9(1960-1981)Online publication date: 1-Sep-2013
https://dl.acm.org/doi/10.1109/TKDE.2012.135
Vijendran ADeepa C(2013)LBDA: A novel framework for extracting content from web pages2013 International Conference on Advanced Computing and Communication Systems10.1109/ICACCS.2013.6938748(1-7)Online publication date: Dec-2013
https://doi.org/10.1109/ICACCS.2013.6938748
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents