[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

A Case-Based Recognition of Semantic Structures in HTML Documents

An Automated Transformation from HTML to XML

  • Conference paper
  • First Online:
Intelligent Data Engineering and Automated Learning — IDEAL 2002 (IDEAL 2002)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2412))

Abstract

The recognition and extraction of semantic/logical structures in HTML documents are substantially important and difficult tasks for intelligent document processing. In this paper, we show that alignment is appropriate for recognizing characteristic semantic/logical structures of a series of HTML documents, within a framework of case-based reasoning. That is, given a series of HTML documents and a sample transformation from an HTML document into an XML format, then the alignment can identify semantic/logical structures in the remaining HTML documents of the series, by matching the text-block sequence of the remaining document with the one of the sample transformation. Several important properties of texts, such as continuity and sequentiality, can naturally be utilized by the alignment. The alignment technology can significantly improve the ability of the case-based transformation method which transforms a spatial/temporal series of HTML documents into machine-readable XML formats. Throughout experimental evaluations, we show that the case-based method with alignment achieved a highly accurate transformation of HTML documents into XML.

This research was supported partly by Grant-in-Aid from The Ministry of Education, Science and Culture of Japan. and also supported by The Telecommunications Advancement Foundation (TAF).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 71.50
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 89.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. N. Ashish and C. A. Knoblock: Wrapper Generation for Semi-Structured Internet Source, ACM SIGMOD Records, 26(4) (1997) 8–15.

    Article  Google Scholar 

  2. W. W. Cohen: Recognizing Structure in Web Pages using Similarity Queries, Proc. of AAAI-99 (1999) 59–66.

    Google Scholar 

  3. J. Y. Hsu and W. Yih: Template-Based Information Mining from HTML Documents, Proc. of AAAI-97 (1997) 256–262.

    Google Scholar 

  4. J. B. Kruskal: An Overview of Sequence Comparison: In D. Sankoff and J. B. Kruskal, (ed.), Time Warps, String Edits and Macromolecules: the Theory and Practice of Sequence Comparison (Addison Wesley, 1983) 1–44.

    Google Scholar 

  5. N. Kushmerick: Regression testing for wrapper maintenance, Proc. of AAAI-99 (1999) 74–79.

    Google Scholar 

  6. G. Salton: Introduction to Modern Information Retrieval, (McGraw-Hill, 1983).

    Google Scholar 

  7. S-J. Lim, Y-K. Ng: An Automated Change-Detection Algorithm for HTML Documents Based on Semantic Hierarchies, Proc. of ICDE 2001 (2001) 303–312.

    Google Scholar 

  8. M. Umehara and K. Iwanuma: A Case-Based Transformation from HTML to XML, Proc. of IDEAL 2000 LNAI 1983 (2000) 410–415.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Umehara, M., Iwanuma, K., Nabeshima, H. (2002). A Case-Based Recognition of Semantic Structures in HTML Documents. In: Yin, H., Allinson, N., Freeman, R., Keane, J., Hubbard, S. (eds) Intelligent Data Engineering and Automated Learning — IDEAL 2002. IDEAL 2002. Lecture Notes in Computer Science, vol 2412. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45675-9_24

Download citation

  • DOI: https://doi.org/10.1007/3-540-45675-9_24

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-44025-3

  • Online ISBN: 978-3-540-45675-9

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics