[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

On Discovering Concept Entities from Web Sites

  • Conference paper
Computational Science and Its Applications – ICCSA 2005 (ICCSA 2005)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3481))

Included in the following conference series:

  • 1698 Accesses

Abstract

A web site usually contains a large number of concept entities, each consisting of one or more web pages connected by hyperlinks. In order to discover these concept entities for more expressive web site queries and other applications, the web unit mining problem has been proposed. Web unit mining aims to determine web pages that constitute a concept entity and classify concept entities into categories. Nevertheless, the performance of an existing web unit mining algorithm, iWUM, suffers as it may create more than one web unit (incomplete web units) from a single concept entity. This paper presents a new web unit mining algorithm, kWUM, which incorporates site-specific knowledge to discover and handle incomplete web units by merging them together and assigning correct labels. Experiments show that the overall accuracy has been significantly improved.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Chakrabarti, S., Dom, B., Indyk, P.: Enhanced hypertext categorization using hyperlinks. In: Proceedings of the 1998 ACM SIGMOD, Seattle, Washington, USA, June 2-4, pp. 307–318 (1998)

    Google Scholar 

  2. Craven, M., Slattery, S.: Relational learning with statistical predicate invention: Better models for hypertext. Journal of Machine Learning 43(1-2), 97–119 (2001)

    Article  MATH  Google Scholar 

  3. Ester, M., Kriegel, H.-P., Schubert, M.: Web site mining: a new way to spot competitors, customers and suppliers in the world wide web. In: Proceedings of the 8th ACM SIGKDD, Edmonton, Alberta, Canada, July 23 - 26, pp. 249–258 (2002)

    Google Scholar 

  4. Furnkranz, J.: Hyperlink ensembles: A case study in hypertext classification. Journal of Information Fusion 1, 299–312 (2001)

    Google Scholar 

  5. Oh, H.-J., Myaeng, S.H., Lee, M.-H.: A practical hypertext catergorization method using links and incrementally available class information. In: Proceedings of the 23rd ACM SIGIR, Athens, Greece, July 24-28, pp. 264–271 (2000)

    Google Scholar 

  6. Sun, A., Lim, E.-P.: Web unit mining: finding and classifying subgraphs of web pages. In: Proceedings of the 12th CIKM, McLean, Virginia, USA, November 4-9, pp. 108–115 (2002)

    Google Scholar 

  7. Terveen, L., Hill, W., Amento, B.: Constructing, organizing, and visualizing collections of topically related web resources. ACM Transactions on Computer- Human Interaction 6(1), 67–94 (1999)

    Article  Google Scholar 

  8. Tian, Y., Huang, T., Gao, W., Cheng, J., Kang, P.: Two-phase web site classification based on hidden markov tree models. In: Proceedings of IEEE/WIC Web Intelligence, Beijing, China, October 13-17 (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Yin, M., Goh, D.HL., Lim, EP. (2005). On Discovering Concept Entities from Web Sites. In: Gervasi, O., et al. Computational Science and Its Applications – ICCSA 2005. ICCSA 2005. Lecture Notes in Computer Science, vol 3481. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11424826_125

Download citation

  • DOI: https://doi.org/10.1007/11424826_125

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-25861-2

  • Online ISBN: 978-3-540-32044-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics