Abstract
A web site usually contains a large number of concept entities, each consisting of one or more web pages connected by hyperlinks. In order to discover these concept entities for more expressive web site queries and other applications, the web unit mining problem has been proposed. Web unit mining aims to determine web pages that constitute a concept entity and classify concept entities into categories. Nevertheless, the performance of an existing web unit mining algorithm, iWUM, suffers as it may create more than one web unit (incomplete web units) from a single concept entity. This paper presents a new web unit mining algorithm, kWUM, which incorporates site-specific knowledge to discover and handle incomplete web units by merging them together and assigning correct labels. Experiments show that the overall accuracy has been significantly improved.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Chakrabarti, S., Dom, B., Indyk, P.: Enhanced hypertext categorization using hyperlinks. In: Proceedings of the 1998 ACM SIGMOD, Seattle, Washington, USA, June 2-4, pp. 307–318 (1998)
Craven, M., Slattery, S.: Relational learning with statistical predicate invention: Better models for hypertext. Journal of Machine Learning 43(1-2), 97–119 (2001)
Ester, M., Kriegel, H.-P., Schubert, M.: Web site mining: a new way to spot competitors, customers and suppliers in the world wide web. In: Proceedings of the 8th ACM SIGKDD, Edmonton, Alberta, Canada, July 23 - 26, pp. 249–258 (2002)
Furnkranz, J.: Hyperlink ensembles: A case study in hypertext classification. Journal of Information Fusion 1, 299–312 (2001)
Oh, H.-J., Myaeng, S.H., Lee, M.-H.: A practical hypertext catergorization method using links and incrementally available class information. In: Proceedings of the 23rd ACM SIGIR, Athens, Greece, July 24-28, pp. 264–271 (2000)
Sun, A., Lim, E.-P.: Web unit mining: finding and classifying subgraphs of web pages. In: Proceedings of the 12th CIKM, McLean, Virginia, USA, November 4-9, pp. 108–115 (2002)
Terveen, L., Hill, W., Amento, B.: Constructing, organizing, and visualizing collections of topically related web resources. ACM Transactions on Computer- Human Interaction 6(1), 67–94 (1999)
Tian, Y., Huang, T., Gao, W., Cheng, J., Kang, P.: Two-phase web site classification based on hidden markov tree models. In: Proceedings of IEEE/WIC Web Intelligence, Beijing, China, October 13-17 (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Yin, M., Goh, D.HL., Lim, EP. (2005). On Discovering Concept Entities from Web Sites. In: Gervasi, O., et al. Computational Science and Its Applications – ICCSA 2005. ICCSA 2005. Lecture Notes in Computer Science, vol 3481. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11424826_125
Download citation
DOI: https://doi.org/10.1007/11424826_125
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25861-2
Online ISBN: 978-3-540-32044-9
eBook Packages: Computer ScienceComputer Science (R0)