Abstract
Despite of the wide diversity of web-pages, web-pages residing in a particular organization, in most cases, are organized with semantically hierarchic structures. For example, the website of a computer science department contains pages about its people, courses and research, among which pages of people are categorized into faculty, staff and students, and pages of research diversify into different areas. Uncovering such hierarchic structures could supply users a convenient way of comprehensive navigation and accelerate other web mining tasks. In this study, we extract a similarity matrix among pages via in-page and crosspage link structures, based on which a density-based clustering algorithm is developed, which hierarchically groups densely linked webpages into semantic clusters. Our experiments show that this method is efficient and effective, and sheds light on mining and exploring web structures.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Bekkerman, R., Zilberstein, S., Allan, J.: Web page clustering using heuristic search in the web graph. In: IJCAI (2007)
Chehreghani, M.H., Abolhassani, H., Chehreghani, M.H.: Improving density-based methods for hierarchical clustering of web pages. Data Knowledge Engineer (2007)
Clauset, A., Newman, M.E.J., Moore, C.: Finding community in very large networks. Physical Review (2004)
Crescenzi, V., Merialdo, P., Missier, P.: Clustering web pages based on their structure. Data Knowledge Engineer (2005)
Ding, C.: Spectral clustering tutorial. In: ICML (2004)
Glassman, C., Manasse, M., Zweig, G.: Syntactic clustering of the web. In: WWW (2005)
Hou, J., Zhang, Y.: Utilizing hyperlink transitivity to improve web page clusterings. In: ADC (2003)
Kessler, M.: Bibliographic coupling between scientific papers. American Documentation (1963)
Kleinberg, J.: Authoritative sources in a hyperlinked environment. Journal of ACM-SIAM (1998)
Kudelka, M., Snasel, V., Lehecka, O., El-Qawasmeh, E., Pokorny, J.: Web pages reordering and clustering based on web patterns. In: Geffert, V., Karhumäki, J., Bertoni, A., Preneel, B., Návrat, P., Bieliková, M. (eds.) SOFSEM 2008. LNCS, vol. 4910, pp. 731–742. Springer, Heidelberg (2008)
Manning, C.D., Schützen, H.: Foundations of statistical natural language processing. MIT Press, Cambridge (1999)
Milligan, G., Cooper, M.: A study of the comparability of external criteria for hierarchical cluster analysis. Multivariate Behavioral Research (1986)
Ng, R., Han, J.: Efficient and effective clustering methods for spatial data mining. In: VLDB (1994)
Qi, X., Davison, B.: Knowing a web page by the company it keeps. In: CIKM (2006)
Small, H.: Co-citation in the scientific literature: A new measure of the relationship between two documents. American Social Info. Science (1973)
Wu, T., Chen, Y., Han, J.: Association mining in large databases: A re-examination of its measures. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) PKDD 2007. LNCS (LNAI), vol. 4702, pp. 621–628. Springer, Heidelberg (2007)
Xu, X., Yuruk, N., Feng, Z., Schweiger, T.A.J.: Scan: a structural clustering algorithm for networks. In: KDD (2007)
Yi, O.: Ehm-based web pages fuzzy clustering algorithm. In: MUE (2007)
Yitong Wang, M.K.: Evaluating contents-link coupled web page clustering for web search results. In: CIKM (2002)
Zamir, O., Etzioni, O.: Web document clustering: a feasible demonstration. In: SIGIR (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lin, C.X., Yu, Y., Han, J., Liu, B. (2010). Hierarchical Web-Page Clustering via In-Page and Cross-Page Link Structures. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2010. Lecture Notes in Computer Science(), vol 6119. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13672-6_22
Download citation
DOI: https://doi.org/10.1007/978-3-642-13672-6_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13671-9
Online ISBN: 978-3-642-13672-6
eBook Packages: Computer ScienceComputer Science (R0)