[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

Hierarchical Web-Page Clustering via In-Page and Cross-Page Link Structures

  • Conference paper
Advances in Knowledge Discovery and Data Mining (PAKDD 2010)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6119))

Included in the following conference series:

Abstract

Despite of the wide diversity of web-pages, web-pages residing in a particular organization, in most cases, are organized with semantically hierarchic structures. For example, the website of a computer science department contains pages about its people, courses and research, among which pages of people are categorized into faculty, staff and students, and pages of research diversify into different areas. Uncovering such hierarchic structures could supply users a convenient way of comprehensive navigation and accelerate other web mining tasks. In this study, we extract a similarity matrix among pages via in-page and crosspage link structures, based on which a density-based clustering algorithm is developed, which hierarchically groups densely linked webpages into semantic clusters. Our experiments show that this method is efficient and effective, and sheds light on mining and exploring web structures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 71.50
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 89.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Bekkerman, R., Zilberstein, S., Allan, J.: Web page clustering using heuristic search in the web graph. In: IJCAI (2007)

    Google Scholar 

  2. Chehreghani, M.H., Abolhassani, H., Chehreghani, M.H.: Improving density-based methods for hierarchical clustering of web pages. Data Knowledge Engineer (2007)

    Google Scholar 

  3. Clauset, A., Newman, M.E.J., Moore, C.: Finding community in very large networks. Physical Review (2004)

    Google Scholar 

  4. Crescenzi, V., Merialdo, P., Missier, P.: Clustering web pages based on their structure. Data Knowledge Engineer (2005)

    Google Scholar 

  5. Ding, C.: Spectral clustering tutorial. In: ICML (2004)

    Google Scholar 

  6. Glassman, C., Manasse, M., Zweig, G.: Syntactic clustering of the web. In: WWW (2005)

    Google Scholar 

  7. Hou, J., Zhang, Y.: Utilizing hyperlink transitivity to improve web page clusterings. In: ADC (2003)

    Google Scholar 

  8. Kessler, M.: Bibliographic coupling between scientific papers. American Documentation (1963)

    Google Scholar 

  9. Kleinberg, J.: Authoritative sources in a hyperlinked environment. Journal of ACM-SIAM (1998)

    Google Scholar 

  10. Kudelka, M., Snasel, V., Lehecka, O., El-Qawasmeh, E., Pokorny, J.: Web pages reordering and clustering based on web patterns. In: Geffert, V., Karhumäki, J., Bertoni, A., Preneel, B., Návrat, P., Bieliková, M. (eds.) SOFSEM 2008. LNCS, vol. 4910, pp. 731–742. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  11. Manning, C.D., Schützen, H.: Foundations of statistical natural language processing. MIT Press, Cambridge (1999)

    MATH  Google Scholar 

  12. Milligan, G., Cooper, M.: A study of the comparability of external criteria for hierarchical cluster analysis. Multivariate Behavioral Research (1986)

    Google Scholar 

  13. Ng, R., Han, J.: Efficient and effective clustering methods for spatial data mining. In: VLDB (1994)

    Google Scholar 

  14. Qi, X., Davison, B.: Knowing a web page by the company it keeps. In: CIKM (2006)

    Google Scholar 

  15. Small, H.: Co-citation in the scientific literature: A new measure of the relationship between two documents. American Social Info. Science (1973)

    Google Scholar 

  16. Wu, T., Chen, Y., Han, J.: Association mining in large databases: A re-examination of its measures. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) PKDD 2007. LNCS (LNAI), vol. 4702, pp. 621–628. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  17. Xu, X., Yuruk, N., Feng, Z., Schweiger, T.A.J.: Scan: a structural clustering algorithm for networks. In: KDD (2007)

    Google Scholar 

  18. Yi, O.: Ehm-based web pages fuzzy clustering algorithm. In: MUE (2007)

    Google Scholar 

  19. Yitong Wang, M.K.: Evaluating contents-link coupled web page clustering for web search results. In: CIKM (2002)

    Google Scholar 

  20. Zamir, O., Etzioni, O.: Web document clustering: a feasible demonstration. In: SIGIR (1998)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lin, C.X., Yu, Y., Han, J., Liu, B. (2010). Hierarchical Web-Page Clustering via In-Page and Cross-Page Link Structures. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2010. Lecture Notes in Computer Science(), vol 6119. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13672-6_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-13672-6_22

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-13671-9

  • Online ISBN: 978-3-642-13672-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics