Abstract
Clustering is a challenging topic in the area of Web data management. Various forms of clustering are required in a wide range of applications, including finding mirrored Web pages, detecting copyright violations, and reporting search results in a structured way. Clustering can either be performed once offline, (independently to search queries), or online (on the results of search queries). Important efforts have focused on mining Web access logs and to cluster search engine results on the fly. Online methods based on link structure and text have been applied successfully to finding pages on related topics. This paper presents an overview of the most popular methodologies and implementations in terms of clustering either Web users or Web sources and presents a survey about current status and future trends in clustering employed over the Web.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Abiteboul, S., Buneman, P., Suciu, D.: Data on the Web. Morgan Kaufmann, San Francisco (2000)
Baldi, P., Frasconi, P., Smyth, P.: Modeling the Internet and the Web. Wiley, Chichester (2003)
Banerjee, A., Ghosh, J.: Clickstream clustering using weighted longest common Subsequences. In: Proceedings of Workshop on Web Mining, SIAM Conference on Data Mining, Chicago, USA, pp. 33–40 (April 2001)
Cadez, I.V., Heckerman, D., Meek, C., Smyth, P., White, S.: Model-based clustering and visualization of navigation patterns on a Web site. Data Mining and Knowledge Discovery 7(4), 399–424 (2003)
Chakrabarti, S.: Mining the Web. Morgan Kaufmann, San Francisco (2003)
Chen, Z., Wai-Chee Fu, A., Chi-Hung Tong, F.: Optimal algorithms for finding user access sessions from very large Web logs. World Wide Web: Internet and Information Systems 6, 259–279 (2003)
Cobena, G., Abdessalem, T., Hinnach, Y.: A comparative study for XML change detection. Technical Report, INRIA, France (2000)
Cooley, R., Mobasher, B., Srivastava, J.: Data preparation for mining World Wide Web browsing patterns. Knowledge Information Systems 1, 5–32 (1999)
Cui, H., Wen, J.-R.: Hierarchical indexing and flexible element retrieval for structured document. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 73–87. Springer, Heidelberg (2003)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. Statistics Society B 39, 1–22 (1997)
Eiron, N., McCurley, K.S.: Untangling compound documents on the Web. In: Proceedings of ACM Hypertext, pp. 85–94 (2003)
Flake, G.W., Lawrence, S., Lee Giles, C., Coetzee, F.: Self-organization and identification of Web Communities. IEEE Computer 35(3) (2002)
Flesca, S., Manco, G., Masciari, E., Pontieri, L., Pugliese, A.: Detecting similarities between XML documents. In: Proceedings of WebDB Workshop (2002)
Fraley, C., Raftery, A.: How many clusters? Which clustering method? Answers via model-based cluster analysis. Computer Journal 41 (1998)
Fuhr, N., Groβjohann, K.: XIRQL: a query language for information retrieval in XML documents. In: Proceedings of ACM SIGIR (2001)
Fu, Y., Sandhu, K., Shih, M.-Y.: Clustering of Web users based on access patterns. In: Proceedings of WEBKDD (1999)
Grabs, T., Org Schek, H.-J.: Generating vector spaces on-the-fly for flexible XML retrieval. In: Proceedings of XML and IR Workshop (2002)
Greco, G., Greco, S., Zumpano, E.: Web communities: models and algorithms. World Wide Web 7(1), 58–82 (2004)
Guha, S., Rastogi, R., Shim, K.: ROCK: A robust clustering algorithm for categorical attributes. ACM SIGMOD Record 25(5) (2000)
Hay, B., Vanhoof, K., Wetsr, G.: Clustering navigation patterns on a Website using a sequence alignment method. In: Proceedings of 17th International Joint Conference on Artificial Intelligence, Seattle, Washington, USA (August 2001)
Kleinberg, J.M.: Authoritative sources in a hyper-linked environment. In: Proceedings of the 9th ACM-SIAM Symposium on Discrete Algorithm (1998)
Mass, Y., Mandelbrod, M., Amitay, E., Maarek, Y., Soffer, A.: Juru XML - an XML retrieval system at INEX 2002. In: Proceedings of INEX, Dagstuhl, Germany (December 2002)
Myaeng, S.H., Jang, D.-H.: A flexible model for retrieval of SGML documents. In: Proceedings of ACM SIGIR (1998)
Nierman, A., Jagadish, H.V.: Evaluating structural similarity in XML documents. In: Proceedings of the WebDB Workshop, Madison, Wisconsin, USA (June 2002)
Kothari, R., Mittal, P.A., Jain, V., Mohania, M.K.: On using page cooccurrences for computing clickstream similarity. In: Proceedings of the 3rd SIAM International Conference on Data Mining, San Francisco, USA (May 2003)
Sankoff, D., Kruskal, J.: Time warps, string edits and macromolecules, the theory and practice of sequence comparison. CSLI Publications, Stanford (1999)
Sarukkai, R.R.: Link prediction and path analysis using Markov chains. Computer Networks 33, 377–386 (2000)
Su, Z., Yang, Q., Zhang, H.H., Xu, X., Hu, Y.: Correlation-based document clustering using Web logs. In: Proceedings of the 34th Annual Hawaii International Conference on System Sciences (HICSS-34), Maui, Hawaii (January 2001)
Tajima, K., Hatano, K., Matsukura, T., Sano, R., Tanaka, K.: Discovery and retrieval of logical information units in Web. In: Proceedings of the Workshop on Organizing Web Space (WOWS 1999), Berkeley, USA, pp. 13–23 (August 1999)
Theobald, A., Weikum, G.: The Index-Based XXL Search engine for querying XML data with relevance ranking. In: Jensen, C.S., Jeffery, K., Pokorný, J., Šaltenis, S., Bertino, E., Böhm, K., Jarke, M. (eds.) EDBT 2002. LNCS, vol. 2287, p. 477. Springer, Heidelberg (2002)
Baeza-Yates, R., Ribiero-Neto, B.: Modern information retrieval. Addison-Wesley, Reading (1999)
Baeza-Yates, R., Navarro, G.: Integrating contents and structure in text retrieval. ACM SIGMOD Record 25(1) (1996)
Yoon, J., Raghavan, V., Chakilam, V., Kerschberg, L.: BitCube: A three-dimensional bitmap indexing for XML documents. Journal of Intelligent Information Systems 17 (2001)
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: An efficient data clustering method for very large databases. In: Proceedings of the International Conference Management of Data (ACM-SIGMOD), Montreal, Canada, pp. 103–114 (June 1996)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Vakali, A., Pokorný, J., Dalamagas, T. (2004). An Overview of Web Data Clustering Practices. In: Lindner, W., Mesiti, M., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds) Current Trends in Database Technology - EDBT 2004 Workshops. EDBT 2004. Lecture Notes in Computer Science, vol 3268. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30192-9_59
Download citation
DOI: https://doi.org/10.1007/978-3-540-30192-9_59
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23305-3
Online ISBN: 978-3-540-30192-9
eBook Packages: Computer ScienceComputer Science (R0)