An Overview of Web Data Clustering Practices

Athena Vakali²¹,
Jaroslav Pokorný²² &
Theodore Dalamagas²³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3268))

Included in the following conference series:

International Conference on Extending Database Technology

1387 Accesses

Abstract

Clustering is a challenging topic in the area of Web data management. Various forms of clustering are required in a wide range of applications, including finding mirrored Web pages, detecting copyright violations, and reporting search results in a structured way. Clustering can either be performed once offline, (independently to search queries), or online (on the results of search queries). Important efforts have focused on mining Web access logs and to cluster search engine results on the fly. Online methods based on link structure and text have been applied successfully to finding pages on related topics. This paper presents an overview of the most popular methodologies and implementations in terms of clustering either Web users or Web sources and presents a survey about current status and future trends in clustering employed over the Web.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 71.50; Price includes VAT (United Kingdom)

Softcover Book: GBP 89.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Efficient Techniques for Clustering of Users on Web Log Data

Comparison of Clustering Algorithms Using KNIME Tool

Web Structure Mining Algorithms: A Survey

References

Abiteboul, S., Buneman, P., Suciu, D.: Data on the Web. Morgan Kaufmann, San Francisco (2000)
Google Scholar
Baldi, P., Frasconi, P., Smyth, P.: Modeling the Internet and the Web. Wiley, Chichester (2003)
Google Scholar
Banerjee, A., Ghosh, J.: Clickstream clustering using weighted longest common Subsequences. In: Proceedings of Workshop on Web Mining, SIAM Conference on Data Mining, Chicago, USA, pp. 33–40 (April 2001)
Google Scholar
Cadez, I.V., Heckerman, D., Meek, C., Smyth, P., White, S.: Model-based clustering and visualization of navigation patterns on a Web site. Data Mining and Knowledge Discovery 7(4), 399–424 (2003)
Article MathSciNet Google Scholar
Chakrabarti, S.: Mining the Web. Morgan Kaufmann, San Francisco (2003)
Google Scholar
Chen, Z., Wai-Chee Fu, A., Chi-Hung Tong, F.: Optimal algorithms for finding user access sessions from very large Web logs. World Wide Web: Internet and Information Systems 6, 259–279 (2003)
Google Scholar
Cobena, G., Abdessalem, T., Hinnach, Y.: A comparative study for XML change detection. Technical Report, INRIA, France (2000)
Google Scholar
Cooley, R., Mobasher, B., Srivastava, J.: Data preparation for mining World Wide Web browsing patterns. Knowledge Information Systems 1, 5–32 (1999)
Google Scholar
Cui, H., Wen, J.-R.: Hierarchical indexing and flexible element retrieval for structured document. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 73–87. Springer, Heidelberg (2003)
Chapter Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. Statistics Society B 39, 1–22 (1997)
MathSciNet Google Scholar
Eiron, N., McCurley, K.S.: Untangling compound documents on the Web. In: Proceedings of ACM Hypertext, pp. 85–94 (2003)
Google Scholar
Flake, G.W., Lawrence, S., Lee Giles, C., Coetzee, F.: Self-organization and identification of Web Communities. IEEE Computer 35(3) (2002)
Google Scholar
Flesca, S., Manco, G., Masciari, E., Pontieri, L., Pugliese, A.: Detecting similarities between XML documents. In: Proceedings of WebDB Workshop (2002)
Google Scholar
Fraley, C., Raftery, A.: How many clusters? Which clustering method? Answers via model-based cluster analysis. Computer Journal 41 (1998)
Google Scholar
Fuhr, N., Groβjohann, K.: XIRQL: a query language for information retrieval in XML documents. In: Proceedings of ACM SIGIR (2001)
Google Scholar
Fu, Y., Sandhu, K., Shih, M.-Y.: Clustering of Web users based on access patterns. In: Proceedings of WEBKDD (1999)
Google Scholar
Grabs, T., Org Schek, H.-J.: Generating vector spaces on-the-fly for flexible XML retrieval. In: Proceedings of XML and IR Workshop (2002)
Google Scholar
Greco, G., Greco, S., Zumpano, E.: Web communities: models and algorithms. World Wide Web 7(1), 58–82 (2004)
Article Google Scholar
Guha, S., Rastogi, R., Shim, K.: ROCK: A robust clustering algorithm for categorical attributes. ACM SIGMOD Record 25(5) (2000)
Google Scholar
Hay, B., Vanhoof, K., Wetsr, G.: Clustering navigation patterns on a Website using a sequence alignment method. In: Proceedings of 17th International Joint Conference on Artificial Intelligence, Seattle, Washington, USA (August 2001)
Google Scholar
Kleinberg, J.M.: Authoritative sources in a hyper-linked environment. In: Proceedings of the 9th ACM-SIAM Symposium on Discrete Algorithm (1998)
Google Scholar
Mass, Y., Mandelbrod, M., Amitay, E., Maarek, Y., Soffer, A.: Juru XML - an XML retrieval system at INEX 2002. In: Proceedings of INEX, Dagstuhl, Germany (December 2002)
Google Scholar
Myaeng, S.H., Jang, D.-H.: A flexible model for retrieval of SGML documents. In: Proceedings of ACM SIGIR (1998)
Google Scholar
Nierman, A., Jagadish, H.V.: Evaluating structural similarity in XML documents. In: Proceedings of the WebDB Workshop, Madison, Wisconsin, USA (June 2002)
Google Scholar
Kothari, R., Mittal, P.A., Jain, V., Mohania, M.K.: On using page cooccurrences for computing clickstream similarity. In: Proceedings of the 3rd SIAM International Conference on Data Mining, San Francisco, USA (May 2003)
Google Scholar
Sankoff, D., Kruskal, J.: Time warps, string edits and macromolecules, the theory and practice of sequence comparison. CSLI Publications, Stanford (1999)
Google Scholar
Sarukkai, R.R.: Link prediction and path analysis using Markov chains. Computer Networks 33, 377–386 (2000)
Article Google Scholar
Su, Z., Yang, Q., Zhang, H.H., Xu, X., Hu, Y.: Correlation-based document clustering using Web logs. In: Proceedings of the 34th Annual Hawaii International Conference on System Sciences (HICSS-34), Maui, Hawaii (January 2001)
Google Scholar
Tajima, K., Hatano, K., Matsukura, T., Sano, R., Tanaka, K.: Discovery and retrieval of logical information units in Web. In: Proceedings of the Workshop on Organizing Web Space (WOWS 1999), Berkeley, USA, pp. 13–23 (August 1999)
Google Scholar
Theobald, A., Weikum, G.: The Index-Based XXL Search engine for querying XML data with relevance ranking. In: Jensen, C.S., Jeffery, K., Pokorný, J., Šaltenis, S., Bertino, E., Böhm, K., Jarke, M. (eds.) EDBT 2002. LNCS, vol. 2287, p. 477. Springer, Heidelberg (2002)
Chapter Google Scholar
Baeza-Yates, R., Ribiero-Neto, B.: Modern information retrieval. Addison-Wesley, Reading (1999)
Google Scholar
Baeza-Yates, R., Navarro, G.: Integrating contents and structure in text retrieval. ACM SIGMOD Record 25(1) (1996)
Google Scholar
Yoon, J., Raghavan, V., Chakilam, V., Kerschberg, L.: BitCube: A three-dimensional bitmap indexing for XML documents. Journal of Intelligent Information Systems 17 (2001)
Google Scholar
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: An efficient data clustering method for very large databases. In: Proceedings of the International Conference Management of Data (ACM-SIGMOD), Montreal, Canada, pp. 103–114 (June 1996)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki, 54124, Greece
Athena Vakali
Faculty of Mathematics and Physics, Charles University, 118 00, Praha 1, Czech Republic
Jaroslav Pokorný
School of Electr. and Comp. Engineering, National Technical University of Athens, Zographou, 15773, Athens, Greece
Theodore Dalamagas

Authors

Athena Vakali
View author publications
You can also search for this author in PubMed Google Scholar
Jaroslav Pokorný
View author publications
You can also search for this author in PubMed Google Scholar
Theodore Dalamagas
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Sidonia Systems, Grubmühl 20, D-82131, Stockdorf, Germany
Wolfgang Lindner
Università di Milano, Italy
Marco Mesiti
Functional Genomics Center Zurich (FGCZ), UZH / ETH Zurich, Winterthurerstrasse 190, CH–8057, Zurich, Switzerland
Can Türker
Computer Science Department, University of Crete, GREECE, and, Institute of Computer Science, FORTH-ICS, Greece
Yannis Tzitzikas
Aristotle University of Thessaloniki,
Athena I. Vakali

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vakali, A., Pokorný, J., Dalamagas, T. (2004). An Overview of Web Data Clustering Practices. In: Lindner, W., Mesiti, M., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds) Current Trends in Database Technology - EDBT 2004 Workshops. EDBT 2004. Lecture Notes in Computer Science, vol 3268. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30192-9_59

Download citation

DOI: https://doi.org/10.1007/978-3-540-30192-9_59
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23305-3
Online ISBN: 978-3-540-30192-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics