More Web Proxy on the site http://driver.im/

article

A Monotonic On-Line Linear Algorithm for Hierarchical Agglomerative Classification

Authors:

Andreea B. Dragut,

Codrin M. NichitiuAuthors Info & Claims

Information Technology and Management, Volume 5, Issue 1-2

Pages 111 - 141

https://doi.org/10.1023/B:ITEM.0000008078.09272.89

Published: 01 January 2004 Publication History

Abstract

We start from an algorithm for on-line linear hierarchical classification for multidimensional data, using a centroid aggregation criterion. After evoking some real-life on-line settings where it can be used, we analyze it mathematically, in the framework of the Lance–Williams algorithms, proving that it does not have some useful properties: it is not monotonic, nor space-conserving. In order to use its on-line capabilities, we modify it and show that it becomes monotonic. While still not having the internal similarity-external dissimilarity property, the worst case classifications of the new algorithm are correctable with an additional small computational effort, on the overall taking O(n⋅k) time for n points and k classes. Experimental study confirm the theoretical improvements upon the initial algorithm. A theoretical and experimental comparison to other algorithms from the literature, shows that it is among the fastest and performs well.

References

[1]

{1} C.C. Aggarwal and P.S. Yu, Outlier detection for high dimensional data, in: SIGMOD 2001 Electronic Proceedings, ed. W.G. Aref (2001). Available to SIGMOD members, http://www.acm.org/sigmod/ sigmod01/eproceedings.]]

Digital Library

[2]

{2} R. Agrawal et al., Automatic subspace clustering of high dimensional data for data mining applications, in: Proceedings of the 1998 ACM-SIGMOD International Conference Management of Data, Seattle Washington (1998) pp. 94-105.]]

Digital Library

[3]

{3} M.R. Anderberg, Cluster analysis for applications, Probability and Mathematical Statistics 19 (1973).]]

[4]

{4} A. Arasu et al., Characterizing memory requirements for queries over continuous data stream, in: Proceedings of the 2002 ACM Symposium on Principles of Database Systems (2002).]]

Digital Library

[5]

{5} T.S. Arthanary and Y. Dodge, Mathematical Programming in Statistics (Wiley, New York, 1988).]]

[6]

{6} B. Babcock et al., Models and issues in data stream systems, Invited paper in Proceedings of the 2002 ACM Symposium on Principles of Database Systems (2002).]]

Digital Library

[7]

{7} A. Berson, S. Smith and K. Thearling, Building Data Mining Applications for CRM (McGraw Hill, New York, 2000).]]

Digital Library

[8]

{8} P.S. Bradley, U. Fayyad and C. Reina, Scaling clustering algorithms to large databases, in: Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, eds. R. Agrawal and P. Stolorz (AAAI Press, Menlo Park, CA, 1998) pp. 9-15.]]

[9]

{9} L. Breiman, Classification and Regression Trees (Wadsworth & Brooks/Cole, Pacific Grove, CA, 1984).]]

[10]

{10} A. Carlyle, Developing organized information displays for voluminous works: A study of user clustering behavior, Information Processing and Management 37 (2001) 677-699.]]

[11]

{11} M. Charikar et al., Incremental clustering and dynamic information retrieval, in: Proceedings of the 29th Annual ACM Symposium on Theory of Computing (STOC'97) (1997) pp. 626-634.]]

Digital Library

[12]

{12} Z. Chen, Space-conserving agglomerative algorithms, Journal of Classification 13 (1996) 157-168.]]

[13]

{13} A.B. Dragut, Characterization of a set of algorithms verifying the internal similarity-external dissimilarity property, Mathematical Reports 4 (2001) 225-232.]]

[14]

{14} A.B. Dragut, On the validity of a hierarchical structure for a modified Zupan algorithm, to appear in Mathematical Reports (2002).]]

[15]

{15} M. Ester et al., A density-based algorithm for discovering clusters in large spatial databases with noise, in: Proceedings of the 2nd International Conference on Knowledge Discovery in Databases and Data Mining (AAAI/MIT Press, 1996) pp. 226-231.]]

[16]

{16} D. Eppstein, Fast hierarchical clustering and other applications of dynamic closest pairs, in: Proceedings of 9th Annual ACM-SIAM Symposium Discrete Algorithms (1998) pp. 619-628.]]

Digital Library

[17]

{17} V. Estivill-Castro and J. Yang, Categorizing visitors dynamically by fast and robust clustering of access logs, in: Proceedings of the 1st Asia-Pacific Conference on Web Intelligence (WI'2001), Lecture Notes in Artificial Intelligence, Vol. 2198 (Springer, Berlin, 2001) pp. 498-507.]]

Digital Library

[18]

{18} B.S. Everitt, Cluster Analysis, 3rd edn. (Halsted Press, 1993).]]

[19]

{19} D. Fasulo, An analysis of recent work on clustering algorithms, Technical Report 01-03-02, University of Washington, Department of Computer Science and Engineering (1999).]]

[20]

{20} A. Feelders, H. Daniels and M. Holsheimer, Methodological and practical aspects of data mining, Information and Management 37 (2000) 271-281.]]

Digital Library

[21]

{21} J. Felsenstein, PHYLIP, Phylogeny Inference Package version 3.572c, Distributed by the author, University of Washington, Department of Genetics, Seattle, Washington (1995).]]

[22]

{22} W.J. Frawley, G. Piatestky-Shapiro and C.J. Matheus, Knowledge discovery in databases: An overview, in: Knowledge Discovery in Databases, eds. G. Piatestky-Shapiro and W.J. Frawley (AAAI Press/MIT Press, 1991).]]

[23]

{23} U. Gather, R. Fried and M. Imhoff, Online classification of states in intensive care, Festschrift in honor to Hans-Hermann Bock's 60th birthday, in: Data Analysis, Classification, and Applications, eds. W. Gaul, O. Opitz and M. Schader (Springer, Berlin, 2000) pp. 413-428.]]

[24]

{24} A. Guenoche, Order distance associated with a hierarchy, Journal of Classification 14 (1997) 101-115.]]

[25]

{25} S. Guha, R. Rastogi and K. Shim, CURE, an efficient clustering algorithm for large databases, in: SIGMOD Conference 1998 (1998) pp. 73-84.]]

Digital Library

[26]

{26} D.J. Hand, Discrimination and Classification (Wiley, New York, 1992).]]

[27]

{27} A.K. Jain and R.C. Dubes, Algorithms for Clustering Data (Prentice Hall, Englewood Cliffs, NJ, 1988).]]

Digital Library

[28]

{28} G. Karypis, E.H. Han and V. Kumar, Multilevel refinement for hierarchical clustering, Technical Report TR-99-020, Department of Computer Science, University of Minnesota, Minneapolis (1999).]]

[29]

{29} L. Kaufman and P.J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis (Wiley, New York, 1990).]]

[30]

{30} G.N. Lance and W.T. Williams, A general theory of classificatory sorting strategies 1. Hierarchical systems, The Computer Journal 9 (1966) 373-382.]]

[31]

{31} G.N. Lance and W.T. Williams, A general theory of classificatory sorting strategies 1. Clustering systems, The Computer Journal 10 (1967) 271-277.]]

[32]

{32} R.C.T. Lee, Clustering analysis and its applications, in: Advances in Information System Science, ed. J.T. Tou (Plenum, New York, 1981) pp. 80-150.]]

[33]

{33} I.C. Lerman, Classification et Analyse Ordinale des Données (Dunod, 1981).]]

[34]

{34} B.M.E. Moret, Decision trees and diagrams, Computing Surveys 14 (1982) 593-623.]]

Digital Library

[35]

{35} R.T. Ng and J. Han, Efficient and effective clustering methods for spatial data mining, in: Proceedings of the 20th VLDB Conference, Santiago, Chile (1994) pp. 144-155.]]

Digital Library

[36]

{36} L. O'Callaghan et al., Streaming-data algorithms for high-quality clustering, in: Proceedings of the 18th International Conference on Data Engineering (ICDE) (2002).]]

Digital Library

[37]

{37} M. Perkowits and O. Etzioni, Adaptive web sites: An AI challenge, in: Proceedings of the International Joint Conference on Artif. International, Nagoya, Japan (1998) pp. 16-23.]]

[38]

{38} V. Stefanescu, Some fundamental aspects of hierarchical classification theory, in: Proceedings DIANA 3rd Conference on Discriminant Analysis, Cluster Analysis, Bechyne, Czechoslovakia (June 1990) pp. 251-258.]]

[39]

{39} J.R. Quinlan, Programs for Machine Learning (Morgan Kaufmann, San Mateo, CA, 1993).]]

Digital Library

[40]

{40} S. Salzberg et al., Decision Trees for Automated Identification of Cosmic Ray Hits in Hubble Space Telescope Images, preprint (John Hopkins University, Baltimore, MD, 1995).]]

[41]

{41} C. Shahabi et al., Knowledge discovery from users web-page navigation, in: Proceedings of the IEEE RIDE97 Workshop (1997).]]

Digital Library

[42]

{42} C. Shahabi, Feature matrices: A model for efficient and anonymous web usage mining, EC-Web (2001).]]

Digital Library

[43]

{43} C. Shahabi and F. Banaei-Kashani, Efficient and anonymous web usage mining for web personalization, INFORMS Journal on Computing (Special issue on data mining) (2002).]]

Digital Library

[44]

{44} J.D. Thompson, D.G. Higgins and T.J. Gibson, CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice, Nucleic Acids Research 22 (1994) 4673-4680.]]

[45]

{45} W. Wang, J. Yang and R. Muntz, STING: A statistical information grid approach to spatial data mining, in: Proceedings of the 23rd International Conference on Very Large Data Bases (1997) pp. 186- 195.]]

Digital Library

[46]

{46} W. Wong and A.W. Fu, Incremental document clustering for web page classification, in: IEEE 2000 International Conference on Information Society in the 21st Century: Emerging Technologies and New Challenges (IS2000), Japan (November 5-8, 2000).]]

[47]

{47} J. Xiao, Y. Zhang, X. Jia and T. Li, Measuring similarity of interests for clustering web-users, in: Proceedings of the 12th Australian Database Conference on ADC 2001 (2001) pp. 107-114.]]

Digital Library

[48]

{48} M. Zait and H. Messatfa, A comparative study of clustering methods, Future Generation Computer Systems 13 (1997) 149-159.]]

Digital Library

[49]

{49} T. Zhang, R. Ramakrishnan and M. Livny, BIRCH: A new data clustering algorithm and its applications, Data Mining and Knowledge Discovery 1 (1997) 141-182.]]

Digital Library

[50]

{50} A. Zhou et al., A hybrid approach to clustering in very large databases, in: Proceedings of the 5th Pacific-Asia Conference PAKDD 2001, Advances in Knowledge Discovery and Data Mining, eds. D. Cheung, G.J. Williams and Q. Li, Lecture Notes on Computer Science, Vol. 2035 (Springer, Berlin, 2001) pp. 519-524.]]

Digital Library

[51]

{51} J. Zupan, A new approach to the binary tree based heuristics, Anal. Chem. Acta 112 (1980) 337-342.]]

[52]

{52} J. Zupan, Clustering of Large Data Sets (Wiley, New York, 1983).]]

Index Terms

A Monotonic On-Line Linear Algorithm for Hierarchical Agglomerative Classification

Recommendations

Kernel Hierarchical Agglomerative Clustering
ICPRAM 2014: Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods

Clustering algorithms, as unsupervised analysis tools, are useful for exploring data structure and have owned

great success in many disciplines. For most of the clustering algorithms like k-means, determining the number

of the clusters is a crucial step ...
Particle Swarm Optimization Based Hierarchical Agglomerative Clustering
WI-IAT '10: Proceedings of the 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 02

Clustering- an important data mining task, which groups the data on the basis of similarities among the data, can be divided into two broad categories, partitional clustering and hierarchal. We combine these two methods and propose a novel clustering ...
Scalable Hierarchical Agglomerative Clustering
KDD '21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining

The applicability of agglomerative clustering, for inferring both hierarchical and flat clustering, is limited by its scalability. Existing scalable hierarchical clustering methods sacrifice quality for speed and often lead to over-merging of clusters. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Information Technology and Management

Information Technology and Management Volume 5, Issue 1-2

January-April 2004

200 pages

ISSN:1385-951X

Issue’s Table of Contents

Copyright © Copyright © 2004 Kluwer Academic Publishers.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 January 2004

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 18 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents