[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

A Monotonic On-Line Linear Algorithm for Hierarchical Agglomerative Classification

Published: 01 January 2004 Publication History

Abstract

We start from an algorithm for on-line linear hierarchical classification for multidimensional data, using a centroid aggregation criterion. After evoking some real-life on-line settings where it can be used, we analyze it mathematically, in the framework of the Lance–Williams algorithms, proving that it does not have some useful properties: it is not monotonic, nor space-conserving. In order to use its on-line capabilities, we modify it and show that it becomes monotonic. While still not having the internal similarity-external dissimilarity property, the worst case classifications of the new algorithm are correctable with an additional small computational effort, on the overall taking O(nk) time for n points and k classes. Experimental study confirm the theoretical improvements upon the initial algorithm. A theoretical and experimental comparison to other algorithms from the literature, shows that it is among the fastest and performs well.

References

[1]
{1} C.C. Aggarwal and P.S. Yu, Outlier detection for high dimensional data, in: SIGMOD 2001 Electronic Proceedings, ed. W.G. Aref (2001). Available to SIGMOD members, http://www.acm.org/sigmod/ sigmod01/eproceedings.]]
[2]
{2} R. Agrawal et al., Automatic subspace clustering of high dimensional data for data mining applications, in: Proceedings of the 1998 ACM-SIGMOD International Conference Management of Data, Seattle Washington (1998) pp. 94-105.]]
[3]
{3} M.R. Anderberg, Cluster analysis for applications, Probability and Mathematical Statistics 19 (1973).]]
[4]
{4} A. Arasu et al., Characterizing memory requirements for queries over continuous data stream, in: Proceedings of the 2002 ACM Symposium on Principles of Database Systems (2002).]]
[5]
{5} T.S. Arthanary and Y. Dodge, Mathematical Programming in Statistics (Wiley, New York, 1988).]]
[6]
{6} B. Babcock et al., Models and issues in data stream systems, Invited paper in Proceedings of the 2002 ACM Symposium on Principles of Database Systems (2002).]]
[7]
{7} A. Berson, S. Smith and K. Thearling, Building Data Mining Applications for CRM (McGraw Hill, New York, 2000).]]
[8]
{8} P.S. Bradley, U. Fayyad and C. Reina, Scaling clustering algorithms to large databases, in: Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, eds. R. Agrawal and P. Stolorz (AAAI Press, Menlo Park, CA, 1998) pp. 9-15.]]
[9]
{9} L. Breiman, Classification and Regression Trees (Wadsworth & Brooks/Cole, Pacific Grove, CA, 1984).]]
[10]
{10} A. Carlyle, Developing organized information displays for voluminous works: A study of user clustering behavior, Information Processing and Management 37 (2001) 677-699.]]
[11]
{11} M. Charikar et al., Incremental clustering and dynamic information retrieval, in: Proceedings of the 29th Annual ACM Symposium on Theory of Computing (STOC'97) (1997) pp. 626-634.]]
[12]
{12} Z. Chen, Space-conserving agglomerative algorithms, Journal of Classification 13 (1996) 157-168.]]
[13]
{13} A.B. Dragut, Characterization of a set of algorithms verifying the internal similarity-external dissimilarity property, Mathematical Reports 4 (2001) 225-232.]]
[14]
{14} A.B. Dragut, On the validity of a hierarchical structure for a modified Zupan algorithm, to appear in Mathematical Reports (2002).]]
[15]
{15} M. Ester et al., A density-based algorithm for discovering clusters in large spatial databases with noise, in: Proceedings of the 2nd International Conference on Knowledge Discovery in Databases and Data Mining (AAAI/MIT Press, 1996) pp. 226-231.]]
[16]
{16} D. Eppstein, Fast hierarchical clustering and other applications of dynamic closest pairs, in: Proceedings of 9th Annual ACM-SIAM Symposium Discrete Algorithms (1998) pp. 619-628.]]
[17]
{17} V. Estivill-Castro and J. Yang, Categorizing visitors dynamically by fast and robust clustering of access logs, in: Proceedings of the 1st Asia-Pacific Conference on Web Intelligence (WI'2001), Lecture Notes in Artificial Intelligence, Vol. 2198 (Springer, Berlin, 2001) pp. 498-507.]]
[18]
{18} B.S. Everitt, Cluster Analysis, 3rd edn. (Halsted Press, 1993).]]
[19]
{19} D. Fasulo, An analysis of recent work on clustering algorithms, Technical Report 01-03-02, University of Washington, Department of Computer Science and Engineering (1999).]]
[20]
{20} A. Feelders, H. Daniels and M. Holsheimer, Methodological and practical aspects of data mining, Information and Management 37 (2000) 271-281.]]
[21]
{21} J. Felsenstein, PHYLIP, Phylogeny Inference Package version 3.572c, Distributed by the author, University of Washington, Department of Genetics, Seattle, Washington (1995).]]
[22]
{22} W.J. Frawley, G. Piatestky-Shapiro and C.J. Matheus, Knowledge discovery in databases: An overview, in: Knowledge Discovery in Databases, eds. G. Piatestky-Shapiro and W.J. Frawley (AAAI Press/MIT Press, 1991).]]
[23]
{23} U. Gather, R. Fried and M. Imhoff, Online classification of states in intensive care, Festschrift in honor to Hans-Hermann Bock's 60th birthday, in: Data Analysis, Classification, and Applications, eds. W. Gaul, O. Opitz and M. Schader (Springer, Berlin, 2000) pp. 413-428.]]
[24]
{24} A. Guenoche, Order distance associated with a hierarchy, Journal of Classification 14 (1997) 101-115.]]
[25]
{25} S. Guha, R. Rastogi and K. Shim, CURE, an efficient clustering algorithm for large databases, in: SIGMOD Conference 1998 (1998) pp. 73-84.]]
[26]
{26} D.J. Hand, Discrimination and Classification (Wiley, New York, 1992).]]
[27]
{27} A.K. Jain and R.C. Dubes, Algorithms for Clustering Data (Prentice Hall, Englewood Cliffs, NJ, 1988).]]
[28]
{28} G. Karypis, E.H. Han and V. Kumar, Multilevel refinement for hierarchical clustering, Technical Report TR-99-020, Department of Computer Science, University of Minnesota, Minneapolis (1999).]]
[29]
{29} L. Kaufman and P.J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis (Wiley, New York, 1990).]]
[30]
{30} G.N. Lance and W.T. Williams, A general theory of classificatory sorting strategies 1. Hierarchical systems, The Computer Journal 9 (1966) 373-382.]]
[31]
{31} G.N. Lance and W.T. Williams, A general theory of classificatory sorting strategies 1. Clustering systems, The Computer Journal 10 (1967) 271-277.]]
[32]
{32} R.C.T. Lee, Clustering analysis and its applications, in: Advances in Information System Science, ed. J.T. Tou (Plenum, New York, 1981) pp. 80-150.]]
[33]
{33} I.C. Lerman, Classification et Analyse Ordinale des Données (Dunod, 1981).]]
[34]
{34} B.M.E. Moret, Decision trees and diagrams, Computing Surveys 14 (1982) 593-623.]]
[35]
{35} R.T. Ng and J. Han, Efficient and effective clustering methods for spatial data mining, in: Proceedings of the 20th VLDB Conference, Santiago, Chile (1994) pp. 144-155.]]
[36]
{36} L. O'Callaghan et al., Streaming-data algorithms for high-quality clustering, in: Proceedings of the 18th International Conference on Data Engineering (ICDE) (2002).]]
[37]
{37} M. Perkowits and O. Etzioni, Adaptive web sites: An AI challenge, in: Proceedings of the International Joint Conference on Artif. International, Nagoya, Japan (1998) pp. 16-23.]]
[38]
{38} V. Stefanescu, Some fundamental aspects of hierarchical classification theory, in: Proceedings DIANA 3rd Conference on Discriminant Analysis, Cluster Analysis, Bechyne, Czechoslovakia (June 1990) pp. 251-258.]]
[39]
{39} J.R. Quinlan, Programs for Machine Learning (Morgan Kaufmann, San Mateo, CA, 1993).]]
[40]
{40} S. Salzberg et al., Decision Trees for Automated Identification of Cosmic Ray Hits in Hubble Space Telescope Images, preprint (John Hopkins University, Baltimore, MD, 1995).]]
[41]
{41} C. Shahabi et al., Knowledge discovery from users web-page navigation, in: Proceedings of the IEEE RIDE97 Workshop (1997).]]
[42]
{42} C. Shahabi, Feature matrices: A model for efficient and anonymous web usage mining, EC-Web (2001).]]
[43]
{43} C. Shahabi and F. Banaei-Kashani, Efficient and anonymous web usage mining for web personalization, INFORMS Journal on Computing (Special issue on data mining) (2002).]]
[44]
{44} J.D. Thompson, D.G. Higgins and T.J. Gibson, CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice, Nucleic Acids Research 22 (1994) 4673-4680.]]
[45]
{45} W. Wang, J. Yang and R. Muntz, STING: A statistical information grid approach to spatial data mining, in: Proceedings of the 23rd International Conference on Very Large Data Bases (1997) pp. 186- 195.]]
[46]
{46} W. Wong and A.W. Fu, Incremental document clustering for web page classification, in: IEEE 2000 International Conference on Information Society in the 21st Century: Emerging Technologies and New Challenges (IS2000), Japan (November 5-8, 2000).]]
[47]
{47} J. Xiao, Y. Zhang, X. Jia and T. Li, Measuring similarity of interests for clustering web-users, in: Proceedings of the 12th Australian Database Conference on ADC 2001 (2001) pp. 107-114.]]
[48]
{48} M. Zait and H. Messatfa, A comparative study of clustering methods, Future Generation Computer Systems 13 (1997) 149-159.]]
[49]
{49} T. Zhang, R. Ramakrishnan and M. Livny, BIRCH: A new data clustering algorithm and its applications, Data Mining and Knowledge Discovery 1 (1997) 141-182.]]
[50]
{50} A. Zhou et al., A hybrid approach to clustering in very large databases, in: Proceedings of the 5th Pacific-Asia Conference PAKDD 2001, Advances in Knowledge Discovery and Data Mining, eds. D. Cheung, G.J. Williams and Q. Li, Lecture Notes on Computer Science, Vol. 2035 (Springer, Berlin, 2001) pp. 519-524.]]
[51]
{51} J. Zupan, A new approach to the binary tree based heuristics, Anal. Chem. Acta 112 (1980) 337-342.]]
[52]
{52} J. Zupan, Clustering of Large Data Sets (Wiley, New York, 1983).]]

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Information Technology and Management
Information Technology and Management  Volume 5, Issue 1-2
January-April 2004
200 pages

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 January 2004

Author Tags

  1. Lance–Williams algorithms
  2. linear on-line hierarchical classification

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 18 Jan 2025

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media