[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article
Free access

Fast algorithms for projected clustering

Published: 01 June 1999 Publication History

Abstract

The clustering problem is well known in the database literature for its numerous applications in problems such as customer segmentation, classification and trend analysis. Unfortunately, all known algorithms tend to break down in high dimensional spaces because of the inherent sparsity of the points. In such high dimensional spaces not all dimensions may be relevant to a given cluster. One way of handling this is to pick the closely correlated dimensions and find clusters in the corresponding subspace. Traditional feature selection algorithms attempt to achieve this. The weakness of this approach is that in typical high dimensional data mining applications different sets of points may cluster better for different subsets of dimensions. The number of dimensions in each such cluster-specific subspace may also vary. Hence, it may be impossible to find a single small subset of dimensions for all the clusters. We therefore discuss a generalization of the clustering problem, referred to as the projected clustering problem, in which the subsets of dimensions selected are specific to the clusters themselves. We develop an algorithmic framework for solving the projected clustering problem, and test its performance on synthetic data.

References

[1]
R. Agrawal, J. Gehrke, D. Gunopolos, P. Raghavan. Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. Proceedings of ~he A CM SIGMOD International Conference on Management of Data, 1998.
[2]
D. Hand, Order Statistics. John Wiley and Sons, New York, 1981.
[3]
M. Berger, I. Rigoutsos. An Algorithm for Point Clustering and Grid Generation. IEEE Transactions on Systems, Man and Cybernetics, Vol. 21, 5:1278-1286, 1991.
[4]
M. R. Brito, E. Chavez, A. Quiroz, J. Yukich. Connectivity of the Mutual k-Nearest-Neighbor Graph for Clustering and Outlier Detection. Siatis~ics and Probability Letters, 35 (1997) pages 33-42.
[5]
P. Cheeseman, j. Kelly, S. Matthew. AutoClass: A Bayesian Classification System. Proceedings of ~he 5~h International Conference on Machine Learning, Morgan Kaufmann, June 1988.
[6]
R. Dubes, A. Jain. Clustering Meihodologies in Exploratory Data Analysis. Advances in Computers, Edited by M. Yovits, Vol. 19, Academic Press, New York, 1980.
[7]
M. Ester, H.-P. Kriegel, X. Xu. A Database Interface for Clu.,~tering in Large Spatial Databases. Proceedings of the first International Conference on Knowledge Discovery and Data Mining, 1995.
[8]
M. Ester, H.-P. Kriegel and X. Xu, Knowledge Discovery in Large ~patlal Databases: Focusing Techniques for Efficient Class Identification. Proceedings of ~he Fourth International Symposium on Large Spagial Database,J, Portland, Maine, U.S.A. 1995.
[9]
M. Ester, H.-P. Kriegel, J. Sander, X. Xu. A Density Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Proceedings of ~he 2nd International Conference on Knowledge Discovery in Databases and Da~a Mining, Portland, Oregon, August 1.995.
[10]
U. Shardanand, P. Maes. Social information filtering: algorithms ior automating "word of mouth". Proceedings of the A CM Conference on Human Factors in Compuging Systems, pages 210-217, 1995.
[11]
D. Fisher. Knowledge Acquisition via Incremental Conceptual Clustering. Machine Learning 2(2), 1987.
[12]
D. Fisher. Optimization and Simplification of Hierarchical Clusters. Proceedings of ~he International Conference on Knowledge Discovery and Data Mining, August 1995.
[13]
D. Gibson, J. Kleinberg, P. Raghavan. Clustering Categorical Data: An Approach Based on Dynamical Systems. Proceedings of the 24$h VLDB Conference, pp. 311-3:22, 1998.
[14]
T. Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, Vol. 38, pp. 293-306, 1985.
[15]
S. Guha, R. Rastogi, K. Shim. CURE: An Efficient Clustering Algorithm for Large Databases. Proceedings of the 1#98 A CM SIGMOD Conference, pp. 73-84, 1998.
[16]
T. Ibaraki, N. Katoh. Resource Allocation Problems: Algorithmic Approaches. MIT Press, Cambridge, Massachusetts, 1988.
[17]
A. Jain, R. Dubes. Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs, New Jersey, 1998.
[18]
L. Kaufman, P. Rousseeuw. Finding Groups in Data- An Introduction to Cluster Analysis. Wiley Series in Probability and Mathematical Statistics, 1990.
[19]
R. Kohavi, D. Sommerfield. Feature Subset Selection Using the Wrapper Method" Overfitting and Dynamic Search Space Topology. Proceedings of ~he First International Conference on Knowledge Discovery and Data Mining, 1995.
[20]
R. Lee. Clustering Analysis and its applicagions. Advances in Information Systems Science, edited by :I. Toum, Vol. 8, pp. 169-292, Plenum Press, New York, 1981.
[21]
R. Ng, J. Han. Efficient and Effective Clustering Methods for Spatial Data Mining. Proceedings of ~he POth VLDB Conference, 1994, pp. 144.155.
[22]
D. Keim, S. Berchtold, C. BShm, H.-P. Kriegel. A cost model for nearest neighbor search in highdimensional data space. Proceedings of the 16~h Symposium on Principles of Database Systems (PODS), pages 78-86, 1997.
[23]
S. Wharton. A Generalized Histogram Clustering for Multidimensional Image Data. Pattern Recognition, Vol. 16, No. 2: pp. 193-199, 1983.
[24]
X. Xu, M. Ester, H.-P. Kriegel, J. SarLder. A Distribution-Based Clustering Algorithm for Mining in Large Spatial Databases. Proceedinqs of ~he Fourteenth International Conference on D~;ta Engineering, 1998, pp. 324-331.
[25]
M. Zait, H. Messatfa. A Comparative Study of Clustering Methods. FGCS Journal, Special Issue on Data Mining, 1997.
[26]
T. Zhang, R. Ramakrishnan, M. Livny. BIRC:{-I: An Efficient Data Clustezing Method for Very Large Databases. Proceedings of ~he A CM $IGMG'D International Conference on Management of Da~a, Montreal, Canada, June 1996.

Cited By

View all
  • (2024)High-Dimensional Projected Clustering for Learner Competency Analysis in Medical Training ProgramsIEEE Access10.1109/ACCESS.2024.349631812(171807-171823)Online publication date: 2024
  • (2024)A parameter free relative density based biclustering method for identifying non-linear feature relationsHeliyon10.1016/j.heliyon.2024.e3473610:15(e34736)Online publication date: Aug-2024
  • (2024)An integrated intrusion detection framework based on subspace clustering and ensemble learningComputers and Electrical Engineering10.1016/j.compeleceng.2024.109113115:COnline publication date: 2-Jul-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGMOD Record
ACM SIGMOD Record  Volume 28, Issue 2
June 1999
599 pages
ISSN:0163-5808
DOI:10.1145/304181
Issue’s Table of Contents
  • cover image ACM Conferences
    SIGMOD '99: Proceedings of the 1999 ACM SIGMOD international conference on Management of data
    June 1999
    604 pages
    ISBN:1581130848
    DOI:10.1145/304182
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 1999
Published in SIGMOD Volume 28, Issue 2

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)448
  • Downloads (Last 6 weeks)66
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)High-Dimensional Projected Clustering for Learner Competency Analysis in Medical Training ProgramsIEEE Access10.1109/ACCESS.2024.349631812(171807-171823)Online publication date: 2024
  • (2024)A parameter free relative density based biclustering method for identifying non-linear feature relationsHeliyon10.1016/j.heliyon.2024.e3473610:15(e34736)Online publication date: Aug-2024
  • (2024)An integrated intrusion detection framework based on subspace clustering and ensemble learningComputers and Electrical Engineering10.1016/j.compeleceng.2024.109113115:COnline publication date: 2-Jul-2024
  • (2024)A comprehensive review of clustering techniques in artificial intelligence for knowledge discovery: Taxonomy, challenges, applications and future prospectsAdvanced Engineering Informatics10.1016/j.aei.2024.10279962(102799)Online publication date: Oct-2024
  • (2024)Cluster analysisFundamentals of Data Science10.1016/B978-0-32-391778-0.00016-8(181-214)Online publication date: 2024
  • (2024)A Novel Hierarchical High-Dimensional Unsupervised Active Learning MethodInternational Journal of Computational Intelligence Systems10.1007/s44196-024-00601-w17:1Online publication date: 24-Jul-2024
  • (2024)Clustering graph data: the roadmap to spectral techniquesDiscover Artificial Intelligence10.1007/s44163-024-00102-x4:1Online publication date: 22-Jan-2024
  • (2023)Big Data y diferentes enfoques de clustering subespacial: De la promoción en redes sociales al mapeo genómicoSalud, Ciencia y Tecnología10.56294/saludcyt20234133(413)Online publication date: 19-Jun-2023
  • (2023)Semi-supervised Projected Subspace Clustering2023 5th International Conference on Frontiers Technology of Information and Computer (ICFTIC)10.1109/ICFTIC59930.2023.10456306(1021-1024)Online publication date: 17-Nov-2023
  • (2023)Approximation algorithms for orthogonal line centersDiscrete Applied Mathematics10.1016/j.dam.2023.05.014338:C(69-76)Online publication date: 30-Oct-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media