Abstract
In this paper, we present a particle swarm optimizer (PSO) to solve the variable weighting problem in projected clustering of high-dimensional data. Many subspace clustering algorithms fail to yield good cluster quality because they do not employ an efficient search strategy. In this paper, we are interested in soft projected clustering. We design a suitable k-means objective weighting function, in which a change of variable weights is exponentially reflected. We also transform the original constrained variable weighting problem into a problem with bound constraints, using a normalized representation of variable weights, and we utilize a particle swarm optimizer to minimize the objective function in order to search for global optima to the variable weighting problem in clustering. Our experimental results on both synthetic and real data show that the proposed algorithm greatly improves cluster quality. In addition, the results of the new algorithm are much less dependent on the initial cluster centroids. In an application to text clustering, we show that the algorithm can be easily adapted to other similarity measures, such as the extended Jaccard coefficient for text data, and can be very effective.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Agrawal, R., Gehrke, J., Gunopulos, D., & Raghavan, P. (2005). Automatic subspace clustering of high dimensional data. Data Mining and Knowledge Discovery, 11(1), 5–33.
Aitnouri, E., Wang, S., & Ziou, D. (2000). On comparison of clustering techniques for histogram pdf estimation. Pattern Recognition and Image Analysis, 10(2), 206–217.
Boley, D., Gini, M., et al. (1999). Document categorization and query generation on the world wild web using WebACE. AI Review, 11, 365–391.
Bouguessa, M., Wang, S., & Sun, H. (2006). An objective approach to cluster validation. Pattern Recognition Letters, 27(13), 1419–1430.
Domeniconi, C., Gunopulos, D., Ma, S., Yan, B., Al-Razgan, M., & Papadopoulos, D. (2007). Locally adaptive metrics for clustering high dimensional data. Data Mining and Knowledge Discovery Journal, 14, 63–97.
Eberhart, R. C., & Kennedy, J. (1995). A new optimizer using particle swarm theory. In Proc. 6th international symposium on micromachine and human science, Japan (pp. 39–43).
Elke, A., Christian, B., et al. (2008). Detection and visualization of subspace cluster hierarchies. In LNCS (Vol. 4443, pp. 152–163). Berlin: Springer.
Evett, I. W., & Spiehler, E. J. (1987). Rule induction in forensic science. Central Research Establishment. Home Office Forensic Science Service, Aldermaston, Reading, Berkshire RG7 4PN.
Goil, G. S., Nagesh, H., & Choudhary, A. (1999). Mafia: Efficient and scalable subspace clustering for very large data sets. Technical Report CPDC-TR-9906-010, Northwestern University.
Han, E. H., Boley, D., et al. (1998). WebACE: A web agent for document categorization and exploration. In Proc. of 2nd international conf. on autonomous agents.
Handl, J., & Knowles, J. (2004). Multiobjective clustering with automatic determination of the number of clusters. Technical Report, UMIST, Department of Chemistry.
Huang, J. Z., Ng, M. K., Rong, H., & Li, Z. (2005). Automated dimension weighting in k-means type clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(5), 1–12.
Jing, L., Ng, M. K., & Huang, J. Z. (2007). An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data. IEEE Transactions on Knowledge and Data Engineering, 19(8), 1026–1041.
Kriegel, H.-P., Kroger, P., & Zimek, A. (2009). Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions on Knowledge Discovery Data, 3(1), Article 1.
Liang, J. J., Qin, A. K., Suganthan, P. N., & Baskar, S. (2006). Comprehensive learning particle swarm optimizer for global optimization of multimodal functions. IEEE Transactions on Evolutionary Computation, 10(3), 281–295.
Lu, Y. (2009). Particle swarm optimizer: applications in high-dimensional data clustering. Ph.D. Dissertation, University of Sherbrooke, Department of Computer Science.
Makarenkov, V., & Legendre, P. (2001). Optimal variable weighting for ultrametric and additive trees and k-means partitioning: Methods and software. Journal of Classification, 18(2), 245–271.
Mangasarian, O. L., & Wolberg, W. H. (1990). Breast cancer diagnosis via linear programming. SIAM News, 23(5), 1–18.
Moise, G., & Sander, J. (2008). Finding non-redundant, statistically significant regions in high dimensional data: A novel approach to projected and subspace clustering. In Proc. of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 533–541).
Moise, G., Sander, J., & Ester, M. (2008). Robust projected clustering. Knowledge and Information Systems, 14(3), 273–298.
Onwubolu, G. C., & Clerc, M. (2004). Optimal operating path for automated drilling operations by a new heuristic approach using particle swarm optimization. International Journal of Production Research, 42(3), 473–491.
Pang-ning, T., Michael, S., & Vipin, K. (2006). Introduction to data mining (p. 77). Upper Saddle River: Pearson Education.
Parsons, L., Haque, E., & Liu, H. (2004). Subspace clustering for high dimensional data: A review. SIGKDD Explorations Newsletter, 6, 90–105.
Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.
Procopiuc, C. M., Jones, M., Agarwal, P. K., & Murali, T. M. (2002). A Monte Carlo algorithm for fast projective clustering. In Proc. of ACM SIGMOD international conference on management of data (pp. 418–427).
Salman, A., Ahmad, I., & Al-Madani, S. (2003). Particle swarm optimization for task assignment problem. Microprocessors and Microsystems, 26(8), 363–371.
Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5), 513–523.
Tasgetiren, M. F., Sevkli, M., Liang, Y.-C., & Gencyilmaz, G. (2004). Particle swarm optimization algorithm for single machine total weighted tardiness problem. In Proc. of the 2004 congress on evolutionary computation (CEC’04), Portland, Oregon (pp. 1412–1419).
TREC (1999). Text retrieval conference, http://trec.nist.gov/.
Tibshirani, R., Walther, G., et al. (2000). Estimating the number of clusters in a dataset via the Gap Statistic. Technical Report, Stanford Univeristy.
Van den Bergh, F., & Engelbecht, A. P. (2000). Cooperative learning in neural networks using particle swarm optimizers. South African Computer Journal, 26, 84–90.
van der Putten, P., & van Someren, M. (Eds.) (2000). CoIL challenge 2000: the insurance company case. Published by Sentient Machine Research, Amsterdam. Technical Report.
Woo, K.-G., Lee, J.-H., Kim, M.-H., & Lee, Y.-J. (2004). FINDIT: A fast and intelligent subspace clustering algorithm using dimension voting. Information and Software Technology, 46(4), 255–271.
Yoshida, H., Kawata, K., Fukuyama, Y., & Nakanishi, Y. (2000). A particle swarm optimization for reactive power and voltage control considering voltage security assessment. IEEE Transactions on Power Systems, 15(4), 1232–1239.
Zhao, Y., & Karypis, G. (2001). Criterion functions for document clustering: Experiments and analysis. Technical Report, CS Dept., Univ. of Minnesota.
Zhou, X., Hu, X., Zhang, X., Lin, X., & Song, I.-Y. (2006). Context-sensitive semantic smoothing for the language modeling approach to genomic IR. In SIGIR’06.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editors: D. Martens, B. Baesens, and T. Fawcett.
Rights and permissions
About this article
Cite this article
Lu, Y., Wang, S., Li, S. et al. Particle swarm optimizer for variable weighting in clustering high-dimensional data. Mach Learn 82, 43–70 (2011). https://doi.org/10.1007/s10994-009-5154-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-009-5154-2