Abstract
The problem of choosing the correct number of clusters is as old as cluster analysis itself. A number of authors have suggested various indexes to facilitate this crucial decision. One of the most extensive comparative studies of indexes was conducted by Milligan and Cooper (1985). The present piece of work pursues the same goal under different conditions. In contrast to Milligan and Cooper's work, the emphasis here is on high-dimensional empirical binary data. Binary artificial data sets are constructed to reflect features typically encountered in real-world data situations in the field of marketing research. The simulation includes 162 binary data sets that are clustered by two different algorithms and lead to recommendations on the number of clusters for each index under consideration. Index results are evaluated and their performance is compared and analyzed.
Similar content being viewed by others
References
Aldenderfer, M.S., & Blashfield, R.K. (1996).Cluster analysis. London, U.K.: Sage Publications.
Andrews, D.F. (1972). Plots of high-dimensional data.Biometrics, 28, 125–136.
Arabie, P., & Hubert, L.J. (1996).Clustering and classification (pp. 5–63). River Edge, NJ: World Scientific.
Arratia, R., & Lander, E.S. (1990). The distribution of clusters in random graphs.Advances in Applied Mathematics, 11, 36–48.
Baker, F.B., & Hubert, L.J. (1975). Measuring the power of hierarchical cluster analysis.Journal of the American Statistical Association, 70, 31–38.
Ball, G.H., & Hall, D.J. (1965).ISODATA, A novel method of data analysis and pattern classification (Tech. Rep. NTIS No. AD 699616). Menlo Park, CA: Stanford Research Institute.
Baroni-Urbani, C., & Buser, M.W. (1976). Similarity of binary data.Systematic Zoology, 25, 251–259.
Baulieu, F. (1989). A classification of presence/absence based dissimilarity coefficients.Journal of Classification, 6, 233–246.
Calinski, R.B., & Harabasz, J. (1974). A dendrite method for cluster analysis.Communications in Statistics, 3, 1–27.
Cheetham, H., & Hazel, J. (1969). Binary (presence-absence) similarity coefficients.Journal of Paleontology, 43, 1130–1136.
Cox, D. (1970).The analysis of binary data. London, U.K.: Chapman and Hall.
Davies, D.L., & Bouldin, D.W. (1979). A cluster separation measure.IEEE Transactions on Pattern Analysis and Machine Intelligence, 1, 224–227.
Dolnicar, S., Grabler, K., & Mazanec, J. (2000). A tale of three cities: Perceptual charting for analysing destination images. In A. Woodside (Ed.),Consumer psychology of tourism, hospitality and leisure (pp. 39–62). London, U.K.: CAB International.
Dolnicar, S., Leisch, F., Weingessel, A., Buchta, C., & Dimitriadou, E. (1998).A comparison of several cluster algorithms on artificial binary data scenarios from tourism marketing (Working Paper 7, SFB). Wien, Austria: Adaptive Information Systems. (http://www.wu-wien.ac.at/am)
Edwards, A.W.F., & Cavalli-Sforza, L. (1965). A method for cluster analysis.Biometrics, 21, 362–375.
Formann, A.K. (1984). Die Latent-Class-Analyse: Einführung in die Theorie und Anwendung [Latent class analysis: Introduction into theory and application]. Weinheim, Germany: Beltz.
Friedman, H.P., & Rubin, J. (1967). On some invariant criteria for grouping data.Journal of the American Statistical Association, 62, 1159–1178.
Fritzke, B. (1997).Some competitive learning methods. Unpublished manuscript [On-line draft document available at http://www.ki.inf.tu-dresden.de/ fritzke/JavaPaper/t.html or http://www.neuroinformatik.ruhr-unibochum.de/ini/VDM/research/gsn/].
Fukunaga, K., & Koontz, W.L.G. (1970). A criterion and an algorithm for grouping data.IEEE Transactions on Computers, C-19, 917–923.
Gower, J.C. (1985). Measures of similarity, dissimilarity, and distance. In S. Kotz & N.L. Johnson (Eds.),Encyclopedia of Statistical Sciences, Vol. 5 (pp. 397–405). New York, NY: Wiley.
Green, P.E., Tull, D.S., & Albaum, G. (1988).Research for Marketing Decisions (5th ed., The Prentice Hall Series in Marketing). Englewood Cliffs, NJ: Prentice-Hall.
Hall, D.J., Duda, R.O., Huffman, D.A., & Wolf, E.E. (1973). Development of new pattern recognotion methods (Tech. Rep. NTIS No. AD 7726141). Los Angeles, CA: Aerospace Research Laboratories.
Hartigan, J.A. (1975).Clustering algorithms. New York, NY: Wiley.
Hubalek, L. (1982). Coefficients of association and similarity, based on binary (presence-absence) data: An evaluation.Biological Review, 57, 669–689.
Hubert, L.J., & Levin, J.R. (1976). A general statistical framework for assessing categorical clustering in free recall.Phycological Bulletin, 83, 1072–1080.
Kaufmann, H., & Pape, H. (1996).Multivariate statistische Verfahren (2nd ed.) [Multivariate statistical methods]. Berlin: Walter de Gruyter.
Li, X. & Dubes, R.C. (1989). A probabilistic measure of similarity for binary data in pattern recognition.Pattern Recognition, 22(4), 397–409.
Linde, Y., Buzo, A., & Gray, R.M. (1980). An algorithm for vector quantizer design.IEEE Transactions on Communications, COM-28(1), 84–95.
Marriot, F.H.C. (1971). Practical problems in a method of cluster analysis.Biometrics, 27, 501–514.
McCutcheon, A.L. (1987).Latent class analysis (Sage University Paper series on Quantitative Applications in the Social Sciences, Series No. 07-064). Beverly Hills, CA: Sage Publications.
Milligan, G.W. (1980). An examination of the effect of six types of error perturbation on fifteen clutering algorithms.Psychometrika, 45, 325–342.
Milligan, G.W. (1981). A Monte Carlo study of thirty internal criterion measures for cluster analysis.Psychometrika, 46, 187–199.
Milligan, G.W., & Cooper, M.C. (1985). An examination of procedures for determining the number of clusters in a data set.Psychometrika, 50, 159–179.
Orloci, L. (1967). An agglomerative method of classification of plant communities.Journal of Ecology, 55, 193–206.
Ramaswamy, W., Chatterjee, R., & Cohen, S.H. (1996). Joint segmentation on distinct interdependent bases with categorical data.Journal of Marketing Research, 33, 337–350.
Ratkowsky, D.A., & Lance, G.N. (1978). A criterion for determining the number of groups in a classification.Australian Computer Journal, 10, 115–117.
Rost, J. (1996).Testtheorie, Testkonstruktion [Theory and construction of tests]. Bern: Verlag Hans Huber.
Sarle, W.S. (1983).Cubic clustering criterion (Tech. Rep. A-108). Research Triangle Park, NC: SAS Institute.
Schwarz, G. (1978). Estimating the dimension of a model.Annuals of Statistics, 6, 461–464.
Scott, A.J. & Symons, M.J. (1971). Clustering methods based on likelihood ratio criteria.Biometrics, 27, 387–397.
Thorndike, R.L. (1953). Who belongs in the familiy?Psychometrika, 18, 267–276.
Wedel, M., & Kamakura, W.A. (1998).Marketing segmentation. Conceptual and methodological foundations (pp. 89–92). Boston/Dordrecht/London: Kluwer Academic.
Wolfe, J.H. (1970). Pattern clustering by multivariate mixture analysis.Multivariate Behavioral Research, 5, 329–350.
Xu, L. (1997). Bayesian Ying-Yang machine, clustering and number of clusters.Pattern Recognition Letters, 18, 1167–1178.
Yang, M.-S. & Yu, K.F. (1990). On stochastic convergence theorems for the fuzzy c-means clustering procedure.International Journal of General Systems, 16, 397–411.
Author information
Authors and Affiliations
Additional information
Author names are listed in alphabetical order.
This piece of research was supported by the Austrian Science Foundation (FWF) under grant SFB#010 (“Adaptive Information Systems and Modeling in Economics and Management Science”).
The authors would like to thank the anonymous reviewers and especially the associate editor for their helpful comments and suggestions.
Rights and permissions
About this article
Cite this article
Dimitriadou, E., Dolničar, S. & Weingessel, A. An examination of indexes for determining the number of clusters in binary data sets. Psychometrika 67, 137–159 (2002). https://doi.org/10.1007/BF02294713
Received:
Revised:
Issue Date:
DOI: https://doi.org/10.1007/BF02294713