Abstract
This paper analyses the application of Simplified Silhouette to the evaluation of k-means clustering validity and compares it with the k-means Cost Function and the original Silhouette. We conclude that for a given dataset the k-means Cost Function is the most valid and efficient measure in the evaluation of the validity of k-means clustering with the same k value, but that Simplified Silhouette is more suitable than the original Silhouette in the selection of the best result from k-means clustering with different k values.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Here we assume that there are at least two different data points in the cluster. Otherwise, the a(i) is set to be 0, and the sil(i) will be 1.
- 2.
In preparing our experiments we tested two different initialisation methods for k-means, a random initialisation and a well-known algorithm k-means++. However, we found that the initialisation method made no difference in our results so in this paper we just report the results using the random initialisation.
- 3.
If other methods like k-means++ are used for selecting the initial centroids, it is very likely to get all the desired k values for all the synthetic datasets.
- 4.
Due to time and resource limitations Simplified Silhouette has not been fully explored in this paper, e.g. the actual industrial datasets are not available. However, this is an attempt to evaluate the internal measures for a specific clustering algorithm. Specific methods should be evaluated, selected and even designed for specific algorithms or conditions, rather than always a same set of general methods for all the situations.
References
Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 2, 224–227 (1979)
Dunn, J.C.: Well-separated clusters and optimal fuzzy partitions. J. Cybern. 4(1), 95–104 (1974)
Fränti, P., Virmajoki, O.: Iterative shrinking method for clustering problems. Pattern Recogn. 39(5), 761–775 (2006)
Franti, P., Virmajoki, O., Hautamaki, V.: Fast agglomerative clustering using a k-nearest neighbor graph. IEEE Trans. Pattern Anal. Mach. Intell. 28(11), 1875–1881 (2006)
Halkidi, M., Vazirgiannis, M., Batistakis, Y.: Quality scheme assessment in the clustering process. In: Zighed, D.A., Komorowski, J., Żytkow, J. (eds.) PKDD 2000. LNCS, vol. 1910, pp. 265–276. Springer, Heidelberg (2000). doi:10.1007/3-540-45372-5_26
Hruschka, E.R., de Castro, L.N., Campello, R.J.: Evolutionary algorithms for clustering gene-expression data. In: Fourth IEEE International Conference on Data Mining, ICDM 2004, pp. 403–406. IEEE (2004)
Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recogn. Lett. 31(8), 651–666 (2010)
Kelleher, J.D., Mac Namee, B., D’Arcy, A.: Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies. MIT Press, Cambridge (2015).
Vendramin, L., Campello, R.J., Hruschka, E.R.: Relative clustering validity criteria: a comparative overview. Stat. Anal. Data Min. 3(4), 209–235 (2010)
Wang, F., Franco, H., Pugh, J., Ross, R.: Empirical comparative analysis of 1-of-k coding and k-prototypes in categorical clustering (2016)
Xiong, H., Li, Z.: Clustering validation measures (2013)
Zaki, M.J., Meira Jr., W.: Data Mining and Analysis: Fundamental Concepts and Algorithms. Cambridge University Press, Cambridge (2014)
Acknowledgement
The authors wish to acknowledge the support of Enterprise Ireland through the Innovation Partnership Programme SmartSeg 2 and the ADAPT Research Centre. The ADAPT Centre for Digital Content Technology is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Funds.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Wang, F., Franco-Penya, HH., Kelleher, J.D., Pugh, J., Ross, R. (2017). An Analysis of the Application of Simplified Silhouette to the Evaluation of k-means Clustering Validity. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2017. Lecture Notes in Computer Science(), vol 10358. Springer, Cham. https://doi.org/10.1007/978-3-319-62416-7_21
Download citation
DOI: https://doi.org/10.1007/978-3-319-62416-7_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-62415-0
Online ISBN: 978-3-319-62416-7
eBook Packages: Computer ScienceComputer Science (R0)