Abstract
We herein introduce a new method of interpretable clustering that uses unsupervised binary trees. It is a three-stage procedure, the first stage of which entails a series of recursive binary splits to reduce the heterogeneity of the data within the new subsamples. During the second stage (pruning), consideration is given to whether adjacent nodes can be aggregated. Finally, during the third stage (joining), similar clusters are joined together, even if they do not share the same parent originally. Consistency results are obtained, and the procedure is used on simulated and real data sets.
Similar content being viewed by others
References
Andreopoulos B, An A, Wang X (2007) Hierarchical density-based clustering of categorical data and a simplification. Adv Knowl Discov Data Min, pp 11–22
Basak J, Krishnapuram R (2005) Interpretable hierarchical clustering by construction an unsupervised decision tree. IEEE Trans Knowl Data Eng 17:121–132
Blockeel H, De Raedt L, Ramon J (1998) Top–down induction of clustering trees. In: Morgan K (ed) Proceedings of the 15th international conference on machine learning, pp 55–63
Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. CRC, Boca Raton
Chavent M, Guinot C, Lechevallier Y, Tenenhaus M (1999) Méthodes Divisives de Classification et Segmentation Non Supervisée: Recherche d’une Typologie de la Peau Humanine Saine. Rev. Statistique Appliquée, XLVII 87–99
Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of 2nd international conference on knowledge discovery and data mining. Portland, OR, pp 226–231
Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97:611–631
Fraley C, Raftery AE (2009) MCLUST Version 3 for R: Normal Mixture Modeling and Model-based Clustering, Technical Report No. 504, Department of Statistics, University of Washington
Gan G, Ma C, Wu J (2007) Clustering data: theory, algorithms and applications. Springer, Berlin
García-Escudero L, Gordaliza A, Matrán C, Mayo-Iscar A (2010) A review of robust clustering methods. Adv Data Anal Classif 4(2):89–109
Gini CW (1912) Variability and mutability, contribution to the study of statistical distributions and relations. Studi Economico-Giuridici della R. Universita de Cagliari
Hastie T, Tibshirani R, Friedman JH (2003) The elements of statistical learning. Springer, Berlin
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193–218
Iyigun C, Ben-Israel A (2008) Probabilistic distance clustering adjusted for cluster size. Prob Eng Inf Sci 22:603–621
Liu B, Xia Y, Yu PS (2000) Clustering through decision tree construction. CIKM ’00. In: Proceedings of the ninth international conference on information and knowledge management. ACM, New York, NY, USA, pp 20–29
Maronna RA, Martin DR, Yohai VJ (2006) Robust statistics: theory and methods. Wiley Series in Probability and Statistics, Wiley, New York
Melia M (2012) Local equivalences of distances between clusterings—a geometric perpective. Mach Learn 86(3):369–389
Oh MS, Raftery A (2007) Model-based clustering with dissimilarities: a Bayesian approach. J Comput Gr Stat 16(3):559–585
Papadimitriou C, Steiglitz K (1982) Combinatorial optimization: algorithms and complexity. Prentice Hall, Englewood Cliffs
Peña D, Prieto FJ (2001) Cluster identification using projections. J Am Stat Assoc 96(456):1433–1445
Peña D, Prieto FJ (2006) A projection method for robust estimation and clustering in large data set. In: Zani S, Cerioli A, Riani M, Vichi M (eds) Data analysis, classification and the forward search. Springer, Berlin
Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66:846–850
Steinley D, Brusco M (2007) Initializing K-means batch clustering: a critical evaluation of several techniques. J Classif 24:99–121
Tryon RC (1939) Cluster analysis. McGraw-Hill, Columbus
Walther G (2010) Optimal and fast detection of spatial clusters with scan statistics. Ann Stat 24(2): 1010–1033
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
1.1 Proof of Lemma 3.1
We first observe that because \(t_l\) and \(t_r\) are disjoint, \(\mathbb{E }(X_{t_l \cup t_r}) = \gamma \mu _l + (1 - \gamma ) \mu _r\), where \(\gamma = P(\mathbf{X} \in t_l \vert \mathbf{X} \in t_l \cup t_r)\). Given that \(j = 1, \ldots , p\), we use \( M^{(j)}_{2i} = \int _{t_i} x(j)^2 dF(x),{\quad }i=l,r\), where \(F\) stands for the distribution function of the vector \(\mathbf{X}\).
It then follows that \(\mathbb{E }(X_{t_l \cup t_r}(j)^2) = \gamma M^{(j)}_{2l}+ (1 - \gamma )M^{(j)}_{2r}\) , and therefore that
By summing the terms in \(j\) we get the desired result.
1.2 Proof of Theorem 3.1
Let \(\mathbb{T }\) be the family of polygons in \( \mathbb{R }^p\) with faces orthogonal to the axes, and fix \(i \in \{1, \ldots , p\}\) and \(t \in \mathbb{T }\). For \(a \in \mathbb{R }\) denote by \(t_l = \{ x \in t: x(i) \le a\}\) and \(t_r = t \setminus t_l\). We define \(r(t,i,a) = R(t) - R(t_l) - R(t_r)\) and \(r_n(t,i,a) = R_n(t) - R_n(t_l) - R_n(t_r)\), the corresponding empirical version.
We start showing the uniform convergence
By Lemma 3.1,
where \(\alpha _A = P(\mathbf{X} \in A)\) and \( \mu _j(a) = \mathbb{E }(X_{t_j}),{\quad }j=l,r\). Then, the pairs \((i_{jn}, a_{jn})\) and \((i_{j}, a_{j})\) are the arguments that maximise the right-hand side of equation (9) with respect to the measures \(P_n\) and \(P\) respectively. We observe that the right-hand side of equation (9) equals
In order to prove Eq. (8) it is sufficient to show that:
-
1.
\(\sup _{a \in \mathbb{R }} \sup _{t \in \mathbb{T }}\vert P_n(t_j) - P(t_j)\vert \rightarrow 0 \, \, a.s. \, \, j=l,r\)
-
2.
\(\begin{array}{l} \sup _{a \in \mathbb{R }}\sup _{t \in \mathbb{T }}\vert \int _{t_j} \Vert x \Vert ^2 dP_n(x) - \int _{t_j} \Vert x \Vert ^2 dP(x) \vert \rightarrow 0\\ a.s. \, \, j=l,r\\ \end{array}\)
-
3.
\(\begin{array}{l} sup_{a \in \mathbb{R }}\sup _{t \in \mathbb{T }} \vert \int _{t_j}\ x (i) dP_n(x) - \int _{t_j} x(i) dP(x) \vert \rightarrow 0\\ a.s. \, \, j=l,r, i=1, \ldots , p. \end{array}\)
Since \(\mathbb{T }\) is a Vapnik–Chervonenkis class, we have that (1) holds. Now observe that the conditions for uniform convergence over families of sets still hold if we are dealing with signed finite measures. Therefore if we consider the finite measure \( \Vert x \Vert ^2dP(x)\) and the finite signed measure given by \(x(i)dP(x)\) we also have that (2) and (3) both hold.
Since
we have that
a.s.
In the first step of the algorithm, \(t=\mathbb{R }^p\) and we obtain \( i_{n1} = i_1\) for \(n\) large enough and \(a_{n1} \rightarrow a_1\) a.s. In the next step, we observe that the empirical procedure begins to work with \(t_{nl}\) and \(t_{nr}\), while the population algorithm will do so with \(t_{l}\) and \(t_{r}\). However, we have that
for \(j=l,r\).
We already know that the first term on the right hand side of equation(12) converges to zero almost surely. In order to show that the second term also converges to zero, it is sufficient to show that
-
1.
\( \sup _{a \in \mathbb{R }} \vert P(t_{nj})- P(t_j)\vert \rightarrow 0{\quad }a.s.{\quad }j=l,r\)
-
2.
\( \sup _{a \in \mathbb{R }}\vert \int _{t_j} \Vert x \Vert ^2 dP(x) - \int _{t_{nj} } \Vert x \Vert ^2 dP(x) \vert \rightarrow 0{\quad }a.s.{\quad }j=l,r\)
-
3.
\( sup_{a \in \mathbb{R }} \vert \int _{t_j}\ x (i) dP(x) - \int _{t_{nj}} x(i) dP(x) \vert \rightarrow 0{\quad }a.s.{\quad }j=l,r, \ i=1, \ldots , p,\)
which follows from the assumption that \(\Vert x \Vert ^2 f(x)\) is bounded. This concludes the proof since \(minsize/n \rightarrow \tau \).
1.3 Proof of Theorem 3.2
We need to show that we have consistency in both steps of the backward algorithm.
(i) Convergence of the pruning step. Let \(\{t^*_{1n}, \ldots , t^*_{mn}\}\) be the output of the forward algorithm. The pruning step partition of the algorithm converges to the corresponding population version from
-
the conclusions of Theorem 3.1.
-
the fact that the random variables \(W_{lr} \quad \tilde{d}_l, \tilde{d}_r\) are positive.
-
the uniform convergence of the empirical quantile function to its population version.
-
the Lebesgue dominated convergence theorem.
The proof of convergence of the joining step is mainly the same as that for (i).
Rights and permissions
About this article
Cite this article
Fraiman, R., Ghattas, B. & Svarc, M. Interpretable clustering using unsupervised binary trees. Adv Data Anal Classif 7, 125–145 (2013). https://doi.org/10.1007/s11634-013-0129-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-013-0129-3