Interpretable clustering using unsupervised binary trees

Ricardo Fraiman¹,
Badih Ghattas² &
Marcela Svarc³

1291 Accesses
Explore all metrics

Abstract

We herein introduce a new method of interpretable clustering that uses unsupervised binary trees. It is a three-stage procedure, the first stage of which entails a series of recursive binary splits to reduce the heterogeneity of the data within the new subsamples. During the second stage (pruning), consideration is given to whether adjacent nodes can be aggregated. Finally, during the third stage (joining), similar clusters are joined together, even if they do not share the same parent originally. Consistency results are obtained, and the procedure is used on simulated and real data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Using Decision Trees for Interpretable Supervised Clustering

Article Open access 15 February 2024

A Heuristic Automatic Clustering Method Based on Hierarchical Clustering

Clustering Models

Article 01 April 2022

References

Andreopoulos B, An A, Wang X (2007) Hierarchical density-based clustering of categorical data and a simplification. Adv Knowl Discov Data Min, pp 11–22
Basak J, Krishnapuram R (2005) Interpretable hierarchical clustering by construction an unsupervised decision tree. IEEE Trans Knowl Data Eng 17:121–132
Article Google Scholar
Blockeel H, De Raedt L, Ramon J (1998) Top–down induction of clustering trees. In: Morgan K (ed) Proceedings of the 15th international conference on machine learning, pp 55–63
Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. CRC, Boca Raton
MATH Google Scholar
Chavent M, Guinot C, Lechevallier Y, Tenenhaus M (1999) Méthodes Divisives de Classification et Segmentation Non Supervisée: Recherche d’une Typologie de la Peau Humanine Saine. Rev. Statistique Appliquée, XLVII 87–99
Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of 2nd international conference on knowledge discovery and data mining. Portland, OR, pp 226–231
Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97:611–631
Article MathSciNet MATH Google Scholar
Fraley C, Raftery AE (2009) MCLUST Version 3 for R: Normal Mixture Modeling and Model-based Clustering, Technical Report No. 504, Department of Statistics, University of Washington
Gan G, Ma C, Wu J (2007) Clustering data: theory, algorithms and applications. Springer, Berlin
Book Google Scholar
García-Escudero L, Gordaliza A, Matrán C, Mayo-Iscar A (2010) A review of robust clustering methods. Adv Data Anal Classif 4(2):89–109
Article MathSciNet Google Scholar
Gini CW (1912) Variability and mutability, contribution to the study of statistical distributions and relations. Studi Economico-Giuridici della R. Universita de Cagliari
Hastie T, Tibshirani R, Friedman JH (2003) The elements of statistical learning. Springer, Berlin
Google Scholar
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193–218
Article Google Scholar
Iyigun C, Ben-Israel A (2008) Probabilistic distance clustering adjusted for cluster size. Prob Eng Inf Sci 22:603–621
Article MathSciNet MATH Google Scholar
Liu B, Xia Y, Yu PS (2000) Clustering through decision tree construction. CIKM ’00. In: Proceedings of the ninth international conference on information and knowledge management. ACM, New York, NY, USA, pp 20–29
Maronna RA, Martin DR, Yohai VJ (2006) Robust statistics: theory and methods. Wiley Series in Probability and Statistics, Wiley, New York
Melia M (2012) Local equivalences of distances between clusterings—a geometric perpective. Mach Learn 86(3):369–389
Article MathSciNet Google Scholar
Oh MS, Raftery A (2007) Model-based clustering with dissimilarities: a Bayesian approach. J Comput Gr Stat 16(3):559–585
Article MathSciNet Google Scholar
Papadimitriou C, Steiglitz K (1982) Combinatorial optimization: algorithms and complexity. Prentice Hall, Englewood Cliffs
MATH Google Scholar
Peña D, Prieto FJ (2001) Cluster identification using projections. J Am Stat Assoc 96(456):1433–1445
Article MATH Google Scholar
Peña D, Prieto FJ (2006) A projection method for robust estimation and clustering in large data set. In: Zani S, Cerioli A, Riani M, Vichi M (eds) Data analysis, classification and the forward search. Springer, Berlin
Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66:846–850
Article Google Scholar
Steinley D, Brusco M (2007) Initializing K-means batch clustering: a critical evaluation of several techniques. J Classif 24:99–121
Article MathSciNet MATH Google Scholar
Tryon RC (1939) Cluster analysis. McGraw-Hill, Columbus
Google Scholar
Walther G (2010) Optimal and fast detection of spatial clusters with scan statistics. Ann Stat 24(2): 1010–1033
Google Scholar

Download references

Author information

Authors and Affiliations

Universidad de San Andrés and Universidad de la República, Vito Dumas 284, 1644 Victoria, Buenos Aires, Argentina
Ricardo Fraiman
Département de Mathématiques, Université de la Méditerrannée Faculté des Sciences de Luminy, 163 Avenue de Luminy, 13288 , Marseille Cedex 09, France
Badih Ghattas
Universidad de San Andrés and Conicet, Vito Dumas 284, 1644 Victoria, Buenos Aires, Argentina
Marcela Svarc

Authors

Ricardo Fraiman
View author publications
You can also search for this author in PubMed Google Scholar
Badih Ghattas
View author publications
You can also search for this author in PubMed Google Scholar
Marcela Svarc
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marcela Svarc.

Appendix

1.1 Proof of Lemma 3.1

We first observe that because $t_l$ and $t_r$ are disjoint, $\mathbb{E }(X_{t_l \cup t_r}) = \gamma \mu _l + (1 - \gamma ) \mu _r$, where $\gamma = P(\mathbf{X} \in t_l \vert \mathbf{X} \in t_l \cup t_r)$. Given that $j = 1, \ldots , p$, we use $ M^{(j)}_{2i} = \int _{t_i} x(j)^2 dF(x),{\quad }i=l,r$, where $F$ stands for the distribution function of the vector $\mathbf{X}$.

It then follows that $\mathbb{E }(X_{t_l \cup t_r}(j)^2) = \gamma M^{(j)}_{2l}+ (1 - \gamma )M^{(j)}_{2r}$ , and therefore that

$$\begin{aligned} var( X_{t_l \cup t_r}(j))&= \gamma var(X_{t_r}(j)) + (1-\gamma ) var(X_{t_r}(j)) \nonumber \\&+\, \gamma (1-\gamma )( \mu _l(j) - \mu _r(j))^2. \end{aligned}$$

(7)

By summing the terms in $j$ we get the desired result.

1.2 Proof of Theorem 3.1

Let $\mathbb{T }$ be the family of polygons in $ \mathbb{R }^p$ with faces orthogonal to the axes, and fix $i \in \{1, \ldots , p\}$ and $t \in \mathbb{T }$. For $a \in \mathbb{R }$ denote by $t_l = \{ x \in t: x(i) \le a\}$ and $t_r = t \setminus t_l$. We define $r(t,i,a) = R(t) - R(t_l) - R(t_r)$ and $r_n(t,i,a) = R_n(t) - R_n(t_l) - R_n(t_r)$, the corresponding empirical version.

We start showing the uniform convergence

$$\begin{aligned} \sup _{a \in \mathbb{R }} \sup _{t \in \mathbb{T }}\vert r_n(t,i,a)-r(t,i,a)\vert \rightarrow 0 \ \ a.s. \end{aligned}$$

(8)

By Lemma 3.1,

$$\begin{aligned} \alpha _t r(t,i,a) = \alpha _{t_l}\alpha _{t_r} \Vert \mu _l(a) - \mu _r(a)\Vert ^2, \end{aligned}$$

(9)

where $\alpha _A = P(\mathbf{X} \in A)$ and $ \mu _j(a) = \mathbb{E }(X_{t_j}),{\quad }j=l,r$. Then, the pairs $(i_{jn}, a_{jn})$ and $(i_{j}, a_{j})$ are the arguments that maximise the right-hand side of equation (9) with respect to the measures $P_n$ and $P$ respectively. We observe that the right-hand side of equation (9) equals

$$\begin{aligned} \alpha _{t_r} \int _{t_l} \Vert x \Vert ^2dP(x) + \alpha _{t_l} \int _{t_r} \Vert x \Vert ^2dP(x) - 2 \left\langle \int _{t_l} xdP(x) , \int _{t_r} xdP(x)\right\rangle . \end{aligned}$$

(10)

In order to prove Eq. (8) it is sufficient to show that:

1.
$\sup _{a \in \mathbb{R }} \sup _{t \in \mathbb{T }}\vert P_n(t_j) - P(t_j)\vert \rightarrow 0 \, \, a.s. \, \, j=l,r$
2.
$\begin{array}{l} \sup _{a \in \mathbb{R }}\sup _{t \in \mathbb{T }}\vert \int _{t_j} \Vert x \Vert ^2 dP_n(x) - \int _{t_j} \Vert x \Vert ^2 dP(x) \vert \rightarrow 0\\ a.s. \, \, j=l,r\\ \end{array}$
3.
$\begin{array}{l} sup_{a \in \mathbb{R }}\sup _{t \in \mathbb{T }} \vert \int _{t_j}\ x (i) dP_n(x) - \int _{t_j} x(i) dP(x) \vert \rightarrow 0\\ a.s. \, \, j=l,r, i=1, \ldots , p. \end{array}$

Since $\mathbb{T }$ is a Vapnik–Chervonenkis class, we have that (1) holds. Now observe that the conditions for uniform convergence over families of sets still hold if we are dealing with signed finite measures. Therefore if we consider the finite measure $ \Vert x \Vert ^2dP(x)$ and the finite signed measure given by $x(i)dP(x)$ we also have that (2) and (3) both hold.

Since

$$\begin{aligned} \lim _{a \rightarrow \infty } \alpha _{t_l}\alpha _{t_r} \Vert \mu _l(a) - \mu _r(a)\Vert ^2 = \lim _{a \rightarrow -\infty } \alpha _{t_l}\alpha _{t_r} \Vert \mu _l(a) - \mu _r(a)\Vert ^2= 0, \end{aligned}$$

(11)

we have that

$$\begin{aligned} \inf {\{argmax_{a \in \mathbb{R }} r_n(t,i,a)\}} \rightarrow \inf {\{argmax_{a \in \mathbb{R }} r(t,i,a)\}} \end{aligned}$$

a.s.

In the first step of the algorithm, $t=\mathbb{R }^p$ and we obtain $ i_{n1} = i_1$ for $n$ large enough and $a_{n1} \rightarrow a_1$ a.s. In the next step, we observe that the empirical procedure begins to work with $t_{nl}$ and $t_{nr}$, while the population algorithm will do so with $t_{l}$ and $t_{r}$. However, we have that

$$\begin{aligned}&\sup _{a \in \mathbb{R }} \vert r_n(t_{nj}, i, a) - r(t_j, i, a) \vert \nonumber \\&\quad \le \sup _{a \in \mathbb{R }} \sup _{t \in \mathbb{T }}\vert r_n(t_{nj}, i, a) - r(t_{nj}, i, a) \vert + \sup _{a \in \mathbb{R }} \vert r(t_{nj}, i, a) - r(t_j, i, a) \vert ,\quad \quad \end{aligned}$$

(12)

for $j=l,r$.

We already know that the first term on the right hand side of equation(12) converges to zero almost surely. In order to show that the second term also converges to zero, it is sufficient to show that

1.
$ \sup _{a \in \mathbb{R }} \vert P(t_{nj})- P(t_j)\vert \rightarrow 0{\quad }a.s.{\quad }j=l,r$
2.
$ \sup _{a \in \mathbb{R }}\vert \int _{t_j} \Vert x \Vert ^2 dP(x) - \int _{t_{nj} } \Vert x \Vert ^2 dP(x) \vert \rightarrow 0{\quad }a.s.{\quad }j=l,r$
3.
$ sup_{a \in \mathbb{R }} \vert \int _{t_j}\ x (i) dP(x) - \int _{t_{nj}} x(i) dP(x) \vert \rightarrow 0{\quad }a.s.{\quad }j=l,r, \ i=1, \ldots , p,$

which follows from the assumption that $\Vert x \Vert ^2 f(x)$ is bounded. This concludes the proof since $minsize/n \rightarrow \tau $.

1.3 Proof of Theorem 3.2

We need to show that we have consistency in both steps of the backward algorithm.

(i) Convergence of the pruning step. Let $\{t^*_{1n}, \ldots , t^*_{mn}\}$ be the output of the forward algorithm. The pruning step partition of the algorithm converges to the corresponding population version from

the conclusions of Theorem 3.1.
the fact that the random variables $W_{lr} \quad \tilde{d}_l, \tilde{d}_r$ are positive.
the uniform convergence of the empirical quantile function to its population version.
the Lebesgue dominated convergence theorem.

The proof of convergence of the joining step is mainly the same as that for (i).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fraiman, R., Ghattas, B. & Svarc, M. Interpretable clustering using unsupervised binary trees. Adv Data Anal Classif 7, 125–145 (2013). https://doi.org/10.1007/s11634-013-0129-3

Download citation

Received: 31 January 2013
Revised: 20 February 2013
Accepted: 07 March 2013
Published: 29 March 2013
Issue Date: June 2013
DOI: https://doi.org/10.1007/s11634-013-0129-3

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Using Decision Trees for Interpretable Supervised Clustering

A Heuristic Automatic Clustering Method Based on Hierarchical Clustering

Clustering Models

References

Author information

Authors and Affiliations

Corresponding author

Appendix

1.1 Proof of Lemma 3.1

1.2 Proof of Theorem 3.1

1.3 Proof of Theorem 3.2

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification (2000)

Subscribe and save

Buy Now

Navigation

Interpretable clustering using unsupervised binary trees

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Using Decision Trees for Interpretable Supervised Clustering

A Heuristic Automatic Clustering Method Based on Hierarchical Clustering

Clustering Models

References

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

1.1 Proof of Lemma 3.1

1.2 Proof of Theorem 3.1

1.3 Proof of Theorem 3.2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification (2000)

Subscribe and save

Buy Now

Search

Navigation