Abstract
We present methods for the estimation of level sets, a level set tree, and a volume function of a multivariate density function. The methods are such that the computation is feasible and estimation is statistically efficient in moderate dimensional cases (\(d\approx 8\)) and for moderate sample sizes (\(n\approx \) 50,000). We apply kernel estimation together with an adaptive partition of the sample space. We illustrate how level set trees can be applied in cluster analysis and in flow cytometry.
Similar content being viewed by others
Notes
The Voronoi cell of a point \(p \in S =\{ X_1,\ldots ,X_n\}\) is \( V_p = \{ x \in \mathbf{R}^d : D(x,p) \le D(x,q) \text{ for } \text{ all } q \in S \} , \) where \(D(x,y) = \Vert x-y\Vert \) is the Euclidean distance in \(\mathbf{R}^d\). A Voronoi cell is a convex polyhedron: it is the intersection of the half-spaces of points at least as close to p as to q, taken over all \(q\in S\).
The Delaunay triangulation of \(S =\{ X_1,\ldots ,X_n\}\) is \( \text{ Delaunay }(S) = \{ \sigma \subset S : \bigcap _{p\in \sigma } V_p \ne \emptyset \} ; \) this is the collection of those tuples \(\sigma \) of \(d+1\) points whose Voronoi cells touch each other. Sets \(\sigma \in \text{ Delaunay }(S)\) are considered as the vertices of a simplex. Thus, a Delaunay triangulation is a collection of d-simplices. A d-simplex is the convex hull of its \(d+1\) vertices. In the two dimensional case (\(d=2\)) the simplices are triangles and in the three dimensional case (\(d=3\)) the simplices are tetrahedrons.
When the kernel function is the standard normal density, then the support of the kernel estimator is the whole space \(\mathbf{R}^d\). When the kernel function is the Bartlett density \(C(1-\Vert x\Vert ^2)_+\), then the support is a union of balls, and when the kernel function is the Epanechnikov product density \(C\prod _{i=1}^d (1-x_i^2)_+\), then the support is a union of rectangles.
A Reeb graph of a function (or a contour tree of a function) is graph whose leaf vertices represent the local minima or maxima and each interior vertex represent the joining or splitting of the contours of the function. Carr et al. (2003) give the following procedure for the computation of a Reeb graph. First a level set tree and a lower level set tree are calculated. A lower level set tree is analogous to a level set tree, but its nodes correspond to the lower level sets \( \{ x \in \mathbf{R}^d : f(x) \le \lambda \} . \) Second, the level set tree and the lower level set tree are pruned so that only the leaf nodes and the nodes with more than one child are left. Third, the pruned level set tree (split tree) and the pruned lower level set tree (join tree) are combined to obtain the Reeb graph.
Given a finite set \(\mathcal{{X}}\) of points in \(\mathbf{R}^d\) and a query point x, the k-d-algorithm finds the point in \(\mathcal{{X}}\) closest to x. The k-d-algorithm uses a k-d-tree, which is a similar binary tree as we use to represent an adaptive partition.
Note that when observations are in \(\mathbf{R}^d\) it is possible that only d of the observations are on the boundary, which happens when d observations are on the corners of the smallest rectangle containing the observations.
The unit simplex is defined as the convex hull of the origin and the vertices \(e_1,\ldots ,e_d\), where \(e_i\in \mathbf{R}^d\) has 1 in the ith position and 0 in the other positions. Klemelä (2004b) used a simulation example where the simplex was such that all edges have the same length. The current simulation example was used in Stuetzle and Nugent (2010) and Menardi and Azzalini (2014) with the sample size about 500.
The modes are located at (1 / 2, 0, 0, 0), \((-1/2,0,0,0)\), \((0,\sqrt{3}/2,0,0)\), \((0,1/(2\sqrt{3}),\sqrt{2/3},0)\), and \((0,1/(2\sqrt{3}),1/(2\sqrt{6}),\sqrt{15/24})\).
References
Aaron C (2013) Estimation of the support of the density and its boundary using random polyhedrons. Technical report, Université Blaise Pascal
Aghaeepour N (2010) FlowMeans: non-parametric flow cytometry data gating. R package version 1(16):
Aghaeepour N, Finak G, Hoos H, Mosmann TR, Brinkman R, Gottardo R, Scheuermann RH (2013) Critical assessment of automated flow cytometry data analysis techniques. Nat Methods 10(3):228–238
Aghaeepour N, Nikolic R, Hoos HH, Brinkman RR (2011) Rapid cell population identification in flow cytometry data. Cytom Part A J Int Soc Anal Cytol 79(1):6–13
Azzallini A, Torelli N (2007) Clustering via nonparametric density estimation. Stat Comput 17:71–80
Baíllo A, Cuesta-Albertos JA, Cuevas A (2001) Convergence rates in nonparametric estimation of level sets. Stat Probab Lett 53:27–35
Baíllo A, Cuevas A, Justel A (2000) Set estimation and nonparametric detection. Can J Stat 28:765–782
Bashashati A, Brinkman RR (2009) A survey of flow cytometry data analysis methods. Adv Bioinform 2009:584–603
Biau G, Cadre B, Pelletier B (2007) A graph-based estimator of the number of clusters. ESAIM Probab Stat 11:272–280
Breiman L, Friedman J, Olshen R, Stone CJ (1984) Classification and regression trees. Chapman and Hall, New York
Burman P, Polonik W (2009) Multivariate mode hunting: data analytic tools with measures of significance. J Multivar Anal 100:1198–1218
Cadre B (2006) Kernel estimation of density level sets. J Multivar Anal 97(4):999–1023
Carr H, Snoeyink J, Axen U (2003) Computing contour trees in any dimension. Comput Geom Theory Appl 24(2):75–94
Chaudhuri K, Dasgupta S (2010) Rates of convergence for the cluster tree. In: Lafferty JD, Williams CKI, Shawe-Taylor J, Zemel RS, Culotta A (eds) Advances in neural information processing systems 23. Curran Associates, Vancouver, pp 343–351
Cuevas A, Febreiro M, Fraiman R (2000) Estimating the number of clusters. Can J Stat 28:367–382
Cuevas A, Febreiro M, Fraiman R (2001) Cluster analysis: a further approach based on density estimation. Comput Stat Data Anal 36:441–459
Cuevas A, Febreiro M, Fraiman R (2006) Plug-in estimation of general level sets. Aust N Z J Stat 48:7–19
Cuevas A, Fraiman R (1997) A plug-in approach to support estimation. Ann Stat 25:2300–2312
Devroye L, Wise GL (1980) Detection of abnormal behavior via nonparametric estimation of the support. SIAM J Appl Math 38:480–488
Duong T, Cowling A, Koch I, Wand MP (2008) Feature significance for multivariate kernel density estimation. Comput Stat Data Anal 52(9):4225–4242
Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis and density estimation. J Am Stat Assoc 97:611–631
Ge Y, Sealfon SC (2012) Flowpeaks: a fast unsupervised clustering for flow cytometry data via k-means and density peak finding. Bioinformatics 28(15):2052–2058
Hartigan JA (1975) Clustering algorithms. Wiley, New York
Hartigan JA (1987) Estimation of a convex density cluster in two dimensions. J Am Stat Assoc 82:267–270
Holmström L, Karttunen K, Klemelä J (2014) Estimation of level set trees with adaptive partitions: supplementary material. Technical report, University of Oulu
Indyk P (2004) Nearest neighbors in high-dimensional spaces. In: Goodman JE, O’Rourke J (eds) Handbook of discrete and computational geometry. Chapman & Hall/CRC, Boca Raton, pp 877–892
Karttunen K, Holmström, L, Klemelä J (2014) Level set trees with enhanced marginal density visualization. In: In proceedings of the 5th international conference on information visualization theory and applications, (IVAPP 2014), Lisbon, Portugal, pp 210–217
Kent BP, Rinaldo A, Verstynen T (2013) DeBaCl: a Python package for interactive DEnsity-BAsed CLustering. J Stat Softw (submitted). arXiv:1307.8136
Klemelä J (2004a) Complexity penalized support estimation. J Multivar Anal 88:274–297
Klemelä J (2004b) Visualization of multivariate density estimates with level set trees. J Comput Graph Stat 13(3):599–620
Klemelä J (2005) Algorithms for the manipulation of level sets of nonparametric density estimates. Comput Stat 20:349–368
Klemelä J (2006) Visualization of multivariate density estimates with shape trees. J Comput Graph Stat 15(2):372–397
Klemelä J (2009) Smoothing of multivariate data: density estimation and visualization. Wiley, New York
Korostelev AP, Tsybakov AB (1993) Minimax theory of image reconstruction (Lecture notes in statistics), vol 82. Springer, Berlin
Kpotufe S, von Luxburg U (2011) Pruning nearest neighbor cluster trees. In: Proceedings of the 28th international conference on machine learning, vol 105, pp 225–232
Lo K, Brinkman RR, Gottardo R (2008) Automated gating of flow cytometry data via robust model-based clustering. Cytom Part A J Int Soc Anal Cytol 73:321–332
Maier M, Hein M, von Luxburg U (2009) Optimal construction of k-nearest-neighbor graphs for identifying noisy clusters. Theor Comput Sci 410(19):1749–1764
Mammen E, Tsybakov AB (1995) Asymptotical minimax recovery of sets with smooth boundaries. Ann Stat 23:502–524
McMullen P (1970) The maximum numbers of faces of a convex polytope. Mathematika 17:179–184
Melamed MR, Lindmo T, Mendelsohn ML (1990) Flow cytometry and sorting, 2nd edn. Wiley, New York
Menardi G, Azzalini A (2014) An advacement in clustering via nonparametric density estimation. Stat Comput 24:753–767
Morgan JN, Sonquist JA (1963) Problems in the analysis of survey data, and a proposal. J Am Stat Assoc 58:415–434
Müller DW, Sawitzki G (1991) Excess mass estimates and tests of multimodality. J Am Stat Assoc 86:738–746
Naumann U, Luta G, Wand MP (2010) The curvHDR method for gating flow cytometry samples. BMC Bioinform 11(44). doi:10.1186/1471-2105-11-44
Nolan D (1991) The excess-mass ellipsoid. J Multivar Anal 39:348–371
O’Neill K, Aghaeepour N, Spidlen J, Brinkman R (2013) Flow cytometry bioinformatics. PLoS Comput Biol 9(12):e1003365. doi:10.1371/journal.pcbi.1003365
Ooi H (2002) Density visualization and mode hunting using trees. J Comput Graph Stat 11:328–347
Polonik W (1995) Measuring mass concentration and estimating density contour clusters—an excess mass approach. Ann Stat 23:855–881
Reeb G (1946) Sur les points singuliers d’une forme de pfaff completement integrable ou d’une fonction numerique. C R Acad Sci Paris 222:847–849
Rigollet P, Vert R (2009) Optimal rates for plug-in estimators of density level sets. Bernoulli 15:1154–1178
Rinaldo A, Wasserman L (2010) Generalized density clustering. Ann Stat 38:2678–2722
Scott DW (1992) Multivariate density estimation: theory, practice, and visualization. Wiley, New York
Shapiro HM (2003) Practical flow cytometry, 4th edn. Wiley, New York
Silverman BW (1986) Density estimation for statistics and data analysis. Chapman and Hall, London
Singh A, Scott C, Nowak R (2009) Adaptive Hausdorff estimation of density level sets. Ann Stat 37:2760–2782
Steinwart I (2015) Fully adaptive density-based clustering. Ann Stat 43:2132–2167
Stuetzle W (2003) Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. J Classif 20(5):25–47
Stuetzle W, Nugent R (2010) A generalized single linkage method for estimating the cluster tree of a density. J Comput Graph Stat 19:397–418
Tarjan RE (1976) Efficiency of a good but not linear set union algorithm. J ACM 22:215–225
Tsybakov AB (1997) On nonparametric estimation of density level sets. Ann Stat 25:948–969
Walther G (1997) Granulometric smoothing. Ann Stat 25:2273–2299
Walther G, Zimmerman N, Moore W, Parks D, Meehan S, Belitskaya I, Pan J, Herzenberg L (2009) Automatic clustering of flow cytometry data with density-based merging. Adv Bioinform 2009:686–759
Zomorodian A (2012) Topological data analysis. In: Zomorodian A (ed) Advances in applied and computational topology, vol 70. American Mathematical Society, Providence, pp 1–40
Author information
Authors and Affiliations
Corresponding author
Additional information
The authors gratefully acknowledge the TEKES funding under project 24301335.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Holmström, L., Karttunen, K. & Klemelä, J. Estimation of level set trees using adaptive partitions. Comput Stat 32, 1139–1163 (2017). https://doi.org/10.1007/s00180-016-0702-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-016-0702-2