Abstract
Existing distance metric learning methods require optimisation to learn a feature space to transform data—this makes them computationally expensive in large datasets. In classification tasks, they make use of class information to learn an appropriate feature space. In this paper, we present a simple supervised dissimilarity measure which does not require learning or optimisation. It uses class information to measure dissimilarity of two data instances in the input space directly. It is a supervised version of an existing data-dependent dissimilarity measure called \(m_\mathrm{e}\). Our empirical results in k-NN and LVQ classification tasks show that the proposed simple supervised dissimilarity measure generally produces predictive accuracy better than or at least as good as existing state-of-the-art supervised and unsupervised dissimilarity measures.
Similar content being viewed by others
Notes
Use to explain the behaviour of Random Forest, RF similarity aims to track the sign of the margin of x (defined as \(P(+1|x)-P(-1|x)\), where \(+\) 1 and − 1 are the two class labels) [4]. In contrast, iForest-based similarity aims to measure similarity of two points such that points are more similar in sparse region than two points of the same inter-point distance in dense region [22].
Path length was used by iForest [14] as the anomaly score for the purpose of anomaly detection; and path length is a proxy to mass in mass estimation (see Section 4 in Ting et al. [19]). Mass-based dissimilarity [20], mentioned earlier, is an extension of mass estimation which is implemented using completely random trees such as iForest. Though based on RF, some path length-based similarity [28] can be viewed as a variant of mass-based dissimilarity which is implemented using classification trees rather than completely random trees.
The source code for ClustRF is at http://www.eecs.qmul.ac.uk/~xiatian/project_robust_graphs/index.html.
References
Aryal S (2017) A data-dependent dissimilarity measure: an effective alternative to distance measures. Monash University, Clayton PhD thesis
Aryal S, Ting KM, Haffari G, Washio T (2014) \(m_p\)-dissimilarity: a data dependent dissimilarity measure. In: Proceedings of the IEEE international conference on data mining, IEEE, pp 707–712
Aryal S, Ting KM, Washio T, Haffari G (2017) Data-dependent dissimilarity measure: an effective alternative to geometric distance measures. Knowl Inf Syst 53(2):479–506
Breiman L (2000) Some infinity theory for predictor ensembles, Technical Report 577. Statistics Dept, UCB
Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27
Davies A, Ghahramani Z (2014) The random forest kernel and creating other kernels for big data from random partitions. arXiv:1402.4293
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Deza MM, Deza E (2009) Encyclopedia of distances. Springer, Berlin
Dua D, Graff C (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. SIGKDD Explor Newsl 11(1):10–18
Kohonen T (1995) Learning vector quantization. Springer, Berlin, pp 175–189
Krumhansl CL (1978) Concerning the applicability of geometric models to similarity data: the interrelationship between similarity and spatial density. Psychol Rev 85(5):445–463
Kulis B (2013) Metric learning: a survey. Found Trends Mach Learn 5(4):287–364
Liu F, Ting KM, Zhou Z-H (2008) Isolation forest. In: Proceedings of the eighth IEEE international conference on data mining, pp 413–422
Macqueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, pp 281–297
Nebel D, Hammer B, Frohberg K, Villmann T (2015) Median variants of learning vector quantization for learning of dissimilarity data. Neurocomputing 169:295–305
Quinlan JR (1986) Induction of decision trees. Mach Learn 1(1):81–106
Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27(379–423):623–657
Ting KM, Zhou G-T, Liu FT, Tan SC (2013) Mass estimation. Mach Learn 90(1):127–160
Ting KM, Zhu Y, Carman M, Zhu Y, Washio T, Zhou Z-H (2019) Lowest probability mass neighbour algorithms: relaxing the metric constraint in distance-based neighbourhood algorithms. Mach Learn 108(2):331–376
Ting KM, Zhu Y, Carman M, Zhu Y, Zhou Z-H (2016) Overcoming key weaknesses of distance-based neighbourhood methods using a data dependent dissimilarity measure. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, NY, USA, pp 1205–1214
Ting KM, Zhu Y, Zhou Z-H (2018) Isolation kernel and its effect on SVM. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 2329–2337
Tversky A (1977) Features of similarity. Psychol Rev 84(4):327–352
Wang F, Sun J (2015) Survey on distance metric learning and dimensionality reduction in data mining. Data Min Knowl Discov 29(2):534–564
Weinberger KQ, Saul LK (2009) Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 10:207–244
Yang L (2006) Distance metric learning: a comprehensive survey, Technical report, Michigan State University
Zadeh PH, Hosseini R, Sra S (2016) Geometric mean metric learning. In: Proceedings of the 33rd international conference on international conference on machine learning, vol 48, pp 2464–2471
Zhu X, Loy CC, Gong S (2014) Constructing robust affinity graphs for spectral clustering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1450–1457
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A: Algorithms to construct isolation forest
Appendix A: Algorithms to construct isolation forest
Algorithms to construct isolation forest and building isolation tree used in \(m_\mathrm{e}\) and \(d_\mathrm{e}\) are provided in Algorithms 1 and 2. Note that after trees are created, the entire dataset D is passed in each isolation tree to record data mass (in \(m_\mathrm{e}\)) and class entropy (in \(d_\mathrm{e}\)) in each node.
Rights and permissions
About this article
Cite this article
Wells, J.R., Aryal, S. & Ting, K.M. Simple supervised dissimilarity measure: Bolstering iForest-induced similarity with class information without learning. Knowl Inf Syst 62, 3203–3216 (2020). https://doi.org/10.1007/s10115-020-01454-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-020-01454-3