Simple supervised dissimilarity measure: Bolstering iForest-induced similarity with class information without learning

Jonathan R. Wells¹,
Sunil Aryal² &
Kai Ming Ting^1,3

289 Accesses
3 Citations
Explore all metrics

Abstract

Existing distance metric learning methods require optimisation to learn a feature space to transform data—this makes them computationally expensive in large datasets. In classification tasks, they make use of class information to learn an appropriate feature space. In this paper, we present a simple supervised dissimilarity measure which does not require learning or optimisation. It uses class information to measure dissimilarity of two data instances in the input space directly. It is a supervised version of an existing data-dependent dissimilarity measure called \(m_\mathrm{e}\). Our empirical results in k-NN and LVQ classification tasks show that the proposed simple supervised dissimilarity measure generally produces predictive accuracy better than or at least as good as existing state-of-the-art supervised and unsupervised dissimilarity measures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Contrastive dissimilarity: optimizing performance on imbalanced and limited data sets

Article 12 August 2024

Metric Learning in Dissimilarity Space for Improved Nearest Neighbor Performance

Data-dependent dissimilarity measure: an effective alternative to geometric distance measures

Article 04 April 2017

Notes

Use to explain the behaviour of Random Forest, RF similarity aims to track the sign of the margin of x (defined as \(P(+1|x)-P(-1|x)\), where \(+\) 1 and − 1 are the two class labels) [4]. In contrast, iForest-based similarity aims to measure similarity of two points such that points are more similar in sparse region than two points of the same inter-point distance in dense region [22].
Path length was used by iForest [14] as the anomaly score for the purpose of anomaly detection; and path length is a proxy to mass in mass estimation (see Section 4 in Ting et al. [19]). Mass-based dissimilarity [20], mentioned earlier, is an extension of mass estimation which is implemented using completely random trees such as iForest. Though based on RF, some path length-based similarity [28] can be viewed as a variant of mass-based dissimilarity which is implemented using classification trees rather than completely random trees.
The source code for ClustRF is at http://www.eecs.qmul.ac.uk/~xiatian/project_robust_graphs/index.html.

References

Aryal S (2017) A data-dependent dissimilarity measure: an effective alternative to distance measures. Monash University, Clayton PhD thesis
Google Scholar
Aryal S, Ting KM, Haffari G, Washio T (2014) \(m_p\)-dissimilarity: a data dependent dissimilarity measure. In: Proceedings of the IEEE international conference on data mining, IEEE, pp 707–712
Aryal S, Ting KM, Washio T, Haffari G (2017) Data-dependent dissimilarity measure: an effective alternative to geometric distance measures. Knowl Inf Syst 53(2):479–506
Article Google Scholar
Breiman L (2000) Some infinity theory for predictor ensembles, Technical Report 577. Statistics Dept, UCB
Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27
Article MATH Google Scholar
Davies A, Ghahramani Z (2014) The random forest kernel and creating other kernels for big data from random partitions. arXiv:1402.4293
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
MathSciNet MATH Google Scholar
Deza MM, Deza E (2009) Encyclopedia of distances. Springer, Berlin
Book MATH Google Scholar
Dua D, Graff C (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. SIGKDD Explor Newsl 11(1):10–18
Article Google Scholar
Kohonen T (1995) Learning vector quantization. Springer, Berlin, pp 175–189
Google Scholar
Krumhansl CL (1978) Concerning the applicability of geometric models to similarity data: the interrelationship between similarity and spatial density. Psychol Rev 85(5):445–463
Article Google Scholar
Kulis B (2013) Metric learning: a survey. Found Trends Mach Learn 5(4):287–364
Article MathSciNet MATH Google Scholar
Liu F, Ting KM, Zhou Z-H (2008) Isolation forest. In: Proceedings of the eighth IEEE international conference on data mining, pp 413–422
Macqueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, pp 281–297
Nebel D, Hammer B, Frohberg K, Villmann T (2015) Median variants of learning vector quantization for learning of dissimilarity data. Neurocomputing 169:295–305
Article Google Scholar
Quinlan JR (1986) Induction of decision trees. Mach Learn 1(1):81–106
Google Scholar
Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27(379–423):623–657
Article MathSciNet MATH Google Scholar
Ting KM, Zhou G-T, Liu FT, Tan SC (2013) Mass estimation. Mach Learn 90(1):127–160
Article MathSciNet MATH Google Scholar
Ting KM, Zhu Y, Carman M, Zhu Y, Washio T, Zhou Z-H (2019) Lowest probability mass neighbour algorithms: relaxing the metric constraint in distance-based neighbourhood algorithms. Mach Learn 108(2):331–376
Article MathSciNet MATH Google Scholar
Ting KM, Zhu Y, Carman M, Zhu Y, Zhou Z-H (2016) Overcoming key weaknesses of distance-based neighbourhood methods using a data dependent dissimilarity measure. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, NY, USA, pp 1205–1214
Ting KM, Zhu Y, Zhou Z-H (2018) Isolation kernel and its effect on SVM. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 2329–2337
Tversky A (1977) Features of similarity. Psychol Rev 84(4):327–352
Article Google Scholar
Wang F, Sun J (2015) Survey on distance metric learning and dimensionality reduction in data mining. Data Min Knowl Discov 29(2):534–564
Article MathSciNet MATH Google Scholar
Weinberger KQ, Saul LK (2009) Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 10:207–244
MATH Google Scholar
Yang L (2006) Distance metric learning: a comprehensive survey, Technical report, Michigan State University
Zadeh PH, Hosseini R, Sra S (2016) Geometric mean metric learning. In: Proceedings of the 33rd international conference on international conference on machine learning, vol 48, pp 2464–2471
Zhu X, Loy CC, Gong S (2014) Constructing robust affinity graphs for spectral clustering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1450–1457

Download references

Author information

Authors and Affiliations

School of Science, Engineering and Information Technology, Federation University, Churchill, Australia
Jonathan R. Wells & Kai Ming Ting
School of Information Technology, Deakin University, Geelong, Australia
Sunil Aryal
National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Kai Ming Ting

Authors

Jonathan R. Wells
View author publications
You can also search for this author in PubMed Google Scholar
Sunil Aryal
View author publications
You can also search for this author in PubMed Google Scholar
Kai Ming Ting
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kai Ming Ting.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Algorithms to construct isolation forest

Algorithms to construct isolation forest and building isolation tree used in \(m_\mathrm{e}\) and \(d_\mathrm{e}\) are provided in Algorithms 1 and 2. Note that after trees are created, the entire dataset D is passed in each isolation tree to record data mass (in \(m_\mathrm{e}\)) and class entropy (in \(d_\mathrm{e}\)) in each node.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wells, J.R., Aryal, S. & Ting, K.M. Simple supervised dissimilarity measure: Bolstering iForest-induced similarity with class information without learning. Knowl Inf Syst 62, 3203–3216 (2020). https://doi.org/10.1007/s10115-020-01454-3

Download citation

Received: 04 February 2019
Revised: 11 February 2020
Accepted: 24 February 2020
Published: 26 March 2020
Issue Date: August 2020
DOI: https://doi.org/10.1007/s10115-020-01454-3