Isolation kernel: the X factor in efficient and effective large scale online kernel learning

458 Accesses
7 Citations
2 Altmetric
Explore all metrics

Abstract

Large scale online kernel learning aims to build an efficient and scalable kernel-based predictive model incrementally from a sequence of potentially infinite data points. Current state-of-the-art large scale online kernel learning focuses on improving efficiency. Two key approaches to gain efficiency through approximation are (1) limiting the number of support vectors, and (2) using an approximate feature map. They often employ a kernel with a feature map with intractable dimensionality. While these approaches can deal with large scale datasets efficiently, this outcome is achieved by compromising predictive accuracy because of the approximation. We offer an alternative approach that puts the kernel used at the heart of the approach. It focuses on creating a sparse and finite-dimensional feature map of a kernel called Isolation Kernel. Using this new approach, to achieve the above aim of large scale online kernel learning becomes extremely simple—simply use Isolation Kernel instead of a kernel having a feature map with intractable dimensionality. We show that, using Isolation Kernel, large scale online kernel learning can be achieved efficiently without sacrificing accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Nyström-SGD: Fast Learning of Kernel-Classifiers with Conditioned Stochastic Gradient Descent

Kernel collaborative online algorithms for multi-task learning

Article 08 August 2019

Large-Scale Data Challenges: Instability in Statistical Learning

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

The code is available at https://github.com/LIBOL/KOL.
As pointed in Ting et al. (2018), Laplacian kernel can be expressed as \(\exp (-\gamma \sum ^d_{\jmath =1} |x_\jmath - y_\jmath |) = \psi ^{-\frac{1}{d} \sum ^d_{\jmath =1} |x_\jmath - y_\jmath |}\), where \(\gamma = \frac{\log (\psi )}{d}\). Laplacian kernel has been shown to be competitive to Gaussian kernel in SVM in a recent study (Ting et al. 2018).
Note that this parameter is called budget in Lu et al. (2016); but this is different from the budget used to limit the number of support vectors, mentioned in Section 2.1.
The two-class conversions from the original class labels were done for three multi-class datasets: mnist: \(\{3, 4, 6, 7, 9\}\) and \(\{0, 1, 2, 5, 8\}\). smallNORB: \(\{1,4\}\) and \(\{0, 2, 3\}\). cifar-10: \(\{ 0, 2, 3, 4, 5\}\) and \(\{1, 6, 7, 8, 9\}\).
This simulation is more realistic than the previous online experiments which assume that class label of each point is available immediately after a prediction is made to enable model update (Lu et al. 2016). In practice, the algorithm can be made to be in the training mode whenever class labels are available, either partially or the entire block.
Note that the first points in the accuracy plots can swing wildly because each is the accuracy on the first data block.
The setting \(b=0\) is used in the experiment in this section. This appears to be the best setting for the datasets we used here. \(b>0\) not only produces poorer accuracy on some datasets, but yields a larger number of features.
LIBSVM appears to perform slightly worse than LIBLINEAR in some datasets with IK\(_a\). This is due to the different heuristics used in the optimisation procedures. Otherwise, both LIBSVM and LIBLINEAR shall yield exactly the same accuracy since the same feature map of Isolation Kernel is used.
At https://scikit-learn.org/stable/modules/kernel_approximation.html in Python.
One study has attributed the data independence of the feature generation as the reason of poorer SVM accuracy in comparison with the Nyström method (Yang et al. 2012).
A sparse representation yields vectors having many zero values, where a feature with zero value means that the feature is irrelevant.
Product Quantization (an improvement over vector quantization) aims to reduce storage and retrieval time for conducting approximate nearest neighbour search.

References

Breiman L (2000) Some infinity theory for predictor ensembles. Tech Rep 577, Statistics Dept, University of California, Berkeley
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article Google Scholar
Cavallanti G, Cesa-Bianchi N, Gentile C (2007) Tracking the best hyperplane with a simple budget perceptron. Mach Learn 69(2):143–167
Article Google Scholar
Chang CC, Lin CJ (2011) LIBSVM: A library for support vector machines. ACM Transactions Intell Syst Technol 2(3):1–27
Article Google Scholar
Chang YW, Hsieh CJ, Chang KW, Ringgaard M, Lin CJ (2010) Training and testing low-degree polynomial data mappings via linear svm. J Mach Learn Res 11:1471–1490
MathSciNet MATH Google Scholar
Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) Liblinear: a library for large linear classification. J Mach Learn Res 9:1871–1874
MATH Google Scholar
Felix XY, Suresh AT, Choromanski KM, Holtmann-Rice DN, Kumar S (2016) Orthogonal random features. In: Advances in Neural Information Processing Systems, pp 1975–1983
Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63(1):3–42
Article Google Scholar
Huang P, Avron H, Sainath TN, Sindhwani V, Ramabhadran B (2014) Kernel methods match deep neural networks on timit. In: Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing, pp 205–209
Jose C, Goyal P, Aggrwal P, Varma M (2013) Local deep kernel learning for efficient non-linear svm prediction. In: Proceedings of the 30th International Conference on Machine Learning, pp III–486–III–494
Kafai M, Eshghi K (2019) Croification: accurate kernel classification with the efficiency of sparse linear svm. IEEE Transactions Pattern Anal Mach Intell 41(1):34–48
Article Google Scholar
Kivinen J, Smola AJ, Williamson RC (2001) Online learning with kernels. In: Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, pp 785–792
Liu FT, Ting KM, Zhou ZH (2008) Isolation forest. In: Proceedings of the IEEE International Conference on Data Mining, pp 413–422
Lu J, Hoi SCH, Wang J, Zhao P, Liu ZY (2016) Large scale online kernel learning. J Mach Learn Res 17(1):1613–1655
MathSciNet MATH Google Scholar
Maji S, Berg AC, Malik J (2013) Efficient classification for additive kernel svms. IEEE Transactions Pattern Anal Mach Intell 35(1):66–77
Article Google Scholar
Musco C, Musco C (2017) Recursive sampling for the nystrom method. In: Advances in Neural Information Processing Systems, pp 3833–3845
Qin X, Ting KM, Zhu Y, Lee VCS (2019) Nearest-neighbour-induced isolation similarity and its impact on density-based clustering. In: Proceedings of The Thirty-Third AAAI Conference on Artificial Intelligence, pp 4755–4762
Rahimi A, Recht B (2007) Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems, pp 1177–1184
Rakotomamonjy A, Bach FR, Canu S, Grandvalet Y (2008) SimpleMKL. J Mach Learn Res 9(Nov):2491–2521
MathSciNet MATH Google Scholar
Scholkopf B, Smola AJ (2001) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, Cambridge
Google Scholar
Shen Y, Chen T, Giannakis G (2018) Online ensemble multi-kernel learning adaptive to non-stationary and adversarial environments. In: Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, pp 2037–2046
Ting KM, Zhu Y, Zhou ZH (2018) Isolation kernel and its effect on SVM. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp 2329–2337
Ting KM, Xu BC, Washio T, Zhou ZH (2020) Isolation distributional kernel: a new tool for kernel based anomaly detection. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 198–206
Vedaldi A, Zisserman A (2012a) Efficient additive kernels via explicit feature maps. IEEE Transactions Pattern Anal Mach Intell 34(3):480–492
Article Google Scholar
Vedaldi A, Zisserman A (2012b) Sparse kernel approximations for efficient classification and detection. In: Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp 2320–2327
Viet Le Q, Sarlos T, Smola AJ (2013) Fastfood: approximate kernel expansions in loglinear time. In: Proceedings of the 30th International Conference on Machine Learning, pp III–244–III–252
Wang Z, Vucetic S (2010) Online passive-aggressive algorithms on a budget. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp 908–915
Wang Z, Crammer K, Vucetic S (2012) Breaking the curse of kernelization: budgeted stochastic gradient descent for large-scale svm training. J Mach Learn Res 13(1):3103–3131
MathSciNet MATH Google Scholar
Williams CKI, Seeger M (2001) Using the nyström method to speed up kernel machines. In: Advances in Neural Information Processing Systems 13. MIT Press, Cambridge, pp 682–688
Wu J, Ding L, Liao S (2017) Predictive nyström method for kernel methods. Neurocomputing 234:116–125
Article Google Scholar
Xu BC, Ting KM, Zhou ZH (2019) Isolation set-kernel and its application to multi-instance learning. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 941–949
Xu BC, Ting KM, Jiang Y (2021) Isolation graph kernel. In: Proceedings of The Thirty-Fifth AAAI Conference on Artificial Intelligence, pp 10487–10495
Yang J, Sindhwani V, Fan Q, Avron H, Mahoney M (2014) Random Laplace feature maps for semigroup kernels on histograms. In: Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp 971–978
Yang T, Li YF, Mahdavi M, Jin R, Zhou ZH (2012) Nyström method vs random fourier features: a theoretical and empirical comparison. In: Advances in Neural Information Processing Systems, pp 476–484
Zadeh P, Hosseini R, Sra S (2016) Geometric mean metric learning. In: Proceedings of the International Conference on Machine Learning, pp 2464–2471

Download references

Acknowledgements

Kai Ming Ting is supported by Natural Science Foundation of China (62076120). Takashi Washio is supported by JST CREST Grant Number JPMJCR1666 and the Japan Society for the Promotion of Science (JSPS) KAKENHI Grant Number 20K21815. A discussion with Jaakko Peltonen at Shonan Meeting in October 2018 has provided the initial impetus in creating the feature map of Isolation Kernel described in this paper.

Author information

Authors and Affiliations

National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Kai Ming Ting
School of Information Technology, Deakin University, Geelong, Australia
Jonathan R. Wells
The Institute of Scientific and Industrial Research, Osaka University, Osaka, Japan
Takashi Washio

Authors

Kai Ming Ting
View author publications
You can also search for this author in PubMed Google Scholar
Jonathan R. Wells
View author publications
You can also search for this author in PubMed Google Scholar
Takashi Washio
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kai Ming Ting.

Additional information

Responsible editor: Jingrui He.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ting, K.M., Wells, J.R. & Washio, T. Isolation kernel: the X factor in efficient and effective large scale online kernel learning. Data Min Knowl Disc 35, 2282–2312 (2021). https://doi.org/10.1007/s10618-021-00785-1

Download citation

Received: 03 November 2020
Accepted: 21 July 2021
Published: 19 August 2021
Issue Date: November 2021
DOI: https://doi.org/10.1007/s10618-021-00785-1

Isolation kernel: the X factor in efficient and effective large scale online kernel learning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Nyström-SGD: Fast Learning of Kernel-Classifiers with Conditioned Stochastic Gradient Descent

Kernel collaborative online algorithms for multi-task learning

Large-Scale Data Challenges: Instability in Statistical Learning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Isolation kernel: the X factor in efficient and effective large scale online kernel learning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Nyström-SGD: Fast Learning of Kernel-Classifiers with Conditioned Stochastic Gradient Descent

Kernel collaborative online algorithms for multi-task learning

Large-Scale Data Challenges: Instability in Statistical Learning

Explore related subjects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation