Abstract
Large scale online kernel learning aims to build an efficient and scalable kernel-based predictive model incrementally from a sequence of potentially infinite data points. Current state-of-the-art large scale online kernel learning focuses on improving efficiency. Two key approaches to gain efficiency through approximation are (1) limiting the number of support vectors, and (2) using an approximate feature map. They often employ a kernel with a feature map with intractable dimensionality. While these approaches can deal with large scale datasets efficiently, this outcome is achieved by compromising predictive accuracy because of the approximation. We offer an alternative approach that puts the kernel used at the heart of the approach. It focuses on creating a sparse and finite-dimensional feature map of a kernel called Isolation Kernel. Using this new approach, to achieve the above aim of large scale online kernel learning becomes extremely simple—simply use Isolation Kernel instead of a kernel having a feature map with intractable dimensionality. We show that, using Isolation Kernel, large scale online kernel learning can be achieved efficiently without sacrificing accuracy.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
The code is available at https://github.com/LIBOL/KOL.
As pointed in Ting et al. (2018), Laplacian kernel can be expressed as \(\exp (-\gamma \sum ^d_{\jmath =1} |x_\jmath - y_\jmath |) = \psi ^{-\frac{1}{d} \sum ^d_{\jmath =1} |x_\jmath - y_\jmath |}\), where \(\gamma = \frac{\log (\psi )}{d}\). Laplacian kernel has been shown to be competitive to Gaussian kernel in SVM in a recent study (Ting et al. 2018).
The two-class conversions from the original class labels were done for three multi-class datasets: mnist: \(\{3, 4, 6, 7, 9\}\) and \(\{0, 1, 2, 5, 8\}\). smallNORB: \(\{1,4\}\) and \(\{0, 2, 3\}\). cifar-10: \(\{ 0, 2, 3, 4, 5\}\) and \(\{1, 6, 7, 8, 9\}\).
This simulation is more realistic than the previous online experiments which assume that class label of each point is available immediately after a prediction is made to enable model update (Lu et al. 2016). In practice, the algorithm can be made to be in the training mode whenever class labels are available, either partially or the entire block.
Note that the first points in the accuracy plots can swing wildly because each is the accuracy on the first data block.
The setting \(b=0\) is used in the experiment in this section. This appears to be the best setting for the datasets we used here. \(b>0\) not only produces poorer accuracy on some datasets, but yields a larger number of features.
LIBSVM appears to perform slightly worse than LIBLINEAR in some datasets with IK\(_a\). This is due to the different heuristics used in the optimisation procedures. Otherwise, both LIBSVM and LIBLINEAR shall yield exactly the same accuracy since the same feature map of Isolation Kernel is used.
One study has attributed the data independence of the feature generation as the reason of poorer SVM accuracy in comparison with the Nyström method (Yang et al. 2012).
A sparse representation yields vectors having many zero values, where a feature with zero value means that the feature is irrelevant.
Product Quantization (an improvement over vector quantization) aims to reduce storage and retrieval time for conducting approximate nearest neighbour search.
References
Breiman L (2000) Some infinity theory for predictor ensembles. Tech Rep 577, Statistics Dept, University of California, Berkeley
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Cavallanti G, Cesa-Bianchi N, Gentile C (2007) Tracking the best hyperplane with a simple budget perceptron. Mach Learn 69(2):143–167
Chang CC, Lin CJ (2011) LIBSVM: A library for support vector machines. ACM Transactions Intell Syst Technol 2(3):1–27
Chang YW, Hsieh CJ, Chang KW, Ringgaard M, Lin CJ (2010) Training and testing low-degree polynomial data mappings via linear svm. J Mach Learn Res 11:1471–1490
Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) Liblinear: a library for large linear classification. J Mach Learn Res 9:1871–1874
Felix XY, Suresh AT, Choromanski KM, Holtmann-Rice DN, Kumar S (2016) Orthogonal random features. In: Advances in Neural Information Processing Systems, pp 1975–1983
Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63(1):3–42
Huang P, Avron H, Sainath TN, Sindhwani V, Ramabhadran B (2014) Kernel methods match deep neural networks on timit. In: Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing, pp 205–209
Jose C, Goyal P, Aggrwal P, Varma M (2013) Local deep kernel learning for efficient non-linear svm prediction. In: Proceedings of the 30th International Conference on Machine Learning, pp III–486–III–494
Kafai M, Eshghi K (2019) Croification: accurate kernel classification with the efficiency of sparse linear svm. IEEE Transactions Pattern Anal Mach Intell 41(1):34–48
Kivinen J, Smola AJ, Williamson RC (2001) Online learning with kernels. In: Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, pp 785–792
Liu FT, Ting KM, Zhou ZH (2008) Isolation forest. In: Proceedings of the IEEE International Conference on Data Mining, pp 413–422
Lu J, Hoi SCH, Wang J, Zhao P, Liu ZY (2016) Large scale online kernel learning. J Mach Learn Res 17(1):1613–1655
Maji S, Berg AC, Malik J (2013) Efficient classification for additive kernel svms. IEEE Transactions Pattern Anal Mach Intell 35(1):66–77
Musco C, Musco C (2017) Recursive sampling for the nystrom method. In: Advances in Neural Information Processing Systems, pp 3833–3845
Qin X, Ting KM, Zhu Y, Lee VCS (2019) Nearest-neighbour-induced isolation similarity and its impact on density-based clustering. In: Proceedings of The Thirty-Third AAAI Conference on Artificial Intelligence, pp 4755–4762
Rahimi A, Recht B (2007) Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems, pp 1177–1184
Rakotomamonjy A, Bach FR, Canu S, Grandvalet Y (2008) SimpleMKL. J Mach Learn Res 9(Nov):2491–2521
Scholkopf B, Smola AJ (2001) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, Cambridge
Shen Y, Chen T, Giannakis G (2018) Online ensemble multi-kernel learning adaptive to non-stationary and adversarial environments. In: Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, pp 2037–2046
Ting KM, Zhu Y, Zhou ZH (2018) Isolation kernel and its effect on SVM. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp 2329–2337
Ting KM, Xu BC, Washio T, Zhou ZH (2020) Isolation distributional kernel: a new tool for kernel based anomaly detection. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 198–206
Vedaldi A, Zisserman A (2012a) Efficient additive kernels via explicit feature maps. IEEE Transactions Pattern Anal Mach Intell 34(3):480–492
Vedaldi A, Zisserman A (2012b) Sparse kernel approximations for efficient classification and detection. In: Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp 2320–2327
Viet Le Q, Sarlos T, Smola AJ (2013) Fastfood: approximate kernel expansions in loglinear time. In: Proceedings of the 30th International Conference on Machine Learning, pp III–244–III–252
Wang Z, Vucetic S (2010) Online passive-aggressive algorithms on a budget. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp 908–915
Wang Z, Crammer K, Vucetic S (2012) Breaking the curse of kernelization: budgeted stochastic gradient descent for large-scale svm training. J Mach Learn Res 13(1):3103–3131
Williams CKI, Seeger M (2001) Using the nyström method to speed up kernel machines. In: Advances in Neural Information Processing Systems 13. MIT Press, Cambridge, pp 682–688
Wu J, Ding L, Liao S (2017) Predictive nyström method for kernel methods. Neurocomputing 234:116–125
Xu BC, Ting KM, Zhou ZH (2019) Isolation set-kernel and its application to multi-instance learning. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 941–949
Xu BC, Ting KM, Jiang Y (2021) Isolation graph kernel. In: Proceedings of The Thirty-Fifth AAAI Conference on Artificial Intelligence, pp 10487–10495
Yang J, Sindhwani V, Fan Q, Avron H, Mahoney M (2014) Random Laplace feature maps for semigroup kernels on histograms. In: Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp 971–978
Yang T, Li YF, Mahdavi M, Jin R, Zhou ZH (2012) Nyström method vs random fourier features: a theoretical and empirical comparison. In: Advances in Neural Information Processing Systems, pp 476–484
Zadeh P, Hosseini R, Sra S (2016) Geometric mean metric learning. In: Proceedings of the International Conference on Machine Learning, pp 2464–2471
Acknowledgements
Kai Ming Ting is supported by Natural Science Foundation of China (62076120). Takashi Washio is supported by JST CREST Grant Number JPMJCR1666 and the Japan Society for the Promotion of Science (JSPS) KAKENHI Grant Number 20K21815. A discussion with Jaakko Peltonen at Shonan Meeting in October 2018 has provided the initial impetus in creating the feature map of Isolation Kernel described in this paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Jingrui He.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ting, K.M., Wells, J.R. & Washio, T. Isolation kernel: the X factor in efficient and effective large scale online kernel learning. Data Min Knowl Disc 35, 2282–2312 (2021). https://doi.org/10.1007/s10618-021-00785-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-021-00785-1