Abstract
Support vector machine (SVM) is an optimal margin based classification technique in machine learning. SVM is a binary linear classifier which has been extended to non-linear data using Kernels and multi-class data using various techniques like one-versus-one, one-versus-rest, Crammer Singer SVM, Weston Watkins SVM and directed acyclic graph SVM (DAGSVM) etc. SVM with a linear Kernel is called linear SVM and one with a non-linear Kernel is called non-linear SVM. Linear SVM is an efficient technique for high dimensional data applications like document classification, word-sense disambiguation, drug design etc. because under such data applications, test accuracy of linear SVM is closer to non-linear SVM while its training is much faster than non-linear SVM. SVM is continuously evolving since its inception and researchers have proposed many problem formulations, solvers and strategies for solving SVM. Moreover, due to advancements in the technology, data has taken the form of ‘Big Data’ which have posed a challenge for Machine Learning to train a classifier on this large-scale data. In this paper, we have presented a review on evolution of linear support vector machine classification, its solvers, strategies to improve solvers, experimental results, current challenges and research directions.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
datasets used in the experiments are available at link: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.
References
Abe S (2015) Fuzzy support vector machines for multilabel classification. Pattern Recogn 48(6):2110–2117
Abe S, Inoue T (2002) Fuzzy support vector machines for multiclass problems. In: European symposium on artificial neural networks, pp 113–118
Beck A, Tetruashvili L (2013) On the convergence of block coordinate descent type methods. SIAM J Optim 23(4):2037–2060
Becker S, Fadili M (2012) A quasi-newton proximal splitting method. In: Advances in neural information processing systems 25: 26th annual conference on neural information processing systems 2012. Proceedings of a meeting held December 3–6, 2012, Lake Tahoe, Nevada, pp 2627–2635
Blondel M, Seki K, Uehara K (2013) Block coordinate descent algorithms for large-scale sparse multiclass classification. Mach Learn 93(1):31–52
Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the 5th annual ACM workshop on computational learning theory, ACM Press, pp 144–152
Bottou L (2010) Large-scale machine learning with stochastic gradient descent. In: Lechevallier Y, Saporta G (eds) Proceedings of the 19th international conference on computational statistics (COMPSTAT’2010), Springer, Paris, pp 177–187. http://leon.bottou.org/papers/bottou-2010
Bottou L (2012) Neural networks: tricks of the trade. 2nd edn, Springer, Berlin, pp 421–436
Bottou L, Lin CJ (2007) Support vector machine solvers. Science 3(1):1–27
Bottou L, Curtis FE, Nocedal J (2016) Optimization methods for large-scale machine learning. arXiv:1606.04838
Bredensteiner EJ, Bennett KP (1999) Computational optimization: a tribute to olvi mangasarian, vol I, Springer, Boston, chap Multicategory Classification by Support Vector Machines, pp 53–79
Byrd RH, Hansen SL, Nocedal J, Singer Y (2016) A stochastic quasi-newton method for large-scale optimization. SIAM J Optim 26(2):1008–1031
Cao LJ, Keerthi SS, Ong CJ, Zhang JQ, Periyathamby U, Fu XJ, Lee HP (2006) Parallel sequential minimal optimization for the training of support vector machines. IEEE Trans Neural Netw 17(4):1039–1049
Cauchy AL (1847) Méthode générale pour la résolution des systèmes d’équations simultanées. Compte Rendu des S’eances de L’Acad’emie des Sciences XXV S’erie A(25):536–538
Chang CC, Hsu CW, Lin CJ (2000) The analysis of decomposition methods for support vector machines. IEEE Transactions on Neural Networks 11(4):1003–1008
Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2:27:1–27:27
Chang KW, Hsieh CJ, Lin CJ (2008) Coordinate descent method for large-scale l2-loss linear support vector machines. J Mach Learn Res 9:1369–1398
Chapelle O (2007) Training a support vector machine in the primal. Neural Comput 19(5):1155–1178
Chauhan VK, Dahiya K, Sharma A (2017a) Mini-batch block-coordinate based stochastic average adjusted gradient methods to solve big data problems. In: Zhang ML, Noh YK (eds) Proceedings of the ninth Asian conference on machine learning, PMLR, Proceedings of machine learning research, vol 77, pp 49–64. http://proceedings.mlr.press/v77/chauhan17a.html
Chauhan VK, Dahiya K, Sharma A (2017b) Trust region levenberg-marquardt method for linear svm. In: Ninth international conference on advances in pattern recognition (ICAPR-2017)
Yc C, Su C (2016) Distance-based margin support vector machine for classification. Appl Math Comput 283:141–152
Collobert R, Bengio S, Bengio Y (2002) A parallel mixture of svms for very large scale problems. Neural Comput 14(5):1105–1114
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
Cotter A, Shalev-Shwartz S, Srebro N (2013) Learning optimally sparse support vector machines. In: Proceedings of the 30th ICML, pp 266–274
Crammer K, Singer Y (2002) On the learnability and design of output codes for multiclass problems. Mach Learn 47(1995):201–233
De S, Taylor G, Goldstein T (2015) Variance reduction for distributed stochastic gradient descent. arXiv:1512.01708
Defazio A, Bach F, Lacoste-Julien S (2014) Saga: a fast incremental gradient method with support for non-strongly convex composite objectives. In: Proceedings of the 27th international conference on neural information processing systems, MIT Press, Cambridge, NIPS’14, pp 1646–1654
Evgeniou T, Pontil M (2004) Regularized multi–task learning. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’04, pp 109–117
Evgeniou T, Pontil M, Poggio T (2000) Regularization networks and support vector machines. In: Advances in computational mathematics, MIT Press, Cambridge, pp 1–50
Fan R, Chang K, Hsieh C, Wang X, Lin C (2008) LIBLINEAR: a library for large linear classification. JMLR 9:1871–1874
Farquhar JDR, Meng H, Szedmak S, Hardoon DR, Shawe-Taylor J (2006) Two view learning: Svm-2k, theory and practice. In: Advances in neural information processing systems, MIT Press, Cambridge
Ferris MC, Munson TS (2002) Interior-point methods for massive support vector machines. SIAM J Optim 13(3):783–804
Fletcher R, Reeves CM (1964) Function minimization by conjugate gradients. Comput J 7(2):149–154. https://doi.org/10.1093/comjnl/7.2.149
Franc V, Sonnenburg S (2008) Optimized cutting plane algorithm for support vector machines. In: Proceedings of the 25th international conference on machine learning, ACM, New York, NY, ICML ’08, pp 320–327
Friedman J (1996) Another approach to polychotomous classification. Stanford University, Tech Rep, Dept Statistics
Fung G, Mangasarian OL (2001) Proximal support vector machine classifiers. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, KDD ’01, pp 77–86
Gao Y, Sun S (2010) An empirical evaluation of linear and nonlinear kernels for text classification using support vector machines. In: 2010 seventh international conference on fuzzy systems and knowledge discovery, vol 4, pp 1502–1505, https://doi.org/10.1109/FSKD.2010.5569327
Gibson N (2011) Gradient-based methods for optimization. part i
Graf HP, Cosatto E, Bottou L, Durdanovic I, Vapnik V (2005) Parallel support vector machines: the cascade SVM. In: Advances in neural information processing systems pp 521–528
Guermeur Y (2002) Combining discriminant models with new multi-class svms. Pattern Anal Appl 5(2):168–179
Hestenes MR, Stiefel E (1952) Methods of conjugate gradients for solving linear systems. J Res Natl Bur Stand 49(6):409–436
Hsieh CJ, Chang KW, Lin CJ, Keerthi SS, Sundararajan S (2008) A dual coordinate descent method for large-scale linear svm. In: Proceedings of the 25th international conference on machine learning, ACM, New York, NY, ICML ’08, pp 408–415
Hsieh CJ, Si S, Dhillon I (2014) A divide-and-conquer solver for Kernel support vector machines. JMLR W&Cp 32(1):566–574
Hsu CW, Lin CJ (2002) A comparison of methods for multiclass support vector machines. IEEE Trans Neural Netw 13(2):415–425. https://doi.org/10.1109/72.991427
Jayadeva, Khemchandani R, Chandra S (2004) Fast and robust learning through fuzzy linear proximal support vector machines. Neurocomput 61(C):401–411
Jayadeva KR, Chandra S (2007) Twin support vector machines for pattern classification. IEEE Trans Pattern Anal Mach Intell 29(5):905–910
Khemchandani R, Chandra S (2017) Twin support vector machines–models, extensions and applications, studies in computational intelligence, vol 659. Springer, Berlin
Ji Y, Sun S (2011) Multitask multiclass support vector machines. In: 2011 IEEE 11th international conference on data mining workshops, pp 512–518
Ji Y, Sun S (2013) Multitask multiclass support vector machines: model and experiments. Pattern Recogn 46(3):914–924
Joachims T (1999) Transductive inference for text classification using support vector machines. In: Proceedings of the sixteenth international conference on machine learning, Morgan Kaufmann Publishers Inc., San Francisco, CA, ICML ’99, pp 200–209
Joachims T (2006) Training linear SVMs in linear time. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, USA, KDD ’06, pp 217–226
Joachims T, Dortmund U, Joachimscsuni-dortmundde T (1999) Making large-scale SVM learning practical. In: Advances in Kernel methods–support vector learning, pp 41–56
Joachims T, Finley T, Yu CNJ (2009) Cutting-plane training of structural svms. Mach Learn 77(1):27–59
Johnson R, Zhang T (2013) Accelerating stochastic gradient descent using predictive variance reduction. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ (eds) Advances in neural information processing systems 26, Curran Associates, Inc., pp 315–323
Kao W, Chung K, Sun C, Lin C (2004) Decomposition methods for linear support vector machines. Neural Comput 16:1689–1704
Keerthi SS, DeCoste D (2005) A modified finite newton method for fast solution of large scale linear svms. JMLR 6:341–361
Keerthi SS, Decoste D (2006) Building support vector machines with reduced classifier complexity. JMLR 7:1493–1515
Keerthi SS, Sundararajan S, Chang KW, Hsieh CJ, Lin CJ (2008) A sequential dual method for large scale multi-class linear svms. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, KDD ’08, pp 408–416
Knerr S, Personnaz L, Dreyfus G (1990) Neurocomputing: algorithms, architectures and applications. Springer, Berlin, pp 41–50
Konečný J, Richtárik P (2013) Semi-stochastic gradient descent methods, 1:19. arXiv:1312.1666
Kressel UHG (1999) Advances in Kernel methods. MIT Press, Cambridge, MA, pp 255–268
Ladicky L, Torr P (2011) Locally linear support vector machines. ICML, Stockholm, pp 985–992
Lazar A (2010) A comparison of linear support vector machine algorithms on large non-sparse datasets. In: Proceedings–9th ICMLA
Lee CP, Lin CJ (2013) A study on l2-loss (squared hinge-loss) multi-class svm. Neural Comput 25:1302–1323
Lee CP, Roth D (2015) Distributed box-constrained quadratic optimization for dual linear SVM. In: Proceedings of the 32nd ICML, 37
Lee YJ, Mangasarian O (2001) Ssvm: a smooth support vector machine for classification. Comput Optim Appl 20(1):5–22
Levenberg K (1944) A method for the solution of certain problems in least squares. Quart Appl Math 2:164–168
Li L, Li Y, Su H, Chu J (2006) Intelligent computing: international conference on intelligent computing, ICIC 2006, Kunming, China, August 16–19, 2006. Proceedings, Part I, Springer Berlin Heidelberg, Berlin, Heidelberg, chap Least Squares Support Vector Machines Based on Support Vector Degrees, pp 1275–1281
Li M, Andersen DG, Park JW, Smola AJ, Ahmed A, Josifovski V, Long J, Shekita EJ, Su BY (2014a) Scaling distributed machine learning with the parameter server. In: Proceedings of the 11th USENIX conference on operating systems design and implementation, USENIX Association, Berkeley, CA, OSDI’14, pp 583–598
Li M, Zhang T, Chen Y, Smola AJ (2014b) Efficient mini-batch training for stochastic optimization. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, KDD ’14, pp 661–670
Lin CF, Wang SD (2002) Fuzzy support vector machines. IEEE Trans Neural Netw 13(2):464–471
Lin CJ (2001) On the convergence of the decomposition method for support vector machines. IEEE Trans Neural Netw 12(6):1288–1298
Lin CJ, Weng RC, Keerthi SS (2008) Trust region newton method for logistic regression. JMLR 9:627–650
Lin CY, Tsai CH, Lee CP, Lin CJ (2014) Large-scale logistic regression and linear support vector machines using spark. In: 2014 IEEE international conference on Big Data (Big Data) pp 519–528
Lin HT, Lin CJ, Weng RC (2007) A note on platt’s probabilistic outputs for support vector machines. Mach Learn 68(3):267–276
Liu D, Shi Y, Tian Y, Huang X (2016) Ramp loss least squares support vector machine. J Comput Sci 14:61–68
Lu Z, Xiao L (2015) On the complexity analysis of randomized block-coordinate descent methods. Math Program 152(1):615–642
Luss R, d’Aspremont A (2009) Support vector machine classification with indefinite kernels. Math Program Comput 1(2):97–118
Mangasarian OL (2001) A finite newton method for classification problems. Tech. rep., Data Mining Institute, Computer Sciences Department, University of Wisconsin
Mangasarian OL (2006) Exact 1-norm support vector machines via unconstrained convex differentiable minimization. J Mach Learn Res 7:1517–1530
Mangasarian OL, Musicant DR (2001) Lagrangian support vector machines. JMLR 1:161–177
Mangasarian OL, Wild EW (2006) Multisurface proximal support vector machine classification via generalized eigenvalues. IEEE Trans Pattern Anal Mach Intell 28(1):69–74
Manno A, Palagi L, Sagratella S (2015) A class of parallel decomposition algorithms for SVMs training, pp 1–24
Marquardt D (1963) An algorithm for least-squares estimation of nonlinear parameters. SIAM J Appl Math 11:431–441
Moré JJ (1978) Numerical analysis. In: Proceedings of the biennial conference held at Dundee, June 28–July 1, 1977, Springer, Berlin, pp 105–116
Needell D, Srebro N, Ward R (2014) Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. In: Proceedings of the 27th international conference on neural information processing systems, MIT Press, Cambridge, MA, NIPS’14, pp 1017–1025
Nesterov Y (2012) Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J Optim 22(2):341–362
Nie F, Huang Y, Wang X, Huang H (2014) New primal SVM solver with linear computational cost for big data classifications. In: Proceedings of the 31st international conference on international conference on machine learning, Vol 32, JMLR.org, ICML’14, pp II–505–II–513
Nie F, Wang X, Huang H (2017) Multiclass capped Lp-norm SVM for robust classifications. The 31st AAAI Conference on Artificial Intelligence (AAAI), San Francisco, USA
Nocedal WS (1999) Numerical optimization. Springer, New York
Parikh N, Boyd S (2014) Proximal algorithms. Found Trends Optim 1(3):127–239
Parlett B (1998) The symmetric eigenvalue problem. Society for Industrial and Applied Mathematics, Philadelphia, PA
Peng X (2010) A nu-twin support vector machine (nu-tsvm) classifier and its geometric algorithms. Inf Sci 180(20):3863–3875
Platt J (1998) Sequential minimal optimization: a fast algorithm for training support vector machines. Microsoft Res pp MSR–TR–98–14
Platt JC, Cristianini N, Shawe-Taylor J (2000) Large margin dags for multiclass classification. Adv Neural Inf Process Syst 12:547–553
Prez-Cruz F, Weston J, Herrmann D, Schölkopf B (2003) Extension of the nu-SVM range for classification. Nato Sci Series Sub Series iii Comput Syst Sci 190:179–196
Ranganathan A (2004) The levenberq-marquardt algorithm
Recht B, Re C, Wright S, Niu F (2011) Hogwild: a lock-free approach to parallelizing stochastic gradient descent. In: Shawe-Taylor J, Zemel RS, Bartlett PL, Pereira F, Weinberger KQ (eds) Advances in neural information processing systems 24, Curran Associates, Inc., pp 693–701
Richtárik P, Takáč M (2014) Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Math Program 144(1–2):1–38
Richtárik P, Takáč M (2016) Parallel coordinate descent methods for big data optimization. Math Program 156(1):433–484
Schmidt M, Le Roux N, Bach F (2016) Minimizing finite sums with the stochastic average gradient. Math Program pp 1–30
Schölkopf B, Smola AJ, Williamson RC, Bartlett PL (2000) New support vector algorithms. Neural Comput 12(5):1207–1245
Shalev-Shwartz S, Singer Y (2006) Online learning meets optimization in the dual, pp 423–437
Shalev-Shwartz S, Zhang T (2013a) Accelerated mini-batch stochastic dual coordinate ascent. In: NIPS’ 13 Proceedings of the 26th International Conference on Neural Information Processing Systems, Vol 1. pp 378–385
Shalev-Shwartz S, Zhang T (2013b) Stochastic dual coordinate ascent methods for regularized loss. JMLR 14(1):567–599
Shalev-Shwartz S, Singer Y, Srebro N (2007) Pegasos: primal estimated sub-gradient solver for SVM. In: Proceedings of the 24th international conference on machine learning, ACM, New York, NY, ICML ’07, pp 807–814
Shawe-Taylor J, Sun S (2011) A review of optimization methodologies in support vector machines. Neurocomputing 74(17):3609–3618
Shi Z, Liu R (2016) A better convergence analysis of the block coordinate descent method for large scale machine learning. arXiv:1608.04826
Stefano M, Mikhail B (2011) Laplacian support vector machines trained in the primal. JMLR 12:1149–1184
Steinwart I (2004) Sparseness of support vector machinessome asymptotically sharp bounds. Adv Neural Inf Process Syst 16:1069–1076
Stempfel G, Ralaivola L (2009) Learning SVMs from sloppily labeled data. In: Proceedings of the 19th international conference on artificial neural networks: Part I, Springer, Berlin, ICANN ’09, pp 884–893
Sun S, Xie X (2016) Semisupervised support vector machines with tangent space intrinsic manifold regularization. IEEE Trans Neural Netw Learn Syst 27(9):1827–1839
Suykens J, Vandewalle J (1999) Least squares support vector machine classifiers. Neural Process Lett 9(3):293–300
Tak M, Bijral A, Richtrik P, Srebro N (2013) Mini-batch primal and dual methods for SVMs. In: 30th international conference on machine learning, Springer, Berlin, pp 537–552
Teo CH, Smola A, Vishwanathan SV, Le QV (2007) A scalable modular convex solver for regularized risk minimization. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, KDD ’07, pp 727–736
Teo CH, Vishwanthan S, Smola AJ, Le QV (2010) Bundle methods for regularized risk minimization. JMLR 11:311–365
Tsai CH, Lin CJCY (2014) Incremental and decremental training for linear classification. Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’14, pp 343–352
Tseng P (2001) Convergence of a block coordinate descent method for nondifferentiable minimization. J Optim Theory Appl 109(3):475–494
Tsochantaridis I, Hofmann T, Joachims T, Altun Y (2004) Support vector machine learning for interdependent and structured output spaces. In: Proceedings of the twenty-first international conference on machine learning, ACM, New York, NY, ICML ’04, p 104
Vapnik VN (1998) Statistical learning theory, 1st edn. Wiley, New York
Wang Z, Crammer K, Vucetic S (2012) Breaking the curse of kernelization : budgeted stochastic gradient descent for large-scale SVM training. JMLR 13(1):3103–3131
Wen T, Edelman A, Gorsich D (2003) Contemporary mathematics. American Mathematical Society, Boston, MA, chap A Fast Projected Conjugate Gradient Algorithm for Training Support Vector Machines, pp 245–263
Weston J, Watkins C (1999) Support vector machines for multi-class pattern recognition. In: ESANN
Willoughby RA (1979) Solutions of ill-posed problems (a. n. tikhonov and v. y. arsenin). SIAM Rev 21(2):266–267
Woodsend K, Gondzio J (2009) Hybrid MPI/OpenMP parallel linear support vector machine training. JMLR 10:1937–1953
Wright SJ (2015) Coordinate descent algorithms. Math Program 151(1):3–34
Xiao L, Zhang T (2014) A proximal stochastic gradient method with progressive variance reduction. SIAM J Optim 24(4):2057–2075
Xie X, Sun S (2015) Multi-view twin support vector machines. Intell Data Anal 19(4):701–712
Xu C, Tao D, Xu C (2013) A survey on multi-view learning, pp 1–59. arXiv:1304.5634v1
Xu Y, Yin W (2013) A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM J Imaging Sci 6(3):1758–1789
Xu Y, Yin W (2015) Block stochastic gradient iteration for convex and nonconvex optimization. SIAM J Optim 25(3):1686–1716
Yang Y, Chen J, Zhu J (2016) Distributing the stochastic gradient sampler for large-scale lda. In: Proceedings of the 22Nd ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, KDD ’16, pp 1975–1984
Ye J, Xiong T (2007) SVM versus least squares SVM. In: JMLR W&P, pp 644–651
Yuan GX, Chang KW, Hsieh CJ, Lin CJ (2010) A comparison of optimization methods and software for large-scale L1-regularized linear classification. JMLR 11:3183–3234
Yuan GX, Ho CH, Lin CJ (2012) Recent advances of large-scale linear classification. Proc IEEE 100(9):2584–2603
Zanghirati G, Zanni L (2003) A parallel solver for large quadratic programs in training support vector machines. Parallel Comput 29(4):535–551
Zanni L, Serafini T, Zanghirati G (2006) Parallel software for training large scale support vector machines on multiprocessor systems. JMLR 7:1467–1492
Zhang A, Gu Q (2016) Accelerated stochastic block coordinate descent with optimal sampling. In: Proceedings of the 22Nd ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, KDD ’16, pp 2035–2044
Zhang T (2004) Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings of the twenty-first international conference on machine learning, ACM, New York, NY, ICML ’04
Zhu J, Rosset S, Hastie T, Tibshirani R (2003) 1-norm support vector machines. In: Neural information processing systems, MIT Press, Cambridge, p 16
Acknowledgements
First author would like to thank Ministry of Human Resource Development, Government of INDIA, for providing fellowship (University Grants Commission–Senior Research Fellowship) to pursue his Ph.D. work.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chauhan, V.K., Dahiya, K. & Sharma, A. Problem formulations and solvers in linear SVM: a review. Artif Intell Rev 52, 803–855 (2019). https://doi.org/10.1007/s10462-018-9614-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10462-018-9614-6