Abstract
Statistical Machine Learning (SML) refers to a body of algorithms and methods by which computers are allowed to discover important features of input data sets which are often very large in size. The very task of feature discovery from data is essentially the meaning of the keyword ‘learning’ in SML. Theoretical justifications for the effectiveness of the SML algorithms are underpinned by sound principles from different disciplines, such as Computer Science and Statistics. The theoretical underpinnings particularly justified by statistical inference methods are together termed as statistical learning theory. This paper provides a review of SML from a Bayesian decision theoretic point of view—where we argue that many SML techniques are closely connected to making inference by using the so called Bayesian paradigm. We discuss many important SML techniques such as supervised and unsupervised learning, deep learning, online learning and Gaussian processes especially in the context of very large data sets where these are often employed. We present a dictionary which maps the key concepts of SML from Computer Science and Statistics. We illustrate the SML techniques with three moderately large data sets where we also discuss many practical implementation issues. Thus the review is especially targeted at statisticians and computer scientists who are aspiring to understand and apply SML for moderately large to big data sets.
Similar content being viewed by others
References
Al-Jarrah OY, Yoo PD, Muhaidat S, Karagiannidis GK, Taha K (2015) Efficient machine learning for Big Data: a review. Big Data Res 2:87–93
Andrieu C, De Freitas N, Doucet A, Jordan MI (2003) An introduction to MCMC for machine learning. Mach Learn 50:5–43
Berger JO (1993) Statistical decision theory and Bayesian analysis, 2nd edn. Springer series in statistics. Springer, New York
Berger JO (2017) Sequential Analysis, vol 1–3. Palgrave Macmillan UK, London
Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13:281–305
Bertsimas D, King A, Mazumder R (2016) Best subset selection via a modern optimization lens. Ann Stat 44:813–852
Blei DM, Kucukelbir A, McAuliffe JD (2017) Variational inference: a review for statisticians. J Am Stat Assoc 112:859–877
Bottou L, Curtis FE, Nocedal J (2018) Optimization methods for large-scale machine learning. SIAM Rev 60:223–311
Bousquet O, Boucheron S, Lugosi G (2004) Introduction to statistical learning theory. Advanced lectures on machine learning. Springer, New York, pp 169–207
Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press, Cambridge
Breiman L (1996) Bagging predictors. Mach Learn 24:123–140
Breiman L (2001a) Random forests. Mach Learn 45:5–32
Breiman L (2001b) Statistical modeling: the two cultures (with comments and a rejoinder by the author). Stat Sci 16:199–231
Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. CRC Press, Boca Raton
Castro R (2018a) 2DI70 - Statistical learning theory, lecture notes. http://www.win.tue.nl/~rmcastro/2DI70/files/2DI70_Lecture_Notes.pdf. Accessed 8 Oct 2019
Castro R (2018b) ELEN6887: Complexity regularization and the squared loss. http://www.win.tue.nl/~rmcastro/6887_10/files/lecture11.pdf. Accessed 8 Oct 2019
Chapelle O, Scholkopf B, Zien A (2010) Semi supervised learning, vol 1. The MIT Press, Cambridge
Chen T, Guestrin C (2016) XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 785–794
Chen Z, Hruschka E, Liu B (2016) Lifelong machine learning and computer reading the web. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 2117–2118
Chipman HA, George EI, McCulloch RE (2006) Bayesian ensemble learning. In: Proceedings of the 19th international conference on neural information processing systems. NIPS’06. MIT Press, Cambridge, pp 265–272
Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to algorithms, 3rd edn. McGraw-Hill, New York
Das S, Dey D (2006) On Bayesian analysis of generalized linear models using Jacobian technique. Am Stat 60:265–268
Das S, Dey D (2010) On Bayesian inference for generalized multivariate gamma distribution. Stat Probab Lett 80:1492–1499
Das S, Dey D (2013) On dynamic generalized linear models with applications. Methodol Comput Appl Probab 15:407–421
Das S, Roy S, Sambasivan R (2018) Fast gaussian process regression for big data. Big Data Res 14:12–26
Das S, Yang H, Banks D (2012) Synthetic priors that merge opinion from multiple experts. Stat Polit Policy 4:2151–7509
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51:107–113
Dheeru D, Karra Taniskidou E (2017) UCI machine learning repository, individual household electric power consumption data set. https://archive.ics.uci.edu/ml/machine-learning-databases/00235/. Accessed 8 Oct 2019
Domingos P (2012) A few useful things to know about machine learning. Commun ACM 55:78–87
Duvenaud D (2014) Automatic model construction with gaussian processes. University of Cambridge, Computational and Biological Learning Laboratory, PhD thesis
ForestScience (1998) Forest CoverType Dataset by Forest Science Department of Colorado State University. https://archive.ics.uci.edu/ml/datasets/covertype Data downloaded from UCI Machine Learning Repository. Accessed 8 Oct 2019
Foroughi F, Luksch P (2018) Data science methodology for Cybersecurity Projects. ArXiv preprint arXiv:1803.04219
Friedman JH (1998) Data mining and statistics: What’s the connection? Comput Sci Stat 29:3–9
Friedman J, Hastie T, Tibshirani R (2009) The elements of statistical learning, 2nd edn. Springer series in statistics. Springer, New York
Gammerman A, Vovk V, Vapnik V (1998) Learning by transduction. In: Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc, Burlington, pp 148–155
Gelfand AE, Dey DK (1994) Bayesian model choice: asymptotics and exact calculations. J R Stat Soc Ser B (Methodological) 56:501–514
Gelfand AE, Smith AFM (1990) Sampling-based approaches to calculating marginal densities. J Am Stat Assoc 85:398–409
Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB (2013) Bayesian data analysis, 3rd edn. CRC Press, Boca Raton
Germain P, Lacasse A, Laviolette F, Marchand M (2009) PAC-Bayesian learning of linear classifiers. In: Proceedings of the 26th international conference on machine learning (ICML), pp 353–360
Gershman SJ, Blei DM (2012) A tutorial on Bayesian nonparametric models. J Math Psychol 56:1–12
Ghavamzadeh M, Mannor S, Pineau J, Tamar A (2015) Bayesian reinforcement learning: a survey. Found Trends Mach Learn 8:359–483
Ghoshal S, Vaart AVD (2017) Fundamentals of nonparametric bayesian inference. Cambridge University Press, Cambridge
Goodfellow I (2018) Practical methodology for deploying machine learning. https://www.youtube.com/watch?v=NKiwFF_zBu4&t=1781s. Accessed 8 Oct 2019
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press http://www.deeplearningbook.org. Accessed 8 Oct 2019
Google Research (2019) Quantum Computing, Quantum Computing, Google Research. https://www.ibm.com/quantum-computing/learn/what-is-quantum-computing/. Accessed 8 Oct 2019
Govindaraju V, Rao CR (2013) Machine learning: theory and applications. Elsevier, North Holland
Guillaumin M, Verbeek J, Schmid C (2010) Multimodal semi-supervised learning for image classification. In: 2010 IEEE computer society conference on computer vision and pattern recognition, CVPR 2010
Haussler D (1992) Decision theoretic generalizations of the PAC model for neural net and other learning applications. Inf Comput 100:78–150
Head M, Holman L, Lanfear R, Kahn A, Jennions M (2015) The extent and consequences of p-hacking in science. PLOS Biol 13:e1002106. https://doi.org/10.1371/journal.pbio.1002106
Hoerl AE, Kennard RW (1970) Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12:55–67
Holzinger A (2014) On topological data mining. Interactive knowledge discovery and data mining in biomedical informatics. Springer, New York, pp 331–356
IBM Q (2019) Quantum computing. https://www.ibm.com/quantum-computing/learn/what-is-quantum-computing/. Accessed 8 Oct 2019
Inmon B (2016) Data lake architecture: designing the data lake and avoiding the garbage dump. Technics Publications, New Jersy
Jain P, Kar P (2017) Non-convex optimization for machine learning. Found Trends Mach Learn 10:142–336
Joachims T (1999) Transductive inference for text classification using support vector machines. In: Proceedings of the 16th international conference on machine learning, ICML 99, pp 200–209
Kadane JB, Wasilkowski GW (1983) Average case-complexity in computer science: a Bayesian view. Technical Report
Karbalayghareh A, Qian X, Dougherty ER (2018) Optimal Bayesian transfer learning. IEEE Trans Signal Process 66:3724–3739
Kendall A, Gal Y (2017) What uncertainties do we need in bayesian deep learning for computer vision? In: 31st conference on neural information processing systems, NIPS 2017
Kimball R (2013) The data warehouse lifecycle toolkit: expert methods for designing, developing, and deploying data warehouses, 3rd edn. Wiley, New York
Kuhn M, Johnson K (2013) Applied predictive modeling. Springer, New York
Larose DT (2005) Discovering knowledge in data: an introduction to data mining. Wiley, New Jersey
Larose DT (2006) Data mining methods & models. Wiley, New York
Laskov P, Gehl C, Krüger S, Müller K-R (2006) Incremental support vector learning: analysis, implementation and applications. J Mach Learn Res 7:1909–1936
Leskovec J, Rajaraman A, Ullman JD (2014) Mining of massive datasets, 2nd edn. Cambridge University Press, Cambridge
Ĺheureux A, Grolinger K, Elyamany HF, Capretz MA (2017) Machine learning with big data: challenges and approaches. IEEE Access 5:7776–7797
Li Q, Lin N (2010) The Bayesian elastic net. Bayesian Anal 5:151–170
Lichman M (2016) UCI machine learning repository. https://archive.ics.uci.edu/ml/machine-learning-databases/housing/. Accessed 8 Oct 2019
Littman ML (1994) Markov games as a framework for multi-agent reinforcement learning. In: Proceedings of the eleventh international conference, Rutgers University, New Brunswick, NJ, July 10–13, pp 157–163
Low Y, Bickson D, Gonzalez J, Guestrin C, Kyrola A, Hellerstein JM (2012) Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proc VLDB Endow 5:716–727
Lu Z, Monteiro RD, Yuan M (2012) Convex optimization methods for dimension reduction and coefficient estimation in multivariate linear regression. Math Program 131:163–194
Manfred O, Ole W (1999) A Bayesian approach to on-line learning. In: Saad D (ed) On-line learning in neural networks. Cambridge University Press, Cambridge, pp 363–379
McCullagh P, Nelder J (1989) Generalized linear models, 2nd edn. Chapman and Hall/CRC, London
McKinsey (2018) How companies are using big data and analytics, McKinsey & Company. https://www.mckinsey.com/business-functions/mckinsey-analytics/our-insights/how-companies-are-using-big-data-and-analytics. Accessed 8 Oct 2019
Microsoft Research (2018) Microsoft Research Lab - Asia. https://www.microsoft.com/en-us/research/lab/microsoft-research-asia/articles/machine-learning-research-hotspots/. Accessed 8 Oct 2019
Mitchell TM (2006) The discipline of machine learning, vol 9. Carnegie Mellon University, School of Computer Science, Machine Learning Department, Carnegie Mellon
National Institute of Standards and Technology - US Department of Commerce (2018) NIST Big Data Interoperability Framework: Volume 1, Definitions. http://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.1500-1.pdf
Nocedal J, Wright S (2006) Numerical optimization, 2nd edn. Springer, New York
Nowak R (2018) Statistical learning theory, Lecture 3. http://nowak.ece.wisc.edu/SLT09/lecture3.pdf. Accessed 8 Oct 2019
Park T, Casella G (2008) The Bayesian Lasso. J Am Stat Assoc 103:681–686
Park S, Choi S (2010) Hierarchical Gaussian process regression. In: ACML, pp 95–110
Pechyony D (2009) Theory and practice of transductive learning. Computer Science Department, PhD thesis, Technion
Pentina A, Lampert CH (2014) A PAC-Bayesian bound for lifelong learning. In: Proceedings of the 31st international conference on machine learning. ICML 14, vol 32, pp 991–999
Poupart P, Vlassis N, Hoey J, Regan K (2006) An analytic solution to discrete Bayesian reinforcement learning. In: Proceedings of the 23rd international conference on Machine learning, ICML 06, pp 697–704
Pratt LY (1992) Discriminability-based transfer between neural networks. Adv Neural Inf Process Syst 5:204–211
Qiu J, Wu Q, Ding G, Xu Y, Feng S (2016) A survey of machine learning for big data processing. EURASIP J Adv Signal Process 2016:67
Quadrianto N, Ghahramani Z (2015) A very simple safe-bayesian random forest. IEEE Trans Pattern Anal Mach Intell 37:1297–1303
Rajaratnam B, Sparks D (2015) MCMC-based inference in the era of big data: a fundamental analysis of the convergence complexity of high-dimensional chains. https://arxiv.org/abs/1508.00947
Rasmussen CE, Williams C (2006) Gaussian processes for machine learning. MIT Press, Cambridge
Ravi Kumar P (2014) Statistical machine learning and Big-p, Big-n, complex Data. http://uwtv.org/series/computer-science-engineering-lecture-series-2013/watch/IxNky5abdL8/. Accessed 8 Oct 2019
Sambasivan R, Das S (2017a) Big data regression using tree based segmentation. In: Proceedings of INDICON, IEEE
Sambasivan R, Das S (2017b) A statistical machine learning approach to yield curve forecasting. In: Proceedings of the international conference on computational intelligence in data science, IEEE
Settles B (2012) Active learning. Synth Lect Artif Intell Mach Learn 6:1–114
Shalev-Shwartz S (2007) Online learning: theory, algorithms, and applications, PhD thesis, Hebrew University
Shalev-Shwartz S, Singer Y (2008) Tutorial on theory and applications of online learning, Tutorial ICML
Sharma R, Das S (2017) Regularization and variable selection with copula prior. Corespondence https://arxiv.org/abs/1709.05514
Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press, Cambridge
Shinal J (2017) Google CEO Sundar PIchai: moving all directions at once. https://www.cnbc.com/2017/05/18/google-ceo-sundar-pichai-machine-learning-big-data.html. Accessed 8 Oct 2019
Shmueli G (2010) To explain or to predict? Stat Sci 25:289–310
Silver DL, Yang Q, Li L (2013) Lifelong machine learning systems: beyond learning algorithms. In: AAAI Spring Symposium: Lifelong Machine Learning, vol 13, pp 05
Snell J, Swersky K, Zemel RS (2017) Prototypical networks for few-shot learning. In: Proceedings of the 31st conference on neural information processing systems, NIPS
Snoek J, Larochelle H, Adams RP (2012) Practical bayesian optimization of machine learning algorithms. Adv Neural Inf Process Syst 25:2951–2959
Sutton RS, Barto AG (1998) Introduction to reinforcement learning, vol 135. MIT Press, Cambridge
Therneau T, Atkinson B, Ripley B (2017) rpart: Recursive Partitioning and Regression Trees R package version 4.1-11
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 58:267–288
Tibshirani R (2019) Lecture notes in statistical learning. http://statweb.stanford.edu/~tibs/stat315a/glossary.pdf. Accessed 8 Oct 2019
Tipping ME, Bishop CM (1999) Mixtures of probabilistic principal component analyzers. Neural Comput 11:443–482
Torrey L, Shavlik J (2009) Transfer learning. In: Soria E, Martin J, Magdalena R, Martinez M, Serrano A (eds) Handbook of research on machine learning applications and trends: algorithms, methods, and techniques, vol 242. IGI Global, Pennsylvania
Tresp V (2000) A Bayesian committee machine. Neural Comput 12:2719–2741
UC Berkeley (2018) Statistical machine learning, Univ of California at Berkeley. https://www.stat.berkeley.edu/~statlearning/. Accessed 8 Oct 2019
Van de Geer S (1990) Estimating a regression function. Ann Stat 18:907–924
Vapnik V (1998) Statistical learning theory. Wiley, New York
Vlassis N, Ghavamzadeh M, Mannor S, Poupart P (2012) Bayesian reinforcement learning. In: Wiering M, van Otterlo M (eds) Reinforcement Learning. Adaptation, Learning, and Optimization, vol 12. Springer, Berlin
Wasserman L (2004) All of statistics: a concise course in statistical inference. Springer Texts in Statistics. Springer, New York
Williams C (2015) AI guru Ng: fearing a rise of killer robots is like worrying about overpopulation on Mars. https://www.theregister.co.uk/2015/03/19/andrew_ng_baidu_ai/. Accessed 8 Oct 2019
Wolpert DH (1992) Stacked generalization. Neural Netw 5:241–259
Wolpert DH (1996) The lack of a priori distinctions between learning algorithms. Neural Comput 8:1341–1390
Wolpert DH, Macready WG (1997) No free lunch theorems for optimization. IEEE Trans Evol Comput 1:67–82
Wu X, Zhu X, Wu G-Q, Ding W (2014) Data mining with big data. IEEE Trans Knowl Data Eng 26:97–107
Xiong H, Pandey G, Steinbach M, Kumar V (2006) Enhancing data analysis with noise removal. IEEE Trans Knowl Data Eng 18:304–319
Yang Y, Tokdar ST et al (2015) Minimax-optimal nonparametric regression in high dimensions. Ann Stat 43:652–674
Yosinski J, Clune J, Bengio Y, Lipson H (2014) How transferable are features in deep neural networks? In: Proceedings of the 27th international conference on neural information processing systems, pp 3320–3328
Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, Meng X, Rosen J, Venkataraman S, Franklin MJ et al (2016) Apache spark: a unified engine for big data processing. Commun ACM 59:56–65
Zhang T (2004) Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings of the twenty-first international conference on Machine learning, vol 116. ACM
Zhiyuan Chen C, Hruschka E, Liu B (2016) KDD 2016 Tutorials - YouTube. http://www.youtube.com/playlist?list=PLvM6T5w9YQBL6rP1-vGqhAa-SQ84KVv0c. Accessed 8 Oct 2019
Zhu J, Chen J, Hu W, Zhang B (2017) Big learning with Bayesian methods. Natl Sci Rev 4:627–651
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B 67:301–320
Acknowledgements
Sourish Das’s research has been supported by an Infosys Foundation Grant and a TATA Trust Grant to CMI and also by a UK Government funded Commonwealth-Rutherford Scholarship (Grant No. RF 2017-123).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Sambasivan, R., Das, S. & Sahu, S.K. A Bayesian perspective of statistical machine learning for big data. Comput Stat 35, 893–930 (2020). https://doi.org/10.1007/s00180-020-00970-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-020-00970-8