Recent Advances in Big Data Analytics

Daoji Li³,
Yinfei Kong³,
Zemin Zheng⁴ &
…
Jianxin Pan⁵

2014 Accesses

Abstract

Unprecedented advances in digital technology have produced a revolution that is transforming science and society. Big data have been rapidly generated in many disciplines, such as business, sciences, engineering, medicine, biology, and humanities. It is often accompanied by a large number of features and/or a large volume of observations. The value of big data lies in effective analysis using statistical inference and machine learning methods that are computationally scalable and efficient. There have seen many new statistical methods and tools to deal with big data in recent years. In this chapter, we aim to summarize some of these approaches to provide a selective overview of the recent developments of theory, methods, and implementations for big data analytics. We will focus on two types of big data: ultrahigh-dimensional data and massive data, where the former refers to the data in which the number of features may grow exponentially with the number of observations while the latter means that the number of observations is huge and much larger than the number of features.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 159.50; Price includes VAT (United Kingdom)

Softcover Book: GBP 199.99; Price includes VAT (United Kingdom)

Hardcover Book: GBP 199.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

An Overview of Big Data and Machine Learning Paradigms

Big data analytics: a survey

Article Open access 01 October 2015

Big Data Analytics—Analysis and Comparison of Various Tools

References

Ai, M., Yu, J., Zhang, H., and Wang, H. (2021). Optimal subsampling algorithms for Big Data regressions. Stat. Sin. 31, 749–772.
Google Scholar
Battey, H., Fan, J., Liu, H., Lu, J., and Zhu, Z. (2018). Distributed testing and estimation under sparse high dimensional models. Ann. Stat. 46, 1352–1382.
Article Google Scholar
Bickel, P. J. and Levina, E. (2008). Regularized estimation of large covariance matrices. Ann. Stat. 36, 199–227.
Google Scholar
Bien, J., Taylor, J., and Tibshirani, R. (2013). A lasso for hierarchical interactions. Ann. Stat. 41, 1111–1141.
Article Google Scholar
Bühlmann, P. and Van De Geer, S. (2011). Statistics for high-dimensional data: methods, theory and applications, Springer Science & Business Media.
Google Scholar
Cai, T., Liu, W., and Luo, X. (2011). A constrained \(\ell _1\) minimization approach to sparse precision matrix estimation. J. Amer. Statist. Assoc. 106, 594–607.
Article Google Scholar
Candés, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when \(p\) is much larger than \(n\). Ann. Stat. 35, 2313–2351.
Google Scholar
Chen, L. and Zhou, Y. (2021). Quantile regression in big data: A divide and conquer based strategy. Comput. Statist. Data Anal. 144, 106892.
Article Google Scholar
Chen, X., Lee, J. D., Li, H., and Yang, Y. (2021). Distributed estimation for principal component analysis: an enlarged eigenspace analysis. J. Amer. Statist. Assoc., to appear.
Google Scholar
Chen, X., Liu, W., and Zhang, Y. (2019). Quantile regression under memory constraint. Ann. Stat. 47, 3244–3273.
Google Scholar
Chen, X. and Xie, M. (2014). A split-and-conquer approach for analysis of extraordinarily large data. Stat. Sin. 24, 1655–1684.
Google Scholar
Chu, W., Li, R., Liu, J. and Reimherr, M. (2020). Feature screening for generalized varying coefficient mixed effect models with application to obesity GWAS. Ann. Appl. Stat. 14, 276–298.
Article Google Scholar
Cordell, H. J. (2009). Detecting gene-gene interactions that underlie human diseases. Nat. Rev. Genet. 10, 392–404.
Article Google Scholar
Cui, H., Li, R., and Zhong, W. (2015). Model-Free Feature Screening for Ultrahigh Dimensional Discriminant Analysis. J. Amer. Statist. Assoc. 110, 630–641.
Article Google Scholar
Donoho, D. L. (2000). High-dimensional data analysis: The curses and blessings of dimensionality; Aide-Memoire of a Lecture at AMS Conference on Math Challenges of the 21st Century.
Google Scholar
Dong, R., Li, D., and Zheng, D. (2021). Parallel integrative learning for large-scale multi-response regression with incomplete outcomes. Comput. Statist. Data Anal. 160, 107243.
Article Google Scholar
Drineas, P., Mahoney, M. W., and Muthukrishnan, S. (2006). Sampling algorithms for \(\ell _2\) regression and applications. In Proceedings of the seventeenth annual ACM-SIAM symposium on discrete algorithm, 1127–1136.
Google Scholar
Drineas, P., Magdon-Ismail, M., Mahoney, M. W., and Woodruff, D. P. (2012). Fast approximation of matrix coherence and statistical leverage. J. Mach. Learn. Res. 13, 3475–3506.
Google Scholar
Drineas, P., Mahoney M.W., Muthukrishnan S, and Sarlós, T. (2011). Faster least squares approximation. Numer. Math. 117, 219–249.
Article Google Scholar
Fan, J., Feng, Y., and Xia, L. (2020). A projection-based conditional dependence measure with applications to high-dimensional undirected graphical models. J. Econometrics 218, 119–139.
Article Google Scholar
Fan, J., Feng, Y., and Song, R. (2011). Nonparametric independence screening in sparse ultra-high dimensional additive models. J. Amer. Statist. Assoc. 106, 544–557.
Article Google Scholar
Fan, J., Han, F., and Liu, H. (2014). Challenges of big data analysis. Natl. Sci. Rev. 1, 293-314.
Article Google Scholar
Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96, 1348–1360.
Article Google Scholar
Fan, J. and Li, R. (2006). Statistical challenges with high dimensionality: Feature selection in knowledge discovery. In: Sanz-Sole M, Soria J, Varona JL, Verdera J, editors. Proceedings of the International Congress of Mathematicians, 595–622.
Google Scholar
Fan, J., Li, R., Zhang, C.-H., and Zou, H. (2020). Statistical Foundations of Data Science. CRC Press.
Book Google Scholar
Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space (with discussion). J. R. Stat. Soc., Ser. B 70, 849-911.
Google Scholar
Fan, J. and Lv, J. (2010). A selective overview of variable selection in high dimensional feature space (invited review article). Stat. Sin. 20, 101–148.
Google Scholar
Fan, J. and Lv, J. (2018). Sure independence screening (invited review article). Wiley StatsRef: Statistics Reference Online.
Google Scholar
Fan, J., Lv, J., and Qi, L. (2011). Sparse high dimensional models in economics (invited review article). Annu. Rev. Econ. 3, 291–317.
Article Google Scholar
Fan, J., Ma, Y., and Dai, W. (2014). Nonparametric independence screening in sparse ultra-high-dimensional varying coefficient models. J. Amer. Statist. Assoc. 109, 1270–1284.
Article Google Scholar
Fan, J. and Song, R. (2010). Sure independence screening in generalized linear models with NP-dimensionality. Ann. Stat. 38, 3567–3604.
Google Scholar
Fang, X. and Xu, J. Joint variable screening in accelerated failure time models. Stat. Sin. 30, 467–485.
Google Scholar
Fan, Y., Kong, Y., Li, D., and Zheng, Z. (2015). Innovated interaction screening for high-dimensional nonlinear classification. Ann. Stat. 43, 1243–1272.
Google Scholar
Fan, Y. and Lv, J. (2016). Innovated scalable efficient estimation in ultra-large Gaussian graphical models. Ann. Stat. 44, 2098–2126.
Google Scholar
Friedman, J., Hastie, T, and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9, 432–441.
Article Google Scholar
Gorst-Rasmussen, A. and Scheike, T. (2013). Independent screening for single-index hazard rate models with ultrahigh dimensional features. J. R. Stat. Soc., Ser. B 75, 217–245.
Google Scholar
Gosik, K., Sun, L., Chinchilli, V. M., and Wu, R. (2018). An ultrahigh-dimensional mapping model of high-order epistatic networks for complex traits. Curr. Genomics 19, 384–394.
Article Google Scholar
Hall, P. and Xue, J.-H. (2014). On selecting interacting features from high-dimensional data. Comput. Stat. Data Anal. 71, 694–708.
Article Google Scholar
Hao, N., Feng, Y., and Zhang, H.H. (2018). Model selection for high dimensional quadratic regression via regularization. J. Amer. Statist. Assoc. 113, 615–625.
Article Google Scholar
Hao, N. and Zhang, H.H. (2014). Interaction screening for ultra-high dimensional data. J. Amer. Statist. Assoc. 109, 1285–1301.
Article Google Scholar
Haris, A., Witten, D., and Simon, N. (2016). Convex modeling of interactions with strong heredity. J. Comput. Graph. Stat. 25, 981–1004.
Article Google Scholar
He, X., Wang, L. and Hong, H. G. (2013). Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data. Ann. Stat. 41, 342–369.
Google Scholar
Hector, E. and Song, P. (2021). A distributed and integrated method of moments for high-dimensional correlated data analysis. J. Amer. Statist. Assoc. 116, 805–818.
Article Google Scholar
Huang, D., Zhu, X., Li, R., and Wang, H. (2021). Feature screening for network autoregression model. Stat. Sin. 31, 1–21.
Google Scholar
Huo, X. and Székely, G. J. (2016). Fast Computing for Distance Covariance. Technometrics 58, 435–447.
Article Google Scholar
Jiang, B. and Liu, J. S. (2014). Variable selection for general index models via sliced inverse regression. Ann. Stat. 42, 1751–1786.
Google Scholar
Jordan, M. I., Lee, J. D., and Yang, Y. (2019). Communication-efficient distributed statistical learning. J. Amer. Statist. Assoc. 114, 668–681.
Article Google Scholar
Kong, Y., Li, D., Fan, Y., and Lv, J. (2017). Interaction pursuit in high-dimensional multi-response regression via distance correlation. Ann. Stat. 45, 897–922.
Google Scholar
Lee, J. D., Liu, Q., Sun, Y., and Taylor, J. E. (2017). Communication-efficient sparse regression. J. Mach. Learn. Res. 18, 1–30.
Google Scholar
Lee, J., Wang, H., and Schifano, E. (2020). Online updating method to correct for measurement error in big data streams. Comput. Statist. Data Anal. 149, 106976
Article Google Scholar
Li, D., Kong, Y., Fan, Y., and Lv, J. (2021). High-dimensional interaction detection with false sign rate control. J. Bus. Econom. Statist., in press.
Google Scholar
Li, G., Peng, H., Zhang, J., and Zhu, L-X. (2012). Robust rank correlation based screening. Ann. Stat. 40, 1846–1877.
Google Scholar
Li, J., Zhong, W., Li, R. and Wu, R. (2014). A fast algorithm for detecting gene-gene interactions in genome-wide association studies. Ann. Appl. Stat. 8, 2292–2318.
Article Google Scholar
Li, R., Zhong, W., and Zhu, L.P. (2012). Feature screening via distance correlation Learning. J. Amer. Statist. Assoc. 107, 1129–1139.
Article Google Scholar
Li, X., Li, R., Xia, Z., and Xu, C. (2020). Distributed feature screening via componentwise debiasing. J. Mach. Learn. Res. 21, 1–32.
Google Scholar
Lin, N. and Xi, R. (2011). Aggregated estimating equation estimation. Stat. Interface 4, 73–83.
Google Scholar
Liu, J., Li, R., and Wu, R. (2014). Feature selection for varying coefficient models with ultrahigh-dimensional covariates. J. Amer. Statist. Assoc. 109, 266–274.
Article Google Scholar
Liu, J., Zhong, W., and Li, R. (2015). A selective overview of feature screening for ultrahigh-dimensional data. Sci. China Math. 58, 1–22.
Article Google Scholar
Liu, W., Ke, Y., Liu, J., and Li, R. (2020). Model-free feature screening and FDR control with Knockoff features. J. Amer. Statist. Assoc., in press.
Google Scholar
Liu, W. and Li, R. (2020). Variable Selection and Feature Screening. Macroeconomic Forecasting in the Era of Big Data, 293–326.
Google Scholar
Lv, J., and Fan, Y. (2009). A unified approach to model selection and sparse recovery using regularized least squares. Ann. Stat., 37, 3498–3528.
Article Google Scholar
Ma, P., Mahoney, M. W., and Yu, B. (2015). A statistical perspective on algorithmic leveraging. J. Mach. Learn. Res. 16, 861–911.
Google Scholar
Ma, P. and Sun, X. (2015). Leveraging for big data regression. Wiley Interdisciplinary Reviews: Computational Statistics 7, 70–76.
Article Google Scholar
Ma, P. , Zhang, X., Xing, X., Ma, J., and Mahoney, M. (2020). Asymptotic analysis of sampling estimators for randomized linear algebra algorithms, AISTATS, 1026–1035.
Google Scholar
Ma, S., Li, R. and Tsai, C.L. (2017). Variable Screening via quantile partial correlation. J. Amer. Statist. Assoc. 112, 650–663.
Article Google Scholar
Mai, Q. and Zou, H. (2011). The Kolmogorov filter for variable screening in high-dimensional binary classification. Biometrika 100, 229–234.
Article Google Scholar
Mai, Q. and Zou, H. (2015). The fused Kolmogorov filter: A nonparametric model-free screening method. Ann. Stat. 43, 1471–1497.
Article Google Scholar
Musani, S. K., Shriner, D., Liu, N., Feng, R., Coffey, C. S., Yi, N., Tiwari, H. K., and Allison, D. B. (2007). Detection of gene\(\times\)gene interactions in genome-wide association studies of human population data. Human Heredity 63, 67–84.
Article Google Scholar
Nandy, D., Chiaromonte, F., and Li, R. (2021). Covariate information number for feature screening in ultrahigh-dimensional supervised problems. J. Amer. Statist. Assoc., in press.
Google Scholar
Niu, Y. S., Hao, N. and Zhang, H.H. (2018). Interaction screening by partial correlation. Stat. Interface 11, 317–325.
Article Google Scholar
Pan, W., Wang, X., Xiao, W., and Zhu, H. (2019). A generic sure independence screening procedure. J. Amer. Statist. Assoc. 114, 928–937.
Article Google Scholar
Ren, Z., Kang, Y., Fan, Y., and Lv, J. (2019). Tuning-free heterogeneous inference in massive networks. J. Amer. Statist. Assoc., 114, 1908–1925.
Article Google Scholar
Sheng, Y. and Wang, Q. (2020). Model-free feature screening for ultrahigh dimensional classification. J. Multivariate Anal. 178, 104618.
Article Google Scholar
Székely, G. J., Rizzo, M. L., and Bakirov, N. K. (2007). Measuring and testing dependence by correlation of distances.Ann. Stat. 35, 2769-2794.
Article Google Scholar
Song, R., Lu, W., Ma, S., and Jeng, J. (2014). Censored rank independence screening for high-dimensional survival data. Biometrika 101, 799–814.
Article Google Scholar
Tang, L., Zhou, L., and Song, P. (2020). Distributed simultaneous inference in generalized linear models via confidence distribution. J. Multivariate Anal. 176, 104567.
Article Google Scholar
Tian, Y. and Feng, Y. (2021). RaSE: A Variable Screening Framework via Random Subspace Ensembles. J. Amer. Statist. Assoc., in press.
Google Scholar
Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. J. R. Stat. Soc., Ser. B 58, 267–288.
Google Scholar
Wang, H. (2019). Divide-and-conquer information-based optimal subdata selection algorithm. J. Stat. Theory Pract. 13, 46.
Article Google Scholar
Wang, H. (2019). More efficient estimation for logistic regression with optimal subsamples. J. Mach. Learn. Res. 20, 1–59.
Google Scholar
Wang, H. and Ma, Y. (2021). Optimal subsampling for quantile regression in big data, Biometrika, 108, 99–112.
Article Google Scholar
Wang, H., Yang, M., and Stufken, J. (2019). Information-based optimal subdata selection for big data linear regression. J. Amer. Statist. Assoc. 114, 26393–405.
Google Scholar
Wang, H., Zhu, R., and Ma, P. (2018). Optimal subsampling for large sample logistic regression. J. Amer. Statist. Assoc. 113, 829-844.
Article Google Scholar
Wang, L., Chen, Z., Wang, C.D., and Li, R. (2020). Ultrahigh dimensional precision matrix estimation via refitted cross validation. J. Econometrics 215, 118–130.
Article Google Scholar
Wang, W., Lu, S.-E., Cheng, J. Q., Xie, M., and Kostis, J. (2021). Multivariate survival analysis in big data: A divide-and-combine approach. Biometrics, to appear.
Google Scholar
Wang, X., Yang, Z., Chen, X., and Liu, W. (2019). Distributed inference for linear support vector machine. J. Mach. Learn. Res. 20, 1–41.
Google Scholar
Wang, Y., Hong, C., Palmer, N., Di, Q., Schwartz, J., Ko-hane, I., and Cai, T. (2021). A fast divide-and-conquer sparse Cox regression. Biostatistics 22, 381–401.
Article Google Scholar
Wu, Y. and Yin, G. (2015). Conditional quantile screening in ultrahigh-dimensional heterogeneous data. Biometrika 102, 65–76.
Article Google Scholar
Xue, L. and Zou, H. (2011). Sure independence screening and compressed random sensing. Biometrika 98, 371–380.
Article Google Scholar
Yao, Y. and Wang, H. (2019). Optimal subsampling for softmax regression. Stat. Papers 60, 235–249.
Article Google Scholar
Yan, X. and Bien, J. (2017). Hierarchical sparse modeling: A choice of two group lasso formulations Stat. Sci. 32, 531–560.
Google Scholar
Yang, G., Yang, S. and Li, R. (2020). Feature screening in ultrahigh dimensional generalized varying-coefficient models. Stat. Sin., 30, 1049–1067.
Google Scholar
Yuan, M. (2010). High dimensional inverse covariance matrix estimation via linear programming. J. Mach. Learn. Res. 11, 2261–2286.
Google Scholar
Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38, 894–942.
Article Google Scholar
Zhang, Y., Duchi, J., and Wainwright, M. (2015). Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates. J. Mach. Learn. Res. 16, 3299–3340.
Google Scholar
Zhao, S.D. and Li, Y. (2012). Principled sure independence screening for Cox models with ultra-high-dimensional covariates. J. Multivariate Anal. 105, 397–411.
Article Google Scholar
Zhao, S. D. and Li, Y. (2014). Score test variable screening. Biometrics 70, 862–871.
Article Google Scholar
Zheng, Z., Zhang, J., Kong, Y., and Wu, Y. (2018). Scalable inference for massive data. Procedia Comput. Sci. 129, 81–87.
Article Google Scholar
Zhou, T., Zhu, L, Xu, C., and Li, R. (2020). Model-free forward screening via cumulative divergence. J. Amer. Statist. Assoc. 115, 1393–1405.
Article Google Scholar
Zhou, Y. and Zhu, L.P. (2018). Model-free feature screening for ultrahigh dimensional data through a modified BLUM-KIEFER-ROSENBLATT correlation. Stat. Sin. 28, 1351–1370.
Google Scholar
Zhong, W. and Zhu, L. (2015). An iterative approach to distance correlation-based sure independence screening. J. Stat. Comput. Simul. 85, 2331–2345.
Article Google Scholar
Zhu, L.-P., Li, L., Li, R., and Zhu, L.-X. (2011). Model-free feature screening for ultrahigh-dimensional data. J. Amer. Statist. Assoc. 106, 1464–1475.
Article Google Scholar
Zhu, X., Li, F., and Wang, H. (2021). Least squares approximation for a distributed system. J. Comput. Graph. Statist., to appear.
Google Scholar
Zou, H. (2006). The adaptive Lasso and its oracle properties. J. Amer. Statist. Assoc. 101, 1418–1429.
Article Google Scholar
Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. R. Stat. Soc., Ser. B 67, 301–320.
Google Scholar
Zuo, L., Zhang, H., Wang, H., and Liu, L. (2021). Sampling-based estimation for massive survival data with additive hazards model. Stat. Med. 40, 441–450.
Article Google Scholar

Download references

Acknowledgements

We sincerely thank Professors Zvi Drezner and Saïd Salhi for their kind invitation to write this article.

Author information

Authors and Affiliations

Department of Information Systems and Decision Sciences, California State University, Fullerton, CA, USA
Daoji Li & Yinfei Kong
International Institute of Finance, School of Management, University of Science and Technology of China, Hefei, China
Zemin Zheng
School of Mathematics, The University of Manchester, Manchester, UK
Jianxin Pan

Authors

Daoji Li
View author publications
You can also search for this author in PubMed Google Scholar
Yinfei Kong
View author publications
You can also search for this author in PubMed Google Scholar
Zemin Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Jianxin Pan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yinfei Kong .

Editor information

Editors and Affiliations

Kent Business School, University of Kent, Canterbury, UK
Saïd Salhi
Lancaster University, Lancaster, UK
John Boylan

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Li, D., Kong, Y., Zheng, Z., Pan, J. (2022). Recent Advances in Big Data Analytics. In: Salhi, S., Boylan, J. (eds) The Palgrave Handbook of Operations Research . Palgrave Macmillan, Cham. https://doi.org/10.1007/978-3-030-96935-6_25

Download citation

DOI: https://doi.org/10.1007/978-3-030-96935-6_25
Published: 08 July 2022
Publisher Name: Palgrave Macmillan, Cham
Print ISBN: 978-3-030-96934-9
Online ISBN: 978-3-030-96935-6
eBook Packages: Business and ManagementBusiness and Management (R0)

Publish with us

Policies and ethics

Recent Advances in Big Data Analytics

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

An Overview of Big Data and Machine Learning Paradigms

Big data analytics: a survey

Big Data Analytics—Analysis and Comparison of Various Tools

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Recent Advances in Big Data Analytics

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

An Overview of Big Data and Machine Learning Paradigms

Big data analytics: a survey

Big Data Analytics—Analysis and Comparison of Various Tools

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation