Abstract
Unprecedented advances in digital technology have produced a revolution that is transforming science and society. Big data have been rapidly generated in many disciplines, such as business, sciences, engineering, medicine, biology, and humanities. It is often accompanied by a large number of features and/or a large volume of observations. The value of big data lies in effective analysis using statistical inference and machine learning methods that are computationally scalable and efficient. There have seen many new statistical methods and tools to deal with big data in recent years. In this chapter, we aim to summarize some of these approaches to provide a selective overview of the recent developments of theory, methods, and implementations for big data analytics. We will focus on two types of big data: ultrahigh-dimensional data and massive data, where the former refers to the data in which the number of features may grow exponentially with the number of observations while the latter means that the number of observations is huge and much larger than the number of features.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ai, M., Yu, J., Zhang, H., and Wang, H. (2021). Optimal subsampling algorithms for Big Data regressions. Stat. Sin. 31, 749–772.
Battey, H., Fan, J., Liu, H., Lu, J., and Zhu, Z. (2018). Distributed testing and estimation under sparse high dimensional models. Ann. Stat. 46, 1352–1382.
Bickel, P. J. and Levina, E. (2008). Regularized estimation of large covariance matrices. Ann. Stat. 36, 199–227.
Bien, J., Taylor, J., and Tibshirani, R. (2013). A lasso for hierarchical interactions. Ann. Stat. 41, 1111–1141.
Bühlmann, P. and Van De Geer, S. (2011). Statistics for high-dimensional data: methods, theory and applications, Springer Science & Business Media.
Cai, T., Liu, W., and Luo, X. (2011). A constrained \(\ell _1\) minimization approach to sparse precision matrix estimation. J. Amer. Statist. Assoc. 106, 594–607.
Candés, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when \(p\) is much larger than \(n\). Ann. Stat. 35, 2313–2351.
Chen, L. and Zhou, Y. (2021). Quantile regression in big data: A divide and conquer based strategy. Comput. Statist. Data Anal. 144, 106892.
Chen, X., Lee, J. D., Li, H., and Yang, Y. (2021). Distributed estimation for principal component analysis: an enlarged eigenspace analysis. J. Amer. Statist. Assoc., to appear.
Chen, X., Liu, W., and Zhang, Y. (2019). Quantile regression under memory constraint. Ann. Stat. 47, 3244–3273.
Chen, X. and Xie, M. (2014). A split-and-conquer approach for analysis of extraordinarily large data. Stat. Sin. 24, 1655–1684.
Chu, W., Li, R., Liu, J. and Reimherr, M. (2020). Feature screening for generalized varying coefficient mixed effect models with application to obesity GWAS. Ann. Appl. Stat. 14, 276–298.
Cordell, H. J. (2009). Detecting gene-gene interactions that underlie human diseases. Nat. Rev. Genet. 10, 392–404.
Cui, H., Li, R., and Zhong, W. (2015). Model-Free Feature Screening for Ultrahigh Dimensional Discriminant Analysis. J. Amer. Statist. Assoc. 110, 630–641.
Donoho, D. L. (2000). High-dimensional data analysis: The curses and blessings of dimensionality; Aide-Memoire of a Lecture at AMS Conference on Math Challenges of the 21st Century.
Dong, R., Li, D., and Zheng, D. (2021). Parallel integrative learning for large-scale multi-response regression with incomplete outcomes. Comput. Statist. Data Anal. 160, 107243.
Drineas, P., Mahoney, M. W., and Muthukrishnan, S. (2006). Sampling algorithms for \(\ell _2\) regression and applications. In Proceedings of the seventeenth annual ACM-SIAM symposium on discrete algorithm, 1127–1136.
Drineas, P., Magdon-Ismail, M., Mahoney, M. W., and Woodruff, D. P. (2012). Fast approximation of matrix coherence and statistical leverage. J. Mach. Learn. Res. 13, 3475–3506.
Drineas, P., Mahoney M.W., Muthukrishnan S, and Sarlós, T. (2011). Faster least squares approximation. Numer. Math. 117, 219–249.
Fan, J., Feng, Y., and Xia, L. (2020). A projection-based conditional dependence measure with applications to high-dimensional undirected graphical models. J. Econometrics 218, 119–139.
Fan, J., Feng, Y., and Song, R. (2011). Nonparametric independence screening in sparse ultra-high dimensional additive models. J. Amer. Statist. Assoc. 106, 544–557.
Fan, J., Han, F., and Liu, H. (2014). Challenges of big data analysis. Natl. Sci. Rev. 1, 293-314.
Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96, 1348–1360.
Fan, J. and Li, R. (2006). Statistical challenges with high dimensionality: Feature selection in knowledge discovery. In: Sanz-Sole M, Soria J, Varona JL, Verdera J, editors. Proceedings of the International Congress of Mathematicians, 595–622.
Fan, J., Li, R., Zhang, C.-H., and Zou, H. (2020). Statistical Foundations of Data Science. CRC Press.
Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space (with discussion). J. R. Stat. Soc., Ser. B 70, 849-911.
Fan, J. and Lv, J. (2010). A selective overview of variable selection in high dimensional feature space (invited review article). Stat. Sin. 20, 101–148.
Fan, J. and Lv, J. (2018). Sure independence screening (invited review article). Wiley StatsRef: Statistics Reference Online.
Fan, J., Lv, J., and Qi, L. (2011). Sparse high dimensional models in economics (invited review article). Annu. Rev. Econ. 3, 291–317.
Fan, J., Ma, Y., and Dai, W. (2014). Nonparametric independence screening in sparse ultra-high-dimensional varying coefficient models. J. Amer. Statist. Assoc. 109, 1270–1284.
Fan, J. and Song, R. (2010). Sure independence screening in generalized linear models with NP-dimensionality. Ann. Stat. 38, 3567–3604.
Fang, X. and Xu, J. Joint variable screening in accelerated failure time models. Stat. Sin. 30, 467–485.
Fan, Y., Kong, Y., Li, D., and Zheng, Z. (2015). Innovated interaction screening for high-dimensional nonlinear classification. Ann. Stat. 43, 1243–1272.
Fan, Y. and Lv, J. (2016). Innovated scalable efficient estimation in ultra-large Gaussian graphical models. Ann. Stat. 44, 2098–2126.
Friedman, J., Hastie, T, and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9, 432–441.
Gorst-Rasmussen, A. and Scheike, T. (2013). Independent screening for single-index hazard rate models with ultrahigh dimensional features. J. R. Stat. Soc., Ser. B 75, 217–245.
Gosik, K., Sun, L., Chinchilli, V. M., and Wu, R. (2018). An ultrahigh-dimensional mapping model of high-order epistatic networks for complex traits. Curr. Genomics 19, 384–394.
Hall, P. and Xue, J.-H. (2014). On selecting interacting features from high-dimensional data. Comput. Stat. Data Anal. 71, 694–708.
Hao, N., Feng, Y., and Zhang, H.H. (2018). Model selection for high dimensional quadratic regression via regularization. J. Amer. Statist. Assoc. 113, 615–625.
Hao, N. and Zhang, H.H. (2014). Interaction screening for ultra-high dimensional data. J. Amer. Statist. Assoc. 109, 1285–1301.
Haris, A., Witten, D., and Simon, N. (2016). Convex modeling of interactions with strong heredity. J. Comput. Graph. Stat. 25, 981–1004.
He, X., Wang, L. and Hong, H. G. (2013). Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data. Ann. Stat. 41, 342–369.
Hector, E. and Song, P. (2021). A distributed and integrated method of moments for high-dimensional correlated data analysis. J. Amer. Statist. Assoc. 116, 805–818.
Huang, D., Zhu, X., Li, R., and Wang, H. (2021). Feature screening for network autoregression model. Stat. Sin. 31, 1–21.
Huo, X. and Székely, G. J. (2016). Fast Computing for Distance Covariance. Technometrics 58, 435–447.
Jiang, B. and Liu, J. S. (2014). Variable selection for general index models via sliced inverse regression. Ann. Stat. 42, 1751–1786.
Jordan, M. I., Lee, J. D., and Yang, Y. (2019). Communication-efficient distributed statistical learning. J. Amer. Statist. Assoc. 114, 668–681.
Kong, Y., Li, D., Fan, Y., and Lv, J. (2017). Interaction pursuit in high-dimensional multi-response regression via distance correlation. Ann. Stat. 45, 897–922.
Lee, J. D., Liu, Q., Sun, Y., and Taylor, J. E. (2017). Communication-efficient sparse regression. J. Mach. Learn. Res. 18, 1–30.
Lee, J., Wang, H., and Schifano, E. (2020). Online updating method to correct for measurement error in big data streams. Comput. Statist. Data Anal. 149, 106976
Li, D., Kong, Y., Fan, Y., and Lv, J. (2021). High-dimensional interaction detection with false sign rate control. J. Bus. Econom. Statist., in press.
Li, G., Peng, H., Zhang, J., and Zhu, L-X. (2012). Robust rank correlation based screening. Ann. Stat. 40, 1846–1877.
Li, J., Zhong, W., Li, R. and Wu, R. (2014). A fast algorithm for detecting gene-gene interactions in genome-wide association studies. Ann. Appl. Stat. 8, 2292–2318.
Li, R., Zhong, W., and Zhu, L.P. (2012). Feature screening via distance correlation Learning. J. Amer. Statist. Assoc. 107, 1129–1139.
Li, X., Li, R., Xia, Z., and Xu, C. (2020). Distributed feature screening via componentwise debiasing. J. Mach. Learn. Res. 21, 1–32.
Lin, N. and Xi, R. (2011). Aggregated estimating equation estimation. Stat. Interface 4, 73–83.
Liu, J., Li, R., and Wu, R. (2014). Feature selection for varying coefficient models with ultrahigh-dimensional covariates. J. Amer. Statist. Assoc. 109, 266–274.
Liu, J., Zhong, W., and Li, R. (2015). A selective overview of feature screening for ultrahigh-dimensional data. Sci. China Math. 58, 1–22.
Liu, W., Ke, Y., Liu, J., and Li, R. (2020). Model-free feature screening and FDR control with Knockoff features. J. Amer. Statist. Assoc., in press.
Liu, W. and Li, R. (2020). Variable Selection and Feature Screening. Macroeconomic Forecasting in the Era of Big Data, 293–326.
Lv, J., and Fan, Y. (2009). A unified approach to model selection and sparse recovery using regularized least squares. Ann. Stat., 37, 3498–3528.
Ma, P., Mahoney, M. W., and Yu, B. (2015). A statistical perspective on algorithmic leveraging. J. Mach. Learn. Res. 16, 861–911.
Ma, P. and Sun, X. (2015). Leveraging for big data regression. Wiley Interdisciplinary Reviews: Computational Statistics 7, 70–76.
Ma, P. , Zhang, X., Xing, X., Ma, J., and Mahoney, M. (2020). Asymptotic analysis of sampling estimators for randomized linear algebra algorithms, AISTATS, 1026–1035.
Ma, S., Li, R. and Tsai, C.L. (2017). Variable Screening via quantile partial correlation. J. Amer. Statist. Assoc. 112, 650–663.
Mai, Q. and Zou, H. (2011). The Kolmogorov filter for variable screening in high-dimensional binary classification. Biometrika 100, 229–234.
Mai, Q. and Zou, H. (2015). The fused Kolmogorov filter: A nonparametric model-free screening method. Ann. Stat. 43, 1471–1497.
Musani, S. K., Shriner, D., Liu, N., Feng, R., Coffey, C. S., Yi, N., Tiwari, H. K., and Allison, D. B. (2007). Detection of gene\(\times\)gene interactions in genome-wide association studies of human population data. Human Heredity 63, 67–84.
Nandy, D., Chiaromonte, F., and Li, R. (2021). Covariate information number for feature screening in ultrahigh-dimensional supervised problems. J. Amer. Statist. Assoc., in press.
Niu, Y. S., Hao, N. and Zhang, H.H. (2018). Interaction screening by partial correlation. Stat. Interface 11, 317–325.
Pan, W., Wang, X., Xiao, W., and Zhu, H. (2019). A generic sure independence screening procedure. J. Amer. Statist. Assoc. 114, 928–937.
Ren, Z., Kang, Y., Fan, Y., and Lv, J. (2019). Tuning-free heterogeneous inference in massive networks. J. Amer. Statist. Assoc., 114, 1908–1925.
Sheng, Y. and Wang, Q. (2020). Model-free feature screening for ultrahigh dimensional classification. J. Multivariate Anal. 178, 104618.
Székely, G. J., Rizzo, M. L., and Bakirov, N. K. (2007). Measuring and testing dependence by correlation of distances.Ann. Stat. 35, 2769-2794.
Song, R., Lu, W., Ma, S., and Jeng, J. (2014). Censored rank independence screening for high-dimensional survival data. Biometrika 101, 799–814.
Tang, L., Zhou, L., and Song, P. (2020). Distributed simultaneous inference in generalized linear models via confidence distribution. J. Multivariate Anal. 176, 104567.
Tian, Y. and Feng, Y. (2021). RaSE: A Variable Screening Framework via Random Subspace Ensembles. J. Amer. Statist. Assoc., in press.
Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. J. R. Stat. Soc., Ser. B 58, 267–288.
Wang, H. (2019). Divide-and-conquer information-based optimal subdata selection algorithm. J. Stat. Theory Pract. 13, 46.
Wang, H. (2019). More efficient estimation for logistic regression with optimal subsamples. J. Mach. Learn. Res. 20, 1–59.
Wang, H. and Ma, Y. (2021). Optimal subsampling for quantile regression in big data, Biometrika, 108, 99–112.
Wang, H., Yang, M., and Stufken, J. (2019). Information-based optimal subdata selection for big data linear regression. J. Amer. Statist. Assoc. 114, 26393–405.
Wang, H., Zhu, R., and Ma, P. (2018). Optimal subsampling for large sample logistic regression. J. Amer. Statist. Assoc. 113, 829-844.
Wang, L., Chen, Z., Wang, C.D., and Li, R. (2020). Ultrahigh dimensional precision matrix estimation via refitted cross validation. J. Econometrics 215, 118–130.
Wang, W., Lu, S.-E., Cheng, J. Q., Xie, M., and Kostis, J. (2021). Multivariate survival analysis in big data: A divide-and-combine approach. Biometrics, to appear.
Wang, X., Yang, Z., Chen, X., and Liu, W. (2019). Distributed inference for linear support vector machine. J. Mach. Learn. Res. 20, 1–41.
Wang, Y., Hong, C., Palmer, N., Di, Q., Schwartz, J., Ko-hane, I., and Cai, T. (2021). A fast divide-and-conquer sparse Cox regression. Biostatistics 22, 381–401.
Wu, Y. and Yin, G. (2015). Conditional quantile screening in ultrahigh-dimensional heterogeneous data. Biometrika 102, 65–76.
Xue, L. and Zou, H. (2011). Sure independence screening and compressed random sensing. Biometrika 98, 371–380.
Yao, Y. and Wang, H. (2019). Optimal subsampling for softmax regression. Stat. Papers 60, 235–249.
Yan, X. and Bien, J. (2017). Hierarchical sparse modeling: A choice of two group lasso formulations Stat. Sci. 32, 531–560.
Yang, G., Yang, S. and Li, R. (2020). Feature screening in ultrahigh dimensional generalized varying-coefficient models. Stat. Sin., 30, 1049–1067.
Yuan, M. (2010). High dimensional inverse covariance matrix estimation via linear programming. J. Mach. Learn. Res. 11, 2261–2286.
Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38, 894–942.
Zhang, Y., Duchi, J., and Wainwright, M. (2015). Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates. J. Mach. Learn. Res. 16, 3299–3340.
Zhao, S.D. and Li, Y. (2012). Principled sure independence screening for Cox models with ultra-high-dimensional covariates. J. Multivariate Anal. 105, 397–411.
Zhao, S. D. and Li, Y. (2014). Score test variable screening. Biometrics 70, 862–871.
Zheng, Z., Zhang, J., Kong, Y., and Wu, Y. (2018). Scalable inference for massive data. Procedia Comput. Sci. 129, 81–87.
Zhou, T., Zhu, L, Xu, C., and Li, R. (2020). Model-free forward screening via cumulative divergence. J. Amer. Statist. Assoc. 115, 1393–1405.
Zhou, Y. and Zhu, L.P. (2018). Model-free feature screening for ultrahigh dimensional data through a modified BLUM-KIEFER-ROSENBLATT correlation. Stat. Sin. 28, 1351–1370.
Zhong, W. and Zhu, L. (2015). An iterative approach to distance correlation-based sure independence screening. J. Stat. Comput. Simul. 85, 2331–2345.
Zhu, L.-P., Li, L., Li, R., and Zhu, L.-X. (2011). Model-free feature screening for ultrahigh-dimensional data. J. Amer. Statist. Assoc. 106, 1464–1475.
Zhu, X., Li, F., and Wang, H. (2021). Least squares approximation for a distributed system. J. Comput. Graph. Statist., to appear.
Zou, H. (2006). The adaptive Lasso and its oracle properties. J. Amer. Statist. Assoc. 101, 1418–1429.
Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. R. Stat. Soc., Ser. B 67, 301–320.
Zuo, L., Zhang, H., Wang, H., and Liu, L. (2021). Sampling-based estimation for massive survival data with additive hazards model. Stat. Med. 40, 441–450.
Acknowledgements
We sincerely thank Professors Zvi Drezner and Saïd Salhi for their kind invitation to write this article.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Li, D., Kong, Y., Zheng, Z., Pan, J. (2022). Recent Advances in Big Data Analytics. In: Salhi, S., Boylan, J. (eds) The Palgrave Handbook of Operations Research . Palgrave Macmillan, Cham. https://doi.org/10.1007/978-3-030-96935-6_25
Download citation
DOI: https://doi.org/10.1007/978-3-030-96935-6_25
Published:
Publisher Name: Palgrave Macmillan, Cham
Print ISBN: 978-3-030-96934-9
Online ISBN: 978-3-030-96935-6
eBook Packages: Business and ManagementBusiness and Management (R0)