Abstract
Studies on missing data have increased in the past few decades. It is an uncontrollable phenomenon and could occur during the data collection in practically any research field. Numerous missing data imputation techniques are well documented in the literature. However, very few studies have systematically examined the evolutionary nuances of a specific area while offering insight into the emerging imputation methods in that field. The primary objective of this paper is to provide a comprehensive review of studies concerning missing data imputation methods in classification problems from several viewpoints: (a) publication trends (by year, subject area, country, document language, and author), (b) keyword analysis, (c) the most cited documents and (d) the most influenced authors. Bibliometric analysis has been conducted using VOSviewer and Harzing Publish or Perish software, covering 430 journal articles published in Scopus from 1991 to June 2021. One of the findings reveals an emerging trend in missing data imputation methods using random forest and nearest neighbor. Above all, this research is a valuable resource for gaining insights into the available imputation techniques at a glance.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Availability of data and materials
The papers analyzed in this study are available in the Scopus database.
References
Bertsimas D, Pawlowski C, Zhuo YD (2018) From predictive methods to missing data imputation: an optimization approach. J Mach Learn Res 18:1–39
Lobato F, Sales C, Araujo I et al (2015) Multi-objective genetic algorithm for missing data imputation. Pattern Recognit Lett 68:126–131. https://doi.org/10.1016/j.patrec.2015.08.023
García-Laencina PJ, Sancho-Gómez JL, Figueiras-Vidal AR (2010) Pattern classification with missing data: a review. Neural Comput Appl 19:263–282. https://doi.org/10.1007/s00521-009-0295-6
Xia J, Zhang S, Cai G et al (2017) Adjusted weight voting algorithm for random forests in handling missing values. Pattern Recognit 69:52–60. https://doi.org/10.1016/j.patcog.2017.04.005
Mehrabani-Zeinabad K, Doostfatemeh M, Ayatollahi SMT (2020) An efficient and effective model to handle missing data in classification. Biomed Res Int. https://doi.org/10.1155/2020/8810143
Awan SE, Bennamoun M, Sohel F et al (2022) A reinforcement learning-based approach for imputing missing data. Neural Comput Appl 34:9701–9716. https://doi.org/10.1007/s00521-022-06958-3
Sim J, Lee JS, Kwon O (2015) Missing values and optimal selection of an imputation method and classification algorithm to improve the accuracy of ubiquitous computing applications. Math Probl Eng. https://doi.org/10.1155/2015/538613
Stekhoven DJ, Bühlmann P (2012) Missforest-Non-parametric missing value imputation for mixed-type data. Bioinformatics. https://doi.org/10.1093/bioinformatics/btr597
Twala B, Phorah M (2010) Predicting incomplete gene microarray data with the use of supervised learning algorithms. Pattern Recognit Lett 31:2061–2069. https://doi.org/10.1016/j.patrec.2010.05.006
Kang H (2013) The prevention and handling of the missing data. Korean J Anesthesiol 64:402–406
Farhangfar A, Kurgan L, Dy J (2008) Impact of imputation of missing values on classification error for discrete data. Pattern Recognit 41:3692–3705. https://doi.org/10.1016/j.patcog.2008.05.019
Silva-Ramírez EL, Cabrera-Sánchez JF (2021) Co-active neuro-fuzzy inference system model as single imputation approach for non-monotone pattern of missing data. Neural Comput Appl 33:8981–9004. https://doi.org/10.1007/s00521-020-05661-5
Farhangfar A, Kurgan LA, Pedrycz W (2007) A novel framework for imputation of missing values in databases. IEEE Trans Syst Man Cybern A Syst Hum 37:692–709. https://doi.org/10.1109/TSMCA.2007.902631
Saar-Tsechansky M, Provost F (2007) Handling missing values when applying classification models. J Mach Learn Res 8:1625–1657
Pati SK, Das AK (2017) Missing value estimation for microarray data through cluster analysis. Knowl Inf Syst. https://doi.org/10.1007/s10115-017-1025-5
Liu J, Musialski P, Wonka P, Ye J (2013) Tensor completion for estimating missing values in visual data. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2012.39
Saha B, Gupta S, Phung D, Venkatesh S (2017) Effective sparse imputation of patient conditions in electronic medical records for emergency risk predictions. Knowl Inf Syst 53:179–206. https://doi.org/10.1007/s10115-017-1038-0
White KK, Reiter JP, Petrin A (2018) Imputation in U.S. manufacturing data and its implications for productivity dispersion. Rev Econ Stat 100:502–509. https://doi.org/10.1162/rest_a_00678
Folino G, Pisani FS (2016) Evolving meta-ensemble of classifiers for handling incomplete and unbalanced datasets in the cyber security domain. Appl Soft Comput J 47:179–190. https://doi.org/10.1016/j.asoc.2016.05.044
Huang J, Keung JW, Sarro F et al (2017) Cross-validation based K nearest neighbor imputation for software quality datasets: an empirical study. J Syst Softw. https://doi.org/10.1016/j.jss.2017.07.012
Cevallos Valdiviezo H, Van Aelst S (2015) Tree-based prediction on incomplete data using imputation or surrogate decisions. Inf Sci (NY) 311:163–181. https://doi.org/10.1016/j.ins.2015.03.018
Mahmoudi A, Deng X, Javed SA, Yuan J (2021) Large-scale multiple criteria decision-making with missing values: project selection through TOPSIS-OPA. J Ambient Intell Humaniz Comput 12:9341–9362. https://doi.org/10.1007/s12652-020-02649-w
Saha S, Ghosh A, Seal DB, Dey KN (2016) An improved fuzzy based missing value estimation in DNA microarray validated by gene ranking. Adv Fuzzy Syst. https://doi.org/10.1155/2016/6134736
Luengo J, García S, Herrera F (2012) On the choice of the best imputation methods for missing values considering three groups of classification methods. Knowl Inf Syst 32:77–108. https://doi.org/10.1007/s10115-011-0424-2
Aydilek IB, Arslan A (2013) A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm. Inf Sci (NY) 233:25–35. https://doi.org/10.1016/j.ins.2013.01.021
Li Z, Sharaf MA, Sitbon L et al (2014) A web-based approach to data imputation. World Wide Web 17:873–897. https://doi.org/10.1007/s11280-013-0263-z
García-Laencina PJ, Sancho-Gómez JL, Figueiras-Vidal AR (2013) Classifying patterns with missing values using Multi-Task Learning perceptrons. Expert Syst Appl 40:1333–1341. https://doi.org/10.1016/j.eswa.2012.08.057
Purwar A, Singh SK (2015) Hybrid prediction model with missing value imputation for medical data. Expert Syst Appl 42:5621–5631. https://doi.org/10.1016/j.eswa.2015.02.050
Nishanth KJ, Ravi V (2016) Probabilistic neural network based categorical data imputation. Neurocomputing 218:17–25. https://doi.org/10.1016/j.neucom.2016.08.044
Bathaeian NS (2018) Using imputation algorithms when missing values appear in the test data in contrast with the training data. Int J Data Anal Tech Strateg 10:111–123. https://doi.org/10.1504/IJDATS.2018.092447
Sahri Z, Yusof R, Watada J (2014) FINNIM: Iterative imputation of missing values in dissolved gas analysis dataset. IEEE Trans Ind Inform 10:2093–2102. https://doi.org/10.1109/TII.2014.2350837
Liu S, Zhang J, Xiang Y, Zhou W (2017) Fuzzy-based information decomposition for incomplete and imbalanced data learning. IEEE Trans Fuzzy Syst 25:1476–1490. https://doi.org/10.1109/TFUZZ.2017.2754998
Zhang S, Cheng D, Deng Z et al (2018) A novel kNN algorithm with data-driven k parameter computation. Pattern Recognit Lett 109:44–54. https://doi.org/10.1016/j.patrec.2017.09.036
Acuña E, Rodriguez C (2004) The treatment of missing values and its effect on classifier accuracy. In: Classification, clustering, and data mining applications. Springer, Berlin, pp 639–647
Zhu X, Zhang S, Jin Z et al (2011) Missing value estimation for mixed-attribute data sets. IEEE Trans Knowl Data Eng 23:110–121. https://doi.org/10.1109/TKDE.2010.99
Gheyas IA, Smith LS (2010) A neural network-based framework for the reconstruction of incomplete data sets. Neurocomputing. https://doi.org/10.1016/j.neucom.2010.06.021
Zeng D, Xie D, Liu R, Li X (2017) Missing value imputation methods for TCM medical data and its effect in the classifier accuracy. In: 2017 IEEE 19th international conference on e-health networking, applications and services (Healthcom). IEEE, pp 1–4
Rado O, Fanah M Al, Taktek E (2019) Performance analysis of missing values imputation methods using machine learning techniques. In: Advances in intelligent systems and computing. Springer, Cham, pp 738–750
Hunt LA (2017) Missing data imputation and its effect on the accuracy of classification. In: Studies in classification, data analysis, and knowledge organization, pp 3–14
Jordanov I, Petrov N, Petrozziello A (2018) Classifiers accuracy improvement based on missing data imputation. J Artif Intell Soft Comput Res 8:31–48. https://doi.org/10.1515/jaiscr-2018-0002
Melton E (2020) A random forest approach to identifying young stellar object candidates in the lupus star-forming region. Astron J 159:200. https://doi.org/10.3847/1538-3881/ab72ac
Nancy JY, Khanna NH, Arputharaj K (2017) Imputing missing values in unevenly spaced clinical time series data to build an effective temporal classification framework. Comput Stat Data Anal 112:63–79. https://doi.org/10.1016/j.csda.2017.02.012
Little RJA, Rubin DB (2002) Statistical analysis with missing data. Wiley, Hoboken
Rubin DB (1976) Inference and missing data. Biometrika 63:581–592. https://doi.org/10.1093/biomet/63.3.581
Kumaran SR, Othman MS, Yusuf LM, Yunianta A (2019) Estimation of missing values using hybrid fuzzy clustering mean and majority vote for microarray data. Procedia Comput Sci 163:145–153. https://doi.org/10.1016/j.procs.2019.12.096
Li S, Koch GG, Preisser JS et al (2017) Sensitivity analysis for missing dichotomous outcome data in multi-visit randomized clinical trial with randomization-based covariance adjustment. J Biopharm Stat 27:387–398. https://doi.org/10.1080/10543406.2017.1289955
Little RJA (1988) A test of missing completely at random for multivariate data with missing values. J Am Stat Assoc 83:1198–1202. https://doi.org/10.1080/01621459.1988.10478722
Bardab SN, Ahmed TM, Mohammed TAA (2021) Data mining classification algorithms: An overview. Int J Adv Appl Sci 8:1–5. https://doi.org/10.21833/ijaas.2021.02.001
Donthu N, Kumar S, Mukherjee D et al (2021) How to conduct a bibliometric analysis: an overview and guidelines. J Bus Res 133:285–296. https://doi.org/10.1016/j.jbusres.2021.04.070
Adnan FA, Zakaria MH, Ibrahim S (2020) 60-year research history of missing data: a bibliometric review on Scopus database (1960–2019). Appl Math Comput Intell 9:75–86
Clogg CC, Rubin DB, Schenker N et al (1991) Multiple imputation of industry and occupation codes in census public-use samples using Bayesian logistic regression. J Am Stat Assoc 86:68–78. https://doi.org/10.1080/01621459.1991.10475005
Che Z, Purushotham S, Cho K et al (2018) Recurrent neural networks for multivariate time series with missing values. Sci Rep 8:1–12. https://doi.org/10.1038/s41598-018-24271-9
Dogo EM, Nwulu NI, Twala B, Aigbavboa CO (2020) Empirical comparison of approaches for mitigating effects of class imbalances in water quality anomaly detection. IEEE Access 8:218015–218036. https://doi.org/10.1109/ACCESS.2020.3038658
Twala B (2017) When partly missing data matters in software effort development prediction. J Adv Comput Intell Intell Informatics. https://doi.org/10.20965/jaciii.2017.p0803
García-Laencina PJ, Sancho-Gómez JL, Figueiras-Vidal AR, Verleysen M (2009) K nearest neighbours with mutual information for simultaneous classification and missing data imputation. Neurocomputing 72:1483–1493. https://doi.org/10.1016/j.neucom.2008.11.026
Urda D, Subirats JL, García-Laencina PJ et al (2012) WIMP: Web server tool for missing data imputation. Comput Methods Programs Biomed. https://doi.org/10.1016/j.cmpb.2012.08.006
Zhang S, Li X, Zong M et al (2017) Learning k for kNN classification. ACM Trans Intell Syst Technol. https://doi.org/10.1145/2990508
Phipps AI, Limburg PJ, Baron JA et al (2015) Association between molecular subtypes of colorectal cancer and patient survival. Gastroenterology 148:77-87.e2. https://doi.org/10.1053/j.gastro.2014.09.038
Kingsley GH, Kowalczyk A, Taylor H et al (2012) A randomized placebo-controlled trial of methotrexate in psoriatic arthritis. Rheumatol (United Kingdom) 51:1368–1377. https://doi.org/10.1093/rheumatology/kes001
Elbaz A, Clavel J, Rathouz PJ et al (2009) Professional exposure to pesticides and Parkinson disease. Ann Neurol 66:494–504. https://doi.org/10.1002/ana.21717
Paleologo G, Elisseeff A, Antonini G (2010) Subagging for credit scoring models. Eur J Oper Res 201:490–499. https://doi.org/10.1016/j.ejor.2009.03.008
Shrive FM, Stuart H, Quan H, Ghali WA (2006) Dealing with missing data in a multi-question depression scale: a comparison of imputation methods. BMC Med Res Methodol 6:1–10. https://doi.org/10.1186/1471-2288-6-57
Jarquín D, Kocak K, Posadas L et al (2014) Genotyping by sequencing for genomic prediction in a soybean breeding population. BMC Genom 15:1–10. https://doi.org/10.1186/1471-2164-15-740
Luengo J, García S, Herrera F (2012) On the choice of the best imputation methods for missing values considering three groups of classification methods
Zhu X, Zhang S, Jin Z et al (2011) Missing value estimation for mixed-attribute data sets. IEEE Trans Knowl Data Eng. https://doi.org/10.1109/TKDE.2010.99
Shivaswamy PK, Bhattacharyya C, Smola AJ (2006) Second order cone programming approaches for handling missing and uncertain data. J Mach Learn Res 7:1283–1314
Buse D, Manack A, Serrano D et al (2012) Headache impact of chronic and episodic migraine: results from the American Migraine Prevalence and Prevention Study. Headache 52:3–17. https://doi.org/10.1111/j.1526-4610.2011.02046.x
Leu S, Von FS, Frank S et al (2013) DH/MGMT-driven molecular classification of low-grade glioma is a strong predictor for long-term survival. Neuro Oncol 15:469–479
Liu ZG, Pan Q, Dezert J, Martin A (2016) Adaptive imputation of missing values for incomplete pattern classification. Pattern Recognit 52:85–95. https://doi.org/10.1016/j.patcog.2015.10.001
Sánchez-Morales A, Sancho-Gómez JL, Martínez-García JA, Figueiras-Vidal AR (2020) Improving deep learning performance with missing values via deletion and compensation. Neural Comput Appl 32:13233–13244. https://doi.org/10.1007/s00521-019-04013-2
Sánchez-Morales A, Sancho-Gómez JL, Figueiras-Vidal AR (2021) Complete autoencoders for classification with missing values. Neural Comput Appl 33:1951–1957. https://doi.org/10.1007/s00521-020-05066-4
Bottigliengo D, Lorenzoni G, Ocagli H et al (2021) Propensity score analysis with partially observed baseline covariates: A practical comparison of methods for handling missing data. Int J Environ Res Public Health. https://doi.org/10.3390/ijerph18136694
Saeipourdizaj P, Sarbakhsh P, Gholampour A (2021) Application of imputation methods for missing values of pm10 and o3 data: interpolation, moving average and k-nearest neighbor methods. Environ Heal Eng Manag 8:215–226. https://doi.org/10.34172/EHEM.2021.25
Vivar G, Kazi A, Burwinkel H et al (2021) Simultaneous imputation and classification using Multigraph Geometric Matrix Completion (MGMC): application to neurodegenerative disease classification. Artif Intell Med. https://doi.org/10.1016/j.artmed.2021.102097
Hamzah FB, Hamzah FM, Razali SFM, Samad H (2021) A comparison of multiple imputation methods for recovering missing data in hydrological studies. Civ Eng J 7:1608–1619. https://doi.org/10.28991/cej-2021-03091747
Popoola PA, Tapamo JR, Assounga AG (2021) Cluster analysis of mixed and missing chronic kidney disease data in KwaZulu-Natal Province, South Africa. IEEE Access 9:52125–52143. https://doi.org/10.1109/ACCESS.2021.3069684
Yu L, Zhou R, Chen R, Lai KK (2022) Missing data preprocessing in credit classification: one-hot encoding or imputation? Emerg Mark Financ Trade 58:472–482. https://doi.org/10.1080/1540496X.2020.1825935
Kim Y, Steen S, Muri H (2022) A novel method for estimating missing values in ship principal data. Ocean Eng 251:110979. https://doi.org/10.1016/j.oceaneng.2022.110979
Sangeetha M, Senthil Kumaran M (2019) Indiscriminant expected maximization imputation model using multiple classification technique on diabetic dataset. Int J Eng Adv Technol 8:3449–3455. https://doi.org/10.35940/ijeat.F9516.088619
Gaul W, Gastes D (2010) Missing values and the consistency problem concerning AHP data. In: Locarek-Junge H, Weihs C (eds). Springer, Berlin, pp 693–700
Zhang S (2011) Shell-neighbor method and its application in missing data imputation. Appl Intell 35:123–133. https://doi.org/10.1007/s10489-009-0207-6
Guo CY, Yang YC, Chen YH (2021) The optimal machine learning-based missing data imputation for the cox proportional hazard model. Front Public Heal 9:1–8. https://doi.org/10.3389/fpubh.2021.680054
Wang ZX, Qiu MZ, Jiang YM et al (2017) Comparison of prognostic nomograms based on different nodal staging systems in patients with resected gastric cancer. J Cancer 8:950–958. https://doi.org/10.7150/jca.17370
Zhu X, Yang J, Zhang C, Zhang S (2021) Efficient utilization of missing data in cost-sensitive learning. IEEE Trans Knowl Data Eng 33:2425–2436. https://doi.org/10.1109/TKDE.2019.2956530
Saeed S, Jhanjhi NZ, Naqvi M et al (2019) Disparage the barriers of journal citation reports (JCR). Int J Comput Sci Netw Secur 19:156–175
Acknowledgements
Not applicable.
Funding
This work was supported under the Collaborative Research Grant (CRG) scheme between Universiti Teknologi Malaysia (Q. K130000.2456.08G27) and Universiti Malaysia Perlis (9023-00013). This work also was funded by the Ministry of higher Education, Malaysia under Fundamental Research Grant Scheme (FRGS/1/2021/TK0/UTM/02/45).
Author information
Authors and Affiliations
Contributions
FAA conducted the literature search review, analyzed the extracted data obtained from the Scopus database, and write the first draft of the manuscript. KRJ and WZAWM provided direction for the bibliometrics review and criticize the contents. SM revised the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethical approval and consent to participate
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Adnan, F.A., Jamaludin, K.R., Wan Muhamad, W.Z.A. et al. A review of the current publication trends on missing data imputation over three decades: direction and future research. Neural Comput & Applic 34, 18325–18340 (2022). https://doi.org/10.1007/s00521-022-07702-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-022-07702-7