[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

A method for comparing multiple imputation techniques: : A case study on the U.S. national COVID cohort collaborative

Published: 01 March 2023 Publication History

Graphical abstract

Display Omitted

Abstract

Healthcare datasets obtained from Electronic Health Records have proven to be extremely useful for assessing associations between patients’ predictors and outcomes of interest. However, these datasets often suffer from missing values in a high proportion of cases, whose removal may introduce severe bias. Several multiple imputation algorithms have been proposed to attempt to recover the missing information under an assumed missingness mechanism. Each algorithm presents strengths and weaknesses, and there is currently no consensus on which multiple imputation algorithm works best in a given scenario. Furthermore, the selection of each algorithm’s parameters and data-related modeling choices are also both crucial and challenging.
In this paper we propose a novel framework to numerically evaluate strategies for handling missing data in the context of statistical analysis, with a particular focus on multiple imputation techniques. We demonstrate the feasibility of our approach on a large cohort of type-2 diabetes patients provided by the National COVID Cohort Collaborative (N3C) Enclave, where we explored the influence of various patient characteristics on outcomes related to COVID-19. Our analysis included classic multiple imputation techniques as well as simple complete-case Inverse Probability Weighted models. Extensive experiments show that our approach can effectively highlight the most promising and performant missing-data handling strategy for our case study. Moreover, our methodology allowed a better understanding of the behavior of the different models and of how it changed as we modified their parameters.
Our method is general and can be applied to different research fields and on datasets containing heterogeneous types.

References

[1]
J.M. Madden, M.D. Lakoma, D. Rusinak, C.Y. Lu, S.B. Soumerai, Missing clinical and behavioral health data in a large electronic health record (EHR) system, J. Am. Med. Inform. Assoc. 23 (6) (2016) 1143–1149.
[2]
R.H. Groenwold, Informative missingness in electronic health record systems: the curse of knowing, Diagnost. Prognost. Res. 4 (1) (2020) 1–6.
[3]
S. Haneuse, D. Arterburn, M.J. Daniels, Assessing missing data assumptions in EHR-based studies: a complex and underappreciated task, JAMA Netw. Open 4 (2) (2021) e210184–e.
[4]
D.B. Rubin, Multiple Imputation for Nonresponse in Surveys, John Wiley & Sons, New York, 1987.
[5]
J.B. Carlin, Multiple Imputation: Perspective and Historical Overview. Chapter 12 of Handbook of Missing Data Methodology, Edited by Molenberghs, G., Fitzmaurice, G. M., Kenward, M. G., Tsiatis, A., Verbeke, G. New York: Chapman & Hall/CRC, 2014. https://doi.org/10.1201/b17622.
[6]
M.G. Kenward, J.R. Carpenter, Multiple Imputation. Chapter 21 of Longitudinal Data Analysis, Chapman & Hall/CRC, New York, 2009,.
[7]
J.S. Murray, Multiple imputation: a review of practical and theoretical findings, Stat. Sci. 33 (2018) (2018) 142–159.
[8]
L. Cappelletti, T. Fontana, G.W. Di Donato, L. Di Tucci, E. Casiraghi, G. Valentini, Complex data imputation by auto-encoders and convolutional neural networks—A case study on genome gap-filling, Computers 9 (2) (2020) 37.
[9]
V. der Laan, J. Mark, J.M. Robins, Unified Methods for Censored Longitudinal Data and Causality, Springer, Cambridge (MA), 2003.
[10]
Y. Zhang, A. Alyass, T. Vanniyasingam, B. Sadeghirad, I.D. Flórez, S.C. Pichika, G.H. Guyatt, A systematic survey of the methods literature on the reporting quality and optimal methods of handling participants with missing outcome data for continuous outcomes in randomized controlled trials, J. Clin. Epidemiol. 88 (2017) 67–80.
[11]
E. Casiraghi, D. Malchiodi, G. Trucco, M. Frasca, L. Cappelletti, T. Fontana, G. Valentini, Explainable machine learning for early assessment of COVID-19 risk prediction in emergency departments, IEEE Access 8 (2020) 196299–196325.
[12]
M.K. Hasan, M.A. Alam, S. Roy, A. Dutta, M.T. Jawad, S. Das, Missing value imputation affects the performance of machine learning: a review and analysis of the literature from 2010 to 2021, Inf. Med. Unlocked 27 (2021),.
[13]
K.G. Moons, R.A. Donders, T. Stijnen, F.E. Harrell Jr, Using the outcome for imputation of missing predictor values was preferred, J. Clin. Epidemiol. 59 (10) (2006) 1092–1101.
[14]
I.R. White, P. Royston, A.M. Wood, Multiple imputation using chained equations: issues and guidance for practice, Stat. Med. 30 (4) (2011) 377–399.
[15]
R. Wong, M. Hall, R. Vaddavalli, A. Anand, N. Arora, C.T. Bramante, N3C consortium, glycemic control and clinical outcomes in US patients With COVID-19: data from the national COVID cohort collaborative (N3C) database, Diabet. Care 45(5) (2022) 1099–1106.
[16]
S.R. Seaman, I.R. White, Review of inverse probability weighting for dealing with missing data, Stat. Methods Med. Res. 22 (3) (2013) 278–295. https://journals.sagepub.com/doi/10.1177/0962280210395740.
[17]
Garrett M. Fitzmaurice, Semiparametric Methods: Introduction and Overview. Chapter 7 of Handbook of Missing Data Methodology (2014), Edited by Molenberghs, G., Fitzmaurice, G. M., Kenward, M. G., Tsiatis, A., Verbeke, G. Chapman & Hall/CRC, New York, 2014. https://doi.org/10.1201/b17622.
[18]
L.E. Chan, E. Casiraghi, B.J. Laraway, J. Reese, Metformin is Associated with Reduced COVID-19 Severity in Patients with Prediabetes, 2022. medRxiv. https://www.medrxiv.org/content/10.1101/2022.08.29.22279355v1.
[19]
D.E. Goldstein, R.R. Little, R.A. Lorenz, J.I. Malone, D. Nathan, C.M. Peterson, D.B. Sacks, Tests of glycemia in diabetes, Diabetes Care 27 (7) (2004) 1761–1773.
[20]
M.R. Anderson, J. Geleris, D.R. Anderson, J. Zucker, Y.R. Nobel, D. Freedberg, M.R. Baldwin, Body mass index and risk for intubation or death in SARS-CoV-2 infection: a retrospective cohort study, Ann. Int. Med. 173 (10) (2020) 782–790.
[21]
S.Y. Tartof, L. Qian, V. Hong, R. Wei, R.F. Nadjafi, H. Fischer, S.B. Murali, Obesity and mortality among patients diagnosed with COVID-19: results from an integrated health care organization, Ann. Int. Med. 173 (10) (2020) 773–781.
[22]
S. Sze, D. Pan, C.R. Nevill, L.J. Gray, C.A. Martin, J. Nazareth, M. Pareek, Ethnicity and clinical outcomes in COVID-19: a systematic review and meta-analysis, EClinicalMedicine 29 (2020).
[23]
S. Magesh, D. John, W.T. Li, Y. Li, A. Mattingly-App, S. Jain, W.M. Ongkeko, Disparities in COVID-19 outcomes by race, ethnicity, and socioeconomic status: a systematic-review and meta-analysis, JAMA Netw. Open 4 (11) (2021) e2134147–e.
[24]
CDC: https://www.cdc.gov/healthyweight/assessing/bmi/adult_bmi/index.html.
[25]
C.B. Weir, A. Jan, BMI classification percentile and cut off points. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing, 2019; url: https://www.ncbi.nlm.nih.gov/books/NBK541070/.
[26]
L. Cook, J. Espinoza, N.G. Weiskopf, N. Mathews, D.A. Dorr, K.L. Gonzales, A. Wilcox, C. Madlock-Brown, N3C Consortium, Issues With Variability in Electronic Health Record Data About Race and Ethnicity: Descriptive Analysis of the National COVID Cohort Collaborative Data Enclave. JMIR medical informatics 10(9) (2022) e39235. https://doi.org/10.2196/39235.
[27]
C. Li, Little's test of missing completely at random, Stata J. 13 (4) (2013) 795–809.
[28]
S. Van Buuren, Flexible imputation of missing data, CRC Press, 2018.
[29]
J.C. Jakobsen, C. Gluud, J. Wetterslev, P. Winkel, When and how should multiple imputation be used for handling missing data in randomised clinical trials – a practical guide with flowcharts, BMC Med. Res. Method. 17 (1) (2017) 1–10.
[30]
K. Bhaskaran, L. Smeeth, What is the difference between missing completely at random and missing at random? Int. J. Epidemiol. 43(4) (2014) 1336–9.
[31]
R.M. Schouten, G. Vink, The dance of the mechanisms: how observed information influences the validity of missingness assumptions, Sociol. Methods Res. 50 (3) (2021) 1243–1258.
[32]
R.J. Little, D.B. Rubin, Statistical analysis with missing data, 793, John Wiley & Sons, 2019.
[33]
J.L. Schafer, J.W. Graham, Missing data: our view of the state of the art, Psychol. Methods 7 (2) (2002) 147–177.
[34]
A. Gelman, J. Hill, Data analysis using regression and multilevel/hierarchical models, Cambridge University Press, 2006.
[35]
G. Molenberghs, C. Beunckens, C. Sotto, M.G. Kenward, Every missingness not at random model has a missingness at random counterpart with equal fit, J. R. Stat. Soc. Ser. B (Stat Methodol.) 70 (2) (2008) 371–388.
[36]
J.L. Schafer, Analysis of Incomplete Multivariate Data, Chapman & Hall, London, 1997.
[37]
J.L. Schafer, M.K. Olsen, Multiple imputation for multivariate missing-data problems: a data analyst's perspective, Multivar. Behav. Res. 33 (4) (1998) 545–571.
[38]
J.W. Graham, A.E. Olchowski, T.D. Gilreath, How many imputations are really needed? Some practical clarifications of multiple imputation theory, Prevent. Sci. 8 (3) (2007) 206–213.
[39]
T.E. Bodner, What improves with increased missing data imputations?, Struct. Equ. Model. Multidiscip. J. 15 (2008) 651–675.
[40]
P.T. Von Hippel, How to impute interactions, squares, and other transformed variables, Sociol. Methodol. 39 (1) (2009) 265–291.
[41]
Rotnitzky, Andrea and Vansteelandt, Stijn, Double-Robust Methods. Chapter 9 of Handbook of Missing Data Methodology (2014), Edited by Molenberghs, G., Fitzmaurice, G. M., Kenward, M. G., Tsiatis, A., Verbeke, G. New York: Chapman & Hall/CRC, 2014. https://doi.org/10.1201/b17622.
[42]
D.J. Stekhoven, P. Bühlmann, MissForest—non-parametric missing value imputation for mixed-type data, Bioinformatics 28 (1) (2012) 112–118.
[43]
R.C. Pereira, M.S. Santos, P.P. Rodrigues, P.H. Abreu, Reviewing autoencoders for missing data imputation: technical trends, applications and outcomes, J. Artif. Intell. Res. 14 (69) (2020) 1255–1285.
[44]
L. Gondara, K. Wang, Mida: Multiple imputation using denoising autoencoders, in: Pacific-Asia conference on knowledge discovery and data mining, Springer, Cham, 2018, pp. 260–272.
[45]
J.C. Kim, K. Chung, Multi-modal stacked denoising autoencoder for handling missing data in healthcare big data, IEEE Access 8 (2020) 104933–104943,.
[46]
A. Jabbar, L. Xi, O. Bourahla, A survey on generative adversarial networks: variants, applications, and training, ACM Comput. Surv. (CSUR) 54 (8) (2021) 1–49.
[47]
J. Yoon, J. Jordon, M. van der Schaar, Gain: missing data imputation using generative adversarial nets, Int. Conf. Mach. Learn. 5689–5698 (2018),.
[48]
S. Cheng-Xian Li, B. Jiang, B. Marlin, Learning from Incomplete Data with Generative Adversarial Networks, 2019. https://arxiv.org/abs/1902.09599.
[49]
Y. Yuan, Multiple imputation using SAS software, J. Stat. Softw. (2011) 1–25. http://www.jstatsoft.org/v45/i06/.
[50]
J. Honaker, G. King, M. Blackwell, Amelia II: a program for missing data, J. Stat. Softw. 45 (7) (2011) 1–47. https://www.jstatsoft.org/v45/i07/.
[51]
N.J. Horton, S.R. Lipsitz, M. Parzen, A potential for bias when rounding in multiple imputation, Am. Stat. 57 (4) (2003) 229–232.
[52]
S. Van Buuren, K. Groothuis-Oudshoorn, Mice: Multivariate imputation by chained equations, R. J. Statist. Software 45 (2011) 1–67.
[53]
L. Breiman, J.H.Friedman, R.A. Olshen, C.J. Stone, Classification and regression trees. Wadsworh, Inc, Belmont, CA, 1984.
[54]
L. Burgette, J.P. Reiter, Multiple imputation via sequential regression trees, Am. J. Epidemiol. 172 (2010) 1070–1076.
[55]
O. Akande, F. Li, J. Reiter, An empirical comparison of multiple imputation methods for categorical data, Am. Stat. 71 (2) (2017) 162–170.
[56]
L. Breiman, Random forests, Mach. Learn. 45 (1) (2001) 5–32.
[57]
A. Sportisse, C. Boyer, J. Josse, Estimation and imputation in probabilistic principal component analysis with missing not at random data, Adv. Neural Inf. Proces. Syst. 33 (2020) 7067–7077. https://proceedings.neurips.cc/paper/2020/file/4ecb679fd35dcfd0f0894c399590be1a-Paper.pdf.
[58]
R.C. Pereira, P.H. Abreu, P.P. Rodrigues, Partial Multiple Imputation with variational autoencoders: tackling not at randomness in healthcare data, IEEE J. Biomed. Health Inform. 26 (8) (2022) 4218–4227. https://ieeexplore.ieee.org/document/9769986.
[59]
R.M. Schouten, P. Lugtig, G. Vink, Generating missing values for simulation purposes: a multivariate amputation procedure, J. Stat. Comput. Simul. 88 (15) (2018) 2909–2930.
[60]
S. Hong, Y. Sun, H. Li, H.S. Lynn, A note on the required sample size of model-based dose-finding methods for molecularly targeted agents, Austin Biomed. Biostatist. 6 (1) (2021) 1037.
[61]
M.A. Haendel, C.G. Chute, T.D. Bennett, D.A. Eichmann, J. Guinney, W.A. Kibbe, K.R. Gersing, The national COVID cohort collaborative (N3C): rationale, design, infrastructure, and deployment, J. Am. Med. Inform. Assoc. 28 (3) (2021) 427–443.
[62]
T.D. Bennett, R.A. Moffitt, J.G. Hajagos, B. Amor, A. Anand, M.M. Bissell, F.M. Koraishy, Clinical characterization and prediction of clinical severity of SARS-CoV-2 infection among US adults using data from the US National COVID Cohort Collaborative, JAMA Netw. Open 4 (7) (2021) e2116901–e.
[63]
M. Blake, P.E. DeWitt, S. Russell, A. Anand, K.R. Bradwell, C. Bremer, D. Gabriel, et al., Children with SARS-CoV-2 in the National COVID Cohort Collaborative (N3C), in: medRxiv : The Preprint Server for Health Sciences, 2021,.
[64]
N. Sharafeldin, B. Bates, Q. Song, V. Madhira, Y. Yan, S. Dong, U. Topaloglu, Outcomes of COVID-19 in patients with cancer: report from the National COVID Cohort Collaborative (N3C), J. Clin. Oncol. 39 (20) (2021) 2232–2246.
[65]
C.T. Bramante, J. Buse, L. Tamaritz, A. Palacio, K. Cohen, D. Vojta, C.J. Tignanelli, Outpatient metformin use is associated with reduced severity of COVID-19 disease in adults with overweight or obesity, J. Med. Virol. 93 (7) (2021) 4273–4279.
[66]
A.R. Kahkoska, T.J. Abrahamsen, G.C. Alexander, T.D. Bennett, C.G. Chute, M.A. Haendel, N3C Consortium Duong Tim Q, Association between glucagon-like peptide 1 receptor agonist and sodium–glucose cotransporter 2 inhibitor use and COVID-19 outcomes, Diabet. Care 44(7) (2021) 1564-1572.
[67]
X. Yang, J. Sun, R.C. Patel, J. Zhang, S. Guo, Q. Zheng, R.B. Mannon, Associations between HIV infection and clinical spectrum of COVID-19: a population level analysis based on US national COVID cohort collaborative (N3C) data, The Lancet HIV 8 (11) (2021) e690–e700.
[68]
E.B. Levitt, D.A. Patch, S. Mabry, A. Terrero, B. Jaeger, M.A. Haendel, J.P. Johnson, Association between COVID-19 and mortality in hip fracture surgery in the national COVID cohort collaborative (N3C): a retrospective Cohort study, JAAOS Glob. Res. Rev. 6 (1) (2022).
[69]
P. Farhad, N. Greifer, C. Leyrat, E. Stuart, MatchThem:: matching and weighting after multiple imputation. arXiv:2009.11772 (2020). https://journal.r-project.org/archive/2021/RJ-2021-073/RJ-2021-073.pdf.
[70]
Coleman, B., Casiraghi, E., Callahan, T. J., Blau, H., Chan, L., Laraway, B., RECOVER Consortium, 2022. Manifestations Associated with Post Acute Sequelae of SARS-CoV2 Infection (PASC) Predict Diagnosis of New-Onset Psychiatric Disease: Findings from the NIH N3C and RECOVER Studies. Submitted to World Psychiatry. medRxiv. https://www.medrxiv.org/content/10.1101/2022.07.08.22277388v1.
[71]
R.R. Deer, M.A. Rock, N. Vasilevsky, L. Carmody, H. Rando, A.J. Anzalone, P.N. Robinson, Characterizing long COVID: deep phenotype of a complex condition, EBioMedicine 74 (2021).
[72]
B. Coleman, E. Casiraghi, H. Blau, L. Chan, M.A. Haendel, B. Laraway, P.N. Robinson, Risk of new-onset psychiatric sequelae of COVID-19 in the early and late post-acute phase, World Psychiatry 21 (2) (2022) 319.
[73]
T.G. Clark, D.G. Altman, Developing a prognostic model in the presence of missing data: an ovarian cancer case study, J. Clin. Epidemiol. 56 (1) (2003) 28–37.

Cited By

View all
  • (2024)Handling missing values and imbalanced classes in machine learning to predict consumer preferenceExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.121694237:PCOnline publication date: 1-Feb-2024
  • (2024)Extremely missing numerical data in Electronic Health Records for machine learning can be managed through simple imputation methods considering informative missingnessComputer Methods and Programs in Biomedicine10.1016/j.cmpb.2023.107803242:COnline publication date: 1-Feb-2024
  • (2023)Enhancing Fairness and Accuracy in Machine Learning Through Similarity NetworksCooperative Information Systems10.1007/978-3-031-46846-9_1(3-20)Online publication date: 30-Oct-2023

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal of Biomedical Informatics
Journal of Biomedical Informatics  Volume 139, Issue C
Mar 2023
337 pages

Publisher

Elsevier Science

San Diego, CA, United States

Publication History

Published: 01 March 2023

Author Tags

  1. Multiple Imputation
  2. Evaluation framework
  3. Clinical informatics
  4. Diabetic patients
  5. COVID-19 severity assessment

Author Tags

  1. EHR
  2. MI
  3. LR
  4. CS
  5. BMI
  6. N3C
  7. MCAR
  8. MAR
  9. MNAR
  10. FCS
  11. JM
  12. MIDA
  13. GAN
  14. GAIN
  15. CART
  16. RF
  17. pmm
  18. RB
  19. ER
  20. MSE
  21. SE
  22. CR
  23. OR
  24. ECMO
  25. SGLT2
  26. IPW

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 15 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Handling missing values and imbalanced classes in machine learning to predict consumer preferenceExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.121694237:PCOnline publication date: 1-Feb-2024
  • (2024)Extremely missing numerical data in Electronic Health Records for machine learning can be managed through simple imputation methods considering informative missingnessComputer Methods and Programs in Biomedicine10.1016/j.cmpb.2023.107803242:COnline publication date: 1-Feb-2024
  • (2023)Enhancing Fairness and Accuracy in Machine Learning Through Similarity NetworksCooperative Information Systems10.1007/978-3-031-46846-9_1(3-20)Online publication date: 30-Oct-2023

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media