Abstract
Images and structured tables are essential parts of real-world databases. Though tabular-image representation learning is promising for creating new insights, it remains a challenging task, as tabular data is typically heterogeneous and incomplete, presenting significant modality disparities with images. Earlier works have mainly focused on simple modality fusion strategies in complete data scenarios, without considering the missing data issue, and thus are limited in practice. In this paper, we propose TIP, a novel tabular-image pre-training framework for learning multimodal representations robust to incomplete tabular data. Specifically, TIP investigates a novel self-supervised learning (SSL) strategy, including a masked tabular reconstruction task to tackle data missingness, and image-tabular matching and contrastive learning objectives to capture multimodal information. Moreover, TIP proposes a versatile tabular encoder tailored for incomplete, heterogeneous tabular data and a multimodal interaction module for inter-modality representation learning. Experiments are performed on downstream multimodal classification tasks using both natural and medical image datasets. The results show that TIP outperforms state-of-the-art supervised/SSL image/multimodal methods in both complete and incomplete data scenarios. Our code is available at https://github.com/siyi-wind/TIP.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Acosta, J.N., Falcone, G.J., Rajpurkar, P., Topol, E.J.: Multimodal biomedical AI. Nat. Med. 28(9), 1773–1784 (2022)
Antelmi, L., Ayache, N., Robert, P., Ribaldi, F., Garibotto, V., Frisoni, G.B., Lorenzi, M.: Combining multi-task learning and multi-channel variational auto-encoders to exploit datasets with missing observations-application to multi-modal neuroimaging studies in dementia. hal preprint hal-03114888v2 (2021)
Assran, M., Duval, Q., Misra, I., et al.: Self-supervised learning from images with a joint-embedding predictive architecture. In: CVPR, pp. 15619–15629 (2023)
Bahri, D., Jiang, H., Tay, Y., Metzler, D.: SCARF: self-supervised contrastive learning using random feature corruption. In: ICLR (2022)
Bai, W., Suzuki, H., Huang, J., Francis, C., Wang, S., Tarroni, G., et al.: A population-based phenome-wide association study of cardiac and aortic structure and function. Nat. Med. 26(10), 1654–1662 (2020)
Baltrušaitis, T., Ahuja, C., Morency, L.P.: Multimodal machine learning: a survey and taxonomy. IEEE TPAMI 41(2), 423–443 (2018)
Barnard, J., Meng, X.L.: Applications of multiple imputation in medical studies: from AIDS to NHANES. Stat. Methods Med. Res. 8(1), 17–36 (1999)
Bayasi, N., Hamarneh, G., Garbi, R.: Continual-Zoo: Leveraging zoo models for continual classification of medical images. In: CVPRW, pp. 4128–4138 (2024)
Bayoudh, K., Knani, R., Hamdaoui, F., Mtibaa, A.: A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets. Vis. Comput. 38(8), 2939–2970 (2022)
Borisov, V., Leemann, T., Seßler, K., Haug, J., Pawelczyk, M., Kasneci, G.: Deep neural networks and tabular data: a survey. IEEE Trans. Neural Netw.Learn. Syst. 35, 7499–7519 (2022)
Borsos, B., Allaart, C.G., van Halteren, A.: Predicting stroke outcome: a case for multimodal deep learning methods with tabular and CT perfusion data. Artif. Intell. Med. 147, 102719 (2024)
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., et al.: Language models are few-shot learners. NIPS 33, 1877–1901 (2020)
Buntin, M.B., Burke, M.F., Hoaglin, M.C., Blumenthal, D.: The benefits of health information technology: a review of the recent literature shows predominantly positive results. Health Aff. 30(3), 464–471 (2011)
Bycroft, C., Freeman, C., Petkova, D., et al.: The UK Biobank resource with deep phenotyping and genomic data. Nature 562(7726), 203–209 (2018)
Cai, Q., Wang, H., et al.: A survey on multimodal data-driven smart healthcare systems: approaches and applications. IEEE Access 7, 133583–133599 (2019)
Chaudhry, B., Wang, J., Wu, S., Maglione, M., Mojica, W., Roth, E., et al.: Systematic review: impact of health information technology on quality, efficiency, and costs of medical care. Ann. Intern. Med. 144(10), 742–752 (2006)
Chen, F.L., Zhang, D.Z., Han, M.L., Chen, X.Y., et al.: VLP: a survey on vision-language pre-training. Mach. Intell. Res. 20(1), 38–56 (2023)
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICML, pp. 1597–1607. PMLR (2020)
Chen, X., He, K.: Exploring simple siamese representation learning. In: CVPR, pp. 15750–15758 (2021)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: ACL, pp. 4171–4186 (2019)
Dong, H., et al.: Table pre-training: a survey on model architectures, pre-training objectives, and downstream tasks. arXiv preprint arXiv:2201.09745 (2022)
Dong, X., Yu, Z., Cao, W., Shi, Y., Ma, Q.: A survey on ensemble learning. Front. Comp. Sci. 14, 241–258 (2020)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2020)
Duanmu, H., et al.: Prediction of pathological complete response to neoadjuvant chemotherapy in breast cancer using deep learning with integrative imaging, Molecular and demographic data. In: Martel, A.L., et al. (ed.) Medical Image Computing and Computer Assisted Intervention - MICCAI 2020, MICCAI 2020, LNCS, vol. 12262, pp 242–252. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59713-9_24
Ganaie, M.A., Hu, M., Malik, A., et al.: Ensemble deep learning: a review. Eng. Appl. Artif. Intell. 115, 105151 (2022)
Ghorbani, A., Zou, J.Y.: Embedding for informative missingness: Deep learning with incomplete data. In: 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 437–445. IEEE (2018)
Gorishniy, Y., Rubachev, I., Khrulkov, V., Babenko, A.: Revisiting deep learning models for tabular data. NIPS 34, 18932–18943 (2021)
Grill, J.B., et al.: Bootstrap your own latent-a new approach to self-supervised learning. NIPS 33, 21271–21284 (2020)
Hager, P., Menten, M.J., Rueckert, D.: Best of both worlds: multimodal contrastive learning with tabular and imaging data. In: CVPR, pp. 23924–23935 (2023)
Han, X., Wang, Y.T., Feng, J.L., Deng, C., et al.: A survey of transformer-based multimodal pre-trained modals. Neurocomputing 515, 89–106 (2023)
Hawthorne, G., Hawthorne, G., Elliott, P.: Imputing cross-sectional missing data: comparison of common techniques. Australian New Zealand J. Psychiatry 39(7), 583–590 (2005)
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: CVPR, pp. 16000–16009 (2022)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
He, R., McAuley, J.: Ups and downs: modeling the visual evolution of fashion trends with one-class collaborative filtering. In: WWW, pp. 507–517 (2016)
Heiliger, L., Sekuboyina, A., Menze, B., et al.: Beyond medical imaging-a review of multimodal deep learning in radiology. Authorea Preprints (2023)
Huang, J., Chen, B., Luo, L., et al.: DVM-CAR: a large-scale automotive dataset for visual marketing research and applications. In: 2022 IEEE International Conference on Big Data (Big Data), pp. 4140–4147. IEEE (2022)
Huang, S.C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digit. Med. 3(1), 136 (2020)
Huang, X., Khetan, A., Cvitkovic, M., Karnin, Z.: TabTransformer: tabular data modeling using contextual embeddings. arXiv preprint arXiv:2012.06678 (2020)
Jarrett, D., Cebere, B.C., Liu, T., Curth, A., van der Schaar, M.: HyperImpute: Generalized iterative imputation with automatic model selection. In: ICML, pp. 9916–9937. PMLR (2022)
Jiang, J.P., Ye, H.J., Wang, L., Yang, Y., Jiang, Y., Zhan, D.C.: On transferring expert knowledge from tabular data to images. In: NIPSW (2023)
Jing, L., Tian, Y.: Self-supervised visual feature learning with deep neural networks: a survey. IEEE TPAMI 43(11), 4037–4058 (2020)
Johnson, A.E., et al.: MIMIC-III, a freely accessible critical care database. Sci. Data 3(1), 1–9 (2016)
Kalyan, K.S., Rajasekharan, A., Sangeetha, S.: AMMU: a survey of transformer-based biomedical pretrained language models. J. Biomed. Inform. 126, 103982 (2022)
Ko, W., Jung, W., Jeon, E., Suk, H.I.: A deep generative-discriminative learning for multimodal representation in imaging genetics. IEEE Trans. Med. Imaging 41(9), 2348–2359 (2022)
Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: ICML, pp. 12888–12900. PMLR (2022)
Li, J., Selvaraju, R., Gotmare, A., et al.: Align before fuse: vision and language representation learning with momentum distillation. NIPS 34, 9694–9705 (2021)
Liaw, A., Wiener, M., et al.: Classification and regression by randomforest. R News 2(3), 18–22 (2002)
Littlejohns, T.J., Holliday, J., Gibson, L.M., Garratt, S., et al.: The UK Biobank imaging enhancement of 100,000 participants: rationale, data collection, management and future directions. Nat. Commun. 11(1), 2624 (2020)
Mackinnon, A.: The use and reporting of multiple imputation in medical research-a review. J. Intern. Med. 268(6), 586–593 (2010)
Majmundar, K.A., Goyal, S., Netrapalli, P., Jain, P.: MET: masked encoding for tabular data. In: NIPSW (2022)
Mattei, P.A., Frellsen, J.: MIWAE: deep generative modelling and imputation of incomplete data sets. In: ICML, pp. 4413–4423. PMLR (2019)
Miao, X., Wu, Y., et al.: An experimental survey of missing data imputation algorithms. IEEE Trans. Knowl. Data Eng. 35(7), 6630–6650 (2022)
Min, B., Ross, H., Sulem, E., Veyseh, A.P.B., Nguyen, T.H., et al.: Recent advances in natural language processing via large pre-trained language models: a survey. ACM Comput. Surv. 56(2), 1–40 (2023)
Ouyang, L., et al.: Training language models to follow instructions with human feedback. NIPS 35, 27730–27744 (2022)
Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: feature learning by inpainting. In: CVPR, pp. 2536–2544 (2016)
Pölsterl, S., Wolf, T.N., Wachinger, C.: Combining 3D image and tabular data via the dynamic affine feature map transform. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS, vol. 12905, pp. 688–698. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87240-3_66
Radford, A., Kim, J.W., Hallacy, C., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763. PMLR (2021)
Raghunathan, T.E., Lepkowski, J.M., Van Hoewyk, J., Solenberger, P., et al.: A multivariate technique for multiply imputing missing values using a sequence of regression models. Surv. Pract. 27(1), 85–96 (2001)
Royston, P., White, I.R.: Multiple imputation by chained equations (MICE): implementation in stata. J. Stat. Softw. 45, 1–20 (2011)
Schafer, J.L., Graham, J.W.: Missing data: our view of the state of the art. Psychol. Methods 7(2), 147 (2002)
Somepalli, G., Goldblum, M., Schwarzschild, A., Bruss, C.B., Goldstein, T.: SAINT: improved neural networks for tabular data via row attention and contrastive pre-training. arXiv preprint arXiv:2106.01342 (2021)
Spasov, S., Passamonti, L., Duggento, A., Lio, P., Toschi, N., et al.: A parameter-efficient deep learning approach to predict conversion from mild cognitive impairment to alzheimer’s disease. Neuroimage 189, 276–287 (2019)
Stekhoven, D.J., Bühlmann, P.: MissForest–non-parametric missing value imputation for mixed-type data. Bioinformatics 28(1), 112–118 (2012)
Sun, K., Luo, X., Luo, M.Y.: A survey of pretrained language models. In: Memmi, G., Yang, B., Kong, L., Zhang, T., Qiu, M. (eds.) Knowledge Science, Engineering and Management, KSEM 2022, LNCS, vol. 13369, pp. 442–456 Springer, Cham (2022). https://doi.org/10.1007/978-3-031-10986-7_36
Ucar, T., Hajiramezanali, E., Edwards, L.: SubTab: Subsetting features of tabular data for self-supervised representation learning. NIPS 34, 18853–18865 (2021)
Vale-Silva, L.A., Rohr, K.: Long-term cancer survival prediction using multimodal deep learning. Sci. Rep. 11(1), 13505 (2021)
Vaswani, A., et al.: Attention is all you need. In: NIPS, vol. 30 (2017)
Wang, Z., Sun, J.: TransTab: learning transferable tabular transformers across tables. NIPS 35, 2902–2915 (2022)
Wolf, T.N., Pölsterl, S., et al.: DAFT: a universal module to interweave tabular data and 3D images in CNNs. Neuroimage 260, 119505 (2022)
Yang, J., Gupta, A., Upadhyay, S., He, L., Goel, R., Paul, S.: TableFormer: robust transformer modeling for table-text encoding. In: ACL, pp. 528–537 (2022)
Ye, C., Lu, G., Wang, H., et al.: CT-BERT: learning better tabular representations through cross-table pre-training. arXiv preprint arXiv:2307.04308 (2023)
Yoon, J., Jordon, J., Schaar, M.: Gain: missing data imputation using generative adversarial nets. In: ICML, pp. 5689–5698. PMLR (2018)
Yoon, J., Zhang, Y., et al.: VIME: extending the success of self-and semi-supervised learning to tabular domain. NIPS 33, 11033–11043 (2020)
Yu, J., Wang, Z., Vasudevan, V., et al.: Coca: contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917 (2022)
Zbontar, J., Jing, L., Misra, I., LeCun, Y., Deny, S.: Barlow twins: self-supervised learning via redundancy reduction. In: ICML, pp. 12310–12320. PMLR (2021)
Zhang, C., Zhang, C., Song, J., Yi, J.S.K., Kweon, I.S.: A survey on masked autoencoder for visual self-supervised learning. In: IJCAI, pp. 6805–6813 (2023)
Zheng, H., et al.: Multi-transSP: multimodal transformer for survival prediction of nasopharyngeal carcinoma patients. In: Wang, L., Dou, Q., Fletcher, P.T., Speidel, S., Li, S. (eds.) Medical Image Computing and Computer Assisted Intervention - MICCAI 2022, MICCAI 2022, LNCS, vol. 13437. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-16449-1_23
Acknowledgements
This research has been conducted using the UK Biobank Resource under Application Number 40616. The MR images presented in the figures are reproduced with the kind permission of UK Biobank ©. We also thank Paul Hager from the Lab for AI in Medicine at the Technical University of Munich for providing the pre-processing code for the UKBB dataset. DO’R is supported by the Medical Research Council (MC_UP_1605/13); National Institute for Health Research (NIHR) Imperial College Biomedical Research Centre; and the British Heart Foundation (RG/19/6/34387, RE/24/130023, CH/P/23/80008).
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Du, S., Zheng, S., Wang, Y., Bai, W., O’Regan, D.P., Qin, C. (2025). TIP: Tabular-Image Pre-training for Multimodal Classification with Incomplete Data. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15073. Springer, Cham. https://doi.org/10.1007/978-3-031-72633-0_27
Download citation
DOI: https://doi.org/10.1007/978-3-031-72633-0_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72632-3
Online ISBN: 978-3-031-72633-0
eBook Packages: Computer ScienceComputer Science (R0)