[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

TIP: Tabular-Image Pre-training for Multimodal Classification with Incomplete Data

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Images and structured tables are essential parts of real-world databases. Though tabular-image representation learning is promising for creating new insights, it remains a challenging task, as tabular data is typically heterogeneous and incomplete, presenting significant modality disparities with images. Earlier works have mainly focused on simple modality fusion strategies in complete data scenarios, without considering the missing data issue, and thus are limited in practice. In this paper, we propose TIP, a novel tabular-image pre-training framework for learning multimodal representations robust to incomplete tabular data. Specifically, TIP investigates a novel self-supervised learning (SSL) strategy, including a masked tabular reconstruction task to tackle data missingness, and image-tabular matching and contrastive learning objectives to capture multimodal information. Moreover, TIP proposes a versatile tabular encoder tailored for incomplete, heterogeneous tabular data and a multimodal interaction module for inter-modality representation learning. Experiments are performed on downstream multimodal classification tasks using both natural and medical image datasets. The results show that TIP outperforms state-of-the-art supervised/SSL image/multimodal methods in both complete and incomplete data scenarios. Our code is available at https://github.com/siyi-wind/TIP.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 49.99
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 64.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Acosta, J.N., Falcone, G.J., Rajpurkar, P., Topol, E.J.: Multimodal biomedical AI. Nat. Med. 28(9), 1773–1784 (2022)

    Article  Google Scholar 

  2. Antelmi, L., Ayache, N., Robert, P., Ribaldi, F., Garibotto, V., Frisoni, G.B., Lorenzi, M.: Combining multi-task learning and multi-channel variational auto-encoders to exploit datasets with missing observations-application to multi-modal neuroimaging studies in dementia. hal preprint hal-03114888v2 (2021)

    Google Scholar 

  3. Assran, M., Duval, Q., Misra, I., et al.: Self-supervised learning from images with a joint-embedding predictive architecture. In: CVPR, pp. 15619–15629 (2023)

    Google Scholar 

  4. Bahri, D., Jiang, H., Tay, Y., Metzler, D.: SCARF: self-supervised contrastive learning using random feature corruption. In: ICLR (2022)

    Google Scholar 

  5. Bai, W., Suzuki, H., Huang, J., Francis, C., Wang, S., Tarroni, G., et al.: A population-based phenome-wide association study of cardiac and aortic structure and function. Nat. Med. 26(10), 1654–1662 (2020)

    Article  Google Scholar 

  6. Baltrušaitis, T., Ahuja, C., Morency, L.P.: Multimodal machine learning: a survey and taxonomy. IEEE TPAMI 41(2), 423–443 (2018)

    Article  Google Scholar 

  7. Barnard, J., Meng, X.L.: Applications of multiple imputation in medical studies: from AIDS to NHANES. Stat. Methods Med. Res. 8(1), 17–36 (1999)

    Article  Google Scholar 

  8. Bayasi, N., Hamarneh, G., Garbi, R.: Continual-Zoo: Leveraging zoo models for continual classification of medical images. In: CVPRW, pp. 4128–4138 (2024)

    Google Scholar 

  9. Bayoudh, K., Knani, R., Hamdaoui, F., Mtibaa, A.: A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets. Vis. Comput. 38(8), 2939–2970 (2022)

    Article  Google Scholar 

  10. Borisov, V., Leemann, T., Seßler, K., Haug, J., Pawelczyk, M., Kasneci, G.: Deep neural networks and tabular data: a survey. IEEE Trans. Neural Netw.Learn. Syst. 35, 7499–7519 (2022)

    Article  Google Scholar 

  11. Borsos, B., Allaart, C.G., van Halteren, A.: Predicting stroke outcome: a case for multimodal deep learning methods with tabular and CT perfusion data. Artif. Intell. Med. 147, 102719 (2024)

    Article  Google Scholar 

  12. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., et al.: Language models are few-shot learners. NIPS 33, 1877–1901 (2020)

    Google Scholar 

  13. Buntin, M.B., Burke, M.F., Hoaglin, M.C., Blumenthal, D.: The benefits of health information technology: a review of the recent literature shows predominantly positive results. Health Aff. 30(3), 464–471 (2011)

    Article  Google Scholar 

  14. Bycroft, C., Freeman, C., Petkova, D., et al.: The UK Biobank resource with deep phenotyping and genomic data. Nature 562(7726), 203–209 (2018)

    Article  Google Scholar 

  15. Cai, Q., Wang, H., et al.: A survey on multimodal data-driven smart healthcare systems: approaches and applications. IEEE Access 7, 133583–133599 (2019)

    Article  Google Scholar 

  16. Chaudhry, B., Wang, J., Wu, S., Maglione, M., Mojica, W., Roth, E., et al.: Systematic review: impact of health information technology on quality, efficiency, and costs of medical care. Ann. Intern. Med. 144(10), 742–752 (2006)

    Article  Google Scholar 

  17. Chen, F.L., Zhang, D.Z., Han, M.L., Chen, X.Y., et al.: VLP: a survey on vision-language pre-training. Mach. Intell. Res. 20(1), 38–56 (2023)

    Article  Google Scholar 

  18. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICML, pp. 1597–1607. PMLR (2020)

    Google Scholar 

  19. Chen, X., He, K.: Exploring simple siamese representation learning. In: CVPR, pp. 15750–15758 (2021)

    Google Scholar 

  20. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: ACL, pp. 4171–4186 (2019)

    Google Scholar 

  21. Dong, H., et al.: Table pre-training: a survey on model architectures, pre-training objectives, and downstream tasks. arXiv preprint arXiv:2201.09745 (2022)

  22. Dong, X., Yu, Z., Cao, W., Shi, Y., Ma, Q.: A survey on ensemble learning. Front. Comp. Sci. 14, 241–258 (2020)

    Article  Google Scholar 

  23. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2020)

    Google Scholar 

  24. Duanmu, H., et al.: Prediction of pathological complete response to neoadjuvant chemotherapy in breast cancer using deep learning with integrative imaging, Molecular and demographic data. In: Martel, A.L., et al. (ed.) Medical Image Computing and Computer Assisted Intervention - MICCAI 2020, MICCAI 2020, LNCS, vol. 12262, pp 242–252. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59713-9_24

  25. Ganaie, M.A., Hu, M., Malik, A., et al.: Ensemble deep learning: a review. Eng. Appl. Artif. Intell. 115, 105151 (2022)

    Article  Google Scholar 

  26. Ghorbani, A., Zou, J.Y.: Embedding for informative missingness: Deep learning with incomplete data. In: 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 437–445. IEEE (2018)

    Google Scholar 

  27. Gorishniy, Y., Rubachev, I., Khrulkov, V., Babenko, A.: Revisiting deep learning models for tabular data. NIPS 34, 18932–18943 (2021)

    Google Scholar 

  28. Grill, J.B., et al.: Bootstrap your own latent-a new approach to self-supervised learning. NIPS 33, 21271–21284 (2020)

    Google Scholar 

  29. Hager, P., Menten, M.J., Rueckert, D.: Best of both worlds: multimodal contrastive learning with tabular and imaging data. In: CVPR, pp. 23924–23935 (2023)

    Google Scholar 

  30. Han, X., Wang, Y.T., Feng, J.L., Deng, C., et al.: A survey of transformer-based multimodal pre-trained modals. Neurocomputing 515, 89–106 (2023)

    Article  Google Scholar 

  31. Hawthorne, G., Hawthorne, G., Elliott, P.: Imputing cross-sectional missing data: comparison of common techniques. Australian New Zealand J. Psychiatry 39(7), 583–590 (2005)

    Article  Google Scholar 

  32. He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: CVPR, pp. 16000–16009 (2022)

    Google Scholar 

  33. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)

    Google Scholar 

  34. He, R., McAuley, J.: Ups and downs: modeling the visual evolution of fashion trends with one-class collaborative filtering. In: WWW, pp. 507–517 (2016)

    Google Scholar 

  35. Heiliger, L., Sekuboyina, A., Menze, B., et al.: Beyond medical imaging-a review of multimodal deep learning in radiology. Authorea Preprints (2023)

    Google Scholar 

  36. Huang, J., Chen, B., Luo, L., et al.: DVM-CAR: a large-scale automotive dataset for visual marketing research and applications. In: 2022 IEEE International Conference on Big Data (Big Data), pp. 4140–4147. IEEE (2022)

    Google Scholar 

  37. Huang, S.C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digit. Med. 3(1), 136 (2020)

    Article  Google Scholar 

  38. Huang, X., Khetan, A., Cvitkovic, M., Karnin, Z.: TabTransformer: tabular data modeling using contextual embeddings. arXiv preprint arXiv:2012.06678 (2020)

  39. Jarrett, D., Cebere, B.C., Liu, T., Curth, A., van der Schaar, M.: HyperImpute: Generalized iterative imputation with automatic model selection. In: ICML, pp. 9916–9937. PMLR (2022)

    Google Scholar 

  40. Jiang, J.P., Ye, H.J., Wang, L., Yang, Y., Jiang, Y., Zhan, D.C.: On transferring expert knowledge from tabular data to images. In: NIPSW (2023)

    Google Scholar 

  41. Jing, L., Tian, Y.: Self-supervised visual feature learning with deep neural networks: a survey. IEEE TPAMI 43(11), 4037–4058 (2020)

    Article  Google Scholar 

  42. Johnson, A.E., et al.: MIMIC-III, a freely accessible critical care database. Sci. Data 3(1), 1–9 (2016)

    Google Scholar 

  43. Kalyan, K.S., Rajasekharan, A., Sangeetha, S.: AMMU: a survey of transformer-based biomedical pretrained language models. J. Biomed. Inform. 126, 103982 (2022)

    Article  Google Scholar 

  44. Ko, W., Jung, W., Jeon, E., Suk, H.I.: A deep generative-discriminative learning for multimodal representation in imaging genetics. IEEE Trans. Med. Imaging 41(9), 2348–2359 (2022)

    Article  Google Scholar 

  45. Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: ICML, pp. 12888–12900. PMLR (2022)

    Google Scholar 

  46. Li, J., Selvaraju, R., Gotmare, A., et al.: Align before fuse: vision and language representation learning with momentum distillation. NIPS 34, 9694–9705 (2021)

    Google Scholar 

  47. Liaw, A., Wiener, M., et al.: Classification and regression by randomforest. R News 2(3), 18–22 (2002)

    Google Scholar 

  48. Littlejohns, T.J., Holliday, J., Gibson, L.M., Garratt, S., et al.: The UK Biobank imaging enhancement of 100,000 participants: rationale, data collection, management and future directions. Nat. Commun. 11(1), 2624 (2020)

    Google Scholar 

  49. Mackinnon, A.: The use and reporting of multiple imputation in medical research-a review. J. Intern. Med. 268(6), 586–593 (2010)

    Article  Google Scholar 

  50. Majmundar, K.A., Goyal, S., Netrapalli, P., Jain, P.: MET: masked encoding for tabular data. In: NIPSW (2022)

    Google Scholar 

  51. Mattei, P.A., Frellsen, J.: MIWAE: deep generative modelling and imputation of incomplete data sets. In: ICML, pp. 4413–4423. PMLR (2019)

    Google Scholar 

  52. Miao, X., Wu, Y., et al.: An experimental survey of missing data imputation algorithms. IEEE Trans. Knowl. Data Eng. 35(7), 6630–6650 (2022)

    Google Scholar 

  53. Min, B., Ross, H., Sulem, E., Veyseh, A.P.B., Nguyen, T.H., et al.: Recent advances in natural language processing via large pre-trained language models: a survey. ACM Comput. Surv. 56(2), 1–40 (2023)

    Article  Google Scholar 

  54. Ouyang, L., et al.: Training language models to follow instructions with human feedback. NIPS 35, 27730–27744 (2022)

    Google Scholar 

  55. Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: feature learning by inpainting. In: CVPR, pp. 2536–2544 (2016)

    Google Scholar 

  56. Pölsterl, S., Wolf, T.N., Wachinger, C.: Combining 3D image and tabular data via the dynamic affine feature map transform. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS, vol. 12905, pp. 688–698. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87240-3_66

    Chapter  Google Scholar 

  57. Radford, A., Kim, J.W., Hallacy, C., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  58. Raghunathan, T.E., Lepkowski, J.M., Van Hoewyk, J., Solenberger, P., et al.: A multivariate technique for multiply imputing missing values using a sequence of regression models. Surv. Pract. 27(1), 85–96 (2001)

    Google Scholar 

  59. Royston, P., White, I.R.: Multiple imputation by chained equations (MICE): implementation in stata. J. Stat. Softw. 45, 1–20 (2011)

    Article  Google Scholar 

  60. Schafer, J.L., Graham, J.W.: Missing data: our view of the state of the art. Psychol. Methods 7(2), 147 (2002)

    Article  Google Scholar 

  61. Somepalli, G., Goldblum, M., Schwarzschild, A., Bruss, C.B., Goldstein, T.: SAINT: improved neural networks for tabular data via row attention and contrastive pre-training. arXiv preprint arXiv:2106.01342 (2021)

  62. Spasov, S., Passamonti, L., Duggento, A., Lio, P., Toschi, N., et al.: A parameter-efficient deep learning approach to predict conversion from mild cognitive impairment to alzheimer’s disease. Neuroimage 189, 276–287 (2019)

    Article  Google Scholar 

  63. Stekhoven, D.J., Bühlmann, P.: MissForest–non-parametric missing value imputation for mixed-type data. Bioinformatics 28(1), 112–118 (2012)

    Article  Google Scholar 

  64. Sun, K., Luo, X., Luo, M.Y.: A survey of pretrained language models. In: Memmi, G., Yang, B., Kong, L., Zhang, T., Qiu, M. (eds.) Knowledge Science, Engineering and Management, KSEM 2022, LNCS, vol. 13369, pp. 442–456 Springer, Cham (2022). https://doi.org/10.1007/978-3-031-10986-7_36

  65. Ucar, T., Hajiramezanali, E., Edwards, L.: SubTab: Subsetting features of tabular data for self-supervised representation learning. NIPS 34, 18853–18865 (2021)

    Google Scholar 

  66. Vale-Silva, L.A., Rohr, K.: Long-term cancer survival prediction using multimodal deep learning. Sci. Rep. 11(1), 13505 (2021)

    Article  Google Scholar 

  67. Vaswani, A., et al.: Attention is all you need. In: NIPS, vol. 30 (2017)

    Google Scholar 

  68. Wang, Z., Sun, J.: TransTab: learning transferable tabular transformers across tables. NIPS 35, 2902–2915 (2022)

    Google Scholar 

  69. Wolf, T.N., Pölsterl, S., et al.: DAFT: a universal module to interweave tabular data and 3D images in CNNs. Neuroimage 260, 119505 (2022)

    Article  Google Scholar 

  70. Yang, J., Gupta, A., Upadhyay, S., He, L., Goel, R., Paul, S.: TableFormer: robust transformer modeling for table-text encoding. In: ACL, pp. 528–537 (2022)

    Google Scholar 

  71. Ye, C., Lu, G., Wang, H., et al.: CT-BERT: learning better tabular representations through cross-table pre-training. arXiv preprint arXiv:2307.04308 (2023)

  72. Yoon, J., Jordon, J., Schaar, M.: Gain: missing data imputation using generative adversarial nets. In: ICML, pp. 5689–5698. PMLR (2018)

    Google Scholar 

  73. Yoon, J., Zhang, Y., et al.: VIME: extending the success of self-and semi-supervised learning to tabular domain. NIPS 33, 11033–11043 (2020)

    Google Scholar 

  74. Yu, J., Wang, Z., Vasudevan, V., et al.: Coca: contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917 (2022)

  75. Zbontar, J., Jing, L., Misra, I., LeCun, Y., Deny, S.: Barlow twins: self-supervised learning via redundancy reduction. In: ICML, pp. 12310–12320. PMLR (2021)

    Google Scholar 

  76. Zhang, C., Zhang, C., Song, J., Yi, J.S.K., Kweon, I.S.: A survey on masked autoencoder for visual self-supervised learning. In: IJCAI, pp. 6805–6813 (2023)

    Google Scholar 

  77. Zheng, H., et al.: Multi-transSP: multimodal transformer for survival prediction of nasopharyngeal carcinoma patients. In: Wang, L., Dou, Q., Fletcher, P.T., Speidel, S., Li, S. (eds.) Medical Image Computing and Computer Assisted Intervention - MICCAI 2022, MICCAI 2022, LNCS, vol. 13437. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-16449-1_23

Download references

Acknowledgements

This research has been conducted using the UK Biobank Resource under Application Number 40616. The MR images presented in the figures are reproduced with the kind permission of UK Biobank ©. We also thank Paul Hager from the Lab for AI in Medicine at the Technical University of Munich for providing the pre-processing code for the UKBB dataset. DO’R is supported by the Medical Research Council (MC_UP_1605/13); National Institute for Health Research (NIHR) Imperial College Biomedical Research Centre; and the British Heart Foundation (RG/19/6/34387, RE/24/130023, CH/P/23/80008).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Siyi Du or Chen Qin .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 11045 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Du, S., Zheng, S., Wang, Y., Bai, W., O’Regan, D.P., Qin, C. (2025). TIP: Tabular-Image Pre-training for Multimodal Classification with Incomplete Data. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15073. Springer, Cham. https://doi.org/10.1007/978-3-031-72633-0_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72633-0_27

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72632-3

  • Online ISBN: 978-3-031-72633-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics