Code for paper: Borrowing Treasures from Neighbors: In-Context Learning for Multimodal Learning with Missing Modalities and Data Scarcity
- Abstract
- Overview
- Environment setup
- Citation
- Main code
- Special remarks
- Evaluation results
- Citation
- Acknowledgement
Multimodal machine learning with missing modalities is an increasingly relevant challenge arising in various applications such as healthcare. This paper extends the current research into missing modalities to the low-data regime, i.e., a downstream task has both missing modalities and limited sample size issues. This problem setting is particularly challenging and also practical as it is often expensive to get full-modality data and sufficient annotated training samples. We propose to use retrieval-augmented in-context learning to address these two crucial issues by unleashing the potential of a transformer's in-context learning ability. Diverging from existing methods, which primarily belong to the parametric paradigm and often require sufficient training samples, our work exploits the value of the available full-modality data, offering a novel perspective on resolving the challenge. The proposed data-dependent framework exhibits a higher degree of sample efficiency and is empirically demonstrated to enhance the classification model's performance on both full- and missing-modality data in the low-data regime across various multimodal learning tasks. When only 1% of the training data are available, our proposed method demonstrates an average improvement of 6.1% over a recent strong baseline across various datasets and missing states. Notably, our method also reduces the performance gap between full-modality and missing-modality data compared with the baseline.
The overview of the proposed method. (a) Assuming that each sample contains data with 2 modalities
Our experimental framework and environment configuration are based on missing aware prompts. Please check for details. We use four datasets for evaluation.
- MedFuse-I (check Medfuse for details)
- MedFuse-P (check Medfuse for details)
- UPMC Food-101 (check missing aware prompts for details)
- Hateful Memes (check missing aware prompts for details)
See Main_code for the implementation details of different ICL modules.
-
The code is based on Pytorch-lightning, to ensure the reproducibility of the experimental results, consider the following:
- set the global seed by
pl.seed_everything(seed)
- modify algorithms without deterministic operations from
torch.nn.functional.interpolate(x, scale_factor = scale_factor, mode='bilinear', align_corners = True)
to
torch.nn.functional.interpolate(x, scale_factor = scale_factor)
-
Duplicate samples in Medfuse-I and Medfuse-P datasets. See issues for the solution.
-
0 sample error in val and test dataset of Medfuse-I and Medfuse-P datasets. See issues for the solution.
The evaluated results for all datasets under various missing cases and sample sizes are shown in the following tables. Bold number indicates the best performance.
With sufficient target dataset size (notably for
If you use this code for your research, please consider citing:
@article{zhi2024borrowing,
title={Borrowing Treasures from Neighbors: In-Context Learning for Multimodal Learning with Missing Modalities and Data Scarcity},
author={Zhi, Zhuo and Liu, Ziquan and Elbadawi, Moe and Daneshmend, Adam and Orlu, Mine and Basit, Abdul and Demosthenous, Andreas and Rodrigues, Miguel},
journal={arXiv preprint arXiv:2403.09428},
year={2024}
}
This code is based on ViLT