research-article

Multi-Source Augmentation and Composite Prompts for Visual Recognition with Missing Modality

Authors:

Zhirui Kuai,

Yulu Zhou,

Qi Xie,

Li KuangAuthors Info & Claims

ICMR '24: Proceedings of the 2024 International Conference on Multimedia Retrieval

Pages 543 - 551

https://doi.org/10.1145/3652583.3658105

Published: 07 June 2024 Publication History

Get Access

Abstract

In multimodal learning for visual recognition, missing modality is a common issue that can significantly impact the performance and robustness of vision-language models. Most existing approaches have only considered the situation where a single modality-either image or text-is missing and then use a data augmentation method to recover the missing modality data. However, in reality, it is common for either text or image to be missing, and in such cases, a data augmentation method that is effective for one modality might not be suitable for the other, thereby necessitating distinct methods for text and image data augmentation. There are also approaches aimed at enhancing the robustness of vision-language models to handle missing data inputs. However since most of these approaches often involve significant modifications to complex model structures and require extensive retraining, these solutions would be impractical with limited computational resources. To address the abovementioned limitations, we develop a Multi-source Augmentation and Composite Prompts method (MACP) to alleviate the performance degradation due to missing modalities from both data and model levels. On the data level, we designed a multi-source data augmentation framework that integrates different data augmentation methods and a data selector to restore the missing data for each image-text sample as well as possible. On the model level, we designed a method for generating prompt vectors that simultaneously indicate the missing modalities in the model input and the source of augmentation data. The prompts will enhance the ability of the vision-language model to handle different input types in low-resource situations by applying prompt tuning. Experimental results demonstrate the effectiveness of our approach in mitigating the impact of modality missing on three vision-language datasets. Code is available.

References

[1]

Dhruv Agarwal, Tanay Agrawal, Laura M Ferrari, and Francc ois Brémond. 2021. From multimodal to unimodal attention in transformers using knowledge distillation. In 2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE, 1--8.

Abstract

References

Index Terms

Recommendations

Tag-assisted Multimodal Sentiment Analysis under Uncertain Missing Modalities

Can Text-to-image Model Assist Multi-modal Learning for Visual Recognition with Visual Modality Missing?

Cross-modal alignment and translation for missing modality action recognition

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations