More Web Proxy on the site http://driver.im/

research-article

Predictive dynamic fusion

AUTHORs:

Changqing Zhang,

Qinghua HuAuthors Info & Claims

ICML'24: Proceedings of the 41st International Conference on Machine Learning

Article No.: 218, Pages 5608 - 5628

Published: 21 July 2024 Publication History

Abstract

Multimodal fusion is crucial in joint decision-making systems for rendering holistic judgments. Since multimodal data changes in open environments, dynamic fusion has emerged and achieved remarkable progress in numerous applications. However, most existing dynamic multimodal fusion methods lack theoretical guarantees and easily fall into suboptimal problems, yielding unreliability and instability. To address this issue, we propose a Predictive Dynamic Fusion (PDF) framework for multimodal learning. We proceed to reveal the multimodal fusion from a generalization perspective and theoretically derive the predictable Collaborative Belief (Co-Belief) with Mono- and Holo-Confidence, which provably reduces the upper bound of generalization error. Accordingly, we further propose a relative calibration strategy to calibrate the predicted Co-Belief for potential uncertainty. Extensive experiments on multiple benchmarks confirm our superiority. Our code is available at https://github.com/Yinan-Xia/PDF.

References

[1]

Amini, A., Schwarting, W., Soleimany, A., and Rus, D. Deep evidential regression. In Advances in Neural Information Processing Systems, volume 33, pp. 14927-14937, 2020.

[2]

Atrey, P. K., Hossain, M. A., El Saddik, A., and Kankanhalli, M. S. Multimodal fusion for multimedia analysis: a survey. Multimedia Systems, 16:345-379, 2010.

Digital Library

[3]

Ayache, S., Quénot, G., and Gensel, J. Classifier fusion for svm-based multimedia semantic indexing. In European Conference on Information Retrieval, pp. 494-504. Springer, 2007.

[4]

Cao, B., Sun, Y., Zhu, P., and Hu, Q. Multi-modal gated mixture of local-to-global experts for dynamic image fusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 23555-23564, 2023.

[5]

Cao, H., Cooper, D. G., Keutmann, M. K., Gur, R. C., Nenkova, A., and Verma, R. Crema-d: Crowd-sourced emotional multimodal actors dataset. IEEE Transactions on Affective Computing, 5(4):377-390, 2014.

[6]

Corbière, C., Thome, N., Bar-Hen, A., Cord, M., and Pérez, P. Addressing failure prediction by learning model confidence. In Advances in Neural Information Processing Systems, volume 32, 2019.

[7]

Cui, H., Radosavljevic, V., Chou, F.-C., Lin, T.-H., Nguyen, T., Huang, T.-K., Schneider, J., and Djuric, N. Multimodal trajectory predictions for autonomous driving using deep convolutional networks. In 2019 International Conference on Robotics and Automation (ICRA), pp. 2090-2096. IEEE, 2019.

Digital Library

[8]

Dempster, A. P. A generalization of bayesian inference. Journal of the Royal Statistical Society: Series B (Methodological), 30(2):205-232, 1968.

[9]

Denker, J. and LeCun, Y. Transforming neural-net output levels to probability distributions. In Advances in Neural Information Processing Systems, volume 3, 1990.

[10]

DeVries, T. and Taylor, G. W. Learning confidence for out-of-distribution detection in neural networks. arXiv preprint arXiv:1802.04865, 2018.

[11]

Feng, D., Haase-Schütz, C., Rosenbaum, L., Hertlein, H., Glaeser, C., Timm, F., Wiesbeck, W., and Dietmayer, K. Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges. IEEE Transactions on Intelligent Transportation Systems, 22(3):1341-1360, 2020.

Digital Library

[12]

Gal, Y. and Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning, pp. 1050-1059. PMLR, 2016.

Digital Library

[13]

Gawlikowski, J., Tassi, C. R. N., Ali, M., Lee, J., Humt, M., Feng, J., Kruspe, A., Triebel, R., Jung, P., Roscher, R., et al. A survey of uncertainty in deep neural networks. Artificial Intelligence Review, 56(Suppl 1):1513-1589, 2023.

Digital Library

[14]

Han, Z., Yang, F., Huang, J., Zhang, C., and Yao, J. Multimodal dynamics: Dynamical fusion for trustworthy multimodal classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20707-20717, 2022a.

[15]

Han, Z., Zhang, C., Fu, H., and Zhou, J. T. Trusted multiview classification with dynamic evidential fusion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):2551-2566, 2022b.

[16]

Hendrycks, D. and Gimpel, K. A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136, 2016.

[17]

Huang, R., Geng, A., and Li, Y. On the importance of gradients for detecting distributional shifts in the wild. In Advances in Neural Information Processing Systems, volume 34, pp. 677-689, 2021a.

[18]

Huang, Y., Du, C., Xue, Z., Chen, X., Zhao, H., and Huang, L. What makes multi-modal learning better than single (provably). In Advances in Neural Information Processing Systems, volume 34, pp. 10944-10956, 2021b.

[19]

Huang, Z., Niu, G., Liu, X., Ding, W., Xiao, X., Wu, H., and Peng, X. Learning with noisy correspondence for cross-modal matching. In Advances in Neural Information Processing Systems, volume 34, pp. 29406-29419, 2021c.

[20]

Kiela, D., Bhooshan, S., Firooz, H., Perez, E., and Testuggine, D. Supervised multimodal bitransformers for classifying images and text. arXiv preprint arXiv:1909.02950, 2019.

[21]

Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, volume 30, 2017.

[22]

Lee, J. and AlRegib, G. Gradients as a measure of uncertainty in neural networks. In 2020 IEEE International Conference on Image Processing (ICIP), pp. 2416-2420. IEEE, 2020.

[23]

Liang, S., Li, Y., and Srikant, R. Enhancing the reliability of out-of-distribution image detection in neural networks. arXiv preprint arXiv:1706.02690, 2017.

[24]

Liu, W., Wang, X., Owens, J. D., and Li, Y. Energy-based out-of-distribution detection. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pp. 21464-21475, 2020.

Digital Library

[25]

Liu, X., Zhu, X., Li, M., Wang, L., Tang, C., Yin, J., Shen, D., Wang, H., and Gao, W. Late fusion incomplete multiview clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(10):2410-2423, 2018.

Digital Library

[26]

Ma, H., Han, Z., Zhang, C., Fu, H., Zhou, J. T., and Hu, Q. Trustworthy multimodal regression with mixture of normal-inverse gamma distributions. In Advances in Neural Information Processing Systems, volume 34, pp. 6881-6893, 2021.

[27]

Ma, H., Zhang, Q., Zhang, C., Wu, B., Fu, H., Zhou, J. T., and Hu, Q. Calibrating multimodal learning. In International Conference on Machine Learning, pp. 23429-23450. PMLR, 2023.

[28]

Mackay, D. J. C. Bayesian methods for adaptive models. California Institute of Technology, 1992.

Digital Library

[29]

Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of machine learning. MIT press, 2018.

Digital Library

[30]

Müller, R., Kornblith, S., and Hinton, G. E. When does label smoothing help? In Advances in Neural Information Processing Systems, volume 32, 2019.

[31]

Natarajan, P., Wu, S., Vitaladevuni, S., Zhuang, X., Tsakalidis, S., Park, U., Prasad, R., and Natarajan, P. Multimodal feature fusion for robust event detection in web videos. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1298-1305. IEEE, 2012.

Digital Library

[32]

Neal, R. M. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012.

[33]

Nefian, A. V., Liang, L., Pi, X., Liu, X., and Murphy, K. Dynamic bayesian networks for audio-visual speech recognition. EURASIP Journal on Advances in Signal Processing, 2002:1-15, 2002.

Digital Library

[34]

Niu, T., Zhu, S., Pang, L., and El Saddik, A. Sentiment analysis on multi-view social data. In MultiMedia Modeling: 22nd International Conference, MMM 2016, Miami, FL, USA, January 4-6, 2016, Proceedings, Part II 22, pp. 15-27. Springer, 2016.

[35]

Papadopoulos, G., Edwards, P. J., and Murray, A. F. Confidence estimation methods for neural networks: A practical comparison. IEEE Transactions on Neural Networks, 12(6):1278-1287, 2001.

Digital Library

[36]

Peng, X., Wei, Y., Deng, A., Wang, D., and Hu, D. Balanced multimodal learning via on-the-fly gradient modulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8238-8247, 2022.

[37]

Pérez-Rúa, J.-M., Vielzeuf, V., Pateux, S., Baccouche, M., and Jurie, F. Mfas: Multimodal fusion architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6966-6975, 2019.

[38]

Perrin, R. J., Fagan, A. M., and Holtzman, D. M. Multimodal techniques for diagnosis and prognosis of alzheimer's disease. Nature, 461(7266):916-922, 2009.

[39]

Scheunders, P. and De Backer, S. Wavelet denoising of multicomponent images using gaussian scale mixture models and a noise-free image as priors. IEEE Transactions on Image Processing, 16(7):1865-1872, 2007.

Digital Library

[40]

Sensoy, M., Kaplan, L., and Kandemir, M. Evidential deep learning to quantify classification uncertainty. In Advances in Neural Information Processing Systems, volume 31, 2018.

[41]

Shannon, C. E. A mathematical theory of communication. The Bell System Technical Journal, 27(3):379-423, 1948.

[42]

Silberman, N., Hoiem, D., Kohli, P., and Fergus, R. Indoor segmentation and support inference from rgbd images. In Computer Vision-ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V 12, pp. 746-760. Springer, 2012.

[43]

Sim, T., Baker, S., and Bsat, M. The cmu pose, illumination, and expression database. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(12):1615 - 1618, December 2003.

Digital Library

[44]

Snoek, C. G., Worring, M., and Smeulders, A. W. Early versus late fusion in semantic video analysis. In Proceedings of the 13th annual ACM international conference on Multimedia, pp. 399-402, 2005.

Digital Library

[45]

Soleymani, M., Garcia, D., Jou, B., Schuller, B., Chang, S.-F., and Pantic, M. A survey of multimodal sentiment analysis. Image and Vision Computing, 65:3-14, 2017.

[46]

Tempany, C. M., Jayender, J., Kapur, T., Bueno, R., Golby, A., Agar, N., and Jolesz, F. A. Multimodal imaging for improved diagnosis and treatment of cancers. Cancer, 121(6):817-827, 2015.

[47]

Wang, H., Yang, Y., and Liu, B. Gmc: Graph-based multiview clustering. IEEE Transactions on Knowledge and Data Engineering, 32(6):1116-1129, 2019a.

[48]

Wang, S., Liu, X., Zhu, E., Tang, C., Liu, J., Hu, J., Xia, J., and Yin, J. Multi-view clustering via late fusion alignment maximization. In IJCAI, pp. 3778-3784, 2019b.

[49]

Wang, W., Tran, D., and Feiszli, M. What makes training multi-modal classification networks hard? In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pp. 12695-12705, 2020.

[50]

Wang, X., Kumar, D., Thome, N., Cord, M., and Precioso, F. Recipe recognition with large multimodal food dataset. In 2015 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp. 1-6. IEEE, 2015.

[51]

Wei, H., Xie, R., Cheng, H., Feng, L., An, B., and Li, Y. Mitigating neural network overconfidence with logit normalization. In International Conference on Machine Learning, pp. 23631-23644. PMLR, 2022.

[52]

Xie, Z., Wang, S. I., Li, J., Lévy, D., Nie, A., Jurafsky, D., and Ng, A. Y. Data noising as smoothing in neural network language models. arXiv preprint arXiv:1703.02573, 2017.

[53]

Xue, Z. and Marculescu, R. Dynamic multimodal fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2574-2583, 2023.

[54]

Yan, R., Yang, J., and Hauptmann, A. G. Learning query-class dependent weights in automatic video retrieval. In Proceedings of the 12th annual ACM international conference on Multimedia, pp. 548-555, 2004.

Digital Library

[55]

Zadeh, A., Chen, M., Poria, S., Cambria, E., and Morency, L.-P. Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250, 2017.

[56]

Zhang, Q., Wu, H., Zhang, C., Hu, Q., Fu, H., Zhou, J. T., and Peng, X. Provable dynamic fusion for low-quality multimodal data. In International conference on machine learning, pp. 41753-41769. PMLR, 2023.

[57]

Zhu, P., Sun, Y., Cao, B., and Hu, Q. Task-customized mixture of adapters for general image fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.

Index Terms

Predictive dynamic fusion

Index terms have been assigned to the content through auto-classification.

Recommendations

Provable dynamic fusion for low-quality multimodal data
ICML'23: Proceedings of the 40th International Conference on Machine Learning

The inherent challenge of multimodal fusion is to precisely capture the cross-modal correlation and flexibly conduct cross-modal interaction. To fully release the value of each modality and mitigate the influence of low-quality multimodal data, dynamic ...
Hierarchical Multimodal Fusion Network with Dynamic Multi-task Learning
2021 IEEE 22nd International Conference on Information Reuse and Integration for Data Science (IRI)
Real-world data often contain multiple modalities and non-exclusive labels. Multimodal fusion is a vital step in mul-timodallearning that integrates features from various modalities in the vector space so that the classifier could utilize the fused vector ...
Learning multimodal word representation via dynamic fusion methods
AAAI'18/IAAI'18/EAAI'18: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence

Multimodal models have been proven to outperform text-based models on learning semantic word representations. Almost all previous multimodal models typically treat the representations from different modalities equally. However, it is obvious that ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

ICML'24: Proceedings of the 41st International Conference on Machine Learning

July 2024

63010 pages

Copyright © 2024.

Publisher

JMLR.org

Publication History

Published: 21 July 2024

Qualifiers

Research-article
Research
Refereed limited

Acceptance Rates

Overall Acceptance Rate 140 of 548 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 18 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Table of Contents