[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.5555/3692070.3692288guideproceedingsArticle/Chapter ViewAbstractPublication PagesicmlConference Proceedingsconference-collections
research-article

Predictive dynamic fusion

Published: 21 July 2024 Publication History

Abstract

Multimodal fusion is crucial in joint decision-making systems for rendering holistic judgments. Since multimodal data changes in open environments, dynamic fusion has emerged and achieved remarkable progress in numerous applications. However, most existing dynamic multimodal fusion methods lack theoretical guarantees and easily fall into suboptimal problems, yielding unreliability and instability. To address this issue, we propose a Predictive Dynamic Fusion (PDF) framework for multimodal learning. We proceed to reveal the multimodal fusion from a generalization perspective and theoretically derive the predictable Collaborative Belief (Co-Belief) with Mono- and Holo-Confidence, which provably reduces the upper bound of generalization error. Accordingly, we further propose a relative calibration strategy to calibrate the predicted Co-Belief for potential uncertainty. Extensive experiments on multiple benchmarks confirm our superiority. Our code is available at https://github.com/Yinan-Xia/PDF.

References

[1]
Amini, A., Schwarting, W., Soleimany, A., and Rus, D. Deep evidential regression. In Advances in Neural Information Processing Systems, volume 33, pp. 14927-14937, 2020.
[2]
Atrey, P. K., Hossain, M. A., El Saddik, A., and Kankanhalli, M. S. Multimodal fusion for multimedia analysis: a survey. Multimedia Systems, 16:345-379, 2010.
[3]
Ayache, S., Quénot, G., and Gensel, J. Classifier fusion for svm-based multimedia semantic indexing. In European Conference on Information Retrieval, pp. 494-504. Springer, 2007.
[4]
Cao, B., Sun, Y., Zhu, P., and Hu, Q. Multi-modal gated mixture of local-to-global experts for dynamic image fusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 23555-23564, 2023.
[5]
Cao, H., Cooper, D. G., Keutmann, M. K., Gur, R. C., Nenkova, A., and Verma, R. Crema-d: Crowd-sourced emotional multimodal actors dataset. IEEE Transactions on Affective Computing, 5(4):377-390, 2014.
[6]
Corbière, C., Thome, N., Bar-Hen, A., Cord, M., and Pérez, P. Addressing failure prediction by learning model confidence. In Advances in Neural Information Processing Systems, volume 32, 2019.
[7]
Cui, H., Radosavljevic, V., Chou, F.-C., Lin, T.-H., Nguyen, T., Huang, T.-K., Schneider, J., and Djuric, N. Multimodal trajectory predictions for autonomous driving using deep convolutional networks. In 2019 International Conference on Robotics and Automation (ICRA), pp. 2090-2096. IEEE, 2019.
[8]
Dempster, A. P. A generalization of bayesian inference. Journal of the Royal Statistical Society: Series B (Methodological), 30(2):205-232, 1968.
[9]
Denker, J. and LeCun, Y. Transforming neural-net output levels to probability distributions. In Advances in Neural Information Processing Systems, volume 3, 1990.
[10]
DeVries, T. and Taylor, G. W. Learning confidence for out-of-distribution detection in neural networks. arXiv preprint arXiv:1802.04865, 2018.
[11]
Feng, D., Haase-Schütz, C., Rosenbaum, L., Hertlein, H., Glaeser, C., Timm, F., Wiesbeck, W., and Dietmayer, K. Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges. IEEE Transactions on Intelligent Transportation Systems, 22(3):1341-1360, 2020.
[12]
Gal, Y. and Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning, pp. 1050-1059. PMLR, 2016.
[13]
Gawlikowski, J., Tassi, C. R. N., Ali, M., Lee, J., Humt, M., Feng, J., Kruspe, A., Triebel, R., Jung, P., Roscher, R., et al. A survey of uncertainty in deep neural networks. Artificial Intelligence Review, 56(Suppl 1):1513-1589, 2023.
[14]
Han, Z., Yang, F., Huang, J., Zhang, C., and Yao, J. Multimodal dynamics: Dynamical fusion for trustworthy multimodal classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20707-20717, 2022a.
[15]
Han, Z., Zhang, C., Fu, H., and Zhou, J. T. Trusted multiview classification with dynamic evidential fusion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):2551-2566, 2022b.
[16]
Hendrycks, D. and Gimpel, K. A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136, 2016.
[17]
Huang, R., Geng, A., and Li, Y. On the importance of gradients for detecting distributional shifts in the wild. In Advances in Neural Information Processing Systems, volume 34, pp. 677-689, 2021a.
[18]
Huang, Y., Du, C., Xue, Z., Chen, X., Zhao, H., and Huang, L. What makes multi-modal learning better than single (provably). In Advances in Neural Information Processing Systems, volume 34, pp. 10944-10956, 2021b.
[19]
Huang, Z., Niu, G., Liu, X., Ding, W., Xiao, X., Wu, H., and Peng, X. Learning with noisy correspondence for cross-modal matching. In Advances in Neural Information Processing Systems, volume 34, pp. 29406-29419, 2021c.
[20]
Kiela, D., Bhooshan, S., Firooz, H., Perez, E., and Testuggine, D. Supervised multimodal bitransformers for classifying images and text. arXiv preprint arXiv:1909.02950, 2019.
[21]
Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, volume 30, 2017.
[22]
Lee, J. and AlRegib, G. Gradients as a measure of uncertainty in neural networks. In 2020 IEEE International Conference on Image Processing (ICIP), pp. 2416-2420. IEEE, 2020.
[23]
Liang, S., Li, Y., and Srikant, R. Enhancing the reliability of out-of-distribution image detection in neural networks. arXiv preprint arXiv:1706.02690, 2017.
[24]
Liu, W., Wang, X., Owens, J. D., and Li, Y. Energy-based out-of-distribution detection. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pp. 21464-21475, 2020.
[25]
Liu, X., Zhu, X., Li, M., Wang, L., Tang, C., Yin, J., Shen, D., Wang, H., and Gao, W. Late fusion incomplete multiview clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(10):2410-2423, 2018.
[26]
Ma, H., Han, Z., Zhang, C., Fu, H., Zhou, J. T., and Hu, Q. Trustworthy multimodal regression with mixture of normal-inverse gamma distributions. In Advances in Neural Information Processing Systems, volume 34, pp. 6881-6893, 2021.
[27]
Ma, H., Zhang, Q., Zhang, C., Wu, B., Fu, H., Zhou, J. T., and Hu, Q. Calibrating multimodal learning. In International Conference on Machine Learning, pp. 23429-23450. PMLR, 2023.
[28]
Mackay, D. J. C. Bayesian methods for adaptive models. California Institute of Technology, 1992.
[29]
Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of machine learning. MIT press, 2018.
[30]
Müller, R., Kornblith, S., and Hinton, G. E. When does label smoothing help? In Advances in Neural Information Processing Systems, volume 32, 2019.
[31]
Natarajan, P., Wu, S., Vitaladevuni, S., Zhuang, X., Tsakalidis, S., Park, U., Prasad, R., and Natarajan, P. Multimodal feature fusion for robust event detection in web videos. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1298-1305. IEEE, 2012.
[32]
Neal, R. M. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012.
[33]
Nefian, A. V., Liang, L., Pi, X., Liu, X., and Murphy, K. Dynamic bayesian networks for audio-visual speech recognition. EURASIP Journal on Advances in Signal Processing, 2002:1-15, 2002.
[34]
Niu, T., Zhu, S., Pang, L., and El Saddik, A. Sentiment analysis on multi-view social data. In MultiMedia Modeling: 22nd International Conference, MMM 2016, Miami, FL, USA, January 4-6, 2016, Proceedings, Part II 22, pp. 15-27. Springer, 2016.
[35]
Papadopoulos, G., Edwards, P. J., and Murray, A. F. Confidence estimation methods for neural networks: A practical comparison. IEEE Transactions on Neural Networks, 12(6):1278-1287, 2001.
[36]
Peng, X., Wei, Y., Deng, A., Wang, D., and Hu, D. Balanced multimodal learning via on-the-fly gradient modulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8238-8247, 2022.
[37]
Pérez-Rúa, J.-M., Vielzeuf, V., Pateux, S., Baccouche, M., and Jurie, F. Mfas: Multimodal fusion architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6966-6975, 2019.
[38]
Perrin, R. J., Fagan, A. M., and Holtzman, D. M. Multimodal techniques for diagnosis and prognosis of alzheimer's disease. Nature, 461(7266):916-922, 2009.
[39]
Scheunders, P. and De Backer, S. Wavelet denoising of multicomponent images using gaussian scale mixture models and a noise-free image as priors. IEEE Transactions on Image Processing, 16(7):1865-1872, 2007.
[40]
Sensoy, M., Kaplan, L., and Kandemir, M. Evidential deep learning to quantify classification uncertainty. In Advances in Neural Information Processing Systems, volume 31, 2018.
[41]
Shannon, C. E. A mathematical theory of communication. The Bell System Technical Journal, 27(3):379-423, 1948.
[42]
Silberman, N., Hoiem, D., Kohli, P., and Fergus, R. Indoor segmentation and support inference from rgbd images. In Computer Vision-ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V 12, pp. 746-760. Springer, 2012.
[43]
Sim, T., Baker, S., and Bsat, M. The cmu pose, illumination, and expression database. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(12):1615 - 1618, December 2003.
[44]
Snoek, C. G., Worring, M., and Smeulders, A. W. Early versus late fusion in semantic video analysis. In Proceedings of the 13th annual ACM international conference on Multimedia, pp. 399-402, 2005.
[45]
Soleymani, M., Garcia, D., Jou, B., Schuller, B., Chang, S.-F., and Pantic, M. A survey of multimodal sentiment analysis. Image and Vision Computing, 65:3-14, 2017.
[46]
Tempany, C. M., Jayender, J., Kapur, T., Bueno, R., Golby, A., Agar, N., and Jolesz, F. A. Multimodal imaging for improved diagnosis and treatment of cancers. Cancer, 121(6):817-827, 2015.
[47]
Wang, H., Yang, Y., and Liu, B. Gmc: Graph-based multiview clustering. IEEE Transactions on Knowledge and Data Engineering, 32(6):1116-1129, 2019a.
[48]
Wang, S., Liu, X., Zhu, E., Tang, C., Liu, J., Hu, J., Xia, J., and Yin, J. Multi-view clustering via late fusion alignment maximization. In IJCAI, pp. 3778-3784, 2019b.
[49]
Wang, W., Tran, D., and Feiszli, M. What makes training multi-modal classification networks hard? In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pp. 12695-12705, 2020.
[50]
Wang, X., Kumar, D., Thome, N., Cord, M., and Precioso, F. Recipe recognition with large multimodal food dataset. In 2015 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp. 1-6. IEEE, 2015.
[51]
Wei, H., Xie, R., Cheng, H., Feng, L., An, B., and Li, Y. Mitigating neural network overconfidence with logit normalization. In International Conference on Machine Learning, pp. 23631-23644. PMLR, 2022.
[52]
Xie, Z., Wang, S. I., Li, J., Lévy, D., Nie, A., Jurafsky, D., and Ng, A. Y. Data noising as smoothing in neural network language models. arXiv preprint arXiv:1703.02573, 2017.
[53]
Xue, Z. and Marculescu, R. Dynamic multimodal fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2574-2583, 2023.
[54]
Yan, R., Yang, J., and Hauptmann, A. G. Learning query-class dependent weights in automatic video retrieval. In Proceedings of the 12th annual ACM international conference on Multimedia, pp. 548-555, 2004.
[55]
Zadeh, A., Chen, M., Poria, S., Cambria, E., and Morency, L.-P. Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250, 2017.
[56]
Zhang, Q., Wu, H., Zhang, C., Hu, Q., Fu, H., Zhou, J. T., and Peng, X. Provable dynamic fusion for low-quality multimodal data. In International conference on machine learning, pp. 41753-41769. PMLR, 2023.
[57]
Zhu, P., Sun, Y., Cao, B., and Hu, Q. Task-customized mixture of adapters for general image fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
ICML'24: Proceedings of the 41st International Conference on Machine Learning
July 2024
63010 pages

Publisher

JMLR.org

Publication History

Published: 21 July 2024

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Acceptance Rates

Overall Acceptance Rate 140 of 548 submissions, 26%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 18 Jan 2025

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media