More Web Proxy on the site http://driver.im/

research-article

Hierarchical Multimodal Fusion Network with Dynamic Multi-task Learning

Authors:

Shu–Ching ChenAuthors Info & Claims

2021 IEEE 22nd International Conference on Information Reuse and Integration for Data Science (IRI)

Pages 208 - 214

https://doi.org/10.1109/IRI51335.2021.00034

Published: 10 August 2021 Publication History

Abstract

Real-world data often contain multiple modalities and non-exclusive labels. Multimodal fusion is a vital step in mul-timodallearning that integrates features from various modalities in the vector space so that the classifier could utilize the fused vector to generate the final prediction score. Common multimodal fusion approaches rarely consider the cross-modality interactions which play an essential role in exploiting the inter-modality relationship and subsequently creating the joint modality embedding. In this paper, we propose a hierarchical multimodal fusion framework with dynamic multi-task learning. It focuses on modeling the joint embedding space for all cross-modality interactions and adjusting the task loss for optimal performance. The proposed model uses a novel hierarchical multimodal fusion network that learns the cross-modal interactions among all combinations of modalities and dynamically allocates the weights for each pair in a sample-aware fashion. Furthermore, a novel dynamic multi-task learning approach is applied to handle the multi-label problems by automatically adjusting the learning progress on both task level and sample level. We show that the proposed framework outperforms the baselines and some of the state-of-the-art methods. We also demonstrate the flexibility and modularity of the proposed hierarchical multimodal fusion and dynamic multi-task learning units, which can be applied to various types of networks.

References

[1]

S.-C. Chen and R. L. Kashyap, “A spatio-temporal semantic model for multimedia database systems and multimedia information systems,” IEEE Transactions on Knowledge and Data Engineering, vol. 13, no. 4, pp. 607–622, 2001.

Digital Library

[2]

S.-C. Chen, M.-L. Shyu, M. Chen, and C. Zhang, “A decision tree-based multimodal data mining framework for soccer goal detection,” in IEEE International Conference on Multimedia and Expo (ICME), vol. 1, 2004, pp. 265–268.

[3]

S. Pouyanfar, Y. Yang, S.-C. Chen, M.-L. Shyu, and S. Iyengar, “Multimedia big data analytics: A survey,” ACM Computing Surveys (CSUR), vol. 51, no. 1, pp. 1–34, 2018.

Digital Library

[4]

T. Baltrušaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine learning: A survey and taxonomy,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 2, pp. 423–443, 2018.

Digital Library

[5]

P. Hu, Y.-A. Huang, K. C. Chan, and Z.-H. You, “Learning multimodal networks from heterogeneous data for prediction of lncrna-mirna interactions,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 17, no. 5, pp. 1516–1524, 2019.

[6]

C. G. Snoek, M. Worring, and A. W. Smeulders, “Early versus late fusion in semantic video analysis,” in Proceedings of the 13th Annual ACM International Conference on Multimedia, 2005, pp. 399–402.

Digital Library

[7]

Y. Zhang and Q. Yang, “A survey on multi-task learning,” IEEE Transactions on Knowledge and Data Engineering, 2021.

[8]

T. Wang and S.-C. Chen, “Multi-label multi-task learning with dynamic task weight balancing,” in IEEE 21st International Conference on Information Reuse and Integration for Data Science (IRI), 2020, pp. 245–252.

[9]

X. Hu, K. Yang, L. Fei, and K. Wang, “Acnet: Attention based network to exploit complementary features for rgbd semantic segmentation,” in IEEE International Conference on Image Processing (ICIP), 2019, pp. 1440–1444.

[10]

Y. Sun, W. Zuo, and M. Liu, “Rtfnet: Rgb-thermal fusion network for semantic segmentation of urban scenes,” IEEE Robotics and Automation Letters, vol. 4, no. 3, pp. 2576–2583, 2019.

[11]

L. Deng, M. Yang, T. Li, Y. He, and C. Wang, “Rfbnet: deep multimodal networks with residual fusion blocks for rgb-d semantic segmentation,” arXiv preprint arXiv:, 2019.

[12]

A. Valada, J. Vertens, A. Dhall, and W. Burgard, “Adapnet: Adaptive semantic segmentation in adverse environmental conditions,” in IEEE International Conference on Robotics and Automation (ICRA), 2017, pp. 4644–4651.

Digital Library

[13]

Y. Cheng, R. Cai, Z. Li, X. Zhao, and K. Huang, “Locality-sensitive deconvolution networks with gated fusion for rgb-d indoor semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 3029–3037.

[14]

Y. Zhang, O. Morel, M. Blanchon, R. Seulin, M. Rastgoo, and D. Sidibé, “Exploration of deep learning-based multimodal fusion for semantic road scene segmentation.” in VISIGRAPP (5: VISAPP), 2019, pp. 336–343.

[15]

T. Meng, L. Lin, M.-L. Shyu, and S.-C. Chen, “Histology image classification using supervised classification and multimodal fusion,” in IEEE International Symposium on Multimedia, 2010, pp. 145–152.

[16]

D. Lin, G. Chen, D. Cohen-Or, P.-A. Heng, and H. Huang, “Cascaded feature network for semantic segmentation of rgb-d images,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 1311–1319.

[17]

J. Jiang, L. Zheng, F. Luo, and Z. Zhang, “Rednet: Residual encoder-decoder network for indoor RGB-D semantic segmentation,” CoRR, vol. abs/1806.01054, 2018.

[18]

A. Valada, R. Mohan, and W. Burgard, “Self-supervised model adaptation for multimodal semantic segmentation,” International Journal of Computer Vision, pp. 1–47, 2019.

[19]

H. Ben-Younes, R. Cadene, M. Cord, and N. Thome, “Mutan: Multimodal tucker fusion for visual question answering,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2612–2620.

[20]

A. Zadeh, M. Chen, S. Poria, E. Cambria, and L. Morency, “Tensor fusion network for multimodal sentiment analysis,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2017, pp. 1103–1114.

[21]

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, 2016, pp. 770–778.

[22]

T. Zhao, Y. Xu, M. Monfort, W. Choi, C. Baker, Y. Zhao, Y. Wang, and Y. N. Wu, “Multi-agent tensor fusion for contextual trajectory prediction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 12126–12134.

[23]

A. B. Zadeh, P. P. Liang, S. Poria, E. Cambria, and L.-P. Morency, “Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 2236–2246.

[24]

M. Angelou, V. Solachidis, N. Vretos, and P. Daras, “Graph-based multimodal fusion with metric learning for multimodal classification,” Pattern Recognition, vol. 95, pp. 296–307, 2019.

Digital Library

[25]

J. Chen and A. Zhang, “Hgmf: Heterogeneous graph-based fusion for multimodal data with incompleteness,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 1295–1305.

Digital Library

[26]

O. Sener and V. Koltun, “Multi-task learning as multi-objective optimization,” in Annual Conference on Neural Information Processing Systems, 2018, pp. 525–536.

[27]

S. Vandenhende, S. Georgoulis, and L. Van Gool, “Mti-net: Multiscale task interaction networks for multi-task learning,” in European Conference on Computer Vision. Springer, 2020, pp. 527–543.

[28]

R. Hu and A. Singh, “Unit: Multimodal multitask learning with a unified transformer,” arXiv preprint arXiv:, 2021.

[29]

S. Mai, H. Hu, and S. Xing, “Modality to modality translation: An adversarial representation learning and graph fusion network for multimodal fusion,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 01, 2020, pp. 164–172.

[30]

T. Wang and S.-C. Chen, “Multi-label multi-task learning with dynamic task weight balancing,” in IEEE 21st International Conference on Information Reuse and Integration for Data Science (IRI), 2020, pp. 245–252.

[31]

F. Alam, F. Ofli, and M. Imran, “Crisismmd: Multimodal twitter datasets from natural disasters,” in Proceedings of the International AAAI Conference on Web and Social Media, vol. 12, no. 1, 2018.

[32]

S. Pouyanfar, T. Wang, and S.-C. Chen, “A multi-label multimodal deep learning framework for imbalanced data classification,” in IEEE conference on multimedia information processing and retrieval (MIPR), 2019, pp. 199–204.

[33]

J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li, “Imagenet: A large-scale hierarchical image database,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, 2009, pp. 248–255.

[34]

C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2818–2826.

[35]

M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations,” in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (NAACL-HLT). Association for Computational Linguistics, 2018, pp. 2227–2237.

[36]

T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” in 1st International Conference on Learning Representations (ICLR), Y. Bengio and Y. LeCun, Eds., 2013.

[37]

J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). ACL, 2014, pp. 1532–1543.

[38]

Y. Aytar, C. Vondrick, and A. Torralba, “Soundnet: Learning sound representations from unlabeled video,” in Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems, 2016, pp. 892–900.

[39]

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations (ICLR), 2015.

Cited By

Index Terms

Hierarchical Multimodal Fusion Network with Dynamic Multi-task Learning
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
  2. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Neural networks

Index terms have been assigned to the content through auto-classification.

Recommendations

Multi-task Hierarchical Heterogeneous Fusion Framework for multimodal summarization
Abstract
With the rise of multimedia content on the internet, Multimodal Summarization has become a challenging task to help individuals grasp vital information fast. However, previous methods mainly learn the different modalities indistinguishably, which ...
Highlights
- We propose a Multi-task Hierarchical Heterogeneous Fusion Framework for multimodal summarization.
- Fine-grained semantics and cross-modality correlation is explored for summarization generation.
- The proposed framework outperforms ...
A cross modal hierarchical fusion multimodal sentiment analysis method based on multi-task learning
Abstract
Humans often express affections and intentions through multiple forms when communicating, involving text, audio, and vision modalities. Using a single modality to determine the sentiment state may be biased, but combining multiple clues can fully ...
Highlights
- Cross-modal hierarchical fusion method is proposed for multimodal sentiment analysis.
- Conflicting information caused by cross-modal interactions is considered.
- Multi-task learning is introduced to assist sentiment analysis.
A text guided multi-task learning network for multimodal sentiment analysis
Abstract
Multimodal Sentiment Analysis (MSA) is an active area of research that leverages multimodal signals for affective understanding of user-generated videos. Existing research tends to develop sophisticated fusion techniques to fuse unimodal ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

2021 IEEE 22nd International Conference on Information Reuse and Integration for Data Science (IRI)

Aug 2021

452 pages

Copyright © 2021.

Publisher

IEEE Press

Publication History

Published: 10 August 2021

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

View Options

View options

Figures

Tables

Media

View Table of Conten