More Web Proxy on the site http://driver.im/

research-article

Open access

Counterfactual Reasoning for Out-of-distribution Multimodal Sentiment Analysis

Authors:

Liqiang NieAuthors Info & Claims

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Pages 15 - 23

https://doi.org/10.1145/3503161.3548211

Published: 10 October 2022 Publication History

Abstract

Existing studies on multimodal sentiment analysis heavily rely on textual modality and unavoidably induce the spurious correlations between textual words and sentiment labels. This greatly hinders the model generalization ability. To address this problem, we define the task of out-of-distribution (OOD) multimodal sentiment analysis. This task aims to estimate and mitigate the bad effect of textual modality for strong OOD generalization. To this end, we embrace causal inference, which inspects the causal relationships via a causal graph. From the graph, we find that the spurious correlations are attributed to the direct effect of textual modality on the model prediction while the indirect one is more reliable by considering multimodal semantics. Inspired by this, we devise a model-agnostic counterfactual framework for multimodal sentiment analysis, which captures the direct effect of textual modality via an extra text model and estimates the indirect one by a multimodal model. During the inference, we first estimate the direct effect by the counterfactual inference, and then subtract it from the total effect of all modalities to obtain the indirect effect for reliable prediction. Extensive experiments show the superior effectiveness and generalization ability of our proposed framework.

Supplementary Material

MP4 File (mm22-fp1888.mp4)

Presentation video

Download
21.35 MB

References

[1]

Galen Andrew, Raman Arora, Jeff A. Bilmes, and Karen Livescu. 2013. Deep Canonical Correlation Analysis. In Proceedings of the International Conference on Machine Learning. JMLR.org, 1247--1255.

[2]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. ACL, 4171--4186.

[3]

Simon Dobrisek, Rok Gajsek, France Mihelivc, Nikola Pavesić, and Vitomir Struc. 2013. Towards efficient multi-modal emotion recognition. International Journal of Advanced Robotic Systems, Vol. 10, 1 (2013), 53.

[4]

Fuli Feng, Jizhi Zhang, Xiangnan He, Hanwang Zhang, and Tat-Seng Chua. 2021. Empowering Language Understanding with Counterfactual Reasoning. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. ACL, 2226--2236.

[5]

Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard S. Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. 2020. Shortcut learning in deep neural networks. Nature Machine Intelligence, Vol. 2, 11 (2020), 665--673.

[6]

Madelyn Glymour, Judea Pearl, and Nicholas P Jewell. 2016. Causal inference in statistics: A primer. John Wiley & Sons.

[7]

Paridhi Gupta, Tanu Mehrotra, Ashita Bansal, and Bhawna Kumari. 2017. Multimodal sentiment analysis and context determination: Using perplexed Bayes classification. In International Conference on Automation and Computing. IEEE, 1--6.

[8]

Devamanyu Hazarika, Roger Zimmermann, and Soujanya Poria. 2020. MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis. In Proceedings of the ACM International Conference on Multimedia. ACM, 1122--1131.

Digital Library

[9]

Jun Hu, Shengsheng Qian, Quan Fang, Youze Wang, Quan Zhao, Huaiwen Zhang, and Changsheng Xu. 2021. Efficient graph deep learning in tensorflow with tf_geometric. In Proceedings of the ACM International Conference on Multimedia. ACM, 3775--3778.

Digital Library

[10]

Jun Hu, Shengsheng Qian, Quan Fang, and Changsheng Xu. 2019. Hierarchical graph semantic pooling network for multi-modal community question answer matching. In Proceedings of the ACM International Conference on Multimedia. ACM, 1157--1165.

Digital Library

[11]

Mahesh G Huddar, Sanjeev S Sannakki, and Vijay S Rajpurohit. 2018. An ensemble approach to utterance level multimodal sentiment analysis. In International Conference on Computational Techniques, Electronics and Mechanical Systems. IEEE, 145--150.

[12]

Zaid Khan and Yun Fu. 2021. Exploiting BERT for Multimodal Target Sentiment Classification through Input Space Translation. In Proceedings of the ACM International Conference on Multimedia. ACM, 3034--3042.

Digital Library

[13]

Yicong Li, Xiang Wang, Junbin Xiao, Wei Ji, and Tat-Seng Chua. 2022. Invariant grounding for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2928--2937.

[14]

Yicong Li, Xun Yang, Xindi Shang, and Tat-Seng Chua. 2021. Interventional video relation detection. In Proceedings of the ACM International Conference on Multimedia. ACM, 4091--4099.

Digital Library

[15]

Fan Liu, Zhiyong Cheng, Lei Zhu, Zan Gao, and Liqiang Nie. 2021. Interest-Aware Message-Passing GCN for Recommendation. In Proceedings of the Web Conference. ACM, 1296--1305.

Digital Library

[16]

Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2018. Efficient Low-rank Multimodal Fusion With Modality-Specific Factors. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. ACL, 2247--2256.

[17]

Sijie Mai, Haifeng Hu, and Songlong Xing. 2020. Modality to Modality Translation: An Adversarial Representation Learning and Graph Fusion Network for Multimodal Fusion. In Proceedings of the Conference on Artificial Intelligence. AAAI, 164--172.

[18]

Hao Ni, Jingkuan Song, Xiaosu Zhu, Feng Zheng, and Lianli Gao. 2021. Camera-Agnostic Person Re-Identification via Adversarial Disentangling Learning. In The ACM International Conference on Multimedia. ACM, 2002--2010.

Digital Library

[19]

Yulei Niu, Kaihua Tang, Hanwang Zhang, Zhiwu Lu, Xiansheng Hua, and Jirong Wen. 2021. Counterfactual VQA: A Cause-Effect Look at Language Bias. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 12700--12710.

[20]

Judea Pearl. 2009. Causality. Cambridge university press.

[21]

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global Vectors for Word Representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 1532--1543.

[22]

Á lvaro Peris and Francisco Casacuberta. 2019. A Neural, Interactive-predictive System for Multimodal Sequence to Sequence Tasks. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. ACL, 81--86.

[23]

Hai Pham, Paul Pu Liang, Thomas Manzini, Louis-Philippe Morency, and Barnabás Póczos. 2019. Found in Translation: Learning Robust Joint Representations by Cyclic Translations between Modalities. In Proceedings of the Conference on Artificial Intelligence. AAAI, 6892--6899.

Digital Library

[24]

Chen Qian, Fuli Feng, Lijie Wen, Chunping Ma, and Pengjun Xie. 2021. Counterfactual Inference for Text Classification Debiasing. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. ACL, 5434--5445.

[25]

Wasifur Rahman, Md. Kamrul Hasan, Sangwu Lee, AmirAli Bagher Zadeh, Chengfeng Mao, Louis-Philippe Morency, and Mohammed E. Hoque. 2020. Integrating Multimodal Information in Large Pretrained Transformers. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. ACL, 2359--2369.

[26]

J. Robins. 1986. A New Approach to Causal Inference in Mortality Studies with Sustained Exposure Period. Mathematical Modelling, Vol. 7, 9--12 (1986), 1393--1512.

[27]

Xuemeng Song, Liqiang Jing, Dengtian Lin, Zhongzhou Zhao, Haiqing Chen, and Liqiang Nie. 2022. V2P: Vision-to-Prompt based Multi-Modal Product Summary Generation. In The International Conference on Research and Development in Information Retrieval. ACM, 992--1001.

Digital Library

[28]

Teng Sun, Chun Wang, Xuemeng Song, Fuli Feng, and Liqiang Nie. 2022. Response Generation by Jointly Modeling Personalized Linguistic Styles and Emotions. ACM Trans. Multim. Comput. Commun. Appl., Vol. 18, 2 (2022), 52:1--52:20.

Digital Library

[29]

Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal Transformer for Unaligned Multimodal Language Sequences. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. ACL, 6558--6569.

[30]

Peter J. M. van Laarhoven and Emile H. L. Aarts. 1987. Simulated Annealing: Theory and Applications. Mathematics and Its Applications, Vol. 37. Springer. 7--15 pages.

[31]

Wenjie Wang, Fuli Feng, Xiangnan He, Hanwang Zhang, and Tat-Seng Chua. 2021. Clicks can be Cheating: Counterfactual Recommendation for Mitigating Clickbait Issue. In The International Conference on Research and Development in Information Retrieval. ACM, 1288--1297.

Digital Library

[32]

Wenjie Wang, Xinyu Lin, Fuli Feng, Xiangnan He, Min Lin, and Tat-Seng Chua. 2022. Causal Representation Learning for Out-of-Distribution Recommendation. In Proceedings of the ACM Web Conference. ACM, 3562--3571.

Digital Library

[33]

Zilong Wang, Zhaohong Wan, and Xiaojun Wan. 2020. TransModality: An End2End Fusion Method with Transformer for Multimodal Sentiment Analysis. In The Web Conference. ACM, 2514--2520.

Digital Library

[34]

Yinwei Wei, Xiang Wang, Qi Li, Liqiang Nie, Yan Li, Xuanping Li, and Tat-Seng Chua. 2021. Contrastive learning for cold-start recommendation. In Proceedings of the ACM International Conference on Multimedia. ACM, 5382--5390.

Digital Library

[35]

Yinwei Wei, Xiang Wang, Liqiang Nie, Xiangnan He, Richang Hong, and Tat-Seng Chua. 2019. MMGCN: Multi-modal graph convolution network for personalized recommendation of micro-video. In Proceedings of the ACM International Conference on Multimedia. ACM, 1437--1445.

Digital Library

[36]

Xu Yan, Li-Ming Zhao, and Bao-Liang Lu. 2021. Simplifying Multimodal Emotion Recognition with Single Eye Movement Modality. In Proceedings of the ACM International Conference on Multimedia. ACM, 1057--1063.

Digital Library

[37]

Wenmeng Yu, Hua Xu, Ziqi Yuan, and Jiele Wu. 2021. Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis. In Proceedings of the Conference on Artificial Intelligence. AAAI, 10790--10797.

[38]

Ziqi Yuan, Wei Li, Hua Xu, and Wenmeng Yu. 2021. Transformer-based Feature Reconstruction Network for Robust Multimodal Sentiment Analysis. In The ACM International Conference on Multimedia. ACM, 4400--4407.

[39]

Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2017. Tensor Fusion Network for Multimodal Sentiment Analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 1103--1114.

[40]

Amir Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018a. Multimodal Language Analysis in the Wild: Carnegie Mellon University-MOSEI Dataset and Interpretable Dynamic Fusion Graph. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. ACL, 2236--2246.

[41]

Amir Zadeh, Paul Pu Liang, Soujanya Poria, Prateek Vij, Erik Cambria, and Louis-Philippe Morency. 2018b. Multi-attention Recurrent Network for Human Communication Comprehension. In Proceedings of the Conference on Artificial Intelligence. AAAI, 5642--5649.

[42]

Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. 2016. MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos. IEEE Intelligent Systems, Vol. 36, 6 (2016), 82--88.

Digital Library

[43]

Dong Zhang, Shoushan Li, Qiaoming Zhu, and Guodong Zhou. 2019. Effective Sentiment-relevant Word Selection for Multi-modal Sentiment Analysis in Spoken Language. In Proceedings of the ACM International Conference on Multimedia. ACM, 148--156.

Digital Library

[44]

Xi Zhang, Feifei Zhang, and Changsheng Xu. 2021. Multi-Level Counterfactual Contrast for Visual Commonsense Reasoning. In Proceedings of the ACM International Conference on Multimedia. ACM, 1793--1802.

Digital Library

[45]

Lin Zheng, Naicheng Guo, Weihao Chen, Jin Yu, and Dazhi Jiang. 2020. Sentiment-guided Sequential Recommendation. In The International Conference on Research and Development in Information Retrieval. ACM, 1957--1960.

Cited By

Li ZLiu PPan YDing WYu JChen HLiu WLuo YWang H(2025)Multimodal sentiment analysis based on disentangled representation learning and cross-modal-context association miningNeurocomputing10.1016/j.neucom.2024.128940617(128940)Online publication date: Feb-2025
https://doi.org/10.1016/j.neucom.2024.128940
Huang CChen JHuang QWang STu YHuang X(2025)AtCAF: Attention-based causality-aware fusion network for multimodal sentiment analysisInformation Fusion10.1016/j.inffus.2024.102725114(102725)Online publication date: Feb-2025
https://doi.org/10.1016/j.inffus.2024.102725
Pakdaman ZKoochari ASharifi A(2025)Content-aware sentiment understanding: cross-modal analysis with encoder-decoder architecturesJournal of Computational Social Science10.1007/s42001-025-00374-y8:2Online publication date: 21-Feb-2025
https://doi.org/10.1007/s42001-025-00374-y
Show More Cited By

Recommendations

MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis
MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Multimodal Sentiment Analysis is an active area of research that leverages multimodal signals for affective understanding of user-generated videos. The predominant approach, addressing this task, has been to develop sophisticated fusion techniques. ...
A Multi-view Spectral-Spatial-Temporal Masked Autoencoder for Decoding Emotions with Self-supervised Learning
MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Affective Brain-computer Interface has achieved considerable advances that researchers can successfully interpret labeled and flawless EEG data collected in laboratory settings. However, the annotation of EEG data is time-consuming and requires a vast ...
Disentangled Representation Learning for Multimodal Emotion Recognition
MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Multimodal emotion recognition aims to identify human emotions from text, audio, and visual modalities. Previous methods either explore correlations between different modalities or design sophisticated fusion strategies. However, the serious problem is ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

October 2022

7537 pages

ISBN:9781450392037

DOI:10.1145/3503161

General Chairs:
João Magalhães
NOVA University of Lisbon, Portugal
,
Alberto del Bimbo
University of Florence, Italy
,
Shin'ichi Satoh
National Institute of Informatics, Japan
,
Nicu Sebe
University of Trento, Italy
,
Program Chairs:
Xavier Alameda-Pineda
Inria, Grenoble, France
,
Qin Jin
Renmin University of China, China
,
Vincent Oria
New Jersey Institute of Technology, USA
,
Laura Toni
University College London, UK

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China

Conference

MM '22

Sponsor:

SIGMM

MM '22: The 30th ACM International Conference on Multimedia

October 10 - 14, 2022

Lisboa, Portugal

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

30
Total Citations
View Citations
1,609
Total Downloads

Downloads (Last 12 months)675
Downloads (Last 6 weeks)67

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Li ZLiu PPan YDing WYu JChen HLiu WLuo YWang H(2025)Multimodal sentiment analysis based on disentangled representation learning and cross-modal-context association miningNeurocomputing10.1016/j.neucom.2024.128940617(128940)Online publication date: Feb-2025
https://doi.org/10.1016/j.neucom.2024.128940
Huang CChen JHuang QWang STu YHuang X(2025)AtCAF: Attention-based causality-aware fusion network for multimodal sentiment analysisInformation Fusion10.1016/j.inffus.2024.102725114(102725)Online publication date: Feb-2025
https://doi.org/10.1016/j.inffus.2024.102725
Pakdaman ZKoochari ASharifi A(2025)Content-aware sentiment understanding: cross-modal analysis with encoder-decoder architecturesJournal of Computational Social Science10.1007/s42001-025-00374-y8:2Online publication date: 21-Feb-2025
https://doi.org/10.1007/s42001-025-00374-y
Huang CLin ZHuang QHuang XJiang FChen J(2025)$$\text {H}^2\text {CAN}$$: heterogeneous hypergraph attention network with counterfactual learning for multimodal sentiment analysisComplex & Intelligent Systems10.1007/s40747-025-01806-y11:4Online publication date: 28-Feb-2025
https://doi.org/10.1007/s40747-025-01806-y
Venugopal JSubramanian ASundaram GRivera MWheeler P(2024)A Comprehensive Approach to Bias Mitigation for Sentiment Analysis of Social Media DataApplied Sciences10.3390/app14231147114:23(11471)Online publication date: 9-Dec-2024
https://doi.org/10.3390/app142311471
Kim JHong JChoi Y(2024)Causal Inference for Modality Debiasing in Multimodal Emotion RecognitionApplied Sciences10.3390/app14231139714:23(11397)Online publication date: 6-Dec-2024
https://doi.org/10.3390/app142311397
Zhu ZZhuang XZhang YXu DHu GWu XZheng YLarson K(2024)TFCDProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/739(6687-6695)Online publication date: 3-Aug-2024
https://dl.acm.org/doi/10.24963/ijcai.2024/739
Zeng ZLuo MKong XLiu HGuo HYang HMa ZZhao XCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Mitigating World Biases: A Multimodal Multi-View Debiasing Framework for Fake News Video DetectionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681673(6492-6500)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681673
Jiang XWei ZLi SXu XSong JShen HCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Counterfactually Augmented Event Matching for De-biased Temporal Sentence GroundingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680948(6472-6481)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680948
Zhang TMin WLiu TJiang SRui Y(2024)Toward Egocentric Compositional Action Anticipation with Adaptive Semantic DebiasingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363333320:5(1-21)Online publication date: 11-Jan-2024
https://dl.acm.org/doi/10.1145/3633333
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten