1 Introduction
In recent years, multimodal sentiment analysis has emerged as a prominent research direction. It aims to predict sentiments from multiple modalities, including audio, visual, and text signals. Compared to unimodal sentiment analysis, multimodal sentiment analysis can learn more comprehensive sentiment representations, leading to significant performance improvements. It has a wide application in various fields, such as human–computer interaction and autonomous driving.
Existing multimodal sentiment analysis methods utilize distinct encoders to extract unimodal representations and then focus on multimodal fusion and cross-modal alignment [
4,
50]. For example, gated mechanisms [
28,
34,
38], cross-modal attentions [
21,
31], or graph neural networks [
23,
41] are employed to fuse representations from different modalities, yielding a comprehensive multimodal representation. Knowledge distillation [
16,
17] or contrastive learning [
24,
26] is utilized to reduce the gap between modalities and learn a unified multimodal representation. Moreover, based on the
Bidirectional Encoder Representations from Transformers (BERT) model and multimodal inputs,
All-modalities-in-One BERT (AOBERT) [
13] is pre-trained on
Multimodal Masked Language Modeling (MMLM) and
Alignment Prediction (AP) tasks to learn multimodal representations for multimodal sentiment analysis. Despite achieving commendable results, existing methods often rely on
Pre-Trained Language Models (PLMs) (e.g., BERT [
6] or XLNet [
43]) to extract semantic information from textual data, lacking an in-depth understanding of the logical relationships within the text modality. Language (or text), as the embodiment of human wisdom, contains rich and intricate information. It can reflect human sentiments not only explicitly through the semantics of certain keywords or phrases but also implicitly through inherent logical relationships. In short, utilizing the logical reasoning information of text for multimodal sentiment analysis has not yet been explored.
For a complex task, humans typically decompose it into intermediate steps and address them sequentially, rather than solving the task in a black-box way. For example, given the question, “
If you have $50 in your wallet and someone gives you $30 more, how much money do you have in total?” Instead of simply answering “$
80,” a complete thought process would involve steps such as: “
I start with $50. I receive an additional $30. By combining these amounts, I arrive at a total of $50 \(+\) $30 \(=\) $80. Therefore, the answer is $80.” The stepwise reasoning process is called
Chain-of-Thought (CoT) [
39].
With the development of
Large Language Models (LLMs) and prompt learning [
1], Wei et al. [
39] proposed the first CoT prompting method for several reasoning tasks, where some CoT exemplars and a question are input into an LLM to generate a step-by-step reasoning process and answer. Lu et al. [
20] introduced a similar method for science question-answering tasks. Unlike the former approach, which outputs the reasoning process before arriving at the answer, the latter method presents the answer first and then gives step-by-step explanations. MM-CoT [
49] is the first to study CoT reasoning in different modalities, which fine-tunes the T5 pre-trained model [
27] with reasoning steps annotated in the dataset. These methods [
20,
39,
49] necessitate CoT annotations of some or all samples, termed few-shot or full-shot approaches, respectively. Recent studies have introduced zero-shot CoT prompting. For example, Kojima et al. [
14] proposed the first zero-shot CoT prompting, where “
Let’s think step by step” is used to trigger LLMs to generate step-by-step reasoning processes. Similarly, Wang et al. [
35] proposed a
Plan-and-Solve (PS) CoT prompting, employing “
Let’s first understand the problem and devise a plan to solve the problem. Then, let’s carry out the plan and solve the problem step by step.” as the trigger phrase. These task-agnostic sentences induce LLMs to autonomously generate reasoning processes without CoT exemplars, achieving satisfactory results in different tasks such as arithmetic reasoning and common-sense reasoning [
3].
However, applying CoT reasoning to multimodal sentiment analysis presents challenges. On one hand, few-shot or full-shot CoT prompting methods [
20,
39,
49] require some or all CoT exemplars for prompting or fine-tuning. For the multimodal sentiment analysis task, there are no readily available reasoning exemplars. Manually annotating CoT exemplars is time-consuming, costly, and challenging to ensure consistency and completeness in multi-step reasoning. On the other hand, existing zero-shot CoT prompting methods [
14,
35] are designed for unimodal text data and could lead to hallucinations or incomplete reasoning processes. Therefore, utilizing the CoT reasoning for multimodal sentiment analysis remains an open problem.
To address this issue, we introduce MM-PEAR (Preliminaries, quEstion, Answer, and Reason)- CoT (MM-PEAR-CoT) reasoning for multimodal sentiment analysis. Specifically, we propose a zero-shot CoT prompt named PEAR, comprising four parts: preliminaries, question, answer, and reason. The PEAR prompt is fed into an LLM to generate text-based reasoning steps and zero-shot sentiment predictions. Moreover, to alleviate irrational reasoning caused by the hallucinations of the LLM, we design a Cross-Modal Filtering and Fusion (CMFF) module. The filtering submodule utilizes information from the audio-visual modality to suppress unreasonable steps, while the fusion submodule integrates high-level reasoning information and cross-modal complementary information during the semantic representation learning to obtain multimodal representations. The multimodal representation and text-based zero-shot sentiment prediction are fused to obtain the final sentiment prediction.
In summary, our contributions are as follows:
—
We introduce the PEAR-CoT prompt to explore the implicit logical relationships in the text modality. The stepwise reasoning process provides an in-depth understanding of the text modality.
—
We propose the CMFF module to learn more comprehensive multimodal representation. It first utilizes audio and visual modalities to alleviate the hallucination in text-based CoT reasoning and then integrates high-level reasoning information and cross-modal complementary information during the text semantic representation learning.
—
Extensive experiments on two multimodal sentiment analysis benchmark datasets demonstrate the effectiveness of the multimodal PEAR-CoT reasoning. To the best of our knowledge, this is the first work to apply CoT reasoning to multimodal sentiment analysis.
The remaining sections of this article are organized as follows.
Section 2 provides a brief overview of related work in multimodal sentiment analysis and CoT prompting.
Section 3 introduces the multimodal PEAR-CoT reasoning for multimodal sentiment analysis.
Section 4 presents detailed experiments and analyses. The conclusion is summarized in
Section 5.
4 Experiments
We conduct extensive experiments to address the following
Research Questions (RQs):
—
RQ1: How does MM-PEAR-CoT compare with existing multimodal sentiment analysis approaches?
—
RQ2: Are all modules within MM-PEAR-CoT necessary and effective?
—
RQ3: What impact does the position of the CMFF module have on multimodal sentiment analysis?
—
RQ4: How well does MM-PEAR-CoT generalize across different text semantic backbones?
—
RQ5: How well does MM-PEAR-CoT generalize across different reasoning backbones?
—
RQ6: How generalizable is PEAR-CoT in the visual modality?
—
RQ7: How does PEAR-CoT compare with existing zero-shot CoT reasoning methods?
—
RQ8: How does the CMFF module alleviate hallucinations in the CoT?
4.1 Datasets and Metrics
4.1.1 Datasets.
Experiments are conducted on two multimodal sentiment analysis datasets, CMU-MOSI [
46] and CMU-MOSEI [
47], which were collected from YouTube. CMU-MOSI consists of 2,199 short monologue video clips, with 1,284 samples in the training set, 229 samples in the validation set, and 686 samples in the test set, respectively. CMU-MOSEI consists of 22,856 video clips, with 16,326 samples in the training set, 1,871 samples in the validation set, and 4,659 samples in the test set, according to the standard dataset split. The dataset is gender balanced. All the sentence utterances are randomly chosen from various topics and monologue videos. The videos are transcribed and properly punctuated. Each video clip in both datasets is annotated with a sentiment score ranging from
\(-\)3 to 3, where
\(-\)3 and 3 indicate strongly negative and strongly positive sentiment, respectively.
4.1.2 Metrics.
To evaluate the performance of our models, we adopt the following metrics based on previous works: binary accuracy (Acc\({}_{2}\)), binary F1 score (F1), 7-class accuracy (Acc\({}_{7}\)), Mean Absolute Error (MAE), and the correlation between model predictions and human annotations (Corr).
4.2 Implementation Details
OpenAI GPT-3.5-turbo-0613 is utilized to generate CoTs.
1 For a fair comparison, the unimodal encoder used in the experiments is consistent with the official one. Specifically, we use the BERT [
6] pre-trained model as the default encoder for the text modality. For the audio modality, we adopt the officially released COVAREP features [
5]. The COVAREP features are related to emotions and tone of speech, including 12 Mel-frequency cepstral coefficients, pitch, voiced/unvoiced segmentation features, glottal source parameters, peak slope parameters, and maxima dispersion quotients [
47]. For the visual modality, we use the officially released Facet feature, which contains a 47-dimensional (on CMU-MOSI) or 35-dimensional (on CMU-MOSI) facial representation. All feature sequences are aligned in the time dimension using the official toolkit.
2The parameter settings for the two datasets are as follows. The initial learning rate is set to 0.0001. The dropout rate is 0.5. The number of epochs is 50. The warmup proportion is set to 0.1. The maximum length for multimodal feature sequences is 50. The maximum number of sentences in the CoT is 12. The number of heads in the MHA is 4. For the self-consistency strategy, we generate seven distinct CoTs. One of these chains is generated with the temperature set to 0 (argmax sampling), while the remaining six are generated with the temperature set to 0.7. The model is trained using the AdamW optimizer [
19]. The entire framework is implemented using PyTorch [
25] and runs on the NVIDIA TITAN RTX GPU.
4.3 Experimental Results (RQ1)
As shown in
Table 3, we conduct a fair comparison with the current state-of-the-art methods under the same experimental settings, including
Low-rank Multimodal Fusion (LMF) [
18],
Multimodal Factorization Model (MFM) [
32], MulT [
31],
Interaction Canonical Correlation Network (ICCN) [
29],
Modality-Invariant and -Specific Analysis (MISA) [
10],
Multimodal Adaptation Gate (MAG) [
28],
Cross-Modal BERT (CM-BERT) [
42], Self-MM [
44],
MultiModal InfoMax (MMIM) [
9],
Feature-Disentangled Multimodal Emotion Recognition (FDMER) [
40], HyCon [
24], AOBERT [
13],
Multimodal Information Modulation (MIM) [
48],
Multimodal Information Modulation (MTMD) [
17], and DMD [
16].
It can be seen that early methods overlooked the gap between different modalities and proposed diverse networks to directly fuse the complementary information across modalities, resulting in limited performance improvements [
18,
31,
32]. Some studies fine-tuned the representation of the textual modality using the complementary information from audio and visual modalities, thereby learning a multi-modal sentiment representation [
13,
28,
42]. Recent approaches leveraged knowledge distillation or contrastive learning to align representations across different modalities and facilitate knowledge transfer between them [
16,
17,
24,
48]. In comparison, the proposed MM-PEAR-CoT method achieved the best results across all evaluation metrics on two datasets, demonstrating the robustness and effectiveness of our approach. For instance, MM-PEAR-CoT achieved a 2.2% improvement in binary classification accuracy on the CMU-MOSI dataset and a 1.7% improvement on the CMU-MOSEI dataset. Furthermore, in the zero-shot setting, our proposed PEAR-CoT method also obtained satisfactory results, even surpassing some methods in supervised settings on the CMU-MOSI dataset. To the best of our knowledge, this study is the first to conduct zero-shot sentiment analysis on the CMU-MOSI and CMU-MOSEI datasets. Our method’s competitive performance in both supervised and zero-shot settings makes it a promising solution for sentiment analysis tasks.
Overall, our approach has several advantages. (1) Existing methods focus on exploring the relationships between different modalities. In contrast, the proposed method leverages CoTs and LLMs to achieve an in-depth understanding of the textual modality. The generated CoT reasoning also enhances the interpretability of sentiment analysis. (2) The proposed method integrates high-level logical reasoning information and cross-modal complementary information into the process of learning textual semantic representations. This makes the learned multimodal sentiment representations more compact and effective. (3) Current state-of-the-art methods such as HyCon [
24], MTMD [
17], and DMD [
16] introduce contrastive learning or cross-modal distillation as auxiliary tasks. This increases the complexity of the model and may lead to challenges in model optimization. In contrast, our proposed method relies solely on sentiment prediction loss, which is more concise.
4.4 Ablation on Different Modules (RQ2)
Table 4 presents the results of module ablation experiments conducted on the CMU-MOSI and CMU-MOSEI datasets. Taking the results from CMU-MOSI as an example, when fine-tuning BERT pre-trained models using only textual data, a baseline binary classification accuracy of 83.5% was achieved. Then, employing the CoT reasoning process of LLMs to enhance the semantic representation of textual modality (
Equation (3)) increased the accuracy to 86.0%. This demonstrates that high-level logical reasoning information can effectively aid in learning textual semantic information. The accuracy further increased to 87.2% when the CoT reasoning steps were filtered using audio-visual modality data (
Equation (2)). This result underscores the necessity and effectiveness of suppressing irrational steps in the CoT. Finally, by incorporating audio-visual modality information into the process of learning textual semantic information (
Equation (4)), an accuracy of 88.3% was attained, marking a 4.8% improvement over the baseline setting. Similar phenomena were observed on the CMU-MOSEI dataset. In conclusion, the ablation study indicates that an in-depth understanding of textual information contributes to more effective sentiment analysis, and cross-modal complementary information can further suppress unreasonable steps in the CoT based on the textual modality.
4.5 Ablation on Different Positions of CMFF (RQ3)
Figure 3 illustrates the binary accuracy of the CMFF module at different positions on the CMU-MOSI dataset. The numbers 1, 4, 8, and 12 represent adding the CMFF module after the 1st, 4th, 8th, and 12th Transformer encoding layers, respectively, while 0 indicates adding the CMFF module before the 1st Transformer encoding layer. Moreover, the results of adding the CMFF module after multiple Transformer encoding layers are also evaluated.
It can be observed that, according to the default setting, adding the CMFF module before the 1st Transformer layer achieved the highest classification accuracy. As the CMFF module is added to higher layers, a decreasing trend in accuracy is evident. This suggests that incorporating high-level reasoning information and cross-modal complementary information earlier is more beneficial for learning textual semantic representations. When the CMFF module is added after several Transformer encoding layers, the accuracy decreases to 84.5%. This may be because the same information is introduced at different stages of representation learning, which on one hand leads to redundancy and on the other hand, disrupts the normal process of multimodal representation learning.
4.6 Ablation on Different Text Semantic Backbones (RQ4)
To assess the generalization of the proposed MM-PEAR-CoT method across different text semantic backbones, we conducted evaluations not only on the default BERT model but also on the XLNet [
43] and SentiLare [
12] models. We compared our approach with the following state-of-the-art methods: MISA [
10], Self-MM [
44], MAG [
28], and CENet [
34]. The results are presented in
Table 5.
It can be observed that whether using the XLNet backbone network or the more sentiment-relevant SentiLare network, our proposed method consistently outperforms the others under the same experimental settings. Specifically, MM-PEAR-CoT achieves a 1.8% improvement in binary classification accuracy compared to the current best method when utilizing the XLNet backbone network and a 1.2% improvement when using the SentiLare backbone network. This indicates the generalization and effectiveness of our proposed method across different text semantic backbone networks. Furthermore, MM-PEAR-CoT benefits from superior text semantic backbone networks, exhibiting stable performance improvements on better networks. For instance, on the CMU-MOSI dataset, MM-PEAR-CoT achieves binary classification accuracies of 88.3%, 90.2%, and 92.1% when using BERT, XLNet, and SentiLare backbone networks, respectively.
4.7 Ablation on Different Reasoning Backbones (RQ5)
To assess the generalization of the proposed method across different reasoning backbones, experiments were conducted not only on the default GPT-3.5 but also on the open-source LLaMA-70B [
30] and the powerful GPT-4.
3 The results of the proposed PEAR-CoT and MM-PEAR-CoT on the CMU-MOSI dataset are presented in
Table 6.
Specifically, LLaMA-70B demonstrated relatively limited reasoning capability, achieving a 75.3% binary classification accuracy. The default GPT-3.5 acquired better results, impressively achieving an 84.5% accuracy in the binary classification task. With approximately 175 billion parameters, GPT-3.5 significantly surpasses LLaMA2-70B, which has 70 billion parameters. Moreover, GPT-4 showed a further 0.8% increase in binary classification accuracy and a 0.014 improvement in Pearson correlation coefficient over GPT-3.5. Based on the PEAR-CoT, the MM-PEAR-CoT framework, incorporating CMFF modules, achieved performance enhancements across all three LLM backbones. This suggests that the proposed approach is applicable across different reasoning backbones and can benefit from stronger reasoning backbones.
4.8 Generalization of PEAR-CoT on the Visual Modality (RQ6)
In the aforementioned experiments, the PEAR-CoT prompts were applied on the text modality. Herein, we further assess the effectiveness and generalizability of the PEAR-CoT prompts within the visual modality. To achieve this, it is first necessary to describe the relevant visual modality information in natural language.
We considered two different types of information: environmental context and facial-related information. For environmental background, descriptions of the surroundings were obtained using a video captioning model pre-trained on the Vatex dataset.
4 However, since the Vatex dataset [
37] was not collected for affective computing tasks, the generated descriptions might have limited emotion-related content. Regarding facial information, statistics for the three most frequently and least frequently occurring
Action Units (AUs), the most likely facial expression categories, and the least likely facial expression categories for each video were collected. Subsequently, definitions for each AU were integrated into the Preliminaries part of the PEAR-CoT prompts. Finally, the generated visual PEAR-CoT prompts were fed into GPT-3.5 to predict sentiment intensity. The specific prompt template and examples are shown in
Figure 4. It was observed that GPT can understand the relationship between specific AUs and sentiment polarity through the definitions of AUs. For example, AUs 01, 02, and 05 are associated with positive emotions, while AUs 04, 06, 09, and 15 are associated with negative emotions. By combining environmental information and facial expression information, specific sentiment intensity predictions were obtained.
To quantify the performance of the PEAR-CoT method in the visual modality, we compared it with the common Transformer [
33] network. Specifically, the Transformer encoder layer is adopted to model the temporal dynamics of the visual modality and predicted sentiment scores. The related results are shown in
Table 7. It is important to note that the PEAR-CoT prompts were utilized in a zero-shot setting without being trained/fine-tuned on the dataset, whereas the Transformer was trained in a supervised setting. We have the following observations. Firstly, the performance of both methods was quite limited, possibly due to the visual modality containing limited emotion-related information. In comparison, the PEAR-CoT prompt method achieved comparable results, especially in terms of classification accuracy and F1 score metrics. However, in the process of using natural language to describe visual context, PEAR-CoT only considered the simplest statistical information, overlooking fine-grained temporal dynamics. This resulted in relatively limited performance, with a noticeable gap in MAE and correlation coefficient metrics compared to the Transformer. Overall, PEAR-CoT remains a method with significant potential in the visual modality.
4.9 Ablation on Different Zero-shot CoT Promptings (RQ7)
To validate the effectiveness of the proposed PEAR-CoT prompt, we compared it against two different zero-shot CoT prompt methods: standard zero-shot CoT prompt [
14] and PS zero-shot CoT prompt [
35].
Table 8 presents the results of the three methods in zero-shot sentiment analysis on the CMU-MOSI dataset. It can be observed that PS-CoT yielded the poorest results, followed by the standard CoT. The proposed PEAR-CoT achieved the best results among the three methods.
Figures 5 and
6 depict a weak negative sample and a neutral sample, respectively. On one hand, PS-CoT is proposed for mathematical calculation tasks and is not suitable for more subjective sentiment analysis tasks. Evaluating sentiment scores for each part and then summing them for the final sentiment prediction is unreasonable, as shown in
Figure 5. On the other hand, for standard CoT, CoT reasoning may cause errors in the early stages and affect the subsequent reasoning process, as illustrated in
Figure 6. In contrast, the PEAR-CoT prompts first offer a global analysis or prediction result, followed by step-by-step explanations. Our approach reduces the risk of accumulating errors due to inaccurate intermediate reasoning steps, thus acquiring the best results across multiple evaluation metrics.
4.10 Case Study about Hallucination Suppression (RQ8)
To further analyze the role of audio-visual modality information in the CMFF module for CoT reasoning steps, we calculated the attention score allocated to each step within the CoT reasoning process in
Equation (2). Specifically, we calculated and ranked the attention scores obtained by each step within the cross-modal filtering submodule.
Figure 7 illustrates the prediction detail for both a weakly positive and a weakly negative sample.
For the positive sample, the appearance of the negative lexicon “problem” and the negation expression “didn’t love” misled the LLM to believe that the intensity of negative sentiment was greater than that of positive sentiment. However, the corresponding visual modality showed a happy expression, and the speaker’s tone was light and cheerful. This indicates a contradiction between the LLM’s reasoning process and the facts under multimodal data. According to the multimodal information, the speaker was expressing a liking toward an object, but not to the extent of love. In this scenario, the attention scores reveal that the CMFF module focused more on parts of the reasoning process that were consistent with the audio-visual modality, specifically the third step. Conversely, it paid less attention to the parts where the reasoning process conflicted with the audio-visual modality, namely the first, fourth, and fifth steps. Finally, with the text-based zero-shot prediction as negative and the audio-visual modality showing positive signals, the MM-PEAR-CoT’s prediction was 1.1 (weakly positive), closer to the annotation of 1.25.
For the negative sample, the presence of the word “orgasm” misled the LLM to perceive a positive sentiment during the third step of reasoning. However, the corresponding visual modality displayed expressions of displeasure, and the speaker’s tone conveyed disgust. This indicates a contradiction between the LLM’s reasoning process and the facts under multimodal data. In fact, according to the multimodal information, the speaker expressed disgust toward the excessive use of Computer-Generated Imagery in the climax of a movie. In this scenario, the attention scores indicate that the CMFF module focused more on parts of the reasoning process that were consistent with the audio-visual modality, specifically the first and second steps. Conversely, it paid less attention to the parts where the reasoning process conflicted with the audio-visual modality, namely the third and fourth steps. Finally, with the text modality being predicted as positive and the audiovisual modality showing negative signals, the MM-Pear-CoT’s prediction was \(-\)0.8 (weakly negative), closer to the annotation of \(-\)1.0.
Overall, the attention scores within the cross-modal filtering module demonstrate that MM-PEAR-CoT is capable of prioritizing reasoning processes that are more consistent with other modalities in the presence of conflicts between textual reasoning steps and multimodal information, thereby obtaining more accurate multimodal sentiment prediction results.
Despite the proposed method’s capabilities, it still makes incorrect predictions in certain instances. For example, consider the “jUzDDGyPkXU\(\_\)21” sample from CMU-MOSI, which is annotated as \(-\)1.0 (weakly negative). The corresponding text states, “The only actor who can really sell their lines is Erin Eckart.” This sentence does not display any overt negative sentiments. Thus, the text-based reasoning process does not involve steps associated with negative sentiments, leading to a zero-shot prediction of 2.0 (positive). However, the negative sentiments are conveyed through audio and visual modalities. This discrepancy results in the proposed method failing to suppress hallucinations by focusing on reasonable steps, finally leading to an erroneous prediction of 1.1 (weakly positive). This example highlights the challenges the proposed method faces when dealing with complex samples.
4.11 Discussion
4.11.1 Fairness.
MM-PEAR-CoT aims to introduce implicit reasoning information into the process of multimodal representation learning. For the multimodal sentiment analysis task, there are no readily available reasoning exemplars. Manually annotating CoT exemplars is time-consuming, costly, and challenging to ensure consistency and completeness in multi-step reasoning. Thus, we leverage LLMs with reasoning capabilities to generate these CoTs. Furthermore, to ensure a fair comparison, we employ backbone networks consistent with current multimodal sentiment analysis methods, such as BERT, XLNet, and SentiLare, for semantic learning in the text modality. Notably, we do not utilize the powerful hidden layer embeddings of LLMs. Extensive experiments demonstrate that the proposed method achieves superior results, further validating the effectiveness of CoT reasoning in multimodal sentiment analysis.
4.11.2 Limitation.
While we have successfully addressed a portion of the hallucination issue stemming from unimodal input by introducing the audio-visual modality, it is crucial to recognize that this problem is not entirely resolved. Exploring methods to mitigate hallucinations is a potential area for further research and investigation. Since our approach involves zero-shot prompting of the LLM to generate rationales, there is a potential risk of inheriting social biases from the LLM. These biases, encompassing cultural, ethical, and various other dimensions, may manifest in the generated rationales, potentially causing adverse effects on users. To address this concern in the future, potential solutions could include implementing constraints at each prompting stage or employing more advanced LLMs trained on unbiased resources.