research-article

Open access

Multimodal PEAR Chain-of-Thought Reasoning for Multimodal Sentiment Analysis

Authors:

Yan Li,

Xiangyuan Lan,

Haifeng Chen,

Ke Lu,

Dongmei JiangAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 20, Issue 9

Article No.: 286, Pages 1 - 23

https://doi.org/10.1145/3672398

Published: 23 September 2024 Publication History

PDF eReader

Abstract

Multimodal sentiment analysis aims to predict sentiments from multimodal signals such as audio, video, and text. Existing methods often rely on Pre-trained Language Models (PLMs) to extract semantic information from textual data, lacking an in-depth understanding of the logical relationships within the text modality. This article introduces the Multimodal PEAR (Preliminaries, quEstion, Answer, Reason) Chain-of-Thought (MM-PEAR-CoT) reasoning for multimodal sentiment analysis. Inspired by the human thought process when solving complex problems, the PEAR CoT prompt is first proposed to induce Large Language Models (LLMs) to generate text-based reasoning processes and zero-shot sentiment prediction results. However, text-based CoT reasoning is not always reliable and might contain irrational steps due to the hallucinations of LLMs. To address this, we further design the Cross-Modal Filtering and Fusion (CMFF) module. The filtering submodule utilizes audio and visual modalities to suppress irrational steps in the CoT, while the fusion submodule integrates high-level reasoning information and cross-modal complementary information in the process of semantic representation learning. Experimental results on two multimodal sentiment analysis benchmark datasets show that high-level reasoning information can help learn discriminative text representation, and cross-modal complementary information can avoid misleading by unreasonable steps in the CoT. MM-PEAR-CoT achieves the best results on both datasets, with improvements of 2.2% and 1.7% in binary classification accuracy on the CMU-MOSI and CMU-MOSEI datasets, respectively. To the best of our knowledge, this is the first study to apply CoT reasoning to multimodal sentiment analysis.

1 Introduction

In recent years, multimodal sentiment analysis has emerged as a prominent research direction. It aims to predict sentiments from multiple modalities, including audio, visual, and text signals. Compared to unimodal sentiment analysis, multimodal sentiment analysis can learn more comprehensive sentiment representations, leading to significant performance improvements. It has a wide application in various fields, such as human–computer interaction and autonomous driving.

Existing multimodal sentiment analysis methods utilize distinct encoders to extract unimodal representations and then focus on multimodal fusion and cross-modal alignment [4, 50]. For example, gated mechanisms [28, 34, 38], cross-modal attentions [21, 31], or graph neural networks [23, 41] are employed to fuse representations from different modalities, yielding a comprehensive multimodal representation. Knowledge distillation [16, 17] or contrastive learning [24, 26] is utilized to reduce the gap between modalities and learn a unified multimodal representation. Moreover, based on the Bidirectional Encoder Representations from Transformers (BERT) model and multimodal inputs, All-modalities-in-One BERT (AOBERT) [13] is pre-trained on Multimodal Masked Language Modeling (MMLM) and Alignment Prediction (AP) tasks to learn multimodal representations for multimodal sentiment analysis. Despite achieving commendable results, existing methods often rely on Pre-Trained Language Models (PLMs) (e.g., BERT [6] or XLNet [43]) to extract semantic information from textual data, lacking an in-depth understanding of the logical relationships within the text modality. Language (or text), as the embodiment of human wisdom, contains rich and intricate information. It can reflect human sentiments not only explicitly through the semantics of certain keywords or phrases but also implicitly through inherent logical relationships. In short, utilizing the logical reasoning information of text for multimodal sentiment analysis has not yet been explored.

For a complex task, humans typically decompose it into intermediate steps and address them sequentially, rather than solving the task in a black-box way. For example, given the question, “If you have $50 in your wallet and someone gives you $30 more, how much money do you have in total?” Instead of simply answering “$ 80,” a complete thought process would involve steps such as: “I start with $50. I receive an additional $30. By combining these amounts, I arrive at a total of $50 $+$ $30 $=$ $80. Therefore, the answer is $80.” The stepwise reasoning process is called Chain-of-Thought (CoT) [39].

With the development of Large Language Models (LLMs) and prompt learning [1], Wei et al. [39] proposed the first CoT prompting method for several reasoning tasks, where some CoT exemplars and a question are input into an LLM to generate a step-by-step reasoning process and answer. Lu et al. [20] introduced a similar method for science question-answering tasks. Unlike the former approach, which outputs the reasoning process before arriving at the answer, the latter method presents the answer first and then gives step-by-step explanations. MM-CoT [49] is the first to study CoT reasoning in different modalities, which fine-tunes the T5 pre-trained model [27] with reasoning steps annotated in the dataset. These methods [20, 39, 49] necessitate CoT annotations of some or all samples, termed few-shot or full-shot approaches, respectively. Recent studies have introduced zero-shot CoT prompting. For example, Kojima et al. [14] proposed the first zero-shot CoT prompting, where “Let’s think step by step” is used to trigger LLMs to generate step-by-step reasoning processes. Similarly, Wang et al. [35] proposed a Plan-and-Solve (PS) CoT prompting, employing “Let’s first understand the problem and devise a plan to solve the problem. Then, let’s carry out the plan and solve the problem step by step.” as the trigger phrase. These task-agnostic sentences induce LLMs to autonomously generate reasoning processes without CoT exemplars, achieving satisfactory results in different tasks such as arithmetic reasoning and common-sense reasoning [3].

However, applying CoT reasoning to multimodal sentiment analysis presents challenges. On one hand, few-shot or full-shot CoT prompting methods [20, 39, 49] require some or all CoT exemplars for prompting or fine-tuning. For the multimodal sentiment analysis task, there are no readily available reasoning exemplars. Manually annotating CoT exemplars is time-consuming, costly, and challenging to ensure consistency and completeness in multi-step reasoning. On the other hand, existing zero-shot CoT prompting methods [14, 35] are designed for unimodal text data and could lead to hallucinations or incomplete reasoning processes. Therefore, utilizing the CoT reasoning for multimodal sentiment analysis remains an open problem.

To address this issue, we introduce MM-PEAR (Preliminaries, quEstion, Answer, and Reason)- CoT (MM-PEAR-CoT) reasoning for multimodal sentiment analysis. Specifically, we propose a zero-shot CoT prompt named PEAR, comprising four parts: preliminaries, question, answer, and reason. The PEAR prompt is fed into an LLM to generate text-based reasoning steps and zero-shot sentiment predictions. Moreover, to alleviate irrational reasoning caused by the hallucinations of the LLM, we design a Cross-Modal Filtering and Fusion (CMFF) module. The filtering submodule utilizes information from the audio-visual modality to suppress unreasonable steps, while the fusion submodule integrates high-level reasoning information and cross-modal complementary information during the semantic representation learning to obtain multimodal representations. The multimodal representation and text-based zero-shot sentiment prediction are fused to obtain the final sentiment prediction.

In summary, our contributions are as follows:

—

We introduce the PEAR-CoT prompt to explore the implicit logical relationships in the text modality. The stepwise reasoning process provides an in-depth understanding of the text modality.

—

We propose the CMFF module to learn more comprehensive multimodal representation. It first utilizes audio and visual modalities to alleviate the hallucination in text-based CoT reasoning and then integrates high-level reasoning information and cross-modal complementary information during the text semantic representation learning.

—

Extensive experiments on two multimodal sentiment analysis benchmark datasets demonstrate the effectiveness of the multimodal PEAR-CoT reasoning. To the best of our knowledge, this is the first work to apply CoT reasoning to multimodal sentiment analysis.

The remaining sections of this article are organized as follows. Section 2 provides a brief overview of related work in multimodal sentiment analysis and CoT prompting. Section 3 introduces the multimodal PEAR-CoT reasoning for multimodal sentiment analysis. Section 4 presents detailed experiments and analyses. The conclusion is summarized in Section 5.

2 Related Work

2.1 Multimodal Sentiment Analysis

Currently, research in multimodal sentiment analysis has proposed various methods for multimodal fusion and cross-modal alignment [4, 50]. In terms of multimodal fusion methods, Zadeh et al. [45] introduced the Tensor Fusion Network to simultaneously model intra-modal and inter-modal dynamics. Tsai et al. [31] proposed the Multimodal Transformer (MulT) to capture complementary information between any two modalities within three modalities. Yang et al. [42] utilized masked multimodal attention on pre-trained BERT models to dynamically adjust the weight of words by integrating information from both text and audio modalities. Similarly, Rahman et al. [28] and Wang et al. [38] employed gating mechanisms to shift representations of the text modality by representations from the audio-visual modality. Hwang et al. [11] recalibrated representations of each modality using representations from other modalities to learn discriminative representations. Kim et al. [13] proposed MMLM and AP pre-training tasks, enabling pre-trained BERT models to learn effective multimodal representations. Moreover, some methods decouple representations of different modalities to learn modality-specific and modality-invariant representations before fusing them to obtain multimodal representations [10, 40]. These methods leverage complementary information between different modalities to achieve comprehensive multimodal representations.

Mai et al. [24] and Poklukar et al. [26] incorporated contrastive learning into cross-modal alignment, reducing the gap between modalities by aligning embeddings of different modalities. Li et al. [16] proposed Decoupled Multimodal Distillation (DMD), leveraging homogeneous and heterogeneous graph distillation on decoupled representations to effectively transfer knowledge across different modalities. Lin et al. [17] utilized feature-based and response-based knowledge distillation to learn robust multimodal sentiment representations. The aligned unimodal features are further fused in a unified representation space, enhancing the effectiveness of multimodal representations.

In conclusion, existing research in multimodal sentiment analysis heavily relies on PLMs for extracting text representations, neglecting the implicit context and logical relationships within the text modality. This limitation hampers the performance of multimodal sentiment analysis.

2.2 CoT Prompting

A CoT refers to a coherent flow of sentences that reveals the premises and conclusion of a reasoning problem, which clearly decomposes a multi-hop reasoning task into intermediate steps instead of solving the task in a black-box way [20]. The core idea of prompt learning is to guide or motivate the LLM to perform better on specific tasks by designing appropriate “prompts” without requiring extensive retraining or fine-tuning. With the increase of LLM parameters and the emergence of reasoning capabilities, the concept of “CoT prompting” has been introduced [39]. This approach merges CoT reasoning with prompt learning by feeding exemplars of CoT reasoning into LLMs as prompts, enabling the model’s output to include the desired reasoning steps. By allowing the model more time and space for computation, CoT prompting effectively improves the performance and interpretability of LLMs on complex problems. This study shows the enormous potential of CoT prompting on LLMs with about 100 billion parameters.

Existing methods for CoT prompting can be categorized into two types: full-shot/few-shot methods and zero-shot methods. In the case of full-shot/few-shot CoT prompting, manually annotated CoT exemplars are required. For instance, Wei et al. [39] fed a sequence of reasoning steps and the answer as prompts into LLMs. Similarly, Lu et al. [20] introduced both the answer and step-by-step explanations as prompts into LLMs. pALM [2] and Generative Pre-trained Transformer (GPT)-3 [1] were used as the backbone networks for these LLMs, with experimental results demonstrating the effectiveness of these few-shot approaches. As the first multimodal CoT reasoning method, MM-CoT [49] fine-tuned the T5 pre-trained model [27] with CoT in the dataset to integrate information from both visual and textual modalities, achieving better results in a full-shot experimental setup.

Regarding zero-shot CoT prompting, task-agnostic triggers are designed to induce LLMs to generate a step-by-step reasoning process. For example, Kojima et al. [14] first proposed the prompt, “Let’s think step by step.” Wang et al. [35] put forth the PS-CoT prompting, “Let’s first understand the problem and devise a plan to solve the problem. Then, let’s carry out the plan and solve the problem step by step.” PS-CoT explicitly guides LLMs to design a plan to decompose the entire task into smaller sub-tasks and execute each sub-task accordingly. Similarly, a three-stage reasoning framework has been proposed for implicit sentiment analysis, wherein the entire task is deconstructed into three steps to predict sentiment polarity [8]. These approaches, utilizing GPT-3 as the backbone network for the LLMs, have achieved the best results in multiple reasoning tasks.

For the task of multimodal sentiment analysis, the existing methods of CoT reasoning exhibit several limitations. Regarding the full-shot/few-shot approaches, there are no annotations for CoTs within the multimodal sentiment analysis datasets. Manually annotating CoT exemplars is time-consuming, costly, and challenging to ensure consistency and completeness in multi-step reasoning. For the zero-shot approach, it considers only the textual modality. The unique audio and video modalities in multimodal sentiment analysis are ignored. Moreover, CoT reasoning based on textual modality alone may lead to hallucinations or be incomplete. Research on mitigating hallucinations in LLMs is just beginning [3]. Most methods verify the correctness of the reasoning process and refine reasoning steps that may cause hallucinatory [7, 15, 22]. Through verification and refinement, cascading errors and hallucinatory phenomena in the reasoning process are significantly reduced. However, these methods are designed for tasks involving only the textual modality. Therefore, for multimodal sentiment analysis tasks, it is necessary to propose CoT reasoning methods that are more suitable for the characteristics of sentiments and multimodality.

3 Method

3.1 Overview

The main notations used in this article are summarized in Table 1.

Table 1.

Symbol	Explanation
$X_{a},X_{v},X_{t}$	The unimodal embedding of audio, video, and text, respectively.
$X_{av}$	The bimodal embedding of audio and video.
$X_{CoT}$	The sentence-level embedding of the CoT.
$X_{CoT}^{\prime}$	The filtered embedding of the CoT.
$X_{t}^{\prime}$	The textual embedding enhanced by the CoT.
$X_{avt}$	The low-level trimodal embedding of audio, video, and text.
$X_{avt}^{\prime}$	The high-level trimodal embedding of audio, video, and text.
$\hat{y}_{t}$	The text-based zero-shot prediction result.
$\hat{y}$	The final prediction result.

Table 1. Summary of the Main Notations

The proposed multimodal PEAR-CoT reasoning for multimodal sentiment analysis is illustrated in Figure 1. To begin with, we construct PEAR-CoT prompts based on the text modality and input them into an LLM, such as GPT-3.5. Based on the self-consistency strategy, the most reasonable CoT is selected from multiple CoTs, obtaining text-based zero-shot sentiment predictions $\hat{y}_{t}$ along with corresponding CoT embedding $X_{CoT}$. Then, the audio and visual modalities are employed to suppress hallucinations within the CoT, resulting in filtered CoT embeddings $X_{CoT}^{\prime}$. In the process of text semantic representation learning (e.g., in BERT), high-level reasoning information $X_{CoT}^{\prime}$ and cross-modal complementary information $X_{av}$ are introduced to learn multimodal sentiment representations $X_{avt}$ and $X_{avt}^{\prime}$. The final sentiment prediction results $\hat{y}$ are obtained by integrating unimodal zero-shot predictions $\hat{y}_{t}$ and multimodal sentiment representations $X_{avt}^{\prime}$. The entire network is optimized through sentiment prediction loss.

Fig. 1.

3.2 PEAR-CoT Prompt

Existing zero-shot CoT prompting methods [14, 20, 35] are proposed for tasks such as logical, mathematical, and common-sense reasoning tasks. These tasks feature a complete question-and-answer structure that can be directly input into LLMs for step-by-step reasoning analysis. However, multimodal sentiment analysis tasks do not possess a standard question-and-answer format, and the prediction of sentiment intensity is closely related to the annotation rules of the dataset. Therefore, it is necessary to design specialized CoT prompts for multimodal sentiment analysis tasks.

3.2.1 Prompt Construction.

The PEAR-CoT prompting includes four components: Preliminaries, quEstion, Answer, and Reason. The quEstion and Answer parts transform the task of predicting sentiment intensity into a question-and-answer structure in natural language, making it understandable for LLMs. The Preliminaries part encompasses foundational knowledge related to the task. It may include knowledge related to the dataset’s annotation to enable reasonable prediction within the LLM. It could also contain prior knowledge from the field of affective computing, improving the model’s understanding of sentiment context. The purpose of the Reason part is to trigger the reasoning ability of the LLM, leading it to output the reasoning or explanation steps progressively. In summary, the complete form of the PEAR-CoT prompting is as follows:

PEAR prompt

Preliminaries: Sentiment intensity scores range from -3.0 to 3.0: -3.0 = highly negative, -2.0 = negative, -1.0 = weakly negative, 0.0 = neutral, 1.0 = weakly positive, 2.0 = positive, 3.0 = highly positive.

Question: Given the sentence “***”, what is the most likely sentiment intensity score (ranging from -3.0 to 3.0)? Please provide reasons.

Answer:

3.2.2 Analyzing the Output of LLM.

By analyzing the output of LLMs, we aim to obtain two components: zero-shot sentiment prediction results $\hat{y}_{t}$ and sentence-level CoT embeddings $X_{CoT}$.

In the standard CoT prompting approach [14], an additional prompt is required to extract sentiment prediction results. For instance, the prompt can be constructed as follows: “Therefore, the answer (ranging from $-$3.0 to 3.0) is….” The initial output of the LLM is then concatenated with this prompt and input into the LLM again to obtain the numerical answer. However, such a computational process is expensive. In contrast, we employ a heuristic algorithm to extract numerical values as zero-shot prediction results directly from the output of the LLM. Specifically, we extract numbers from either the first or the last sentence of the output. If multiple numbers are present, we choose the first one as the result. For example, if the sentiment score is “1.0 or 1.5,” we consider 1.0 as the predicted sentiment intensity. If no numbers are found in the output, we consider the sentiment intensity as 0.0, representing a neutral sentiment.

As for the sentence-level CoT embeddings $X_{\mathrm{CoT}}$, we utilize a PLM, such as BERT, to directly extract sentence-level features from the LLM’s output.

3.2.3 Self-Consistency.

The self-consistency strategy [36] is a technique designed to enhance the effectiveness of CoT reasoning. The core idea behind this method is that by independently reasoning multiple times and aggregating the results, the accuracy and reliability can be significantly improved. The self-consistency approach typically follows two steps:

—

Generating multiple CoTs: For a given question, the LLM first generates multiple possible CoTs (reasoning processes). Each chain attempts to explain how the answer is arrived at in a different manner.

—

Evaluation and aggregation: These CoTs are evaluated to find consistency among them. This involves comparing the answers derived from different chains and selecting the most frequently occurring answer as the final result.

The self-consistency approach can reduce errors or biases that might occur in a single CoT by integrating multiple perspectives, thereby enhancing the robustness and accuracy of CoT reasoning.

3.3 CMFF Module

The outputs of LLMs may contain hallucinations, and the CoTs based on the text modality might be incomplete. Therefore, as illustrated in Figure 2, we propose the CMFF module. This module leverages information from audio and visual modalities to suppress irrational steps in the CoT and integrate complementary information between modalities, learning discriminative multimodal sentiment representations.

Fig. 2.

Specifically, we concatenate the embeddings of the audio and video modalities and map them to dimensions consistent with the text modality

\begin{align}X_{av}=\text{Activation}(\text{Linear}(X_{a}||X_{v}))\in\mathbb{R}^{T\times C},\end{align}

(1)

where $||$ represents the concatenation operator, $\text{Linear}(\cdot)$ is implemented through a fully connected layer, and $\text{Activation}(\cdot)$ is the non-linear activation function. $T$ is the length of the representation sequence and $C$ is the the dimension of representation.

Then, we employ a cross-modal Multi-Head Attention (MHA) to filter the CoT embedding and suppress unreasonable steps in the reasoning process by MHA${}_{1}$

\begin{align}X_{CoT}^{\prime}=\text{MHA}_{1}(X_{av},X_{CoT},X_{CoT})\in\mathbb{R}^{T\times C}.\end{align}

(2)

As shown in Figure 2, in the MHA, a normalized attention matrix is learned based on the audio-visual embedding $X_{av}$ and CoT embedding $X_{CoT}$. It gives more attention to the reasonable steps within the CoT and less attention to the unreasonable ones.

To better utilize the semantic information and implicit logical relationships of the text modality, we incorporate the CoT embedding into the process of semantic representation learning. This operation occurs within the hidden layers of the PLM (e.g., BERT) rather than after the output layer. In this way, with the assistance of high-level logical reasoning information, semantic representation learning becomes more effective. Specifically, we use the filtered CoT embedding $X_{CoT}^{\prime}$ to enhance the semantic representation of the text by MHA${}_{2}$

\begin{align}X_{t}^{\prime}=LN(X_{t}+\text{MHA}_{2}(X_{t},X_{CoT}^{\prime},X_{CoT}^{\prime})) \in\mathbb{R}^{T\times C}.\end{align}

(3)

In the absence of multimodal information, the learned representation distribution is solely determined by the text modality. Non-linguistic behaviors influence the meaning of the text, thereby impacting the representation distribution of the text modality. In essence, both linguistic and non-linguistic cues jointly determine the representation distribution. Therefore, we integrate complementary information between modalities to obtain a multimodal representation by MHA${}_{3}$

\begin{align}X_{avt}=LN(X_{t}^{\prime}+\text{MHA}_{3}(X_{t}^{\prime},X_{av},X_{av}))\in \mathbb{R}^{T\times C},\end{align}

(4)

where $LN(\cdot)$ represents layer normalization to reduce the difference in input distribution of different layers and improve the robustness of the model.

Table 2 shows the detailed input and output of each layer in the CMFF module.

Table 2.

Layer	Input (Query)	Input (Key and Value)	Output
First MHA	$X_{av}\in\mathbb{R}^{T\times C}$	$X_{CoT}\in\mathbb{R}^{T_{CoT}\times C}$	$X_{CoT}^{\prime}\in\mathbb{R}^{T\times C}$
Second MHA	$X_{t}\in\mathbb{R}^{T\times C}$	$X_{CoT}^{\prime}\in\mathbb{R}^{T\times C}$	$X_{t}^{\prime}\in\mathbb{R}^{T\times C}$
Third MHA	$X_{t}^{\prime}\in\mathbb{R}^{T\times C}$	$X_{av}\in\mathbb{R}^{T\times C}$	$X_{avt}\in\mathbb{R}^{T\times C}$

Table 2. Details of the CMFF Module

3.4 Optimization Objective

As shown in Figure 1, the low-level multimodal representation $X_{avt}$ is fed into the remaining encoding layers to model semantic information, and the embedding at the Classification (CLS) token is adopted as high-level multimodal sentiment representation $X_{avt}^{{}^{\prime}}$. The final result is obtained by fusing the unimodal zero-shot prediction result with the multimodal sentiment representation

\begin{align}\hat{y}=\text{MLP}(\text{Concat}(X_{avt}^{{}^{\prime}},\hat{y}_{t})).\end{align}

(5)

Here we use a two-layer perceptron as the predictor, and the number of nodes in the second layer is 1.

We employ the L1 loss function to optimize the entire network

\begin{align}\mathcal{L}(y,\hat{y})=\frac{1}{n}\sum_{i=1}^{n}|y_{i}-\hat{y_{i}}|.\end{align}

(6)

Here, $n$ represents the number of samples, $y_{i}$ denotes the ground truth, and $\hat{y_{i}}$ corresponds to the prediction.

4 Experiments

We conduct extensive experiments to address the following Research Questions (RQs):

—

RQ1: How does MM-PEAR-CoT compare with existing multimodal sentiment analysis approaches?

—

RQ2: Are all modules within MM-PEAR-CoT necessary and effective?

—

RQ3: What impact does the position of the CMFF module have on multimodal sentiment analysis?

—

RQ4: How well does MM-PEAR-CoT generalize across different text semantic backbones?

—

RQ5: How well does MM-PEAR-CoT generalize across different reasoning backbones?

—

RQ6: How generalizable is PEAR-CoT in the visual modality?

—

RQ7: How does PEAR-CoT compare with existing zero-shot CoT reasoning methods?

—

RQ8: How does the CMFF module alleviate hallucinations in the CoT?

4.1 Datasets and Metrics

4.1.1 Datasets.

Experiments are conducted on two multimodal sentiment analysis datasets, CMU-MOSI [46] and CMU-MOSEI [47], which were collected from YouTube. CMU-MOSI consists of 2,199 short monologue video clips, with 1,284 samples in the training set, 229 samples in the validation set, and 686 samples in the test set, respectively. CMU-MOSEI consists of 22,856 video clips, with 16,326 samples in the training set, 1,871 samples in the validation set, and 4,659 samples in the test set, according to the standard dataset split. The dataset is gender balanced. All the sentence utterances are randomly chosen from various topics and monologue videos. The videos are transcribed and properly punctuated. Each video clip in both datasets is annotated with a sentiment score ranging from $-$3 to 3, where $-$3 and 3 indicate strongly negative and strongly positive sentiment, respectively.

4.1.2 Metrics.

To evaluate the performance of our models, we adopt the following metrics based on previous works: binary accuracy (Acc${}_{2}$), binary F1 score (F1), 7-class accuracy (Acc${}_{7}$), Mean Absolute Error (MAE), and the correlation between model predictions and human annotations (Corr).

4.2 Implementation Details

OpenAI GPT-3.5-turbo-0613 is utilized to generate CoTs.¹ For a fair comparison, the unimodal encoder used in the experiments is consistent with the official one. Specifically, we use the BERT [6] pre-trained model as the default encoder for the text modality. For the audio modality, we adopt the officially released COVAREP features [5]. The COVAREP features are related to emotions and tone of speech, including 12 Mel-frequency cepstral coefficients, pitch, voiced/unvoiced segmentation features, glottal source parameters, peak slope parameters, and maxima dispersion quotients [47]. For the visual modality, we use the officially released Facet feature, which contains a 47-dimensional (on CMU-MOSI) or 35-dimensional (on CMU-MOSI) facial representation. All feature sequences are aligned in the time dimension using the official toolkit.²

The parameter settings for the two datasets are as follows. The initial learning rate is set to 0.0001. The dropout rate is 0.5. The number of epochs is 50. The warmup proportion is set to 0.1. The maximum length for multimodal feature sequences is 50. The maximum number of sentences in the CoT is 12. The number of heads in the MHA is 4. For the self-consistency strategy, we generate seven distinct CoTs. One of these chains is generated with the temperature set to 0 (argmax sampling), while the remaining six are generated with the temperature set to 0.7. The model is trained using the AdamW optimizer [19]. The entire framework is implemented using PyTorch [25] and runs on the NVIDIA TITAN RTX GPU.

4.3 Experimental Results (RQ1)

As shown in Table 3, we conduct a fair comparison with the current state-of-the-art methods under the same experimental settings, including Low-rank Multimodal Fusion (LMF) [18], Multimodal Factorization Model (MFM) [32], MulT [31], Interaction Canonical Correlation Network (ICCN) [29], Modality-Invariant and -Specific Analysis (MISA) [10], Multimodal Adaptation Gate (MAG) [28], Cross-Modal BERT (CM-BERT) [42], Self-MM [44], MultiModal InfoMax (MMIM) [9], Feature-Disentangled Multimodal Emotion Recognition (FDMER) [40], HyCon [24], AOBERT [13], Multimodal Information Modulation (MIM) [48], Multimodal Information Modulation (MTMD) [17], and DMD [16].

Table 3.

Methods	CMU-MOSI					CMU-MOSEI
Methods	Acc${}_{7}\uparrow$	Acc${}_{2}\uparrow$	F1$\uparrow$	MAE$\downarrow$	Corr$\uparrow$	Acc${}_{7}\uparrow$	Acc${}_{2}\uparrow$	F1$\uparrow$	MAE$\downarrow$	Corr$\uparrow$
MFM (2018)	35.4	81.7	81.6	0.877	0.706	51.4	84.4	84.4	0.568	0.717
LMF (2018)	33.2	82.5	82.5	0.917	0.695	48.0	82.0	82.2	0.623	0.677
MulT (2019)	40.0	83.0	82.8	0.871	0.698	51.8	82.5	82.3	0.580	0.703
ICCN (2020)	39.0	83.0	83.0	0.862	0.714	51.6	84.2	84.2	0.565	0.713
MISA (2020)	42.3	83.4	83.6	0.783	0.761	52.2	85.5	85.3	0.555	0.756
MAG (2020)	46.7	84.3	84.6	0.727	0.781	52.7	84.8	84.7	0.543	0.755
CM-BERT (2020)	44.9	84.5	84.5	0.729	0.791	51.9${}^{*}$	84.7${}^{*}$	84.7${}^{*}$	0.573${}^{*}$	0.728${}^{*}$
Self-MM (2021)	45.8	84.8	84.9	0.712	0.795	53.5	85.0	84.9	0.529	0.767
MMIM (2021)	46.7	86.1	86.0	0.700	0.800	54.2	86.0	85.9	0.526	0.772
FDMER (2022)	44.1	84.6	84.7	0.724	0.788	54.1	86.1	85.8	0.536	0.773
HyCon (2022)	46.6	85.2	85.1	0.713	0.790	52.8	85.4	85.6	0.601	0.776
AOBERT (2023)	40.2	85.6	86.4	0.856	0.700	54.5	86.2	85.9	0.515	0.765
MIM (2023)	47.0	85.9	85.9	0.701	0.805	52.5	86.4	86.3	0.579	0.792
MTMD (2023)	47.5	86.0	86.0	0.705	0.799	53.7	86.1	85.9	0.531	0.767
DMD (2023)	45.6	86.0	86.0	0.710${}^{*}$	0.792${}^{*}$	54.5	86.6	86.6	0.537${}^{*}$	0.771${}^{*}$
PEAR-CoT (Zero-shot)	43.1	84.5	84.6	0.726	0.800	43.2	71.0	71.3	0.755	0.661
MM-PEAR-CoT (Supervised)	48.1	88.3	88.2	0.647	0.842	56.0	88.3	88.4	0.486	0.798
$\Delta$	$+$ 0.6	$+$ 2.2	$+$ 1.8	$+$ 0.053	$+$ 0.037	$+$ 1.5	$+$ 1.7	$+$ 1.8	$+$ 0.029	$+$ 0.006

Table 3. Comparison on the CMU-MOSI and CMU-MOSEI Datasets

$\uparrow$ means the higher the better, while $\downarrow$ means the lower the better. ${}^{*}$ means reproduced results. Bold means the best result, while italic means the second best result.

It can be seen that early methods overlooked the gap between different modalities and proposed diverse networks to directly fuse the complementary information across modalities, resulting in limited performance improvements [18, 31, 32]. Some studies fine-tuned the representation of the textual modality using the complementary information from audio and visual modalities, thereby learning a multi-modal sentiment representation [13, 28, 42]. Recent approaches leveraged knowledge distillation or contrastive learning to align representations across different modalities and facilitate knowledge transfer between them [16, 17, 24, 48]. In comparison, the proposed MM-PEAR-CoT method achieved the best results across all evaluation metrics on two datasets, demonstrating the robustness and effectiveness of our approach. For instance, MM-PEAR-CoT achieved a 2.2% improvement in binary classification accuracy on the CMU-MOSI dataset and a 1.7% improvement on the CMU-MOSEI dataset. Furthermore, in the zero-shot setting, our proposed PEAR-CoT method also obtained satisfactory results, even surpassing some methods in supervised settings on the CMU-MOSI dataset. To the best of our knowledge, this study is the first to conduct zero-shot sentiment analysis on the CMU-MOSI and CMU-MOSEI datasets. Our method’s competitive performance in both supervised and zero-shot settings makes it a promising solution for sentiment analysis tasks.

Overall, our approach has several advantages. (1) Existing methods focus on exploring the relationships between different modalities. In contrast, the proposed method leverages CoTs and LLMs to achieve an in-depth understanding of the textual modality. The generated CoT reasoning also enhances the interpretability of sentiment analysis. (2) The proposed method integrates high-level logical reasoning information and cross-modal complementary information into the process of learning textual semantic representations. This makes the learned multimodal sentiment representations more compact and effective. (3) Current state-of-the-art methods such as HyCon [24], MTMD [17], and DMD [16] introduce contrastive learning or cross-modal distillation as auxiliary tasks. This increases the complexity of the model and may lead to challenges in model optimization. In contrast, our proposed method relies solely on sentiment prediction loss, which is more concise.

4.4 Ablation on Different Modules (RQ2)

Table 4 presents the results of module ablation experiments conducted on the CMU-MOSI and CMU-MOSEI datasets. Taking the results from CMU-MOSI as an example, when fine-tuning BERT pre-trained models using only textual data, a baseline binary classification accuracy of 83.5% was achieved. Then, employing the CoT reasoning process of LLMs to enhance the semantic representation of textual modality (Equation (3)) increased the accuracy to 86.0%. This demonstrates that high-level logical reasoning information can effectively aid in learning textual semantic information. The accuracy further increased to 87.2% when the CoT reasoning steps were filtered using audio-visual modality data (Equation (2)). This result underscores the necessity and effectiveness of suppressing irrational steps in the CoT. Finally, by incorporating audio-visual modality information into the process of learning textual semantic information (Equation (4)), an accuracy of 88.3% was attained, marking a 4.8% improvement over the baseline setting. Similar phenomena were observed on the CMU-MOSEI dataset. In conclusion, the ablation study indicates that an in-depth understanding of textual information contributes to more effective sentiment analysis, and cross-modal complementary information can further suppress unreasonable steps in the CoT based on the textual modality.

Table 4.

	Finetune	CoT Reasoning	CoT Filtering	Cross-Modal Fusion	Acc${}_{7}$	Acc${}_{2}$	F1	MAE	Corr
MOSI	$\checkmark$				42.8	83.5	83.6	0.752	0.775
	$\checkmark$	$\checkmark$			45.0	86.0	85.9	0.771	0.793
	$\checkmark$	$\checkmark$	$\checkmark$		46.2	87.2	87.1	0.702	0.806
	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	48.1	88.3	88.2	0.647	0.842
MOSEI	$\checkmark$				47.9	83.6	83.8	0.646	0.697
	$\checkmark$	$\checkmark$			53.2	85.9	86.1	0.542	0.755
	$\checkmark$	$\checkmark$	$\checkmark$		55.1	87.3	87.2	0.502	0.789
	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	56.0	88.3	88.4	0.486	0.798

Table 4. Module Ablation on the CMU-MOSI and CMU-MOSEI Datasets

Bold means the best result.

4.5 Ablation on Different Positions of CMFF (RQ3)

Figure 3 illustrates the binary accuracy of the CMFF module at different positions on the CMU-MOSI dataset. The numbers 1, 4, 8, and 12 represent adding the CMFF module after the 1st, 4th, 8th, and 12th Transformer encoding layers, respectively, while 0 indicates adding the CMFF module before the 1st Transformer encoding layer. Moreover, the results of adding the CMFF module after multiple Transformer encoding layers are also evaluated.

Fig. 3.

It can be observed that, according to the default setting, adding the CMFF module before the 1st Transformer layer achieved the highest classification accuracy. As the CMFF module is added to higher layers, a decreasing trend in accuracy is evident. This suggests that incorporating high-level reasoning information and cross-modal complementary information earlier is more beneficial for learning textual semantic representations. When the CMFF module is added after several Transformer encoding layers, the accuracy decreases to 84.5%. This may be because the same information is introduced at different stages of representation learning, which on one hand leads to redundancy and on the other hand, disrupts the normal process of multimodal representation learning.

4.6 Ablation on Different Text Semantic Backbones (RQ4)

To assess the generalization of the proposed MM-PEAR-CoT method across different text semantic backbones, we conducted evaluations not only on the default BERT model but also on the XLNet [43] and SentiLare [12] models. We compared our approach with the following state-of-the-art methods: MISA [10], Self-MM [44], MAG [28], and CENet [34]. The results are presented in Table 5.

Table 5.

Methods	XLNet					SentiLare
Methods	Acc${}_{7}\uparrow$	Acc${}_{2}\uparrow$	F1$\uparrow$	MAE$\downarrow$	Corr$\uparrow$	Acc${}_{7}\uparrow$	Acc${}_{2}\uparrow$	F1$\uparrow$	MAE$\downarrow$	Corr$\uparrow$
MISA	43.3${}^{*}$	86.3	86.4	0.712	0.803	47.7${}^{*}$	88.5	88.6	0.605	0.848
Self-MM	44.2${}^{*}$	86.9	86.9	0.671	0.817	48.3${}^{*}$	89.5	89.5	0.590	0.864
MAG	45.6${}^{*}$	87.3	87.3	0.678	0.819	48.8${}^{*}$	89.6	89.6	0.573	0.867
CENet	46.5${}^{*}$	88.4	88.4	0.648	0.832	49.4${}^{*}$	90.9	90.8	0.570	0.870
MM-PEAR-CoT	50.0	90.2	90.2	0.620	0.865	50.9	92.1	92.0	0.596	0.898

Table 5. Text Semantic Backbone Ablation on the CMU-MOSI Dataset

Bold means the best result. Asterisk means the reproduced result.

It can be observed that whether using the XLNet backbone network or the more sentiment-relevant SentiLare network, our proposed method consistently outperforms the others under the same experimental settings. Specifically, MM-PEAR-CoT achieves a 1.8% improvement in binary classification accuracy compared to the current best method when utilizing the XLNet backbone network and a 1.2% improvement when using the SentiLare backbone network. This indicates the generalization and effectiveness of our proposed method across different text semantic backbone networks. Furthermore, MM-PEAR-CoT benefits from superior text semantic backbone networks, exhibiting stable performance improvements on better networks. For instance, on the CMU-MOSI dataset, MM-PEAR-CoT achieves binary classification accuracies of 88.3%, 90.2%, and 92.1% when using BERT, XLNet, and SentiLare backbone networks, respectively.

4.7 Ablation on Different Reasoning Backbones (RQ5)

To assess the generalization of the proposed method across different reasoning backbones, experiments were conducted not only on the default GPT-3.5 but also on the open-source LLaMA-70B [30] and the powerful GPT-4.³ The results of the proposed PEAR-CoT and MM-PEAR-CoT on the CMU-MOSI dataset are presented in Table 6.

Specifically, LLaMA-70B demonstrated relatively limited reasoning capability, achieving a 75.3% binary classification accuracy. The default GPT-3.5 acquired better results, impressively achieving an 84.5% accuracy in the binary classification task. With approximately 175 billion parameters, GPT-3.5 significantly surpasses LLaMA2-70B, which has 70 billion parameters. Moreover, GPT-4 showed a further 0.8% increase in binary classification accuracy and a 0.014 improvement in Pearson correlation coefficient over GPT-3.5. Based on the PEAR-CoT, the MM-PEAR-CoT framework, incorporating CMFF modules, achieved performance enhancements across all three LLM backbones. This suggests that the proposed approach is applicable across different reasoning backbones and can benefit from stronger reasoning backbones.

Table 6.

Backbones	PEAR-CoT					MM-PEAR-CoT
Backbones	Acc${}_{7}\uparrow$	Acc${}_{2}\uparrow$	F1$\uparrow$	MAE$\downarrow$	Corr$\uparrow$	Acc${}_{7}\uparrow$	Acc${}_{2}\uparrow$	F1$\uparrow$	MAE$\downarrow$	Corr$\uparrow$
LLaMA-70B	35.9	75.3	75.2	0.998	0.687	46.8	86.3	86.3	0.684	0.816
GPT-3.5 (default)	43.1	84.5	84.6	0.726	0.800	48.1	88.3	88.2	0.647	0.842
GPT-4	45.1	85.3	85.3	0.722	0.814	48.8	89.0	89.0	0.639	0.858

Table 6. Reasoning Backbone Ablation on the CMU-MOSI Dataset

4.8 Generalization of PEAR-CoT on the Visual Modality (RQ6)

In the aforementioned experiments, the PEAR-CoT prompts were applied on the text modality. Herein, we further assess the effectiveness and generalizability of the PEAR-CoT prompts within the visual modality. To achieve this, it is first necessary to describe the relevant visual modality information in natural language.

We considered two different types of information: environmental context and facial-related information. For environmental background, descriptions of the surroundings were obtained using a video captioning model pre-trained on the Vatex dataset.⁴ However, since the Vatex dataset [37] was not collected for affective computing tasks, the generated descriptions might have limited emotion-related content. Regarding facial information, statistics for the three most frequently and least frequently occurring Action Units (AUs), the most likely facial expression categories, and the least likely facial expression categories for each video were collected. Subsequently, definitions for each AU were integrated into the Preliminaries part of the PEAR-CoT prompts. Finally, the generated visual PEAR-CoT prompts were fed into GPT-3.5 to predict sentiment intensity. The specific prompt template and examples are shown in Figure 4. It was observed that GPT can understand the relationship between specific AUs and sentiment polarity through the definitions of AUs. For example, AUs 01, 02, and 05 are associated with positive emotions, while AUs 04, 06, 09, and 15 are associated with negative emotions. By combining environmental information and facial expression information, specific sentiment intensity predictions were obtained.

Fig. 4.

To quantify the performance of the PEAR-CoT method in the visual modality, we compared it with the common Transformer [33] network. Specifically, the Transformer encoder layer is adopted to model the temporal dynamics of the visual modality and predicted sentiment scores. The related results are shown in Table 7. It is important to note that the PEAR-CoT prompts were utilized in a zero-shot setting without being trained/fine-tuned on the dataset, whereas the Transformer was trained in a supervised setting. We have the following observations. Firstly, the performance of both methods was quite limited, possibly due to the visual modality containing limited emotion-related information. In comparison, the PEAR-CoT prompt method achieved comparable results, especially in terms of classification accuracy and F1 score metrics. However, in the process of using natural language to describe visual context, PEAR-CoT only considered the simplest statistical information, overlooking fine-grained temporal dynamics. This resulted in relatively limited performance, with a noticeable gap in MAE and correlation coefficient metrics compared to the Transformer. Overall, PEAR-CoT remains a method with significant potential in the visual modality.

Table 7.

Method	Acc${}_{7}\uparrow$	Acc${}_{2}\uparrow$	F1$\uparrow$	MAE$\downarrow$	Corr$\uparrow$
PEAR-CoT (zero-shot)	15.5	55.2	57.0	1.621	0.027
Transformer (supervised)	16.9	57.1	57.2	1.387	0.121

Table 7. Results of the Visual Modality on the CMU-MOSI Dataset

Bold means the best result.

4.9 Ablation on Different Zero-shot CoT Promptings (RQ7)

To validate the effectiveness of the proposed PEAR-CoT prompt, we compared it against two different zero-shot CoT prompt methods: standard zero-shot CoT prompt [14] and PS zero-shot CoT prompt [35].

Table 8 presents the results of the three methods in zero-shot sentiment analysis on the CMU-MOSI dataset. It can be observed that PS-CoT yielded the poorest results, followed by the standard CoT. The proposed PEAR-CoT achieved the best results among the three methods. Figures 5 and 6 depict a weak negative sample and a neutral sample, respectively. On one hand, PS-CoT is proposed for mathematical calculation tasks and is not suitable for more subjective sentiment analysis tasks. Evaluating sentiment scores for each part and then summing them for the final sentiment prediction is unreasonable, as shown in Figure 5. On the other hand, for standard CoT, CoT reasoning may cause errors in the early stages and affect the subsequent reasoning process, as illustrated in Figure 6. In contrast, the PEAR-CoT prompts first offer a global analysis or prediction result, followed by step-by-step explanations. Our approach reduces the risk of accumulating errors due to inaccurate intermediate reasoning steps, thus acquiring the best results across multiple evaluation metrics.

Table 8.

Zero-shot CoT Prompt	Acc${}_{7}\uparrow$	Acc${}_{2}\uparrow$	F1$\uparrow$	MAE$\downarrow$	Corr$\uparrow$
PS-CoT	30.9	79.0	79.0	1.122	0.644
Standard-CoT	40.1	82.8	82.9	0.882	0.702
PEAR-CoT	43.1	84.5	84.6	0.726	0.800

Table 8. CoT Prompt Ablation on the CMU-MOSI Dataset

Fig. 5.

Fig. 6.

4.10 Case Study about Hallucination Suppression (RQ8)

To further analyze the role of audio-visual modality information in the CMFF module for CoT reasoning steps, we calculated the attention score allocated to each step within the CoT reasoning process in Equation (2). Specifically, we calculated and ranked the attention scores obtained by each step within the cross-modal filtering submodule. Figure 7 illustrates the prediction detail for both a weakly positive and a weakly negative sample.

Fig. 7.

For the positive sample, the appearance of the negative lexicon “problem” and the negation expression “didn’t love” misled the LLM to believe that the intensity of negative sentiment was greater than that of positive sentiment. However, the corresponding visual modality showed a happy expression, and the speaker’s tone was light and cheerful. This indicates a contradiction between the LLM’s reasoning process and the facts under multimodal data. According to the multimodal information, the speaker was expressing a liking toward an object, but not to the extent of love. In this scenario, the attention scores reveal that the CMFF module focused more on parts of the reasoning process that were consistent with the audio-visual modality, specifically the third step. Conversely, it paid less attention to the parts where the reasoning process conflicted with the audio-visual modality, namely the first, fourth, and fifth steps. Finally, with the text-based zero-shot prediction as negative and the audio-visual modality showing positive signals, the MM-PEAR-CoT’s prediction was 1.1 (weakly positive), closer to the annotation of 1.25.

For the negative sample, the presence of the word “orgasm” misled the LLM to perceive a positive sentiment during the third step of reasoning. However, the corresponding visual modality displayed expressions of displeasure, and the speaker’s tone conveyed disgust. This indicates a contradiction between the LLM’s reasoning process and the facts under multimodal data. In fact, according to the multimodal information, the speaker expressed disgust toward the excessive use of Computer-Generated Imagery in the climax of a movie. In this scenario, the attention scores indicate that the CMFF module focused more on parts of the reasoning process that were consistent with the audio-visual modality, specifically the first and second steps. Conversely, it paid less attention to the parts where the reasoning process conflicted with the audio-visual modality, namely the third and fourth steps. Finally, with the text modality being predicted as positive and the audiovisual modality showing negative signals, the MM-Pear-CoT’s prediction was $-$0.8 (weakly negative), closer to the annotation of $-$1.0.

Overall, the attention scores within the cross-modal filtering module demonstrate that MM-PEAR-CoT is capable of prioritizing reasoning processes that are more consistent with other modalities in the presence of conflicts between textual reasoning steps and multimodal information, thereby obtaining more accurate multimodal sentiment prediction results.

Despite the proposed method’s capabilities, it still makes incorrect predictions in certain instances. For example, consider the “jUzDDGyPkXU$\_$21” sample from CMU-MOSI, which is annotated as $-$1.0 (weakly negative). The corresponding text states, “The only actor who can really sell their lines is Erin Eckart.” This sentence does not display any overt negative sentiments. Thus, the text-based reasoning process does not involve steps associated with negative sentiments, leading to a zero-shot prediction of 2.0 (positive). However, the negative sentiments are conveyed through audio and visual modalities. This discrepancy results in the proposed method failing to suppress hallucinations by focusing on reasonable steps, finally leading to an erroneous prediction of 1.1 (weakly positive). This example highlights the challenges the proposed method faces when dealing with complex samples.

4.11 Discussion

4.11.1 Fairness.

MM-PEAR-CoT aims to introduce implicit reasoning information into the process of multimodal representation learning. For the multimodal sentiment analysis task, there are no readily available reasoning exemplars. Manually annotating CoT exemplars is time-consuming, costly, and challenging to ensure consistency and completeness in multi-step reasoning. Thus, we leverage LLMs with reasoning capabilities to generate these CoTs. Furthermore, to ensure a fair comparison, we employ backbone networks consistent with current multimodal sentiment analysis methods, such as BERT, XLNet, and SentiLare, for semantic learning in the text modality. Notably, we do not utilize the powerful hidden layer embeddings of LLMs. Extensive experiments demonstrate that the proposed method achieves superior results, further validating the effectiveness of CoT reasoning in multimodal sentiment analysis.

4.11.2 Limitation.

While we have successfully addressed a portion of the hallucination issue stemming from unimodal input by introducing the audio-visual modality, it is crucial to recognize that this problem is not entirely resolved. Exploring methods to mitigate hallucinations is a potential area for further research and investigation. Since our approach involves zero-shot prompting of the LLM to generate rationales, there is a potential risk of inheriting social biases from the LLM. These biases, encompassing cultural, ethical, and various other dimensions, may manifest in the generated rationales, potentially causing adverse effects on users. To address this concern in the future, potential solutions could include implementing constraints at each prompting stage or employing more advanced LLMs trained on unbiased resources.

5 Conclusion

In this article, we propose the multimodal PEAR-CoT reasoning for multimodal sentiment analysis. PEAR-CoT prompts are employed to guideLLMs in generating a progressive reasoning process, serving as a representation of logical relationships within the text. To suppress the hallucinations of LLMs, we further employ information from the audio-visual modality to filter the CoT. We integrate high-level logical information and cross-modal complementary information into the process of text semantic representation learning, obtaining multimodal representations for sentiment prediction. In addition to the text modality, we also showcase the potential of PEAR-CoT reasoning in the visual modality. To the best of our knowledge, this is the first study that applies CoT reasoning to multimodal sentiment analysis.

Footnotes

https://platform.openai.com/docs/guides/gpt

https://github.com/A2Zadeh/CMU-MultimodalSDK

https://openai.com/blog/new-embedding-models-and-api-updates. In terms of the number of tokens spent, each reasoning requires about 125 input tokens and no more than 190 output tokens.

⁴

https://huggingface.co/Neleac/timesformer-gpt2-video-captioning

References

[1]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 33. 1877–1901.

Symbol	Explanation
\(X_{a},X_{v},X_{t}\)	The unimodal embedding of audio, video, and text, respectively.
\(X_{av}\)	The bimodal embedding of audio and video.
\(X_{CoT}\)	The sentence-level embedding of the CoT.
\(X_{CoT}^{\prime}\)	The filtered embedding of the CoT.
\(X_{t}^{\prime}\)	The textual embedding enhanced by the CoT.
\(X_{avt}\)	The low-level trimodal embedding of audio, video, and text.
\(X_{avt}^{\prime}\)	The high-level trimodal embedding of audio, video, and text.
\(\hat{y}_{t}\)	The text-based zero-shot prediction result.
\(\hat{y}\)	The final prediction result.

Layer	Input (Query)	Input (Key and Value)	Output
First MHA	\(X_{av}\in\mathbb{R}^{T\times C}\)	\(X_{CoT}\in\mathbb{R}^{T_{CoT}\times C}\)	\(X_{CoT}^{\prime}\in\mathbb{R}^{T\times C}\)
Second MHA	\(X_{t}\in\mathbb{R}^{T\times C}\)	\(X_{CoT}^{\prime}\in\mathbb{R}^{T\times C}\)	\(X_{t}^{\prime}\in\mathbb{R}^{T\times C}\)
Third MHA	\(X_{t}^{\prime}\in\mathbb{R}^{T\times C}\)	\(X_{av}\in\mathbb{R}^{T\times C}\)	\(X_{avt}\in\mathbb{R}^{T\times C}\)

Methods	CMU-MOSI					CMU-MOSEI
Methods	Acc\({}_{7}\uparrow\)	Acc\({}_{2}\uparrow\)	F1\(\uparrow\)	MAE\(\downarrow\)	Corr\(\uparrow\)	Acc\({}_{7}\uparrow\)	Acc\({}_{2}\uparrow\)	F1\(\uparrow\)	MAE\(\downarrow\)	Corr\(\uparrow\)
MFM (2018)	35.4	81.7	81.6	0.877	0.706	51.4	84.4	84.4	0.568	0.717
LMF (2018)	33.2	82.5	82.5	0.917	0.695	48.0	82.0	82.2	0.623	0.677
MulT (2019)	40.0	83.0	82.8	0.871	0.698	51.8	82.5	82.3	0.580	0.703
ICCN (2020)	39.0	83.0	83.0	0.862	0.714	51.6	84.2	84.2	0.565	0.713
MISA (2020)	42.3	83.4	83.6	0.783	0.761	52.2	85.5	85.3	0.555	0.756
MAG (2020)	46.7	84.3	84.6	0.727	0.781	52.7	84.8	84.7	0.543	0.755
CM-BERT (2020)	44.9	84.5	84.5	0.729	0.791	51.9\({}^{*}\)	84.7\({}^{*}\)	84.7\({}^{*}\)	0.573\({}^{*}\)	0.728\({}^{*}\)
Self-MM (2021)	45.8	84.8	84.9	0.712	0.795	53.5	85.0	84.9	0.529	0.767
MMIM (2021)	46.7	86.1	86.0	0.700	0.800	54.2	86.0	85.9	0.526	0.772
FDMER (2022)	44.1	84.6	84.7	0.724	0.788	54.1	86.1	85.8	0.536	0.773
HyCon (2022)	46.6	85.2	85.1	0.713	0.790	52.8	85.4	85.6	0.601	0.776
AOBERT (2023)	40.2	85.6	86.4	0.856	0.700	54.5	86.2	85.9	0.515	0.765
MIM (2023)	47.0	85.9	85.9	0.701	0.805	52.5	86.4	86.3	0.579	0.792
MTMD (2023)	47.5	86.0	86.0	0.705	0.799	53.7	86.1	85.9	0.531	0.767
DMD (2023)	45.6	86.0	86.0	0.710\({}^{*}\)	0.792\({}^{*}\)	54.5	86.6	86.6	0.537\({}^{*}\)	0.771\({}^{*}\)
PEAR-CoT (Zero-shot)	43.1	84.5	84.6	0.726	0.800	43.2	71.0	71.3	0.755	0.661
MM-PEAR-CoT (Supervised)	48.1	88.3	88.2	0.647	0.842	56.0	88.3	88.4	0.486	0.798
\(\Delta\)	\(+\) 0.6	\(+\) 2.2	\(+\) 1.8	\(+\) 0.053	\(+\) 0.037	\(+\) 1.5	\(+\) 1.7	\(+\) 1.8	\(+\) 0.029	\(+\) 0.006

Abstract

1 Introduction

2 Related Work

2.1 Multimodal Sentiment Analysis

2.2 CoT Prompting

3 Method

3.1 Overview

3.2 PEAR-CoT Prompt

3.2.1 Prompt Construction.

3.2.2 Analyzing the Output of LLM.

3.2.3 Self-Consistency.

3.3 CMFF Module

3.4 Optimization Objective

4 Experiments

4.1 Datasets and Metrics

4.1.1 Datasets.

4.1.2 Metrics.

4.2 Implementation Details

4.3 Experimental Results (RQ1)

4.4 Ablation on Different Modules (RQ2)

4.5 Ablation on Different Positions of CMFF (RQ3)

4.6 Ablation on Different Text Semantic Backbones (RQ4)

4.7 Ablation on Different Reasoning Backbones (RQ5)

4.8 Generalization of PEAR-CoT on the Visual Modality (RQ6)

4.9 Ablation on Different Zero-shot CoT Promptings (RQ7)

4.10 Case Study about Hallucination Suppression (RQ8)

4.11 Discussion

4.11.1 Fairness.

4.11.2 Limitation.

5 Conclusion

Footnotes

References

Cited By

Index Terms

Recommendations

Joint training strategy of unimodal and multimodal for multimodal sentiment analysis

Few-shot Multimodal Sentiment Analysis Based on Multimodal Probabilistic Fusion Prompts

Hybrid cross-modal interaction learning for multimodal sentiment analysis

Comments

Information

Published In

Publisher

Publication History

Check for updates

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations