[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article
Open access

Multimodal PEAR Chain-of-Thought Reasoning for Multimodal Sentiment Analysis

Published: 23 September 2024 Publication History

Abstract

Multimodal sentiment analysis aims to predict sentiments from multimodal signals such as audio, video, and text. Existing methods often rely on Pre-trained Language Models (PLMs) to extract semantic information from textual data, lacking an in-depth understanding of the logical relationships within the text modality. This article introduces the Multimodal PEAR (Preliminaries, quEstion, Answer, Reason) Chain-of-Thought (MM-PEAR-CoT) reasoning for multimodal sentiment analysis. Inspired by the human thought process when solving complex problems, the PEAR CoT prompt is first proposed to induce Large Language Models (LLMs) to generate text-based reasoning processes and zero-shot sentiment prediction results. However, text-based CoT reasoning is not always reliable and might contain irrational steps due to the hallucinations of LLMs. To address this, we further design the Cross-Modal Filtering and Fusion (CMFF) module. The filtering submodule utilizes audio and visual modalities to suppress irrational steps in the CoT, while the fusion submodule integrates high-level reasoning information and cross-modal complementary information in the process of semantic representation learning. Experimental results on two multimodal sentiment analysis benchmark datasets show that high-level reasoning information can help learn discriminative text representation, and cross-modal complementary information can avoid misleading by unreasonable steps in the CoT. MM-PEAR-CoT achieves the best results on both datasets, with improvements of 2.2% and 1.7% in binary classification accuracy on the CMU-MOSI and CMU-MOSEI datasets, respectively. To the best of our knowledge, this is the first study to apply CoT reasoning to multimodal sentiment analysis.

1 Introduction

In recent years, multimodal sentiment analysis has emerged as a prominent research direction. It aims to predict sentiments from multiple modalities, including audio, visual, and text signals. Compared to unimodal sentiment analysis, multimodal sentiment analysis can learn more comprehensive sentiment representations, leading to significant performance improvements. It has a wide application in various fields, such as human–computer interaction and autonomous driving.
Existing multimodal sentiment analysis methods utilize distinct encoders to extract unimodal representations and then focus on multimodal fusion and cross-modal alignment [4, 50]. For example, gated mechanisms [28, 34, 38], cross-modal attentions [21, 31], or graph neural networks [23, 41] are employed to fuse representations from different modalities, yielding a comprehensive multimodal representation. Knowledge distillation [16, 17] or contrastive learning [24, 26] is utilized to reduce the gap between modalities and learn a unified multimodal representation. Moreover, based on the Bidirectional Encoder Representations from Transformers (BERT) model and multimodal inputs, All-modalities-in-One BERT (AOBERT) [13] is pre-trained on Multimodal Masked Language Modeling (MMLM) and Alignment Prediction (AP) tasks to learn multimodal representations for multimodal sentiment analysis. Despite achieving commendable results, existing methods often rely on Pre-Trained Language Models (PLMs) (e.g., BERT [6] or XLNet [43]) to extract semantic information from textual data, lacking an in-depth understanding of the logical relationships within the text modality. Language (or text), as the embodiment of human wisdom, contains rich and intricate information. It can reflect human sentiments not only explicitly through the semantics of certain keywords or phrases but also implicitly through inherent logical relationships. In short, utilizing the logical reasoning information of text for multimodal sentiment analysis has not yet been explored.
For a complex task, humans typically decompose it into intermediate steps and address them sequentially, rather than solving the task in a black-box way. For example, given the question, “If you have $50 in your wallet and someone gives you $30 more, how much money do you have in total?” Instead of simply answering “$ 80,” a complete thought process would involve steps such as: “I start with $50. I receive an additional $30. By combining these amounts, I arrive at a total of $50 \(+\) $30 \(=\) $80. Therefore, the answer is $80.” The stepwise reasoning process is called Chain-of-Thought (CoT) [39].
With the development of Large Language Models (LLMs) and prompt learning [1], Wei et al. [39] proposed the first CoT prompting method for several reasoning tasks, where some CoT exemplars and a question are input into an LLM to generate a step-by-step reasoning process and answer. Lu et al. [20] introduced a similar method for science question-answering tasks. Unlike the former approach, which outputs the reasoning process before arriving at the answer, the latter method presents the answer first and then gives step-by-step explanations. MM-CoT [49] is the first to study CoT reasoning in different modalities, which fine-tunes the T5 pre-trained model [27] with reasoning steps annotated in the dataset. These methods [20, 39, 49] necessitate CoT annotations of some or all samples, termed few-shot or full-shot approaches, respectively. Recent studies have introduced zero-shot CoT prompting. For example, Kojima et al. [14] proposed the first zero-shot CoT prompting, where “Let’s think step by step” is used to trigger LLMs to generate step-by-step reasoning processes. Similarly, Wang et al. [35] proposed a Plan-and-Solve (PS) CoT prompting, employing “Let’s first understand the problem and devise a plan to solve the problem. Then, let’s carry out the plan and solve the problem step by step.” as the trigger phrase. These task-agnostic sentences induce LLMs to autonomously generate reasoning processes without CoT exemplars, achieving satisfactory results in different tasks such as arithmetic reasoning and common-sense reasoning [3].
However, applying CoT reasoning to multimodal sentiment analysis presents challenges. On one hand, few-shot or full-shot CoT prompting methods [20, 39, 49] require some or all CoT exemplars for prompting or fine-tuning. For the multimodal sentiment analysis task, there are no readily available reasoning exemplars. Manually annotating CoT exemplars is time-consuming, costly, and challenging to ensure consistency and completeness in multi-step reasoning. On the other hand, existing zero-shot CoT prompting methods [14, 35] are designed for unimodal text data and could lead to hallucinations or incomplete reasoning processes. Therefore, utilizing the CoT reasoning for multimodal sentiment analysis remains an open problem.
To address this issue, we introduce MM-PEAR (Preliminaries, quEstion, Answer, and Reason)- CoT (MM-PEAR-CoT) reasoning for multimodal sentiment analysis. Specifically, we propose a zero-shot CoT prompt named PEAR, comprising four parts: preliminaries, question, answer, and reason. The PEAR prompt is fed into an LLM to generate text-based reasoning steps and zero-shot sentiment predictions. Moreover, to alleviate irrational reasoning caused by the hallucinations of the LLM, we design a Cross-Modal Filtering and Fusion (CMFF) module. The filtering submodule utilizes information from the audio-visual modality to suppress unreasonable steps, while the fusion submodule integrates high-level reasoning information and cross-modal complementary information during the semantic representation learning to obtain multimodal representations. The multimodal representation and text-based zero-shot sentiment prediction are fused to obtain the final sentiment prediction.
In summary, our contributions are as follows:
We introduce the PEAR-CoT prompt to explore the implicit logical relationships in the text modality. The stepwise reasoning process provides an in-depth understanding of the text modality.
We propose the CMFF module to learn more comprehensive multimodal representation. It first utilizes audio and visual modalities to alleviate the hallucination in text-based CoT reasoning and then integrates high-level reasoning information and cross-modal complementary information during the text semantic representation learning.
Extensive experiments on two multimodal sentiment analysis benchmark datasets demonstrate the effectiveness of the multimodal PEAR-CoT reasoning. To the best of our knowledge, this is the first work to apply CoT reasoning to multimodal sentiment analysis.
The remaining sections of this article are organized as follows. Section 2 provides a brief overview of related work in multimodal sentiment analysis and CoT prompting. Section 3 introduces the multimodal PEAR-CoT reasoning for multimodal sentiment analysis. Section 4 presents detailed experiments and analyses. The conclusion is summarized in Section 5.

2 Related Work

2.1 Multimodal Sentiment Analysis

Currently, research in multimodal sentiment analysis has proposed various methods for multimodal fusion and cross-modal alignment [4, 50]. In terms of multimodal fusion methods, Zadeh et al. [45] introduced the Tensor Fusion Network to simultaneously model intra-modal and inter-modal dynamics. Tsai et al. [31] proposed the Multimodal Transformer (MulT) to capture complementary information between any two modalities within three modalities. Yang et al. [42] utilized masked multimodal attention on pre-trained BERT models to dynamically adjust the weight of words by integrating information from both text and audio modalities. Similarly, Rahman et al. [28] and Wang et al. [38] employed gating mechanisms to shift representations of the text modality by representations from the audio-visual modality. Hwang et al. [11] recalibrated representations of each modality using representations from other modalities to learn discriminative representations. Kim et al. [13] proposed MMLM and AP pre-training tasks, enabling pre-trained BERT models to learn effective multimodal representations. Moreover, some methods decouple representations of different modalities to learn modality-specific and modality-invariant representations before fusing them to obtain multimodal representations [10, 40]. These methods leverage complementary information between different modalities to achieve comprehensive multimodal representations.
Mai et al. [24] and Poklukar et al. [26] incorporated contrastive learning into cross-modal alignment, reducing the gap between modalities by aligning embeddings of different modalities. Li et al. [16] proposed Decoupled Multimodal Distillation (DMD), leveraging homogeneous and heterogeneous graph distillation on decoupled representations to effectively transfer knowledge across different modalities. Lin et al. [17] utilized feature-based and response-based knowledge distillation to learn robust multimodal sentiment representations. The aligned unimodal features are further fused in a unified representation space, enhancing the effectiveness of multimodal representations.
In conclusion, existing research in multimodal sentiment analysis heavily relies on PLMs for extracting text representations, neglecting the implicit context and logical relationships within the text modality. This limitation hampers the performance of multimodal sentiment analysis.

2.2 CoT Prompting

A CoT refers to a coherent flow of sentences that reveals the premises and conclusion of a reasoning problem, which clearly decomposes a multi-hop reasoning task into intermediate steps instead of solving the task in a black-box way [20]. The core idea of prompt learning is to guide or motivate the LLM to perform better on specific tasks by designing appropriate “prompts” without requiring extensive retraining or fine-tuning. With the increase of LLM parameters and the emergence of reasoning capabilities, the concept of “CoT prompting” has been introduced [39]. This approach merges CoT reasoning with prompt learning by feeding exemplars of CoT reasoning into LLMs as prompts, enabling the model’s output to include the desired reasoning steps. By allowing the model more time and space for computation, CoT prompting effectively improves the performance and interpretability of LLMs on complex problems. This study shows the enormous potential of CoT prompting on LLMs with about 100 billion parameters.
Existing methods for CoT prompting can be categorized into two types: full-shot/few-shot methods and zero-shot methods. In the case of full-shot/few-shot CoT prompting, manually annotated CoT exemplars are required. For instance, Wei et al. [39] fed a sequence of reasoning steps and the answer as prompts into LLMs. Similarly, Lu et al. [20] introduced both the answer and step-by-step explanations as prompts into LLMs. pALM [2] and Generative Pre-trained Transformer (GPT)-3 [1] were used as the backbone networks for these LLMs, with experimental results demonstrating the effectiveness of these few-shot approaches. As the first multimodal CoT reasoning method, MM-CoT [49] fine-tuned the T5 pre-trained model [27] with CoT in the dataset to integrate information from both visual and textual modalities, achieving better results in a full-shot experimental setup.
Regarding zero-shot CoT prompting, task-agnostic triggers are designed to induce LLMs to generate a step-by-step reasoning process. For example, Kojima et al. [14] first proposed the prompt, “Let’s think step by step.” Wang et al. [35] put forth the PS-CoT prompting, “Let’s first understand the problem and devise a plan to solve the problem. Then, let’s carry out the plan and solve the problem step by step.” PS-CoT explicitly guides LLMs to design a plan to decompose the entire task into smaller sub-tasks and execute each sub-task accordingly. Similarly, a three-stage reasoning framework has been proposed for implicit sentiment analysis, wherein the entire task is deconstructed into three steps to predict sentiment polarity [8]. These approaches, utilizing GPT-3 as the backbone network for the LLMs, have achieved the best results in multiple reasoning tasks.
For the task of multimodal sentiment analysis, the existing methods of CoT reasoning exhibit several limitations. Regarding the full-shot/few-shot approaches, there are no annotations for CoTs within the multimodal sentiment analysis datasets. Manually annotating CoT exemplars is time-consuming, costly, and challenging to ensure consistency and completeness in multi-step reasoning. For the zero-shot approach, it considers only the textual modality. The unique audio and video modalities in multimodal sentiment analysis are ignored. Moreover, CoT reasoning based on textual modality alone may lead to hallucinations or be incomplete. Research on mitigating hallucinations in LLMs is just beginning [3]. Most methods verify the correctness of the reasoning process and refine reasoning steps that may cause hallucinatory [7, 15, 22]. Through verification and refinement, cascading errors and hallucinatory phenomena in the reasoning process are significantly reduced. However, these methods are designed for tasks involving only the textual modality. Therefore, for multimodal sentiment analysis tasks, it is necessary to propose CoT reasoning methods that are more suitable for the characteristics of sentiments and multimodality.

3 Method

3.1 Overview

The main notations used in this article are summarized in Table 1.
Table 1.
SymbolExplanation
\(X_{a},X_{v},X_{t}\)The unimodal embedding of audio, video, and text, respectively.
\(X_{av}\)The bimodal embedding of audio and video.
\(X_{CoT}\)The sentence-level embedding of the CoT.
\(X_{CoT}^{\prime}\)The filtered embedding of the CoT.
\(X_{t}^{\prime}\)The textual embedding enhanced by the CoT.
\(X_{avt}\)The low-level trimodal embedding of audio, video, and text.
\(X_{avt}^{\prime}\)The high-level trimodal embedding of audio, video, and text.
\(\hat{y}_{t}\)The text-based zero-shot prediction result.
\(\hat{y}\)The final prediction result.
Table 1. Summary of the Main Notations
The proposed multimodal PEAR-CoT reasoning for multimodal sentiment analysis is illustrated in Figure 1. To begin with, we construct PEAR-CoT prompts based on the text modality and input them into an LLM, such as GPT-3.5. Based on the self-consistency strategy, the most reasonable CoT is selected from multiple CoTs, obtaining text-based zero-shot sentiment predictions \(\hat{y}_{t}\) along with corresponding CoT embedding \(X_{CoT}\). Then, the audio and visual modalities are employed to suppress hallucinations within the CoT, resulting in filtered CoT embeddings \(X_{CoT}^{\prime}\). In the process of text semantic representation learning (e.g., in BERT), high-level reasoning information \(X_{CoT}^{\prime}\) and cross-modal complementary information \(X_{av}\) are introduced to learn multimodal sentiment representations \(X_{avt}\) and \(X_{avt}^{\prime}\). The final sentiment prediction results \(\hat{y}\) are obtained by integrating unimodal zero-shot predictions \(\hat{y}_{t}\) and multimodal sentiment representations \(X_{avt}^{\prime}\). The entire network is optimized through sentiment prediction loss.
Fig. 1.
Fig. 1. The proposed multimodal PEAR-CoT reasoning for multimodal sentiment analysis. For example, GPT-3.5 is adopted as the LLM for CoT reasoning, and BERT is adopted as the PLM for text semantic encoding. Best viewed zoomed in and in color. CLS, Classification; MLP, Multi-Layer Perceptron.

3.2 PEAR-CoT Prompt

Existing zero-shot CoT prompting methods [14, 20, 35] are proposed for tasks such as logical, mathematical, and common-sense reasoning tasks. These tasks feature a complete question-and-answer structure that can be directly input into LLMs for step-by-step reasoning analysis. However, multimodal sentiment analysis tasks do not possess a standard question-and-answer format, and the prediction of sentiment intensity is closely related to the annotation rules of the dataset. Therefore, it is necessary to design specialized CoT prompts for multimodal sentiment analysis tasks.

3.2.1 Prompt Construction.

The PEAR-CoT prompting includes four components: Preliminaries, quEstion, Answer, and Reason. The quEstion and Answer parts transform the task of predicting sentiment intensity into a question-and-answer structure in natural language, making it understandable for LLMs. The Preliminaries part encompasses foundational knowledge related to the task. It may include knowledge related to the dataset’s annotation to enable reasonable prediction within the LLM. It could also contain prior knowledge from the field of affective computing, improving the model’s understanding of sentiment context. The purpose of the Reason part is to trigger the reasoning ability of the LLM, leading it to output the reasoning or explanation steps progressively. In summary, the complete form of the PEAR-CoT prompting is as follows:
PEAR prompt
Preliminaries: Sentiment intensity scores range from -3.0 to 3.0: -3.0 = highly negative, -2.0 = negative, -1.0 = weakly negative, 0.0 = neutral, 1.0 = weakly positive, 2.0 = positive, 3.0 = highly positive.
Question: Given the sentence “***”, what is the most likely sentiment intensity score (ranging from -3.0 to 3.0)? Please provide reasons.
Answer:

3.2.2 Analyzing the Output of LLM.

By analyzing the output of LLMs, we aim to obtain two components: zero-shot sentiment prediction results \(\hat{y}_{t}\) and sentence-level CoT embeddings \(X_{CoT}\).
In the standard CoT prompting approach [14], an additional prompt is required to extract sentiment prediction results. For instance, the prompt can be constructed as follows: “Therefore, the answer (ranging from \(-\)3.0 to 3.0) is….” The initial output of the LLM is then concatenated with this prompt and input into the LLM again to obtain the numerical answer. However, such a computational process is expensive. In contrast, we employ a heuristic algorithm to extract numerical values as zero-shot prediction results directly from the output of the LLM. Specifically, we extract numbers from either the first or the last sentence of the output. If multiple numbers are present, we choose the first one as the result. For example, if the sentiment score is “1.0 or 1.5,” we consider 1.0 as the predicted sentiment intensity. If no numbers are found in the output, we consider the sentiment intensity as 0.0, representing a neutral sentiment.
As for the sentence-level CoT embeddings \(X_{\mathrm{CoT}}\), we utilize a PLM, such as BERT, to directly extract sentence-level features from the LLM’s output.

3.2.3 Self-Consistency.

The self-consistency strategy [36] is a technique designed to enhance the effectiveness of CoT reasoning. The core idea behind this method is that by independently reasoning multiple times and aggregating the results, the accuracy and reliability can be significantly improved. The self-consistency approach typically follows two steps:
Generating multiple CoTs: For a given question, the LLM first generates multiple possible CoTs (reasoning processes). Each chain attempts to explain how the answer is arrived at in a different manner.
Evaluation and aggregation: These CoTs are evaluated to find consistency among them. This involves comparing the answers derived from different chains and selecting the most frequently occurring answer as the final result.
The self-consistency approach can reduce errors or biases that might occur in a single CoT by integrating multiple perspectives, thereby enhancing the robustness and accuracy of CoT reasoning.

3.3 CMFF Module

The outputs of LLMs may contain hallucinations, and the CoTs based on the text modality might be incomplete. Therefore, as illustrated in Figure 2, we propose the CMFF module. This module leverages information from audio and visual modalities to suppress irrational steps in the CoT and integrate complementary information between modalities, learning discriminative multimodal sentiment representations.
Fig. 2.
Fig. 2. The proposed CMFF module (left) and the Multi-Head Attention (MHA) mechanism (right).
Specifically, we concatenate the embeddings of the audio and video modalities and map them to dimensions consistent with the text modality
\begin{align}X_{av}=\text{Activation}(\text{Linear}(X_{a}||X_{v}))\in\mathbb{R}^{T\times C},\end{align}
(1)
where \(||\) represents the concatenation operator, \(\text{Linear}(\cdot)\) is implemented through a fully connected layer, and \(\text{Activation}(\cdot)\) is the non-linear activation function. \(T\) is the length of the representation sequence and \(C\) is the the dimension of representation.
Then, we employ a cross-modal Multi-Head Attention (MHA) to filter the CoT embedding and suppress unreasonable steps in the reasoning process by MHA\({}_{1}\)
\begin{align}X_{CoT}^{\prime}=\text{MHA}_{1}(X_{av},X_{CoT},X_{CoT})\in\mathbb{R}^{T\times C}.\end{align}
(2)
As shown in Figure 2, in the MHA, a normalized attention matrix is learned based on the audio-visual embedding \(X_{av}\) and CoT embedding \(X_{CoT}\). It gives more attention to the reasonable steps within the CoT and less attention to the unreasonable ones.
To better utilize the semantic information and implicit logical relationships of the text modality, we incorporate the CoT embedding into the process of semantic representation learning. This operation occurs within the hidden layers of the PLM (e.g., BERT) rather than after the output layer. In this way, with the assistance of high-level logical reasoning information, semantic representation learning becomes more effective. Specifically, we use the filtered CoT embedding \(X_{CoT}^{\prime}\) to enhance the semantic representation of the text by MHA\({}_{2}\)
\begin{align}X_{t}^{\prime}=LN(X_{t}+\text{MHA}_{2}(X_{t},X_{CoT}^{\prime},X_{CoT}^{\prime})) \in\mathbb{R}^{T\times C}.\end{align}
(3)
In the absence of multimodal information, the learned representation distribution is solely determined by the text modality. Non-linguistic behaviors influence the meaning of the text, thereby impacting the representation distribution of the text modality. In essence, both linguistic and non-linguistic cues jointly determine the representation distribution. Therefore, we integrate complementary information between modalities to obtain a multimodal representation by MHA\({}_{3}\)
\begin{align}X_{avt}=LN(X_{t}^{\prime}+\text{MHA}_{3}(X_{t}^{\prime},X_{av},X_{av}))\in \mathbb{R}^{T\times C},\end{align}
(4)
where \(LN(\cdot)\) represents layer normalization to reduce the difference in input distribution of different layers and improve the robustness of the model.
Table 2 shows the detailed input and output of each layer in the CMFF module.
Table 2.
LayerInput (Query)Input (Key and Value)Output
First MHA\(X_{av}\in\mathbb{R}^{T\times C}\)\(X_{CoT}\in\mathbb{R}^{T_{CoT}\times C}\)\(X_{CoT}^{\prime}\in\mathbb{R}^{T\times C}\)
Second MHA\(X_{t}\in\mathbb{R}^{T\times C}\)\(X_{CoT}^{\prime}\in\mathbb{R}^{T\times C}\)\(X_{t}^{\prime}\in\mathbb{R}^{T\times C}\)
Third MHA\(X_{t}^{\prime}\in\mathbb{R}^{T\times C}\)\(X_{av}\in\mathbb{R}^{T\times C}\)\(X_{avt}\in\mathbb{R}^{T\times C}\)
Table 2. Details of the CMFF Module

3.4 Optimization Objective

As shown in Figure 1, the low-level multimodal representation \(X_{avt}\) is fed into the remaining encoding layers to model semantic information, and the embedding at the Classification (CLS) token is adopted as high-level multimodal sentiment representation \(X_{avt}^{{}^{\prime}}\). The final result is obtained by fusing the unimodal zero-shot prediction result with the multimodal sentiment representation
\begin{align}\hat{y}=\text{MLP}(\text{Concat}(X_{avt}^{{}^{\prime}},\hat{y}_{t})).\end{align}
(5)
Here we use a two-layer perceptron as the predictor, and the number of nodes in the second layer is 1.
We employ the L1 loss function to optimize the entire network
\begin{align}\mathcal{L}(y,\hat{y})=\frac{1}{n}\sum_{i=1}^{n}|y_{i}-\hat{y_{i}}|.\end{align}
(6)
Here, \(n\) represents the number of samples, \(y_{i}\) denotes the ground truth, and \(\hat{y_{i}}\) corresponds to the prediction.

4 Experiments

We conduct extensive experiments to address the following Research Questions (RQs):
RQ1: How does MM-PEAR-CoT compare with existing multimodal sentiment analysis approaches?
RQ2: Are all modules within MM-PEAR-CoT necessary and effective?
RQ3: What impact does the position of the CMFF module have on multimodal sentiment analysis?
RQ4: How well does MM-PEAR-CoT generalize across different text semantic backbones?
RQ5: How well does MM-PEAR-CoT generalize across different reasoning backbones?
RQ6: How generalizable is PEAR-CoT in the visual modality?
RQ7: How does PEAR-CoT compare with existing zero-shot CoT reasoning methods?
RQ8: How does the CMFF module alleviate hallucinations in the CoT?

4.1 Datasets and Metrics

4.1.1 Datasets.

Experiments are conducted on two multimodal sentiment analysis datasets, CMU-MOSI [46] and CMU-MOSEI [47], which were collected from YouTube. CMU-MOSI consists of 2,199 short monologue video clips, with 1,284 samples in the training set, 229 samples in the validation set, and 686 samples in the test set, respectively. CMU-MOSEI consists of 22,856 video clips, with 16,326 samples in the training set, 1,871 samples in the validation set, and 4,659 samples in the test set, according to the standard dataset split. The dataset is gender balanced. All the sentence utterances are randomly chosen from various topics and monologue videos. The videos are transcribed and properly punctuated. Each video clip in both datasets is annotated with a sentiment score ranging from \(-\)3 to 3, where \(-\)3 and 3 indicate strongly negative and strongly positive sentiment, respectively.

4.1.2 Metrics.

To evaluate the performance of our models, we adopt the following metrics based on previous works: binary accuracy (Acc\({}_{2}\)), binary F1 score (F1), 7-class accuracy (Acc\({}_{7}\)), Mean Absolute Error (MAE), and the correlation between model predictions and human annotations (Corr).

4.2 Implementation Details

OpenAI GPT-3.5-turbo-0613 is utilized to generate CoTs.1 For a fair comparison, the unimodal encoder used in the experiments is consistent with the official one. Specifically, we use the BERT [6] pre-trained model as the default encoder for the text modality. For the audio modality, we adopt the officially released COVAREP features [5]. The COVAREP features are related to emotions and tone of speech, including 12 Mel-frequency cepstral coefficients, pitch, voiced/unvoiced segmentation features, glottal source parameters, peak slope parameters, and maxima dispersion quotients [47]. For the visual modality, we use the officially released Facet feature, which contains a 47-dimensional (on CMU-MOSI) or 35-dimensional (on CMU-MOSI) facial representation. All feature sequences are aligned in the time dimension using the official toolkit.2
The parameter settings for the two datasets are as follows. The initial learning rate is set to 0.0001. The dropout rate is 0.5. The number of epochs is 50. The warmup proportion is set to 0.1. The maximum length for multimodal feature sequences is 50. The maximum number of sentences in the CoT is 12. The number of heads in the MHA is 4. For the self-consistency strategy, we generate seven distinct CoTs. One of these chains is generated with the temperature set to 0 (argmax sampling), while the remaining six are generated with the temperature set to 0.7. The model is trained using the AdamW optimizer [19]. The entire framework is implemented using PyTorch [25] and runs on the NVIDIA TITAN RTX GPU.

4.3 Experimental Results (RQ1)

As shown in Table 3, we conduct a fair comparison with the current state-of-the-art methods under the same experimental settings, including Low-rank Multimodal Fusion (LMF) [18], Multimodal Factorization Model (MFM) [32], MulT [31], Interaction Canonical Correlation Network (ICCN) [29], Modality-Invariant and -Specific Analysis (MISA) [10], Multimodal Adaptation Gate (MAG) [28], Cross-Modal BERT (CM-BERT) [42], Self-MM [44], MultiModal InfoMax (MMIM) [9], Feature-Disentangled Multimodal Emotion Recognition (FDMER) [40], HyCon [24], AOBERT [13], Multimodal Information Modulation (MIM) [48], Multimodal Information Modulation (MTMD) [17], and DMD [16].
Table 3.
MethodsCMU-MOSICMU-MOSEI
Acc\({}_{7}\uparrow\)Acc\({}_{2}\uparrow\)F1\(\uparrow\)MAE\(\downarrow\)Corr\(\uparrow\)Acc\({}_{7}\uparrow\)Acc\({}_{2}\uparrow\)F1\(\uparrow\)MAE\(\downarrow\)Corr\(\uparrow\)
MFM (2018)35.481.781.60.8770.70651.484.484.40.5680.717
LMF (2018)33.282.582.50.9170.69548.082.082.20.6230.677
MulT (2019)40.083.082.80.8710.69851.882.582.30.5800.703
ICCN (2020)39.083.083.00.8620.71451.684.284.20.5650.713
MISA (2020)42.383.483.60.7830.76152.285.585.30.5550.756
MAG (2020)46.784.384.60.7270.78152.784.884.70.5430.755
CM-BERT (2020)44.984.584.50.7290.79151.9\({}^{*}\)84.7\({}^{*}\)84.7\({}^{*}\)0.573\({}^{*}\)0.728\({}^{*}\)
Self-MM (2021)45.884.884.90.7120.79553.585.084.90.5290.767
MMIM (2021)46.786.186.00.7000.80054.286.085.90.5260.772
FDMER (2022)44.184.684.70.7240.78854.186.185.80.5360.773
HyCon (2022)46.685.285.10.7130.79052.885.485.60.6010.776
AOBERT (2023)40.285.686.40.8560.70054.586.285.90.5150.765
MIM (2023)47.085.985.90.7010.80552.586.486.30.5790.792
MTMD (2023)47.586.086.00.7050.79953.786.185.90.5310.767
DMD (2023)45.686.086.00.710\({}^{*}\)0.792\({}^{*}\)54.586.686.60.537\({}^{*}\)0.771\({}^{*}\)
PEAR-CoT (Zero-shot)43.184.584.60.7260.80043.271.071.30.7550.661
MM-PEAR-CoT (Supervised)48.188.388.20.6470.84256.088.388.40.4860.798
\(\Delta\)\(+\) 0.6\(+\) 2.2\(+\) 1.8\(+\) 0.053\(+\) 0.037\(+\) 1.5\(+\) 1.7\(+\) 1.8\(+\) 0.029\(+\) 0.006
Table 3. Comparison on the CMU-MOSI and CMU-MOSEI Datasets
\(\uparrow\) means the higher the better, while \(\downarrow\) means the lower the better. \({}^{*}\) means reproduced results. Bold means the best result, while italic means the second best result.
It can be seen that early methods overlooked the gap between different modalities and proposed diverse networks to directly fuse the complementary information across modalities, resulting in limited performance improvements [18, 31, 32]. Some studies fine-tuned the representation of the textual modality using the complementary information from audio and visual modalities, thereby learning a multi-modal sentiment representation [13, 28, 42]. Recent approaches leveraged knowledge distillation or contrastive learning to align representations across different modalities and facilitate knowledge transfer between them [16, 17, 24, 48]. In comparison, the proposed MM-PEAR-CoT method achieved the best results across all evaluation metrics on two datasets, demonstrating the robustness and effectiveness of our approach. For instance, MM-PEAR-CoT achieved a 2.2% improvement in binary classification accuracy on the CMU-MOSI dataset and a 1.7% improvement on the CMU-MOSEI dataset. Furthermore, in the zero-shot setting, our proposed PEAR-CoT method also obtained satisfactory results, even surpassing some methods in supervised settings on the CMU-MOSI dataset. To the best of our knowledge, this study is the first to conduct zero-shot sentiment analysis on the CMU-MOSI and CMU-MOSEI datasets. Our method’s competitive performance in both supervised and zero-shot settings makes it a promising solution for sentiment analysis tasks.
Overall, our approach has several advantages. (1) Existing methods focus on exploring the relationships between different modalities. In contrast, the proposed method leverages CoTs and LLMs to achieve an in-depth understanding of the textual modality. The generated CoT reasoning also enhances the interpretability of sentiment analysis. (2) The proposed method integrates high-level logical reasoning information and cross-modal complementary information into the process of learning textual semantic representations. This makes the learned multimodal sentiment representations more compact and effective. (3) Current state-of-the-art methods such as HyCon [24], MTMD [17], and DMD [16] introduce contrastive learning or cross-modal distillation as auxiliary tasks. This increases the complexity of the model and may lead to challenges in model optimization. In contrast, our proposed method relies solely on sentiment prediction loss, which is more concise.

4.4 Ablation on Different Modules (RQ2)

Table 4 presents the results of module ablation experiments conducted on the CMU-MOSI and CMU-MOSEI datasets. Taking the results from CMU-MOSI as an example, when fine-tuning BERT pre-trained models using only textual data, a baseline binary classification accuracy of 83.5% was achieved. Then, employing the CoT reasoning process of LLMs to enhance the semantic representation of textual modality (Equation (3)) increased the accuracy to 86.0%. This demonstrates that high-level logical reasoning information can effectively aid in learning textual semantic information. The accuracy further increased to 87.2% when the CoT reasoning steps were filtered using audio-visual modality data (Equation (2)). This result underscores the necessity and effectiveness of suppressing irrational steps in the CoT. Finally, by incorporating audio-visual modality information into the process of learning textual semantic information (Equation (4)), an accuracy of 88.3% was attained, marking a 4.8% improvement over the baseline setting. Similar phenomena were observed on the CMU-MOSEI dataset. In conclusion, the ablation study indicates that an in-depth understanding of textual information contributes to more effective sentiment analysis, and cross-modal complementary information can further suppress unreasonable steps in the CoT based on the textual modality.
Table 4.
 FinetuneCoT ReasoningCoT FilteringCross-Modal FusionAcc\({}_{7}\)Acc\({}_{2}\)F1MAECorr
MOSI\(\checkmark\)   42.883.583.60.7520.775
\(\checkmark\)\(\checkmark\)  45.086.085.90.7710.793
\(\checkmark\)\(\checkmark\)\(\checkmark\) 46.287.287.10.7020.806
\(\checkmark\)\(\checkmark\)\(\checkmark\)\(\checkmark\)48.188.388.20.6470.842
MOSEI\(\checkmark\)   47.983.683.80.6460.697
\(\checkmark\)\(\checkmark\)  53.285.986.10.5420.755
\(\checkmark\)\(\checkmark\)\(\checkmark\) 55.187.387.20.5020.789
\(\checkmark\)\(\checkmark\)\(\checkmark\)\(\checkmark\)56.088.388.40.4860.798
Table 4. Module Ablation on the CMU-MOSI and CMU-MOSEI Datasets
Bold means the best result.

4.5 Ablation on Different Positions of CMFF (RQ3)

Figure 3 illustrates the binary accuracy of the CMFF module at different positions on the CMU-MOSI dataset. The numbers 1, 4, 8, and 12 represent adding the CMFF module after the 1st, 4th, 8th, and 12th Transformer encoding layers, respectively, while 0 indicates adding the CMFF module before the 1st Transformer encoding layer. Moreover, the results of adding the CMFF module after multiple Transformer encoding layers are also evaluated.
Fig. 3.
Fig. 3. Binary accuracy of different positions on the CMU-MOSI dataset.
It can be observed that, according to the default setting, adding the CMFF module before the 1st Transformer layer achieved the highest classification accuracy. As the CMFF module is added to higher layers, a decreasing trend in accuracy is evident. This suggests that incorporating high-level reasoning information and cross-modal complementary information earlier is more beneficial for learning textual semantic representations. When the CMFF module is added after several Transformer encoding layers, the accuracy decreases to 84.5%. This may be because the same information is introduced at different stages of representation learning, which on one hand leads to redundancy and on the other hand, disrupts the normal process of multimodal representation learning.

4.6 Ablation on Different Text Semantic Backbones (RQ4)

To assess the generalization of the proposed MM-PEAR-CoT method across different text semantic backbones, we conducted evaluations not only on the default BERT model but also on the XLNet [43] and SentiLare [12] models. We compared our approach with the following state-of-the-art methods: MISA [10], Self-MM [44], MAG [28], and CENet [34]. The results are presented in Table 5.
Table 5.
MethodsXLNetSentiLare
Acc\({}_{7}\uparrow\)Acc\({}_{2}\uparrow\)F1\(\uparrow\)MAE\(\downarrow\)Corr\(\uparrow\)Acc\({}_{7}\uparrow\)Acc\({}_{2}\uparrow\)F1\(\uparrow\)MAE\(\downarrow\)Corr\(\uparrow\)
MISA43.3\({}^{*}\)86.386.40.7120.80347.7\({}^{*}\)88.588.60.6050.848
Self-MM44.2\({}^{*}\)86.986.90.6710.81748.3\({}^{*}\)89.589.50.5900.864
MAG45.6\({}^{*}\)87.387.30.6780.81948.8\({}^{*}\)89.689.60.5730.867
CENet46.5\({}^{*}\)88.488.40.6480.83249.4\({}^{*}\)90.990.80.5700.870
MM-PEAR-CoT50.090.290.20.6200.86550.992.192.00.5960.898
Table 5. Text Semantic Backbone Ablation on the CMU-MOSI Dataset
Bold means the best result. Asterisk means the reproduced result.
It can be observed that whether using the XLNet backbone network or the more sentiment-relevant SentiLare network, our proposed method consistently outperforms the others under the same experimental settings. Specifically, MM-PEAR-CoT achieves a 1.8% improvement in binary classification accuracy compared to the current best method when utilizing the XLNet backbone network and a 1.2% improvement when using the SentiLare backbone network. This indicates the generalization and effectiveness of our proposed method across different text semantic backbone networks. Furthermore, MM-PEAR-CoT benefits from superior text semantic backbone networks, exhibiting stable performance improvements on better networks. For instance, on the CMU-MOSI dataset, MM-PEAR-CoT achieves binary classification accuracies of 88.3%, 90.2%, and 92.1% when using BERT, XLNet, and SentiLare backbone networks, respectively.

4.7 Ablation on Different Reasoning Backbones (RQ5)

To assess the generalization of the proposed method across different reasoning backbones, experiments were conducted not only on the default GPT-3.5 but also on the open-source LLaMA-70B [30] and the powerful GPT-4.3 The results of the proposed PEAR-CoT and MM-PEAR-CoT on the CMU-MOSI dataset are presented in Table 6.
Specifically, LLaMA-70B demonstrated relatively limited reasoning capability, achieving a 75.3% binary classification accuracy. The default GPT-3.5 acquired better results, impressively achieving an 84.5% accuracy in the binary classification task. With approximately 175 billion parameters, GPT-3.5 significantly surpasses LLaMA2-70B, which has 70 billion parameters. Moreover, GPT-4 showed a further 0.8% increase in binary classification accuracy and a 0.014 improvement in Pearson correlation coefficient over GPT-3.5. Based on the PEAR-CoT, the MM-PEAR-CoT framework, incorporating CMFF modules, achieved performance enhancements across all three LLM backbones. This suggests that the proposed approach is applicable across different reasoning backbones and can benefit from stronger reasoning backbones.
Table 6.
BackbonesPEAR-CoTMM-PEAR-CoT
Acc\({}_{7}\uparrow\)Acc\({}_{2}\uparrow\)F1\(\uparrow\)MAE\(\downarrow\)Corr\(\uparrow\)Acc\({}_{7}\uparrow\)Acc\({}_{2}\uparrow\)F1\(\uparrow\)MAE\(\downarrow\)Corr\(\uparrow\)
LLaMA-70B35.975.375.20.9980.68746.886.386.30.6840.816
GPT-3.5 (default)43.184.584.60.7260.80048.188.388.20.6470.842
GPT-445.185.385.30.7220.81448.889.089.00.6390.858
Table 6. Reasoning Backbone Ablation on the CMU-MOSI Dataset

4.8 Generalization of PEAR-CoT on the Visual Modality (RQ6)

In the aforementioned experiments, the PEAR-CoT prompts were applied on the text modality. Herein, we further assess the effectiveness and generalizability of the PEAR-CoT prompts within the visual modality. To achieve this, it is first necessary to describe the relevant visual modality information in natural language.
We considered two different types of information: environmental context and facial-related information. For environmental background, descriptions of the surroundings were obtained using a video captioning model pre-trained on the Vatex dataset.4 However, since the Vatex dataset [37] was not collected for affective computing tasks, the generated descriptions might have limited emotion-related content. Regarding facial information, statistics for the three most frequently and least frequently occurring Action Units (AUs), the most likely facial expression categories, and the least likely facial expression categories for each video were collected. Subsequently, definitions for each AU were integrated into the Preliminaries part of the PEAR-CoT prompts. Finally, the generated visual PEAR-CoT prompts were fed into GPT-3.5 to predict sentiment intensity. The specific prompt template and examples are shown in Figure 4. It was observed that GPT can understand the relationship between specific AUs and sentiment polarity through the definitions of AUs. For example, AUs 01, 02, and 05 are associated with positive emotions, while AUs 04, 06, 09, and 15 are associated with negative emotions. By combining environmental information and facial expression information, specific sentiment intensity predictions were obtained.
Fig. 4.
Fig. 4. PEAR-CoT prompting on the visual modality. The text with a gray background is the input to the LLM, and the rest is the output from the LLM.
To quantify the performance of the PEAR-CoT method in the visual modality, we compared it with the common Transformer [33] network. Specifically, the Transformer encoder layer is adopted to model the temporal dynamics of the visual modality and predicted sentiment scores. The related results are shown in Table 7. It is important to note that the PEAR-CoT prompts were utilized in a zero-shot setting without being trained/fine-tuned on the dataset, whereas the Transformer was trained in a supervised setting. We have the following observations. Firstly, the performance of both methods was quite limited, possibly due to the visual modality containing limited emotion-related information. In comparison, the PEAR-CoT prompt method achieved comparable results, especially in terms of classification accuracy and F1 score metrics. However, in the process of using natural language to describe visual context, PEAR-CoT only considered the simplest statistical information, overlooking fine-grained temporal dynamics. This resulted in relatively limited performance, with a noticeable gap in MAE and correlation coefficient metrics compared to the Transformer. Overall, PEAR-CoT remains a method with significant potential in the visual modality.
Table 7.
MethodAcc\({}_{7}\uparrow\)Acc\({}_{2}\uparrow\)F1\(\uparrow\)MAE\(\downarrow\)Corr\(\uparrow\)
PEAR-CoT (zero-shot)15.555.257.01.6210.027
Transformer (supervised)16.957.157.21.3870.121
Table 7. Results of the Visual Modality on the CMU-MOSI Dataset
Bold means the best result.

4.9 Ablation on Different Zero-shot CoT Promptings (RQ7)

To validate the effectiveness of the proposed PEAR-CoT prompt, we compared it against two different zero-shot CoT prompt methods: standard zero-shot CoT prompt [14] and PS zero-shot CoT prompt [35].
Table 8 presents the results of the three methods in zero-shot sentiment analysis on the CMU-MOSI dataset. It can be observed that PS-CoT yielded the poorest results, followed by the standard CoT. The proposed PEAR-CoT achieved the best results among the three methods. Figures 5 and 6 depict a weak negative sample and a neutral sample, respectively. On one hand, PS-CoT is proposed for mathematical calculation tasks and is not suitable for more subjective sentiment analysis tasks. Evaluating sentiment scores for each part and then summing them for the final sentiment prediction is unreasonable, as shown in Figure 5. On the other hand, for standard CoT, CoT reasoning may cause errors in the early stages and affect the subsequent reasoning process, as illustrated in Figure 6. In contrast, the PEAR-CoT prompts first offer a global analysis or prediction result, followed by step-by-step explanations. Our approach reduces the risk of accumulating errors due to inaccurate intermediate reasoning steps, thus acquiring the best results across multiple evaluation metrics.
Table 8.
Zero-shot CoT PromptAcc\({}_{7}\uparrow\)Acc\({}_{2}\uparrow\)F1\(\uparrow\)MAE\(\downarrow\)Corr\(\uparrow\)
PS-CoT30.979.079.01.1220.644
Standard-CoT40.182.882.90.8820.702
PEAR-CoT43.184.584.60.7260.800
Table 8. CoT Prompt Ablation on the CMU-MOSI Dataset
Fig. 5.
Fig. 5. Case study on a weak negative sample. The text with a gray background is the input to the LLM, and the rest is the output from the LLM.
Fig. 6.
Fig. 6. Case study on a neutral sample. The text with a gray background is the input to the LLM, and the rest is the output from the LLM.

4.10 Case Study about Hallucination Suppression (RQ8)

To further analyze the role of audio-visual modality information in the CMFF module for CoT reasoning steps, we calculated the attention score allocated to each step within the CoT reasoning process in Equation (2). Specifically, we calculated and ranked the attention scores obtained by each step within the cross-modal filtering submodule. Figure 7 illustrates the prediction detail for both a weakly positive and a weakly negative sample.
Fig. 7.
Fig. 7. Case study about hallucination suppression.
For the positive sample, the appearance of the negative lexicon “problem” and the negation expression “didn’t love” misled the LLM to believe that the intensity of negative sentiment was greater than that of positive sentiment. However, the corresponding visual modality showed a happy expression, and the speaker’s tone was light and cheerful. This indicates a contradiction between the LLM’s reasoning process and the facts under multimodal data. According to the multimodal information, the speaker was expressing a liking toward an object, but not to the extent of love. In this scenario, the attention scores reveal that the CMFF module focused more on parts of the reasoning process that were consistent with the audio-visual modality, specifically the third step. Conversely, it paid less attention to the parts where the reasoning process conflicted with the audio-visual modality, namely the first, fourth, and fifth steps. Finally, with the text-based zero-shot prediction as negative and the audio-visual modality showing positive signals, the MM-PEAR-CoT’s prediction was 1.1 (weakly positive), closer to the annotation of 1.25.
For the negative sample, the presence of the word “orgasm” misled the LLM to perceive a positive sentiment during the third step of reasoning. However, the corresponding visual modality displayed expressions of displeasure, and the speaker’s tone conveyed disgust. This indicates a contradiction between the LLM’s reasoning process and the facts under multimodal data. In fact, according to the multimodal information, the speaker expressed disgust toward the excessive use of Computer-Generated Imagery in the climax of a movie. In this scenario, the attention scores indicate that the CMFF module focused more on parts of the reasoning process that were consistent with the audio-visual modality, specifically the first and second steps. Conversely, it paid less attention to the parts where the reasoning process conflicted with the audio-visual modality, namely the third and fourth steps. Finally, with the text modality being predicted as positive and the audiovisual modality showing negative signals, the MM-Pear-CoT’s prediction was \(-\)0.8 (weakly negative), closer to the annotation of \(-\)1.0.
Overall, the attention scores within the cross-modal filtering module demonstrate that MM-PEAR-CoT is capable of prioritizing reasoning processes that are more consistent with other modalities in the presence of conflicts between textual reasoning steps and multimodal information, thereby obtaining more accurate multimodal sentiment prediction results.
Despite the proposed method’s capabilities, it still makes incorrect predictions in certain instances. For example, consider the “jUzDDGyPkXU\(\_\)21” sample from CMU-MOSI, which is annotated as \(-\)1.0 (weakly negative). The corresponding text states, “The only actor who can really sell their lines is Erin Eckart.” This sentence does not display any overt negative sentiments. Thus, the text-based reasoning process does not involve steps associated with negative sentiments, leading to a zero-shot prediction of 2.0 (positive). However, the negative sentiments are conveyed through audio and visual modalities. This discrepancy results in the proposed method failing to suppress hallucinations by focusing on reasonable steps, finally leading to an erroneous prediction of 1.1 (weakly positive). This example highlights the challenges the proposed method faces when dealing with complex samples.

4.11 Discussion

4.11.1 Fairness.

MM-PEAR-CoT aims to introduce implicit reasoning information into the process of multimodal representation learning. For the multimodal sentiment analysis task, there are no readily available reasoning exemplars. Manually annotating CoT exemplars is time-consuming, costly, and challenging to ensure consistency and completeness in multi-step reasoning. Thus, we leverage LLMs with reasoning capabilities to generate these CoTs. Furthermore, to ensure a fair comparison, we employ backbone networks consistent with current multimodal sentiment analysis methods, such as BERT, XLNet, and SentiLare, for semantic learning in the text modality. Notably, we do not utilize the powerful hidden layer embeddings of LLMs. Extensive experiments demonstrate that the proposed method achieves superior results, further validating the effectiveness of CoT reasoning in multimodal sentiment analysis.

4.11.2 Limitation.

While we have successfully addressed a portion of the hallucination issue stemming from unimodal input by introducing the audio-visual modality, it is crucial to recognize that this problem is not entirely resolved. Exploring methods to mitigate hallucinations is a potential area for further research and investigation. Since our approach involves zero-shot prompting of the LLM to generate rationales, there is a potential risk of inheriting social biases from the LLM. These biases, encompassing cultural, ethical, and various other dimensions, may manifest in the generated rationales, potentially causing adverse effects on users. To address this concern in the future, potential solutions could include implementing constraints at each prompting stage or employing more advanced LLMs trained on unbiased resources.

5 Conclusion

In this article, we propose the multimodal PEAR-CoT reasoning for multimodal sentiment analysis. PEAR-CoT prompts are employed to guideLLMs in generating a progressive reasoning process, serving as a representation of logical relationships within the text. To suppress the hallucinations of LLMs, we further employ information from the audio-visual modality to filter the CoT. We integrate high-level logical information and cross-modal complementary information into the process of text semantic representation learning, obtaining multimodal representations for sentiment prediction. In addition to the text modality, we also showcase the potential of PEAR-CoT reasoning in the visual modality. To the best of our knowledge, this is the first study that applies CoT reasoning to multimodal sentiment analysis.

Footnotes

3
https://openai.com/blog/new-embedding-models-and-api-updates. In terms of the number of tokens spent, each reasoning requires about 125 input tokens and no more than 190 output tokens.

References

[1]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 33. 1877–1901.
[2]
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. 2023. PaLM: Scaling language modeling with pathways. Journal of Machine Learning Research 24, 240 (2023), 1–113.
[3]
Zheng Chu, Jingchang Chen, Qianglong Chen, Weijiang Yu, Tao He, Haotian Wang, Weihua Peng, Ming Liu, Bing Qin, and Ting Liu. 2023. A survey of chain of thought reasoning: Advances, frontiers and future. arXiv:2309.15402. Retrieved from
[4]
Ringki Das and Thoudam D. Singh. 2023. Multimodal sentiment analysis: A survey of methods, trends, and challenges. ACM Computing Surveys 55, 13s (2023), 1–38.
[5]
Gilles Degottex, John Kane, Thomas Drugman, Tuomo Raitio, and Stefan Scherer. 2014. COVAREP—A collaborative voice analysis repository for speech technologies. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP ’14). IEEE, 960–964.
[6]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805. Retrieved from
[7]
Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston. 2023. Chain-of-verification reduces hallucination in large language models. arXiv:2309.11495. Retrieved from
[8]
Hao Fei, Bobo Li, Qian Liu, Lidong Bing, Fei Li, and Tat-Seng Chua. 2023. Reasoning implicit sentiment with chain-of-thought prompting. arXiv:2305.11255. Retrieved from
[9]
Wei Han, Hui Chen, and Soujanya Poria. 2021. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis. arXiv:2109.00412. Retrieved from
[10]
Devamanyu Hazarika, Roger Zimmermann, and Soujanya Poria. 2020. MISA: Modality-invariant and-specific representations for multimodal sentiment analysis. In Proceedings of the 28th ACM International Conference on Multimedia. 1122–1131.
[11]
Yewon Hwang and Jong-Hwan Kim. 2023. Self-supervised unimodal label generation strategy using recalibrated modality representations for multimodal sentiment analysis. In Proceedings of the Findings of the Association for Computational Linguistics (EACL ’23). 35–46.
[12]
Pei Ke, Haozhe Ji, Siyang Liu, Xiaoyan Zhu, and Minlie Huang. 2019. SentiLARE: Sentiment-aware language representation learning with linguistic knowledge. arXiv:1911.02493. Retrieved from
[13]
Kyeonghun Kim and Sanghyun Park. 2023. AOBERT: All-modalities-in-One BERT for multimodal sentiment analysis. Information Fusion 92 (2023), 37–45.
[14]
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 35. 22199–22213.
[15]
Deren Lei, Yaxi Li, Mingyu Wang, Vincent Yun, Emily Ching, and Eslam Kamal. 2023. Chain of natural language inference for reducing large language model ungrounded hallucinations. arXiv:2310.03951. Retrieved from
[16]
Yong Li, Yuanzhi Wang, and Zhen Cui. 2023. Decoupled multimodal distilling for emotion recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6631–6640.
[17]
Ronghao Lin and Haifeng Hu. 2023. Multi-task momentum distillation for multimodal sentiment analysis. IEEE Transactions on Affective Computing 15, (2023), 549–565.
[18]
Zhun Liu, Ying Shen, Varun B. Lakshminarasimhan, Paul P. Liang, Amir Zadeh, and Louis-Philippe Morency. 2018. Efficient low-rank multimodal fusion with modality-specific factors. arXiv:1806.00064. Retrieved from
[19]
Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv:1711.05101. Retrieved from
[20]
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. 2022. Learn to explain: Multimodal reasoning via thought chains for science question answering. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 35. 2507–2521.
[21]
Fengmao Lv, Xiang Chen, Yanyong Huang, Lixin Duan, and Guosheng Lin. 2021. Progressive modality reinforcement for human multimodal emotion recognition from unaligned multimodal sequences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2554–2562.
[22]
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, and Sean Welleck, Bodhisattwa Prasad Majumder, Shashank Gupta, Amir Yazdanbakhsh, Peter Clark. 2023. Self-refine: Iterative refinement with self-feedback. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 36, 46534–46594.
[23]
Sijie Mai, Songlong Xing, Jiaxuan He, Ying Zeng, and Haifeng Hu. 2023. Multimodal graph for unaligned multimodal sequence analysis via graph convolution and graph pooling. ACM Transactions on Multimedia Computing, Communications and Applications 19, 2 (2023), 1–24.
[24]
Sijie Mai, Ying Zeng, Shuangjia Zheng, and Haifeng Hu. 2022. Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis. IEEE Transactions on Affective Computing 14, 1 (2022), 2276–2289.
[25]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 32, 8024–8035.
[26]
Petra Poklukar, Miguel Vasco, Hang Yin, Francisco S. Melo, Ana Paiva, and Danica Kragic. 2022. Geometric multimodal contrastive representation learning. In Proceedings of the International Conference on Machine Learning. PMLR, 17782–17800.
[27]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485–5551.
[28]
Wasifur Rahman, Md K. Hasan, Sangwu Lee, Amir Zadeh, Chengfeng Mao, Louis-Philippe Morency, and Ehsan Hoque. 2020. Integrating multimodal information in large pretrained transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Vol. NIH Public Access, 2359–2369.
[29]
Zhongkai Sun, Prathusha Sarma, William Sethares, and Yingyu Liang. 2020. Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 8992–8999.
[30]
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. LLaMA 2: Open foundation and fine-tuned chat models. arXiv:2307.09288. Retrieved from
[31]
Yao-Hung H. Tsai, Shaojie Bai, Paul P. Liang, J. Z. Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Vol. NIH Public Access, 6558–6569.
[32]
Yao-Hung H. Tsai, Paul P. Liang, Amir Zadeh, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2018. Learning factorized multimodal representations. arXiv:1806.06176. Retrieved from
[33]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 30, 5998–6008.
[34]
Di Wang, Shuai Liu, Quan Wang, Yumin Tian, Lihuo He, and Xinbo Gao. 2022. Cross-modal enhancement network for multimodal sentiment analysis. IEEE Transactions on Multimedia 25, 4909–4921.
[35]
Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. 2023. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. arXiv:2305.04091. Retrieved from
[36]
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022b. Self-consistency improves chain of thought reasoning in language models. arXiv:2203.11171. Retrieved from
[37]
Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Y. Wang. 2019b. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4581–4591.
[38]
Yansen Wang, Ying Shen, Zhun Liu, Paul P. Liang, Amir Zadeh, and Louis-Philippe Morency. 2019a. Words can shift: Dynamically adjusting word representations using nonverbal behaviors. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 7216–7223.
[39]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 35. 24824–24837.
[40]
Dingkang Yang, Shuai Huang, Haopeng Kuang, Yangtao Du, and Lihua Zhang. 2022. Disentangled representation learning for multimodal emotion recognition. In Proceedings of the 30th ACM International Conference on Multimedia. 1642–1651.
[41]
Jianing Yang, Yongxin Wang, Ruitao Yi, Yuying Zhu, Azaan Rehman, Amir Zadeh, Soujanya Poria, and Louis-Philippe Morency. 2020. MTGAT: Multimodal temporal graph attention networks for unaligned human multimodal language sequences. arXiv:2010.11985.
[42]
Kaicheng Yang, Hua Xu, and Kai Gao. 2020. CM-BERT: Cross-modal bert for text-audio sentiment analysis. In Proceedings of the 28th ACM International Conference on Multimedia. 521–528.
[43]
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R. Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized autoregressive pretraining for language understanding. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 32, 5754–5764.
[44]
Wenmeng Yu, Hua Xu, Ziqi Yuan, and Jiele Wu. 2021. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In Proceedings of the AAAI conference on artificial intelligence, Vol. 35. 10790–10797.
[45]
Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2017. Tensor fusion network for multimodal sentiment analysis. arXiv:1707.07250. Retrieved from
[46]
Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. 2016. Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv:1606.06259. Retrieved from
[47]
AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Vol. 1 (Long Papers). 2236–2246.
[48]
Ying Zeng, Sijie Mai, Wenjun Yan, and Haifeng Hu. 2023. Multimodal reaction: Information modulation for cross-modal representation learning. IEEE Transactions on Multimedia 26, 2178–2191.
[49]
Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. 2023. Multimodal chain-of-thought reasoning in language models. arXiv:2302.00923. Retrieved from
[50]
Sicheng Zhao, Guoli Jia, Jufeng Yang, Guiguang Ding, and Kurt Keutzer. 2021. Emotion recognition from multiple modalities: Fundamentals and methodologies. IEEE Signal Processing Magazine 38, 6 (2021), 59–73.

Cited By

View all
  • (2025)AVES: An Audio-Visual Emotion Stream Dataset for Temporal Emotion DetectionIEEE Transactions on Affective Computing10.1109/TAFFC.2024.344092416:1(438-450)Online publication date: Jan-2025
  • (2024)Noise-Resistance Learning via Multi-Granularity Consistency for Unsupervised Domain Adaptive Person Re-IdentificationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/370232821:1(1-23)Online publication date: 23-Dec-2024
  • (2024)Correlation-aware Cross-modal Attention Network for Fashion Compatibility Modeling in UGC SystemsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3698772Online publication date: 5-Oct-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 9
September 2024
780 pages
EISSN:1551-6865
DOI:10.1145/3613681
  • Editor:
  • Abdulmotaleb El Saddik
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 September 2024
Online AM: 11 June 2024
Accepted: 02 June 2024
Revised: 07 April 2024
Received: 26 November 2023
Published in TOMM Volume 20, Issue 9

Check for updates

Qualifiers

  • Research-article

Funding Sources

  • National Natural Science Foundation of China
  • Key Research and Development Program of Shaanxi

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1,129
  • Downloads (Last 6 weeks)442
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)AVES: An Audio-Visual Emotion Stream Dataset for Temporal Emotion DetectionIEEE Transactions on Affective Computing10.1109/TAFFC.2024.344092416:1(438-450)Online publication date: Jan-2025
  • (2024)Noise-Resistance Learning via Multi-Granularity Consistency for Unsupervised Domain Adaptive Person Re-IdentificationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/370232821:1(1-23)Online publication date: 23-Dec-2024
  • (2024)Correlation-aware Cross-modal Attention Network for Fashion Compatibility Modeling in UGC SystemsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3698772Online publication date: 5-Oct-2024
  • (2024)Efficiently Gluing Pre-Trained Language and Vision Models for Image CaptioningACM Transactions on Intelligent Systems and Technology10.1145/368206715:6(1-16)Online publication date: 19-Nov-2024
  • (2024)Dual-path Collaborative Generation Network for Emotional Video CaptioningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681603(496-505)Online publication date: 28-Oct-2024
  • (2024)Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657727(229-239)Online publication date: 10-Jul-2024
  • (2024)Learning Domain-Invariant Model for WiFi-Based Indoor LocalizationIEEE Transactions on Mobile Computing10.1109/TMC.2024.343845423:12(13898-13913)Online publication date: 1-Dec-2024
  • (2024)Set of Diverse Queries With Uncertainty Regularization for Composed Image RetrievalIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.340100634:10_Part_2(10494-10506)Online publication date: 1-Oct-2024
  • (2024)Winning Prize Comes from Losing Tickets: Improve Invariant Learning by Exploring Variant Parameters for Out-of-Distribution GeneralizationInternational Journal of Computer Vision10.1007/s11263-024-02075-x133:1(456-474)Online publication date: 1-Aug-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media