CN116662924A

CN116662924A - Aspect-level multi-mode emotion analysis method based on dual-channel and attention mechanism

Info

Publication number: CN116662924A
Application number: CN202310273760.9A
Authority: CN
Inventors: 梁燕; 侯增辉; 尹恩同; 陈思旭; 徐露
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2023-03-20
Filing date: 2023-03-20
Publication date: 2023-08-29

Abstract

The invention discloses an aspect-level multimode emotion analysis method of a double-channel and attention mechanism, which is based on a neural network, extracts emotion information contained in image features in a multi-scale mode through combining aspect word features and text features and introduces a GCN (generalized graphic communication network) network into an aspect-level multimode emotion analysis task, so that feature extraction and interaction fusion capability of a model is greatly improved. In the invention, a pre-training encoder is adopted in a feature extraction layer to extract aspect words, text features and image features, and after the aspect words and sentence features are fused in a bidirectional manner in an attention mechanism layer, final aspect word features and sentence feature representations are obtained. And the image features establish an image feature extraction network through a channel attention mechanism and a spatial attention mechanism, and finally, the interactive fusion features of all modes are dynamically extracted through a GCN module. In the experiment, the performance index of the multi-modal emotion analysis on the data set based on the aspect of the attention mechanism is improved.

Description

Aspect-level multi-mode emotion analysis method based on dual-channel and attention mechanism

Technical Field

The invention belongs to the computer language processing and emotion analysis directions, and particularly relates to an aspect-level multi-mode emotion analysis method based on a dual-channel and attention mechanism.

Background

In recent years, content released by users on various online platforms has grown rapidly. How to exploit the emotional tendency of a certain aspect contained in artificial intelligence and other related technologies is becoming a hot point of research in recent years.

Emotion expresses the person's attitudes to an objective thing, and usually transmits emotion tendencies in various ways such as limb language, facial expression, language words, etc. Emotion analysis (Sentiment Analysis, SA), also known as Opinion Mining (OM), aims to extract opinions from a large number of unstructured text and classify them as positive, neutral or negative emotion polarity. In the Internet age, social platforms such as microblogs, knowledges, weChats and the like are developed, and characters and pictures are gradually becoming main carriers for users to transfer target aspects or entity opinion emotion in the network world. The task of aspect-based emotion analysis has received extensive attention in academia and industry over the last decade.

In the early days, text features are usually generated by using machine learning methods such as emotion dictionaries, dependency relations, statistical methods and the like, but the traditional method needs to consume a great deal of manpower to select and extract the features, lacks the association between aspect words and sentence contexts among the features, and has poor mobility and robustness. The deep learning method is successful in various tasks of natural language processing, and meanwhile, the application of the neural network in aspect-level emotion analysis is promoted. By learning and extracting feature correlations of aspect words between sentence contexts using various neural network models in deep learning, the performance of the models is also gradually improved. A number of deep network model approaches such as convolutional neural networks (Convolutional Neural Network, CNN), recurrent neural networks (Recurrent Neural Network, RNN), graph neural networks (Graph Neural Network, GNN), and attention mechanisms (Attention Mechanism) have been proposed, with text-based aspect-level emotion analysis being further developed.

As the content of many online platforms becomes more and more modelled, the emotional polarity of information prediction targets from other modalities is also becoming of increasing interest to researchers. And the scientific research achievement obtained in the image processing field is deeply learned, so that a theoretical basis is provided for multi-mode emotion analysis based on aspect level. Xu et al firstly introduce image mode information into aspect-level emotion analysis, extract image features by using CNN, extract text features by using Long-short-term memory (Long-short term Memory, LSTM) network, and verify the feasibility of the proposed method through an interactive attention mechanism. Then Gu et al adopts a bidirectional gating circulation unit (Bidirectional Gate Recurrent Unit, biGRU) network and a multi-head self-attention mechanism to encode text semantic information, adopts a ResNet-152 model and a capsule network to extract image characteristics, adopts a multi-head attention network in multi-mode interaction fusion, furthest improves the contribution of each mode to emotion transmission, and improves the performance of the network. Yu et al propose a hierarchical interaction module for modeling pairwise interactions between given aspect words, text information and image information. In order to make up for the semantic difference between text features and image features, an auxiliary reconstruction module based on an automatic encoder concept is further provided, and model performance is improved. However, the existing model still has some defects, 1) channel information and spatial information in an image cannot be fully extracted in the image feature extraction process, so that emotion information in the image cannot be effectively combined with aspect word information. 2) Information fusion between modes cannot be effectively carried out, so that the performance of the model is not ideal. Thus, research is directed to aspect-level based multimodal emotion analysis tasks and more efficient models are presented herein.

CN114936623A, an aspect emotion analysis method integrating multi-mode data, firstly carries out data preprocessing, and adjusts text and image formats to adapt to the input requirement of a neural network; secondly, extracting text features by using Bi-LSTM after word embedding, and extracting image features by using a Resnet50 network; extracting and aligning multi-modal aspects, extracting aspect terms from the text by using a sequence labeling method, and performing implicit alignment of image areas and aspect words by using a memory network added with attention and Point-wise convolution operation; then based on the text characteristics of the position attention, gaussian modeling context explicit positions, and a memory network extracts text representations sensitive to the terms; then carrying out multi-mode data fusion, and fusing the multi-mode data by a fusion discrimination matrix; and finally, carrying out emotion classification, and carrying out emotion classification by utilizing the fused characteristic information. According to the method, the multi-modal data are used for carrying out aspect-level emotion analysis, multi-modal complementary information is extracted, and the accuracy of emotion analysis tasks is improved.

The method for using average aspect word vectors is easy to cause word sense confusion and is unfavorable for the interaction of aspect words with sentences and image features. Secondly, in the extraction of the image features, the auxiliary extraction effect of semantic information of sentence context on the image features is ignored. Thus, the above-described methods are limited in their ability to fuse multimodal data.

Disclosure of Invention

The present invention is directed to solving the above problems of the prior art. An aspect-level multi-mode emotion analysis method based on a dual-channel and attention mechanism is provided. The technical scheme of the invention is as follows:

an aspect-level multi-modal emotion analysis method based on a dual-channel and attention mechanism comprises the following steps:

step 1: extracting hidden characteristic representations from sentence characteristics and aspect word characteristics in a data set by using a Bert pre-training encoder, and extracting picture characteristics by using a ResNet-152 pre-training network; an aspect word is a subsequence that belongs to a sentence;

step 2: calculating the feature correlation of sentence features and aspect word features through a multi-head attention mechanism, so that corresponding attention weighting is obtained between the high-similarity features; finally obtaining aspect word features guided by the text and text features guided by the aspect words;

step 3: weighting the original image characteristics by using the aspect word characteristics guided by the text, and obtaining the image channel characteristics through a channel attention mechanism;

step 4: weighting the image channel characteristics by using the text characteristics guided by the aspect words, and generating a spatial attention pattern by using the spatial relation of the characteristics in a spatial attention mechanism to obtain final characteristic representation of the image;

step 5: calculating a dynamic adjacency matrix by text features guided by aspect words and image features generated by channel attention and spatial attention; obtaining a final fusion feature representation using the aggregate capabilities and the messaging capabilities of the graph neural network;

step 6: and classifying the final fusion features, aspect word features and sentence features by using a pooling mechanism through a classification module.

Further, step 1 extracts hidden feature representations from sentence features and aspect word features in the dataset by using a Bert pre-training encoder, and extracts picture features by using a res net-152 pre-training network, specifically:

outputting text and aspect feature information through two Bert-based pre-training text feature encoders; extracting image features by using a pretrained ResNet network; providing better initialization parameters for the model by adopting a pre-training model, enabling the model to have better generalization performance by fine adjustment on a target task, and accelerating model convergence; bert pre-training model to obtain sentence characteristicsAnd aspect features->Wherein t represents the text length and d represents the output feature dimension; the image features being denoted as H _I ＝ResNet(E _I ),H _I ∈R ^c×w×h Wherein ResNet represents a ResNet-152 model, c represents a channel of an image feature, and w and h represent the width and the height of the feature respectively; wherein E is _T 、E _S 、E _I Representing original aspect words, sentences and images; h _T 、H _S HI represents aspect words, sentences and image features extracted through the pre-training network.

Further, the step 2 adopts a multi-head attention mechanism to fuse related information between aspect word characteristics and sentence characteristics, and the specific method is as follows:

in order to obtain the interactive features between sentences and aspect words, a multi-head attention mechanism is adopted to calculate the similarity between the sentences and the aspect words, and feature fusion between the sentences and the aspect words can be effectively realized, and the expression is as follows:

MHA represents a multi-head attention mechanism, Q, K, V represents an input feature, d _k Is a scaling factor,Output representing ith layer in transducer, layerNorm representing layer normalization, glue is activation function, W ₁ ^T 、W ₂ ^T Respectively representing a matrix of trainable parameters.

Wherein the aspect word features and sentence features are respectively used as query matrix q to calculate aspect word features Y guided by sentence features _T And sentence feature Y guided by aspect word feature _S 。

Further, the step 3 weights the original image features by using the aspect word features guided by the text, and obtains the image channel features through a channel attention mechanism, and the specific method is as follows:

in order to introduce aspect word features into an image, the aspect word features and the image features are fused through a multi-head self-attention mechanism, and the specific formula is as follows:

H _ca ＝HMA(H _I ,Y _T ,Y _T ) (4)

M _CH ＝σ(MLP(AvgPool(H _ca ))+MLP(MaxPool(H _ca ))) (5)

wherein in the channel attention mechanism, it inputs H _ca Is a multi-layer perceptron represented by MLP, avgPoll, maxPoll, max, σ, relu activation function, by multi-head attention mechanism, aspect word directed image features. M is M _CH Representing the output of the channel attention.

Further, the step 4: the text feature guided by the aspect words is used for weighting the image channel feature, in a spatial attention mechanism, a spatial attention pattern is generated by using the spatial relation of the feature, and the final feature representation of the image is obtained, wherein the method comprises the following specific steps:

through a multi-head attention mechanism, the sentence characteristics guided by the aspect words are used for weighting the image characteristics output by the channel attention mechanism, and important areas related to the emotion of the aspect words in the image characteristics are highlighted in the channel attention mechanism, wherein the specific formula is as follows:

H _sa ＝MHA(H _I ,M _CH ,M _CH ) (6)

M _SP ＝σ(Conv(Concat(AvgPool(H _sa )；MaxPool(H _sa )))) (7)

equation (7) represents implementation details of the channel attention mechanism, where Concat represents matrix concatenation, conv represents convolution operation, and σ represents the Relu activation function. H _sa Representing image features, H, guided by sentence features via a multi-headed attention mechanism _I Representing image features, M _SP Input representing spatial attentionAnd (5) outputting.

Further, the step 5 calculates a dynamic adjacency matrix by the text features guided by the aspect words and the image features generated by the channel attention and the space attention; the aggregation capability and the message transmission capability of the graph neural network are used to obtain a final fusion characteristic representation, which specifically comprises the following steps:

the sentence features and the image features are spliced, and an attention matrix is obtained through a self-attention mechanism and is used as an adjacent matrix of the GCN, firstly, the attention matrix can capture related features between the sentence and the image features, so that the adjacent matrix is more flexible, and secondly, the importance degree of similar features between the sentence and the image can be adaptively adjusted; in GCN, for a given node, graph G= { V, A }, where V is all nodes in the graph, A is the adjacency matrix between all nodes, A is the concatenation matrix corresponding to sentence features and image features _ij The weight of (2) depends on the similarity between the nodes;

H _att ＝Concat(Y _S ,M _SP ) (8)

A＝MHA(H _att ,H _att ,H _att ) (9)

wherein H is _att Output M representing attention of a splice channel _SP Sentence feature Y guided by sum aspect words _S ，For node v _i Is characterized by the output of the first layer, W ^l A trainable weight matrix of the GCN layer I, wherein sigma is a Relu activation function; since GCN completes feature extraction and coding work between associated nodes, output H of all nodes in layer I ^l Expressed as:

n represents the number of nodes

Further, the step 6: the final fusion features, aspect word features and sentence features are classified by using a pooling mechanism through a classification module, and the specific steps are as follows:

for aspect word features and sentence features, the [ CLS ] is added when the features are extracted by using a pre-training model]As a tag, the final hidden state of the tag is therefore taken as a collective representation of aspect words and sentence features, noted asAndfor the first characterization of the fusion output part with the GCN feature, since it is a weighted sum between the features, the feature is taken as the classified feature +.>The total output characteristic O after pooling and stitching can be expressed as:

in the classification phase:

p(y|O)＝softmax(W ^T O) (13)

wherein the method comprises the steps of

Is a trainable weight, and a cross entropy Loss function is used to calculate a Loss value Loss. D. y is ^(j) The number of training samples and the actual labels of the samples are represented respectively.

The invention has the advantages and beneficial effects as follows:

the advantage of the invention is mainly that in step 3 of claim 1, the channels in the image can be regarded as feature extractors, the channel attention being directed to extracting important features in the image channels that are relevant to the aspect words. In order to integrate aspect word information into image channel features, both features are interacted with using a multi-head attention mechanism before channel attention, so that the guiding function of the aspect words in the channel attention is conveniently exerted. Then in step 4, the spatial attention mainly extracts the region features related to the aspect words in the image, and because the sentence features also have the region association related to the aspect words, the sentence features are introduced into the spatial attention mechanism, so that the spatial attention is guided to extract the region features related to the aspect words in the image. And gradually extracting deep features in the image, and enhancing contribution of the image features to emotion classification in subsequent multi-mode fusion.

Drawings

FIG. 1 is a flowchart of an aspect-level multi-modal emotion analysis method based on a dual channel and attention mechanism in accordance with a preferred embodiment of the present invention.

FIG. 2 is a framework diagram of an aspect-level multimodal emotion analysis model.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and specifically described below with reference to the drawings in the embodiments of the present invention. The described embodiments are only a few embodiments of the present invention.

The technical scheme for solving the technical problems is as follows:

fig. 1 is a general flow chart of an aspect-level multi-modal emotion analysis method based on a dual-channel and attention mechanism according to the present invention, and is further described below with reference to fig. 1. The invention mainly comprises the following steps:

step 1: extracting hidden feature representations from sentence features and aspect word features in the dataset using a Bert pre-training encoder. The picture features are extracted using a ResNet-152 pre-training network.

Step 2: and calculating the feature correlation of the sentence features and the aspect word features through a multi-head attention mechanism, so that corresponding attention weighting is obtained between the features with high similarity. Finally, the aspect word features guided by the text and the text features guided by the aspect words are obtained.

Step 3: and weighting the original image characteristics by using the aspect word characteristics guided by the text, and obtaining the image channel characteristics through a channel attention mechanism.

Step 4: and weighting the image channel characteristics by using the text characteristics guided by the aspect words, and generating a spatial attention pattern by using the spatial relationship of the characteristics to obtain the final characteristic representation of the image.

Step 5: the text features guided by the aspect words and the image features generated by the channel attention and the spatial attention calculate a dynamic adjacency matrix. The aggregate and messaging capabilities of the graph neural network are used to derive a final fused feature representation.

Step 6: the final fused feature representation is classified by the classification module using a pooling mechanism.

2. FIG. 2 is a framework diagram of the multi-modal emotion analysis model at this aspect level, and the structural principle of the present invention will be further described with reference to FIG. 2. The method model of the invention has 4 layers, and the specific contents of each layer are as follows:

(1) Modal feature extraction layer

A set of multimodal samples is known, which contains a sentence s= { w of n words ₁ ,w ₂ ，…,w _n And associated image I, and an aspect word sub-sequence T of S. For aspect word T, it is also associated with an emotion tag y. The invention adds sentence E _S And aspect word E _T Features are extracted in two Bert encoders, respectively, image E _I Features are extracted in the ResNe-152 network. For the input part of the Bert text encoder, the label [ CLS ]]Added to the header of text, tag [ SEP ]]Adding the end of the text, and finally obtaining sentence characteristics by using the Bert pre-training modelAnd aspect word featuresWhere t represents the aspect word or text length and d represents the feature dimension. For image features, expressed as +.>Where c represents the number of channels of the image feature and w and h represent the width and height of the feature, respectively.

(2) Attention mechanism layer

In order to further extract the relevance between features and the modal interaction between text words and image features, the model adopts a multi-head attention mechanism to extract the potential relevance features between aspect word features and sentence features. In the image feature extraction, the channel attention mechanism fusing aspect word features and the space attention mechanism fusing text features are used, the image mode features are fused with the text mode features, and features which are in different scales and are related to the aspect words in the image mode are extracted, so that the GCN network can identify the adjacent relation between important nodes. The multi-mode characteristics are extracted more deeply through a message passing and aggregation mechanism of the GCN network.

1) Sentence and aspect word feature interactions

In order to obtain interactive features between sentences and aspect words, the associated information among different features is strengthened, and redundant information is filtered. The invention adopts a multi-head attention mechanism to calculate the similarity between the features of the two, and can effectively realize the feature fusion between the two. The expression of the multi-head attention mechanism is:

where T represents the matrix transpose, MHA represents the multi-headed attention mechanism, consisting of three parts, query (Q), key (K) and Value (V), with the attention Value generated by interaction between Q, K mapped to V by dot product. Scale factor d _k Is the characteristic dimension of each attention head.

The aspect word features and sentence features can obtain the fused output features of the aspect word features and the sentence features through a multi-head attention mechanism, and the aspect word features guided by the sentence features are subjected to linear transformation and residual connection to obtain the final output features, wherein the specific formula is as follows:

wherein LayerNorm represents layer standardization, has functions of guaranteeing stability of data characteristic distribution and accelerating model training, and Glue represents activation function, W ₁ ^T 、W ₂ ^T Representing a trainable weight parameter. In the multi-head attention mechanism, the aspect word features and sentence features are respectively used as a query matrix Q, so that the aspect word features Y guided by the sentence features can be calculated _T And sentence feature guided by aspect word feature, called Y _S 。

2) Channel attention mechanism

Since each channel of image features can be regarded as a feature detector, in the channel attention mechanism, the features after average pooling and maximum pooling can be extracted through a layer of feedforward neural network, and important features related to aspect words in each channel of the image can be extracted. In order to identify emotion distribution related to an aspect word contained in a channel in an image channel, the invention introduces the aspect word feature in a channel attention mechanism, firstly, the aspect word feature and the image feature are fused through a multi-head attention mechanism, and the fused feature is used as input of the channel attention mechanism. The specific formula is as follows:

H _ca ＝MHA(H _I ,Y _T ,Y _T ) (4)

M _CH ＝σ(MLP(AvgPool(H _ca ))+MLP(MaxPool(H _ca ))) (5)

wherein in the channel attention mechanism, it inputs H _ca The image features guided by the aspect words are subjected to linear transformation and residual connection by a multi-head attention mechanism, and the complete output is subjected to linear transformation and residual connection as shown in formulas (2) and (3), and H is used in the part _ca Is a requirement for simplifying the description. Equation (5) represents implementation details of the channel attention mechanism, wherein the MLP represents the multi-layer perceptron, comprising trainable weight parameters in the neural network, avgPool represents average pooling and MaxPool represents maximum pooling. σ represents the Relu activation function.

3) Spatial attention mechanism

In the spatial attention mechanism, the spatial relationship of features can be utilized to learn the distribution area of important emotion features related to aspect words. In the sentence and aspect word feature interaction stage, the text features guided by the aspect words already contain the position areas of the important emotion features for the aspect words in the sentences. In order to enhance the extraction capability of the image feature region. In the invention, after sentence characteristics guided by aspect words and image characteristics output by a channel attention mechanism are fused through characteristics, a distribution area of important emotion characteristics is learned in a space attention mechanism. The specific formula is as follows:

H _sa ＝MHA(M _CH ,Y _S ,Y _S ) (6)

M _SP ＝σ(Conv(Concat(AvgPool(H _sa )；MaxPool(H _sa )))) (7)

equation (7) represents implementation details of the spatial attention mechanism, where Concat represents matrix concatenation, conv represents convolution operation, and σ represents the Relu activation function.

(3) GCN feature fusion layer

Sentence features are spliced with image features, and an attention matrix is obtained through a self-attention mechanism and is used as an adjacency matrix of the GCN. Firstly, the attention matrix can learn the related features between the sentence and the image features, so that the adjacency matrix is more flexible, and secondly, the importance degree of the similar features between the sentence features and the image features can be adaptively adjusted. In GCN, for a given node, graph G= { V, A }, where V is all nodes in the graph, A is the adjacency matrix between all nodes, A is the concatenation matrix corresponding to sentence features and image features _ij The weight of (2) depends on the similarity between the nodes.

H _att ＝Concat(Y _S ,M _SP ) (8)

A＝MHA(H _att ,H _att ,H _att ) (9)

Wherein,,for node v _i Is characterized by the output of the first layer, W ^l Is a trainable weight matrix for the GCN layer i, σ is a Relu activation function. Since GCN completes feature extraction and coding work between associated nodes, output H of all nodes in layer I ^l Expressed as:

(4) Output layer

For aspect word features and sentence features, since [ CLS ] is added when features are extracted initially using a pre-trained model]As a tag, the final hidden state of the tag is therefore taken as a collective representation of aspect words and sentence features, noted asAndfor the first characterization of the fusion output part with the GCN feature, since it is a weighted sum between the features, the feature is taken as the classified feature +.>The output characteristics can be expressed as:

the GCN output characteristics pass through a layer of feedforward neural network to finish classification tasks, and the specific formula is as follows:

p(y|O)＝softmax(W ^T O) (13)

wherein the method comprises the steps ofIs a trainable weight, and a cross entropy Loss function is used to calculate a Loss value Loss.

Experimental simulation

Table1 model performance comparison Table1 Model performance comparison

As can be seen from Table1, compared with other models, the model proposed in this chapter obtains optimal experimental results in terms of classification accuracy and macro average F1, and improves classification accuracy and macro average F1 score by 1.35% and 1.25% on TWITTER-2015 respectively in comparison with suboptimal TomBERT network model. The lifting rate is 1.38% and 1.53% on TWITTER-2017 respectively. The method is characterized in that the chapter extracts deep semantic association related to the term in the aspects in the images by fusing the term features in the aspects with the text features, and classification performance of the model is further improved by means of feature fusion capability of the graph neural network. The TomBERT model uses BERT to extract visual representations that are sensitive to aspect terms, but fails to employ efficient feature fusion methods. For the MIMN model that uses a multi-hop memory network to extract features, although multi-hop fusion of bimodal features is achieved, interactive features between text content and visual information are not extracted deeply.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

The above examples should be understood as illustrative only and not limiting the scope of the invention. Various changes and modifications to the present invention may be made by one skilled in the art after reading the teachings herein, and such equivalent changes and modifications are intended to fall within the scope of the invention as defined in the appended claims.

Claims

1. An aspect-level multi-mode emotion analysis method based on a dual-channel and attention mechanism is characterized by comprising the following steps of:

step 1: extracting hidden characteristic representations of sentences and aspect words in the data set by using a Bert pre-training encoder, and extracting picture characteristics by using a ResNet-152 pre-training network; an aspect word is a subsequence that belongs to a sentence;

2. The method for analyzing the emotion of the aspect level multimode based on the dual-channel and the attention mechanism according to claim 1, wherein the step 1 extracts the hidden characteristic representation from the sentence characteristics and the aspect word characteristics in the data set by using a Bert pre-training coder, and extracts the picture characteristics by using a res net-152 pre-training network, specifically:

outputting text and aspect word feature information through two Bert-based pre-training text feature encoders; extracting image features by using a pretrained ResNet network; providing better initialization parameters for the model by adopting a pre-training model, enabling the model to have better generalization performance by fine adjustment on a target task, and accelerating model convergence; bert pre-training model to obtain sentence feature H _S ＝Bert(E _S ),And aspect features H _T ＝Bert(E _T ),/>Wherein t represents the text length and d represents the output feature dimension; the image features being denoted as H _I ＝ResNet(E _I ),/>Wherein ResNet represents a ResNet-152 model, represents channels of image features, and w and h represent the width and the height of the features respectively; wherein E is _T 、E _S 、E _I Representing original aspect words, sentences and images; h _T 、H _S 、H _I Representing aspect words, sentences and image features extracted through the pre-training network.

3. The method for analyzing the emotion in multiple modes in aspect level based on dual channel and attention mechanism according to claim 1, wherein the step 2 adopts a multi-head attention mechanism, and fuses the related information between the aspect word characteristics and the sentence characteristics, and the specific method is as follows:

MHA represents a multi-head attention mechanism, Q, K, V represents an input feature, d _k In order for the scaling factor to be a factor,representing the output of the ith layer in the transducer, layerNorm represents layer normalization, and Glue is an activation function, W ₁ ^T 、W ₂ ^T Respectively representing a trainable parameter matrix;

4. The method for analyzing the emotion in multiple modes in aspect level based on dual channel and attention mechanism according to claim 1, wherein said step 3 weights the original image features by using the aspect word features guided by the text, and obtains the image channel features by the channel attention mechanism, specifically comprising:

H _ca ＝HMA(H _I ,Y _T ,Y _T ) (4)

M _CH ＝σ(MLP(AvgPool(H _ca ))+MLP(MaxPool(H _ca ))) (5)

wherein in the channel attention mechanism, it inputs H _ca Is image characteristics guided by a multi-head attention mechanism through aspect words, a multi-layer perceptron represented by MLP, average pooling represented by AvgPoll, maximum pooling represented by MaxPOll, and Relu activation function represented by sigma, M _CH Representing the output of the channel attention.

5. The method for analyzing the emotion of the aspect level multimode based on the dual-channel and attention mechanism according to claim 1, wherein the step 4 is as follows: the text feature guided by the aspect words is used for weighting the image channel feature, in a spatial attention mechanism, a spatial attention pattern is generated by using the spatial relation of the feature, and the final feature representation of the image is obtained, wherein the method comprises the following specific steps:

H _sa ＝MHA(H _I ,M _CH ,M _CH ) (6)

M _SP ＝σ(Conv(Concat(AvgPool(H _sa )；MaxPool(H _sa )))) (7)

equation (7) represents implementation details of the channel attention mechanism, wherein Concat represents matrix concatenation, conv represents convolution operation, σ represents Relu activation function, H _sa Representing image features guided by multi-head attention mechanism, by sentence features, H _I Representing image features, M _SP Representing the output of spatial attention.

6. The method for analyzing the emotion of the aspect level multimode based on the dual-channel and attention mechanism according to claim 5, wherein the step 5 calculates a dynamic adjacency matrix by text features guided by aspect words and image features generated by channel attention and spatial attention; the aggregation capability and the message transmission capability of the graph neural network are used to obtain a final fusion characteristic representation, which specifically comprises the following steps:

H _att ＝Concat(Y _S ,M _SP ) (8)

A＝MHA(H _att ,H _att ,H _att ) (9)

n represents the number of nodes.

7. The method for analyzing the emotion of the aspect level multimode based on the dual-channel and attention mechanism according to claim 6, wherein the following step 6: the final fusion features, aspect word features and sentence features are classified by using a pooling mechanism through a classification module, and the specific steps are as follows:

for aspect word features and sentence features, the [ CLS ] is added when the features are extracted by using a pre-training model]As a tag, the final hidden state of the tag is therefore taken as a collective representation of aspect words and sentence features, noted asAnd->For the first characterization of the fusion output part with the GCN feature, since it is a weighted sum between the features, the feature is taken as the classified feature +.>The total output characteristic O after pooling and stitching can be expressed as:

in the classification phase:

p(y|O)＝softmax(W ^T O) (13)

wherein the method comprises the steps ofIs a trainable weight, and calculates a Loss value Loss using a cross entropy Loss function, D, y ^(j) The number of training samples and the actual labels of the samples are represented respectively.