CN116662924A - Aspect-level multi-mode emotion analysis method based on dual-channel and attention mechanism - Google Patents
Aspect-level multi-mode emotion analysis method based on dual-channel and attention mechanism Download PDFInfo
- Publication number
- CN116662924A CN116662924A CN202310273760.9A CN202310273760A CN116662924A CN 116662924 A CN116662924 A CN 116662924A CN 202310273760 A CN202310273760 A CN 202310273760A CN 116662924 A CN116662924 A CN 116662924A
- Authority
- CN
- China
- Prior art keywords
- features
- attention mechanism
- image
- channel
- sentence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000007246 mechanism Effects 0.000 title claims abstract description 90
- 230000008451 emotion Effects 0.000 title claims abstract description 48
- 238000004458 analytical method Methods 0.000 title claims abstract description 25
- 238000012549 training Methods 0.000 claims abstract description 25
- 230000004927 fusion Effects 0.000 claims abstract description 24
- 238000013528 artificial neural network Methods 0.000 claims abstract description 16
- 239000000284 extract Substances 0.000 claims abstract description 12
- 238000000605 extraction Methods 0.000 claims abstract description 12
- 230000002452 interceptive effect Effects 0.000 claims abstract description 6
- 239000011159 matrix material Substances 0.000 claims description 36
- 238000000034 method Methods 0.000 claims description 28
- 238000011176 pooling Methods 0.000 claims description 13
- 230000004913 activation Effects 0.000 claims description 12
- 230000002776 aggregation Effects 0.000 claims description 3
- 238000004220 aggregation Methods 0.000 claims description 3
- 230000005540 biological transmission Effects 0.000 claims description 3
- 238000012512 characterization method Methods 0.000 claims description 3
- 230000009977 dual effect Effects 0.000 claims description 3
- 239000003292 glue Substances 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 2
- 230000003993 interaction Effects 0.000 abstract description 9
- 230000002457 bidirectional effect Effects 0.000 abstract description 3
- 238000004891 communication Methods 0.000 abstract 1
- 238000002474 experimental method Methods 0.000 abstract 1
- 230000006870 function Effects 0.000 description 13
- 238000013527 convolutional neural network Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 3
- 230000000306 recurrent effect Effects 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000002996 emotional effect Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002902 bimodal effect Effects 0.000 description 1
- 239000002775 capsule Substances 0.000 description 1
- 239000000969 carrier Substances 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 230000008921 facial expression Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 230000007935 neutral effect Effects 0.000 description 1
- 238000007500 overflow downdraw method Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 239000013598 vector Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses an aspect-level multimode emotion analysis method of a double-channel and attention mechanism, which is based on a neural network, extracts emotion information contained in image features in a multi-scale mode through combining aspect word features and text features and introduces a GCN (generalized graphic communication network) network into an aspect-level multimode emotion analysis task, so that feature extraction and interaction fusion capability of a model is greatly improved. In the invention, a pre-training encoder is adopted in a feature extraction layer to extract aspect words, text features and image features, and after the aspect words and sentence features are fused in a bidirectional manner in an attention mechanism layer, final aspect word features and sentence feature representations are obtained. And the image features establish an image feature extraction network through a channel attention mechanism and a spatial attention mechanism, and finally, the interactive fusion features of all modes are dynamically extracted through a GCN module. In the experiment, the performance index of the multi-modal emotion analysis on the data set based on the aspect of the attention mechanism is improved.
Description
Technical Field
The invention belongs to the computer language processing and emotion analysis directions, and particularly relates to an aspect-level multi-mode emotion analysis method based on a dual-channel and attention mechanism.
Background
In recent years, content released by users on various online platforms has grown rapidly. How to exploit the emotional tendency of a certain aspect contained in artificial intelligence and other related technologies is becoming a hot point of research in recent years.
Emotion expresses the person's attitudes to an objective thing, and usually transmits emotion tendencies in various ways such as limb language, facial expression, language words, etc. Emotion analysis (Sentiment Analysis, SA), also known as Opinion Mining (OM), aims to extract opinions from a large number of unstructured text and classify them as positive, neutral or negative emotion polarity. In the Internet age, social platforms such as microblogs, knowledges, weChats and the like are developed, and characters and pictures are gradually becoming main carriers for users to transfer target aspects or entity opinion emotion in the network world. The task of aspect-based emotion analysis has received extensive attention in academia and industry over the last decade.
In the early days, text features are usually generated by using machine learning methods such as emotion dictionaries, dependency relations, statistical methods and the like, but the traditional method needs to consume a great deal of manpower to select and extract the features, lacks the association between aspect words and sentence contexts among the features, and has poor mobility and robustness. The deep learning method is successful in various tasks of natural language processing, and meanwhile, the application of the neural network in aspect-level emotion analysis is promoted. By learning and extracting feature correlations of aspect words between sentence contexts using various neural network models in deep learning, the performance of the models is also gradually improved. A number of deep network model approaches such as convolutional neural networks (Convolutional Neural Network, CNN), recurrent neural networks (Recurrent Neural Network, RNN), graph neural networks (Graph Neural Network, GNN), and attention mechanisms (Attention Mechanism) have been proposed, with text-based aspect-level emotion analysis being further developed.
As the content of many online platforms becomes more and more modelled, the emotional polarity of information prediction targets from other modalities is also becoming of increasing interest to researchers. And the scientific research achievement obtained in the image processing field is deeply learned, so that a theoretical basis is provided for multi-mode emotion analysis based on aspect level. Xu et al firstly introduce image mode information into aspect-level emotion analysis, extract image features by using CNN, extract text features by using Long-short-term memory (Long-short term Memory, LSTM) network, and verify the feasibility of the proposed method through an interactive attention mechanism. Then Gu et al adopts a bidirectional gating circulation unit (Bidirectional Gate Recurrent Unit, biGRU) network and a multi-head self-attention mechanism to encode text semantic information, adopts a ResNet-152 model and a capsule network to extract image characteristics, adopts a multi-head attention network in multi-mode interaction fusion, furthest improves the contribution of each mode to emotion transmission, and improves the performance of the network. Yu et al propose a hierarchical interaction module for modeling pairwise interactions between given aspect words, text information and image information. In order to make up for the semantic difference between text features and image features, an auxiliary reconstruction module based on an automatic encoder concept is further provided, and model performance is improved. However, the existing model still has some defects, 1) channel information and spatial information in an image cannot be fully extracted in the image feature extraction process, so that emotion information in the image cannot be effectively combined with aspect word information. 2) Information fusion between modes cannot be effectively carried out, so that the performance of the model is not ideal. Thus, research is directed to aspect-level based multimodal emotion analysis tasks and more efficient models are presented herein.
CN114936623A, an aspect emotion analysis method integrating multi-mode data, firstly carries out data preprocessing, and adjusts text and image formats to adapt to the input requirement of a neural network; secondly, extracting text features by using Bi-LSTM after word embedding, and extracting image features by using a Resnet50 network; extracting and aligning multi-modal aspects, extracting aspect terms from the text by using a sequence labeling method, and performing implicit alignment of image areas and aspect words by using a memory network added with attention and Point-wise convolution operation; then based on the text characteristics of the position attention, gaussian modeling context explicit positions, and a memory network extracts text representations sensitive to the terms; then carrying out multi-mode data fusion, and fusing the multi-mode data by a fusion discrimination matrix; and finally, carrying out emotion classification, and carrying out emotion classification by utilizing the fused characteristic information. According to the method, the multi-modal data are used for carrying out aspect-level emotion analysis, multi-modal complementary information is extracted, and the accuracy of emotion analysis tasks is improved.
The method for using average aspect word vectors is easy to cause word sense confusion and is unfavorable for the interaction of aspect words with sentences and image features. Secondly, in the extraction of the image features, the auxiliary extraction effect of semantic information of sentence context on the image features is ignored. Thus, the above-described methods are limited in their ability to fuse multimodal data.
Disclosure of Invention
The present invention is directed to solving the above problems of the prior art. An aspect-level multi-mode emotion analysis method based on a dual-channel and attention mechanism is provided. The technical scheme of the invention is as follows:
an aspect-level multi-modal emotion analysis method based on a dual-channel and attention mechanism comprises the following steps:
step 1: extracting hidden characteristic representations from sentence characteristics and aspect word characteristics in a data set by using a Bert pre-training encoder, and extracting picture characteristics by using a ResNet-152 pre-training network; an aspect word is a subsequence that belongs to a sentence;
step 2: calculating the feature correlation of sentence features and aspect word features through a multi-head attention mechanism, so that corresponding attention weighting is obtained between the high-similarity features; finally obtaining aspect word features guided by the text and text features guided by the aspect words;
step 3: weighting the original image characteristics by using the aspect word characteristics guided by the text, and obtaining the image channel characteristics through a channel attention mechanism;
step 4: weighting the image channel characteristics by using the text characteristics guided by the aspect words, and generating a spatial attention pattern by using the spatial relation of the characteristics in a spatial attention mechanism to obtain final characteristic representation of the image;
step 5: calculating a dynamic adjacency matrix by text features guided by aspect words and image features generated by channel attention and spatial attention; obtaining a final fusion feature representation using the aggregate capabilities and the messaging capabilities of the graph neural network;
step 6: and classifying the final fusion features, aspect word features and sentence features by using a pooling mechanism through a classification module.
Further, step 1 extracts hidden feature representations from sentence features and aspect word features in the dataset by using a Bert pre-training encoder, and extracts picture features by using a res net-152 pre-training network, specifically:
outputting text and aspect feature information through two Bert-based pre-training text feature encoders; extracting image features by using a pretrained ResNet network; providing better initialization parameters for the model by adopting a pre-training model, enabling the model to have better generalization performance by fine adjustment on a target task, and accelerating model convergence; bert pre-training model to obtain sentence characteristicsAnd aspect features->Wherein t represents the text length and d represents the output feature dimension; the image features being denoted as H I =ResNet(E I ),H I ∈R c×w×h Wherein ResNet represents a ResNet-152 model, c represents a channel of an image feature, and w and h represent the width and the height of the feature respectively; wherein E is T 、E S 、E I Representing original aspect words, sentences and images; h T 、H S HI represents aspect words, sentences and image features extracted through the pre-training network.
Further, the step 2 adopts a multi-head attention mechanism to fuse related information between aspect word characteristics and sentence characteristics, and the specific method is as follows:
in order to obtain the interactive features between sentences and aspect words, a multi-head attention mechanism is adopted to calculate the similarity between the sentences and the aspect words, and feature fusion between the sentences and the aspect words can be effectively realized, and the expression is as follows:
MHA represents a multi-head attention mechanism, Q, K, V represents an input feature, d k Is a scaling factor,Output representing ith layer in transducer, layerNorm representing layer normalization, glue is activation function, W 1 T 、W 2 T Respectively representing a matrix of trainable parameters.
Wherein the aspect word features and sentence features are respectively used as query matrix q to calculate aspect word features Y guided by sentence features T And sentence feature Y guided by aspect word feature S 。
Further, the step 3 weights the original image features by using the aspect word features guided by the text, and obtains the image channel features through a channel attention mechanism, and the specific method is as follows:
in order to introduce aspect word features into an image, the aspect word features and the image features are fused through a multi-head self-attention mechanism, and the specific formula is as follows:
H ca =HMA(H I ,Y T ,Y T ) (4)
M CH =σ(MLP(AvgPool(H ca ))+MLP(MaxPool(H ca ))) (5)
wherein in the channel attention mechanism, it inputs H ca Is a multi-layer perceptron represented by MLP, avgPoll, maxPoll, max, σ, relu activation function, by multi-head attention mechanism, aspect word directed image features. M is M CH Representing the output of the channel attention.
Further, the step 4: the text feature guided by the aspect words is used for weighting the image channel feature, in a spatial attention mechanism, a spatial attention pattern is generated by using the spatial relation of the feature, and the final feature representation of the image is obtained, wherein the method comprises the following specific steps:
through a multi-head attention mechanism, the sentence characteristics guided by the aspect words are used for weighting the image characteristics output by the channel attention mechanism, and important areas related to the emotion of the aspect words in the image characteristics are highlighted in the channel attention mechanism, wherein the specific formula is as follows:
H sa =MHA(H I ,M CH ,M CH ) (6)
M SP =σ(Conv(Concat(AvgPool(H sa );MaxPool(H sa )))) (7)
equation (7) represents implementation details of the channel attention mechanism, where Concat represents matrix concatenation, conv represents convolution operation, and σ represents the Relu activation function. H sa Representing image features, H, guided by sentence features via a multi-headed attention mechanism I Representing image features, M SP Input representing spatial attentionAnd (5) outputting.
Further, the step 5 calculates a dynamic adjacency matrix by the text features guided by the aspect words and the image features generated by the channel attention and the space attention; the aggregation capability and the message transmission capability of the graph neural network are used to obtain a final fusion characteristic representation, which specifically comprises the following steps:
the sentence features and the image features are spliced, and an attention matrix is obtained through a self-attention mechanism and is used as an adjacent matrix of the GCN, firstly, the attention matrix can capture related features between the sentence and the image features, so that the adjacent matrix is more flexible, and secondly, the importance degree of similar features between the sentence and the image can be adaptively adjusted; in GCN, for a given node, graph G= { V, A }, where V is all nodes in the graph, A is the adjacency matrix between all nodes, A is the concatenation matrix corresponding to sentence features and image features ij The weight of (2) depends on the similarity between the nodes;
H att =Concat(Y S ,M SP ) (8)
A=MHA(H att ,H att ,H att ) (9)
wherein H is att Output M representing attention of a splice channel SP Sentence feature Y guided by sum aspect words S ,For node v i Is characterized by the output of the first layer, W l A trainable weight matrix of the GCN layer I, wherein sigma is a Relu activation function; since GCN completes feature extraction and coding work between associated nodes, output H of all nodes in layer I l Expressed as:
n represents the number of nodes
Further, the step 6: the final fusion features, aspect word features and sentence features are classified by using a pooling mechanism through a classification module, and the specific steps are as follows:
for aspect word features and sentence features, the [ CLS ] is added when the features are extracted by using a pre-training model]As a tag, the final hidden state of the tag is therefore taken as a collective representation of aspect words and sentence features, noted asAndfor the first characterization of the fusion output part with the GCN feature, since it is a weighted sum between the features, the feature is taken as the classified feature +.>The total output characteristic O after pooling and stitching can be expressed as:
in the classification phase:
p(y|O)=softmax(W T O) (13)
wherein the method comprises the steps of
Is a trainable weight, and a cross entropy Loss function is used to calculate a Loss value Loss. D. y is (j) The number of training samples and the actual labels of the samples are represented respectively.
The invention has the advantages and beneficial effects as follows:
the advantage of the invention is mainly that in step 3 of claim 1, the channels in the image can be regarded as feature extractors, the channel attention being directed to extracting important features in the image channels that are relevant to the aspect words. In order to integrate aspect word information into image channel features, both features are interacted with using a multi-head attention mechanism before channel attention, so that the guiding function of the aspect words in the channel attention is conveniently exerted. Then in step 4, the spatial attention mainly extracts the region features related to the aspect words in the image, and because the sentence features also have the region association related to the aspect words, the sentence features are introduced into the spatial attention mechanism, so that the spatial attention is guided to extract the region features related to the aspect words in the image. And gradually extracting deep features in the image, and enhancing contribution of the image features to emotion classification in subsequent multi-mode fusion.
Drawings
FIG. 1 is a flowchart of an aspect-level multi-modal emotion analysis method based on a dual channel and attention mechanism in accordance with a preferred embodiment of the present invention.
FIG. 2 is a framework diagram of an aspect-level multimodal emotion analysis model.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and specifically described below with reference to the drawings in the embodiments of the present invention. The described embodiments are only a few embodiments of the present invention.
The technical scheme for solving the technical problems is as follows:
fig. 1 is a general flow chart of an aspect-level multi-modal emotion analysis method based on a dual-channel and attention mechanism according to the present invention, and is further described below with reference to fig. 1. The invention mainly comprises the following steps:
step 1: extracting hidden feature representations from sentence features and aspect word features in the dataset using a Bert pre-training encoder. The picture features are extracted using a ResNet-152 pre-training network.
Step 2: and calculating the feature correlation of the sentence features and the aspect word features through a multi-head attention mechanism, so that corresponding attention weighting is obtained between the features with high similarity. Finally, the aspect word features guided by the text and the text features guided by the aspect words are obtained.
Step 3: and weighting the original image characteristics by using the aspect word characteristics guided by the text, and obtaining the image channel characteristics through a channel attention mechanism.
Step 4: and weighting the image channel characteristics by using the text characteristics guided by the aspect words, and generating a spatial attention pattern by using the spatial relationship of the characteristics to obtain the final characteristic representation of the image.
Step 5: the text features guided by the aspect words and the image features generated by the channel attention and the spatial attention calculate a dynamic adjacency matrix. The aggregate and messaging capabilities of the graph neural network are used to derive a final fused feature representation.
Step 6: the final fused feature representation is classified by the classification module using a pooling mechanism.
2. FIG. 2 is a framework diagram of the multi-modal emotion analysis model at this aspect level, and the structural principle of the present invention will be further described with reference to FIG. 2. The method model of the invention has 4 layers, and the specific contents of each layer are as follows:
(1) Modal feature extraction layer
A set of multimodal samples is known, which contains a sentence s= { w of n words 1 ,w 2 ,…,w n And associated image I, and an aspect word sub-sequence T of S. For aspect word T, it is also associated with an emotion tag y. The invention adds sentence E S And aspect word E T Features are extracted in two Bert encoders, respectively, image E I Features are extracted in the ResNe-152 network. For the input part of the Bert text encoder, the label [ CLS ]]Added to the header of text, tag [ SEP ]]Adding the end of the text, and finally obtaining sentence characteristics by using the Bert pre-training modelAnd aspect word featuresWhere t represents the aspect word or text length and d represents the feature dimension. For image features, expressed as +.>Where c represents the number of channels of the image feature and w and h represent the width and height of the feature, respectively.
(2) Attention mechanism layer
In order to further extract the relevance between features and the modal interaction between text words and image features, the model adopts a multi-head attention mechanism to extract the potential relevance features between aspect word features and sentence features. In the image feature extraction, the channel attention mechanism fusing aspect word features and the space attention mechanism fusing text features are used, the image mode features are fused with the text mode features, and features which are in different scales and are related to the aspect words in the image mode are extracted, so that the GCN network can identify the adjacent relation between important nodes. The multi-mode characteristics are extracted more deeply through a message passing and aggregation mechanism of the GCN network.
1) Sentence and aspect word feature interactions
In order to obtain interactive features between sentences and aspect words, the associated information among different features is strengthened, and redundant information is filtered. The invention adopts a multi-head attention mechanism to calculate the similarity between the features of the two, and can effectively realize the feature fusion between the two. The expression of the multi-head attention mechanism is:
where T represents the matrix transpose, MHA represents the multi-headed attention mechanism, consisting of three parts, query (Q), key (K) and Value (V), with the attention Value generated by interaction between Q, K mapped to V by dot product. Scale factor d k Is the characteristic dimension of each attention head.
The aspect word features and sentence features can obtain the fused output features of the aspect word features and the sentence features through a multi-head attention mechanism, and the aspect word features guided by the sentence features are subjected to linear transformation and residual connection to obtain the final output features, wherein the specific formula is as follows:
wherein LayerNorm represents layer standardization, has functions of guaranteeing stability of data characteristic distribution and accelerating model training, and Glue represents activation function, W 1 T 、W 2 T Representing a trainable weight parameter. In the multi-head attention mechanism, the aspect word features and sentence features are respectively used as a query matrix Q, so that the aspect word features Y guided by the sentence features can be calculated T And sentence feature guided by aspect word feature, called Y S 。
2) Channel attention mechanism
Since each channel of image features can be regarded as a feature detector, in the channel attention mechanism, the features after average pooling and maximum pooling can be extracted through a layer of feedforward neural network, and important features related to aspect words in each channel of the image can be extracted. In order to identify emotion distribution related to an aspect word contained in a channel in an image channel, the invention introduces the aspect word feature in a channel attention mechanism, firstly, the aspect word feature and the image feature are fused through a multi-head attention mechanism, and the fused feature is used as input of the channel attention mechanism. The specific formula is as follows:
H ca =MHA(H I ,Y T ,Y T ) (4)
M CH =σ(MLP(AvgPool(H ca ))+MLP(MaxPool(H ca ))) (5)
wherein in the channel attention mechanism, it inputs H ca The image features guided by the aspect words are subjected to linear transformation and residual connection by a multi-head attention mechanism, and the complete output is subjected to linear transformation and residual connection as shown in formulas (2) and (3), and H is used in the part ca Is a requirement for simplifying the description. Equation (5) represents implementation details of the channel attention mechanism, wherein the MLP represents the multi-layer perceptron, comprising trainable weight parameters in the neural network, avgPool represents average pooling and MaxPool represents maximum pooling. σ represents the Relu activation function.
3) Spatial attention mechanism
In the spatial attention mechanism, the spatial relationship of features can be utilized to learn the distribution area of important emotion features related to aspect words. In the sentence and aspect word feature interaction stage, the text features guided by the aspect words already contain the position areas of the important emotion features for the aspect words in the sentences. In order to enhance the extraction capability of the image feature region. In the invention, after sentence characteristics guided by aspect words and image characteristics output by a channel attention mechanism are fused through characteristics, a distribution area of important emotion characteristics is learned in a space attention mechanism. The specific formula is as follows:
H sa =MHA(M CH ,Y S ,Y S ) (6)
M SP =σ(Conv(Concat(AvgPool(H sa );MaxPool(H sa )))) (7)
equation (7) represents implementation details of the spatial attention mechanism, where Concat represents matrix concatenation, conv represents convolution operation, and σ represents the Relu activation function.
(3) GCN feature fusion layer
Sentence features are spliced with image features, and an attention matrix is obtained through a self-attention mechanism and is used as an adjacency matrix of the GCN. Firstly, the attention matrix can learn the related features between the sentence and the image features, so that the adjacency matrix is more flexible, and secondly, the importance degree of the similar features between the sentence features and the image features can be adaptively adjusted. In GCN, for a given node, graph G= { V, A }, where V is all nodes in the graph, A is the adjacency matrix between all nodes, A is the concatenation matrix corresponding to sentence features and image features ij The weight of (2) depends on the similarity between the nodes.
H att =Concat(Y S ,M SP ) (8)
A=MHA(H att ,H att ,H att ) (9)
Wherein,,for node v i Is characterized by the output of the first layer, W l Is a trainable weight matrix for the GCN layer i, σ is a Relu activation function. Since GCN completes feature extraction and coding work between associated nodes, output H of all nodes in layer I l Expressed as:
(4) Output layer
For aspect word features and sentence features, since [ CLS ] is added when features are extracted initially using a pre-trained model]As a tag, the final hidden state of the tag is therefore taken as a collective representation of aspect words and sentence features, noted asAndfor the first characterization of the fusion output part with the GCN feature, since it is a weighted sum between the features, the feature is taken as the classified feature +.>The output characteristics can be expressed as:
the GCN output characteristics pass through a layer of feedforward neural network to finish classification tasks, and the specific formula is as follows:
p(y|O)=softmax(W T O) (13)
wherein the method comprises the steps ofIs a trainable weight, and a cross entropy Loss function is used to calculate a Loss value Loss.
Experimental simulation
Table1 model performance comparison Table1 Model performance comparison
As can be seen from Table1, compared with other models, the model proposed in this chapter obtains optimal experimental results in terms of classification accuracy and macro average F1, and improves classification accuracy and macro average F1 score by 1.35% and 1.25% on TWITTER-2015 respectively in comparison with suboptimal TomBERT network model. The lifting rate is 1.38% and 1.53% on TWITTER-2017 respectively. The method is characterized in that the chapter extracts deep semantic association related to the term in the aspects in the images by fusing the term features in the aspects with the text features, and classification performance of the model is further improved by means of feature fusion capability of the graph neural network. The TomBERT model uses BERT to extract visual representations that are sensitive to aspect terms, but fails to employ efficient feature fusion methods. For the MIMN model that uses a multi-hop memory network to extract features, although multi-hop fusion of bimodal features is achieved, interactive features between text content and visual information are not extracted deeply.
The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
The above examples should be understood as illustrative only and not limiting the scope of the invention. Various changes and modifications to the present invention may be made by one skilled in the art after reading the teachings herein, and such equivalent changes and modifications are intended to fall within the scope of the invention as defined in the appended claims.
Claims (7)
1. An aspect-level multi-mode emotion analysis method based on a dual-channel and attention mechanism is characterized by comprising the following steps of:
step 1: extracting hidden characteristic representations of sentences and aspect words in the data set by using a Bert pre-training encoder, and extracting picture characteristics by using a ResNet-152 pre-training network; an aspect word is a subsequence that belongs to a sentence;
step 2: calculating the feature correlation of sentence features and aspect word features through a multi-head attention mechanism, so that corresponding attention weighting is obtained between the high-similarity features; finally obtaining aspect word features guided by the text and text features guided by the aspect words;
step 3: weighting the original image characteristics by using the aspect word characteristics guided by the text, and obtaining the image channel characteristics through a channel attention mechanism;
step 4: weighting the image channel characteristics by using the text characteristics guided by the aspect words, and generating a spatial attention pattern by using the spatial relation of the characteristics in a spatial attention mechanism to obtain final characteristic representation of the image;
step 5: calculating a dynamic adjacency matrix by text features guided by aspect words and image features generated by channel attention and spatial attention; obtaining a final fusion feature representation using the aggregate capabilities and the messaging capabilities of the graph neural network;
step 6: and classifying the final fusion features, aspect word features and sentence features by using a pooling mechanism through a classification module.
2. The method for analyzing the emotion of the aspect level multimode based on the dual-channel and the attention mechanism according to claim 1, wherein the step 1 extracts the hidden characteristic representation from the sentence characteristics and the aspect word characteristics in the data set by using a Bert pre-training coder, and extracts the picture characteristics by using a res net-152 pre-training network, specifically:
outputting text and aspect word feature information through two Bert-based pre-training text feature encoders; extracting image features by using a pretrained ResNet network; providing better initialization parameters for the model by adopting a pre-training model, enabling the model to have better generalization performance by fine adjustment on a target task, and accelerating model convergence; bert pre-training model to obtain sentence feature H S =Bert(E S ),And aspect features H T =Bert(E T ),/>Wherein t represents the text length and d represents the output feature dimension; the image features being denoted as H I =ResNet(E I ),/>Wherein ResNet represents a ResNet-152 model, represents channels of image features, and w and h represent the width and the height of the features respectively; wherein E is T 、E S 、E I Representing original aspect words, sentences and images; h T 、H S 、H I Representing aspect words, sentences and image features extracted through the pre-training network.
3. The method for analyzing the emotion in multiple modes in aspect level based on dual channel and attention mechanism according to claim 1, wherein the step 2 adopts a multi-head attention mechanism, and fuses the related information between the aspect word characteristics and the sentence characteristics, and the specific method is as follows:
in order to obtain the interactive features between sentences and aspect words, a multi-head attention mechanism is adopted to calculate the similarity between the sentences and the aspect words, and feature fusion between the sentences and the aspect words can be effectively realized, and the expression is as follows:
MHA represents a multi-head attention mechanism, Q, K, V represents an input feature, d k In order for the scaling factor to be a factor,representing the output of the ith layer in the transducer, layerNorm represents layer normalization, and Glue is an activation function, W 1 T 、W 2 T Respectively representing a trainable parameter matrix;
wherein the aspect word features and sentence features are respectively used as query matrix Q to calculate aspect word features Y guided by sentence features T And sentence feature Y guided by aspect word feature S 。
4. The method for analyzing the emotion in multiple modes in aspect level based on dual channel and attention mechanism according to claim 1, wherein said step 3 weights the original image features by using the aspect word features guided by the text, and obtains the image channel features by the channel attention mechanism, specifically comprising:
in order to introduce aspect word features into an image, the aspect word features and the image features are fused through a multi-head self-attention mechanism, and the specific formula is as follows:
H ca =HMA(H I ,Y T ,Y T ) (4)
M CH =σ(MLP(AvgPool(H ca ))+MLP(MaxPool(H ca ))) (5)
wherein in the channel attention mechanism, it inputs H ca Is image characteristics guided by a multi-head attention mechanism through aspect words, a multi-layer perceptron represented by MLP, average pooling represented by AvgPoll, maximum pooling represented by MaxPOll, and Relu activation function represented by sigma, M CH Representing the output of the channel attention.
5. The method for analyzing the emotion of the aspect level multimode based on the dual-channel and attention mechanism according to claim 1, wherein the step 4 is as follows: the text feature guided by the aspect words is used for weighting the image channel feature, in a spatial attention mechanism, a spatial attention pattern is generated by using the spatial relation of the feature, and the final feature representation of the image is obtained, wherein the method comprises the following specific steps:
through a multi-head attention mechanism, the sentence characteristics guided by the aspect words are used for weighting the image characteristics output by the channel attention mechanism, and important areas related to the emotion of the aspect words in the image characteristics are highlighted in the channel attention mechanism, wherein the specific formula is as follows:
H sa =MHA(H I ,M CH ,M CH ) (6)
M SP =σ(Conv(Concat(AvgPool(H sa );MaxPool(H sa )))) (7)
equation (7) represents implementation details of the channel attention mechanism, wherein Concat represents matrix concatenation, conv represents convolution operation, σ represents Relu activation function, H sa Representing image features guided by multi-head attention mechanism, by sentence features, H I Representing image features, M SP Representing the output of spatial attention.
6. The method for analyzing the emotion of the aspect level multimode based on the dual-channel and attention mechanism according to claim 5, wherein the step 5 calculates a dynamic adjacency matrix by text features guided by aspect words and image features generated by channel attention and spatial attention; the aggregation capability and the message transmission capability of the graph neural network are used to obtain a final fusion characteristic representation, which specifically comprises the following steps:
the sentence features and the image features are spliced, and an attention matrix is obtained through a self-attention mechanism and is used as an adjacent matrix of the GCN, firstly, the attention matrix can capture related features between the sentence and the image features, so that the adjacent matrix is more flexible, and secondly, the importance degree of similar features between the sentence and the image can be adaptively adjusted; in GCN, for a given node, graph G= { V, A }, where V is all nodes in the graph, A is the adjacency matrix between all nodes, A is the concatenation matrix corresponding to sentence features and image features ij The weight of (2) depends on the similarity between the nodes;
H att =Concat(Y S ,M SP ) (8)
A=MHA(H att ,H att ,H att ) (9)
wherein H is att Output M representing attention of a splice channel SP Sentence feature Y guided by sum aspect words S ,For node v i Is characterized by the output of the first layer, W l A trainable weight matrix of the GCN layer I, wherein sigma is a Relu activation function; since GCN completes feature extraction and coding work between associated nodes, output H of all nodes in layer I l Expressed as:
n represents the number of nodes.
7. The method for analyzing the emotion of the aspect level multimode based on the dual-channel and attention mechanism according to claim 6, wherein the following step 6: the final fusion features, aspect word features and sentence features are classified by using a pooling mechanism through a classification module, and the specific steps are as follows:
for aspect word features and sentence features, the [ CLS ] is added when the features are extracted by using a pre-training model]As a tag, the final hidden state of the tag is therefore taken as a collective representation of aspect words and sentence features, noted asAnd->For the first characterization of the fusion output part with the GCN feature, since it is a weighted sum between the features, the feature is taken as the classified feature +.>The total output characteristic O after pooling and stitching can be expressed as:
in the classification phase:
p(y|O)=softmax(W T O) (13)
wherein the method comprises the steps ofIs a trainable weight, and calculates a Loss value Loss using a cross entropy Loss function, D, y (j) The number of training samples and the actual labels of the samples are represented respectively.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310273760.9A CN116662924A (en) | 2023-03-20 | 2023-03-20 | Aspect-level multi-mode emotion analysis method based on dual-channel and attention mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310273760.9A CN116662924A (en) | 2023-03-20 | 2023-03-20 | Aspect-level multi-mode emotion analysis method based on dual-channel and attention mechanism |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116662924A true CN116662924A (en) | 2023-08-29 |
Family
ID=87708608
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310273760.9A Pending CN116662924A (en) | 2023-03-20 | 2023-03-20 | Aspect-level multi-mode emotion analysis method based on dual-channel and attention mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116662924A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117395164A (en) * | 2023-12-12 | 2024-01-12 | 烟台大学 | Network attribute prediction method and system for industrial Internet of things |
-
2023
- 2023-03-20 CN CN202310273760.9A patent/CN116662924A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117395164A (en) * | 2023-12-12 | 2024-01-12 | 烟台大学 | Network attribute prediction method and system for industrial Internet of things |
CN117395164B (en) * | 2023-12-12 | 2024-03-26 | 烟台大学 | Network attribute prediction method and system for industrial Internet of things |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110298037B (en) | Convolutional neural network matching text recognition method based on enhanced attention mechanism | |
CN108733792B (en) | Entity relation extraction method | |
CN109918671A (en) | Electronic health record entity relation extraction method based on convolution loop neural network | |
Sharma et al. | A survey of methods, datasets and evaluation metrics for visual question answering | |
CN110287323B (en) | Target-oriented emotion classification method | |
CN111985205A (en) | Aspect level emotion classification model | |
CN113705238A (en) | Method and model for analyzing aspect level emotion based on BERT and aspect feature positioning model | |
CN114896434B (en) | Hash code generation method and device based on center similarity learning | |
CN116975350A (en) | Image-text retrieval method, device, equipment and storage medium | |
CN114004220A (en) | Text emotion reason identification method based on CPC-ANN | |
Ishmam et al. | From image to language: A critical analysis of visual question answering (vqa) approaches, challenges, and opportunities | |
CN116662500A (en) | Method for constructing question-answering system based on BERT model and external knowledge graph | |
CN116187349A (en) | Visual question-answering method based on scene graph relation information enhancement | |
CN116701996A (en) | Multi-modal emotion analysis method, system, equipment and medium based on multiple loss functions | |
CN116522945A (en) | Model and method for identifying named entities in food safety field | |
CN112733764A (en) | Method for recognizing video emotion information based on multiple modes | |
CN111930981A (en) | Data processing method for sketch retrieval | |
CN116258147A (en) | Multimode comment emotion analysis method and system based on heterogram convolution | |
CN115730232A (en) | Topic-correlation-based heterogeneous graph neural network cross-language text classification method | |
CN116662924A (en) | Aspect-level multi-mode emotion analysis method based on dual-channel and attention mechanism | |
CN114048314A (en) | Natural language steganalysis method | |
CN113642630A (en) | Image description method and system based on dual-path characteristic encoder | |
CN118364111A (en) | Personality detection method based on text enhancement of large language model | |
Meng et al. | Regional bullying text recognition based on two-branch parallel neural networks | |
CN117093692A (en) | Multi-granularity image-text matching method and system based on depth fusion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |