CN117540023A

CN117540023A - Image joint text emotion analysis method based on modal fusion graph convolution network

Info

Publication number: CN117540023A
Application number: CN202410021947.4A
Authority: CN
Inventors: 孙玉宝; 谈钱辉; 沈心旸; 李军侠
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2024-01-08
Filing date: 2024-01-08
Publication date: 2024-02-09

Abstract

The invention discloses an image joint text sentiment analysis method based on a modal fusion graph convolution network, which includes: acquiring images and text data containing user emotional information; constructing an image joint text sentiment analysis model based on a modal fusion graph convolution network, Including image and text feature extraction module, semantic enhancement graph convolution module and global fusion module; design the loss function and use the Adam optimizer to achieve iterative optimization and update of model parameters; the trained network model can achieve end-to-end analysis of user emotional tendencies Classification. This invention uses deep learning technology to accurately analyze the emotional tendencies of users through the images and text information posted on social platforms, which helps enterprises analyze customers' attitude tendencies towards related products, and also helps social media platforms analyze the images browsed by users. Judge their preferences based on the content of the article.

Description

Image joint text emotion analysis method based on modal fusion graph convolution network

Technical Field

The invention relates to an image combined text emotion analysis method based on a modal fusion graph rolling network, and belongs to the technical field of image and text processing.

Background

Emotion analysis tasks have diversified application scenes and potential values, and a plurality of emotion analysis technologies based on text data exist in the past and can be divided into a traditional method based on an emotion dictionary and a method based on deep learning. Conventional approaches often do not work well because of the inability to handle relationships between contexts well. Deep networks have good learning ability, but the quality of data plays a decisive role in the final presentation of the data, and it is often difficult to achieve a satisfactory prediction effect by relying on only single-mode data. Because the images in the social media and the corresponding text descriptions have certain correlation, compared with a single-mode text emotion analysis task or an image emotion analysis task, the information of the two modes of the combined image and the text can fully utilize the complementary advantages between the images and the text information, so that more accurate emotion analysis is realized.

The key of the image joint text emotion analysis task is how to efficiently capture emotion association between two modal emotion representations and fully fuse the two modal features. The existing image combined text emotion analysis method mainly uses an attention mechanism to simply merge two modes, and although the methods can obtain better results than a single-mode emotion analysis method, because of quite large intra-class differences and inter-class similarities between multi-mode emotion data, complex emotion interaction between different mode features cannot be captured well by the simple attention mechanism, emotion association between different mode data cannot be built sufficiently, and therefore the model is difficult to learn deeper information.

Disclosure of Invention

The technical problems to be solved by the invention are as follows: the image combined text emotion analysis method based on the modal fusion graph rolling network is high in accuracy, rapid in analysis process and capable of achieving better graph emotion analysis.

The invention adopts the following technical scheme for solving the technical problems:

the image joint text emotion analysis method based on the modal fusion graph rolling network comprises the following steps:

step 1, acquiring images and text data containing emotion information of a user as a dataset, wherein the acquired image data and the text data correspond to each other one by one, and after labeling paired images and text data, the dataset is divided into a training set and a testing set;

step 2, constructing an image joint text emotion analysis model based on a modal fusion graph convolution network, wherein the model comprises an image-text feature extraction module, a semantic enhancement graph convolution module and a global fusion module; the image-text feature extraction module comprises an image feature extraction unit and a text feature extraction unit, which are respectively used for extracting image features in image data and text features in text data; the semantic enhancement graph convolution module comprises an image semantic enhancement unit, a text semantic enhancement unit and a fusion information semantic enhancement unit, which are respectively used for carrying out semantic enhancement on image features, text features and image text fusion features; the global fusion module comprises a combination layer, an attention mechanism layer and a full connection layer, wherein the combination layer is used for combining the semantically enhanced image features, the semanteme features and the image text fusion features to obtain initial global emotion features, the attention mechanism layer is used for capturing attention weights facing the graph from the initial global emotion features, and the full connection layer is used for obtaining a final emotion analysis result based on the attention weights;

step 3, designing a loss function for optimizing the model constructed in the step 2, and presetting a training super-parameter of the model;

step 4, training the model constructed in the step 2 by using a training set, and optimizing and updating model parameters by using an Adam optimizer according to a loss function to obtain a trained model;

and step 5, testing the test set by using the trained model to obtain emotion analysis results, namely emotion tendencies of the user.

In a preferred embodiment of the present invention, in the step 1, the paired image and text data are labeled as one of the following three categories according to the emotional tendency of the user: the method comprises the steps of negatively, neutral and positive dividing the marked data set into a training set and a testing set, wherein the proportion of each category in the training set to the total quantity of the training set is the same.

In the step 2, as a preferable scheme of the present invention, the expression of the image joint text emotion analysis model based on the modal fusion graph rolling network is:

，

wherein,and->Image features and text features, respectively, +.>And->Respectively image data and text data,for the image feature extraction unit, < > for>For text feature extraction unit, < > for>And->Respectively, semantically enhanced image features, text features and image text fusion features, +.>For the image semantic enhancement unit,>for text semantic enhancement unit,>for fusing information semantic enhancement units,/->For global fusion module,/->In order for the splicing operation to be performed,is a full connection layer->And the final emotion analysis result.

In the step 2, the image semantic enhancement unit, the text semantic enhancement unit and the fusion information semantic enhancement unit have the same structure and comprise an edge generation unit and a graph convolution operation unit;

for characteristics of，/>First by linear transformation->And->Will->Embedding a new feature space, and then sending the new feature space into an edge generation unit to calculate similarity between node features so as to capture the connection between the node features, wherein the expression is as follows:

，

wherein,and->For a learnable parameter->Respectively represent the firsti、jThe mode of each sample ismIs characterized by (1)>Constructing an emotion association graph for the similarity coefficients among the nodes according to the obtained similarity coefficients among the nodes, wherein the calculation mode of the adjacent matrix is as follows:

，

wherein,is thatmTotal number of nodes under modality, +.>Represent the firstkThe mode of each sample ismIs characterized in that,Eis a diagonal identity matrix>For the similarity matrix of the graph nodes, +.>For matrix->Element of (a)>Finally, the node characteristics with strong emotion expression in the single-mode data are aggregated through a graph convolution operation unit, and the graph convolution expression is as follows:

，

wherein,respectively +.>Output and input of layer diagram convolution, +.>Is->Layer diagram convolved learnable parameter matrix, +.>For ReLU activation function, +.>Is the input of the layer 1 graph convolution.

As a preferred embodiment of the present invention, in the step 3, a loss functionIncluding emotion classification loss functionsAnd tag-based contrast learning loss function>The expression is:

，

wherein,parameters to be optimized for the model, +.>Is the firstiTrue value of individual samples, +.>For model pair numberiPredicted value of individual sample outputs, +.>And->Respectively represent the first in the same batchiAnd (d)jGlobal emotion fusion feature of individual samples, +.>In order to compare the coefficients of the learning,Sfor batch size, +.>Is the first toiThe set of all sample numbers with the same label for each sample, the super-parameters of the model include learning rate +.>Iteration number epoch, batch sizeSAnd the depth and number of layers of the model.

In a preferred embodiment of the present invention, in the step 4, the model parameters are updated by a back propagation algorithm.

A computer device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the image joint text emotion analysis method based on a modal fusion graph rolling network when the computer program is executed.

A computer readable storage medium storing a computer program which when executed by a processor implements the steps of the image joint text emotion analysis method based on a modal fusion graph convolution network.

Compared with the prior art, the technical scheme provided by the invention has the following technical effects:

1. according to the invention, a graph rolling network, an attention mechanism and a multi-mode emotion analysis model are combined, semantic enhancement is carried out on the picture features and the text features obtained by the graph feature extraction module through the graph rolling network, global information fusion is carried out on the graph features subjected to semantic enhancement through the attention mechanism by the global fusion module, and then the obtained global features are classified through a full-connection layer.

2. The invention improves the accuracy of emotion tendency analysis to a great extent, can accurately realize emotion tendency analysis through images and text data, is beneficial to enterprises to analyze the attitude tendency of clients to related products, and is also beneficial to social media platforms to judge favorites of the clients through image-text contents browsed by users.

Drawings

FIG. 1 is a schematic flow chart of an image joint text emotion analysis method based on a modal fusion graph rolling network;

FIG. 2 is a schematic diagram of a emotion analysis network model;

FIG. 3 is a schematic diagram of the structure of a semantic enhancement unit in the semantic enhancement graph convolution module;

FIG. 4 is a schematic illustration of an example of visual analysis results on paired image text datasets.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings. The embodiments described below by referring to the drawings are exemplary only for explaining the present invention and are not to be construed as limiting the present invention.

As shown in fig. 1, the image joint text emotion analysis method based on the modal fusion graph rolling network comprises the following steps:

s101, acquiring an image containing user emotion information and text data from an Internet social media platform.

In step S101, firstly, image and text data published by a user on a social media platform are acquired through the internet, and paired image and text data are divided into three cases according to their emotional tendency, respectively: negative, neutral, positive; and then dividing the data of each case into a training set and a testing set according to a preset proportion.

S201, constructing an image joint text emotion analysis model based on a modal fusion graph convolution network, wherein the image joint text emotion analysis model comprises an image-text feature extraction module, a semantic enhancement graph convolution module and a global fusion module. The image-text feature extraction module is used for expanding the initial image features and the initial text features into node vector representations, the semantic enhancement graph convolution module constructs a modal graph and a multi-modal interaction graph of the node features by calculating emotion similarity among the nodes, and the global fusion module calculates attention coefficients of the node vectors and carries out weighted fusion to obtain global representations of image-text emotion by carrying out dynamic reasoning and aggregation on emotion context features in the modal and among the multi-modal features on the graph.

In step S201, an image joint text emotion analysis model based on a modal fusion graph convolution network is shown in fig. 2, and the model is composed of three modules, namely an image-text feature extraction module, a semantic enhancement graph convolution module and a global fusion module. In the image-text feature extraction module, image features are extracted by the pretrained ResNet50, and text features are extracted by the pretrained Bert. After the characteristics of the image and the text are obtained and are subjected to splicing operation, the characteristics are respectively sent to a graph convolution module corresponding to the image, the text and the fusion information for semantic enhancement, and the semantic enhancement graph convolution module is shown in fig. 3. And finally, sending the semantically enhanced image, text and fusion information features into a global fusion module for processing to obtain global features, and classifying through a full connection layer to obtain a final prediction result.

As shown in FIG. 2, the text data is processed by the Bert model, given a textWherein->Is textTLength of->Representative textTMiddle (f)iIndividual words, after passing through the Bert model, each word will be mapped to a dimension ofText emotion feature vector of (2) text emotion feature +.>Wherein each text emotion feature vector is used as a node in a subsequent semantic enhanced graph convolution module.

As shown in FIG. 2, the image data is processed by the ResNet50 model for a given textTCorresponding imageIAfter inputting ResNet50, taking the feature map output by the convolution of the last layerWherein->Dimension of the feature map, < >>The number of channels in the feature map. Since the subsequent image feature will be fused with the text feature, it is necessary to make the dimensions of the image feature vector coincide with those of the text feature vector, so the number of channels of the feature map is changed using the full connection layer +.>So that the number of channels after the change is +.>. Finally, expanding the space dimension of the feature map into one dimension to obtain image emotion features +.>Wherein->Each image emotion feature vector is used as a node in a subsequent semantic enhancement graph rolling module.

As shown in FIG. 3, the semantic enhancement map convolution module includes three sub-modules for processing text features, respectivelyImage characteristics->Image text fusion feature->Wherein->. The three sub-modules are identical in construction and +.>First by linear transformation->And->Will->Embedding a new feature space, which is then fed into the edge generation unit +.>Similarity between node features is calculated to capture the links between node features. The expression is:

，

wherein,and->Is a parameter that can be learned, < >>Represents the firstiFirst, thejThe mode of each sample ismIs characterized by (1)>A function is calculated for the similarity between the nodes. Then constructing emotion association diagram according to the obtained similarity coefficient between nodes, and adjoining the matrixThe calculation method is as follows:

，

wherein,representative ofmThe total number of nodes in the modality,Eis a diagonal identity matrix, addEThe purpose of (a) is to alleviate the gradient vanishing and degradation problems. Adjacency matrix->Information interaction between two nodes with higher emotion semantic similarity can be enhanced, and interaction influence between irrelevant nodes is restrained. Finally, the node characteristics with strong emotion expression in the single-mode data are aggregated through graph convolution, and the graph convolution expression is as follows:

，

wherein,is->Layer diagram convolved learnable parameter matrix, +.>For the ReLU activation function, in one embodiment of the invention +.>And is set to 2 to prevent transition smoothness caused by excessive layers.

As shown in FIG. 2, in the global fusion module, the text emotion expression output by the semantic enhancement module is first expressedImage emotion representation +.>Image text fusion representation +.>Combining to obtain initial global emotion feature->The expression is:

，

attention mechanisms are then used to capture graph-oriented attention information from the fused node features, attention weightsThe calculated expression of (2) is:

，

wherein,weight and bias of the first fully connected layer, respectively,/->Respectively the weight and the deviation of the second full connection layer;

finally multiplying the attention weight with the corresponding emotion feature vector and adding to obtain the global emotion featureAnd obtaining the final prediction result by the full link layer +.>The expression is:

，

wherein,the weight and bias of the full connection layer respectively.

S301, designing a loss function for optimizing a network model, and presetting a training super-parameter of the network model.

In step S301, a loss functionIncluding emotion classification loss function->And tag-based contrast learning loss function>The expression is:

，

wherein,for parameters that need to be optimized +.>Is the firstiTrue value of individual samples, +.>For model pair numberiPredicted value of individual sample outputs, +.>And->Respectively represent the first in the same batchiAnd (d)jGlobal emotion fusion feature of individual samples, +.>For comparison of the learning coefficients. The super-parameters of the model comprise learning rate->Number of iterations epoch, batch sizeSDepth and number of layers of the network model.

S401, training data are sent into an emotion analysis model, and an Adam optimizer is adopted to realize iterative optimization and updating of model parameters according to a loss function.

Step S4011, initializing the picture feature extraction module and the text feature extraction module by adopting pre-trained parameters and other network parameters; selecting training data setsSPairs of image textAndsending the result into a network model, and obtaining a corresponding output prediction result +.>；

Step S4012, updating the rest network parameters by the reverse propagation algorithm，，/>Is first order momentum, wherein the Adam optimizer is one of gradient descent algorithms;

step S4013, sequentially performing operations S4011 and S4012 on the data in the whole training set, and performing epoch=100 iterations in total.

S501, if the emotion analysis model converges, the trained model can directly realize end-to-end emotion tendency analysis of the user, the output of the model is the emotion tendency of the user, and otherwise, the S401 is returned.

Step S5011, judging whether the classified network model converges or not: in the iterative process of network training, if the objective function value is reduced and gradually increased to a certain value, judging that the network converges;

s5012, inputting paired image text data into a converged network model, wherein the output of the model is the corresponding emotion tendency;

step S5013, if the iterative training does not converge, the routine returns to step S401.

Examples

In order to demonstrate the effectiveness of the present invention, comparative experiments and ablation experiments were performed. Firstly introducing a data set and training details, then providing comparison experiment results of different algorithms on the data set, and finally explaining a digestion experiment, thereby proving the effectiveness of semantic enhancement graph convolution and label comparison learning loss function.

The datasets used in the experiments were MVSA-Single and MVSA-Multiple datasets, with samples in both datasets collected from social media websites Twitter, with one corresponding emotion tag for each image text pair. The samples in the MVSA-Single and the MVSA-Multiple have three emotion labels which are positive, negative and neutral respectively, wherein the sample label of the MVSA-Single is annotated by one annotator, the total number of the samples is 4511, each sample in the MVSA-Single is annotated by three annotators, and the total logarithm of the image text in the data set is 17024. In the experiment, the data set is divided into a training set, a verification set and a test set according to the proportion of 8:1:1.

By conducting experiments in the test set, final classification accuracies of 74.36% and 72.87% were obtained on the MVSA-Single and MVSA-Multiple datasets, respectively.

The method is compared with the existing Multiple image-text fusion emotion analysis methods based on deep learning, and a comparison experiment is carried out on MVSA-Single and MVSA-Multiple data sets. Methods for comparison with the present invention include MultiSentiNet, HSAN, coMN, MVAN and MGNN. The MultiSentiNet performs image-text fusion emotion analysis by fusing a target feature vector, a scene feature vector and a text feature vector. HSAN uses a cascade semantic attention network to conduct teletext emotion prediction based on image descriptions. The CoMN iteratively performs the image-text feature interactions using a common memory network. MVAN introduces a multiview attention mechanism in the memory network for emotion classification. From the data point of view, the MGNN utilizes the graph neural network to find co-occurrence characteristics among the data set samples. The results of the comparative experiments are shown in Table 1.

TABLE 1

In order to verify the improvement effect of the semantic enhancement graph convolution module and the label contrast learning loss function on the final classification accuracy of the network, the network without the semantic enhancement graph convolution module and the contrast learning loss function is taken as a baseline, and related experiments are carried out on MVSA-Single and MVSA-Multiple data sets by sequentially adding the semantic enhancement graph convolution module for images, texts and multi-mode data and the contrast learning loss function. The experimental results are shown in table 2, wherein IG, TG and FG respectively represent an image emotion semantic enhancement map convolution module, a text emotion semantic enhancement map convolution module and a fusion feature semantic enhancement map convolution module, and LBCL represents a tag-based contrast learning loss function.

TABLE 2

As can be seen from Table 1, compared with the existing image-text fusion emotion analysis method, the method provided by the invention can greatly improve the accuracy of emotion tendency analysis on the data of a real social media platform, and realizes improvement of innovativeness.

As can be seen from table 2, compared with Baseline which only retains the image-text feature extraction and feature fusion and only uses cross entropy loss, the classification accuracy can be effectively improved by adding the semantic enhancement graph convolution modules (IG, TG, FG) and the contrast learning loss (LBCL).

As shown in fig. 4, in order to show the visual analysis result of the present invention on paired image text data sets, it can be seen that the network model of the present invention can accurately analyze the image text data in different scenes.

Based on the same inventive concept, the embodiment of the application provides a computer device, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the steps of the image joint text emotion analysis method based on the modal fusion graph rolling network when executing the computer program.

Based on the same inventive concept, the embodiments of the present application provide a computer readable storage medium storing a computer program, which when executed by a processor, implements the steps of the image joint text emotion analysis method based on a modal fusion graph rolling network.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow in the flowchart, and combinations of flows in the flowchart, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows.

The above embodiments are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereto, and any modification made on the basis of the technical scheme according to the technical idea of the present invention falls within the protection scope of the present invention.

Claims

1. The image joint text emotion analysis method based on the modal fusion graph rolling network is characterized by comprising the following steps of:

2. The image-text emotion analysis method based on the modal fusion graph rolling network according to claim 1, wherein in the step 1, paired images and text data are labeled as one of the following three categories according to emotion tendencies of users: the method comprises the steps of negatively, neutral and positive dividing the marked data set into a training set and a testing set, wherein the proportion of each category in the training set to the total quantity of the training set is the same.

3. The image joint text emotion analysis method based on the modal fusion graph rolling network according to claim 1, wherein in the step 2, the expression of the image joint text emotion analysis model based on the modal fusion graph rolling network is:

，

wherein,and->Image features and text features, respectively, +.>And->Image data and text data, respectively, +.>For the image feature extraction unit, < > for>For text feature extraction unit, < > for>And->Respectively, semantically enhanced image features, text features and image text fusion features, +.>For the image semantic enhancement unit,>for text semantic enhancement unit,>for fusing information semantic enhancement units,/->For global fusion module,/->For splicing operation, < >>Is a full connection layer->And the final emotion analysis result.

4. The image joint text emotion analysis method based on the modal fusion graph rolling network according to claim 3, wherein in the step 2, the structures of an image semantic enhancement unit, a text semantic enhancement unit and a fusion information semantic enhancement unit are the same, and the image joint text emotion analysis method comprises an edge generation unit and a graph rolling operation unit;

，

wherein,and->For a learnable parameter->Respectively represent the firsti、jThe mode of each sample ismIs characterized in that,constructing an emotion association graph for the similarity coefficients among the nodes according to the obtained similarity coefficients among the nodes, wherein the calculation mode of the adjacent matrix is as follows:

，

wherein,is thatmTotal number of nodes under modality, +.>Represent the firstkThe mode of each sample ismIs characterized in that,Eis a diagonal identity matrix>For the similarity matrix of the graph nodes, +.>For matrix->Element of (a)>Finally, the node characteristics with strong emotion expression in the single-mode data are aggregated through a graph convolution operation unit, and the graph convolution expression is as follows:，

5. The image joint text emotion analysis method based on the modal fusion graph rolling network according to claim 1, wherein in the step 3, a loss function is usedIncluding emotion classification loss function->And tag-based contrast learning loss function>The expression is:

，

wherein,parameters to be optimized for the model, +.>Is the firstiTrue value of individual samples, +.>For model pair numberiPrediction of individual sample outputsValue of->And->Respectively represent the first in the same batchiAnd (d)jGlobal emotion fusion feature of individual samples, +.>In order to compare the coefficients of the learning,Sfor batch size, +.>Is the first toiThe set of all sample numbers with the same label for each sample, the super-parameters of the model include learning rate +.>Iteration number epoch, batch sizeSAnd the depth and number of layers of the model.

6. The image joint text emotion analysis method based on the modal fusion graph rolling network according to claim 1, wherein in the step 4, model parameters are updated through a back propagation algorithm.

7. A computer device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, implements the steps of the image joint text emotion analysis method based on a modal fusion graph rolling network as claimed in any one of claims 1 to 6.

8. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the steps of the image joint text emotion analysis method based on a modal fusion graph rolling network as claimed in any one of claims 1 to 6.