CN116756287A

CN116756287A - Image question-answering method based on modal joint interaction

Info

Publication number: CN116756287A
Application number: CN202310749393.5A
Authority: CN
Inventors: 郑旭; 张栗粽; 高辉; 何岳峰; 仲文章; 刘立建
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2023-06-21
Filing date: 2023-06-21
Publication date: 2023-09-15

Abstract

The invention discloses an image question-answering method based on modal joint interaction, which comprises the following steps: preprocessing an image and a problem to obtain a corresponding image feature vector and a problem high-level feature vector; constructing an image question-answering network and obtaining intra-mode unit attention characteristics and inter-mode unit attention characteristics; carrying out depth fusion on the intra-mode unit attention features and inter-mode interaction attention features through the mode bidirectional joint interaction and residual error stacking type depth fusion to obtain output image features and output problem features of the hidden layer; carrying out feature combination on the output image feature and the output problem feature of the last hidden layer to obtain a final feature; and mapping the final features into multi-category vectors through linear transformation to conduct answer prediction. The invention can realize bidirectional guidance between the image and the problem characteristics and improve the interaction capacity of the model; enhancing information sharing of cross-modal semantic space; the multi-modal interaction capability of the model is enhanced; the answer classification effect is improved.

Description

Image question-answering method based on modal joint interaction

Technical Field

The invention relates to the technical field of artificial intelligence deep learning image question-answering, in particular to an image question-answering method based on modal joint interaction.

Background

At present, in the field of computers, the network speed and the performance are greatly improved, so that data is transmitted on the Internet in various modes, such as from the initial text to image transmission, to the modes of voice, video and the like, and the development of software and hardware of the computer is not separated. As a classical cross-modal task, the image question-answering is used, a computer is required to search key information from given images and questions and infer answers to the questions, and compared with tasks such as single-mode target detection, question-answering systems and the like, the computer model is required to understand image-text information in a finer granularity, and fusion interaction is carried out on the two modes at the same time, so that model reasoning capacity is realized.

The image questions and answers require an understanding of the image and the text of the question, respectively, and for the image itself to be stored in a high-dimensional data structure, and to contain a large amount of visual information such as color, shape, etc., which is intuitive to humans, but is very challenging for computers.

With the development of computer vision, more and more image feature extraction methods can effectively extract and encode information in images, so that a computer can understand various information in the images. The problem text belongs to natural language, contains a great deal of semantic and grammar information, and is not intuitively understood by a computer. In addition, natural language has problems such as ambiguity and ambiguity, which also increases the difficulty of understanding natural language by a computer. The current natural language processing field also shows that the word embedding method can enable a computer to well understand natural language, which is very key to the improvement of the reasoning capability of the computer. The image-text characteristics are difficult to effectively fuse due to semantic differences, the model is enabled to acquire reasoning capacity, and the interaction capacity of the model is low, so that the image-text characteristics are also a main problem of an image question-answering task.

Disclosure of Invention

Aiming at the defects in the prior art, the image question-answering method based on the modal joint interaction solves the problem that answer prediction cannot be performed due to the fact that image-text features are difficult to fuse in the prior art.

In order to achieve the aim of the invention, the invention adopts the following technical scheme:

the image question-answering method based on the modal joint interaction comprises the following steps:

s1, initializing a target detection network taking convolutional nerves as a core through a pre-trained residual network to obtain dynamic characteristics of an input image; filling dynamic characteristics of an input image to obtain an image characteristic vector;

s2, carrying out word vectorization on the problem text through a pre-trained global word vector model to obtain a feature vector of the problem text; carrying out characterization processing on the problem text feature vector to obtain a problem high-level feature vector;

s3, constructing an image question-answering network taking intra-mode unit attention and inter-mode interaction attention as cores; inputting the image feature vector and the question high-level feature vector into an image question-answering network to obtain intra-mode unit attention features and inter-mode unit attention features;

s4, carrying out depth fusion on the intra-mode unit attention features and inter-mode interaction attention features through the mode bidirectional joint interaction and residual error stacking type depth fusion to obtain output image features and output problem features of different hidden layers;

s5, feature combination is carried out on the output image features and the output problem features of the last hidden layer through feature stacking, and final features are obtained; and mapping the final features into multi-category vectors through linear transformation to conduct answer prediction.

Further, the residual error network pre-trained in the step S1 adopts a ResNet-101 network structure, and the convolutional neural network adopts a Faster R-CNN network structure; the global word vector model in the step S2 adopts a GloVe model.

Further, the specific steps of step S2 are as follows:

s2-1, performing word segmentation and vectorization on m words in a question text to obtain a question text feature vector;

s2-2, filling the problem text feature vector with 0 to obtain a dimension M _QUES Vector representation of xEMB_SIZEWherein (1)>Vector representation representing the moment t of the question text, M _QUES Representing the number of problem words, EMB_SIZE represents the embedding dimension of the problem;

s2-3, introducing a bidirectional gating circulation unit, and according to the formula:

obtaining a bidirectional hidden state at the moment t, namely a forward hidden stateAnd reverse hidden state->Wherein GRU (& gt) represents a recurrent neural network model,>indicates the forward hidden state at time t-1, < >>The reverse hidden state at the time t+1 is represented;

s2-4, splicing the problem text feature vectors at all times to obtain a final problem high-level feature vector

Further, the specific steps of step S3 are as follows:

s3-1, initializing model parameters of an image question-answering network, and inputting image feature vectors and problem high-level feature vectors into the image question-answering network;

s3-2, training a model of the image question-answering network, training the image question-answering network by using a preset answer type as a training target through a back propagation algorithm and random gradient descent, and adjusting parameters of the image question-answering network to obtain a trained image question-answering network; wherein the parameters of the image question-answering network comprise a weight matrix W which can be learned _n And bias term b _n ；

S3-3, constructing an intra-mode unit attention module of the image and an intra-mode unit attention module of the problem, selecting one of the intra-mode unit attention modules, and according to the formula:

Q,K,V＝trans(X)

obtaining a query vector Q, a key vector K and a value vector V; wherein n represents the word number of the problem or the object number of the image, EMB_DIM represents the embedding dimension of the problem, X represents the feature vector of one mode with dimension of n multiplied by EMB_DIM, and trans (·) represents the feature vector X is converted into a multi-head feature vector; the feature vector X comprises an image feature vector and a problem high-level feature vector;

s3-4, according to the formula:

obtaining an attention score matrix S; wherein K is ^T Representing the transposed matrix of the key vector K, d _k The emb_dim size representing the query vector Q;

s3-5, according to the formula:

A＝softmax(S)，S∈R ^MH×n×n

obtaining an attention weight matrix A; wherein softmax (·) represents a normalized exponential function, R represents a real number, and MH represents the number of attention heads;

s3-6, converting the multi-head feature vector to obtain the multi-head feature vector with the same dimension as the original input, and according to the formula:

O＝trans'(A·V)

O'＝LayerNorm(O+Dropout(O))

FFN(O')＝max(0,O'W ₁ +b ₁ )W ₂ +b ₂

O ^I ＝LayerNorm(O'+Dropout(FFN(O')))

obtaining intra-modal cell attention matrix O ^I I.e. intra-modal unit attention features of the modality; wherein trans '(. Cndot.) represents a dimensional transfer function, O represents a unit attention matrix in an initial mode, O' represents a unit attention matrix in an intermediate mode, dropout (. Cndot.) represents random inactivation, layerNorm (. Cndot.) represents a normalization function, FFN (. Cndot.) represents a feedforward neural network, max (. Cndot.) represents implementation of a ReLU activation function, W ₁ Representing an input layerA learnable weight matrix to hidden layer, W ₂ A learnable weight matrix representing hidden layer to output layer, b ₁ Bias term representing input layer to hidden layer, b ₂ A bias term representing a hidden layer to an output layer;

s3-7, repeating the steps S3-3 to S3-6 to obtain the intra-mode unit attention matrix O of the other mode ⁱ ；

S3-8, constructing interaction attention among modes, taking the characteristics of one mode as query, taking the characteristics of the other mode as key values, and according to the formula:

obtain the attention weight matrix A ₁ The method comprises the steps of carrying out a first treatment on the surface of the Wherein Q is ₁ Query vectors, K, representing a modal feature ₁ A key vector representing another modal feature, K ₁ ^T Representing key vector K ₁ Transposed matrix d of ₁ Representing query vector Q ₁ EMB_DIM size, V ₁ A value vector representing another modality feature;

s3-9, according to the formula:

O ₁ ＝LayerNorm(A ₁ +Dropout(A ₁ ))

O _A ＝LayerNorm(O ₁ +Dropout(FFN(O ₁ )))

obtaining inter-modal cell attention matrix O _A Inter-modality unit attention features; wherein O is ₁ Representing an initial inter-modality cell attention matrix.

Further, the back propagation algorithm of step S3-2 employs a loss function that is a binary cross entropy loss function.

Further, the specific steps of step S4 are as follows:

s4-1, according to the formula:

obtaining an inter-mode interaction attention feature XO guided by image features and an inter-mode interaction attention feature YO guided by problem features; wherein CR (·) represents inter-modality interaction attention, X 'represents intra-image-modality unit attention features, Y' represents problem intra-modality unit attention features;

s4-2, according to the formula:

obtaining inter-modal interaction attention features X [ i ] which are guided by image features after the depth stacking of the ith hidden layer and inter-modal interaction attention features Y [ i ] which are guided by problem features after the depth stacking of the ith hidden layer; wherein i represents the ith hidden layer, XO [ i-1] represents the inter-modal interaction attention feature of the ith hidden layer which is guided by the image feature, and YO [ i-1] represents the inter-modal interaction attention feature of the ith hidden layer which is guided by the problem feature;

s4-3, according to the formula:

obtaining the output image characteristic x [ i ] of the ith hidden layer]And the output problem feature y [ i ] of the ith hidden layer]The method comprises the steps of carrying out a first treatment on the surface of the Wherein X [ i-1]]Representing inter-modal interaction attention features guided by image features after depth stacking of the i-1 th hidden layer, Y [ i-1]]Representing inter-modal interaction attention features guided by problem features, alpha, after deep stacking of the i-1 th hidden layer _i Image trainable weight variable, beta, representing the ith hidden layer _i The question that represents the i-th hidden layer may train a weight variable.

The formula of the linear transformation map in step S5 is as follows:

O _F ＝proj(concat(x[I],y[I]))

wherein O is _F Representing answers, x [ I ]]Output image feature x [ i ] representing last hidden layer]，y[I]The output problem feature of the last hidden layer is represented, concat (·) represents a splicing operator, and proj (·) represents a linear transformation mapping function.

The beneficial effects of the invention are as follows:

1. the invention can realize bidirectional guidance between the image and the problem feature by constructing the model based on the image-text feature extraction and depth fusion of the modal joint interaction and introducing the mechanism of the modal joint interaction, thereby improving the interaction capability of the model; by using a residual-type deep stacking fusion mechanism, information sharing of cross-modal semantic space is enhanced.

2. The mechanism of the modal bidirectional guidance in the invention simultaneously considers the depth interaction between the two modalities, adopts the combined guidance of the forward direction and the reverse direction to the unit attention characteristics in the two modalities, enhances the multi-modal interaction capability of the model and improves the answer classification effect;

3. the fusion mechanism of residual stacking in the invention adopts a deep stacking mode to further interact the features after bidirectional guidance; the design of the residual dynamic mechanism improves the expression capacity, avoids the problem of gradient disappearance of the deep neural network in the training process, and improves the generalization of the model.

Drawings

FIG. 1 is a flow chart of the present invention;

fig. 2 is a schematic diagram of the operation of the two-way joint interaction of modes and the residual error stacking type depth fusion of the invention.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and all the inventions which make use of the inventive concept are protected by the spirit and scope of the present invention as defined and defined in the appended claims to those skilled in the art.

In one embodiment of the present invention, as shown in fig. 1, an image question-answering method based on modal joint interaction includes the following steps:

The pre-trained residual network in the step S1 adopts a ResNet-101 network structure, and the convolutional neural network adopts a Faster R-CNN network structure; the global word vector model in the step S2 adopts a GloVe model.

The specific steps of step S2 are as follows:

As shown in fig. 2, the specific steps of step S3 are as follows:

s3-2, training a model of the image question-answering network, and presettingThe answer category is a training target, the image question-answering network is trained through a back propagation algorithm and random gradient descent, and parameters of the image question-answering network are adjusted to obtain a trained image question-answering network; wherein the parameters of the image question-answering network comprise a weight matrix W which can be learned _n And bias term b _n ；

Q,K,V＝trans(X)

s3-4, according to the formula:

s3-5, according to the formula:

A＝softmax(S)，S∈R ^MH×n×n

O＝trans'(A·V)

O'＝LayerNorm(O+Dropout(O))

FFN(O')＝max(0,O'W ₁ +b ₁ )W ₂ +b ₂

O ^I ＝LayerNorm(O'+Dropout(FFN(O')))

obtaining intra-modal cell attention matrix O ^I I.e. intra-modal unit attention features of the modality; wherein trans '(. Cndot.) represents a dimensional transfer function, O represents a unit attention matrix in an initial mode, O' represents a unit attention matrix in an intermediate mode, dropout (. Cndot.) represents random inactivation, layerNorm (. Cndot.) represents a normalization function, FFN (. Cndot.) represents a feedforward neural network, max (. Cndot.) represents implementation of a ReLU activation function, W ₁ A learnable weight matrix representing input layer to hidden layer, W ₂ A learnable weight matrix representing hidden layer to output layer, b ₁ Bias term representing input layer to hidden layer, b ₂ A bias term representing a hidden layer to an output layer;

s3-9, according to the formula:

O ₁ ＝LayerNorm(A ₁ +Dropout(A ₁ ))

O _A ＝LayerNorm(O ₁ +Dropout(FFN(O ₁ )))

obtaining inter-modal cell attention matrix O _A Inter-modality unit attention features; wherein O is ₁ Representing initial inter-modality cell attentionA matrix.

The back propagation algorithm of step S3-2 uses a binary cross entropy loss function.

As shown in fig. 2, the specific steps of step S4 are as follows:

s4-1, according to the formula:

s4-2, according to the formula:

s4-3, according to the formula:

obtaining the output image characteristic x [ i ] of the ith hidden layer]And the output problem feature y [ i ] of the ith hidden layer]The method comprises the steps of carrying out a first treatment on the surface of the Wherein X [ i-1]]Representing inter-modal interaction attention features guided by image features after depth stacking of the i-1 th hidden layer, Y [ i-1]]Characterizing problems after deep stacking of the i-1 th hidden layerFor guided inter-modal interaction attention features, alpha _i Image trainable weight variable, beta, representing the ith hidden layer _i The question that represents the i-th hidden layer may train a weight variable.

The formula of the linear transformation map in step S5 is as follows:

O _F ＝proj(concat(x[I],y[I]))

In summary, the invention can realize bidirectional guidance between the image and the problem feature by constructing the model based on the image-text feature extraction and depth fusion of the modal joint interaction and introducing the mechanism of the modal joint interaction, thereby improving the model interaction capability; by using a residual-type deep stacking fusion mechanism, information sharing of cross-modal semantic space is enhanced. The mechanism of the modal bidirectional guidance simultaneously considers the depth interaction between the two modalities, adopts the combined guidance of the forward direction and the reverse direction on the unit attention characteristics in the two modalities, enhances the multi-modal interaction capability of the model, and improves the answer classification effect; the fusion mechanism of the residual stacking of the invention adopts a deep stacking mode to further interact the features after bidirectional guidance; the design of the residual dynamic mechanism improves the expression capacity, avoids the problem of gradient disappearance of the deep neural network in the training process, and improves the generalization of the model.

Claims

1. An image question-answering method based on modal joint interaction is characterized by comprising the following steps of: the method comprises the following steps:

2. The image question-answering method based on modal joint interaction according to claim 1, wherein: the pre-trained residual network in the step S1 adopts a ResNet-101 network structure, and the convolutional neural network adopts a Faster R-CNN network structure; the global word vector model in the step S2 adopts a GloVe model.

3. The image question-answering method based on modal joint interaction according to claim 1, wherein: the specific steps of the step S2 are as follows:

s2-2, filling the problem text feature vector with 0 to obtain a dimension M _QUES Vector representation of xEMB_SIZEWherein (1)>Representing question textVector representation at time t, M _QUES Representing the number of problem words, EMB_SIZE represents the embedding dimension of the problem;

4. The image question-answering method based on modal joint interaction according to claim 1, wherein: the specific steps of the step S3 are as follows:

Q,K,V＝trans(X)

s3-4, according to the formula:

s3-5, according to the formula:

A＝softmax(S)，S∈R ^MH×n×n

O＝trans'(A·V)

O'＝LayerNorm(O+Dropout(O))

FFN(O')＝max(0,O'W ₁ +b ₁ )W ₂ +b ₂

O ^I ＝LayerNorm(O'+Dropout(FFN(O')))

s3-9, according to the formula:

O ₁ ＝LayerNorm(A ₁ +Dropout(A ₁ ))

O _A ＝LayerNorm(O ₁ +Dropout(FFN(O ₁ )))

5. The image question-answering method based on modal joint interaction according to claim 4, wherein: the back propagation algorithm of step S3-2 uses a binary cross entropy loss function.

6. The image question-answering method based on modal joint interaction according to claim 1, wherein: the specific steps of the step S4 are as follows:

s4-1, according to the formula:

s4-2, according to the formula:

s4-3, according to the formula:

7. The image question-answering method based on modal joint interaction according to claim 1, wherein: the formula of the linear transformation mapping in the step S5 is as follows:

O _F ＝proj(concat(x[I],y[I]))