CN116756287A - Image question-answering method based on modal joint interaction - Google Patents
Image question-answering method based on modal joint interaction Download PDFInfo
- Publication number
- CN116756287A CN116756287A CN202310749393.5A CN202310749393A CN116756287A CN 116756287 A CN116756287 A CN 116756287A CN 202310749393 A CN202310749393 A CN 202310749393A CN 116756287 A CN116756287 A CN 116756287A
- Authority
- CN
- China
- Prior art keywords
- image
- attention
- features
- feature
- modal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 24
- 230000009133 cooperative interaction Effects 0.000 title claims abstract description 23
- 239000013598 vector Substances 0.000 claims abstract description 107
- 230000003993 interaction Effects 0.000 claims abstract description 47
- 230000002457 bidirectional effect Effects 0.000 claims abstract description 17
- 230000004927 fusion Effects 0.000 claims abstract description 16
- 230000009466 transformation Effects 0.000 claims abstract description 10
- 238000013507 mapping Methods 0.000 claims abstract description 8
- 239000011159 matrix material Substances 0.000 claims description 42
- 238000012549 training Methods 0.000 claims description 10
- 238000013527 convolutional neural network Methods 0.000 claims description 6
- 238000013528 artificial neural network Methods 0.000 claims description 5
- 238000001514 detection method Methods 0.000 claims description 4
- 230000004913 activation Effects 0.000 claims description 3
- 238000012512 characterization method Methods 0.000 claims description 3
- 230000002779 inactivation Effects 0.000 claims description 3
- 210000005036 nerve Anatomy 0.000 claims description 3
- 238000003062 neural network model Methods 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 230000000306 recurrent effect Effects 0.000 claims description 3
- 230000011218 segmentation Effects 0.000 claims description 3
- 238000012546 transfer Methods 0.000 claims description 3
- 230000000694 effects Effects 0.000 abstract description 3
- 230000002708 enhancing effect Effects 0.000 abstract 1
- 238000007781 pre-processing Methods 0.000 abstract 1
- 230000006870 function Effects 0.000 description 13
- 230000007246 mechanism Effects 0.000 description 8
- 238000000605 extraction Methods 0.000 description 3
- 238000013461 design Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 230000008034 disappearance Effects 0.000 description 2
- 230000008278 dynamic mechanism Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000005094 computer simulation Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Landscapes
- Image Analysis (AREA)
Abstract
The invention discloses an image question-answering method based on modal joint interaction, which comprises the following steps: preprocessing an image and a problem to obtain a corresponding image feature vector and a problem high-level feature vector; constructing an image question-answering network and obtaining intra-mode unit attention characteristics and inter-mode unit attention characteristics; carrying out depth fusion on the intra-mode unit attention features and inter-mode interaction attention features through the mode bidirectional joint interaction and residual error stacking type depth fusion to obtain output image features and output problem features of the hidden layer; carrying out feature combination on the output image feature and the output problem feature of the last hidden layer to obtain a final feature; and mapping the final features into multi-category vectors through linear transformation to conduct answer prediction. The invention can realize bidirectional guidance between the image and the problem characteristics and improve the interaction capacity of the model; enhancing information sharing of cross-modal semantic space; the multi-modal interaction capability of the model is enhanced; the answer classification effect is improved.
Description
Technical Field
The invention relates to the technical field of artificial intelligence deep learning image question-answering, in particular to an image question-answering method based on modal joint interaction.
Background
At present, in the field of computers, the network speed and the performance are greatly improved, so that data is transmitted on the Internet in various modes, such as from the initial text to image transmission, to the modes of voice, video and the like, and the development of software and hardware of the computer is not separated. As a classical cross-modal task, the image question-answering is used, a computer is required to search key information from given images and questions and infer answers to the questions, and compared with tasks such as single-mode target detection, question-answering systems and the like, the computer model is required to understand image-text information in a finer granularity, and fusion interaction is carried out on the two modes at the same time, so that model reasoning capacity is realized.
The image questions and answers require an understanding of the image and the text of the question, respectively, and for the image itself to be stored in a high-dimensional data structure, and to contain a large amount of visual information such as color, shape, etc., which is intuitive to humans, but is very challenging for computers.
With the development of computer vision, more and more image feature extraction methods can effectively extract and encode information in images, so that a computer can understand various information in the images. The problem text belongs to natural language, contains a great deal of semantic and grammar information, and is not intuitively understood by a computer. In addition, natural language has problems such as ambiguity and ambiguity, which also increases the difficulty of understanding natural language by a computer. The current natural language processing field also shows that the word embedding method can enable a computer to well understand natural language, which is very key to the improvement of the reasoning capability of the computer. The image-text characteristics are difficult to effectively fuse due to semantic differences, the model is enabled to acquire reasoning capacity, and the interaction capacity of the model is low, so that the image-text characteristics are also a main problem of an image question-answering task.
Disclosure of Invention
Aiming at the defects in the prior art, the image question-answering method based on the modal joint interaction solves the problem that answer prediction cannot be performed due to the fact that image-text features are difficult to fuse in the prior art.
In order to achieve the aim of the invention, the invention adopts the following technical scheme:
the image question-answering method based on the modal joint interaction comprises the following steps:
s1, initializing a target detection network taking convolutional nerves as a core through a pre-trained residual network to obtain dynamic characteristics of an input image; filling dynamic characteristics of an input image to obtain an image characteristic vector;
s2, carrying out word vectorization on the problem text through a pre-trained global word vector model to obtain a feature vector of the problem text; carrying out characterization processing on the problem text feature vector to obtain a problem high-level feature vector;
s3, constructing an image question-answering network taking intra-mode unit attention and inter-mode interaction attention as cores; inputting the image feature vector and the question high-level feature vector into an image question-answering network to obtain intra-mode unit attention features and inter-mode unit attention features;
s4, carrying out depth fusion on the intra-mode unit attention features and inter-mode interaction attention features through the mode bidirectional joint interaction and residual error stacking type depth fusion to obtain output image features and output problem features of different hidden layers;
s5, feature combination is carried out on the output image features and the output problem features of the last hidden layer through feature stacking, and final features are obtained; and mapping the final features into multi-category vectors through linear transformation to conduct answer prediction.
Further, the residual error network pre-trained in the step S1 adopts a ResNet-101 network structure, and the convolutional neural network adopts a Faster R-CNN network structure; the global word vector model in the step S2 adopts a GloVe model.
Further, the specific steps of step S2 are as follows:
s2-1, performing word segmentation and vectorization on m words in a question text to obtain a question text feature vector;
s2-2, filling the problem text feature vector with 0 to obtain a dimension M QUES Vector representation of xEMB_SIZEWherein (1)>Vector representation representing the moment t of the question text, M QUES Representing the number of problem words, EMB_SIZE represents the embedding dimension of the problem;
s2-3, introducing a bidirectional gating circulation unit, and according to the formula:
obtaining a bidirectional hidden state at the moment t, namely a forward hidden stateAnd reverse hidden state->Wherein GRU (& gt) represents a recurrent neural network model,>indicates the forward hidden state at time t-1, < >>The reverse hidden state at the time t+1 is represented;
s2-4, splicing the problem text feature vectors at all times to obtain a final problem high-level feature vector
Further, the specific steps of step S3 are as follows:
s3-1, initializing model parameters of an image question-answering network, and inputting image feature vectors and problem high-level feature vectors into the image question-answering network;
s3-2, training a model of the image question-answering network, training the image question-answering network by using a preset answer type as a training target through a back propagation algorithm and random gradient descent, and adjusting parameters of the image question-answering network to obtain a trained image question-answering network; wherein the parameters of the image question-answering network comprise a weight matrix W which can be learned n And bias term b n ;
S3-3, constructing an intra-mode unit attention module of the image and an intra-mode unit attention module of the problem, selecting one of the intra-mode unit attention modules, and according to the formula:
Q,K,V=trans(X)
obtaining a query vector Q, a key vector K and a value vector V; wherein n represents the word number of the problem or the object number of the image, EMB_DIM represents the embedding dimension of the problem, X represents the feature vector of one mode with dimension of n multiplied by EMB_DIM, and trans (·) represents the feature vector X is converted into a multi-head feature vector; the feature vector X comprises an image feature vector and a problem high-level feature vector;
s3-4, according to the formula:
obtaining an attention score matrix S; wherein K is T Representing the transposed matrix of the key vector K, d k The emb_dim size representing the query vector Q;
s3-5, according to the formula:
A=softmax(S),S∈R MH×n×n
obtaining an attention weight matrix A; wherein softmax (·) represents a normalized exponential function, R represents a real number, and MH represents the number of attention heads;
s3-6, converting the multi-head feature vector to obtain the multi-head feature vector with the same dimension as the original input, and according to the formula:
O=trans'(A·V)
O'=LayerNorm(O+Dropout(O))
FFN(O')=max(0,O'W 1 +b 1 )W 2 +b 2
O I =LayerNorm(O'+Dropout(FFN(O')))
obtaining intra-modal cell attention matrix O I I.e. intra-modal unit attention features of the modality; wherein trans '(. Cndot.) represents a dimensional transfer function, O represents a unit attention matrix in an initial mode, O' represents a unit attention matrix in an intermediate mode, dropout (. Cndot.) represents random inactivation, layerNorm (. Cndot.) represents a normalization function, FFN (. Cndot.) represents a feedforward neural network, max (. Cndot.) represents implementation of a ReLU activation function, W 1 Representing an input layerA learnable weight matrix to hidden layer, W 2 A learnable weight matrix representing hidden layer to output layer, b 1 Bias term representing input layer to hidden layer, b 2 A bias term representing a hidden layer to an output layer;
s3-7, repeating the steps S3-3 to S3-6 to obtain the intra-mode unit attention matrix O of the other mode i ;
S3-8, constructing interaction attention among modes, taking the characteristics of one mode as query, taking the characteristics of the other mode as key values, and according to the formula:
obtain the attention weight matrix A 1 The method comprises the steps of carrying out a first treatment on the surface of the Wherein Q is 1 Query vectors, K, representing a modal feature 1 A key vector representing another modal feature, K 1 T Representing key vector K 1 Transposed matrix d of 1 Representing query vector Q 1 EMB_DIM size, V 1 A value vector representing another modality feature;
s3-9, according to the formula:
O 1 =LayerNorm(A 1 +Dropout(A 1 ))
O A =LayerNorm(O 1 +Dropout(FFN(O 1 )))
obtaining inter-modal cell attention matrix O A Inter-modality unit attention features; wherein O is 1 Representing an initial inter-modality cell attention matrix.
Further, the back propagation algorithm of step S3-2 employs a loss function that is a binary cross entropy loss function.
Further, the specific steps of step S4 are as follows:
s4-1, according to the formula:
obtaining an inter-mode interaction attention feature XO guided by image features and an inter-mode interaction attention feature YO guided by problem features; wherein CR (·) represents inter-modality interaction attention, X 'represents intra-image-modality unit attention features, Y' represents problem intra-modality unit attention features;
s4-2, according to the formula:
obtaining inter-modal interaction attention features X [ i ] which are guided by image features after the depth stacking of the ith hidden layer and inter-modal interaction attention features Y [ i ] which are guided by problem features after the depth stacking of the ith hidden layer; wherein i represents the ith hidden layer, XO [ i-1] represents the inter-modal interaction attention feature of the ith hidden layer which is guided by the image feature, and YO [ i-1] represents the inter-modal interaction attention feature of the ith hidden layer which is guided by the problem feature;
s4-3, according to the formula:
obtaining the output image characteristic x [ i ] of the ith hidden layer]And the output problem feature y [ i ] of the ith hidden layer]The method comprises the steps of carrying out a first treatment on the surface of the Wherein X [ i-1]]Representing inter-modal interaction attention features guided by image features after depth stacking of the i-1 th hidden layer, Y [ i-1]]Representing inter-modal interaction attention features guided by problem features, alpha, after deep stacking of the i-1 th hidden layer i Image trainable weight variable, beta, representing the ith hidden layer i The question that represents the i-th hidden layer may train a weight variable.
The formula of the linear transformation map in step S5 is as follows:
O F =proj(concat(x[I],y[I]))
wherein O is F Representing answers, x [ I ]]Output image feature x [ i ] representing last hidden layer],y[I]The output problem feature of the last hidden layer is represented, concat (·) represents a splicing operator, and proj (·) represents a linear transformation mapping function.
The beneficial effects of the invention are as follows:
1. the invention can realize bidirectional guidance between the image and the problem feature by constructing the model based on the image-text feature extraction and depth fusion of the modal joint interaction and introducing the mechanism of the modal joint interaction, thereby improving the interaction capability of the model; by using a residual-type deep stacking fusion mechanism, information sharing of cross-modal semantic space is enhanced.
2. The mechanism of the modal bidirectional guidance in the invention simultaneously considers the depth interaction between the two modalities, adopts the combined guidance of the forward direction and the reverse direction to the unit attention characteristics in the two modalities, enhances the multi-modal interaction capability of the model and improves the answer classification effect;
3. the fusion mechanism of residual stacking in the invention adopts a deep stacking mode to further interact the features after bidirectional guidance; the design of the residual dynamic mechanism improves the expression capacity, avoids the problem of gradient disappearance of the deep neural network in the training process, and improves the generalization of the model.
Drawings
FIG. 1 is a flow chart of the present invention;
fig. 2 is a schematic diagram of the operation of the two-way joint interaction of modes and the residual error stacking type depth fusion of the invention.
Detailed Description
The following description of the embodiments of the present invention is provided to facilitate understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and all the inventions which make use of the inventive concept are protected by the spirit and scope of the present invention as defined and defined in the appended claims to those skilled in the art.
In one embodiment of the present invention, as shown in fig. 1, an image question-answering method based on modal joint interaction includes the following steps:
s1, initializing a target detection network taking convolutional nerves as a core through a pre-trained residual network to obtain dynamic characteristics of an input image; filling dynamic characteristics of an input image to obtain an image characteristic vector;
s2, carrying out word vectorization on the problem text through a pre-trained global word vector model to obtain a feature vector of the problem text; carrying out characterization processing on the problem text feature vector to obtain a problem high-level feature vector;
s3, constructing an image question-answering network taking intra-mode unit attention and inter-mode interaction attention as cores; inputting the image feature vector and the question high-level feature vector into an image question-answering network to obtain intra-mode unit attention features and inter-mode unit attention features;
s4, carrying out depth fusion on the intra-mode unit attention features and inter-mode interaction attention features through the mode bidirectional joint interaction and residual error stacking type depth fusion to obtain output image features and output problem features of different hidden layers;
s5, feature combination is carried out on the output image features and the output problem features of the last hidden layer through feature stacking, and final features are obtained; and mapping the final features into multi-category vectors through linear transformation to conduct answer prediction.
The pre-trained residual network in the step S1 adopts a ResNet-101 network structure, and the convolutional neural network adopts a Faster R-CNN network structure; the global word vector model in the step S2 adopts a GloVe model.
The specific steps of step S2 are as follows:
s2-1, performing word segmentation and vectorization on m words in a question text to obtain a question text feature vector;
s2-2, filling the problem text feature vector with 0 to obtain a dimension M QUES Vector representation of xEMB_SIZEWherein (1)>Vector representation representing the moment t of the question text, M QUES Representing the number of problem words, EMB_SIZE represents the embedding dimension of the problem;
s2-3, introducing a bidirectional gating circulation unit, and according to the formula:
obtaining a bidirectional hidden state at the moment t, namely a forward hidden stateAnd reverse hidden state->Wherein GRU (& gt) represents a recurrent neural network model,>indicates the forward hidden state at time t-1, < >>The reverse hidden state at the time t+1 is represented;
s2-4, splicing the problem text feature vectors at all times to obtain a final problem high-level feature vector
As shown in fig. 2, the specific steps of step S3 are as follows:
s3-1, initializing model parameters of an image question-answering network, and inputting image feature vectors and problem high-level feature vectors into the image question-answering network;
s3-2, training a model of the image question-answering network, and presettingThe answer category is a training target, the image question-answering network is trained through a back propagation algorithm and random gradient descent, and parameters of the image question-answering network are adjusted to obtain a trained image question-answering network; wherein the parameters of the image question-answering network comprise a weight matrix W which can be learned n And bias term b n ;
S3-3, constructing an intra-mode unit attention module of the image and an intra-mode unit attention module of the problem, selecting one of the intra-mode unit attention modules, and according to the formula:
Q,K,V=trans(X)
obtaining a query vector Q, a key vector K and a value vector V; wherein n represents the word number of the problem or the object number of the image, EMB_DIM represents the embedding dimension of the problem, X represents the feature vector of one mode with dimension of n multiplied by EMB_DIM, and trans (·) represents the feature vector X is converted into a multi-head feature vector; the feature vector X comprises an image feature vector and a problem high-level feature vector;
s3-4, according to the formula:
obtaining an attention score matrix S; wherein K is T Representing the transposed matrix of the key vector K, d k The emb_dim size representing the query vector Q;
s3-5, according to the formula:
A=softmax(S),S∈R MH×n×n
obtaining an attention weight matrix A; wherein softmax (·) represents a normalized exponential function, R represents a real number, and MH represents the number of attention heads;
s3-6, converting the multi-head feature vector to obtain the multi-head feature vector with the same dimension as the original input, and according to the formula:
O=trans'(A·V)
O'=LayerNorm(O+Dropout(O))
FFN(O')=max(0,O'W 1 +b 1 )W 2 +b 2
O I =LayerNorm(O'+Dropout(FFN(O')))
obtaining intra-modal cell attention matrix O I I.e. intra-modal unit attention features of the modality; wherein trans '(. Cndot.) represents a dimensional transfer function, O represents a unit attention matrix in an initial mode, O' represents a unit attention matrix in an intermediate mode, dropout (. Cndot.) represents random inactivation, layerNorm (. Cndot.) represents a normalization function, FFN (. Cndot.) represents a feedforward neural network, max (. Cndot.) represents implementation of a ReLU activation function, W 1 A learnable weight matrix representing input layer to hidden layer, W 2 A learnable weight matrix representing hidden layer to output layer, b 1 Bias term representing input layer to hidden layer, b 2 A bias term representing a hidden layer to an output layer;
s3-7, repeating the steps S3-3 to S3-6 to obtain the intra-mode unit attention matrix O of the other mode i ;
S3-8, constructing interaction attention among modes, taking the characteristics of one mode as query, taking the characteristics of the other mode as key values, and according to the formula:
obtain the attention weight matrix A 1 The method comprises the steps of carrying out a first treatment on the surface of the Wherein Q is 1 Query vectors, K, representing a modal feature 1 A key vector representing another modal feature, K 1 T Representing key vector K 1 Transposed matrix d of 1 Representing query vector Q 1 EMB_DIM size, V 1 A value vector representing another modality feature;
s3-9, according to the formula:
O 1 =LayerNorm(A 1 +Dropout(A 1 ))
O A =LayerNorm(O 1 +Dropout(FFN(O 1 )))
obtaining inter-modal cell attention matrix O A Inter-modality unit attention features; wherein O is 1 Representing initial inter-modality cell attentionA matrix.
The back propagation algorithm of step S3-2 uses a binary cross entropy loss function.
As shown in fig. 2, the specific steps of step S4 are as follows:
s4-1, according to the formula:
obtaining an inter-mode interaction attention feature XO guided by image features and an inter-mode interaction attention feature YO guided by problem features; wherein CR (·) represents inter-modality interaction attention, X 'represents intra-image-modality unit attention features, Y' represents problem intra-modality unit attention features;
s4-2, according to the formula:
obtaining inter-modal interaction attention features X [ i ] which are guided by image features after the depth stacking of the ith hidden layer and inter-modal interaction attention features Y [ i ] which are guided by problem features after the depth stacking of the ith hidden layer; wherein i represents the ith hidden layer, XO [ i-1] represents the inter-modal interaction attention feature of the ith hidden layer which is guided by the image feature, and YO [ i-1] represents the inter-modal interaction attention feature of the ith hidden layer which is guided by the problem feature;
s4-3, according to the formula:
obtaining the output image characteristic x [ i ] of the ith hidden layer]And the output problem feature y [ i ] of the ith hidden layer]The method comprises the steps of carrying out a first treatment on the surface of the Wherein X [ i-1]]Representing inter-modal interaction attention features guided by image features after depth stacking of the i-1 th hidden layer, Y [ i-1]]Characterizing problems after deep stacking of the i-1 th hidden layerFor guided inter-modal interaction attention features, alpha i Image trainable weight variable, beta, representing the ith hidden layer i The question that represents the i-th hidden layer may train a weight variable.
The formula of the linear transformation map in step S5 is as follows:
O F =proj(concat(x[I],y[I]))
wherein O is F Representing answers, x [ I ]]Output image feature x [ i ] representing last hidden layer],y[I]The output problem feature of the last hidden layer is represented, concat (·) represents a splicing operator, and proj (·) represents a linear transformation mapping function.
In summary, the invention can realize bidirectional guidance between the image and the problem feature by constructing the model based on the image-text feature extraction and depth fusion of the modal joint interaction and introducing the mechanism of the modal joint interaction, thereby improving the model interaction capability; by using a residual-type deep stacking fusion mechanism, information sharing of cross-modal semantic space is enhanced. The mechanism of the modal bidirectional guidance simultaneously considers the depth interaction between the two modalities, adopts the combined guidance of the forward direction and the reverse direction on the unit attention characteristics in the two modalities, enhances the multi-modal interaction capability of the model, and improves the answer classification effect; the fusion mechanism of the residual stacking of the invention adopts a deep stacking mode to further interact the features after bidirectional guidance; the design of the residual dynamic mechanism improves the expression capacity, avoids the problem of gradient disappearance of the deep neural network in the training process, and improves the generalization of the model.
Claims (7)
1. An image question-answering method based on modal joint interaction is characterized by comprising the following steps of: the method comprises the following steps:
s1, initializing a target detection network taking convolutional nerves as a core through a pre-trained residual network to obtain dynamic characteristics of an input image; filling dynamic characteristics of an input image to obtain an image characteristic vector;
s2, carrying out word vectorization on the problem text through a pre-trained global word vector model to obtain a feature vector of the problem text; carrying out characterization processing on the problem text feature vector to obtain a problem high-level feature vector;
s3, constructing an image question-answering network taking intra-mode unit attention and inter-mode interaction attention as cores; inputting the image feature vector and the question high-level feature vector into an image question-answering network to obtain intra-mode unit attention features and inter-mode unit attention features;
s4, carrying out depth fusion on the intra-mode unit attention features and inter-mode interaction attention features through the mode bidirectional joint interaction and residual error stacking type depth fusion to obtain output image features and output problem features of different hidden layers;
s5, feature combination is carried out on the output image features and the output problem features of the last hidden layer through feature stacking, and final features are obtained; and mapping the final features into multi-category vectors through linear transformation to conduct answer prediction.
2. The image question-answering method based on modal joint interaction according to claim 1, wherein: the pre-trained residual network in the step S1 adopts a ResNet-101 network structure, and the convolutional neural network adopts a Faster R-CNN network structure; the global word vector model in the step S2 adopts a GloVe model.
3. The image question-answering method based on modal joint interaction according to claim 1, wherein: the specific steps of the step S2 are as follows:
s2-1, performing word segmentation and vectorization on m words in a question text to obtain a question text feature vector;
s2-2, filling the problem text feature vector with 0 to obtain a dimension M QUES Vector representation of xEMB_SIZEWherein (1)>Representing question textVector representation at time t, M QUES Representing the number of problem words, EMB_SIZE represents the embedding dimension of the problem;
s2-3, introducing a bidirectional gating circulation unit, and according to the formula:
obtaining a bidirectional hidden state at the moment t, namely a forward hidden stateAnd reverse hidden state->Wherein GRU (& gt) represents a recurrent neural network model,>indicates the forward hidden state at time t-1, < >>The reverse hidden state at the time t+1 is represented;
s2-4, splicing the problem text feature vectors at all times to obtain a final problem high-level feature vector
4. The image question-answering method based on modal joint interaction according to claim 1, wherein: the specific steps of the step S3 are as follows:
s3-1, initializing model parameters of an image question-answering network, and inputting image feature vectors and problem high-level feature vectors into the image question-answering network;
s3-2, training a model of the image question-answering network, training the image question-answering network by using a preset answer type as a training target through a back propagation algorithm and random gradient descent, and adjusting parameters of the image question-answering network to obtain a trained image question-answering network; wherein the parameters of the image question-answering network comprise a weight matrix W which can be learned n And bias term b n ;
S3-3, constructing an intra-mode unit attention module of the image and an intra-mode unit attention module of the problem, selecting one of the intra-mode unit attention modules, and according to the formula:
Q,K,V=trans(X)
obtaining a query vector Q, a key vector K and a value vector V; wherein n represents the word number of the problem or the object number of the image, EMB_DIM represents the embedding dimension of the problem, X represents the feature vector of one mode with dimension of n multiplied by EMB_DIM, and trans (·) represents the feature vector X is converted into a multi-head feature vector; the feature vector X comprises an image feature vector and a problem high-level feature vector;
s3-4, according to the formula:
obtaining an attention score matrix S; wherein K is T Representing the transposed matrix of the key vector K, d k The emb_dim size representing the query vector Q;
s3-5, according to the formula:
A=softmax(S),S∈R MH×n×n
obtaining an attention weight matrix A; wherein softmax (·) represents a normalized exponential function, R represents a real number, and MH represents the number of attention heads;
s3-6, converting the multi-head feature vector to obtain the multi-head feature vector with the same dimension as the original input, and according to the formula:
O=trans'(A·V)
O'=LayerNorm(O+Dropout(O))
FFN(O')=max(0,O'W 1 +b 1 )W 2 +b 2
O I =LayerNorm(O'+Dropout(FFN(O')))
obtaining intra-modal cell attention matrix O I I.e. intra-modal unit attention features of the modality; wherein trans '(. Cndot.) represents a dimensional transfer function, O represents a unit attention matrix in an initial mode, O' represents a unit attention matrix in an intermediate mode, dropout (. Cndot.) represents random inactivation, layerNorm (. Cndot.) represents a normalization function, FFN (. Cndot.) represents a feedforward neural network, max (. Cndot.) represents implementation of a ReLU activation function, W 1 A learnable weight matrix representing input layer to hidden layer, W 2 A learnable weight matrix representing hidden layer to output layer, b 1 Bias term representing input layer to hidden layer, b 2 A bias term representing a hidden layer to an output layer;
s3-7, repeating the steps S3-3 to S3-6 to obtain the intra-mode unit attention matrix O of the other mode i ;
S3-8, constructing interaction attention among modes, taking the characteristics of one mode as query, taking the characteristics of the other mode as key values, and according to the formula:
obtain the attention weight matrix A 1 The method comprises the steps of carrying out a first treatment on the surface of the Wherein Q is 1 Query vectors, K, representing a modal feature 1 A key vector representing another modal feature, K 1 T Representing key vector K 1 Transposed matrix d of 1 Representing query vector Q 1 EMB_DIM size, V 1 A value vector representing another modality feature;
s3-9, according to the formula:
O 1 =LayerNorm(A 1 +Dropout(A 1 ))
O A =LayerNorm(O 1 +Dropout(FFN(O 1 )))
obtaining inter-modal cell attention matrix O A Inter-modality unit attention features; wherein O is 1 Representing an initial inter-modality cell attention matrix.
5. The image question-answering method based on modal joint interaction according to claim 4, wherein: the back propagation algorithm of step S3-2 uses a binary cross entropy loss function.
6. The image question-answering method based on modal joint interaction according to claim 1, wherein: the specific steps of the step S4 are as follows:
s4-1, according to the formula:
obtaining an inter-mode interaction attention feature XO guided by image features and an inter-mode interaction attention feature YO guided by problem features; wherein CR (·) represents inter-modality interaction attention, X 'represents intra-image-modality unit attention features, Y' represents problem intra-modality unit attention features;
s4-2, according to the formula:
obtaining inter-modal interaction attention features X [ i ] which are guided by image features after the depth stacking of the ith hidden layer and inter-modal interaction attention features Y [ i ] which are guided by problem features after the depth stacking of the ith hidden layer; wherein i represents the ith hidden layer, XO [ i-1] represents the inter-modal interaction attention feature of the ith hidden layer which is guided by the image feature, and YO [ i-1] represents the inter-modal interaction attention feature of the ith hidden layer which is guided by the problem feature;
s4-3, according to the formula:
obtaining the output image characteristic x [ i ] of the ith hidden layer]And the output problem feature y [ i ] of the ith hidden layer]The method comprises the steps of carrying out a first treatment on the surface of the Wherein X [ i-1]]Representing inter-modal interaction attention features guided by image features after depth stacking of the i-1 th hidden layer, Y [ i-1]]Representing inter-modal interaction attention features guided by problem features, alpha, after deep stacking of the i-1 th hidden layer i Image trainable weight variable, beta, representing the ith hidden layer i The question that represents the i-th hidden layer may train a weight variable.
7. The image question-answering method based on modal joint interaction according to claim 1, wherein: the formula of the linear transformation mapping in the step S5 is as follows:
O F =proj(concat(x[I],y[I]))
wherein O is F Representing answers, x [ I ]]Output image feature x [ i ] representing last hidden layer],y[I]The output problem feature of the last hidden layer is represented, concat (·) represents a splicing operator, and proj (·) represents a linear transformation mapping function.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310749393.5A CN116756287A (en) | 2023-06-21 | 2023-06-21 | Image question-answering method based on modal joint interaction |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310749393.5A CN116756287A (en) | 2023-06-21 | 2023-06-21 | Image question-answering method based on modal joint interaction |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116756287A true CN116756287A (en) | 2023-09-15 |
Family
ID=87949318
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310749393.5A Pending CN116756287A (en) | 2023-06-21 | 2023-06-21 | Image question-answering method based on modal joint interaction |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116756287A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117422704A (en) * | 2023-11-23 | 2024-01-19 | 南华大学附属第一医院 | Cancer prediction method, system and equipment based on multi-mode data |
-
2023
- 2023-06-21 CN CN202310749393.5A patent/CN116756287A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117422704A (en) * | 2023-11-23 | 2024-01-19 | 南华大学附属第一医院 | Cancer prediction method, system and equipment based on multi-mode data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110490946B (en) | Text image generation method based on cross-modal similarity and antagonism network generation | |
CN110163299B (en) | Visual question-answering method based on bottom-up attention mechanism and memory network | |
CN110609891B (en) | Visual dialog generation method based on context awareness graph neural network | |
WO2021223323A1 (en) | Image content automatic description method based on construction of chinese visual vocabulary list | |
CN111581401B (en) | Local citation recommendation system and method based on depth correlation matching | |
CN112100346B (en) | Visual question-answering method based on fusion of fine-grained image features and external knowledge | |
CN113065587B (en) | Scene graph generation method based on hyper-relation learning network | |
CN109783666A (en) | A kind of image scene map generation method based on iteration fining | |
Yu et al. | Hybrid attention-oriented experience replay for deep reinforcement learning and its application to a multi-robot cooperative hunting problem | |
CN113360621A (en) | Scene text visual question-answering method based on modal inference graph neural network | |
CN115331075A (en) | Countermeasures type multi-modal pre-training method for enhancing knowledge of multi-modal scene graph | |
CN116975350A (en) | Image-text retrieval method, device, equipment and storage medium | |
CN116187349A (en) | Visual question-answering method based on scene graph relation information enhancement | |
Chen et al. | Coupled multimodal emotional feature analysis based on broad-deep fusion networks in human–robot interaction | |
CN116756287A (en) | Image question-answering method based on modal joint interaction | |
Manousaki et al. | Vlmah: Visual-linguistic modeling of action history for effective action anticipation | |
Pan et al. | Teach machine to learn: hand-drawn multi-symbol sketch recognition in one-shot | |
CN115098646B (en) | Multistage relation analysis and mining method for graphic data | |
CN113779244B (en) | Document emotion classification method and device, storage medium and electronic equipment | |
Riley et al. | Non-monotonic logical reasoning and deep learning for explainable visual question answering | |
CN116150334A (en) | Chinese co-emotion sentence training method and system based on UniLM model and Copy mechanism | |
CN112036546B (en) | Sequence processing method and related equipment | |
CN116226322A (en) | Mongolian emotion analysis method based on fusion of countermeasure learning and support vector machine | |
Tang et al. | Designing a partially understandable neural network through semantic embedding | |
Pei et al. | Visual relational reasoning for image caption |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |