[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN116756287A - Image question-answering method based on modal joint interaction - Google Patents

Image question-answering method based on modal joint interaction Download PDF

Info

Publication number
CN116756287A
CN116756287A CN202310749393.5A CN202310749393A CN116756287A CN 116756287 A CN116756287 A CN 116756287A CN 202310749393 A CN202310749393 A CN 202310749393A CN 116756287 A CN116756287 A CN 116756287A
Authority
CN
China
Prior art keywords
image
attention
features
feature
modal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310749393.5A
Other languages
Chinese (zh)
Inventor
郑旭
张栗粽
高辉
何岳峰
仲文章
刘立建
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202310749393.5A priority Critical patent/CN116756287A/en
Publication of CN116756287A publication Critical patent/CN116756287A/en
Pending legal-status Critical Current

Links

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses an image question-answering method based on modal joint interaction, which comprises the following steps: preprocessing an image and a problem to obtain a corresponding image feature vector and a problem high-level feature vector; constructing an image question-answering network and obtaining intra-mode unit attention characteristics and inter-mode unit attention characteristics; carrying out depth fusion on the intra-mode unit attention features and inter-mode interaction attention features through the mode bidirectional joint interaction and residual error stacking type depth fusion to obtain output image features and output problem features of the hidden layer; carrying out feature combination on the output image feature and the output problem feature of the last hidden layer to obtain a final feature; and mapping the final features into multi-category vectors through linear transformation to conduct answer prediction. The invention can realize bidirectional guidance between the image and the problem characteristics and improve the interaction capacity of the model; enhancing information sharing of cross-modal semantic space; the multi-modal interaction capability of the model is enhanced; the answer classification effect is improved.

Description

Image question-answering method based on modal joint interaction
Technical Field
The invention relates to the technical field of artificial intelligence deep learning image question-answering, in particular to an image question-answering method based on modal joint interaction.
Background
At present, in the field of computers, the network speed and the performance are greatly improved, so that data is transmitted on the Internet in various modes, such as from the initial text to image transmission, to the modes of voice, video and the like, and the development of software and hardware of the computer is not separated. As a classical cross-modal task, the image question-answering is used, a computer is required to search key information from given images and questions and infer answers to the questions, and compared with tasks such as single-mode target detection, question-answering systems and the like, the computer model is required to understand image-text information in a finer granularity, and fusion interaction is carried out on the two modes at the same time, so that model reasoning capacity is realized.
The image questions and answers require an understanding of the image and the text of the question, respectively, and for the image itself to be stored in a high-dimensional data structure, and to contain a large amount of visual information such as color, shape, etc., which is intuitive to humans, but is very challenging for computers.
With the development of computer vision, more and more image feature extraction methods can effectively extract and encode information in images, so that a computer can understand various information in the images. The problem text belongs to natural language, contains a great deal of semantic and grammar information, and is not intuitively understood by a computer. In addition, natural language has problems such as ambiguity and ambiguity, which also increases the difficulty of understanding natural language by a computer. The current natural language processing field also shows that the word embedding method can enable a computer to well understand natural language, which is very key to the improvement of the reasoning capability of the computer. The image-text characteristics are difficult to effectively fuse due to semantic differences, the model is enabled to acquire reasoning capacity, and the interaction capacity of the model is low, so that the image-text characteristics are also a main problem of an image question-answering task.
Disclosure of Invention
Aiming at the defects in the prior art, the image question-answering method based on the modal joint interaction solves the problem that answer prediction cannot be performed due to the fact that image-text features are difficult to fuse in the prior art.
In order to achieve the aim of the invention, the invention adopts the following technical scheme:
the image question-answering method based on the modal joint interaction comprises the following steps:
s1, initializing a target detection network taking convolutional nerves as a core through a pre-trained residual network to obtain dynamic characteristics of an input image; filling dynamic characteristics of an input image to obtain an image characteristic vector;
s2, carrying out word vectorization on the problem text through a pre-trained global word vector model to obtain a feature vector of the problem text; carrying out characterization processing on the problem text feature vector to obtain a problem high-level feature vector;
s3, constructing an image question-answering network taking intra-mode unit attention and inter-mode interaction attention as cores; inputting the image feature vector and the question high-level feature vector into an image question-answering network to obtain intra-mode unit attention features and inter-mode unit attention features;
s4, carrying out depth fusion on the intra-mode unit attention features and inter-mode interaction attention features through the mode bidirectional joint interaction and residual error stacking type depth fusion to obtain output image features and output problem features of different hidden layers;
s5, feature combination is carried out on the output image features and the output problem features of the last hidden layer through feature stacking, and final features are obtained; and mapping the final features into multi-category vectors through linear transformation to conduct answer prediction.
Further, the residual error network pre-trained in the step S1 adopts a ResNet-101 network structure, and the convolutional neural network adopts a Faster R-CNN network structure; the global word vector model in the step S2 adopts a GloVe model.
Further, the specific steps of step S2 are as follows:
s2-1, performing word segmentation and vectorization on m words in a question text to obtain a question text feature vector;
s2-2, filling the problem text feature vector with 0 to obtain a dimension M QUES Vector representation of xEMB_SIZEWherein (1)>Vector representation representing the moment t of the question text, M QUES Representing the number of problem words, EMB_SIZE represents the embedding dimension of the problem;
s2-3, introducing a bidirectional gating circulation unit, and according to the formula:
obtaining a bidirectional hidden state at the moment t, namely a forward hidden stateAnd reverse hidden state->Wherein GRU (& gt) represents a recurrent neural network model,>indicates the forward hidden state at time t-1, < >>The reverse hidden state at the time t+1 is represented;
s2-4, splicing the problem text feature vectors at all times to obtain a final problem high-level feature vector
Further, the specific steps of step S3 are as follows:
s3-1, initializing model parameters of an image question-answering network, and inputting image feature vectors and problem high-level feature vectors into the image question-answering network;
s3-2, training a model of the image question-answering network, training the image question-answering network by using a preset answer type as a training target through a back propagation algorithm and random gradient descent, and adjusting parameters of the image question-answering network to obtain a trained image question-answering network; wherein the parameters of the image question-answering network comprise a weight matrix W which can be learned n And bias term b n
S3-3, constructing an intra-mode unit attention module of the image and an intra-mode unit attention module of the problem, selecting one of the intra-mode unit attention modules, and according to the formula:
Q,K,V=trans(X)
obtaining a query vector Q, a key vector K and a value vector V; wherein n represents the word number of the problem or the object number of the image, EMB_DIM represents the embedding dimension of the problem, X represents the feature vector of one mode with dimension of n multiplied by EMB_DIM, and trans (·) represents the feature vector X is converted into a multi-head feature vector; the feature vector X comprises an image feature vector and a problem high-level feature vector;
s3-4, according to the formula:
obtaining an attention score matrix S; wherein K is T Representing the transposed matrix of the key vector K, d k The emb_dim size representing the query vector Q;
s3-5, according to the formula:
A=softmax(S),S∈R MH×n×n
obtaining an attention weight matrix A; wherein softmax (·) represents a normalized exponential function, R represents a real number, and MH represents the number of attention heads;
s3-6, converting the multi-head feature vector to obtain the multi-head feature vector with the same dimension as the original input, and according to the formula:
O=trans'(A·V)
O'=LayerNorm(O+Dropout(O))
FFN(O')=max(0,O'W 1 +b 1 )W 2 +b 2
O I =LayerNorm(O'+Dropout(FFN(O')))
obtaining intra-modal cell attention matrix O I I.e. intra-modal unit attention features of the modality; wherein trans '(. Cndot.) represents a dimensional transfer function, O represents a unit attention matrix in an initial mode, O' represents a unit attention matrix in an intermediate mode, dropout (. Cndot.) represents random inactivation, layerNorm (. Cndot.) represents a normalization function, FFN (. Cndot.) represents a feedforward neural network, max (. Cndot.) represents implementation of a ReLU activation function, W 1 Representing an input layerA learnable weight matrix to hidden layer, W 2 A learnable weight matrix representing hidden layer to output layer, b 1 Bias term representing input layer to hidden layer, b 2 A bias term representing a hidden layer to an output layer;
s3-7, repeating the steps S3-3 to S3-6 to obtain the intra-mode unit attention matrix O of the other mode i
S3-8, constructing interaction attention among modes, taking the characteristics of one mode as query, taking the characteristics of the other mode as key values, and according to the formula:
obtain the attention weight matrix A 1 The method comprises the steps of carrying out a first treatment on the surface of the Wherein Q is 1 Query vectors, K, representing a modal feature 1 A key vector representing another modal feature, K 1 T Representing key vector K 1 Transposed matrix d of 1 Representing query vector Q 1 EMB_DIM size, V 1 A value vector representing another modality feature;
s3-9, according to the formula:
O 1 =LayerNorm(A 1 +Dropout(A 1 ))
O A =LayerNorm(O 1 +Dropout(FFN(O 1 )))
obtaining inter-modal cell attention matrix O A Inter-modality unit attention features; wherein O is 1 Representing an initial inter-modality cell attention matrix.
Further, the back propagation algorithm of step S3-2 employs a loss function that is a binary cross entropy loss function.
Further, the specific steps of step S4 are as follows:
s4-1, according to the formula:
obtaining an inter-mode interaction attention feature XO guided by image features and an inter-mode interaction attention feature YO guided by problem features; wherein CR (·) represents inter-modality interaction attention, X 'represents intra-image-modality unit attention features, Y' represents problem intra-modality unit attention features;
s4-2, according to the formula:
obtaining inter-modal interaction attention features X [ i ] which are guided by image features after the depth stacking of the ith hidden layer and inter-modal interaction attention features Y [ i ] which are guided by problem features after the depth stacking of the ith hidden layer; wherein i represents the ith hidden layer, XO [ i-1] represents the inter-modal interaction attention feature of the ith hidden layer which is guided by the image feature, and YO [ i-1] represents the inter-modal interaction attention feature of the ith hidden layer which is guided by the problem feature;
s4-3, according to the formula:
obtaining the output image characteristic x [ i ] of the ith hidden layer]And the output problem feature y [ i ] of the ith hidden layer]The method comprises the steps of carrying out a first treatment on the surface of the Wherein X [ i-1]]Representing inter-modal interaction attention features guided by image features after depth stacking of the i-1 th hidden layer, Y [ i-1]]Representing inter-modal interaction attention features guided by problem features, alpha, after deep stacking of the i-1 th hidden layer i Image trainable weight variable, beta, representing the ith hidden layer i The question that represents the i-th hidden layer may train a weight variable.
The formula of the linear transformation map in step S5 is as follows:
O F =proj(concat(x[I],y[I]))
wherein O is F Representing answers, x [ I ]]Output image feature x [ i ] representing last hidden layer],y[I]The output problem feature of the last hidden layer is represented, concat (·) represents a splicing operator, and proj (·) represents a linear transformation mapping function.
The beneficial effects of the invention are as follows:
1. the invention can realize bidirectional guidance between the image and the problem feature by constructing the model based on the image-text feature extraction and depth fusion of the modal joint interaction and introducing the mechanism of the modal joint interaction, thereby improving the interaction capability of the model; by using a residual-type deep stacking fusion mechanism, information sharing of cross-modal semantic space is enhanced.
2. The mechanism of the modal bidirectional guidance in the invention simultaneously considers the depth interaction between the two modalities, adopts the combined guidance of the forward direction and the reverse direction to the unit attention characteristics in the two modalities, enhances the multi-modal interaction capability of the model and improves the answer classification effect;
3. the fusion mechanism of residual stacking in the invention adopts a deep stacking mode to further interact the features after bidirectional guidance; the design of the residual dynamic mechanism improves the expression capacity, avoids the problem of gradient disappearance of the deep neural network in the training process, and improves the generalization of the model.
Drawings
FIG. 1 is a flow chart of the present invention;
fig. 2 is a schematic diagram of the operation of the two-way joint interaction of modes and the residual error stacking type depth fusion of the invention.
Detailed Description
The following description of the embodiments of the present invention is provided to facilitate understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and all the inventions which make use of the inventive concept are protected by the spirit and scope of the present invention as defined and defined in the appended claims to those skilled in the art.
In one embodiment of the present invention, as shown in fig. 1, an image question-answering method based on modal joint interaction includes the following steps:
s1, initializing a target detection network taking convolutional nerves as a core through a pre-trained residual network to obtain dynamic characteristics of an input image; filling dynamic characteristics of an input image to obtain an image characteristic vector;
s2, carrying out word vectorization on the problem text through a pre-trained global word vector model to obtain a feature vector of the problem text; carrying out characterization processing on the problem text feature vector to obtain a problem high-level feature vector;
s3, constructing an image question-answering network taking intra-mode unit attention and inter-mode interaction attention as cores; inputting the image feature vector and the question high-level feature vector into an image question-answering network to obtain intra-mode unit attention features and inter-mode unit attention features;
s4, carrying out depth fusion on the intra-mode unit attention features and inter-mode interaction attention features through the mode bidirectional joint interaction and residual error stacking type depth fusion to obtain output image features and output problem features of different hidden layers;
s5, feature combination is carried out on the output image features and the output problem features of the last hidden layer through feature stacking, and final features are obtained; and mapping the final features into multi-category vectors through linear transformation to conduct answer prediction.
The pre-trained residual network in the step S1 adopts a ResNet-101 network structure, and the convolutional neural network adopts a Faster R-CNN network structure; the global word vector model in the step S2 adopts a GloVe model.
The specific steps of step S2 are as follows:
s2-1, performing word segmentation and vectorization on m words in a question text to obtain a question text feature vector;
s2-2, filling the problem text feature vector with 0 to obtain a dimension M QUES Vector representation of xEMB_SIZEWherein (1)>Vector representation representing the moment t of the question text, M QUES Representing the number of problem words, EMB_SIZE represents the embedding dimension of the problem;
s2-3, introducing a bidirectional gating circulation unit, and according to the formula:
obtaining a bidirectional hidden state at the moment t, namely a forward hidden stateAnd reverse hidden state->Wherein GRU (& gt) represents a recurrent neural network model,>indicates the forward hidden state at time t-1, < >>The reverse hidden state at the time t+1 is represented;
s2-4, splicing the problem text feature vectors at all times to obtain a final problem high-level feature vector
As shown in fig. 2, the specific steps of step S3 are as follows:
s3-1, initializing model parameters of an image question-answering network, and inputting image feature vectors and problem high-level feature vectors into the image question-answering network;
s3-2, training a model of the image question-answering network, and presettingThe answer category is a training target, the image question-answering network is trained through a back propagation algorithm and random gradient descent, and parameters of the image question-answering network are adjusted to obtain a trained image question-answering network; wherein the parameters of the image question-answering network comprise a weight matrix W which can be learned n And bias term b n
S3-3, constructing an intra-mode unit attention module of the image and an intra-mode unit attention module of the problem, selecting one of the intra-mode unit attention modules, and according to the formula:
Q,K,V=trans(X)
obtaining a query vector Q, a key vector K and a value vector V; wherein n represents the word number of the problem or the object number of the image, EMB_DIM represents the embedding dimension of the problem, X represents the feature vector of one mode with dimension of n multiplied by EMB_DIM, and trans (·) represents the feature vector X is converted into a multi-head feature vector; the feature vector X comprises an image feature vector and a problem high-level feature vector;
s3-4, according to the formula:
obtaining an attention score matrix S; wherein K is T Representing the transposed matrix of the key vector K, d k The emb_dim size representing the query vector Q;
s3-5, according to the formula:
A=softmax(S),S∈R MH×n×n
obtaining an attention weight matrix A; wherein softmax (·) represents a normalized exponential function, R represents a real number, and MH represents the number of attention heads;
s3-6, converting the multi-head feature vector to obtain the multi-head feature vector with the same dimension as the original input, and according to the formula:
O=trans'(A·V)
O'=LayerNorm(O+Dropout(O))
FFN(O')=max(0,O'W 1 +b 1 )W 2 +b 2
O I =LayerNorm(O'+Dropout(FFN(O')))
obtaining intra-modal cell attention matrix O I I.e. intra-modal unit attention features of the modality; wherein trans '(. Cndot.) represents a dimensional transfer function, O represents a unit attention matrix in an initial mode, O' represents a unit attention matrix in an intermediate mode, dropout (. Cndot.) represents random inactivation, layerNorm (. Cndot.) represents a normalization function, FFN (. Cndot.) represents a feedforward neural network, max (. Cndot.) represents implementation of a ReLU activation function, W 1 A learnable weight matrix representing input layer to hidden layer, W 2 A learnable weight matrix representing hidden layer to output layer, b 1 Bias term representing input layer to hidden layer, b 2 A bias term representing a hidden layer to an output layer;
s3-7, repeating the steps S3-3 to S3-6 to obtain the intra-mode unit attention matrix O of the other mode i
S3-8, constructing interaction attention among modes, taking the characteristics of one mode as query, taking the characteristics of the other mode as key values, and according to the formula:
obtain the attention weight matrix A 1 The method comprises the steps of carrying out a first treatment on the surface of the Wherein Q is 1 Query vectors, K, representing a modal feature 1 A key vector representing another modal feature, K 1 T Representing key vector K 1 Transposed matrix d of 1 Representing query vector Q 1 EMB_DIM size, V 1 A value vector representing another modality feature;
s3-9, according to the formula:
O 1 =LayerNorm(A 1 +Dropout(A 1 ))
O A =LayerNorm(O 1 +Dropout(FFN(O 1 )))
obtaining inter-modal cell attention matrix O A Inter-modality unit attention features; wherein O is 1 Representing initial inter-modality cell attentionA matrix.
The back propagation algorithm of step S3-2 uses a binary cross entropy loss function.
As shown in fig. 2, the specific steps of step S4 are as follows:
s4-1, according to the formula:
obtaining an inter-mode interaction attention feature XO guided by image features and an inter-mode interaction attention feature YO guided by problem features; wherein CR (·) represents inter-modality interaction attention, X 'represents intra-image-modality unit attention features, Y' represents problem intra-modality unit attention features;
s4-2, according to the formula:
obtaining inter-modal interaction attention features X [ i ] which are guided by image features after the depth stacking of the ith hidden layer and inter-modal interaction attention features Y [ i ] which are guided by problem features after the depth stacking of the ith hidden layer; wherein i represents the ith hidden layer, XO [ i-1] represents the inter-modal interaction attention feature of the ith hidden layer which is guided by the image feature, and YO [ i-1] represents the inter-modal interaction attention feature of the ith hidden layer which is guided by the problem feature;
s4-3, according to the formula:
obtaining the output image characteristic x [ i ] of the ith hidden layer]And the output problem feature y [ i ] of the ith hidden layer]The method comprises the steps of carrying out a first treatment on the surface of the Wherein X [ i-1]]Representing inter-modal interaction attention features guided by image features after depth stacking of the i-1 th hidden layer, Y [ i-1]]Characterizing problems after deep stacking of the i-1 th hidden layerFor guided inter-modal interaction attention features, alpha i Image trainable weight variable, beta, representing the ith hidden layer i The question that represents the i-th hidden layer may train a weight variable.
The formula of the linear transformation map in step S5 is as follows:
O F =proj(concat(x[I],y[I]))
wherein O is F Representing answers, x [ I ]]Output image feature x [ i ] representing last hidden layer],y[I]The output problem feature of the last hidden layer is represented, concat (·) represents a splicing operator, and proj (·) represents a linear transformation mapping function.
In summary, the invention can realize bidirectional guidance between the image and the problem feature by constructing the model based on the image-text feature extraction and depth fusion of the modal joint interaction and introducing the mechanism of the modal joint interaction, thereby improving the model interaction capability; by using a residual-type deep stacking fusion mechanism, information sharing of cross-modal semantic space is enhanced. The mechanism of the modal bidirectional guidance simultaneously considers the depth interaction between the two modalities, adopts the combined guidance of the forward direction and the reverse direction on the unit attention characteristics in the two modalities, enhances the multi-modal interaction capability of the model, and improves the answer classification effect; the fusion mechanism of the residual stacking of the invention adopts a deep stacking mode to further interact the features after bidirectional guidance; the design of the residual dynamic mechanism improves the expression capacity, avoids the problem of gradient disappearance of the deep neural network in the training process, and improves the generalization of the model.

Claims (7)

1. An image question-answering method based on modal joint interaction is characterized by comprising the following steps of: the method comprises the following steps:
s1, initializing a target detection network taking convolutional nerves as a core through a pre-trained residual network to obtain dynamic characteristics of an input image; filling dynamic characteristics of an input image to obtain an image characteristic vector;
s2, carrying out word vectorization on the problem text through a pre-trained global word vector model to obtain a feature vector of the problem text; carrying out characterization processing on the problem text feature vector to obtain a problem high-level feature vector;
s3, constructing an image question-answering network taking intra-mode unit attention and inter-mode interaction attention as cores; inputting the image feature vector and the question high-level feature vector into an image question-answering network to obtain intra-mode unit attention features and inter-mode unit attention features;
s4, carrying out depth fusion on the intra-mode unit attention features and inter-mode interaction attention features through the mode bidirectional joint interaction and residual error stacking type depth fusion to obtain output image features and output problem features of different hidden layers;
s5, feature combination is carried out on the output image features and the output problem features of the last hidden layer through feature stacking, and final features are obtained; and mapping the final features into multi-category vectors through linear transformation to conduct answer prediction.
2. The image question-answering method based on modal joint interaction according to claim 1, wherein: the pre-trained residual network in the step S1 adopts a ResNet-101 network structure, and the convolutional neural network adopts a Faster R-CNN network structure; the global word vector model in the step S2 adopts a GloVe model.
3. The image question-answering method based on modal joint interaction according to claim 1, wherein: the specific steps of the step S2 are as follows:
s2-1, performing word segmentation and vectorization on m words in a question text to obtain a question text feature vector;
s2-2, filling the problem text feature vector with 0 to obtain a dimension M QUES Vector representation of xEMB_SIZEWherein (1)>Representing question textVector representation at time t, M QUES Representing the number of problem words, EMB_SIZE represents the embedding dimension of the problem;
s2-3, introducing a bidirectional gating circulation unit, and according to the formula:
obtaining a bidirectional hidden state at the moment t, namely a forward hidden stateAnd reverse hidden state->Wherein GRU (& gt) represents a recurrent neural network model,>indicates the forward hidden state at time t-1, < >>The reverse hidden state at the time t+1 is represented;
s2-4, splicing the problem text feature vectors at all times to obtain a final problem high-level feature vector
4. The image question-answering method based on modal joint interaction according to claim 1, wherein: the specific steps of the step S3 are as follows:
s3-1, initializing model parameters of an image question-answering network, and inputting image feature vectors and problem high-level feature vectors into the image question-answering network;
s3-2, training a model of the image question-answering network, training the image question-answering network by using a preset answer type as a training target through a back propagation algorithm and random gradient descent, and adjusting parameters of the image question-answering network to obtain a trained image question-answering network; wherein the parameters of the image question-answering network comprise a weight matrix W which can be learned n And bias term b n
S3-3, constructing an intra-mode unit attention module of the image and an intra-mode unit attention module of the problem, selecting one of the intra-mode unit attention modules, and according to the formula:
Q,K,V=trans(X)
obtaining a query vector Q, a key vector K and a value vector V; wherein n represents the word number of the problem or the object number of the image, EMB_DIM represents the embedding dimension of the problem, X represents the feature vector of one mode with dimension of n multiplied by EMB_DIM, and trans (·) represents the feature vector X is converted into a multi-head feature vector; the feature vector X comprises an image feature vector and a problem high-level feature vector;
s3-4, according to the formula:
obtaining an attention score matrix S; wherein K is T Representing the transposed matrix of the key vector K, d k The emb_dim size representing the query vector Q;
s3-5, according to the formula:
A=softmax(S),S∈R MH×n×n
obtaining an attention weight matrix A; wherein softmax (·) represents a normalized exponential function, R represents a real number, and MH represents the number of attention heads;
s3-6, converting the multi-head feature vector to obtain the multi-head feature vector with the same dimension as the original input, and according to the formula:
O=trans'(A·V)
O'=LayerNorm(O+Dropout(O))
FFN(O')=max(0,O'W 1 +b 1 )W 2 +b 2
O I =LayerNorm(O'+Dropout(FFN(O')))
obtaining intra-modal cell attention matrix O I I.e. intra-modal unit attention features of the modality; wherein trans '(. Cndot.) represents a dimensional transfer function, O represents a unit attention matrix in an initial mode, O' represents a unit attention matrix in an intermediate mode, dropout (. Cndot.) represents random inactivation, layerNorm (. Cndot.) represents a normalization function, FFN (. Cndot.) represents a feedforward neural network, max (. Cndot.) represents implementation of a ReLU activation function, W 1 A learnable weight matrix representing input layer to hidden layer, W 2 A learnable weight matrix representing hidden layer to output layer, b 1 Bias term representing input layer to hidden layer, b 2 A bias term representing a hidden layer to an output layer;
s3-7, repeating the steps S3-3 to S3-6 to obtain the intra-mode unit attention matrix O of the other mode i
S3-8, constructing interaction attention among modes, taking the characteristics of one mode as query, taking the characteristics of the other mode as key values, and according to the formula:
obtain the attention weight matrix A 1 The method comprises the steps of carrying out a first treatment on the surface of the Wherein Q is 1 Query vectors, K, representing a modal feature 1 A key vector representing another modal feature, K 1 T Representing key vector K 1 Transposed matrix d of 1 Representing query vector Q 1 EMB_DIM size, V 1 A value vector representing another modality feature;
s3-9, according to the formula:
O 1 =LayerNorm(A 1 +Dropout(A 1 ))
O A =LayerNorm(O 1 +Dropout(FFN(O 1 )))
obtaining inter-modal cell attention matrix O A Inter-modality unit attention features; wherein O is 1 Representing an initial inter-modality cell attention matrix.
5. The image question-answering method based on modal joint interaction according to claim 4, wherein: the back propagation algorithm of step S3-2 uses a binary cross entropy loss function.
6. The image question-answering method based on modal joint interaction according to claim 1, wherein: the specific steps of the step S4 are as follows:
s4-1, according to the formula:
obtaining an inter-mode interaction attention feature XO guided by image features and an inter-mode interaction attention feature YO guided by problem features; wherein CR (·) represents inter-modality interaction attention, X 'represents intra-image-modality unit attention features, Y' represents problem intra-modality unit attention features;
s4-2, according to the formula:
obtaining inter-modal interaction attention features X [ i ] which are guided by image features after the depth stacking of the ith hidden layer and inter-modal interaction attention features Y [ i ] which are guided by problem features after the depth stacking of the ith hidden layer; wherein i represents the ith hidden layer, XO [ i-1] represents the inter-modal interaction attention feature of the ith hidden layer which is guided by the image feature, and YO [ i-1] represents the inter-modal interaction attention feature of the ith hidden layer which is guided by the problem feature;
s4-3, according to the formula:
obtaining the output image characteristic x [ i ] of the ith hidden layer]And the output problem feature y [ i ] of the ith hidden layer]The method comprises the steps of carrying out a first treatment on the surface of the Wherein X [ i-1]]Representing inter-modal interaction attention features guided by image features after depth stacking of the i-1 th hidden layer, Y [ i-1]]Representing inter-modal interaction attention features guided by problem features, alpha, after deep stacking of the i-1 th hidden layer i Image trainable weight variable, beta, representing the ith hidden layer i The question that represents the i-th hidden layer may train a weight variable.
7. The image question-answering method based on modal joint interaction according to claim 1, wherein: the formula of the linear transformation mapping in the step S5 is as follows:
O F =proj(concat(x[I],y[I]))
wherein O is F Representing answers, x [ I ]]Output image feature x [ i ] representing last hidden layer],y[I]The output problem feature of the last hidden layer is represented, concat (·) represents a splicing operator, and proj (·) represents a linear transformation mapping function.
CN202310749393.5A 2023-06-21 2023-06-21 Image question-answering method based on modal joint interaction Pending CN116756287A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310749393.5A CN116756287A (en) 2023-06-21 2023-06-21 Image question-answering method based on modal joint interaction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310749393.5A CN116756287A (en) 2023-06-21 2023-06-21 Image question-answering method based on modal joint interaction

Publications (1)

Publication Number Publication Date
CN116756287A true CN116756287A (en) 2023-09-15

Family

ID=87949318

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310749393.5A Pending CN116756287A (en) 2023-06-21 2023-06-21 Image question-answering method based on modal joint interaction

Country Status (1)

Country Link
CN (1) CN116756287A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117422704A (en) * 2023-11-23 2024-01-19 南华大学附属第一医院 Cancer prediction method, system and equipment based on multi-mode data

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117422704A (en) * 2023-11-23 2024-01-19 南华大学附属第一医院 Cancer prediction method, system and equipment based on multi-mode data

Similar Documents

Publication Publication Date Title
CN110490946B (en) Text image generation method based on cross-modal similarity and antagonism network generation
CN110163299B (en) Visual question-answering method based on bottom-up attention mechanism and memory network
CN110609891B (en) Visual dialog generation method based on context awareness graph neural network
WO2021223323A1 (en) Image content automatic description method based on construction of chinese visual vocabulary list
CN111581401B (en) Local citation recommendation system and method based on depth correlation matching
CN112100346B (en) Visual question-answering method based on fusion of fine-grained image features and external knowledge
CN113065587B (en) Scene graph generation method based on hyper-relation learning network
CN109783666A (en) A kind of image scene map generation method based on iteration fining
Yu et al. Hybrid attention-oriented experience replay for deep reinforcement learning and its application to a multi-robot cooperative hunting problem
CN113360621A (en) Scene text visual question-answering method based on modal inference graph neural network
CN115331075A (en) Countermeasures type multi-modal pre-training method for enhancing knowledge of multi-modal scene graph
CN116975350A (en) Image-text retrieval method, device, equipment and storage medium
CN116187349A (en) Visual question-answering method based on scene graph relation information enhancement
Chen et al. Coupled multimodal emotional feature analysis based on broad-deep fusion networks in human–robot interaction
CN116756287A (en) Image question-answering method based on modal joint interaction
Manousaki et al. Vlmah: Visual-linguistic modeling of action history for effective action anticipation
Pan et al. Teach machine to learn: hand-drawn multi-symbol sketch recognition in one-shot
CN115098646B (en) Multistage relation analysis and mining method for graphic data
CN113779244B (en) Document emotion classification method and device, storage medium and electronic equipment
Riley et al. Non-monotonic logical reasoning and deep learning for explainable visual question answering
CN116150334A (en) Chinese co-emotion sentence training method and system based on UniLM model and Copy mechanism
CN112036546B (en) Sequence processing method and related equipment
CN116226322A (en) Mongolian emotion analysis method based on fusion of countermeasure learning and support vector machine
Tang et al. Designing a partially understandable neural network through semantic embedding
Pei et al. Visual relational reasoning for image caption

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination