CN110263912B - Image question-answering method based on multi-target association depth reasoning - Google Patents
Image question-answering method based on multi-target association depth reasoning Download PDFInfo
- Publication number
- CN110263912B CN110263912B CN201910398140.1A CN201910398140A CN110263912B CN 110263912 B CN110263912 B CN 110263912B CN 201910398140 A CN201910398140 A CN 201910398140A CN 110263912 B CN110263912 B CN 110263912B
- Authority
- CN
- China
- Prior art keywords
- image
- question
- feature
- vector
- attention
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses an image question and answer method based on multi-target association depth reasoning. The invention comprises the following steps: 1. and 2, performing data preprocessing on the image and the text described in the natural language of the image, and performing attention mechanism reordering on each target based on the candidate box geometric feature enhanced adaptive attention module model. 3. Neural network structure based on AAM model. 4. And (4) model training, namely training neural network parameters by using a back propagation algorithm. The invention provides a deep neural network for image question answering, in particular a method for uniformly modeling image-question text data, reasoning on characteristics of each target in an image, reordering the attention mechanism of each target so as to answer questions more accurately, and obtaining better effect in the field of image question answering.
Description
Technical Field
The invention relates to a deep neural network structure for an image Question Answering (Visual Question Answering) task, in particular to a method for carrying out unified modeling on image-Question Answering data, searching interaction relations between entity features and corresponding space position geometric features in an image and achieving adaptive attention weight adjustment through modeling on position relations between the entity features and the corresponding space position geometric features.
Background
Image question-answering is an emerging task that intersects computer vision and natural language processing. The task is to allow the machine to automatically answer the corresponding answer by giving a question related to the image. The task of image question-answering is undoubtedly more complex than another cross-task of computer vision and natural language processing, image description, which requires a machine to be able to understand images and questions and reason about the correct results. Such as "what color is her glasses? Such sentences contain rich semantic information. In order to answer the question, the machine needs to locate the area of the female eye in the image, and then answer according to the keyword of "color". For another example, "what do the beard is made? "this problem, the machine needs to be unable to directly find the location of the beard, but can estimate the area where the beard should be located according to the location of the face and pay attention to the area. This question is then answered according to the keyword "make".
With the rapid development of deep learning in recent years, end-to-end modeling using a deep Convolutional Neural Network (CNN) or a deep cyclic Neural network (RNN) is becoming the mainstream research direction in the field of computer vision and natural language processing at present. In the research process of the image question-answering algorithm, an end-to-end modeling idea is introduced, meanwhile, the image is subjected to end-to-end modeling by using a proper network structure, and a computer automatically answers the image according to input questions and the image, so that the research question is worth of deep exploration.
For many years, it has been well recognized in the field of computer vision that contextual information or associations between objects contribute to model enhancements. But most methods of using this information have preceded the popularity of deep learning. In the current deep learning era, no significant progress is made in the field of using relationship information between objects, particularly image question answering, and most methods still focus on respectively paying attention to entities. Because the object in the image has the changes of two-dimensional space position, scale/aspect ratio and the like, the image question-answering model needs to infer the problem depending on the interrelation between the entities. Therefore, the position information of the object, i.e., the geometric features in general, plays a complex and important role in the image question-answering model.
In the aspect of practical application, the image question-answering algorithm has wide application scenes. With the rapid development of wearable smart hardware (such as Google glasses and microsoft HoloLens) and augmented reality technology, in the near future, an image content automatic question-answering system based on visual perception may become an important way for human-computer interaction. The technology can help us, especially the disabled with visual impairment to better perceive and understand the world
In conclusion, the image question-answering algorithm based on end-to-end modeling is a direction worthy of intensive research, the subject is to be switched in from a plurality of key difficult problems in the task, the problems existing in the current method are solved, and finally a set of complete image question-answering system is formed.
Due to the fact that image content under a natural scene is complex, a main body is various; the description based on natural language has high degree of freedom, which makes the description of image content face huge challenge. Specifically, there are two main difficulties:
(1) the feature extraction problem is a classic and basic problem in the cross-media expression research direction, and commonly used methods are image processing feature extraction methods such as Histogram of Oriented Gradient (HOG), Local Binary Pattern (LBP), Haar features and the like. In addition, the features extracted by ResNet, GoogleNet and fast-RCNN models based on deep learning theory all play excellent effects in many fields, such as image fine-grained classification, natural language processing and recommendation systems. Therefore, selecting a proper strategy during cross-media data feature extraction, and improving the expression capability of features while ensuring the high efficiency of calculation is a direction worthy of intensive research.
(2) How to reason the problem by relying on the interrelationship between entities in the image: the input to the image question-and-answer algorithm is an image, which may have multiple target entities, and a question. The algorithm not only extracts the characteristics of each target entity in the image and correctly understands each target of the image, but also infers the relation between the targets by using the geometric characteristics and the visual characteristics of the target characteristics. Therefore, how to lead the algorithm to automatically learn the relation among all targets of the image and form more accurate cross-media expression characteristics is a difficult problem in the image question-answering algorithm and is also a crucial link influencing the performance of the algorithm result.
Disclosure of Invention
The invention provides an image question and answer method based on multi-target association depth reasoning. The invention relates to a deep neural network architecture for an image Question Answering (Visual Question Answering) task, which mainly comprises two points: 1. and adopting image characteristics with stronger expressive power and geometric information. 2. And reasoning the relation between the targets in the image by using the target characteristics in the image.
The technical scheme adopted by the invention for solving the technical problem comprises the following steps:
step (1), data preprocessing, and feature extraction of image and text data
Firstly, preprocessing an image:
target entities contained in the images are detected using a fast-RCNN deep neural network structure. And extracting the visual features V and the geometric features G containing the target size and coordinate information in the image.
Preprocessing the text data:
counting sentence length of a given question text sets the maximum length of the question text according to the statistical information. And constructing a problem text vocabulary dictionary, replacing the words of the problem with index values in the description vocabulary dictionary, and then passing through the LSTM, thereby converting the problem text into a vector q.
Step (2), attention module based on candidate frame geometric feature enhancement
The structure is shown in fig. 1, and the geometric feature G, the visual feature V, and the attention weight vector m for the three feature candidate box positions that are input.
Firstly, sequentially coding an attention weight vector m, converting the attention weight vector m into vectors according to the weight sequence, then mapping the vectors to a high dimension and adding the vectors to a visual feature V mapped to the high dimension, and obtaining V by Layer Normalization (Layer Normalization) processing of the output of the vectorsA。
Then, the geometric characteristics G are mapped through a linear layer and then are subjected to an activation function ReLU to obtain GR. Will VAAnd GRInputting a candidate frame Relation component (relationship Module) to carry out reasoning to obtain Orelation. Mixing O withrelationMultiplying the linear layer and sigmoid function with the original attention weight vector m to obtain a new attention vector
Step (3) constructing a deep neural network
The structure of the method is shown in FIG. 2, and firstly, the problem text is converted into an index value vector according to a vocabulary dictionary. Then the vector is transmitted into a Long Short Term Memory network (LSTM) through high-dimensional mapping, the output vector q and the visual feature V obtained by using the Faster R-CNN are fused in a Hadamard product (Hadamard product) mode, and the attention weight m of each entity feature is obtained through an attention module. Inputting the Attention weight m, the visual feature V and the geometric feature G into an Adaptive Attention Module (AAM) enhanced based on the geometric features of the candidate frame, reasoning by using the visual feature and the geometric feature of the position of the candidate frame, reordering the Attention weight, and obtaining a new Attention vectorAttention vectorFusing the product with the visual feature V and then carrying out weighted average to obtain new visual featuresCharacterizing visual featuresAnd fusing the problem text vector q with the problem text vector q through a Hadamard product to generate probability through a softmax function, and outputting the probability as an output predicted value of the network.
Step (4), model training
And (4) training the model parameters of the neural network in the step (3) by utilizing a back propagation algorithm according to the difference between the generated predicted value and the actual description of the image until the whole network model converges.
The step (1) is specifically realized as follows:
1-1, extracting the features of the image i by using an existing deep neural network fast-RCNN, wherein the extracted features comprise visual features V and geometric features G of k targets contained in the image, and V ═ V ═ V { (V) } V1,v2,...,vk},G={g1,g2,...,gk},k∈[10,100]And the visual vector of the single target isThe geometric feature of the individual target is giX, y, w, h, whereinWherein x, y, w and h are position parameters of geometric features and respectively represent the abscissa, the ordinate, the width and the height of a candidate frame where an entity in the image is located;
1-2. for a given question text, the different words in the question text in the data set are first counted and recorded in a dictionary. Converting words in the word list into index values according to the word dictionary, thereby converting the problem text into an index vector with a fixed length, wherein the specific formula is as follows:
whereinIs the word wkThe index value in the dictionary, l, represents the length of the question text.
The adaptive attention module deep inference network based on candidate box geometric feature enhancement in step (2) specifically comprises the following steps:
2-1. the input attention weight vector m is first processed. Annotating each object in mGravity weight m { m1,m2,...,mkThe value-ordered sequence number pos of the code is encoded,the specific formula is as follows:
2-2, the matrix PE and the visual characteristic V are added after passing through different linear layers respectively, and the output of the matrix PE and the visual characteristic V are subjected to layer normalization processing to obtain VAThe concrete formula is as follows:
VA=LayerNorm(WPEPET+WVVT) (formula 3)
2-3, performing correlation calculation on the geometric characteristic G, and obtaining G by passing the geometric characteristic G through a linear layerRThe concrete formula is as follows:
GR=WGΩ(G)T(formula 4)
2-4, mixing VAAnd GRThe input correlation module performs reasoning to obtain OrelationThe concrete formula is as follows:
Orelation=softmax(log(GR)+VR)·(WOVA+bO) (formula 7)
2-5, mixing OrelationAfter passing through the full connection layer, multiplying the original attention weight m by a sigmoid function to obtain a new attention vectorThe specific formula is as follows:
Constructing a deep neural network in the step (3), which comprises the following specific steps:
3-1, mapping the problem text vector q and the visual feature V to a public space through linear transformation of a full connection layer and then fusing by using a Hadamard product, FfusionRepresenting a fused feature on a common space. WrAnd WqRespectively representing corresponding full-link layer parameters, symbols, which linearly transform the visual characteristic V and the current state information qRepresenting the two matrices using the hadamard product.WmRepresenting the fully-connected layer parameters that dimension the fused features down and produce an attention-weight distribution,the initial attention weight vector m, j represents the currently calculated jth region attention weight. The specific formula is as follows:
m=softmax(WmFfusion+bm) (formula 10)
3-2, inputting m, V and G into an adaptive attention module enhanced based on the geometric features of the candidate boxes according to the step (2), reasoning by using the features of V and G, reordering m to obtain a new attention feature
3-3. passing throughAnd the visual feature vector is obtained by weighted average after the feature product of V is multipliedThe specific formula is as follows:
the training model in the step (4) is as follows:
the question-answer pairs in the VQA-v2.0 dataset are answered by multiple people, so that the same question may have different correct answers. Previous image question-answering models treated the highest ticket number as the only correct answer and one-hot encoding (one-hot encoding) it. Because the correct answers have a plurality of elements, all answers to the same question are voted, and the weight of the correct answer in all correct answers is determined according to the number of votes. And using a Kullback-Leibler divergence loss function if N represents the length of the answer vocabulary. Presect represents the predicted value distribution, and GT represents the true value. Then the definition is as shown:
the invention has the following beneficial effects:
the invention relates to a method for uniformly modeling image-description data, reasoning on characteristics of each target in an image, and reordering attention mechanisms of each target so as to more accurately describe the image. The invention introduces the implicit geometric characteristics in the image for the first time and structures the image, so that the image and the solid characteristics in the image are subjected to cooperative reasoning, and the accuracy of the visual question-answering model can be effectively improved after the existing visual question-answering technology is combined.
The invention has smaller parameter quantity, light weight and high efficiency, is beneficial to more efficient distributed training and is beneficial to being deployed in specific hardware with limited memory.
Drawings
FIG. 1: an adaptive attention module enhanced based on candidate box geometric features;
FIG. 2: and (3) image question-answering neural network architecture of the adaptive attention module enhanced based on the geometrical characteristics of the candidate box.
Detailed Description
The following is a more detailed description of the detailed parameters of the present invention.
The invention provides a deep neural network framework aiming at image Question Answering (Visual Question Answering).
The data preprocessing and the feature extraction of the image and the text in the step (1) are specifically as follows:
1-1. for feature extraction of image data, we used the MS-COCO dataset as training and testing data and extracted its visual features using the existing fast-RCNN model. Specifically, the image data is input into the Faster-RCNN network, and the 10 ∞ m in the image is detected by using the Faster-RCNN modelCombining 100 targets, framing each target, extracting 2048-dimensional visual feature V from the image of each target, and recording the coordinates and the size { x, y, w, h } of the frame of each icon as the geometric feature G of the target, wherein V ═ { V }1,v2,...,vk},G={g1,g2,...,gk},k∈[10,100]。
1-2. for the question text, firstly, different words in the question text in the data set are counted, and 9847 words with the word frequency higher than 5 are recorded in the dictionary by all the words in the text.
1-3, only the first 16 words are taken for each question sentence, and if the question sentence is less than 16 words, the null characters are supplemented. The translation of the string from between values is done using each word to generate the index value in the word dictionary in 1-2 instead of the word, so that each question translates into 16 word index vectors.
Step (2) learning and associating the target feature V and the geometric feature G of the image based on an Adaptive Attention Module (AAM) model enhanced by the geometric features of the candidate frame so as to reorder the input original Attention information m, which is specifically as follows:
2-1, firstly, processing the input attention weight vector m, and processing the attention information { m in each target in m1,m2,...,mkThe serial number pos of the value sequencing of the attention information m is coded to obtain a matrix based on the attention information m
2-2, mapping PE to 128 dimensions and adding V mapped to 128 dimensions, and obtaining matrix V with the size of 100x128 through layer normalization processing of outputA。
2-3, performing correlation calculation on the characteristic G, encoding by a formula (2) to obtain a matrix of 100x100x64 dimensions, mapping the last dimension of the matrix to a single value, and then obtaining a matrix G of 100x100 dimensions through an activation function ReLUR。
2-4, mixing VAAnd GRThe input association (relationship) module performs inference by first reasoning about VAOf each targetFeatures are mapped to 128 dimensions, and then the target features are point-multiplied with each other to obtain a matrix V of 100x100R. According to VRAnd GRThe combined calculation results in a matrix of 100x100 and VAWeighted averaging of each target in (a) yields a matrix O of 100x128relation。
2-5, mixing OrelationAfter passing through the full connection layer, sigmoid is multiplied by original m to obtain new 100-dimensional
Constructing a deep neural network in the step (3), which comprises the following specific steps:
3-1, for the problem text characteristics, wherein the text input is the 16-dimensional index value vector generated in the step (1), a word embedding technology is used for converting each word index into a corresponding word vector, and the size of the word vector used by the word vector is 1024. Each question text becomes a matrix of size 16x 1024. And zero-filling the input visual features into a matrix of 100x2048, mapping the matrix into a matrix of 100x1024 through a linear layer, and then taking the word vector at each moment as the input of an LSTM, wherein the LSTM is a recurrent neural network structure, and setting the output of the LSTM as a 1024-dimensional vector q.
And 3-2, inputting the output vector q of the LSTM into an Attention module to obtain a preliminary 100-dimensional Attention feature m, and finishing the image Attention point information extraction (Attention) operation.
3-3, inputting m, V and G into an Adaptive Attention Module (AAM) model enhanced based on the geometric features of the candidate boxes according to the step (2), reasoning by using the features of V and G, reordering m, and obtaining a new 100-dimensional Attention featureUp to this point, the operations of reasoning about the associations between objects in the image and reordering the points of interest (attentions) are completed.
3-4. passing 100-dimensional vectorWeighted average is carried out on the feature V with 100x1024 dimensions to obtain 1024-dimensional visual features with attention
3-5. We will generate the reordered visual features with attention informationFusing with an output vector q of the LSTM, and sequentially performing FC layer and softmax operations, wherein FC is a neural network full-connection operation, and finally outputting a 9487-dimensional prediction vector of the word, wherein each element in the output represents a probability value for predicting that the answer corresponding to the element index is the answer of the given question.
The training model in the step (4) is as follows:
and (3) comparing the predicted 9487 dimensional vector generated in the step (3) with a correct answer of the question, calculating the difference between a predicted value and an actual correct value through a loss function defined by the user to form a loss value, and adjusting the parameter value of the whole network by using a BP algorithm according to the loss value so as to gradually reduce the difference between the predicted value and the actual value generated by the network until the network converges.
Claims (5)
1. An image question-answering method based on multi-target association depth reasoning is characterized by comprising the following steps:
step (1), data preprocessing, and feature extraction of image and text data
Firstly, preprocessing an image:
detecting a target entity contained in the image by using a fast-RCNN deep neural network structure; extracting visual features V and geometric features G containing the size and coordinate information of each target in the image;
preprocessing the text data:
counting the sentence length of a given question text, and setting the maximum length of the question text according to statistical information; constructing a problem text vocabulary dictionary, replacing words of a problem with index values in a description vocabulary dictionary, and then converting the problem text into a vector q through an LSTM;
step (2), attention module based on candidate frame geometric feature enhancement
Geometric feature G, visual feature V and attention weight vector m for the three input feature candidate box positions;
firstly, sequentially coding attention weight vector m, converting the attention weight vector m into vectors according to the weight sequence, then mapping the vectors to a high dimension and adding visual features V mapped to the high dimension, and obtaining V by layer normalization processing of the output of the vectorsA;
Then, the geometric characteristics G are mapped through a linear layer and then are subjected to an activation function ReLU to obtain GR(ii) a Will VAAnd GRThe input candidate box relation component carries out reasoning to obtain OrelationIntroducing OrelationMultiplying the original attention weight vector m by a linear layer and sigmoid function to obtain a new attention weight vector
Step (3) constructing a deep neural network
Firstly, converting a problem text into an index value vector according to a vocabulary dictionary; then the vector is transmitted into a Long Short Term Memory network (LSTM) through high-dimensional mapping, the output vector q and the visual feature V obtained by using fast R-CNN are fused in a Hadamard product (Hadamard product) mode, and an attention weight vector m of each entity feature is obtained through an attention module; inputting the attention weight vector m, the visual feature V and the geometric feature G into an adaptive attention module based on the geometric feature enhancement of the candidate frame, reasoning by using the visual feature and the geometric feature of the position of the candidate frame, reordering the attention weight vector to obtain a new attention weight vectorAttention weight vectorFusing the product with the visual feature V and then carrying out weighted average to obtain new visual featuresCharacterizing visual featuresGenerating probability through a softmax function by fusing the problem text vector q with a Hadamard product, and outputting the probability as an output predicted value of the network;
step (4), model training
And (4) training the model parameters of the neural network in the step (3) by utilizing a back propagation algorithm according to the difference between the generated predicted value and the actual description of the image until the whole network model converges.
2. The image question-answering method based on multi-target association depth reasoning according to claim 1, characterized in that the step (1) is implemented as follows:
1-1, extracting the features of the image i by using an existing deep neural network fast-RCNN, wherein the extracted features comprise visual features V and geometric features G of k targets contained in the image, and V ═ V ═ V { (V) } V1,v2,...,vk},G={g1,g2,...,gk},k∈[10,100]And the visual vector of the single target isThe geometric feature of the individual target is giX, y, w, h, whereinWherein x, y, w and h are position parameters of geometric features and respectively represent the abscissa, the ordinate, the width and the height of a candidate frame where an entity in the image is located;
1-2, for a given problem text, firstly counting different words in the problem text in a data set, and recording the words in a dictionary; converting words in the word list into index values according to the word dictionary, thereby converting the problem text into an index vector with a fixed length, wherein the specific formula is as follows:
3. The image question-answering method based on multi-objective association depth reasoning according to claim 2, wherein the adaptive attention module depth reasoning network based on candidate box geometric feature enhancement in step (2) is specifically as follows:
2-1, firstly processing an input attention weight vector m; weighting each target attention in m into a vector m { m }1,m2,...,mkThe value-ordered sequence number pos of the code is encoded,the specific formula is as follows:
2-2, the matrix PE and the visual characteristic V are added after passing through different linear layers respectively, and the output of the matrix PE and the visual characteristic V are subjected to layer normalization processing to obtain VAThe concrete formula is as follows:
VA=Layer Norm(WPEPET+WVVT) (formula 3)
2-3, performing correlation calculation on the geometric characteristic G, and obtaining G by passing the geometric characteristic G through a linear layerRThe concrete formula is as follows:
GR=WGΩ(G)T(formula 4)
2-4, mixing VAAnd GRThe input correlation module performs reasoning to obtain OrelationThe concrete formula is as follows:
Orelation=softmax(log(GR)+VR)·(WOVA+bO) (formula 7)
2-5, mixing OrelationAfter passing through the full connection layer, multiplying the original attention weight vector m by a sigmoid function to obtain a new attention weight vectorThe specific formula is as follows:
4. The image question-answering method based on multi-objective association depth reasoning according to claim 3, wherein the deep neural network is constructed in the step (3), and specifically comprises the following steps:
3-1, mapping the problem text vector q and the visual feature V to a public space through linear transformation of a full connection layer and then fusing by using a Hadamard product, FfusionRepresenting a fused feature on a common space; wrAnd WqRespectively representing corresponding full-link layer parameters, symbols, which linearly transform the visual characteristic V and the current state information qExpressing that the two matrixes adopt Hadamard products; wmRepresenting the fully-connected layer parameters that dimension the fused features down and produce an attention weight vector distribution,an initial attention weight vector m, j represents the current calculated jth region attention weight vector; the specific formula is as follows:
m=softmax(WmFfusion+bm) (formula 10)
3-2, inputting m, V and G into an adaptive attention module enhanced based on the geometric features of the candidate boxes according to the step (2), reasoning by using the features of V and G, reordering m to obtain a new attention feature
3-3. passing throughAnd the visual feature vector is obtained by weighted average after the feature product of V is multipliedThe specific formula is as follows:
5. the image question-answering method based on multi-target association depth reasoning according to claim 4, wherein the model training in the step (4) is as follows:
VQA-v2.0 data sets of question-answer pairs are answered by multiple people, so that the same question may have different correct answers; the previous image question-answering model regards the highest ticket number as the only correct answer and carries out one-hot encoding (one-hot encoding) on the answer; because the correct answers have the diversity, all answers of the same question are voted, and the weight of the correct answers in all correct answers is determined according to the number of votes; and using a Kullback-Leibler divergence loss function if N represents the length of the answer vocabulary; presect represents the predicted value distribution, GT represents the true value; then the definition is as shown:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910398140.1A CN110263912B (en) | 2019-05-14 | 2019-05-14 | Image question-answering method based on multi-target association depth reasoning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910398140.1A CN110263912B (en) | 2019-05-14 | 2019-05-14 | Image question-answering method based on multi-target association depth reasoning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110263912A CN110263912A (en) | 2019-09-20 |
CN110263912B true CN110263912B (en) | 2021-02-26 |
Family
ID=67914695
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910398140.1A Active CN110263912B (en) | 2019-05-14 | 2019-05-14 | Image question-answering method based on multi-target association depth reasoning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110263912B (en) |
Families Citing this family (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110879844B (en) * | 2019-10-25 | 2022-10-14 | 北京大学 | Cross-media reasoning method and system based on heterogeneous interactive learning |
CN110889505B (en) * | 2019-11-18 | 2023-05-02 | 北京大学 | Cross-media comprehensive reasoning method and system for image-text sequence matching |
CN111598118B (en) * | 2019-12-10 | 2023-07-07 | 中山大学 | Visual question-answering task implementation method and system |
CN111553372B (en) * | 2020-04-24 | 2023-08-08 | 北京搜狗科技发展有限公司 | Training image recognition network, image recognition searching method and related device |
CN111611367B (en) * | 2020-05-21 | 2023-04-28 | 拾音智能科技有限公司 | Visual question-answering method introducing external knowledge |
CN111737458B (en) * | 2020-05-21 | 2024-05-21 | 深圳赛安特技术服务有限公司 | Attention mechanism-based intention recognition method, device, equipment and storage medium |
CN113837212B (en) * | 2020-06-24 | 2023-09-26 | 四川大学 | Visual question-answering method based on multi-mode bidirectional guiding attention |
CN111897939B (en) * | 2020-08-12 | 2024-02-02 | 腾讯科技(深圳)有限公司 | Visual dialogue method, training method, device and equipment for visual dialogue model |
CN112016493B (en) * | 2020-09-03 | 2024-08-23 | 科大讯飞股份有限公司 | Image description method, device, electronic equipment and storage medium |
CN112309528B (en) * | 2020-10-27 | 2023-04-07 | 上海交通大学 | Medical image report generation method based on visual question-answering method |
CN113010712B (en) * | 2021-03-04 | 2022-12-02 | 天津大学 | Visual question answering method based on multi-graph fusion |
CN113094484A (en) * | 2021-04-07 | 2021-07-09 | 西北工业大学 | Text visual question-answering implementation method based on heterogeneous graph neural network |
CN113326933B (en) * | 2021-05-08 | 2022-08-09 | 清华大学 | Attention mechanism-based object operation instruction following learning method and device |
CN113761153B (en) * | 2021-05-19 | 2023-10-24 | 腾讯科技(深圳)有限公司 | Picture-based question-answering processing method and device, readable medium and electronic equipment |
CN113220859B (en) * | 2021-06-01 | 2024-05-10 | 平安科技(深圳)有限公司 | Question answering method and device based on image, computer equipment and storage medium |
CN113392253B (en) * | 2021-06-28 | 2023-09-29 | 北京百度网讯科技有限公司 | Visual question-answering model training and visual question-answering method, device, equipment and medium |
CN113515615A (en) * | 2021-07-09 | 2021-10-19 | 天津大学 | Visual question-answering method based on capsule self-guide cooperative attention mechanism |
CN113792703B (en) * | 2021-09-29 | 2024-02-02 | 山东新一代信息产业技术研究院有限公司 | Image question-answering method and device based on Co-Attention depth modular network |
CN114398471A (en) * | 2021-12-24 | 2022-04-26 | 哈尔滨工程大学 | Visual question-answering method based on deep reasoning attention mechanism |
CN114564958B (en) * | 2022-01-11 | 2023-08-04 | 平安科技(深圳)有限公司 | Text recognition method, device, equipment and medium |
CN117274616B (en) * | 2023-09-26 | 2024-03-29 | 南京信息工程大学 | Multi-feature fusion deep learning service QoS prediction system and prediction method |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160342895A1 (en) * | 2015-05-21 | 2016-11-24 | Baidu Usa Llc | Multilingual image question answering |
US20170124432A1 (en) * | 2015-11-03 | 2017-05-04 | Baidu Usa Llc | Systems and methods for attention-based configurable convolutional neural networks (abc-cnn) for visual question answering |
CN107608943A (en) * | 2017-09-08 | 2018-01-19 | 中国石油大学(华东) | Merge visual attention and the image method for generating captions and system of semantic notice |
CN107766447A (en) * | 2017-09-25 | 2018-03-06 | 浙江大学 | It is a kind of to solve the method for video question and answer using multilayer notice network mechanism |
CN108108771A (en) * | 2018-01-03 | 2018-06-01 | 华南理工大学 | Image answering method based on multiple dimensioned deep learning |
CN108228703A (en) * | 2017-10-31 | 2018-06-29 | 北京市商汤科技开发有限公司 | Image answering method, device, system and storage medium |
CN108829677A (en) * | 2018-06-05 | 2018-11-16 | 大连理工大学 | A kind of image header automatic generation method based on multi-modal attention |
CN109472024A (en) * | 2018-10-25 | 2019-03-15 | 安徽工业大学 | A kind of file classification method based on bidirectional circulating attention neural network |
CN109712108A (en) * | 2018-11-05 | 2019-05-03 | 杭州电子科技大学 | It is a kind of that vision positioning method is directed to based on various distinctive candidate frame generation network |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10366166B2 (en) * | 2017-09-07 | 2019-07-30 | Baidu Usa Llc | Deep compositional frameworks for human-like language acquisition in virtual environments |
CN109829049B (en) * | 2019-01-28 | 2021-06-01 | 杭州一知智能科技有限公司 | Method for solving video question-answering task by using knowledge base progressive space-time attention network |
-
2019
- 2019-05-14 CN CN201910398140.1A patent/CN110263912B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160342895A1 (en) * | 2015-05-21 | 2016-11-24 | Baidu Usa Llc | Multilingual image question answering |
US20170124432A1 (en) * | 2015-11-03 | 2017-05-04 | Baidu Usa Llc | Systems and methods for attention-based configurable convolutional neural networks (abc-cnn) for visual question answering |
CN107608943A (en) * | 2017-09-08 | 2018-01-19 | 中国石油大学(华东) | Merge visual attention and the image method for generating captions and system of semantic notice |
CN107766447A (en) * | 2017-09-25 | 2018-03-06 | 浙江大学 | It is a kind of to solve the method for video question and answer using multilayer notice network mechanism |
CN108228703A (en) * | 2017-10-31 | 2018-06-29 | 北京市商汤科技开发有限公司 | Image answering method, device, system and storage medium |
CN108108771A (en) * | 2018-01-03 | 2018-06-01 | 华南理工大学 | Image answering method based on multiple dimensioned deep learning |
CN108829677A (en) * | 2018-06-05 | 2018-11-16 | 大连理工大学 | A kind of image header automatic generation method based on multi-modal attention |
CN109472024A (en) * | 2018-10-25 | 2019-03-15 | 安徽工业大学 | A kind of file classification method based on bidirectional circulating attention neural network |
CN109712108A (en) * | 2018-11-05 | 2019-05-03 | 杭州电子科技大学 | It is a kind of that vision positioning method is directed to based on various distinctive candidate frame generation network |
Non-Patent Citations (4)
Title |
---|
《ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering》;Kan Chen,et al;《arXiv:1511.05960v2》;20160403;全文 * |
《Attention Is All You Need》;Ashish Vaswani,et al;《arXiv:1706.03762v5》;20171206;全文 * |
《基于深度神经网络和注意力机制的图像问答研究》;李庆;《中国优秀硕士学位论文全文数据库 信息科技辑》;20190115(第1期);全文 * |
《视觉问答技术研究》;俞俊,等;《计算机研究与发展》;20181231;第55卷(第9期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN110263912A (en) | 2019-09-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110263912B (en) | Image question-answering method based on multi-target association depth reasoning | |
CN110490946B (en) | Text image generation method based on cross-modal similarity and antagonism network generation | |
CN110334705B (en) | Language identification method of scene text image combining global and local information | |
CN107480206B (en) | Multi-mode low-rank bilinear pooling-based image content question-answering method | |
CN112100346B (en) | Visual question-answering method based on fusion of fine-grained image features and external knowledge | |
CN110647612A (en) | Visual conversation generation method based on double-visual attention network | |
CN113297364B (en) | Natural language understanding method and device in dialogue-oriented system | |
CN110134946A (en) | A kind of machine reading understanding method for complex data | |
CN112733866A (en) | Network construction method for improving text description correctness of controllable image | |
CN110516530A (en) | A kind of Image Description Methods based on the enhancing of non-alignment multiple view feature | |
CN112527993B (en) | Cross-media hierarchical deep video question-answer reasoning framework | |
CN115222998B (en) | Image classification method | |
CN117033609B (en) | Text visual question-answering method, device, computer equipment and storage medium | |
CN113761153A (en) | Question and answer processing method and device based on picture, readable medium and electronic equipment | |
CN114970517A (en) | Visual question and answer oriented method based on multi-modal interaction context perception | |
CN115331075A (en) | Countermeasures type multi-modal pre-training method for enhancing knowledge of multi-modal scene graph | |
CN115563327A (en) | Zero sample cross-modal retrieval method based on Transformer network selective distillation | |
CN114241191A (en) | Cross-modal self-attention-based non-candidate-box expression understanding method | |
CN116580440A (en) | Lightweight lip language identification method based on visual transducer | |
CN114638408A (en) | Pedestrian trajectory prediction method based on spatiotemporal information | |
CN117671666A (en) | Target identification method based on self-adaptive graph convolution neural network | |
CN113837290A (en) | Unsupervised unpaired image translation method based on attention generator network | |
CN115984485A (en) | High-fidelity three-dimensional face model generation method based on natural text description | |
Jiang et al. | Cross-level reinforced attention network for person re-identification | |
CN117972138B (en) | Training method and device for pre-training model and computer equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |