CN109284414B

CN109284414B - Cross-modal content retrieval method and system based on semantic preservation

Info

Publication number: CN109284414B
Application number: CN201811156579.5A
Authority: CN
Inventors: 王树徽; 吴益灵; 黄庆明
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2018-09-30
Filing date: 2018-09-30
Publication date: 2020-12-04
Anticipated expiration: 2038-09-30
Also published as: CN109284414A

Abstract

The invention relates to a cross-modal content retrieval method based on semantic preservation, which comprises the following steps: respectively constructing a first feature map and a second feature map by taking the feature vectors of the first mode sample and the second mode sample as nodes; extracting label vectors of all samples as nodes to construct a semantic graph; acquiring a neighbor node of each node; respectively constructing a first mapping function and a second mapping function for mapping the first modality samples and the second modality samples into implicit representations; learning the mapping function, approximately maximizing the likelihood of the appearance of the neighbor node of each node, and enabling each implicit expression to reconstruct the label information corresponding to the corresponding node; mapping the retrieval samples into retrieval implicit representations by using a first mapping function, and mapping each second mode sample into a target implicit representation by using a second mapping function; and acquiring the distance between the retrieval implicit representation and each target implicit representation, and taking all second modality samples corresponding to the distances smaller than the retrieval threshold value as retrieval results.

Description

Cross-modal content retrieval method and system based on semantic preservation

Technical Field

The invention relates to a cross-modal retrieval technology in the multimedia field, in particular to a cross-modal content retrieval technology.

Background

With the development of multimedia technology, data of various modalities is widely present in the internet. Cross-modal retrieval is one of the important research topics in the multimedia field. In the traditional single-mode retrieval system, query samples and retrieval results are limited to a single mode, and the increasing requirements of users cannot be met. The cross-modal retrieval system is different from the single-modal retrieval system, and the query sample and the retrieval result belong to different modalities respectively, for example, text content is retrieved by using image, video and audio data samples as query samples. The cross-modal retrieval technology provides a more convenient retrieval mode for the user, the user can conveniently acquire required information of multiple modalities, and the user experience is improved. Because the query sample and the retrieval result belong to different modalities respectively, how to semantically compare the similarity of the samples in different modalities is a problem worthy of research.

Due to heterogeneity of different modalities, the key to cross-modality retrieval is how to correlate different modalities. Currently, most cross-modal search algorithms map samples of different modalities into a low-dimensional implicit space. According to the learned classification of the implicit expression, a real-representation cross-modal retrieval method and a binary-representation cross-modal retrieval method can be divided. According to the information classification used by these methods, there can be classified into an unsupervised method and a supervised method. The unsupervised method only uses the co-occurrence information of samples in different modes, and the supervised method uses the label information carried by the samples. In general, the more information that is used, the better the cross-modal search algorithm works.

The label information can be used as high-level semantic information to guide the establishment of the relationship between different modal samples, and although the samples of different modalities have different feature spaces, the samples have the same label space. In existing methods, tag information is used as another modality, either to compute similarity-related image-text pairs, or as a representation of an implied space. The existing method is simple to use the label information, only considers the correlation between the modalities, does not consider the correlation in the modalities, but the correlation information in the modalities is very important. In the same mode, the samples with similar semantics are similar in implicit representation, and the samples with similar semantics among the modes are similar in implicit representation, so that the similarity of consistency ensures that all the samples with similar semantics have similar implicit representation. We consider creating a semantic graph containing all samples to provide high level semantic constraints, adding two feature graphs containing respective modal samples to provide manifold constraints, and reconstructing label information to provide full informationAnd (5) performing semantic constraint. In addition, the conventional graph-based approach requires the creation of a complexity O (M) when the number of nodes is M²) The solution of the graph requires a complex eigenvalue decomposition process, and the learning of the graph structure requires a more efficient algorithm.

Disclosure of Invention

In order to solve the problems, the invention discloses a cross-modal content retrieval method and a system based on semantic preservation, which comprises the following steps: constructing a retrieval set by using the first mode sample, and constructing a target set by using the second mode sample; extracting the feature vector of the first modal sample as a node to construct a first feature map; extracting the feature vector of the second modal sample as a node to construct a second feature map; extracting label vectors of label information of all samples in the retrieval set and the target set to construct a semantic graph for nodes; acquiring a neighbor node of each node; constructing a first mapping function for mapping the first modality sample to an implicit representation and a second mapping function for mapping the second modality sample to an implicit representation; learning the first mapping function and the second mapping function to approximately maximize the likelihood of occurrence of the neighbor node of each node, and enabling each implicit expression to reconstruct the corresponding label information of the corresponding node; taking a certain first modal sample as a retrieval sample, mapping the retrieval sample into a retrieval implicit expression through a first mapping function, and mapping each second modal sample into a target implicit expression through a second mapping function; and acquiring the distance between the retrieval implicit representation and each target implicit representation, and taking all the second modal samples corresponding to the distances smaller than a retrieval threshold value as retrieval results of the retrieval samples.

The invention relates to a cross-mode content retrieval method, wherein neighbor sampling and negative sampling are adopted to learn the first mapping function and the second mapping function, polynomial distribution is established according to the weight of the edge from a sampling node to a node adjacent to the sampling node, the node connected with the sampling node is sampled from the polynomial distribution and is taken as a neighbor node, and the node not connected with the sampling node is taken as a negative node according to uniform distribution.

The invention relates to a cross-modal content retrieval method, whereinThe distance is the Euclidean distance d (x) between the retrieval hidden representation and the target hidden representation_i,x_j)＝(x_i-x_j)²Or cosine distance

Wherein x is_iFor the search of implicit representations, x_jImplicitly for the target.

The invention relates to a cross-modal content retrieval method, wherein the modalities of a first modality sample comprise a visual modality, an auditory modality and a text modality, and the modalities of a second modality sample comprise a visual modality, an auditory modality and a text modality.

The invention relates to a cross-modal content retrieval method, wherein if the mode of a first modal sample and/or a second modal sample is a visual mode, the feature vector of the first modal sample and/or the second modal sample is a scale-invariant feature transformation feature, or a visual mode convolutional neural network feature, or a direction gradient histogram feature; if the mode of the first mode sample and/or the second mode sample is a text mode, the feature vector of the first mode sample and/or the second mode sample is a word frequency-inverse file frequency feature or a text mode deep convolution/recurrent neural network feature.

The invention also discloses a cross-modal content retrieval system based on semantic preservation, which comprises:

the sample set construction module is used for constructing a retrieval set by using the first modality samples and constructing a target set by using the second modality samples;

the characteristic graph building module is used for building a first characteristic graph, a second characteristic graph and a semantic graph and obtaining neighbor nodes of each node in the first characteristic graph and the second characteristic graph; extracting the feature vector of the first modal sample as a node to construct the first feature map, extracting the feature vector of the second modal sample as a node to construct the second feature map, and extracting the label vectors of the label information of all samples in the retrieval set and the target set as nodes to construct the semantic map; acquiring a neighbor node of each node;

the mapping function learning module is used for constructing a mapping function and learning the mapping function; wherein a first mapping function for mapping the first modality sample to an implicit representation and a second mapping function for mapping the second modality sample to an implicit representation are constructed; learning the first mapping function and the second mapping function to approximately maximize the likelihood of occurrence of the neighbor node of each node, and enabling each implicit expression to reconstruct the corresponding label information of the corresponding node;

the sample retrieval module is used for acquiring a retrieval result; wherein, a certain first mode sample is used as a retrieval sample, the retrieval sample is mapped into a retrieval implicit expression by the first mapping function, and each second mode sample is mapped into a target implicit expression by the second mapping function; and acquiring the distance between the retrieval implicit representation and each target implicit representation, and taking all the second modal samples corresponding to the distances smaller than a retrieval threshold value as retrieval results of the retrieval samples.

The invention relates to a cross-modal content retrieval system, wherein a mapping function learning module comprises:

the neighbor sampling module is used for learning the first mapping function and the second mapping function by adopting neighbor sampling; wherein, a polynomial distribution is established according to the weight of the edge from the sampling node to the adjacent node, and the node which is sampled from the polynomial distribution and connected with the sampling node is a neighbor node;

a negative sampling module for learning the first mapping function and the second mapping function by adopting negative sampling; wherein a node from the polynomial distribution sample unconnected to the sampling node is a negative node.

In the cross-modal content retrieval system of the present invention, in the retrieval and result obtaining module, the distance is a euclidean distance d (x) between the retrieval hidden representation and the target hidden representation_i,x_j)＝(x_i-x_j)²Or cosine distance

Wherein x is_iIs implied for the searchDenotes x_jImplicitly for the target.

The invention relates to a cross-modal content retrieval system, wherein the modalities of the first modality sample comprise a visual modality, an auditory modality and a text modality, and the modalities of the second modality sample comprise a visual modality, an auditory modality and a text modality.

The cross-modal content retrieval system provided by the invention, wherein if the modality of the first modality sample and/or the second modality sample is a visual modality, the feature vector of the first modality sample and/or the second modality sample is a scale-invariant feature transformation feature, or a visual modality convolutional neural network feature, or a direction gradient histogram feature; if the mode of the first mode sample and/or the second mode sample is a text mode, the feature vector of the first mode sample and/or the second mode sample is a word frequency-inverse file frequency feature or a text mode deep convolution/recurrent neural network feature.

Drawings

FIG. 1 is a flowchart of a cross-modal content retrieval method based on semantic preservation according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a characteristic diagram and a semantic diagram of a cross-modal content retrieval method based on semantic preservation according to an embodiment of the present invention.

Fig. 3 is a mapping function diagram of a cross-modal content retrieval method based on semantic preservation according to an embodiment of the present invention.

FIG. 4 is a schematic diagram of a cross-modal content retrieval system based on semantic preservation according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more clearly understood, the cross-modal content retrieval method and system based on semantic preservation according to the present invention are further described in detail below with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention provides a cross-modal retrieval method based on semantic preservation, and relates to multiple modalities. For convenience of description, the embodiment of the present invention only relates to two modalities, namely, text and image, but it should be understood that the cross-modality content retrieval method according to the present invention can be widely applied to the modalities such as text, vision, hearing and the like, and multiple modalities such as video and the like, and is not limited to the above modalities. The cross-modal retrieval method is roughly divided into three steps, firstly, original features are extracted from each sample in a feature extraction mode, then a mapping function is learned to map each sample from the original features to implicit expressions, finally, the distances between the implicit expressions of the retrieved samples and the implicit expressions of the samples in a target set are calculated, the samples are sorted according to the distances, and the samples of the target set, the distance between which and the retrieved samples is smaller than a threshold value, are selected as retrieval results.

FIG. 1 is a flowchart of a cross-modal content retrieval method based on semantic preservation according to an embodiment of the present invention. As shown in fig. 1, in the embodiment of the present invention, the cross-modal search method based on semantic preservation specifically includes:

step S1, a search set and a target set are constructed, wherein the samples of the search set all have a first modality, which is called a first modality sample, the samples of the target set all have a second modality, which is called a second modality sample, the modalities of the first modality sample and the second modality sample include a visual modality, an auditory modality, a text modality, and the like, and may also be a multi-modality including a visual modality and an auditory modality, such as a video modality, and the like, which is not limited herein; the first mode sample and the second mode sample have different modes, in the embodiment of the present invention, the first mode is an image mode, and the second mode is a text mode;

step S2, extracting the feature vectors of all the first mode samples as nodes to construct a first feature map; extracting feature vectors of all second modal samples as nodes to construct a second feature map; extracting tag information of semantic tags of all the first modal samples and all the second modal samples as tag vectors, and constructing a semantic graph by taking each tag vector as a node; in the embodiment of the present invention, when the first modality sample is an image sample and the second modality sample is a text sample, the feature vectors of the image sample and the text sample are extracted first; the feature vector of the image sample may be selected from SIFT (Scale-invariant feature transform) features, CNN (convolutional Neural Network) features of a visual modality, or HOG (Histogram of oriented Gradient) features, and the like, and the feature vector of the text sample may adopt TF-IDF (term frequency-inverse file frequency term frequency-inverse document similarity-inverse similarity) features, or CNN/RNN (depth convolutional/recurrent Neural Network) features of the text modality, which is not limited in this respect;

fig. 2 is a schematic diagram of a characteristic diagram and a semantic diagram of a cross-modal content retrieval method based on semantic preservation according to an embodiment of the present invention. As shown in fig. 2, three graphs are created using the first-modality samples and the second-modality samples, respectively, including: a semantic map Gs, a first feature map (image feature map Gt), a second feature map (text feature map Gi), all label vectors extracted by the semantic labels of the text sample and the image sample are a node in the semantic map Gs;

step S3, acquiring the neighbor node of each node through the three graphs; because the semantic map Gs contains semantic labels of the text sample and the image sample, the semantic information between modalities and in modalities is contained; the three graphs are a semantic graph, a first feature graph and a second feature graph; the method comprises the following steps of establishing connection of each node in a semantic graph by taking label information of an image sample and a text sample as a label vector, and comprises the following two methods:

the first method is that if and only if the label vectors of two nodes in the semantic graph have at least one value with the same dimension being not 0, an edge is established between the two nodes, the vector similarity is calculated according to the label vectors as the weight of the edge between the nodes, and the cosine similarity can be used

Or using exponential similarity

Where z is_i、z_jLabel vectors for nodes i and j, respectively, and sigma is a width coefficient;

the second method uses the existing knowledge graph to establish the connection of each node of the semantic graph, for example, find the corresponding concept of the label of the image sample and the text sample in the word network (WordNet), and uses the similarity of the entity in the knowledge graph, such as the shortest path, etc., as the weight of the edge between the nodes in the semantic graph; for the case of multiple labels, the similarity between all labels needs to be averaged to be used as the weight of the edges in the semantic graph Gs; in the first feature map (image feature map), for any two nodes, the feature vector of the image is used for calculating the distance, if one node is a k neighbor node of the other node, the two nodes have connection, and the weight of the edge is 1; in the second feature map (text feature map), for any two nodes, the distance is calculated by using the feature vector of the text, if one node is a k neighbor node of the other node, the two nodes have connection, and the weight of the edge is 1;

step S4, constructing a first mapping function for mapping the first modality samples into an implicit representation, and constructing a second mapping function for mapping the second modality samples into an implicit representation; learning the first mapping function and the second mapping function to approximately maximize the likelihood of occurrence of the neighbor node of each node, and enabling each implicit expression to reconstruct corresponding label information of the corresponding node; for image sample v_iImplicitly denoted as f_v(v_i) For text sample t_iImplicitly denoted as f_t(t_i) These two implicit representations are collectively denoted as f (n)_i)，n_iIs an image sample or a text sample; in order to maintain the local structures of the semantic graph, the first characteristic graph and the second characteristic graph, the probability of the occurrence of the neighbor node of each node is respectively maximized for each graph;

for node n_iOne set of sampled neighbor samples P (n)_i) To maximize the probability

Where V is the set of all nodes in the semantic graph, the first feature graph and the second feature graph, P (n)_i) Representation node n_iThe samples corresponding to the neighbor nodes of (1) are neighbor samples, and T represents the transposition of the vector;

when the number of nodes is large, sampling negative samples will result in the aforementioned probability Pr (P (n)_i)|n_i) Relaxation to minimize losses

N(n_i) Representation node n_iA negative sample of (d); the neighbour samples being obtained by neighbour sampling, i.e. from each neighbour node to node n_iThe weights of the edges of (a) establish a polynomial distribution from which the neighbor nodes are sampled; obtaining negative samples by negative sampling, i.e. selecting n according to uniform distribution_iNodes without connections are the most negative examples; in the three graphs, similar neighbor sampling and negative sampling are adopted to ensure local structure, and G in the above formula can be a semantic graph G_sText feature graph G_iOr image feature map G_tI.e. to the image sample v_iCan obtain

For text sample t_iCan obtain

In addition, a global semantic keeping condition is introduced, namely semantic label information can be recovered according to the implicit expression obtained by mapping; let g (-) be a function from implicit representation to semantic label, the penalty for global semantic preservation is:

wherein

Is node n_iThe semantic tag of (1);

in general, for an image sample v_iThe optimized loss is:

whereinAlpha and beta are equilibrium coefficients; similarly, for text sample t_iThe optimized loss is:

in order to model the non-linear relationship between the original features and the implicit representation, the present invention employs the structure of a neural network. Fig. 3 is a mapping function diagram of a cross-modal content retrieval method based on semantic preservation according to an embodiment of the present invention. As shown in FIG. 3, f_v(·)、f_tMapping text and images to a unified implicit representation space, and then mapping from g to a semantic tag space, the form of the network can be different for different specific application scenarios, such as f_v(·)、f_tThe number of layers (·), g (·) may be increased or decreased; finally, optimizing a loss function by using a random gradient descent method and an error back propagation algorithm, and learning a mapping function;

step S5, calculating the implicit expression of each sample according to the learned mapping function; for a given first-mode sample (search sample) in the search set, calculating the distance between its implicit representation and the implicit representation of each second-mode sample in the target set, where the distance can be the Euclidean distance d (x)_i,x_j)＝(x_i-x_j)²Cosine distances may also be used

The invention is not limited thereto, wherein x_i，x_jAn implicit representation representing a first modality sample and an implicit representation representing a second modality sample, respectively; and sequencing all the obtained distances from small to large, and selecting the first N second mode samples in the distance sequence as the retrieval results of the retrieval samples according to a preset retrieval threshold N.

The invention also discloses a cross-modal content retrieval system based on semantic preservation. FIG. 4 is a schematic diagram of a cross-modal content retrieval system based on semantic preservation according to an embodiment of the present invention. As shown in fig. 4, the cross-modal content retrieval system of the present invention includes: the system comprises a sample set construction module, a feature map construction module, a mapping function learning module and a sample retrieval module, wherein the sample set construction module is used for constructing a retrieval set and a target set, samples of the retrieval set have a first mode and are called as first mode samples, and samples in the target set have a second mode and are called as second mode samples; the characteristic graph constructing module is used for extracting characteristic vectors of all the first modal samples to construct a first characteristic graph for the nodes, extracting characteristic vectors of all the second modal samples to construct a second characteristic graph for the nodes, extracting label vectors of label information of all the first modal samples and the second modal samples to construct a semantic graph for the nodes, and acquiring neighbor nodes of each node; the mapping function learning module is used for constructing a first mapping function and a second mapping function, wherein the first mapping function is used for mapping a first mode sample into an implicit expression, the second mapping function is used for mapping a second mode sample into an implicit expression, the likelihood of the appearance of a neighbor node of each node is approximately maximized by learning the first mapping function and the second mapping function, and each implicit expression can reconstruct corresponding label information of the corresponding node; the sample retrieval module is used for obtaining retrieval results, wherein a certain first modal sample is used as a retrieval sample, the retrieval sample is mapped into a retrieval implicit representation through a first mapping function, each second modal sample in a target set is mapped into a target implicit representation through a second mapping function, the distance between the retrieval implicit representation and each target implicit representation is obtained, all obtained distances are sorted from small to large, and the first N second modal samples in the distance sequence are selected as the retrieval results of the retrieval samples according to a preset retrieval threshold value N.

Claims

1. A cross-modal content retrieval method based on semantic preservation is characterized by comprising the following steps:

constructing a retrieval set by using the first mode sample, and constructing a target set by using the second mode sample;

extracting the feature vector of the first modal sample as a node to construct a first feature map; extracting the feature vector of the second modal sample as a node to construct a second feature map; extracting label vectors of label information of all samples in the retrieval set and the target set as nodes to form a semantic graph; acquiring a neighbor node of each node;

constructing a first mapping function for mapping the first modality sample to an implicit representation and a second mapping function for mapping the second modality sample to an implicit representation; learning the first mapping function and the second mapping function to approximately maximize the likelihood of occurrence of the neighbor node of each node and enable each implicit expression to reconstruct the corresponding label information of the corresponding node;

taking a certain first modal sample as a retrieval sample, mapping the retrieval sample into a retrieval implicit expression through a first mapping function, and mapping each second modal sample into a target implicit expression through a second mapping function; and acquiring the distance between the retrieval implicit representation and each target implicit representation, and taking all the second modal samples corresponding to the distances smaller than a retrieval threshold value as retrieval results of the retrieval samples.

2. The cross-modal content retrieval method of claim 1, wherein the first mapping function and the second mapping function are learned using neighbor sampling and negative sampling, a polynomial distribution is established based on weights of edges of sampling nodes to nodes adjacent thereto, nodes with connections to the sampling nodes are sampled from the polynomial distribution as neighbor nodes, and nodes with no connections to the sampling nodes are selected as negative nodes in a uniform distribution.

3. The method of claim 1, wherein the distance is a Euclidean distance d (x) between the hidden representation and the target hidden representation_i,x_j)＝(x_i-x_j)²Or cosine distance

4. A cross-modal content retrieval method according to claim 1, wherein the modalities of the first modality sample include a visual modality, an auditory modality, and a text modality, and the modalities of the second modality sample include a visual modality, an auditory modality, and a text modality.

5. The method according to claim 4, wherein if the mode of the first mode sample or the second mode sample is a visual mode, the feature vector of the first mode sample or the second mode sample is a scale-invariant feature transform feature, or a visual mode convolutional neural network feature, or a histogram of oriented gradients feature; if the mode of the first mode sample or the second mode sample is a text mode, the feature vector of the first mode sample or the second mode sample is a word frequency-inverse file frequency feature or a text mode depth convolution/recurrent neural network feature.

6. A cross-modal content retrieval system based on semantic preservation, comprising:

the mapping function learning module is used for constructing a mapping function and learning the mapping function; wherein a first mapping function for mapping the first modality sample to an implicit representation and a second mapping function for mapping the second modality sample to an implicit representation are constructed; learning the first mapping function and the second mapping function to approximately maximize the likelihood of occurrence of the neighbor node of each node and enable each implicit expression to reconstruct the corresponding label information of the corresponding node;

7. The cross-modal content retrieval system of claim 6, wherein the mapping function learning module comprises:

the negative sampling module is used for carrying out approximate learning on the first mapping function and the second mapping function by adopting negative sampling; wherein a node from the polynomial distribution sample unconnected to the sampling node is a negative node.

8. The cross-modal content retrieval system of claim 6, wherein in the sample retrieval module, the distance is a Euclidean distance d (x) between the retrieval hidden representation and the target hidden representation_i,x_j)＝(x_i-x_j)²Or cosine distance

9. A cross-modal content retrieval system according to claim 6, wherein the modalities of the first modality sample comprise a visual modality, an auditory modality, and a text modality, and the modalities of the second modality sample comprise a visual modality, an auditory modality, and a text modality.

10. The cross-modal content retrieval system of claim 9, wherein if the mode of the first mode sample or the second mode sample is a visual mode, the feature vector of the first mode sample or the second mode sample is a scale-invariant feature transform feature, or a visual mode convolutional neural network feature, or a histogram of oriented gradients feature; if the mode of the first mode sample or the second mode sample is a text mode, the feature vector of the first mode sample or the second mode sample is a word frequency-inverse file frequency feature or a text mode depth convolution/recurrent neural network feature.