[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN109284414B - Cross-modal content retrieval method and system based on semantic preservation - Google Patents

Cross-modal content retrieval method and system based on semantic preservation Download PDF

Info

Publication number
CN109284414B
CN109284414B CN201811156579.5A CN201811156579A CN109284414B CN 109284414 B CN109284414 B CN 109284414B CN 201811156579 A CN201811156579 A CN 201811156579A CN 109284414 B CN109284414 B CN 109284414B
Authority
CN
China
Prior art keywords
sample
retrieval
node
mode
modality
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811156579.5A
Other languages
Chinese (zh)
Other versions
CN109284414A (en
Inventor
王树徽
吴益灵
黄庆明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN201811156579.5A priority Critical patent/CN109284414B/en
Publication of CN109284414A publication Critical patent/CN109284414A/en
Application granted granted Critical
Publication of CN109284414B publication Critical patent/CN109284414B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a cross-modal content retrieval method based on semantic preservation, which comprises the following steps: respectively constructing a first feature map and a second feature map by taking the feature vectors of the first mode sample and the second mode sample as nodes; extracting label vectors of all samples as nodes to construct a semantic graph; acquiring a neighbor node of each node; respectively constructing a first mapping function and a second mapping function for mapping the first modality samples and the second modality samples into implicit representations; learning the mapping function, approximately maximizing the likelihood of the appearance of the neighbor node of each node, and enabling each implicit expression to reconstruct the label information corresponding to the corresponding node; mapping the retrieval samples into retrieval implicit representations by using a first mapping function, and mapping each second mode sample into a target implicit representation by using a second mapping function; and acquiring the distance between the retrieval implicit representation and each target implicit representation, and taking all second modality samples corresponding to the distances smaller than the retrieval threshold value as retrieval results.

Description

Cross-modal content retrieval method and system based on semantic preservation
Technical Field
The invention relates to a cross-modal retrieval technology in the multimedia field, in particular to a cross-modal content retrieval technology.
Background
With the development of multimedia technology, data of various modalities is widely present in the internet. Cross-modal retrieval is one of the important research topics in the multimedia field. In the traditional single-mode retrieval system, query samples and retrieval results are limited to a single mode, and the increasing requirements of users cannot be met. The cross-modal retrieval system is different from the single-modal retrieval system, and the query sample and the retrieval result belong to different modalities respectively, for example, text content is retrieved by using image, video and audio data samples as query samples. The cross-modal retrieval technology provides a more convenient retrieval mode for the user, the user can conveniently acquire required information of multiple modalities, and the user experience is improved. Because the query sample and the retrieval result belong to different modalities respectively, how to semantically compare the similarity of the samples in different modalities is a problem worthy of research.
Due to heterogeneity of different modalities, the key to cross-modality retrieval is how to correlate different modalities. Currently, most cross-modal search algorithms map samples of different modalities into a low-dimensional implicit space. According to the learned classification of the implicit expression, a real-representation cross-modal retrieval method and a binary-representation cross-modal retrieval method can be divided. According to the information classification used by these methods, there can be classified into an unsupervised method and a supervised method. The unsupervised method only uses the co-occurrence information of samples in different modes, and the supervised method uses the label information carried by the samples. In general, the more information that is used, the better the cross-modal search algorithm works.
The label information can be used as high-level semantic information to guide the establishment of the relationship between different modal samples, and although the samples of different modalities have different feature spaces, the samples have the same label space. In existing methods, tag information is used as another modality, either to compute similarity-related image-text pairs, or as a representation of an implied space. The existing method is simple to use the label information, only considers the correlation between the modalities, does not consider the correlation in the modalities, but the correlation information in the modalities is very important. In the same mode, the samples with similar semantics are similar in implicit representation, and the samples with similar semantics among the modes are similar in implicit representation, so that the similarity of consistency ensures that all the samples with similar semantics have similar implicit representation. We consider creating a semantic graph containing all samples to provide high level semantic constraints, adding two feature graphs containing respective modal samples to provide manifold constraints, and reconstructing label information to provide full informationAnd (5) performing semantic constraint. In addition, the conventional graph-based approach requires the creation of a complexity O (M) when the number of nodes is M2) The solution of the graph requires a complex eigenvalue decomposition process, and the learning of the graph structure requires a more efficient algorithm.
Disclosure of Invention
In order to solve the problems, the invention discloses a cross-modal content retrieval method and a system based on semantic preservation, which comprises the following steps: constructing a retrieval set by using the first mode sample, and constructing a target set by using the second mode sample; extracting the feature vector of the first modal sample as a node to construct a first feature map; extracting the feature vector of the second modal sample as a node to construct a second feature map; extracting label vectors of label information of all samples in the retrieval set and the target set to construct a semantic graph for nodes; acquiring a neighbor node of each node; constructing a first mapping function for mapping the first modality sample to an implicit representation and a second mapping function for mapping the second modality sample to an implicit representation; learning the first mapping function and the second mapping function to approximately maximize the likelihood of occurrence of the neighbor node of each node, and enabling each implicit expression to reconstruct the corresponding label information of the corresponding node; taking a certain first modal sample as a retrieval sample, mapping the retrieval sample into a retrieval implicit expression through a first mapping function, and mapping each second modal sample into a target implicit expression through a second mapping function; and acquiring the distance between the retrieval implicit representation and each target implicit representation, and taking all the second modal samples corresponding to the distances smaller than a retrieval threshold value as retrieval results of the retrieval samples.
The invention relates to a cross-mode content retrieval method, wherein neighbor sampling and negative sampling are adopted to learn the first mapping function and the second mapping function, polynomial distribution is established according to the weight of the edge from a sampling node to a node adjacent to the sampling node, the node connected with the sampling node is sampled from the polynomial distribution and is taken as a neighbor node, and the node not connected with the sampling node is taken as a negative node according to uniform distribution.
The invention relates to a cross-modal content retrieval method, whereinThe distance is the Euclidean distance d (x) between the retrieval hidden representation and the target hidden representationi,xj)=(xi-xj)2Or cosine distance
Figure BDA0001819064100000021
Wherein x isiFor the search of implicit representations, xjImplicitly for the target.
The invention relates to a cross-modal content retrieval method, wherein the modalities of a first modality sample comprise a visual modality, an auditory modality and a text modality, and the modalities of a second modality sample comprise a visual modality, an auditory modality and a text modality.
The invention relates to a cross-modal content retrieval method, wherein if the mode of a first modal sample and/or a second modal sample is a visual mode, the feature vector of the first modal sample and/or the second modal sample is a scale-invariant feature transformation feature, or a visual mode convolutional neural network feature, or a direction gradient histogram feature; if the mode of the first mode sample and/or the second mode sample is a text mode, the feature vector of the first mode sample and/or the second mode sample is a word frequency-inverse file frequency feature or a text mode deep convolution/recurrent neural network feature.
The invention also discloses a cross-modal content retrieval system based on semantic preservation, which comprises:
the sample set construction module is used for constructing a retrieval set by using the first modality samples and constructing a target set by using the second modality samples;
the characteristic graph building module is used for building a first characteristic graph, a second characteristic graph and a semantic graph and obtaining neighbor nodes of each node in the first characteristic graph and the second characteristic graph; extracting the feature vector of the first modal sample as a node to construct the first feature map, extracting the feature vector of the second modal sample as a node to construct the second feature map, and extracting the label vectors of the label information of all samples in the retrieval set and the target set as nodes to construct the semantic map; acquiring a neighbor node of each node;
the mapping function learning module is used for constructing a mapping function and learning the mapping function; wherein a first mapping function for mapping the first modality sample to an implicit representation and a second mapping function for mapping the second modality sample to an implicit representation are constructed; learning the first mapping function and the second mapping function to approximately maximize the likelihood of occurrence of the neighbor node of each node, and enabling each implicit expression to reconstruct the corresponding label information of the corresponding node;
the sample retrieval module is used for acquiring a retrieval result; wherein, a certain first mode sample is used as a retrieval sample, the retrieval sample is mapped into a retrieval implicit expression by the first mapping function, and each second mode sample is mapped into a target implicit expression by the second mapping function; and acquiring the distance between the retrieval implicit representation and each target implicit representation, and taking all the second modal samples corresponding to the distances smaller than a retrieval threshold value as retrieval results of the retrieval samples.
The invention relates to a cross-modal content retrieval system, wherein a mapping function learning module comprises:
the neighbor sampling module is used for learning the first mapping function and the second mapping function by adopting neighbor sampling; wherein, a polynomial distribution is established according to the weight of the edge from the sampling node to the adjacent node, and the node which is sampled from the polynomial distribution and connected with the sampling node is a neighbor node;
a negative sampling module for learning the first mapping function and the second mapping function by adopting negative sampling; wherein a node from the polynomial distribution sample unconnected to the sampling node is a negative node.
In the cross-modal content retrieval system of the present invention, in the retrieval and result obtaining module, the distance is a euclidean distance d (x) between the retrieval hidden representation and the target hidden representationi,xj)=(xi-xj)2Or cosine distance
Figure BDA0001819064100000041
Wherein x isiIs implied for the searchDenotes xjImplicitly for the target.
The invention relates to a cross-modal content retrieval system, wherein the modalities of the first modality sample comprise a visual modality, an auditory modality and a text modality, and the modalities of the second modality sample comprise a visual modality, an auditory modality and a text modality.
The cross-modal content retrieval system provided by the invention, wherein if the modality of the first modality sample and/or the second modality sample is a visual modality, the feature vector of the first modality sample and/or the second modality sample is a scale-invariant feature transformation feature, or a visual modality convolutional neural network feature, or a direction gradient histogram feature; if the mode of the first mode sample and/or the second mode sample is a text mode, the feature vector of the first mode sample and/or the second mode sample is a word frequency-inverse file frequency feature or a text mode deep convolution/recurrent neural network feature.
Drawings
FIG. 1 is a flowchart of a cross-modal content retrieval method based on semantic preservation according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of a characteristic diagram and a semantic diagram of a cross-modal content retrieval method based on semantic preservation according to an embodiment of the present invention.
Fig. 3 is a mapping function diagram of a cross-modal content retrieval method based on semantic preservation according to an embodiment of the present invention.
FIG. 4 is a schematic diagram of a cross-modal content retrieval system based on semantic preservation according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more clearly understood, the cross-modal content retrieval method and system based on semantic preservation according to the present invention are further described in detail below with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The invention provides a cross-modal retrieval method based on semantic preservation, and relates to multiple modalities. For convenience of description, the embodiment of the present invention only relates to two modalities, namely, text and image, but it should be understood that the cross-modality content retrieval method according to the present invention can be widely applied to the modalities such as text, vision, hearing and the like, and multiple modalities such as video and the like, and is not limited to the above modalities. The cross-modal retrieval method is roughly divided into three steps, firstly, original features are extracted from each sample in a feature extraction mode, then a mapping function is learned to map each sample from the original features to implicit expressions, finally, the distances between the implicit expressions of the retrieved samples and the implicit expressions of the samples in a target set are calculated, the samples are sorted according to the distances, and the samples of the target set, the distance between which and the retrieved samples is smaller than a threshold value, are selected as retrieval results.
FIG. 1 is a flowchart of a cross-modal content retrieval method based on semantic preservation according to an embodiment of the present invention. As shown in fig. 1, in the embodiment of the present invention, the cross-modal search method based on semantic preservation specifically includes:
step S1, a search set and a target set are constructed, wherein the samples of the search set all have a first modality, which is called a first modality sample, the samples of the target set all have a second modality, which is called a second modality sample, the modalities of the first modality sample and the second modality sample include a visual modality, an auditory modality, a text modality, and the like, and may also be a multi-modality including a visual modality and an auditory modality, such as a video modality, and the like, which is not limited herein; the first mode sample and the second mode sample have different modes, in the embodiment of the present invention, the first mode is an image mode, and the second mode is a text mode;
step S2, extracting the feature vectors of all the first mode samples as nodes to construct a first feature map; extracting feature vectors of all second modal samples as nodes to construct a second feature map; extracting tag information of semantic tags of all the first modal samples and all the second modal samples as tag vectors, and constructing a semantic graph by taking each tag vector as a node; in the embodiment of the present invention, when the first modality sample is an image sample and the second modality sample is a text sample, the feature vectors of the image sample and the text sample are extracted first; the feature vector of the image sample may be selected from SIFT (Scale-invariant feature transform) features, CNN (convolutional Neural Network) features of a visual modality, or HOG (Histogram of oriented Gradient) features, and the like, and the feature vector of the text sample may adopt TF-IDF (term frequency-inverse file frequency term frequency-inverse document similarity-inverse similarity) features, or CNN/RNN (depth convolutional/recurrent Neural Network) features of the text modality, which is not limited in this respect;
fig. 2 is a schematic diagram of a characteristic diagram and a semantic diagram of a cross-modal content retrieval method based on semantic preservation according to an embodiment of the present invention. As shown in fig. 2, three graphs are created using the first-modality samples and the second-modality samples, respectively, including: a semantic map Gs, a first feature map (image feature map Gt), a second feature map (text feature map Gi), all label vectors extracted by the semantic labels of the text sample and the image sample are a node in the semantic map Gs;
step S3, acquiring the neighbor node of each node through the three graphs; because the semantic map Gs contains semantic labels of the text sample and the image sample, the semantic information between modalities and in modalities is contained; the three graphs are a semantic graph, a first feature graph and a second feature graph; the method comprises the following steps of establishing connection of each node in a semantic graph by taking label information of an image sample and a text sample as a label vector, and comprises the following two methods:
the first method is that if and only if the label vectors of two nodes in the semantic graph have at least one value with the same dimension being not 0, an edge is established between the two nodes, the vector similarity is calculated according to the label vectors as the weight of the edge between the nodes, and the cosine similarity can be used
Figure BDA0001819064100000061
Or using exponential similarity
Figure BDA0001819064100000062
Where z isi、zjLabel vectors for nodes i and j, respectively, and sigma is a width coefficient;
the second method uses the existing knowledge graph to establish the connection of each node of the semantic graph, for example, find the corresponding concept of the label of the image sample and the text sample in the word network (WordNet), and uses the similarity of the entity in the knowledge graph, such as the shortest path, etc., as the weight of the edge between the nodes in the semantic graph; for the case of multiple labels, the similarity between all labels needs to be averaged to be used as the weight of the edges in the semantic graph Gs; in the first feature map (image feature map), for any two nodes, the feature vector of the image is used for calculating the distance, if one node is a k neighbor node of the other node, the two nodes have connection, and the weight of the edge is 1; in the second feature map (text feature map), for any two nodes, the distance is calculated by using the feature vector of the text, if one node is a k neighbor node of the other node, the two nodes have connection, and the weight of the edge is 1;
step S4, constructing a first mapping function for mapping the first modality samples into an implicit representation, and constructing a second mapping function for mapping the second modality samples into an implicit representation; learning the first mapping function and the second mapping function to approximately maximize the likelihood of occurrence of the neighbor node of each node, and enabling each implicit expression to reconstruct corresponding label information of the corresponding node; for image sample viImplicitly denoted as fv(vi) For text sample tiImplicitly denoted as ft(ti) These two implicit representations are collectively denoted as f (n)i),niIs an image sample or a text sample; in order to maintain the local structures of the semantic graph, the first characteristic graph and the second characteristic graph, the probability of the occurrence of the neighbor node of each node is respectively maximized for each graph;
for node niOne set of sampled neighbor samples P (n)i) To maximize the probability
Figure BDA0001819064100000063
Where V is the set of all nodes in the semantic graph, the first feature graph and the second feature graph, P (n)i) Representation node niThe samples corresponding to the neighbor nodes of (1) are neighbor samples, and T represents the transposition of the vector;
when the number of nodes is large, sampling negative samples will result in the aforementioned probability Pr (P (n)i)|ni) Relaxation to minimize losses
Figure BDA0001819064100000071
N(ni) Representation node niA negative sample of (d); the neighbour samples being obtained by neighbour sampling, i.e. from each neighbour node to node niThe weights of the edges of (a) establish a polynomial distribution from which the neighbor nodes are sampled; obtaining negative samples by negative sampling, i.e. selecting n according to uniform distributioniNodes without connections are the most negative examples; in the three graphs, similar neighbor sampling and negative sampling are adopted to ensure local structure, and G in the above formula can be a semantic graph GsText feature graph GiOr image feature map GtI.e. to the image sample viCan obtain
Figure BDA0001819064100000072
For text sample tiCan obtain
Figure BDA0001819064100000073
In addition, a global semantic keeping condition is introduced, namely semantic label information can be recovered according to the implicit expression obtained by mapping; let g (-) be a function from implicit representation to semantic label, the penalty for global semantic preservation is:
Figure BDA0001819064100000074
wherein
Figure BDA0001819064100000075
Is node niThe semantic tag of (1);
in general, for an image sample viThe optimized loss is:
Figure BDA0001819064100000076
whereinAlpha and beta are equilibrium coefficients; similarly, for text sample tiThe optimized loss is:
Figure BDA0001819064100000077
in order to model the non-linear relationship between the original features and the implicit representation, the present invention employs the structure of a neural network. Fig. 3 is a mapping function diagram of a cross-modal content retrieval method based on semantic preservation according to an embodiment of the present invention. As shown in FIG. 3, fv(·)、ftMapping text and images to a unified implicit representation space, and then mapping from g to a semantic tag space, the form of the network can be different for different specific application scenarios, such as fv(·)、ftThe number of layers (·), g (·) may be increased or decreased; finally, optimizing a loss function by using a random gradient descent method and an error back propagation algorithm, and learning a mapping function;
step S5, calculating the implicit expression of each sample according to the learned mapping function; for a given first-mode sample (search sample) in the search set, calculating the distance between its implicit representation and the implicit representation of each second-mode sample in the target set, where the distance can be the Euclidean distance d (x)i,xj)=(xi-xj)2Cosine distances may also be used
Figure BDA0001819064100000081
The invention is not limited thereto, wherein xi,xjAn implicit representation representing a first modality sample and an implicit representation representing a second modality sample, respectively; and sequencing all the obtained distances from small to large, and selecting the first N second mode samples in the distance sequence as the retrieval results of the retrieval samples according to a preset retrieval threshold N.
The invention also discloses a cross-modal content retrieval system based on semantic preservation. FIG. 4 is a schematic diagram of a cross-modal content retrieval system based on semantic preservation according to an embodiment of the present invention. As shown in fig. 4, the cross-modal content retrieval system of the present invention includes: the system comprises a sample set construction module, a feature map construction module, a mapping function learning module and a sample retrieval module, wherein the sample set construction module is used for constructing a retrieval set and a target set, samples of the retrieval set have a first mode and are called as first mode samples, and samples in the target set have a second mode and are called as second mode samples; the characteristic graph constructing module is used for extracting characteristic vectors of all the first modal samples to construct a first characteristic graph for the nodes, extracting characteristic vectors of all the second modal samples to construct a second characteristic graph for the nodes, extracting label vectors of label information of all the first modal samples and the second modal samples to construct a semantic graph for the nodes, and acquiring neighbor nodes of each node; the mapping function learning module is used for constructing a first mapping function and a second mapping function, wherein the first mapping function is used for mapping a first mode sample into an implicit expression, the second mapping function is used for mapping a second mode sample into an implicit expression, the likelihood of the appearance of a neighbor node of each node is approximately maximized by learning the first mapping function and the second mapping function, and each implicit expression can reconstruct corresponding label information of the corresponding node; the sample retrieval module is used for obtaining retrieval results, wherein a certain first modal sample is used as a retrieval sample, the retrieval sample is mapped into a retrieval implicit representation through a first mapping function, each second modal sample in a target set is mapped into a target implicit representation through a second mapping function, the distance between the retrieval implicit representation and each target implicit representation is obtained, all obtained distances are sorted from small to large, and the first N second modal samples in the distance sequence are selected as the retrieval results of the retrieval samples according to a preset retrieval threshold value N.

Claims (10)

1. A cross-modal content retrieval method based on semantic preservation is characterized by comprising the following steps:
constructing a retrieval set by using the first mode sample, and constructing a target set by using the second mode sample;
extracting the feature vector of the first modal sample as a node to construct a first feature map; extracting the feature vector of the second modal sample as a node to construct a second feature map; extracting label vectors of label information of all samples in the retrieval set and the target set as nodes to form a semantic graph; acquiring a neighbor node of each node;
constructing a first mapping function for mapping the first modality sample to an implicit representation and a second mapping function for mapping the second modality sample to an implicit representation; learning the first mapping function and the second mapping function to approximately maximize the likelihood of occurrence of the neighbor node of each node and enable each implicit expression to reconstruct the corresponding label information of the corresponding node;
taking a certain first modal sample as a retrieval sample, mapping the retrieval sample into a retrieval implicit expression through a first mapping function, and mapping each second modal sample into a target implicit expression through a second mapping function; and acquiring the distance between the retrieval implicit representation and each target implicit representation, and taking all the second modal samples corresponding to the distances smaller than a retrieval threshold value as retrieval results of the retrieval samples.
2. The cross-modal content retrieval method of claim 1, wherein the first mapping function and the second mapping function are learned using neighbor sampling and negative sampling, a polynomial distribution is established based on weights of edges of sampling nodes to nodes adjacent thereto, nodes with connections to the sampling nodes are sampled from the polynomial distribution as neighbor nodes, and nodes with no connections to the sampling nodes are selected as negative nodes in a uniform distribution.
3. The method of claim 1, wherein the distance is a Euclidean distance d (x) between the hidden representation and the target hidden representationi,xj)=(xi-xj)2Or cosine distance
Figure FDA0002638378220000011
Wherein x isiFor the search of implicit representations, xjImplicitly for the target.
4. A cross-modal content retrieval method according to claim 1, wherein the modalities of the first modality sample include a visual modality, an auditory modality, and a text modality, and the modalities of the second modality sample include a visual modality, an auditory modality, and a text modality.
5. The method according to claim 4, wherein if the mode of the first mode sample or the second mode sample is a visual mode, the feature vector of the first mode sample or the second mode sample is a scale-invariant feature transform feature, or a visual mode convolutional neural network feature, or a histogram of oriented gradients feature; if the mode of the first mode sample or the second mode sample is a text mode, the feature vector of the first mode sample or the second mode sample is a word frequency-inverse file frequency feature or a text mode depth convolution/recurrent neural network feature.
6. A cross-modal content retrieval system based on semantic preservation, comprising:
the sample set construction module is used for constructing a retrieval set by using the first modality samples and constructing a target set by using the second modality samples;
the characteristic graph building module is used for building a first characteristic graph, a second characteristic graph and a semantic graph and obtaining neighbor nodes of each node in the first characteristic graph and the second characteristic graph; extracting the feature vector of the first modal sample as a node to construct the first feature map, extracting the feature vector of the second modal sample as a node to construct the second feature map, and extracting the label vectors of the label information of all samples in the retrieval set and the target set as nodes to construct the semantic map; acquiring a neighbor node of each node;
the mapping function learning module is used for constructing a mapping function and learning the mapping function; wherein a first mapping function for mapping the first modality sample to an implicit representation and a second mapping function for mapping the second modality sample to an implicit representation are constructed; learning the first mapping function and the second mapping function to approximately maximize the likelihood of occurrence of the neighbor node of each node and enable each implicit expression to reconstruct the corresponding label information of the corresponding node;
the sample retrieval module is used for acquiring a retrieval result; wherein, a certain first mode sample is used as a retrieval sample, the retrieval sample is mapped into a retrieval implicit expression by the first mapping function, and each second mode sample is mapped into a target implicit expression by the second mapping function; and acquiring the distance between the retrieval implicit representation and each target implicit representation, and taking all the second modal samples corresponding to the distances smaller than a retrieval threshold value as retrieval results of the retrieval samples.
7. The cross-modal content retrieval system of claim 6, wherein the mapping function learning module comprises:
the neighbor sampling module is used for learning the first mapping function and the second mapping function by adopting neighbor sampling; wherein, a polynomial distribution is established according to the weight of the edge from the sampling node to the adjacent node, and the node which is sampled from the polynomial distribution and connected with the sampling node is a neighbor node;
the negative sampling module is used for carrying out approximate learning on the first mapping function and the second mapping function by adopting negative sampling; wherein a node from the polynomial distribution sample unconnected to the sampling node is a negative node.
8. The cross-modal content retrieval system of claim 6, wherein in the sample retrieval module, the distance is a Euclidean distance d (x) between the retrieval hidden representation and the target hidden representationi,xj)=(xi-xj)2Or cosine distance
Figure FDA0002638378220000021
Wherein x isiFor the search of implicit representations, xjImplicitly for the target.
9. A cross-modal content retrieval system according to claim 6, wherein the modalities of the first modality sample comprise a visual modality, an auditory modality, and a text modality, and the modalities of the second modality sample comprise a visual modality, an auditory modality, and a text modality.
10. The cross-modal content retrieval system of claim 9, wherein if the mode of the first mode sample or the second mode sample is a visual mode, the feature vector of the first mode sample or the second mode sample is a scale-invariant feature transform feature, or a visual mode convolutional neural network feature, or a histogram of oriented gradients feature; if the mode of the first mode sample or the second mode sample is a text mode, the feature vector of the first mode sample or the second mode sample is a word frequency-inverse file frequency feature or a text mode depth convolution/recurrent neural network feature.
CN201811156579.5A 2018-09-30 2018-09-30 Cross-modal content retrieval method and system based on semantic preservation Active CN109284414B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811156579.5A CN109284414B (en) 2018-09-30 2018-09-30 Cross-modal content retrieval method and system based on semantic preservation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811156579.5A CN109284414B (en) 2018-09-30 2018-09-30 Cross-modal content retrieval method and system based on semantic preservation

Publications (2)

Publication Number Publication Date
CN109284414A CN109284414A (en) 2019-01-29
CN109284414B true CN109284414B (en) 2020-12-04

Family

ID=65182054

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811156579.5A Active CN109284414B (en) 2018-09-30 2018-09-30 Cross-modal content retrieval method and system based on semantic preservation

Country Status (1)

Country Link
CN (1) CN109284414B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109886326B (en) * 2019-01-31 2022-01-04 深圳市商汤科技有限公司 Cross-modal information retrieval method and device and storage medium
CN110222560B (en) * 2019-04-25 2022-12-23 西北大学 Text person searching method embedded with similarity loss function
CN111813967B (en) * 2020-07-14 2024-01-30 中国科学技术信息研究所 Retrieval method, retrieval device, computer equipment and storage medium
CN112100410A (en) * 2020-08-13 2020-12-18 中国科学院计算技术研究所 Cross-modal retrieval method and system based on semantic condition association learning
CN114996511B (en) * 2022-04-22 2024-08-20 北京爱奇艺科技有限公司 Training method and device for cross-modal video retrieval model

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7089543B2 (en) * 2001-07-13 2006-08-08 Sony Corporation Use of formal logic specification in construction of semantic descriptions
WO2010120941A3 (en) * 2009-04-15 2011-01-20 Evri Inc. Automatic mapping of a location identifier pattern of an object to a semantic type using object metadata
CN103049526A (en) * 2012-12-20 2013-04-17 中国科学院自动化研究所 Cross-media retrieval method based on double space learning
CN105205096A (en) * 2015-08-18 2015-12-30 天津中科智能识别产业技术研究院有限公司 Text modal and image modal crossing type data retrieval method
CN107273517A (en) * 2017-06-21 2017-10-20 复旦大学 Picture and text cross-module state search method based on the embedded study of figure
CN107330100A (en) * 2017-07-06 2017-11-07 北京大学深圳研究生院 Combine the two-way search method of image text of embedded space based on multi views
CN107633263A (en) * 2017-08-30 2018-01-26 清华大学 Network embedding grammar based on side

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106611015B (en) * 2015-10-27 2020-08-28 北京百度网讯科技有限公司 Label processing method and device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7089543B2 (en) * 2001-07-13 2006-08-08 Sony Corporation Use of formal logic specification in construction of semantic descriptions
WO2010120941A3 (en) * 2009-04-15 2011-01-20 Evri Inc. Automatic mapping of a location identifier pattern of an object to a semantic type using object metadata
CN103049526A (en) * 2012-12-20 2013-04-17 中国科学院自动化研究所 Cross-media retrieval method based on double space learning
CN105205096A (en) * 2015-08-18 2015-12-30 天津中科智能识别产业技术研究院有限公司 Text modal and image modal crossing type data retrieval method
CN107273517A (en) * 2017-06-21 2017-10-20 复旦大学 Picture and text cross-module state search method based on the embedded study of figure
CN107330100A (en) * 2017-07-06 2017-11-07 北京大学深圳研究生院 Combine the two-way search method of image text of embedded space based on multi views
CN107633263A (en) * 2017-08-30 2018-01-26 清华大学 Network embedding grammar based on side

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Online Asymmetric Similarity Learning for Cross-Modal Retrieval";Yiling,Wu等;《2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)》;20171109;第3984-3993页 *
"异质媒体分析技术研究进展";王树徽等;《集成技术》;20150331;第4卷(第2期);第8-19页 *

Also Published As

Publication number Publication date
CN109284414A (en) 2019-01-29

Similar Documents

Publication Publication Date Title
CN109284414B (en) Cross-modal content retrieval method and system based on semantic preservation
US11270225B1 (en) Methods and apparatus for asynchronous and interactive machine learning using word embedding within text-based documents and multimodal documents
CN112182245B (en) Knowledge graph embedded model training method and system and electronic equipment
US20190325342A1 (en) Embedding multimodal content in a common non-euclidean geometric space
Liu et al. Image annotation via graph learning
US7890512B2 (en) Automatic image annotation using semantic distance learning
CN112819023B (en) Sample set acquisition method, device, computer equipment and storage medium
CN112215837B (en) Multi-attribute image semantic analysis method and device
WO2021139247A1 (en) Construction method, apparatus and device for medical domain knowledge map, and storage medium
Zhang et al. Social image tagging using graph-based reinforcement on multi-type interrelated objects
US20230237093A1 (en) Video recommender system by knowledge based multi-modal graph neural networks
CN111160564A (en) Chinese knowledge graph representation learning method based on feature tensor
Syed et al. Selecting priors for latent Dirichlet allocation
Wang et al. Image tag refinement by regularized latent Dirichlet allocation
Su et al. Semi-supervised knowledge distillation for cross-modal hashing
CN115689672A (en) Chat type commodity shopping guide method and device, equipment and medium thereof
Zhou et al. Rank2vec: learning node embeddings with local structure and global ranking
Ning et al. Integration of image feature and word relevance: Toward automatic image annotation in cyber-physical-social systems
CN111506832B (en) Heterogeneous object completion method based on block matrix completion
CN111831847B (en) Similar picture set recommendation method and system
CN116861923B (en) Implicit relation mining method, system, computer and storage medium based on multi-view unsupervised graph contrast learning
Tang et al. A Cross-Domain Multimodal Supervised Latent Topic Model for Item Tagging and Cold-Start Recommendation
CN117435685A (en) Document retrieval method, document retrieval device, computer equipment, storage medium and product
CN112199531B (en) Cross-modal retrieval method and device based on hash algorithm and neighborhood graph
Sun et al. Enabling 5G: sentimental image dominant graph topic model for cross-modality topic detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant