CN114943017B

CN114943017B - Cross-modal retrieval method based on similarity zero sample hash

Info

Publication number: CN114943017B
Application number: CN202210696434.4A
Authority: CN
Inventors: 舒振球; 永凯玲; 余正涛; 高盛祥; 毛存礼
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2022-06-20
Filing date: 2022-06-20
Publication date: 2024-06-18
Anticipated expiration: 2042-06-20
Also published as: CN114943017A

Abstract

The invention discloses a cross-modal retrieval method based on similarity zero sample hash. A new zero sample hash framework is provided to fully mine supervised semantic information, which combines intra-modal similarity, inter-modal similarity, semantic tags and class attributes to guide the learning of zero sample hash codes. In this framework intra-modality similarity and inter-modality similarity are considered at the same time. Intra-modality similarity is represented by manifold structure and feature similarity of multi-modality data, and inter-modality similarity is represented by semantic correlation between modalities. In addition, the semantic tags and class attributes are embedded into the hash codes, and more discriminative hash codes are learned for each instance. However, due to the embedding of class attributes, the relationship between the visible class and the invisible class can be captured well in the hash code, so that the knowledge of the attributes can be transferred from the visible class into the invisible class. The invention realizes higher-precision retrieval of zero sample cross-modal data.

Description

Cross-modal retrieval method based on similarity zero sample hash

Technical Field

The invention relates to a cross-modal retrieval method based on similarity zero sample hash, and belongs to the field of cross-modal hash retrieval.

Background

Most existing cross-modal hash retrieval methods are studied in a visible class dataset. However, with the explosive growth of multimedia data, a large number of new concepts (invisible classes) emerge. Retraining an existing cross-modal hash model by collecting new conceptual data is not feasible because it would consume a significant amount of time and space. It is therefore necessary to propose a cross-modal hash model in which the training data does not contain new concepts, but which can still handle the new concepts. However, zero sample learning can identify data categories that have never been seen. That is, the trained classifier is not only able to identify existing data categories in the training set, but also to distinguish data from unseen categories. This makes zero sample learning a research hotspot for invisible class retrieval tasks.

Over the past few years, zero sample learning has been widely used in single-mode retrieval tasks. Some researchers implement potential semantic transfers by projecting tags into a word embedding space. Some researchers have proposed a zero sample hash based on an asymmetric ratio similarity matrix to improve knowledge transfer capability from visible to invisible classes. Other researchers have proposed a zero-sample learning model for multi-label image retrieval that predicts data labels of invisible classes using an example concept consistency ranking algorithm. However, the above work is a study on a single-mode search task, and the study on an invisible cross-mode search task is still relatively poor. Under the big data age that new concepts are continuously emerging, the existing cross-mode retrieval method also has the following problems: (1) The existing method only considers visible class data, and ignores invisible class data. Therefore, such a model is not suitable for cross-modal data retrieval in the big data age. (2) Most methods do not use class attribute information in hash code learning, which is detrimental to the transfer of knowledge from visible classes to invisible classes. (3) The existing few zero sample cross-modal retrieval methods fail to train models by using intra-modal similarity, inter-modal similarity, class labels and class attributes at the same time.

Disclosure of Invention

In view of the challenges presented above, the present invention provides a cross-modal retrieval method based on similarity zero sample hashing. The invention is used for solving the problem of cross-mode retrieval containing invisible class data by fusing intra-mode similarity, inter-mode similarity, tag information and class attribute.

In order to achieve the purpose of the invention, the cross-modal retrieval method based on similarity zero sample hash has the technical scheme that: the invention provides a new zero sample hash framework for fully mining and supervising semantic information, which combines intra-mode similarity, inter-mode similarity, semantic tags and class attributes to guide the learning process of zero sample hash codes. In this framework intra-modality similarity and inter-modality similarity are considered at the same time. Intra-modal similarity represents features and semantic similarity among samples in a mode, and inter-modal similarity represents semantic correlation among modes. In addition, the semantic tags and class attributes are embedded into the hash codes, and more discriminative hash codes are learned for each instance. However, due to the embedding of class attributes, the relationship between the visible class and the invisible class can be captured well in the hash code, so that the supervision knowledge can be transferred from the visible class into the invisible class. The invention comprises the following steps:

step1, acquiring a cross-modal dataset, and extracting cross-modal dataset characteristics and class attribute vectors;

Step2, processing a cross-modal dataset: processing the existing cross-modal data set into a cross-modal zero-sample data set; the original data set is divided into a training set and a query set, 20% of the classes of the original data set are randomly selected as invisible classes, and the rest are visible classes. For a zero sample cross-modal retrieval scene, the invention takes a sample pair corresponding to an invisible class in an original query set as a new query set; taking a sample pair corresponding to the visible class in the original training set as a new training set; the search set consists of an original training set;

Step3, learning an objective function: the intra-modal similarity, inter-modal similarity, semantic tags, class attributes, hash codes and hash functions are fused to learn the same framework, so that an objective function is obtained, and hash codes with more discriminant are learned;

Step4, performing iterative updating of the objective function: the variable matrix in the objective function obtained in the last step is updated through iteration until the objective function converges or reaches the maximum iteration number, and a hash function and a hash code of a training set are obtained;

Step5, performing zero sample cross-modal retrieval: and inputting a query sample, and obtaining a hash code of the query sample according to the hash function obtained in Step 4. The hash codes of the query samples are substituted into the retrieval set to query, and because the query is performed in a binary space, the result of the query is obtained by calculating the Hamming distance between the query samples and each sample in the retrieval set. The sample corresponding to the minimum Hamming distance in the search set is the query result obtained for us.

Further, the cross-modality retrieval data set includes a plurality of sample pairs, each sample pair including: text, images, and corresponding semantic tags.

Further, in Step1, extracting image features through a VGG-16 model; extracting text features through a word bag model; extracting class attributes, extracting a corresponding word vector for each class name by a Glove method, and forming a class attribute matrix.

Further, in Step2, in order to ensure generalization capability of the model, each time the data enters the model for training, a random selection method should be used to process and divide the data set. Through multiple training, an average value is taken as the final result.

The intra-modal similarity in Step3 is divided into feature similarity and semantic similarity, wherein the feature similarity is calculated through Euclidean similarity, and the semantic similarity is measured through Jaccard similarity.

Further, the inter-modality similarity in Step3 refers to the semantic similarity between each instance of different modalities, and the semantic similarity is measured by the tag semantic information.

Further, the objective function obtained in Step3 includes two parts, i.e., hash code learning and hash function, where hash code learning refers to learning hash codes by combining intra-modal similarity, inter-modal similarity, semantic tags and class attributes; the learning of the hash function refers to learning the hash function by minimizing the least square regression problem, and the semantic relation between the hash code and the hash function is enhanced by putting the hash code learning and the hash function learning into the same model for learning, so that high-precision zero sample cross-modal retrieval is realized.

Further, the iterative update of the objective function in Step4 is updated by using the objective function obtained in Step4 as an original function. It is clear that the objective function is not optimal and that it needs to be optimized. Since the objective function is a non-convex problem, when other variables are fixed and a matrix variable is updated, the function at this time is a convex problem, so that the update of the objective function is facilitated. In the invention, the matrix variable is updated by adopting the alternate iterative algorithm until the objective function converges or the maximum iterative times are reached, and finally the optimal hash code and the hash function are obtained.

Further, in Step3, establishing intra-modality similarity, inter-modality similarity and hash code connection through a kernel-based supervised hash (KSH) optimization model, and enhancing semantic information in the hash code in a manner that the similarity is embedded in the hash code; establishing a relation between the semantic tag and the class attribute and the hash code in a tag reconstruction mode, embedding the tag into the hash code, and enhancing semantic information contained in the hash code; embedding class attributes in the hash code is realized by transferring attribute knowledge in the visible class to the invisible class so as to realize the retrieval of the invisible class.

Further, in Step4, since the master model is a non-convex problem, it is a difficult problem to directly optimize it. However, when the rest variables are fixed and only one variable is optimized, the new problem converted by the original model is a convex problem, and the optimization solution can be directly carried out. Similarly, each variable is optimized in this way until convergence or the maximum number of iterations is reached, resulting in an optimal result.

(1) The existing method only considers visible class data, and ignores invisible class data. Therefore, such a model is not suitable for cross-modal data retrieval in the big data age. (2) Most methods do not use class attribute information in hash code learning, which is detrimental to the transfer of knowledge from visible classes to invisible classes. (3) The existing few zero sample cross-modal retrieval methods fail to train models by using intra-modal similarity, inter-modal similarity, class labels and class attributes at the same time.

The beneficial effects of the invention are as follows:

The invention provides a cross-modal retrieval method based on similarity zero sample hash. The method overcomes the defect that the existing majority of cross-modal retrieval methods cannot solve the limitation of zero sample data. The hash codes are learned by simultaneously using intra-mode similarity, inter-mode similarity and class attributes, so that the relationship between the visible class and the invisible class can be well captured, and the monitoring knowledge is transferred from the visible class to the invisible class. Furthermore, to take into account the supervised tag information, the present invention improves accuracy by embedding the tag information into the attribute space. Thus, more discriminant hash codes can be generated from the model proposed by the present invention. In addition, the invention provides a discrete optimization scheme to solve the proposed model, thereby effectively avoiding quantization errors.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention.

FIG. 1 is a flow chart of a method according to an embodiment of the invention.

FIG. 2 is a flow chart of iterative updating of the SAZH model of the present invention.

Detailed Description

The following description is exemplary and is intended to further illustrate the embodiments of the present invention with reference to the accompanying drawings.

Example 1

FIG. 1 is a flow chart of a cross-modal retrieval method based on similarity zero sample hashing.

In this example, referring to fig. 1, the method of the present invention specifically includes the following processes:

1. And acquiring a cross-modal dataset, and extracting cross-modal dataset characteristics and class attribute vectors. In this example, the data set used includes both image and text modalities, as well as a label corresponding to one of them. In Step1, extracting the class attribute, extracting a corresponding word vector for each class name by using Glove method to form a class attribute matrix.

2. Processing of cross-modality datasets. Because the problem to be solved by the invention is zero sample cross-modal retrieval, the acquired cross-modal dataset cannot be directly used. The data set should be processed according to the application scenario of the zero sample so as to conform to the application scenario of the zero sample cross-modal retrieval. The specific treatment method comprises the following steps:

the original data set is divided into a training set and a query set, 20% of the classes of the original data set are randomly selected as invisible classes, and the rest are visible classes. For a zero sample cross-modal retrieval scene, the invention takes a sample pair corresponding to an invisible class in an original query set as a new query set; taking a sample pair corresponding to the visible class in the original training set as a new training set; the search set consists of the original training set.

In the present invention, a given set of multimodal data is: wherein/> Is a multi-modal data point,/>For the feature vector corresponding to the ith instance of the image modality,/>For the feature vector corresponding to the ith instance of the text mode, l _i is the common label vector corresponding to the ith instance of the two modes, and n is the total instance number of the data set.

The invention is used for processing and dividing the data setTo represent multimodal data for the training set, where n _s is the number of training samples.

3. The intra-modal similarity, inter-modal similarity, tag information, class attributes, hash codes and hash functions are fused to learn the same framework, so that an objective function is obtained, and hash codes with more discriminant are learned. The learning model of each module will be described in detail below:

3.1, modal similarity learning

Intra-modal similarity is divided into feature similarity, which is calculated by Euclidean similarity, and semantic similarity, which is measured by Jaccard similarity. The Euclidean distance is simple to calculate and reflects the distance between the two vectors, and the method adopts the Euclidean distance as the characteristic similarity measurement. First of all,And/>The Euclidean distance between the two is: /(I)Then/>And/>The similarity is: /(I)Wherein/>And/>The ith and jth samples, respectively, representing the tth modality, and t= {1,2} representation are illustrated in two modalities in the present invention.

Furthermore, we measure semantic similarity with Jaccard similarity as follows:

Wherein, Corresponding to the number of labels assigned to the ith instance in the nth modality. The label of an instance depends on its features, and the semantic similarity is positively correlated with the feature similarity of the respective instance. Therefore, we can combine the feature similarity with the semantic similarity between the data, and can get the following learning model:

Wherein, Is the overall similarity within the modality,/>Representing the feature similarity between two samples,/>Semantic similarity is measured by the Jaccard similarity method.

3.2 Inter-modality similarity learning

Inter-modality similarity refers to semantic similarity among instances among different modalities, and the semantic similarity is measured through label semantic information;

specifically, in the present invention, the inter-modality similarity is calculated by a class label matrix. Is provided with For the corresponding tag matrix, where L _ij =1 means that X _i* belongs to class j, otherwise 0. Furthermore, the inter-modality similarity matrix, expressed asCan be constructed from a tag matrix if/>X _i* and X _j* are described as similar. Otherwise, X _i* and X _j* are dissimilar. Where c represents the number of categories.

3.3 Hash function learning

The hash function in the present invention is learned by minimizing the following least squares regression problem:

Where β is a non-negative parameter, B ₁ and B ₂ correspond to the hash codes of the two modalities of image and text, respectively, and W ₁ and W ₂ correspond to the projection matrices of the two modalities of image and text, respectively.

3.4, Similarity preserving learning

The similarity preserving learning model provided by the invention comprehensively considers intra-mode similarity and inter-mode similarity by combining an optimization model based on Kernel Supervised Hash (KSH). The model expression is as follows:

Wherein S ¹¹ and S ²² are intra-mode similarity matrices of two modes of the image and the text, respectively, and S ¹² is an inter-mode similarity matrix of the two modes of the image and the text.

3.5 Class Attribute and Label embedding

The method embeds the tag information into the hash code, which is helpful for generating the optimized binary code by fully utilizing the tag information, and has stronger robustness when processing data in a large scale. Thus, an optimized hash code is obtained by optimizing the following model:

Where α is a non-negative parameter, C ₁ and C ₂ represent projection matrices that project the two modality hash codes of the image and text, respectively, into the tag.

In addition, class attribute information is added into the proposed model, so that not only can the generation of a hash code with more discriminant be facilitated, but also the transfer of attribute knowledge from visible class to invisible class is mainly realized, and the zero sample cross-modal retrieval problem is solved. The class attribute information is embedded by embedding a class attribute matrix corresponding to each class name into the projection matrix of formula (5). Thus, the tag information and the class attribute information can be simultaneously embedded into the learning of the hash code. Updating the formula (5) to:

Wherein V ₁ and V ₂ represent a transformation projection matrix for projecting two modality hash codes of an image and a text into a tag and adding class attribute information respectively.

3.6 Objective function

The objective function of the invention is obtained by combining the steps as follows:

Wherein, Regularization terms representing the model, the purpose of which is to prevent overfitting; gamma is a parameter that controls the regularization term. X ⁽¹⁾ and X ⁽²⁾ are feature matrices of two modes of an image and a text respectively; y is a label matrix; a is a class attribute matrix; s ¹¹ and S ²² are intra-mode similarity matrices of two modes of an image and a text respectively, and S ¹² is an inter-mode similarity matrix of the two modes of the image and the text; w ₁、W₂、V₁、V₂ is a projection matrix; alpha and beta are non-negative parameters.

4. Performing iterative updating of the objective function: and iteratively updating the objective function obtained in the last step until the objective function converges or the maximum iteration number is reached, so as to obtain a hash function and a hash code of the training set.

The function (7) is not optimal and it needs to be updated iteratively. Obviously, the overall objective function is a non-convex optimization problem. We therefore propose an efficient iterative algorithm to solve this problem.

Specifically, referring to fig. 2, the optimization procedure for equation (7) is as follows:

B ₁ -step: the variable W ₁,W₂,V₁,V₂,B₂ is fixed, so for B ₁, equation (7) can be reduced to:

By setting the partial derivative of B ₁ to zero, a closed-loop solution for B ₁ can be derived. The following are provided:

B ₂ -step: similar to the update procedure of B ₁, a closed solution of B ₂ is obtained. The following are provided:

v ₁ -step: the variable W ₁,W₂,V₂,B₁,B₂ is fixed, so for V ₁, equation (7) can be reduced to:

By setting the partial derivative of V ₁ to zero, we can derive the following formula:

we define B₁₁＝AA^T，/>However, equation (12) can be rewritten as:

A₁₁V₁+V₁B₁₁＝C₁₁ (13)

equation (13) is a Sylvester equation that can be solved using the Sylvester function in MATLAB.

V ₂ -step: similarly, with respect to V ₂, we have:

A₂₂V₂+V₂B₂₂＝C₂₂ (14)

Wherein, B₂₂＝AA^T，/>

W ₁ -step: similarly, with respect to W ₁, we have:

A₃₃W₁+W₁B₃₃＝C₃₃ (15)

Wherein, B₃₃＝B₁ ^TB₁，/>

W ₂ -step: similarly, with respect to W ₂, we have:

A₄₄W₂+W₂B₄₄＝C₄₄ (16)

Wherein, B₄₄＝B₂ ^TB₂，/>

And (3) optimizing the formula (7) through the steps until the function converges or the maximum iteration number is reached, and stopping iteration.

5. Inquiring, and performing zero sample cross-modal retrieval: firstly, obtaining a hash code corresponding to a search set, inputting a query sample, and obtaining the hash code of the query sample according to the hash function obtained in the last step. And substituting the hash codes of the query samples into the retrieval set to query. The specific implementation steps are as follows:

the feature matrix corresponding to the query sample of a given image and text is And/>The projection matrices W ₁ and W ₂ obtained in the previous step are combined. By the formula/>And/>And obtaining the hash code corresponding to the query sample. In this embodiment, we do two main search tasks: image query text and text query images.

Because the query task of the invention is carried out in a binary space, the result of the query is obtained by calculating the Hamming distance between the query sample and each sample in the search set. The sample corresponding to the minimum Hamming distance in the search set is the query result obtained by the user.

In order to illustrate the effects of the present invention, the following describes the technical solution of the present invention through specific embodiments:

1. Simulation conditions

The invention uses Matlab software to perform experimental simulation. Experiments were conducted on a cross-modality dataset Wiki (containing both image and text modalities), the conducted experiments including two query tasks: (1) Text query image (Text 2 Img), (2) image query Text (Img 2 Text). The parameters in the experiment were set to α=1e-2, β=1e5, γ=1e-4.

2. Emulation content

The method is compared with the existing non-zero sample cross-mode hash retrieval method and the zero sample cross-mode hash retrieval method, and the non-zero sample cross-mode hash retrieval method is adopted as the comparison method: (1) Collaborative Matrix Factorization Hash (CMFH), (2) Joint and Individual Matrix Factorization Hash (JIMFH), (3) Discrete Robust Matrix Factorization Hash (DRMFH), (4) Asymmetric Supervised Consistent and Specific Hash (ASCSH), (5) tagged consistent matrix factorization hash (LCMFH); the zero sample single-mode hash retrieval method comprises the following steps: (1) Zero sample hashing (TSK) based on supervised knowledge transfer, (2) attribute hashing Algorithm (AH) for zero sample image retrieval; the zero sample cross-modal hash retrieval method comprises the following steps: (1) Cross-modal property hashing (CMAH), (2) orthogonal hashing algorithm (CHOP) for zero sample cross-modal retrieval. And for the zero sample single-mode hash retrieval method, the hash codes of the two modes of the image and the text are obtained through a single-mode model respectively, and then the following query task is carried out.

3. Simulation results

The comparison method and the experimental result of the method under the data set Wiki are respectively given in the simulation experiment. In order to meet the zero sample cross-modal retrieval scenario, 20% of classes in the random data set Wiki are selected as invisible classes. The total of 8 classes are contained in the data set Wiki, and according to the experimental setting, the embodiment randomly selects two classes as invisible classes from the classes, and the processing mode of the rest data sets is the same as that of the invention.

In this simulation, a widely used index was used to measure the performance of the SASH method and other comparative methods proposed by the present invention. I.e. the average value of the average accuracy (mAP). Given a query and a list of search results, the average Accuracy (AP) is defined as:

Where N is the number of related instances in the search set, P (r) is defined as the accuracy of the r-th search instance, and if the r-th search instance is the true neighbor of the query, δ (r) =1; otherwise δ (r) =0. All queried APs are then averaged to obtain the mAP. The evaluation rule is that the larger the mAP value is, the better the performance is.

The hash code lengths from the simulation experiments are 8 bits, 12 bits, 16 bits and 32 bits, and the corresponding mAP values of the SAZH method and other comparison methods proposed by the present invention are shown in Table 1.

Table 1 mAP values on Text query image (Text 2 Img) task for all methods on Wiki dataset

Table 2 mAP values on image query (Img 2 Text) task for all methods on Wiki dataset

As can be seen from tables 1 and 2, the mAP values in two query tasks of the SAZH method proposed by the present invention in the zero-sample cross-modal retrieval scenario of the Wiki dataset are higher than those of other comparative methods. Further proves the superiority of SAZH method in zero sample cross-modal search.

The foregoing examples merely illustrate specific embodiments of the invention, which are described in greater detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention.

Claims

1. A cross-modal retrieval method based on similarity zero sample hash is characterized in that: the method comprises the following specific steps:

step1, acquiring a cross-modal dataset, and extracting features of the cross-modal dataset and class attributes;

step2, processing a cross-modal dataset: processing the existing cross-modal data set into a cross-modal zero-sample data set;

Step4, performing iterative updating of the objective function: the variable matrix in the objective function obtained by Step3 is updated iteratively until the objective function converges or reaches the maximum iteration number, so that a hash function and a hash code of a training set are obtained;

Step5: zero sample cross-modal retrieval: firstly, obtaining hash codes corresponding to a search set, then obtaining hash codes of a query set through a hash function obtained in Step4, putting the hash codes into the search set for query, obtaining a query result through calculating Hamming distances between the query set and each sample in the search set, and obtaining a final query result by the smallest Hamming distance;

In Step1, extracting class attributes, wherein a Glove method is adopted to extract a corresponding word vector for each class name to form a class attribute matrix;

The objective function obtained in Step3 comprises two parts of hash code learning and a hash function, wherein the hash code learning refers to learning the hash code by combining intra-mode similarity, inter-mode similarity, semantic tags and class attributes; learning of the hash function means that the hash function is learned by minimizing the least square regression problem, and the hash code learning and the hash function learning are put into the same model for learning, so that semantic relation between the hash code and the hash function is enhanced, and high-precision zero sample cross-modal retrieval is realized;

the objective function in Step3 is:

Wherein, Regularization terms representing the model to prevent overfitting; gamma is a parameter for controlling regularization term, and X ⁽¹⁾ and X ⁽²⁾ are feature matrices of two modes of image and text respectively; y is a label matrix; a is a class attribute matrix; s ¹¹ and S ²² are intra-mode similarity matrices of two modes of an image and a text respectively, and S ¹² is an inter-mode similarity matrix of the two modes of the image and the text; w ₁、W₂、V₁、V₂ is a projection matrix; alpha and beta are non-negative parameters, n _s is the number of training samples, and B ₁ and B ₂ correspond to hash codes of two modes of an image and a text respectively.

2. The cross-modal retrieval method based on similarity zero sample hashing according to claim 1, wherein: the specific method of Step2 is as follows: the original data set is divided into a training set and a query set firstly, then 20% of the classes in all classes of the original data set are randomly selected as invisible classes, and the rest classes are visible classes; for a zero sample cross-modal retrieval scene, taking a sample pair corresponding to an invisible class in an original query set as a new query set; taking a sample pair corresponding to the visible class in the original training set as a new training set; the search set consists of the original training set.

3. The cross-modal retrieval method based on similarity zero sample hashing according to claim 1, wherein: the intra-modal similarity in Step3 is divided into feature similarity and semantic similarity, wherein the feature similarity is calculated through Euclidean similarity, and the semantic similarity is measured through Jaccard similarity.

4. The cross-modal retrieval method based on similarity zero sample hashing according to claim 1, wherein: the inter-modality similarity in Step3 refers to the semantic similarity between each instance of different modalities, and the semantic similarity is measured through label semantic information.

5. The cross-modal retrieval method based on similarity zero sample hashing according to claim 1, wherein: the iterative updating of the objective function in Step4 is performed by taking the objective function obtained in Step4 as an original function, and obviously the objective function is not optimal, and the objective function needs to be optimized, because the objective function is a non-convex problem, when other variables are fixed and a matrix variable is updated, the function is a convex problem, so that the updating of the objective function is facilitated; and updating the matrix variables by adopting the alternate iterative algorithm until the objective function converges or the maximum iterative times are reached, and finally obtaining the optimal hash code and the hash function.