CN113486804B

CN113486804B - Object identification method, device, equipment and storage medium

Info

Publication number: CN113486804B
Application number: CN202110769197.5A
Authority: CN
Inventors: 奚昌凤; 吴子扬; 沙文
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2021-07-07
Filing date: 2021-07-07
Publication date: 2024-02-20
Anticipated expiration: 2041-07-07
Also published as: CN113486804A

Abstract

The application provides an object identification method, an object identification device and a storage medium, wherein the object identification method comprises the following steps: acquiring an image to be identified in a target heterogeneous scene; based on one of a plurality of pre-established recognition models, recognizing an object to be recognized in an image to be recognized, wherein the recognition models are obtained by training a training sample set in a target heterogeneous scene, each recognition model is updated in parameters according to corresponding prediction loss, the prediction loss corresponding to each recognition model is determined according to a target classification result of the training sample set, the target classification result of the training sample set is obtained by respectively fusing classification results of the training sample set on the recognition models, the classification result of the training sample set on one recognition model is determined according to unique feature vectors extracted from feature expression vectors determined by the recognition model for each sample in the training sample set. The object recognition method provided by the application has a good recognition effect on the images in the target heterogeneous scene.

Description

Object identification method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to an object identification method, an object identification device, and a storage medium.

Background

The current object recognition technology mainly focuses on homogeneous scenes, namely, the current object recognition scheme is mainly a recognition scheme aiming at images of the same mode, and when the scheme is applied to recognize images of different modes in heterogeneous scenes, the recognition effect of the current object recognition scheme is poor due to the difference between the different modes in the heterogeneous scenes.

For example, the current face recognition scheme focuses on the recognition of face images in homogeneous scenes (i.e., face images of the same modality), but does not focus on the differences between different modalities in heterogeneous scenes, which results in a significant reduction in recognition effect compared with homogeneous scenes when the current face recognition scheme is used to recognize face images in heterogeneous scenes.

Disclosure of Invention

In view of this, the present application provides an object recognition method, device, apparatus, and storage medium, which are used to solve the problem that the current object recognition scheme has poor image recognition effect in heterogeneous scenes, and the technical scheme is as follows:

an object recognition method, comprising:

Acquiring an image to be identified in a target heterogeneous scene;

identifying an object to be identified in the image to be identified based on one of a plurality of pre-established identification models;

the method comprises the steps that a plurality of recognition models are obtained by training a training sample set in a target heterogeneous scene and target classification results of the training sample set, the target classification results of the training sample set are obtained by fusing classification results of the training sample set on the plurality of recognition models respectively, the classification results of the training sample set on one recognition model are determined according to unique feature vectors extracted from feature expression vectors determined by the recognition model for each sample in the training sample set, and the unique feature vectors are feature vectors capable of uniquely characterizing objects in the corresponding samples.

Optionally, the identifying the object to be identified in the image to be identified based on one of a plurality of pre-established identification models includes:

identifying the image to be identified based on the optimal identification model in the plurality of identification models;

determining an optimal recognition model from the plurality of recognition models, comprising:

Inputting a test sample pair in the target heterogeneous scene into each recognition model to obtain two characteristic expression vectors determined by each recognition model aiming at the test sample pair, wherein the test sample pair is an image of two different modes of the same object;

calculating the similarity of two feature expression vectors determined by each recognition model aiming at the test sample pair so as to obtain the similarity corresponding to each recognition model;

and determining an optimal recognition model from the plurality of recognition models according to the similarity corresponding to each recognition model.

Optionally, the process of establishing the plurality of identification models includes:

determining feature representation vectors corresponding to samples in the training sample set respectively based on each recognition model to obtain a feature representation vector set corresponding to each recognition model;

extracting unique feature vectors from each feature representation vector in the feature representation vector set corresponding to each recognition model to obtain a unique feature vector set corresponding to each recognition model;

classifying samples in the training sample set according to the unique feature vector set corresponding to each recognition model to obtain a classification result of the training sample set on each recognition model;

Fusing the classification results of the training sample set on each recognition model respectively, wherein the fusion result is used as a target classification result of the training sample set;

and determining the prediction loss of each recognition model by taking the target classification result of the training sample set as a basis, and carrying out parameter updating on the corresponding recognition model according to the determined prediction loss.

Optionally, the fusing the classification results of the training sample set on each recognition model includes:

for each of a plurality of sample pairs consisting of samples in the training sample set:

and if the training sample sets are respectively in the classification results on the identification models, determining that the two samples in the sample pair belong to the same class, otherwise, determining that the two samples in the sample pair belong to different classes.

Optionally, for a target recognition model of a predicted loss to be determined, determining the predicted loss of the target recognition model based on a target classification result of the training sample set includes:

obtaining triples corresponding to all samples in the training sample set respectively, wherein the triples corresponding to each sample in the training sample set are constructed according to the target classification result of the training sample set, and the triples corresponding to one sample comprise the sample, the samples belonging to the same class with the sample and the samples belonging to different classes with the sample;

And determining a first prediction loss of the target recognition model according to the triples respectively corresponding to the samples in the training sample set and the unique feature vector set corresponding to the target recognition model.

Optionally, the determining the first prediction loss of the target recognition model according to the triples corresponding to each sample in the training sample set and the unique feature vector set corresponding to the target recognition model includes:

for each sample in the training sample set: calculating a distance between two unique feature vectors which are positioned in a unique feature vector set corresponding to the target recognition model and correspond to a positive example pair corresponding to the sample as a first distance corresponding to the sample on the target recognition model, and calculating a distance between two unique feature vectors which are positioned in the unique feature vector set corresponding to the target recognition model and correspond to a negative example pair corresponding to the sample as a second distance corresponding to the sample on the target recognition model, wherein the positive example pair corresponding to the sample consists of the sample and a sample belonging to the same class in a triplet corresponding to the sample, and the negative example pair corresponding to the sample consists of the sample and a sample belonging to different classes in a triplet corresponding to the sample;

And determining a first prediction loss of the target recognition model according to a first distance, a second distance and a distance threshold value corresponding to each sample in the training sample set on the target recognition model.

Optionally, the process of determining a distance threshold corresponding to a sample on the object recognition model includes:

each recognition model except the target recognition model is used as a non-target recognition model, and the method comprises the following steps:

calculating a distance between two unique feature vectors corresponding to the non-target recognition model and corresponding to a positive example pair corresponding to the sample, wherein the distance is used as a first distance corresponding to the sample on the non-target recognition model, the distance between two unique feature vectors corresponding to the non-target recognition model and corresponding to a negative example pair corresponding to the sample and corresponding to the non-target recognition model is calculated as a second distance corresponding to the sample on the non-target recognition model;

the second distance corresponding to the sample on the non-target recognition model is different from the first distance corresponding to the sample on the non-target recognition model, and the calculated difference value is used as the distance difference corresponding to the sample on the non-target recognition model;

And calculating the average value of the distance differences corresponding to the sample on each non-target recognition model, and determining the distance threshold value corresponding to the sample on the target recognition model according to the calculated average value.

Optionally, the determining the prediction loss of the target recognition model based on the target classification result of the training sample set further includes:

determining class centers of various samples in the training sample set according to the target classification result of the training sample set and the unique feature vector corresponding to the target recognition model;

and determining a second prediction loss of the target recognition model according to the unique feature vector set corresponding to the target recognition model and class centers of various samples in the training sample set.

Optionally, the determining the second prediction loss of the target recognition model according to the unique feature vector set corresponding to the target recognition model and class centers of various samples in the training sample set includes:

for each sample in the training sample set: determining the prediction loss of the target recognition model on the sample according to the unique feature vector corresponding to the sample, the class center of the class to which the sample belongs and the class center of the class different from the class to which the sample belongs;

And determining a second prediction loss of the target recognition model according to the prediction loss of the target recognition model on each sample in the training sample set.

Optionally, the determining the prediction loss of the target recognition model on the sample according to the unique feature vector corresponding to the sample, the class center of the class to which the sample belongs, and the class center of the class different from the class to which the sample belongs includes:

reconstructing an image according to a class center of a class to which the sample belongs, wherein the reconstructed image is used as a first image, and reconstructing an image according to a class center of a class different from the class to which the sample belongs, and the reconstructed image is used as a second image;

acquiring unique feature vectors according to the first image and the second image respectively;

and determining the prediction loss of the target recognition model on the sample according to the unique feature vector corresponding to the sample and the unique feature vectors acquired respectively according to the first image and the second image.

Optionally, the image to be identified is a face image to be identified in a heterogeneous face identification scene;

the plurality of recognition models are a plurality of face recognition models; the training sample set in the target heterogeneous scene is a training face image set in the heterogeneous face recognition scene; the unique feature vector extracted from the feature representation vector determined for each sample in the training sample set by each recognition model is an identity feature vector.

An object recognition apparatus comprising: an image acquisition module and an image recognition module;

the image acquisition module is used for acquiring an image to be identified in the target heterogeneous scene;

the image recognition module is used for recognizing an object to be recognized in the image to be recognized based on one of a plurality of recognition models which are established in advance;

the method comprises the steps that a plurality of recognition models are obtained by training a training sample set in a target heterogeneous scene, each recognition model carries out parameter updating according to corresponding prediction loss, the prediction loss corresponding to each recognition model is determined according to a target classification result of the training sample set, the target classification result of the training sample set is obtained by fusing classification results of the training sample set on the plurality of recognition models respectively, the classification result of the training sample set on one recognition model is determined according to a unique feature vector extracted from feature expression vectors determined by the recognition model for each sample in the training sample set, and the unique feature vector is a feature vector capable of uniquely characterizing an object in the corresponding sample.

An object recognition apparatus comprising: a memory and a processor;

The memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the object recognition method described in any one of the above.

A readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the object recognition method of any of the above.

According to the object recognition method, device, equipment and storage medium, after the image to be recognized in the target heterogeneous scene is obtained, the object to be recognized in the image to be recognized can be recognized based on one of a plurality of recognition models which are built in advance, because the plurality of recognition models are obtained by training a training sample set in the target heterogeneous scene and a target classification result of the training sample set, the target classification result of the training sample set is obtained by respectively fusing classification results of the training sample set on the recognition models, and the classification result of the training sample set on each recognition model is determined according to the unique feature vector extracted from the feature expression vector determined for each sample in the training sample set by the recognition model, the target classification result of the training sample set is a relatively accurate classification result, therefore, the plurality of recognition models are trained based on the target classification result of the training sample set in the target heterogeneous scene, the plurality of recognition models which are applicable to the target heterogeneous scene and have better performance can be obtained, and the image to be recognized in the target heterogeneous scene can be recognized based on one of the plurality of the recognition models obtained by training.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of an object recognition method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of establishing a plurality of recognition models according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a training process of multiple recognition models according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an object recognition device according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an object recognition device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In view of the fact that the current object recognition schemes have poor recognition effects on images in heterogeneous scenes, the inventor tries to propose a scheme with good recognition effects on images in heterogeneous scenes, and researches the scheme, and the initial thinking is as follows:

the method comprises the steps of obtaining unlabeled samples in a target heterogeneous scene, classifying the unlabeled samples in the target heterogeneous scene based on a clustering algorithm or a classification model to obtain pseudo labels of the samples in the target heterogeneous scene, training a recognition model based on the samples in the target heterogeneous scene and the pseudo labels of the samples in the target heterogeneous scene, and recognizing an image to be recognized in the target heterogeneous scene by using the recognition model obtained through training.

The inventor finds that the thought has defects through research: limitations of the clustering algorithm or classification model itself result in the presence of noise in the generated pseudo tags, which makes the final model training biased. In view of the defects of the ideas, the inventor continuously researches, and finally provides an object identification scheme with good effect through continuous researches, wherein the basic ideas of the scheme are as follows:

training a plurality of recognition models by adopting a training sample set (comprising a plurality of non-labeling images) in a target heterogeneous scene, firstly classifying samples in the training sample set based on the plurality of recognition models to obtain classification results of the training sample set on each recognition model respectively, then fusing the classification results, utilizing the fusion results to obtain a plurality of recognition models with better performance through mutual learning among the plurality of recognition models, finally training, and utilizing one of the plurality of recognition models (such as an optimal recognition model) obtained through training to recognize the image to be recognized in the target heterogeneous scene after training is finished.

The object recognition method provided by the application is suitable for any heterogeneous scene (such as heterogeneous face recognition scene) needing object recognition, and can be applied to electronic equipment with data processing capability, wherein the electronic equipment can be a server at a network side or a terminal used at a user side, such as a PC (personal computer), a notebook, a PAD (personal digital assistant) and the like. The object recognition method provided in the present application will be described by the following examples.

First embodiment

Referring to fig. 1, a flowchart of an object recognition method provided in an embodiment of the present application is shown, where the method may include:

step S101: and acquiring an image to be identified in the target heterogeneous scene.

The target heterogeneous scene in this embodiment may be a heterogeneous face recognition scene, and the corresponding image to be recognized is a face image, and of course, the embodiment is not limited thereto, for example, the target heterogeneous scene may also be a heterogeneous pedestrian re-recognition scene, a heterogeneous license plate recognition scene, and the like. It should be noted that, in this embodiment, the target heterogeneous scene is a scene that needs to identify images of multiple different modes, and in this embodiment, the images of different modes refer to images of different visual domains, for example, images of heterogeneous pedestrians in the scene include images of a visible light domain and images of an infrared domain, the visible light domain and the infrared domain are different visual domains, and the images of the visible light domain and the images of the infrared domain are images of different modes.

Step S102: and identifying the object to be identified in the image to be identified based on one of a plurality of pre-established identification models.

The method comprises the steps that a plurality of recognition models are obtained by training a training sample set in a target heterogeneous scene and target classification results of the training sample set, the target classification results of the training sample set are obtained by fusing classification results of the training sample set on the plurality of recognition models respectively, the classification results of the training sample set on one recognition model are determined according to unique feature vectors extracted from feature expression vectors determined by the recognition model for each sample in the training sample set, and the unique feature vectors are feature vectors capable of uniquely characterizing objects in corresponding samples.

If the target heterogeneous scene is a heterogeneous face recognition scene, the plurality of recognition models are a plurality of face recognition models, the training sample set is a training face image set composed of a plurality of non-labeling face images, and the unique feature vector extracted from the feature expression vector determined by the recognition model for each sample in the training sample set is an identity characterization vector.

It should be noted that, the initial multiple recognition models (i.e., the models before training by using the training sample set in the target heterogeneous scene and the target classification result of the training sample set) are complex recognition models obtained by training the training samples of the multiple scenes, and have higher recognition accuracy and better generalization capability.

In one possible implementation, the object to be identified in the image to be identified may be identified based on any one of a plurality of pre-established identification models, and in another possible implementation, the object to be identified in the image to be identified may be identified based on an optimal identification model of the plurality of pre-established identification models.

Wherein determining an optimal recognition model from the plurality of recognition models comprises: and testing the performance of each recognition model by using a test sample pair in the target heterogeneous scene, and determining the recognition model with the best performance as the optimal recognition model. Specifically, inputting a test sample pair in a target heterogeneous scene into each recognition model to obtain two characteristic expression vectors determined by each recognition model aiming at the test sample pair; calculating the similarity of two characteristic expression vectors determined by each recognition model aiming at the sample pair to be tested so as to obtain the similarity corresponding to each recognition model; and determining an optimal recognition model from the plurality of recognition models according to the similarity corresponding to each recognition model. Optionally, the number of pairs of samples may be one or more, if the number of pairs of samples is one, the similarity corresponding to each recognition model is one, the recognition model corresponding to the maximum similarity is finally determined to be the optimal recognition model, if the number of pairs of samples is multiple, for example N, the similarity corresponding to each recognition model is N, in one possible implementation manner, the similarity corresponding to each recognition model may be averaged to obtain the similarity average value corresponding to each recognition model, finally the recognition model with the maximum similarity average value is determined to be the optimal recognition model, in another possible implementation manner, for each pair of samples, the recognition model corresponding to each recognition model is determined to be the optimal recognition model according to the similarity corresponding to the test sample (the recognition model corresponding to the maximum similarity on the test sample is determined to be the optimal recognition model), the number of times of the recognition model corresponding to be determined to be the optimal recognition model is finally determined to be the optimal is determined to be the optimal, for example, the recognition model corresponding to the 3 is determined to be the optimal recognition model corresponding to the first recognition model 1 on the basis of the first model corresponding to the first model 3, the recognition model corresponding to the first model is determined to be the optimal recognition model 1 on the basis of the similarity corresponding to the first model 3 is determined to be the optimal recognition model 1, the recognition model corresponding to the first model is determined to the best model corresponding to the first model 3 is determined to the first model is determined to be the optimal recognition model 1, and the recognition model corresponding to the first model is determined to the best corresponding to the first model is determined to the first model 3 is determined to the best recognition model 3 is determined to the best corresponding to the similarity corresponding model 3 is corresponding to the similarity to the first model 3 to the first recognition model 3, the optimal recognition model determined according to the similarity of the 3 recognition models on the 5 th test sample is the recognition model 1, and since the recognition model 1 is determined to be the optimal recognition model the most frequently, the recognition model 1 is finally determined to be the optimal recognition model. It should be noted that, the pair of test samples may be two images of different modalities (i.e., different visual fields) of the same object, for example, the pair of test samples may be a face image of a visible light field and a face image of an infrared field of the same object.

In this case, the "maximum similarity", "maximum similarity mean", and the like need to be replaced with "minimum similarity", "minimum similarity mean", and the like.

According to the object recognition method, after the image to be recognized in the target heterogeneous scene is obtained, the object to be recognized in the image to be recognized can be recognized based on one of a plurality of recognition models which are built in advance, because the plurality of recognition models are obtained by training a training sample set in the target heterogeneous scene and a target classification result of the training sample set, the target classification result of the training sample set is obtained by respectively fusing classification results of the training sample set on the recognition models, the classification result of the training sample set on each recognition model is determined according to the unique feature vector extracted from the feature expression vector determined by the recognition model for each sample in the training sample set, which means that the target classification result of the training sample set is a relatively accurate classification result.

Second embodiment

As is clear from the above embodiments, the image to be identified is identified based on one of a plurality of identification models established in advance, and this embodiment focuses on the process of establishing a plurality of identification models.

Referring to fig. 2, a flow chart for creating a plurality of recognition models is shown, which may include:

step S201: and acquiring a preset training sample from the training sample total set to form a training sample set.

The training sample total set comprises a plurality of unlabeled images in the target heterogeneous scene.

Step S202: and classifying samples in the training sample set based on the recognition models respectively to obtain classification results of the training sample set on the recognition models respectively.

Since the classification of samples in the training sample set based on each recognition model is the same, the embodiment uses one model M _k (K identification models)The kth recognition model in (a) as an example, for the recognition model M-based _k The process of classifying samples in a training sample set is described.

Based on recognition model M _k The process of classifying samples in the training sample set includes:

step S2021: based on recognition model M _k Determining characteristic expression vectors corresponding to samples in the training sample set respectively, and forming an identification model M by the determined characteristic expression vectors _k The corresponding features represent a set of vectors.

For the ith sample x in the training sample set _i Will x _i Input recognition model M _k Can obtain x _i Corresponding feature representation vector f _i ，x _i Corresponding feature representation vector f _i Is x _i Feature representation vectors, e.g. x, of the object to be identified _i For the face image, x is _i Corresponding feature representation vector f _i Is x _i Features of the face represent vectors.

Step S2022: from the recognition model M _k Extracting unique feature vectors from each feature expression vector in the corresponding feature expression vector set, and forming a recognition model M by the extracted unique feature vectors _k A corresponding set of unique feature vectors.

Specifically, for the ith sample x in the training sample set _i The unique feature vector f can be extracted as follows _i ：

f _i ＝α _k *fea _i (1)

Wherein alpha is _k Is with fea _i Amplitude vectors of equal dimensions with all values at [0,1]Between them. It should be noted that each recognition model has a corresponding α, α _k To identify the model M _k And carrying out random initialization assignment on alpha when training is started on each identification model according to the corresponding amplitude vector, and obtaining the alpha value through gradient back transmission learning.

Step S2023: according to the recognition model M _k A corresponding set of unique feature vectors, classifying samples in the training sample set, To obtain training sample set in the recognition model M _k The classification result.

Specifically, according to the recognition model M _k The corresponding unique feature vector set, the process of classifying the samples in the training sample set comprises: acquiring all possible sample pairs in a training sample set based on an identification model M _k Judging whether two samples in each sample pair belong to the same class or not according to the corresponding unique feature vector set to obtain a judging result of each sample pair, and obtaining a training sample set in the recognition model M according to the judging results of all sample pairs _k The classification result.

For any pair of samples (x _i ,x _j ) Whether two samples in the pair belong to the same class can be determined as follows:

computing a recognition model M _k Corresponding unique feature vector set x _i Corresponding unique feature vector f _i And x _j Corresponding unique feature vector f _j Similarity score of (2) _ij ：

score _ij ＝similarity(f _i ,f _j ) (2)

After obtaining f _i And f _j Similarity score of (2) _ij Then, based on the set similarity threshold thre, the sample x is determined _i And sample x _j Whether or not they belong to the same class, in particular, if score _ij Greater than or equal to the similarity threshold thre, the sample x is determined _i And sample x _j Belonging to the same class, if score _ij Less than the similarity threshold thre, the sample x is determined _i And sample x _j Belonging to different classes, namely:

the classification result of the training sample set on each recognition model is obtained through the mode, namely the classification result of the training sample set on each recognition model is obtained.

Step S203: and fusing the classification results of the training sample set on each recognition model respectively, wherein the fusion result is used as a target classification result of the training sample set.

Considering that the classification result of the training sample set on one recognition model may be inaccurate, in order to obtain a relatively accurate classification result, in this embodiment, the classification results of the training sample set on each recognition model are fused, and the fused classification result is used as the final classification result of the training sample set, that is, the target classification result. Referring to fig. 3, a process of obtaining classification results of the training sample set on the K recognition models, respectively, and fusing the K classification results to obtain a target classification result of the training sample set is shown.

Specifically, the process of fusing the classification results of the training sample set on each recognition model may include: for each of a plurality of sample pairs (training samples in a training sample set may be combined in pairs to obtain all possible sample pairs) comprising samples in the training sample set, if two samples in the sample pair belong to the same class in classification results of the training sample set on each recognition model respectively, determining that the two samples in the sample pair belong to the same class, otherwise, determining that the two samples in the sample pair belong to different classes, and the above fusion manner may be represented by the following formula:

Wherein K represents the number of recognition models, and accordingly, K classification results are obtained via step S201, in formula (4)Representing the sum of σ (class (x _i )＝class(x _j )) ₁ ～σ(class(x _i )＝class(x _j )) _K And (3) multiplying by the formula (4): if σ (class (x) _i )＝class(x _j )) ₁ ～σ(class(x _i )＝class(x _j )) _K If the result of the continuous multiplication is 1, the sample x is finally determined _i And sample x _j Belonging toIn the same class, if σ (class (x _i )＝class(x _j )) ₁ ～σ(class(x _i )＝class(x _j )) _K If the result of the continuous multiplication is 0, the sample x is finally determined _i And sample x _j Not belonging to the same class.

Note that, for a pair of samples (x _i ,x _j ) The following may occur: the method comprises the following steps: based on the classification result of the training sample set on each recognition model, x _i And x _j All belong to the same class, and when this occurs, x is specified _i And x _j Indeed belonging to the same class; and two,: training the classification result of the sample set on a part of the recognition model, x _i And x _j Belonging to the same class, in the classification result of the training sample set on another part of the recognition model, x _i And x _j Belonging to different classes, when this occurs, x is specified _i And x _j May be of different classes, it being understood that at x _i And x _j If it is of different kinds, x is _i And x _j Into a class, resulting in a final x _i And x _j The samples which do not belong to the class exist in the class, the subsequent training is based on the classification result, and the training effect of the model is inevitably influenced, therefore, the application proposes that x is calculated when the second condition occurs _i And x _j It should be noted that when the second case occurs, x cannot be explained _i And x _j Must be of different classes and can only specify x _i And x _j May belong to different classes, meaning x _i And x _j May also belong to the same class, at x _i And x _j When belonging to the same class, according to the fusion strategy of the application, x is calculated _i And x _j It should be noted that, although the fusion policy according to the present application may happen that a certain sample is not classified into the class to which it belongs (for example, a certain sample is classified separately and not classified into the class to which it belongs), it can ensure that there is no erroneously classified sample inside each class finally obtained, and for the sample not classified into the class to which it belongs, the fusion policy is implemented by modelingContinuous learning and training will ultimately divide it into classes.

Sigma (class (x) _i )＝class(x _j )) _k Representing sample x in the classification result of training sample set on kth model _i And x _j Is subject to the following distribution:

formula (5): training the classification result of the sample set on the kth model, if x _i And sample x _j Belongs to the same class, then σ (class (x _i )＝class(x _j ) () =1, if x _i And sample x _j Not belonging to the same class, σ (class (x _i )＝class(x _j ))＝0。

For example, the recognition model is 3, 3 classification results, namely classification result 1, classification result 2 and classification result 3, can be obtained via step S201, respectively, for any pair of samples (x _i ,x _j ) If x in classification result 1 _i And x _j Belonging to the same class (x _i )＝class(x _j )) ₁ =1), x in classification result 2 _i And x _j Belonging to the same class (x _i )＝class(x _j )) ₂ =1), x in classification result 3 _i And x _j Also belonging to the same class (x _i )＝class(x _j )) ₂ =1), i.e.Then finally determine x _i And x _j Belonging to the same class, if x is in the classification result 1 _i And x _j Belongs to the same class, x in the classification result 2 _i And x _j Belonging to different classes (class (x _i )＝class(x _j )) ₂ =0), x in classification result 3 _i And x _j Belonging to the same group, i.e.)>Then the mostFinal determination of x _i And x _j Belonging to different classes.

It should be noted that, not all samples find the same kind of sample, and for samples where the same kind of sample is not found, the samples are individually classified into one class.

Step S204: and determining the prediction loss of each recognition model by taking the target classification result of the training sample set as a basis, and updating the parameters of the corresponding recognition model according to the determined prediction loss.

And performing iterative training for a plurality of times according to the mode shown in the steps S201 to S204 until the training ending condition is met. And after training, the built recognition model is obtained.

Second embodiment

The present embodiment focuses on the procedure of "determining the prediction loss of each recognition model based on the target classification result of the training sample set" in step S204 in the above embodiment.

Since the mode of determining the prediction loss corresponding to each recognition model is the same based on the target classification result of the training sample set, the recognition model M is used in the embodiment _k For example, the recognition model M is determined based on the target classification result of the training sample set _k The corresponding process of predicting loss is described.

Determining a recognition model M based on the target classification result of the training sample set _k The corresponding flow diagram of prediction loss may include:

and a1, acquiring triples corresponding to each sample in the training sample set.

The method comprises the steps of constructing a triplet corresponding to each sample in a training sample set according to a target classification result of the training sample set, wherein the triplet corresponding to one sample comprises the sample, one sample randomly selected from samples belonging to the same class with the sample in the training sample set, and one sample randomly selected from samples belonging to different classes with the sample in the training sample set.

Step a2, according to the triples respectively corresponding to the samples in the training sample set, and identifying Model M _k Corresponding unique characteristic vector set for determining identification model M _k Is a first predictive loss of (a).

Specifically, according to the triples respectively corresponding to the samples in the training sample set and the recognition model M _k Corresponding unique characteristic vector set for determining identification model M _k The process of the first predictive loss of (2) includes:

step a21, for each sample in the training sample set, performing:

step a21-1a, calculating the position of the model M _k The corresponding unique feature vectors are concentrated, and the distance between the two unique feature vectors corresponding to the positive example pair corresponding to the sample is used as the first distance corresponding to the sample on the identification model.

The positive example pair corresponding to the sample consists of the sample and the sample belonging to the same class as the sample in the triplet corresponding to the sample.

Step a21-1b, calculating the position of the model M _k The corresponding unique feature vectors are concentrated, and the distance between the two unique feature vectors corresponding to the negative example pair corresponding to the sample is used as the second distance corresponding to the sample on the identification model.

The negative example pair corresponding to the sample consists of the sample and the sample belonging to different classes from the sample in the triplet corresponding to the sample.

Exemplary, one sample x ^a The corresponding triplet is (x ^a ,x ^p ,x ⁿ ) Wherein x is ^p To and from the training sample set ^a Randomly selected one sample from all samples belonging to the same class, x ⁿ To and from the training sample set ^a Randomly selected one sample from all samples belonging to different classes, and positive example pair corresponding to sample x is represented by x ^a X in the triplet corresponding thereto ^p The composition, i.e. the positive example pair corresponding to sample x is (x ^a ,x ^p ) Sample x ^a Corresponding negative example pair is represented by x ^a X in the triplet corresponding thereto ⁿ The composition, i.e. negative example pair corresponding to sample x, is (x ^a ,x ⁿ ) NeedleFor sample x ^a The step a21-1a is to calculate the positive example pair (x ^a ,x ^p ) The distance between the corresponding two unique feature vectors, and the negative example pair (x ^a ,x ⁿ ) The distance between the corresponding two unique feature vectors.

Each sample in the training sample set can be obtained in the recognition model M through the step a21 _k The first distance and the second distance are respectively corresponding to each other.

Step a22, according to each sample in the training sample set, identifying the model M _k The first distance, the second distance and the distance threshold value corresponding to the first distance and the second distance respectively are used for determining the identification model M _k Is a first predictive loss of (a).

Recognition model M _k The first predicted loss of (2) may be the distance metric loss in fig. 3, in particular, the recognition model M _k The first predicted loss of (2) may be calculated by:

wherein B represents the total number of samples in the training sample set,representing the ith sample in the training sample setCorresponding triples, < >>Representing training sample set and sample->A sample belonging to the same class, < >>Representing training sample set and sample->A sample belonging to different classes, < >>Representation sample->Corresponding positive example pair, < ->Representing a sampleCorresponding negative example pair, < ->Representing a recognition model M _k Corresponding unique feature vector set +.>Corresponding unique feature vector and +.>Distance between corresponding unique feature vectors, < >>Represents M _k Corresponding unique feature vector set +.>Corresponding unique feature vector and sample->The distance between the corresponding unique feature vectors, m is the distance threshold, which is a super-parameter, the purpose of which is to force +.>Ratio->Large m.

In one possible implementation, M may be set to a fixed constant (i.e., all samples in the training sample set are in the recognition model M _k The distance threshold values corresponding to the above are the same), if m is set as a fixed constant, the specific value of m is generally determined by multiple times of adjustment, and it should be noted that m is set as a fixed constant, and in the whole training process, the constraint on positive and negative example pairs corresponding to all samples in the training sample set is the same.

The inventor finds that in the process of realizing the scheme, the positive and negative examples corresponding to different samples have differences, if the same distance constraint is adopted for training, the training is not reasonable, and in view of the fact, the application provides another preferable implementation mode:

training a recognition model M _k In this case, the adaptive distance threshold may be calculated for the positive and negative example pairs corresponding to each sample in the training sample set, and the calculated distance threshold may be used to constrain the corresponding positive and negative example pairs, e.g., for the first sample x in the training sample set ₁ The corresponding positive and negative example pairs calculate a distance threshold value m1, and the distance threshold value m1 is utilized to calculate a first sample x during training ₁ Constraint is carried out on the corresponding positive and negative examples, and the first sample x in the training sample set is aimed at ₂ The corresponding positive and negative example pairs calculate a distance threshold value m2, and the distance threshold value m2 is utilized to calculate a second sample x during training ₂ The corresponding positive and negative example pair constrains, and so on.

Specifically, it is a sampleCorresponding positive example pair->And negative example pair->Calculate the distance threshold (i.e. calculate sample +.>At M _k Upper corresponding distance threshold) comprises:

step b1, dividing the identification model M _k Each other recognition model (i.e. M ₁ 、…、M _k-1 、M _k+1 ..M _K ) All as non-target recognition models, perform:

Step b11, calculating a unique feature vector set corresponding to the non-target recognition model and matching with the sampleCorresponding positive example pair->The distance between the two corresponding unique feature vectors is taken as a sample +.>A first distance corresponding to the non-target recognition model is calculated, and the unique feature vector set corresponding to the non-target recognition model is located and is corresponding to the sample +.>Corresponding negative example pair->The distance between the two corresponding unique feature vectors is taken as a sample +.>And a corresponding second distance on the non-target recognition model.

Step b12, sampleCorresponding second distance to sample +.>Taking the difference between the corresponding first distances on the non-target recognition model, and taking the calculated difference as a sample +.>Corresponding distance differences on the non-target recognition model.

The sample can be obtained through the stepsRespectively divide and identify the model M _k The other recognition models are respectively corresponding to the distance differences.

Step b2, calculating a sampleRespectively divide and identify the model M _k The mean value of the distance differences corresponding to the other recognition models is taken as a sample +.>In recognition model M _k And the corresponding distance difference average value.

Assuming a total of K recognition models, if the sample is to be taken The first distance to the pair on the t (t.noteq.k) th recognition model is denoted +.>Sample->The second distance to the t-th recognition model is denoted +.>Sample->In recognition model M _k The corresponding distance difference average value can be calculatedThe method is shown as follows:

the sample isIn recognition model M _k Upper corresponding distance mean d _k Larger, explaining negative example pairThe distance between the two pairs is larger than that of the positive example pair +.>Distance between, which means sample +.>The corresponding positive and negative cases are more distinguishable, otherwise, if the sample is +>In recognition model M _k Upper corresponding distance mean d _k Smaller, explaining the negative example pair->The distance between the two pairs is close to or smaller than the normal pair +.>Distance between, which means sample +.>The corresponding positive and negative example pairs are not easily distinguished.

Step b3, according to the sampleIn recognition model M _k Determining the sample +.>In recognition model M _k And a corresponding distance threshold.

In obtaining the sampleIn recognition model M _k The corresponding distance difference mean d _k After that, can be according to d _k Determining sample->In recognition model M _k Upper corresponding distance threshold m _k Specifically, when d _k When larger, due to sample ∈ ->The corresponding positive and negative example pairs are easier to distinguish, so that m is needed to be calculated _k Relatively large, on the contrary, when d _k Smaller, due to sample ∈ - >The corresponding positive and negative example pairs are not easily distinguished, so that m is needed to be calculated _k Relatively small, visible, m _k Is one and d _k A related function, which can be expressed as:

m _k ＝F(d _k ,m’) (8)

where m' is a constant.

In view of this, the model M is identified _k The first predicted loss of (2) may be expressed as:

considering that the distance constraint between the positive and negative case pairs is gradually increased during the training of the recognition model, the distance threshold can be set to a smaller value in the early stage of training, and the distance threshold can be gradually increased as the recognition model is better trained, which is achieved by a function F (d _k M'), thus F (d) _k The m') function is designed as follows:

wherein E represents the total iteration number in the whole training process, and E represents the current iteration number.

Substituting formula (10) into formula (9) to obtain:

based on the first predictive lossFor recognition model M _k The parameter updating is performed such that the model M is identified _k The input images can be well expressed, so that images of different categories have good distinguishability.

Preferably, in order to further enhance the recognition model M _k So that images of different categories have better distinguishability, and determining the recognition model M based on the target classification result of the training sample set _k The corresponding process of predicting loss "may further include:

step c1: according to the target classification result of the training sample set and the recognition model M _k And determining class centers of various samples in the training sample set according to the corresponding unique feature vectors.

Specifically, for each type of sample in the training sample set, the recognition model M may be used _k And (3) calculating the average value of the unique feature vectors corresponding to the samples in the corresponding unique feature vectors, wherein the calculated average value is used as the class center of the samples.

Assuming that the training sample set includes P-type samples, the class center of the P-th type sample in the P-type samples can be calculated by the following formula:

wherein X is _p Represents the set of p-th class samples in the training sample set, |X _p I represents X _p Total number of samples, x _q X represents _p The (c) th sample of the (c) is,representing the model M to be identified _k The unique feature vectors corresponding to the samples in the p-th type of samples in the corresponding unique feature vectors are summed.

Step c2: according to the recognition model M _k Corresponding unique feature vector set and class center of each sample in training sample set, and determining identification model M _k Is a second predictive loss of (a).

Specifically, according to the recognition model M _k Corresponding unique feature vector set and class center of each sample in training sample set, and determining identification model M _k The second predictive loss process of (2) includes:

step c21: for each sample in the training sample set, determining a recognition model M according to the unique feature vector corresponding to the sample, the class center of the class to which the sample belongs and the class center of the class different from the class to which the sample belongs _k Predictive loss on the sample to obtain a recognition model M _k Prediction loss on each sample in the training sample set, respectively.

Determining a recognition model M according to the unique feature vector corresponding to the sample, the class center of the class to which the sample belongs and the class center of the class different from the class to which the sample belongs _k The process of predicting loss on the sample includes: reconstructing an image according to a class center of a class to which the sample belongs, wherein the reconstructed image is used as a first image, and reconstructing an image according to a class center of a class different from the class to which the sample belongs, and the reconstructed image is used as a second image; acquiring a unique feature vector according to the first image, and acquiring the unique feature vector according to the second image;

according to the uniqueness of the sampleA unique feature vector and a unique feature vector acquired from the first image and the second image respectively, to determine the recognition model M _k Predictive loss on the sample.

Step c22: according to the recognition model M _k Predictive loss on each sample in the training sample set, respectively, to determine the recognition model M _k Is a second predictive loss of (a).

Recognition model M _k Is essentially a center-like reconstruction loss, which can be determined in a manner expressed as:

wherein c in the above formula _i Representing sample x _i The center of the corresponding class is used for the control of the data,representation and sample x _i The P-th class center among the class centers of the P-1 classes which belong to different classes,/I>Representing the reconstruction network according to->Reconstructed image, similar, +.>Representing the reconstruction network according to->Reconstructed image,/->Represents x _i Corresponding unique feature vectors and basesUnique features corresponding to reconstructed imagesSimilarity of vectors, < >>Represents x _i Corresponding unique feature vectors and based on +.>Similarity, m, of unique feature vectors corresponding to the reconstructed image ₁ Is a super-ginseng which needs to be regulated.

In order to avoid that the classification results of the recognition models are too different from each other in the training process, so that the classification results are greatly deviated, and in order to avoid that the classification results of the recognition models are too different from each other in the training process, the classification results of the recognition models are not too different from each other, and in order to avoid that the classification results of the recognition models are too different from each other in the training process, the following strategies are provided:

According to the recognition model M _k Corresponding unique feature vector set, unique feature vector set corresponding to other recognition models and binarized mask vector corresponding to other recognition models to determine a recognition model M _k The corresponding model difference condition characterization value is used as a recognition model M _k The third predictive penalty (the mutual penalty in fig. 3) in which the binarized mask vector is a trainable parameter.

Recognition model M _k The third predicted loss of (2) is essentially a model mutual loss, which can be calculated by:

wherein, mask _t Binarized mask vector representing the t-th recognition model, each iteration mask during the whole training process _t Is randomly generated, the dimension of which is equal to that based on the recognition model M _t The dimension of the obtained unique feature vector is obtained by masking the binary mask vector _t And based onRecognition model M _t The obtained unique feature vectors are multiplied by the corresponding positions so that a part of the unique feature vectors is set to 0, and preferably, the proportion of the set 0 in the unique feature vectors is not more than 50%. It should be noted that, the binary mask vector is a vector composed of 0 and 1, and multiplying the binary mask vector by the unique feature vector at the corresponding position results in that, in the unique feature vector, the position corresponding to 0 in the binary mask vector is set to 0 (which is equivalent to masking off a part of the unique feature vector), which results in learning only part of the features (i.e., the unmasked part of the features) in the unique feature vector between models, and further differentiation between different models.

Recognition model M based on third prediction loss pair _k Model training ensures that the model M is identified _k Is approximated to other recognition models, and at the same time ensures the recognition model M _k Differentiation from other recognition models.

Optionally, to promote the recognition model M _k The embodiment can also determine the recognition model M according to the modal information of the training sample _{k is as follows} Fourth predictive loss, in particular:

from the recognition model M _k Extracting modal feature vectors from each feature expression vector in the corresponding feature expression vector set, and forming an identification model M by the extracted modal feature vectors _k Corresponding modal characteristic vector set according to the recognition model M _k Corresponding modal feature vector set, and determining identification model M _k Fourth predictive loss of (a).

Wherein the vector f is represented from the features _i Extracting modal feature vector mod _i The manner of (2) is as follows:

mod _i ＝(1-α _k )*fea _i (15)

alpha in the above formula _k Namely alpha in the formula (1) _k In the slave feature expression vector f _i Extracting unique feature vector fea _i In this case, the model feature vector mod can be extracted together _i 。

In the present embodiment, the recognition model M can be used _k Corresponding modal feature vector set and the following determination recognition model M _k Fourth predictive loss of (2):

wherein y is _i Representing sample x _i Corresponding tag vector, S (mod _i ) Representation of sample x _i And performing classification operation on the corresponding modal feature vectors to obtain corresponding classification probability vectors. The same modal feature vectors can be constrained to be classified into the same class by the above equation.

In obtaining the four predictive losses described aboveAfterwards, can be-> Fusion, based on fusion result, of identification model M _k And updating parameters. Alternatively, the following formula may be usedFusion:

wherein lambda is ₁ 、λ ₂ And lambda (lambda) ₃ Is three super-parameters to be regulated, and the lambda can be calculated in the initial stage of training due to the large difference among a plurality of models in the initial stage of model training ₂ Is provided with larger size, and gradually reduces lambda as a plurality of models are continuously approached ₂ To a value of 0. During the whole training process, other K-1 recognition models are trained in a mode of the kth recognition model.

Through the introduction, in the whole training process of the recognition model, the specific class labels of the samples are not needed for guiding training, namely class labeling is not needed for the training samples, so that the scheme can be well applied to the unsupervised recognition task in the heterogeneous scene.

Fourth embodiment

The embodiment of the application further provides an object recognition device, the object recognition device provided by the embodiment of the application is described below, and the object recognition device described below and the object recognition method described above can be referred to correspondingly.

Referring to fig. 4, a schematic structural diagram of an object recognition device provided in an embodiment of the present application may include: an image acquisition module 401 and an image recognition module 402.

The image acquisition module 401 is configured to acquire an image to be identified in the target heterogeneous scene.

The image recognition module 402 is configured to recognize an object to be recognized in the image to be recognized based on one of a plurality of recognition models that are established in advance.

Optionally, the image recognition module 402 is specifically configured to recognize the image to be recognized based on an optimal recognition model of the multiple recognition models.

Optionally, the object identifying apparatus provided in the embodiment of the present application may further include: and an optimal recognition model determining module.

The optimal recognition model determining module is used for inputting the test sample pair in the target heterogeneous scene into each recognition model to obtain two feature representation vectors determined by each recognition model aiming at the test sample pair, calculating the similarity of the two feature representation vectors determined by each recognition model aiming at the test sample pair to obtain the similarity corresponding to each recognition model respectively, and determining the optimal recognition model from the plurality of recognition models according to the similarity corresponding to each recognition model respectively. Wherein the pair of test samples are images of two different modalities of the same object.

Optionally, the object identifying apparatus provided in the embodiment of the present application may further include: and a model building module.

The model construction module is used for determining feature expression vectors corresponding to all samples in the training sample set respectively based on each recognition model to obtain feature expression vector sets corresponding to each recognition model, extracting unique feature vectors from each feature expression vector in the feature expression vector sets corresponding to each recognition model to obtain unique feature vector sets corresponding to each recognition model, classifying samples in the training sample set according to the unique feature vector sets corresponding to each recognition model to obtain classification results of the training sample set on each recognition model, fusing the classification results of the training sample set on all recognition models respectively, taking the fusion results as target classification results of the training sample set, determining prediction loss of each recognition model based on the target classification results of the training sample set, and updating parameters of the corresponding recognition model according to the determined prediction loss.

Optionally, when the model building module fuses the classification results of the training sample set on each recognition model, the model building module is specifically configured to:

for each of a plurality of sample pairs consisting of samples in the training sample set: and if the training sample sets are respectively in the classification results on the identification models, determining that the two samples in the sample pair belong to the same class, otherwise, determining that the two samples in the sample pair belong to different classes.

Optionally, the model building module is configured to determine, based on the target classification result of the training sample set, a predicted loss of the target recognition model, for a target recognition model of a predicted loss to be determined, where the model building module is specifically configured to:

obtaining triples corresponding to all samples in the training sample set respectively, wherein the triples corresponding to each sample in the training sample set are constructed according to the target classification result of the training sample set, and the triples corresponding to one sample comprise the sample, the samples belonging to the same class with the sample and the samples belonging to different classes with the sample; and determining a first prediction loss of the target recognition model according to the triples respectively corresponding to the samples in the training sample set and the unique feature vector set corresponding to the target recognition model.

Optionally, when determining the first prediction loss of the target recognition model according to the triplet corresponding to each sample in the training sample set and the unique feature vector set corresponding to the target recognition model, the model construction module is specifically configured to:

for each sample in the training sample set: calculating a distance between two unique feature vectors which are positioned in a unique feature vector set corresponding to the target recognition model and correspond to a positive example pair corresponding to the sample as a first distance corresponding to the sample on the target recognition model, and calculating a distance between two unique feature vectors which are positioned in the unique feature vector set corresponding to the target recognition model and correspond to a negative example pair corresponding to the sample as a second distance corresponding to the sample on the target recognition model, wherein the positive example pair corresponding to the sample consists of the sample and a sample belonging to the same class in a triplet corresponding to the sample, and the negative example pair corresponding to the sample consists of the sample and a sample belonging to different classes in a triplet corresponding to the sample; and determining a first prediction loss of the target recognition model according to a first distance, a second distance and a distance threshold value corresponding to each sample in the training sample set on the target recognition model.

Optionally, the model building module is further configured to determine a distance threshold corresponding to each sample in the training sample set on the target recognition model respectively.

The model construction module is specifically configured to, when determining a distance threshold corresponding to a sample on the target recognition model:

calculating a distance between two unique feature vectors corresponding to the non-target recognition model and corresponding to a positive example pair corresponding to the sample, wherein the distance is used as a first distance corresponding to the sample on the non-target recognition model, the distance between two unique feature vectors corresponding to the non-target recognition model and corresponding to a negative example pair corresponding to the sample and corresponding to the non-target recognition model is calculated as a second distance corresponding to the sample on the non-target recognition model; the second distance corresponding to the sample on the non-target recognition model is different from the first distance corresponding to the sample on the non-target recognition model, and the calculated difference value is used as the distance difference corresponding to the sample on the non-target recognition model; calculating the average value of the distance differences corresponding to the sample on each non-target recognition model respectively, and taking the average value as the average value of the distance differences corresponding to the sample on the target recognition model; and determining a distance threshold value corresponding to the sample on the target recognition model according to the average value of the distance difference corresponding to the sample on the target recognition model.

Optionally, when determining the predicted loss of the target recognition model based on the target classification result of the training sample set, the model construction module is further configured to:

determining class centers of various samples in the training sample set according to the target classification result of the training sample set and the unique feature vector corresponding to the target recognition model; and determining a second prediction loss of the target recognition model according to the unique feature vector set corresponding to the target recognition model and class centers of various samples in the training sample set.

Optionally, the model construction module determines a second prediction loss of the target recognition model according to the unique feature vector set corresponding to the target recognition model and class centers of various samples in the training sample set, and is specifically configured to:

for each sample in the training sample set: determining the prediction loss of the target recognition model on the sample according to the unique feature vector corresponding to the sample, the class center of the class to which the sample belongs and the class center of the class different from the class to which the sample belongs; and determining a second prediction loss of the target recognition model according to the prediction loss of the target recognition model on each sample in the training sample set.

Optionally, when determining the prediction loss of the target recognition model on the sample according to the unique feature vector corresponding to the sample, the class center of the class to which the sample belongs, and the class center of the class different from the class to which the sample belongs, the model construction module is specifically configured to:

Optionally, the model building module is further configured to:

for each recognition model, determining a model difference condition characterization value as a third prediction loss of the recognition model according to the unique feature vector set corresponding to the recognition model, the unique feature vector sets corresponding to other recognition models and the binarization mask vectors corresponding to other recognition models, wherein the binarization mask vectors are trainable parameters, and the binarization mask vectors are used for setting 0 of part of elements in the unique feature vectors obtained based on the corresponding recognition models.

Optionally, the model building module is further configured to:

for each recognition model, extracting a modal feature vector from each feature expression vector in a feature expression vector set corresponding to the recognition model, and forming a modal feature vector set corresponding to the recognition model by the extracted modal feature vectors; and determining a fourth prediction loss of the recognition model according to the modal feature vector set corresponding to the recognition model.

According to the object recognition device, after the image to be recognized in the target heterogeneous scene is obtained, the object to be recognized in the image to be recognized can be recognized based on one of a plurality of recognition models which are built in advance, because the plurality of recognition models are obtained by training a training sample set in the target heterogeneous scene and a target classification result of the training sample set, the target classification result of the training sample set is obtained by respectively fusing classification results of the training sample set on the recognition models, the classification result of the training sample set on each recognition model is determined according to the unique feature vector extracted from the feature expression vector determined by the recognition model for each sample in the training sample set, which means that the target classification result of the training sample set is a relatively accurate classification result.

Fifth embodiment

An embodiment of the present application further provides an object recognition device, referring to fig. 5, which shows a schematic structural diagram of the object recognition device, where the object recognition device may include: at least one processor 501, at least one communication interface 502, at least one memory 503, and at least one communication bus 504;

in the embodiment of the present application, the number of the processor 501, the communication interface 502, the memory 503, and the communication bus 504 is at least one, and the processor 501, the communication interface 502, and the memory 503 complete communication with each other through the communication bus 504;

the processor 501 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present invention, etc.;

the memory 503 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory), etc., such as at least one magnetic disk memory;

wherein the memory stores a program, the processor is operable to invoke the program stored in the memory, the program operable to:

acquiring an image to be identified in a target heterogeneous scene;

Alternatively, the refinement function and the extension function of the program may be described with reference to the above.

Sixth embodiment

The embodiment of the application also provides a readable storage medium, which can store a program suitable for being executed by a processor, the program being configured to:

acquiring an image to be identified in a target heterogeneous scene;

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An object recognition method, comprising:

acquiring an image to be identified in a target heterogeneous scene;

the method comprises the steps that a plurality of recognition models are obtained by training a training sample set in a target heterogeneous scene and target classification results of the training sample set, the target classification results of the training sample set are obtained by fusing classification results of the training sample set on the plurality of recognition models respectively, the classification results of the training sample set on one recognition model are determined according to unique feature vectors extracted from feature expression vectors determined by the recognition model for each sample in the training sample set, and the unique feature vectors are feature vectors capable of uniquely characterizing objects in the corresponding samples;

Wherein each recognition model includes a first predicted loss, a second predicted loss, and a third predicted loss, and training each recognition model includes training each recognition model based on the first predicted loss, the second predicted loss, and the third predicted loss;

and determining a model difference condition characterization value corresponding to the identification model as a third prediction loss of the identification model according to the unique feature vector set corresponding to the identification model, the unique feature vector set corresponding to other identification models and the binarization mask vector corresponding to other identification models aiming at any identification model.

2. The method according to claim 1, wherein the identifying the object to be identified in the image to be identified based on one of a plurality of identification models established in advance includes:

3. The method of claim 1, wherein the process of building the plurality of recognition models comprises:

4. The method of claim 3, wherein fusing the classification results of the training sample set on each recognition model, respectively, comprises:

5. The object recognition method according to claim 3, wherein determining the predicted loss of the target recognition model based on the target classification result of the training sample set for the target recognition model of the predicted loss to be determined comprises:

6. The method according to claim 5, wherein determining the first prediction loss of the target recognition model according to the triplets respectively corresponding to the samples in the training sample set and the unique feature vector set corresponding to the target recognition model includes:

7. The method of claim 6, wherein determining a distance threshold for a sample on the object recognition model comprises:

8. The method according to claim 5, wherein determining the predicted loss of the object recognition model based on the object classification result of the training sample set further comprises:

9. The method according to claim 8, wherein determining the second prediction loss of the target recognition model according to the unique feature vector set corresponding to the target recognition model and the class center of each type of sample in the training sample set includes:

10. The method according to claim 9, wherein determining the predicted loss of the object recognition model on the sample according to the unique feature vector corresponding to the sample, the class center of the class to which the sample belongs, and the class center of the class different from the class to which the sample belongs, comprises:

11. The object recognition method according to any one of claims 1 to 10, wherein the image to be recognized is a face image to be recognized in a heterogeneous face recognition scene;

12. An object recognition apparatus, comprising: an image acquisition module and an image recognition module;

the method comprises the steps that a plurality of recognition models are obtained by training a training sample set in a target heterogeneous scene, each recognition model carries out parameter updating according to corresponding prediction loss, the prediction loss corresponding to each recognition model is determined according to a target classification result of the training sample set, the target classification result of the training sample set is obtained by respectively fusing classification results of the training sample set on the plurality of recognition models, the classification result of the training sample set on one recognition model is determined according to a unique feature vector extracted from feature expression vectors determined by the recognition model for each sample in the training sample set, and the unique feature vector is a feature vector capable of uniquely representing an object in the corresponding sample;

13. An object recognition apparatus, characterized by comprising: a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the object recognition method according to any one of claims 1 to 11.

14. A readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the object recognition method according to any one of claims 1-11.