CN111507213A

CN111507213A - Image recognition method, image recognition device, storage medium and electronic equipment

Info

Publication number: CN111507213A
Application number: CN202010260389.9A
Authority: CN
Inventors: 赖申其; 柴振华
Original assignee: Beijing Sankuai Online Technology Co Ltd
Current assignee: Beijing Sankuai Online Technology Co Ltd
Priority date: 2020-04-03
Filing date: 2020-04-03
Publication date: 2020-08-07

Abstract

The present disclosure relates to an image recognition method, apparatus, storage medium, and electronic device, the method comprising: performing overlapping segmentation on each image in the acquired image sequence to obtain a plurality of segmented images corresponding to the image; processing the feature information of each segmented image through an attention mechanism, and performing maximum pooling on the processed feature matrix to obtain the feature information of the segmented image; and inputting each piece of feature information into an image recognition model to obtain a recognition result output by the image recognition model, wherein the recognition result is used for representing images with the same object in the image sequence.

Description

Image recognition method, image recognition device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of image recognition technologies, and in particular, to an image recognition method, an image recognition device, a storage medium, and an electronic device.

Background

Pedestrian re-identification (Person re-identification), also known as pedestrian re-identification, is a technique that uses computer vision techniques to determine whether a particular pedestrian is present in an image or video sequence. For example, a surveillance pedestrian image may be given, which compensates for the visual limitation of a fixed camera by retrieving the pedestrian image across the device. Based on the characteristics, the pedestrian re-identification technology is widely applied to the fields of intelligent video monitoring, intelligent security and the like.

In the related art, the image of the pedestrian needs to be subjected to feature extraction during pedestrian identification, but the adopted feature extraction method has the problems of low accuracy and more redundant information, and the accuracy and the efficiency of the pedestrian identification are influenced.

Disclosure of Invention

An object of the present disclosure is to provide an image recognition method, an image recognition apparatus, a storage medium, and an electronic device, so as to solve the above related technical problems.

In order to achieve the above object, in a first aspect of the embodiments of the present disclosure, there is provided an image recognition method, including:

performing overlapping segmentation on each image in the acquired image sequence to obtain a plurality of segmented images corresponding to the image;

processing the feature information of each segmented image through an attention mechanism, and performing maximum pooling on the processed feature matrix to obtain the feature information of the segmented image;

and inputting each piece of feature information into an image recognition model to obtain a recognition result output by the image recognition model, wherein the recognition result is used for representing images with the same object in the image sequence.

Optionally, the processing the feature information of the segmented image through an attention mechanism, and performing maximum pooling on the processed feature matrix to obtain the feature information of the segmented image includes:

remolding the first characteristic matrix of the segmented image to obtain a second characteristic matrix, and transposing and remolding the first characteristic matrix to obtain a third characteristic matrix;

generating a fourth feature matrix according to the similarity matrix between the second feature matrix and the third feature matrix and the second feature matrix;

superposing the fourth feature matrix to the first feature matrix to obtain a fifth feature matrix;

and performing maximum pooling on the fifth feature matrix to obtain feature information of the segmented image, wherein the feature information comprises the fifth feature matrix subjected to maximum pooling.

Optionally, the image recognition model is trained by:

for each batch of image sample sequences, determining a difficulty tuple of the batch of image sample sequences, wherein each difficulty tuple comprises a positive sample and a plurality of negative samples;

updating parameter values of the image recognition model according to the loss function values of each of the difficulty tuples.

Optionally, the difficulty tuples include difficulty triples and difficulty quadruples, and the image sample sequence includes multiple types of samples, where an image including the same object in the image sample sequence is a type of sample, and accordingly, for each batch of image sample sequences, determining the difficulty tuples of the batch of image sample sequences includes:

selecting all sample class combinations from various types of image samples in the image sample sequence, wherein the sample class combinations comprise a first sample class, a second sample class and a third sample class;

for each sample class combination, sequentially taking a first sample class, a second sample class and a third sample class in the sample class combination as a target positive sample class, taking other sample classes, which are not the target positive sample class, in the sample class combination as negative sample classes, and obtaining a loss function value of each sample class combination through a loss function;

the sample class with the largest loss function value is combined as the difficulty tuple of the batch of image sample sequences.

Optionally, the selecting all sample class combinations from the classes of image samples in the image sample sequence includes:

for each batch of image sample sequence, determining a first similarity between each image in the batch of image sample sequence;

selecting a first sample class from various image samples in the image sample sequence, and determining second similarity between the first sample class and other samples in the image sample sequence according to the first similarity of each picture included in the first sample class;

determining the second class of samples and the third class of samples from classes of samples for which the second similarity is below a threshold.

Optionally, the obtaining a loss function value of each sample class combination by a loss function includes:

determining a sample combination corresponding to the target sample image by taking each sample image in the sample class combination as a target sample image, wherein the sample combination comprises the corresponding target sample image, a positive sample image with the largest vector distance to the target sample image, a first negative sample image with the smallest vector distance to the target sample image, and a second negative sample image with the smallest vector distance to any sample image in a sample class to which the first negative sample image belongs, the target sample image and the positive sample image belong to the same sample class, and the first negative sample image, the second negative sample image and the target sample image respectively belong to different sample classes;

determining a loss function value of the sample class combination according to the loss function value of each sample combination;

said updating parameter values of said image recognition model according to the loss function values of each said difficulty tuple, comprising:

and updating the parameter value of the image identification model according to the maximum loss function value in the loss function values of each sample class combination of the batch of image sample sequences.

determining a target sample combination with the largest vector distance between a target sample image and the positive sample image aiming at a plurality of sample combinations of the target sample image belonging to the same sample class;

determining a loss function value of the sample class combination according to the loss function value of each target sample combination;

said updating parameter values of said image recognition model according to the loss function values of each said difficulty tuple, comprising: and updating the parameter value of the image identification model according to the maximum loss function value in the loss function values of each sample class combination of the batch of image sample sequences.

Optionally, the determining a loss function value of the sample class combination according to the loss function value of each target sample combination includes:

the loss function value L of the sample class combination is calculated by the following formula_hard：

Wherein N is the number of the target sample combinations corresponding to the sample type combinations, A is the target sample image, A^′And B is the positive sample image, B is the first negative sample image, C is the second negative sample image, batch is the image sample sequence, and m is a distance threshold.

Optionally, the plurality of image sample sequences of the image recognition model are obtained by:

and for each batch of initial image sample sequence, shielding part or all image samples in the initial image sample sequence to obtain the image sample sequence, wherein the number of the samples of the image sample sequence is more than that of the initial image sample sequence.

Optionally, the blocking part or all of the image samples in the initial image sample sequence includes:

for pictures in various image samples in the initial image sample sequence, shielding the same coordinate area;

and for the pictures in the same type of image sample in the initial image sample sequence, blocking different coordinate areas.

In a second aspect of the embodiments of the present disclosure, there is provided an image recognition apparatus including:

the segmentation module is used for performing overlapping segmentation on each image in the acquired image sequence to obtain a plurality of segmented images corresponding to the image;

the processing module is used for processing the characteristic information of the segmented images through an attention mechanism and performing maximum pooling on the processed characteristic matrix to obtain the characteristic information of the segmented images;

and the input module is used for inputting each piece of feature information into an image recognition model to obtain a recognition result output by the image recognition model, and the recognition result is used for representing the images with the same object in the image sequence.

Optionally, the processing module includes:

the execution submodule is used for remolding the first characteristic matrix of the segmented image to obtain a second characteristic matrix, and transposing and remolding the first characteristic matrix to obtain a third characteristic matrix;

the generating submodule is used for generating a fourth feature matrix according to the similarity matrix between the second feature matrix and the third feature matrix and the second feature matrix;

the superposition submodule is used for superposing the fourth feature matrix to the first feature matrix to obtain a fifth feature matrix;

and the pooling submodule is used for performing maximum pooling on the fifth feature matrix to obtain feature information of the segmented image, wherein the feature information comprises the fifth feature matrix subjected to maximum pooling.

Optionally, the apparatus further comprises:

a training module, configured to train to obtain the image recognition model, where the training module includes:

a determining module, configured to determine, for each batch of image sample sequences, a difficulty tuple of the batch of image sample sequences, where each difficulty tuple includes a positive sample and a plurality of negative samples;

and the updating module is used for updating the parameter value of the image identification model according to the loss function value of each difficulty tuple.

Optionally, the difficulty tuples include difficulty triples and difficulty quadruples, the image sample sequence includes multiple types of samples, where images including the same object in the image sample sequence are one type of samples, and accordingly the determining module includes:

the selecting submodule is used for selecting all sample class combinations from various image samples in the image sample sequence, and the sample class combinations comprise a first sample class, a second sample class and a third sample class;

the first calculation sub-module is used for sequentially taking a first sample class, a second sample class and a third sample class in each sample class combination as a target positive sample class, taking other sample classes which are not the target positive sample class in the sample class combinations as negative sample classes, and solving a loss function value of each sample class combination through a loss function;

a determining submodule for combining the sample classes with the largest loss function values as the difficult tuples of the batch of image sample sequences.

Optionally, the selecting sub-module includes:

the first determining subunit is used for determining a first similarity between each image in each batch of image sample sequence aiming at each batch of image sample sequence;

the first selection subunit is used for selecting a first sample class from various image samples in the image sample sequence;

a second determining subunit, configured to determine, according to the first similarity of each picture included in the first sample class, second similarities of the first sample class and samples of other classes in the batch of image sample sequences;

a third determining subunit, configured to determine the second sample class and the third sample class from the sample classes with the second similarity lower than a threshold.

Optionally, the first computation submodule includes:

a fourth determining subunit, configured to determine, with each sample image in the sample class combination as a target sample image, a sample combination corresponding to the target sample image, where the sample combination includes the corresponding target sample image, a positive sample image having a largest vector distance from the target sample image, a first negative sample image having a smallest vector distance from the target sample image, and a second negative sample image having a smallest vector distance from any sample image in a sample class to which the first negative sample image belongs, where the target sample image and the positive sample image belong to the same sample class, and the first negative sample image, the second negative sample image, and the target sample image belong to different sample classes, respectively;

a fifth determining subunit, configured to determine a loss function value of the sample class combination according to the loss function value of each sample combination;

the updating module is used for updating the parameter value of the image identification model according to the maximum loss function value in the loss function values of each sample class combination of the batch of image sample sequences.

Optionally, the first computation submodule includes:

a sixth determining subunit, configured to determine, with each sample image in the sample class combination as a target sample image, a sample combination corresponding to the target sample image, where the sample combination includes the corresponding target sample image, a positive sample image having a largest vector distance from the target sample image, a first negative sample image having a smallest vector distance from the target sample image, and a second negative sample image having a smallest vector distance from any sample image in a sample class to which the first negative sample image belongs, where the target sample image and the positive sample image belong to the same sample class, and the first negative sample image, the second negative sample image, and the target sample image belong to different sample classes, respectively;

a seventh determining subunit, configured to determine, for a plurality of sample combinations in which a target sample image belongs to the same sample class, a target sample combination in which a vector distance between the target sample image and the positive sample image is the largest;

an eighth determining subunit, configured to determine a loss function value of the sample class combination according to the loss function value of each of the target sample combinations;

Optionally, the eighth determining subunit is configured to:

Wherein N is the number of the target sample combinations corresponding to the sample type combinations, a is the target sample image, a' is the positive sample image, B is the first negative sample image, C is the second negative sample image, batch is the image sample sequence, and m is a distance threshold.

Optionally, the apparatus further includes an execution module for acquiring a plurality of image sample sequences of the image recognition model, and the execution module includes:

and the execution submodule is used for shielding part or all of the image samples in each batch of initial image sample sequence to obtain the image sample sequence, and the number of the samples of the image sample sequence is more than that of the initial image sample sequence.

Optionally, the execution submodule is configured to:

In a third aspect of the embodiments of the present disclosure, a computer-readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the steps of the method of any one of the above first aspects.

In a fourth aspect of the embodiments of the present disclosure, an electronic device is provided, including:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to implement the steps of the method of any of the first aspects above.

According to the technical scheme, each image in the image sequence is subjected to overlapping segmentation, so that the semantic information of the image can be better reserved, and more accurate image characteristic information can be extracted. In addition, the technical scheme also introduces an attention mechanism, so that a feature extraction process based on the attention mechanism and pooling is realized, image redundancy caused by overlapping segmentation can be reduced in such a way, more accurate image feature information can be extracted finally, and the accuracy and efficiency of image identification are improved.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:

fig. 1 is a flowchart illustrating an image recognition method according to an exemplary embodiment of the present disclosure.

FIG. 2 is a schematic diagram illustrating an image segmentation, according to an exemplary embodiment of the present disclosure.

Fig. 3 is a schematic diagram illustrating an image feature extraction process according to an exemplary embodiment of the disclosure.

FIG. 4 is a flow chart illustrating training of an image recognition model according to an exemplary embodiment of the present disclosure.

FIG. 5 is a flow chart illustrating the selection of a difficult tuple according to an exemplary embodiment of the present disclosure.

Fig. 6 is a block diagram of an image recognition apparatus according to an exemplary embodiment of the present disclosure.

Fig. 7 is a schematic diagram of an electronic device shown in an exemplary embodiment of the present disclosure.

Detailed Description

The following detailed description of specific embodiments of the present disclosure is provided in connection with the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present disclosure, are given by way of illustration and explanation only, not limitation.

Before introducing the image recognition method, the image recognition device, the storage medium, and the electronic device provided by the present disclosure, an application scenario of each embodiment provided by the present disclosure is first introduced, and each embodiment of the present disclosure may be applied to an image recognition occasion, for example, comparing similarities between images or searching for a specified target image from an image sequence.

Taking pedestrian recognition as an example, in some scenarios, the moving track of the target object needs to be searched according to the known picture information of the target object. It should be understood that a target object may appear in the shooting areas of a plurality of cameras, and therefore, when finding the moving track of the target object, it is necessary to determine whether the target object exists in the image sequence shot by each camera according to the picture information of the target object. For example, the identification and search of the image information of the target object may be achieved by a metric learning derived model. The model learning process can adopt a triple loss function, namely three pictures are selected from a data set, wherein the three pictures comprise two pictures of one person, and the two pictures are the same person, and the other picture is the other person. By adopting the triple loss function, the forced training of the model can be realized. Applicants have discovered that such a training process is inefficient because the selected sample combination may not be the most difficult sample combination in the dataset due to limitations in the construction of the triplet (i.e., the sample combination is not the most favorable sample combination for training the model). In addition, the features of the image of the pedestrian can be extracted during pedestrian identification, but the feature extraction method adopted in the related technology has the problems of low accuracy and more redundant information, and the identification accuracy and efficiency of the model are influenced.

To this end, the present disclosure provides an image recognition method, referring to a flowchart of the image recognition method shown in fig. 1:

in step S11, each of the acquired images in the image sequence is subjected to overlap segmentation, so that a plurality of segmented images corresponding to the image are obtained.

Wherein the image sequence may comprise an image of a target object to be searched and a target image sequence possibly comprising an image of the target object. For example, the target image sequence may be, for example, video data acquired by a camera, and in specific implementation, an image of a target object and an image in the target image sequence that may include the image of the target object may be segmented, and then feature extraction may be performed on the obtained segmented image.

For example, referring to a schematic diagram of image segmentation shown in fig. 2, for a segmentation method of an image P40, in some scenes, the image P40 may be segmented into pictures P411 to P414 by non-overlapping segmentation (i.e., there is no repeated content between segmented images of the same image), and feature vectors of the picture P40 are finally obtained by respectively performing feature extraction on the pictures P411 to P414. It is noted that such a segmentation may cause semantic information of the image to be segmented (for example, the backpack shown in fig. 2 is segmented in different pictures), and thus is not favorable for accurate feature extraction. Therefore, in some scenarios, the image P40 may be overlap-sliced (i.e., there may be duplicates between sliced images of the same image), and as shown in fig. 2, the image P40 may be sliced into images P421 to P424. It should be understood that in the overlapped segmentation image, semantic information of pictures can be better preserved, so that feature information can be respectively extracted from the pictures P421 to P424 on the basis, and finally the feature information of the image P40 is determined according to the extracted feature information.

In step S12, for each of the segmented images, feature information of the segmented image is processed by an attention mechanism, and the processed feature matrix is processed in a maximum pooling manner, so as to obtain feature information of the segmented image.

Taking the image P423 cut out in fig. 2 as an example for illustration, in a specific implementation, a target region in the image P423 may be determined by an attention mechanism, where the target region is a region (for example, a backpack region in the image P423) with a high degree of correlation with an object to be identified in the image P423. In this way, a feature matrix can be extracted based on the target region, and the feature information of the segmented image is finally obtained by performing maximum pooling on the feature matrix. Of course, in some embodiments, the feature matrix may also be extracted based on the target region and a region other than the target region, which is not limited by the present disclosure.

In step S13, each feature information is input into an image recognition model, and a recognition result output by the image recognition model is obtained, where the recognition result is used to characterize images with the same object in the image sequence.

Wherein the image sequence may comprise an image of a target object to be searched and a target image sequence possibly comprising an image of the target object. For example, the target image sequence may be video data acquired through a camera, and in particular, in implementation, the image of the target object and the image in the target image sequence that may include the image of the target object may be segmented and feature extracted, and the image of the target object is compared with each frame image or a designated frame image in the video data through the image recognition model, so as to determine whether the image of the target object is included in the video data.

In a possible implementation, the step S12 includes:

For example, for each of the segmented images, a first feature matrix of the image may be extracted. Referring to fig. 3, which is a schematic diagram illustrating an image feature extraction process, the first feature matrix may include three dimensions (i.e., C × H × W) of a channel, a height, and a width. Further, the first feature matrix may be changed to obtain a second feature matrix and a third feature matrix. It should be understood that based on the operation rule of the matrix, the first feature matrix may be recombined (reshape) to obtain the second feature matrix of C × HW, and similarly, the first feature matrix may be recombined and transposed (reshape & transpose) to obtain the third feature matrix of HW × C. Thus, as shown in fig. 3, a target similarity matrix C × C may be obtained by performing similarity calculation on the second feature matrix and the third feature matrix, where the target similarity matrix is used to characterize similarity relationships between channels of the segmented image. In an embodiment, the target similarity matrix may be further normalized by a softmax function to obtain a similarity matrix, and each part of the similarity matrix may be used to describe a probability of a target object that needs to be focused in the segmented image.

In this way, a fourth feature matrix may be generated from the similarity matrix and the second feature matrix. It should be understood that each portion of the similarity matrix may be used to describe the probability of a target object of interest in the segmented image. Therefore, after the similarity matrix is multiplied by the second feature matrix, the part with lower probability (namely the part which does not need to be concerned) can be screened, and the problem of information redundancy caused by overlap segmentation can be solved.

In addition, the fourth feature matrix may be reshaped, converted into a three-dimensional matrix of C × H × W, and superimposed on the first feature matrix, thereby obtaining a fifth feature matrix. In this way, the maximum pooling (globalmaxpool) of the fifth feature matrix may be performed to finally obtain the target feature matrix C1 x 1 corresponding to the segmented image.

Optionally, feature information of a target image to which the segmented image belongs may be determined according to a target feature matrix of each segmented image, and the feature information is used as an input of the image recognition model, so that a training basis is provided for the image recognition model.

According to the technical scheme, the target image is subjected to overlapping segmentation, so that the semantic information of the image can be better reserved, and more accurate image characteristic information can be extracted. In addition, the technical scheme also introduces an attention mechanism, so that a pooling process based on the attention mechanism is realized, and further, image redundancy caused by overlapping segmentation can be reduced in such a way, and more accurate image feature information can be extracted finally.

Optionally, referring to a training flowchart of an image recognition model shown in fig. 4, the image recognition model may be obtained by training as follows:

in step S41, for each batch of image sample sequences, a difficulty tuple of the batch of image sample sequences is determined.

For example, X image classes may be included in the training set, and each of the classes may refer to a set of pictures including the same target object. In a specific implementation, P categories may be selected from the X categories, and each batch of images may include P × K image samples, taking the example that each category includes K image samples.

The difficulty tuples correspond to the sequence of image samples, which may include a positive sample class and a plurality of negative sample classes. For example, the difficulty tuple may include two negative sample classes, and then, in particular, a sample class combination may be selected for P classes in the image sample sequence. It should be understood that the ease of combination of sample classes can be described in terms of the similarity between sample classes. For example, in some embodiments, the similarity between different images may be determined by calculating a distance value between the different images, thereby determining the similarity between sample classes that include the images. When the distance value is larger, the similarity of the two images is lower, otherwise, the similarity of the two images is higher, and when the distance value between the pictures of different types in the sample type combination is smaller and/or the distance between the pictures in the same type of sample combination is larger (namely, the sample type combination is considered to be more difficult), the sample combination can generate more beneficial effect on the training process of the image recognition model. Thus, the difficulty tuple may be determined by calculating the loss function value for each of the sample class combinations, and then determining the difficulty tuple according to the loss function value for each of the sample class combinations (e.g., the sample class combination with the largest loss function value may be used as the difficulty tuple).

In this way, in step S42, for each batch of the sequence of image samples, the parameter values of the image recognition model may be updated according to the loss function values of the difficulty tuples determined from the sequence of image samples.

According to the technical scheme, when the image recognition model is trained, the difficult tuples corresponding to each batch of image sample sequence can be selected. That is to say, for each input batch of image sample sequences, the method may not be limited to the limitation between tuples, but may find the most difficult sample combination, that is, the sample combination with the best training effect, from the image sample sequence, and update the parameter value of the image recognition model according to the loss function value of the most difficult sample combination, thereby achieving the effects of improving the training efficiency and increasing the model recognition accuracy.

For the training process of the image recognition model, in a possible implementation, the difficulty tuples include difficulty triples and difficulty quadruples, and the image sample sequence includes multiple types of samples, where the images in the image sample sequence that include the same object are one type of samples, and accordingly, with reference to the schematic diagram of the selection process of one difficulty tuple shown in fig. 5, the step S41 includes:

s411, all sample class combinations are selected from various types of image samples in the image sample sequence, wherein the sample class combinations comprise a first sample class, a second sample class and a third sample class. Taking the example that each batch of image sample sequence includes P classes, in specific implementation, three classes can be taken as a group to randomly extract from the P classes, and finally all sample class combinations are obtained.

And S412, aiming at each sample class combination, sequentially taking a first sample class, a second sample class and a third sample class in the sample class combination as target positive sample classes, taking other sample classes which are not the target positive sample classes in the sample class combination as negative sample classes, and obtaining a loss function value of each sample class combination through a loss function.

Illustratively, the sample class combination may include a sample class A, B, C. In calculating the loss function value of the sample class combination, the sample class a may be taken as a target positive sample class, and the sample class B and the sample class C may be taken as negative sample classes. An anchor point picture and a positive sample picture are selected from the target positive sample class a, and a first negative sample picture and a second negative sample picture are respectively selected from the sample class B and the sample class C (regarding the anchor point picture, the positive sample picture and the negative sample picture, corresponding explanations already exist in the related art, and the details of the disclosure are not repeated here). In this way, an objective loss function value may be calculated from the anchor picture, the positive sample picture, the first negative sample picture, and the second negative sample picture. Alternatively, the sample class B and the sample class C may be set as target positive sample classes, corresponding target loss function values may be sequentially calculated, and finally, a loss function value corresponding to the sample class combination may be determined from each of the target loss function values (for example, an average value of each of the target loss function values may be set as a loss function value of the sample class combination).

S413, the sample class with the largest loss function value is combined as the difficulty tuple of the batch of image sample sequences.

That is to say, the above technical solution can determine the sample class combination having the largest loss function value in the batch of image sample sequences by combining the types of images in the image sample sequences and calculating the loss function value of each sample class combination. In this way, the sample class combination with the largest loss function value can be used as the difficulty tuple of the batch of image sample sequences, and the parameter value in the image recognition model is updated according to the loss function value of the difficulty tuple, so that the training effect of the image recognition model can be improved.

In a possible embodiment, the selecting all sample class combinations from the classes of image samples in the image sample sequence includes:

for each sequence of image samples, a first similarity between each image in the sequence of image samples is determined. Taking the example that the image sample sequence includes 16 categories, each category includes 4 pictures, in specific implementation, similarity calculation may be performed between every two of the 64 pictures, so as to obtain a first similarity between each picture.

In addition, a first sample class may be selected from various types of image samples in the image sample sequence, and a second similarity between the first sample class and other types of samples in the image sample sequence may be determined according to the first similarity of each picture included in the first sample class. For example, when determining the second similarity of the sample class a and the sample class B, a sum of the first similarities of each picture included in the sample class a and each picture included in the sample class B may be calculated, and the sum is used as the second similarity. In an embodiment, the sum may be averaged, and the average may be used as the second similarity.

Thus, after obtaining the second similarity between the first sample class and the other samples in the batch of image sample sequences, the second sample class and the third sample class may be determined from the sample class with the second similarity lower than the threshold value. The threshold value can be set according to the requirements of the application scene. In some embodiments, the second similarity may also be sorted, and the sample class at the end of the second similarity sorting is taken as the second sample class and the third sample class.

The above-mentioned technical solution also considers the similarity (i.e. difficulty degree) between the sample classes when selecting the sample class combination, so that it can only generate one of the most difficult sample class combinations for each sample class. Compared with a mode of randomly combining sample classes in the image sample sequence, the technical scheme can reduce the generated sample class combination, thereby realizing the effect of searching and mining difficult samples and reducing the calculation amount.

and taking each sample image in the sample class combination as a target sample image, and determining a sample combination corresponding to the target sample image, wherein the sample combination comprises the corresponding target sample image, a positive sample image with the largest vector distance to the target sample image, a first negative sample image with the smallest vector distance to the target sample image, and a second negative sample image with the smallest vector distance to any sample image in a sample class to which the first negative sample image belongs, the target sample image and the positive sample image belong to the same sample class, and the first negative sample image, the second negative sample image and the target sample image respectively belong to different sample classes.

Taking the sample combination comprising A, B, C sample classes, each sample class comprising 4 pictures as an example, the sample combination may comprise 1-12 pictures, wherein 1-4 belong to sample class a, 5-8 belong to sample class B, and 9-12 belong to sample class C. In specific implementation, the pictures 1 to 12 can be sequentially used as target sample images to select sample combinations. Taking picture 1 as a target sample image as an example, a positive sample image may be selected from other pictures 2-4 of sample class a to which picture 1 belongs, where the selection may be based on the largest vector distance between the picture and the target sample image. Similarly, a first negative sample image can be selected from the images 5-8 included in the sample class B, wherein the first negative sample image can be selected according to the fact that the vector distance between the picture and the target sample image is the minimum. Furthermore, a second negative sample image may be selected from the sample class C, according to which the smallest vector distance to any sample image in the sample class B may be selected.

It is worth mentioning that for each of the sample combinations, the loss value calculation may also be performed for the sample combination. Wherein a triplet loss function value composed of the target sample image, the positive sample image, and the first negative sample image is calculated when a distance of the first negative sample image from the target sample image is less than a minimum of vector distances of images included in sample class B and sample class C. When the distance of the first negative sample image from the target sample image is greater than the minimum value of the vector distances of the images included in the sample class B and sample class C, a quadruple loss function value composed of the target sample image, the positive sample image, the second negative sample image, and a target negative sample image of the sample class B having the smallest vector distance from the second negative sample image is calculated.

Thus, after obtaining the loss function values (i.e., 12 loss function values) for each of the sample combinations, the loss function value for the sample-class combination can be determined according to the loss function value for each of the sample combinations. For example, the loss function values of each of the sample combinations may be averaged, and the average may be used as the loss function value of the sample class combination.

In this case, the updating the parameter values of the image recognition model according to the loss function values of each of the difficulty tuples includes:

In the above scheme, when calculating the loss function value of a sample combination, it is possible to switch between the triplet loss function and the quadruple loss function according to the difficulty of the samples (corresponding to the distances of the samples, the sample combination is considered to be more difficult when the distances between pictures belonging to the same type of samples are larger and/or the distances between pictures belonging to different types of samples are smaller). In other words, the above technical solution can find a more difficult sample class combination from all negative samples without being limited to the tuple. It should be understood that based on the characteristics of the multi-element group loss function, when the sample class combination is more difficult, a better model training effect can be achieved, and therefore the technical scheme can mine the difficult sample class combination, so that the training effect of the model is improved.

It should be noted that, in the above embodiment, when the sample class a is used as the target positive sample class, 4 sample combinations respectively using the pictures 1-4 in the sample class a as the target images (anchor images) can be determined. Applicants found that the 4 combinations of samples could also be screened to determine the more difficult combinations of samples.

Therefore, in a possible implementation, before determining the loss function value of the sample class combination according to the loss function value of each sample combination, a target sample combination with the largest vector distance between the target sample image and the positive sample image may also be determined for a plurality of sample combinations of which the target sample image belongs to the same sample class.

In this way, after obtaining the loss function value of each of the target sample combinations (i.e. 3 loss function values, only one loss function value is generated when each type of sample is used as a target positive sample type), the loss function value of each of the sample type combinations can be determined according to the loss function value of each of the target sample combinations. For example, the loss function values of each of the sample combinations may be averaged, and the average may be used as the loss function value of the sample class combination.

In this case, the updating the parameter values of the image recognition model according to the loss function values of each of the difficulty tuples includes: and updating the parameter value of the image identification model according to the maximum loss function value in the loss function values of each sample class combination of the batch of image sample sequences.

In another possible embodiment, the determining the loss function value of the sample class combination according to the loss function value of each target sample combination includes:

Wherein N is the number of the target sample combinations corresponding to the sample type combinations, A is the target sample image, A^′Is the positive sample image, B is the first negative sample image, C is the second negative sample image, batch is the image sample sequence, m is a distance thresholdFor example, 0.5, 1, etc. may be set.

The technical scheme can be used for searching more difficult sample class combinations from all negative samples without limitation of tuples. It should be understood that based on the characteristics of the multi-element group loss function, when the sample class combination is more difficult, a better model training effect can be achieved, and therefore the technical scheme can mine the difficult sample class combination, so that the training effect of the model is improved.

In addition, the applicant finds that the data set of the image sample sequence has the problems that sample images are difficult to obtain, the number of the sample images is small, and the like during implementation, so that the training effect of the model is influenced, and the identification accuracy of the image identification model is finally influenced.

Therefore, in an embodiment, the plurality of image sample sequences of the image recognition model are obtained by:

For example, all or a portion of the pictures in the data set may be randomly erased. For example, different sizes (e.g., mean distribution 0 to 40%), proportions (e.g., aspect ratio 66.7% to 133.3%), and location areas (e.g., full-map random erase) may be randomly erased for each of the pictures. Therefore, the number of the image sample sequences can be increased by randomly erasing the initial image sample sequence, and the problem of insufficient training sets in the related technology is solved.

In a possible embodiment, the blocking some or all of the image samples in the initial image sample sequence includes:

The shielding manner for each region has been described in the previous embodiment, and is not described herein. Furthermore, it should be understood that when two pictures belonging to the same type of sample are erased in different coordinate areas, more differences can be generated between the pictures in the same type of sample, namely, the similarity is reduced, and the vector distance value is increased. When the same area of two pictures belonging to different sample classes is erased, the erased part can become the similar point of the two pictures, thereby increasing the similarity between the pictures of different sample classes and reducing the vector distance. With reference to the description about the difficulty level of the sample, the shielding manner adopted in this embodiment can reduce the distance between the sample classes and increase the distance between the pictures in the same sample class, so that the overall difficulty of the sample can be increased, which is beneficial to selecting a more difficult sample class combination, and updating the parameter value in the image recognition model according to the loss function value of the sample class combination, thereby finally improving the training effect.

The present disclosure also provides an image recognition apparatus, referring to a schematic diagram of an image recognition apparatus 600 shown in fig. 6, the apparatus 600 comprising:

a segmentation module 601, configured to perform overlapping segmentation on each image in the obtained image sequence to obtain multiple segmented images corresponding to the image;

a processing module 602, configured to process, by using an attention mechanism, feature information of each segmented image, and perform maximum pooling on the processed feature matrix to obtain feature information of the segmented image;

an input module 603, configured to input each piece of feature information into an image recognition model, to obtain a recognition result output by the image recognition model, where the recognition result is used to represent images with the same object in the image sequence.

Optionally, the processing module 602 includes:

Optionally, the apparatus 600 further includes:

According to the technical scheme, when the image recognition model is trained, the difficult tuples corresponding to each batch of image sample sequence can be selected. That is to say, for each input batch of image sample sequences, the most difficult sample combination, that is, the sample combination with the best training effect, can be found from the image sample sequence, and the parameter value of the image recognition model is updated according to the loss function value of the most difficult sample combination, so that the effects of improving the training efficiency and increasing the model recognition accuracy can be achieved.

Optionally, the selecting sub-module includes:

Optionally, the first computation submodule includes:

Optionally, the eighth determining subunit is configured to:

Optionally, the execution submodule is configured to:

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

It should be noted that, for convenience and brevity of description, the embodiments described in the specification all belong to preferred embodiments, and the related parts are not necessarily essential to the present invention, for example, the first determining sub-module and the second determining sub-module may be independent devices or may be the same device when being implemented specifically, and the disclosure is not limited thereto.

The present disclosure also provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the image recognition method described in any of the above embodiments.

The present disclosure also provides an electronic device, comprising:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to implement the steps of the image recognition method in any of the above embodiments.

Fig. 7 is a block diagram illustrating an electronic device 700 in accordance with an example embodiment. For example, the electronic device 700 may be provided as a server. Referring to fig. 7, an electronic device 700 includes a processor 722, which may be one or more in number, and a memory 732 for storing computer programs that are executable by the processor 722. The computer programs stored in memory 732 may include one or more modules that each correspond to a set of instructions. Further, the processor 722 may be configured to execute the computer program to perform the image recognition method described above.

Additionally, the electronic device 700 may further include a power component 726 and a communication component 750, the power component 726 may be configured to perform power management of the electronic device 700, and the communication component 750 may be configured to enable communication, e.g., wired or wireless communication, of the electronic device 700. in addition, the electronic device 700 may further include an input/output (I/O) interface 758. the electronic device 700 may be operable based on an operating system stored in the memory 732, e.g., Windows Server, Mac XTOSM, UnixTM, &lTtTtTtranslation = L "& &ggTtL/T &gTtTtTtInuxTM, and so on.

In another exemplary embodiment, a computer readable storage medium comprising program instructions which, when executed by a processor, implement the steps of the image recognition method described above is also provided. For example, the computer readable storage medium may be the memory 732 described above including program instructions that are executable by the processor 722 of the electronic device 700 to perform the image recognition method described above.

In another exemplary embodiment, a computer program product is also provided, which comprises a computer program executable by a programmable apparatus, the computer program having code portions for performing the image recognition method described above when executed by the programmable apparatus.

The preferred embodiments of the present disclosure are described in detail with reference to the accompanying drawings, however, the present disclosure is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solution of the present disclosure within the technical idea of the present disclosure, and these simple modifications all belong to the protection scope of the present disclosure.

It should be noted that, in the foregoing embodiments, various features described in the above embodiments may be combined in any suitable manner, and in order to avoid unnecessary repetition, various combinations that are possible in the present disclosure are not described again.

In addition, any combination of various embodiments of the present disclosure may be made, and the same should be considered as the disclosure of the present disclosure, as long as it does not depart from the spirit of the present disclosure.

Claims

1. An image recognition method, characterized in that the method comprises:

2. The method of claim 1, wherein the processing the feature information of the segmented image through an attention mechanism and performing maximum pooling on the processed feature matrix to obtain the feature information of the segmented image comprises:

3. The method of claim 1, wherein the image recognition model is trained by:

4. The method of claim 3, wherein the difficulty tuples comprise difficulty triples and difficulty quadruples, and wherein the sequence of image samples comprises a plurality of classes of samples, wherein the images in the sequence of image samples comprising the same object are a class of samples, and wherein correspondingly, for each batch of the sequence of image samples, determining the difficulty tuples of the batch of the sequence of image samples comprises:

5. The method of claim 4, wherein the selecting all sample class combinations from the classes of image samples in the sequence of image samples comprises:

6. The method of claim 4, wherein said obtaining a loss function value for each of said sample class combinations by a loss function comprises:

7. The method of claim 4, wherein said obtaining a loss function value for each of said sample class combinations by a loss function comprises:

8. The method of claim 7, wherein determining the loss function value for the sample class combination from the loss function value for each of the target sample combinations comprises:

9. The method according to any one of claims 3 to 8, wherein the plurality of image sample sequences of the image recognition model are obtained by:

10. The method of claim 9, wherein the blocking some or all of the image samples in the initial sequence of image samples comprises:

11. An image recognition apparatus, comprising:

12. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 10.

13. An electronic device, comprising:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to carry out the steps of the method of any one of claims 1 to 10.