CN112257758A

CN112257758A - Fine-grained image recognition method, convolutional neural network and training method thereof

Info

Publication number: CN112257758A
Application number: CN202011033047.XA
Authority: CN
Inventors: 刘洋; 孙海涛
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2020-09-27
Filing date: 2020-09-27
Publication date: 2021-01-22

Abstract

The application relates to a fine-grained image recognition method, a convolutional neural network and a training method thereof, computer equipment and a computer readable storage medium, wherein an original image is obtained; extracting global features of an original image by adopting a convolution network; determining a plurality of candidate regions in an original image, and determining a characteristic value of each candidate region according to the global characteristics; sequencing the plurality of characteristic values to obtain a first sequencing result, and determining at least one candidate region corresponding to the maximum N characteristic values in the first sequencing result; extracting the characteristics of the at least one candidate region to obtain at least one local region characteristic, wherein each candidate region corresponds to one local region characteristic; cascading the global feature and the at least one local region feature to obtain a cascading feature; the original images are classified according to the cascade characteristics to obtain classification results, the problems of complexity and high labeling cost of a fine-grained image identification method are solved, and the fine-grained image identification method is simplified.

Description

Fine-grained image recognition method, convolutional neural network and training method thereof

Technical Field

The present application relates to the field of computer vision technology and deep learning technology, and in particular, to a fine-grained image recognition method, a convolutional neural network, a training method for a convolutional neural network, a computer device, and a computer-readable storage medium.

Background

The goal of fine-grained image recognition is to classify subclasses of objects at a fine-grained level, and fine-grained image recognition is very challenging due to the very subtle differences between the subclasses. Compared with the traditional image classification, the fine-grained image identification has the following different points and difficulties:

(1) the differences among the classes are very slight, such as different subclasses of birds or different subclasses of cars, and the differences among the objects of the different subclasses are mainly reflected in local details; (2) for fine-grained level images, the intra-class difference is large, that is, the same sub-class image itself has a large difference in form, posture, color, background, and the like. Therefore, how to detect the object component with resolution and how to extract fine-grained features better becomes a difficult problem to be solved in the current fine-grained identification field.

In view of the above problems, the related art proposes the following methods:

related art a: the prior patent (patent number CN111144490A) provides a fine-grained identification method based on a rotation knowledge distillation strategy, which obtains a convolution characteristic diagram through convolution neural network training; clustering the convolution characteristic images to obtain channel indication vectors, pre-training a channel group (channel group) module according to the channel indication vectors to generate an attention mask, acquiring local images, and finally training the local images and the global images through a rotation knowledge distillation strategy. The disadvantages of this method are: the complexity of the model is high, and the model is difficult to be put into practical application due to the fact that training is carried out in stages.

Related art B: training a semantic segmentation network (FCN) to extract part information, and after training, cutting a part semantic graph predicted by FCN to obtain image blocks of different parts. After the image blocks at the position level are obtained through the FCN, feature extraction is respectively carried out on the global image and the position image through training a plurality of sub-networks. The disadvantages of this method are: the part information is extracted by training a semantic segmentation network FCN, and training is performed by means of the part marking information, so that high labor marking cost is generated.

At present, no effective solution is provided for the problems of complex fine-grained image identification method and high labeling cost in the related technology.

Disclosure of Invention

The embodiment of the application provides a fine-grained image identification method, a convolutional neural network training method, computer equipment and a computer readable storage medium, and aims to at least solve the problems that the fine-grained image identification method in the related art is complex and the labeling cost is high.

In a first aspect, an embodiment of the present application provides a fine-grained image recognition method, which is applied to fine-grained image recognition, and includes: acquiring an original image; extracting the global features of the original image by adopting a convolution network; determining a plurality of candidate regions in the original image, and determining a feature value of each candidate region according to the global features; sequencing the characteristic values to obtain a first sequencing result, and determining at least one candidate region corresponding to the maximum N characteristic values in the first sequencing result; extracting the feature of the at least one candidate region to obtain at least one local region feature, wherein each candidate region corresponds to one local region feature; cascading the global feature and at least one local region feature to obtain a cascading feature; and classifying the original image according to the cascade characteristic to obtain a classification result.

In some embodiments, determining a plurality of candidate regions in the original image, and determining a feature value of each of the candidate regions according to the global feature comprises: transforming the scale of the global feature to obtain a plurality of feature maps with different scales, wherein each feature map corresponds to one candidate region; and determining a characteristic value corresponding to each candidate region according to the plurality of characteristic graphs.

In some embodiments, transforming the scale of the global feature to obtain a plurality of feature maps of different scales includes: sequentially inputting the global features into a plurality of 3 x 3 convolutional layers connected in series for down-sampling to respectively obtain an output result corresponding to each convolutional layer; and inputting the output result of each convolution layer into the 1 x 1 convolution layer for up-sampling to respectively obtain a plurality of characteristic maps with different scales.

In some embodiments, sorting the plurality of feature values to obtain a first sorting result, and determining at least one candidate region corresponding to the largest N feature values in the first sorting result includes: determining a plurality of candidate regions corresponding to the largest M characteristic values in the first sequencing result; extracting features corresponding to the candidate regions in the original image to obtain a plurality of candidate region features; generating a plurality of confidence levels corresponding to a plurality of the candidate region features; sequencing the confidence degrees to obtain a second sequencing result; judging whether the first sequencing result and the second sequencing result meet a preset rule or not; and under the condition that the first sequencing result and the second sequencing result meet a preset rule, determining at least one candidate region corresponding to the maximum N characteristic values in the first sequencing result.

In some embodiments, before the step of sorting the plurality of feature values to obtain a first sorting result and determining at least one candidate region corresponding to the largest N feature values in the first sorting result, the method further includes: determining at least one candidate region corresponding to the largest L characteristic values in the first sequencing result as a reference region; determining a candidate region which is overlapped with the current reference region in the plurality of candidate regions and a corresponding overlapping rate; and deleting the candidate area corresponding to the candidate area with the overlapping rate larger than a preset threshold value.

In some embodiments, before extracting the feature of the at least one candidate region to obtain at least one local region feature, the method further includes: cutting the at least one candidate area from the original image; and preprocessing the at least one candidate region to obtain a candidate region with a preset size.

In a second aspect, an embodiment of the present application provides a convolutional neural network, including: the device comprises an acquisition module, a global feature extraction module, a candidate region extraction module and a classification module; the input end of the acquisition module is connected with the input end of the global feature extraction module, the candidate region extraction module is connected with the global feature extraction module in a closed loop manner, and the output end of the global feature extraction module is connected with the output end of the classification module; the acquisition module is used for acquiring an original image; the global feature extraction module is used for extracting features of the original image, wherein the features comprise global features and local region features; the candidate region extraction module is configured to determine a plurality of candidate regions in the original image, determine a feature value of each candidate region according to the global feature, rank the plurality of feature values to obtain a first ranking result, and determine at least one candidate region corresponding to the largest N feature values in the first ranking result, where each candidate region corresponds to a local region feature; the classification module is used for cascading the global feature and at least one local region feature to obtain a cascading feature, and classifying the original image according to the cascading feature to obtain a classification result.

In a third aspect, an embodiment of the present application provides a training method for a convolutional neural network, including: acquiring an original image; extracting the global features of the original image by adopting a global feature extraction module; inputting the global features into a first classifier for prediction to obtain global loss; determining a plurality of candidate regions in the original image by adopting a candidate region extraction module, and determining a characteristic value of each candidate region according to the global characteristic; sorting the plurality of characteristic values to obtain a first sorting result and a sorting loss, and determining at least one candidate region corresponding to the maximum N characteristic values in the first sorting result; extracting the characteristics of the at least one candidate region by adopting the global characteristic extraction module to obtain at least one local region characteristic, wherein each candidate region corresponds to one local region characteristic; inputting the at least one local region characteristic into a second classifier for prediction to obtain local region loss; cascading the global feature and at least one local region feature to obtain a cascading feature; inputting the cascade characteristics into a third classifier for prediction to obtain cascade loss; adjusting parameters for training the convolutional neural network according to the global loss, the local region loss, the cascade loss and the sequencing loss respectively.

In a fourth aspect, an embodiment of the present application provides a computer device, including a memory, a processor, and a computer program stored on the memory and running on the processor, where the processor implements the fine-grained image recognition method according to the first aspect when executing the computer program.

In a fifth aspect, embodiments of the present application provide a computer-readable storage medium on which a computer program is stored, where in some embodiments, the program is executed by a processor to implement the fine-grained image recognition method according to the first aspect.

Compared with the related art, the fine-grained image identification method, the convolutional neural network, the training method of the convolutional neural network, the computer device and the computer readable storage medium provided by the embodiment of the application acquire the original image; extracting global features of an original image by adopting a convolution network; determining a plurality of candidate regions in an original image, and determining a characteristic value of each candidate region according to the global characteristics; sequencing the plurality of characteristic values to obtain a first sequencing result, and determining at least one candidate region corresponding to the maximum N characteristic values in the first sequencing result; extracting the characteristics of the at least one candidate region to obtain at least one local region characteristic, wherein each candidate region corresponds to one local region characteristic; cascading the global feature and the at least one local region feature to obtain a cascading feature; the original images are classified according to the cascade characteristics to obtain classification results, the problems of complexity and high labeling cost of fine-grained image identification methods in the related technology are solved, and the fine-grained image identification method is simplified under the condition that the accuracy of the identification results is not reduced.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the application.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a flowchart of a fine-grained image recognition method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a convolutional neural network according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a feature extraction network based on a channel attention mechanism according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a candidate region extraction module according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a navigation-net according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a convolutional neural network in accordance with a preferred embodiment of the present application;

FIG. 7 is a flow diagram of a convolutional neural network recognition image according to an embodiment of the present application;

FIG. 8 is a flow chart of a method of training a convolutional neural network according to an embodiment of the present application;

fig. 9 is a hardware configuration diagram of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any creative effort belong to the protection scope of the present application.

It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.

Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.

The various techniques described in this application may be widely applied to various image recognition scenarios, such as intelligent retail merchandise recognition, species refinement recognition, and vehicle type recognition in intelligent transportation.

The implementation provides a fine-grained image identification method, which is applied to fine-grained image identification. Fig. 1 is a flowchart of a fine-grained image recognition method according to an embodiment of the present application, and as shown in fig. 1, the flowchart includes the following steps:

step S101, an original image is acquired.

And step S102, extracting the global features of the original image by adopting a convolution network. The global feature is obtained by extracting the global information of the original image by the convolution network.

Step S103, a plurality of candidate areas are determined in the original image, and the characteristic value of each candidate area is determined according to the global characteristics. The candidate regions may be randomly selected regions in the original image, each region has a different size and aspect ratio, and the feature of each candidate region may be obtained according to the global feature, that is, the feature value of each candidate region may be determined.

Step S104, sequencing the plurality of characteristic values to obtain a first sequencing result, and determining at least one candidate region corresponding to the maximum N characteristic values in the first sequencing result. For example, the plurality of feature values are sorted in a descending order to obtain a first sorting result, the larger the feature value of the candidate region is, the more distinctive the candidate region is, the larger the contribution degree to the image recognition result is, and by determining at least one candidate region corresponding to the largest N feature values in the first sorting result, the candidate region with higher distinctive property can be determined. One or more candidate regions with higher discriminativity may be determined, and the scoring criterion for the feature value may be adjusted according to the image recognition accuracy.

Step S105, extracting the feature of the at least one candidate region to obtain at least one local region feature, where each candidate region corresponds to one local region feature. For example, feature extraction may be performed on at least one candidate region using the same convolutional network as the global feature extraction, and the at least one candidate region is obtained from the original image.

And S106, cascading the global feature and the at least one local region feature to obtain a cascading feature. The cascade connection is to combine the global characteristic channel and at least one local area characteristic channel, increase the number of channels describing the original image, and keep the information amount of each channel unchanged.

And S107, classifying the original images according to the cascade characteristics to obtain a classification result. The original image is classified by adopting a cascade classifier, and the cascade classifier can be obtained by combining a plurality of weak classifiers.

Compared with the related technology A, the embodiment can directly carry out end-to-end training, and the application is more convenient.

Compared with the related art B, the embodiment locates the discriminative area in a weak supervision mode, namely, only class labels are needed in the training process, and the labeling cost is reduced.

Through the steps, the problems of complexity and high labeling cost of the fine-grained image identification method in the related technology are solved, and the fine-grained image identification method is simplified under the condition that the accuracy of the identification result is not reduced.

In addition, according to the embodiment, the original image is identified by combining the global features and the local region features, and the accuracy of the identification result is improved under the condition that the complexity of the identification method of the fine-grained image is not increased.

The following embodiment describes determining a feature value of each candidate region.

In some embodiments, determining a plurality of candidate regions in the original image, the determining a feature value for each candidate region based on the global features comprises: transforming the scale of the global feature to obtain a plurality of feature maps with different scales, wherein each feature map corresponds to a candidate region; from the plurality of feature maps, a feature value corresponding to each candidate region is determined.

In this embodiment, for a plurality of Feature maps (Feature maps) with different scales, the larger the scale is, the larger the receptive field is, that is, the larger the area of the pixel points on the Feature map mapped on the original image is, by generating a plurality of Feature maps with different scales, each Feature map can be one-to-one mapped to a plurality of candidate areas, so as to determine the Feature value of each candidate area, the Feature value can be used as the "score" of the candidate area, the larger the Feature value of the candidate area is, the more the candidate area has the discriminativity, and at least one candidate area with higher discriminativity is determined by the scoring standard.

In some embodiments, transforming the scale of the global feature to obtain a plurality of feature maps of different scales includes: sequentially inputting the global features into a plurality of 3 x 3 convolutional layers connected in series for down-sampling to respectively obtain an output result corresponding to each convolutional layer; and inputting the output result of each convolution layer into the 1 x 1 convolution layer for up-sampling to respectively obtain a plurality of characteristic graphs with different scales.

The following example describes determining at least one candidate region with higher discrimination.

In some embodiments, the ranking the plurality of feature values to obtain a first ranking result, and the determining at least one candidate region corresponding to the largest N feature values in the first ranking result includes: determining a plurality of candidate regions corresponding to the largest M characteristic values in the first sequencing result; extracting features corresponding to a plurality of candidate regions in an original image to obtain a plurality of candidate region features; generating a plurality of confidence levels corresponding to the plurality of candidate region features; sequencing the confidence degrees to obtain a second sequencing result; judging whether the first sequencing result and the second sequencing result meet a preset rule or not; and under the condition that the first sequencing result and the second sequencing result meet the preset rule, determining at least one candidate region corresponding to the maximum N characteristic values in the first sequencing result.

The confidence coefficient represents the prediction probability of a candidate region on a real category, and the higher the confidence coefficient of a certain candidate region is, the higher the contribution degree of the region to the recognition of the original image is; m and N may be the same or different; the preset rule represents that the order of the feature values in the first ranking result and the confidence degrees in the second ranking result is the same.

In the embodiment, through a self-supervision mode, the part with higher identifiability can be well positioned only by using the category label, namely, fine-grained features are better extracted, and better classification is realized.

Since the candidate regions are regions randomly selected in the original image at the initial stage of setting the candidate regions, there may be a case where the candidate regions overlap with each other. In some embodiments, before the step of sorting the plurality of feature values to obtain a first sorting result and determining at least one candidate region corresponding to the largest N feature values in the first sorting result, the method further includes: determining at least one candidate region corresponding to the largest L characteristic values in the first sequencing result as a reference region; determining a candidate region which is overlapped with the current reference region in the plurality of candidate regions and a corresponding overlapping rate; and deleting the candidate areas corresponding to the candidate areas with the overlapping rates larger than the preset threshold.

L and N may be the same or different. The embodiment may delete the overlapped candidate region by Non-Maximum Suppression (NMS), which includes: selecting a candidate region from the candidate regions corresponding to the maximum L characteristic values as a reference region, and selecting the candidate region with the highest score (characteristic value); traversing the rest of candidate areas, determining the candidate areas overlapped with the current reference area and the corresponding overlapping rate (Intersection over Unit, abbreviated as IOU), and deleting the candidate areas with the overlapping rate larger than a preset threshold; and continuously selecting one candidate area from the maximum L candidate areas as a reference area, and repeating the process. By the embodiment, some candidate areas with high overlapping rate can be deleted, and redundant calculation is reduced.

In some embodiments, before extracting the feature of the at least one candidate region using a convolutional network to obtain at least one local region feature, the method further includes: cutting the at least one candidate area from the original image; and preprocessing the at least one candidate region to obtain a candidate region with a preset size.

The at least one candidate region may be cropped from the original image and upsampled to a predetermined size. And then, performing feature extraction on at least one candidate region by adopting a convolution network with the same global feature extraction, and obtaining the at least one candidate region according to the original image.

The Fine-Grained image recognition method provided by the application is a Fine-Grained image recognition method (Non-Local Combined with Multi-Region Attention Mechanism, abbreviated as NLMA) Combined with a Non-Local and Multi-Region Attention Mechanism, and the table 1 shows experimental results of comparing the NLMA on a public Fine-Grained image library with various algorithms according to the embodiment of the application.

TABLE 1 experimental results of comparison of NLMA on public fine-grained image library with multiple algorithms

In table 1, Method represents an image recognition Method, Base Model represents a basic Model adopted by the image recognition Method, VGGnet represents a deep convolutional neural network developed by a Visual Geometry Group (Visual Geometry Group) of Oxford university and a researcher of Google deep mind company, VGG-19 represents a deep convolutional neural network invented by Oxford Visual Geometry Group laboratory of Oxford university, reseet-50 represents a residual neural network, Top-1 acquisition represents the Accuracy of the first-ranked class conforming to the actual result, and CUB-200-2011, Stanford Cars, and vc fgairtrack represent 3 open fine-grained image libraries, respectively. Among them, Bilinear-CNN algorithm is described in "Bilinear CNN Models for Fine-grained Visual Recognition" published by Lin et al (ICCV, 2015, pp.1449-1457); the MA-CNN algorithm is described in the paper "Learning Multi-attribute computational Network for Fine-granular Image Recognition" published by Zheng et al (ICCV, 2017, pp. 5219-5227); the DCL algorithm is described in Chen et al, "Dedescription and Construction Learning for Fine-grained Image Recognition" (CVPR,2019, pp.157-5166); see, e.g., Learning to navigation for Fine-grained Classification by Yang et al (ECCV, 2019, pp.20-435).

Experiment results show that the NLMA provided by the embodiment has higher recognition rate on three fine-grained image libraries. Compared with a Navigator algorithm, the NLMA of the embodiment is respectively improved by 0.6%, 04% and 0.6% on the basis of the Navigator recognition rate; compared with a Navigator + Non-local algorithm, the NLMA of the embodiment respectively improves the recognition rate of the Navigator + Non-local algorithm by 0.2%, 02% and 0.3%. The experimental results show that the improvement of this example is effective.

With reference to the fine-grained image recognition method of the foregoing embodiment, this embodiment further provides a convolutional neural network, fig. 2 is a schematic structural diagram of the convolutional neural network according to the embodiment of the present application, and as shown in fig. 2, the convolutional neural network includes: the system comprises an acquisition module 201, a global feature extraction module 202, a candidate region extraction module 203 and a classification module 204; the input end of the acquisition module 201 is connected to the input end of the global feature extraction module 202, the candidate region extraction module 203 is connected with the global feature extraction module 202 in a closed loop manner, and the output end of the global feature extraction module 202 is connected to the output end of the classification module 204; the obtaining module 201 is configured to obtain an original image; the global feature extraction module 202 is configured to extract features of an original image, where the features include global features and local area features; the candidate region extraction module 203 is configured to determine a plurality of candidate regions in an original image, determine a feature value of each candidate region according to a global feature, rank the plurality of feature values to obtain a first ranking result, and determine at least one candidate region corresponding to the largest N feature values in the first ranking result, where each candidate region corresponds to a local region feature; the classification module 204 is configured to cascade the global feature and the at least one local region feature to obtain a cascade feature, and classify the original image according to the cascade feature to obtain a classification result.

Referring to fig. 2, in some embodiments, the convolutional neural network further comprises a pooling module 205, a first classification unit 206, and a second classification unit 207; wherein, the input end of the pooling module 205 is connected with the output end of the global feature extraction module 202, and the output end is respectively connected with the input ends of the classification module 204, the first classification unit 206 and the second classification unit 207; the pooling module 205 is configured to compress sizes of the global features and the local region features in a spatial dimension, the first classification unit 206 is configured to predict the global features and generate global loss, the second classification unit 207 is configured to predict the local region features and generate local region loss, and the classification module 204 includes a plurality of sequentially cascaded classifiers configured to predict the cascaded features and generate cascaded loss.

The following embodiments will describe the functions or construction methods of the global feature extraction module and the candidate region extraction module, respectively.

(I) Global feature extraction Module

The global feature extraction module is provided with a non-local unit and a feature extraction network based on a channel attention mechanism.

(I-1) non-local Unit

The non-local unit captures the dependency relationship between different positions by adopting the principle of non-local mean filtering operation, can directly calculate the relation between any two positions through the non-local unit, has larger sensing range and is not influenced by distance; the performance can be effectively improved by using less non-local operations; the non-local operation does not change the size of the input features and can therefore be flexibly embedded into individual convolution modules. Therefore, the global information perception capability of the model can be enhanced by introducing a non-local module into the feature extraction network. One example of a non-local operation definitional formula is as follows:

where i is the index of the output position (i.e. the position of the response value to be calculated), j is the index of all possible positions in the feature space, i and j have a range of (0, W × H), W is the width of the input feature map, H is the height of the input feature map, x is the input feature, y is the output feature, x and y are the same size, a binary function f is used to give a dependency of the two positions i and j, the result is a scalar, a univariate function g is used to give a feature representation of the input feature at position j, and C (x) is a normalization parameter.

(I-2) feature extraction network based on channel attention mechanism

The present embodiment generates a channel Attention weight map to enhance the Attention of important feature channels by introducing a channel Attention mechanism (CBAM) in a reference feature extraction network, so as to improve the feature representation capability of the model.

Fig. 3 is a schematic diagram of a feature extraction network based on a channel attention mechanism according to an embodiment of the present application, and as shown in fig. 3, a feature extraction network based on a residual neural network (ResNet-50) is taken as an example, and CBAM is added to each basic residual module (Res-block) to form CBAM-Res-block, where GMP is global maximum pooling, and GAP is global mean pooling.

The process of extracting the network processing image based on the characteristic of the channel attention mechanism comprises the following steps: inputting a feature diagram F epsilon R by using GMP and GAP respectively^H×W×CCompressing in space dimension to obtain two different space semantic feature maps

And

wherein, W is the width of the input characteristic diagram, H is the height of the input characteristic diagram, and C is the channel number of the input characteristic diagram; the two characteristic maps are respectively input into a shared convolution module, and the shared convolution module is composed of two full connection layers (FC1, FC2) and an activation layer (Relu); adding the two output feature graphs element by element, and obtaining a channel attention weight graph M through an activation layer (Sigmoid)_c∈R^C×1×1(ii) a Outputting Res-blockEach channel multiplied by M_cThe formula involved in the above process is shown as follows:

where σ denotes a sigmoid activation function, W₀,W₁∈R^C/r×CAnd r represents a compression ratio of the compression ratio,

and

sharing a parameter W₀And W₁。

(II) candidate region extraction Module

Fig. 4 is a schematic diagram of a candidate region extraction module according to an embodiment of the present application, and as shown in fig. 4, the candidate region extraction module is provided with a navigation network (navigation-net) and a Teaching network (Teaching-net) for helping the global feature extraction module to realize the positioning of the local region, and only category tag information of an original image is required in the process. Wherein input represents input (original image), Feature represents Feature, Feature Extractor represents Feature Extractor, Score represents Score, and Confidence represents Confidence.

(Ⅱ-1)Navigating-net

The network uses the Anchor mechanism to lay candidate frames with different sizes and aspect ratios in advance at different positions in the original image, each candidate frame represents an area in the original image, and therefore a plurality of candidate areas can be generated in the original image: { R1, R2, …, Rq }. By "scoring" each candidate region using the Navigating-net, the top N regions that are most discriminative can be selected among the candidate boxes. Here, the Anchor is a mapping point of the center of the current sliding window in the original pixel space when the sliding window is used for the feature map, and is called the Anchor.

Fig. 5 is a schematic diagram of the navigation-net according to an embodiment of the present application, and as shown in fig. 5, the output characteristic of the last convolutional layer of the characteristic extraction network is used as the input characteristic of the navigation-net and input to the navigation-net; using 3 layers of 3 × 3 convolutional layers (Conv) to reduce the size of the feature continuously, then inputting the output of the 3 × 3 convolutional layers to a 1 × 1 convolutional layer respectively to obtain 3 feature maps with different scales, wherein the feature value on each feature map corresponds to a candidate frame, and represents the navigang-net to score the candidate frame, and a preliminary score list is obtained according to the scores of a plurality of candidate frames: { I ' (R1), I ' (R2), …, I ' (Rq) }. For example, feature maps having a Size (Size) of 2018 × 14 × 14 are input, and feature maps having sizes of 6 × 14 × 14, 6 × 7 × 7, and 9 × 4 × 4 are output, respectively.

The network structure of the Navigating-net is pyramid-shaped, and the lower the characteristic is, the larger the receptive field is. Therefore, by generating a plurality of feature maps of different sizes, each feature value can be associated with a previously pre-laid candidate box in a one-to-one manner, and the candidate box is taken as a 'score' of the candidate box. To solve this problem, some candidate frames with a large overlap rate may be deleted by the NMS to obtain a final score list { I (R1), I (R2), …, I (Rq) }. And then selecting the first N candidate regions with the highest scores from the rest candidate frames, upsampling the candidate regions to a specified size, and inputting the upsampled candidate regions into a feature extraction network.

(Ⅱ-2)Teaching-net

The Teasching-net is provided with a classifier, which can be implemented, for example, in a fully-connected layer configuration. Sorting the score list generated by the navigation-net by { I (R1), I (R2), … and I (Rq) }, and selecting the first N candidate frames with the largest scores; cutting candidate areas corresponding to the N candidate frames in the original image, upsampling the cut candidate areas to a specified size, and inputting the upsampled candidate areas to a feature extraction network for feature extraction; input into the Teasching-net for classification. The Teaching-net generates a confidence for the N candidate regions, which is the prediction probability in the real category, to obtain a confidence list { C (R1), C (R2), …, C (Rq) }.

The higher the confidence of a candidate region, the more important it is for identifying the image, so Teaching-net uses the output confidence as a guide signal to guide the learning of Navigating-net. Further, the following rank-loss is used to encourage the Navigating-net to produce scores that are in order with the confidence produced by the Teasching-net, i.e., such that the confidence of a region is ranked higher, the score is ranked higher, and an example of a definitional formula for the rank-loss is as follows:

wherein, F (x) ═ max {1-x,0}, this function is used to excite I and C to be in the same order, and after the network converges, the Navigating-net can recommend the most resolving area to help the fine-grained image recognition method of the above embodiment to perform better classification.

A preferred embodiment of the convolutional neural network will be described below.

Fig. 6 is a schematic diagram of a convolutional neural network according to a preferred embodiment of the present application, and as shown in fig. 6, the convolutional neural network includes a global feature extraction module, a candidate region extraction module, a GAP layer, a cascade layer, FC1, FC2, and FC3, and FC1, FC2, and FC3 are all classifiers, where FC2 is a cascade classifier and FC3 is a fully connected layer.

The convolution neural network shown in fig. 6 is applied to the fine-grained image recognition method of the above embodiment, and includes the following processes: first, the original picture is input into a feature extraction network based on a channel attention mechanism and containing a non-local module for extracting global features, and a global loss (raw loss) is calculated through a first classifier (FC 1). Secondly, the extracted global features are input into a candidate region extraction module so as to recommend N local regions with the highest resolution. Then, the recommended local area is clipped (crop) in the original image and up-sampled (resize) to a predetermined size, and is input again to the feature extraction network for extracting the local area feature, and the local area loss (part loss) is calculated by the fully connected layer (FC 3). Finally, the global features and the local region features are cascaded to be used as final feature representation, and the result is input into a cascade classifier (FC2) to obtain a final class prediction result and cascade loss (cascade loss)

In conjunction with fig. 6, fig. 7 shows a flowchart of identifying an image by a convolutional neural network according to an embodiment of the present application, and as shown in fig. 7, the flowchart includes the following steps:

in step S701, an image is input.

In step S702, feature extraction is performed based on the channel attention mechanism and the non-local operation feature extraction network.

In step S703, the candidate region extraction module performs region recommendation.

Step S704, recommend N regions, and upsample to a specified size.

In step S705, feature extraction is performed again on the recommended region.

Step S706, the original image feature and the area feature are fused.

And step S707, inputting the result to the classifier to obtain a final result.

With reference to the convolutional neural network of the above embodiment, this embodiment further provides a training method of the convolutional neural network, fig. 8 is a flowchart of the training method of the convolutional neural network according to the embodiment of the present application, and as shown in fig. 8, the flowchart includes the following steps:

in step S801, an original image is acquired.

Step S802, a global feature extraction module is adopted to extract global features of the original image.

Step S803, the global features are input to the first classifier for prediction, so as to obtain global loss.

Step S804, a plurality of candidate regions are determined in the original image by adopting a candidate region extraction module, and the characteristic value of each candidate region is determined according to the global characteristic; and sequencing the plurality of characteristic values to obtain a first sequencing result and sequencing loss, and determining at least one candidate region corresponding to the maximum N characteristic values in the first sequencing result.

Step S805, a global feature extraction module is adopted to extract features of the at least one candidate region to obtain at least one local region feature, wherein each candidate region corresponds to one local region feature.

Step S806, inputting the at least one local area feature into the second classifier for prediction, so as to obtain a local area loss.

Step S807, concatenating the global feature and the at least one local region feature to obtain a concatenated feature.

And step S808, inputting the cascade characteristics into a third classifier for prediction to obtain cascade loss.

And step S809, adjusting parameters of the convolutional neural network according to the global loss, the local region loss, the cascade loss and the sequencing loss.

Obtaining the total loss of the convolutional neural network according to the global loss, the local region loss, the cascade loss and the sequencing loss, wherein an example of a definitional expression of the total loss is as follows:

L_total＝αL_concat+βL_raw+γL_part+δL_rank

wherein, represents L_totalTotal loss, L_concatRepresenting cascade loss, L_rawRepresenting a global penalty, L_partRepresenting local regional losses, L_rankRepresenting the ordering loss, alpha, beta, gamma and delta respectively represent the weight of the corresponding loss, reversely propagating by calculating the total loss, calculating the Gradient of the parameters participating in the training of the convolutional neural network, and performing parameter updating by using a Stochastic Gradient Descent (SGD) algorithm to train the convolutional neural network. In some embodiments, α, β, γ, and δ may each take the value of 1.

In addition, the fine-grained image recognition method of the embodiment of the present application described in conjunction with fig. 1 may be implemented by a computer device. Fig. 9 is a hardware configuration diagram of a computer device according to an embodiment of the present application.

The computer device may comprise a processor 901 and a memory 902 in which computer program instructions are stored.

Specifically, the processor 901 may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of the embodiments of the present Application.

Memory 902 may include, among other things, mass storage for data or instructions. By way of example, and not limitation, memory 902 may include a Hard Disk Drive (Hard Disk Drive, abbreviated HDD), a floppy Disk Drive, a Solid State Drive (SSD), flash memory, an optical Disk, a magneto-optical Disk, tape, or a Universal Serial Bus (USB) Drive or a combination of two or more of these. Memory 902 may include removable or non-removable (or fixed) media, where appropriate. The memory 902 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 902 is a Non-Volatile (Non-Volatile) memory. In particular embodiments, Memory 902 includes Read-Only Memory (ROM) and Random Access Memory (RAM). The ROM may be mask-programmed ROM, Programmable ROM (PROM), Erasable PROM (EPROM), Electrically Erasable PROM (EEPROM), Electrically rewritable ROM (EAROM), or FLASH Memory (FLASH), or a combination of two or more of these, where appropriate. The RAM may be a Static Random-Access Memory (SRAM) or a Dynamic Random-Access Memory (DRAM), where the DRAM may be a Fast Page Mode Dynamic Random-Access Memory (FPMDRAM), an Extended data output Dynamic Random-Access Memory (EDODRAM), a Synchronous Dynamic Random-Access Memory (SDRAM), and the like.

The memory 902 may be used to store or cache various data files for processing and/or communication purposes, as well as possibly computer program instructions for execution by the processor 901.

The processor 901 realizes any of the fine-grained image recognition methods in the above embodiments by reading and executing computer program instructions stored in the memory 902.

In some of these embodiments, the computer device may also include a communication interface 903 and bus 900. As shown in fig. 9, the processor 901, the memory 902, and the communication interface 903 are connected via a bus 900 to complete communication therebetween.

The communication interface 903 is used for implementing communication among modules, devices, units and/or equipment in the embodiment of the present application. The communication interface 903 may also enable other components such as: the data communication is carried out among external equipment, image/data acquisition equipment, a database, external storage, an image/data processing workstation and the like.

Bus 900 includes hardware, software, or both coupling the components of the computer device to each other. Bus 900 includes, but is not limited to, at least one of the following: data Bus (Data Bus), Address Bus (Address Bus), Control Bus (Control Bus), Expansion Bus (Expansion Bus), and Local Bus (Local Bus). By way of example, and not limitation, Bus 900 may include an Accelerated Graphics Port (AGP) or other Graphics Bus, an Enhanced Industry Standard Architecture (EISA) Bus, a Front-Side Bus (Front Side Bus), an FSB (FSB), a Hyper Transport (HT) Interconnect, an ISA (ISA) Bus, an InfiniBand (InfiniBand) Interconnect, a Low Pin Count (LPC) Bus, a memory Bus, a microchannel Architecture (MCA) Bus, a PCI (Peripheral Component Interconnect) Bus, a PCI-Express (PCI-X) Bus, a Serial Advanced Technology Attachment (SATA) Bus, a Video Electronics Bus (audio Association) Bus, abbreviated VLB) bus or other suitable bus or a combination of two or more of these. Bus 900 may include one or more buses, where appropriate. Although specific buses are described and shown in the embodiments of the application, any suitable buses or interconnects are contemplated by the application.

The computer device may execute the fine-grained image recognition method in the embodiment of the present application based on the acquired original image, thereby implementing the fine-grained image recognition method described with reference to fig. 1.

In addition, in combination with the fine-grained image recognition method in the foregoing embodiment, the embodiment of the present application may provide a computer-readable storage medium to implement. The computer readable storage medium having stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement any of the fine-grained image recognition methods of the above embodiments.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A fine-grained image recognition method is characterized by comprising the following steps:

acquiring an original image;

extracting the global features of the original image by adopting a convolution network;

determining a plurality of candidate regions in the original image, and determining a feature value of each candidate region according to the global features;

sequencing the characteristic values to obtain a first sequencing result, and determining at least one candidate region corresponding to the maximum N characteristic values in the first sequencing result;

extracting the feature of the at least one candidate region to obtain at least one local region feature, wherein each candidate region corresponds to one local region feature;

cascading the global feature and at least one local region feature to obtain a cascading feature;

and classifying the original image according to the cascade characteristic to obtain a classification result.

2. The fine grain image recognition method according to claim 1, wherein a plurality of candidate regions are determined in the original image, and determining a feature value of each of the candidate regions according to the global feature comprises:

transforming the scale of the global feature to obtain a plurality of feature maps with different scales, wherein each feature map corresponds to one candidate region;

and determining a characteristic value corresponding to each candidate region according to the plurality of characteristic graphs.

3. The fine-grained image recognition method according to claim 2, wherein transforming the scale of the global feature to obtain a plurality of feature maps of different scales comprises:

sequentially inputting the global features into a plurality of 3 x 3 convolutional layers connected in series for down-sampling to respectively obtain an output result corresponding to each convolutional layer;

and inputting the output result of each convolution layer into the 1 x 1 convolution layer for up-sampling to respectively obtain a plurality of characteristic maps with different scales.

4. The fine-grained image recognition method according to claim 1, wherein the plurality of feature values are sorted to obtain a first sorting result, and the determining at least one candidate region corresponding to the largest N feature values in the first sorting result comprises:

determining a plurality of candidate regions corresponding to the largest M characteristic values in the first sequencing result;

extracting features corresponding to the candidate regions in the original image to obtain a plurality of candidate region features;

generating a plurality of confidence levels corresponding to a plurality of the candidate region features;

sequencing the confidence degrees to obtain a second sequencing result;

judging whether the first sequencing result and the second sequencing result meet a preset rule or not;

and under the condition that the first sequencing result and the second sequencing result meet a preset rule, determining at least one candidate region corresponding to the maximum N characteristic values in the first sequencing result.

5. The fine-grained image recognition method according to claim 1, wherein a plurality of feature values are sorted to obtain a first sorting result, and before determining at least one candidate region corresponding to the largest N feature values in the first sorting result, the method further comprises:

determining at least one candidate region corresponding to the largest L characteristic values in the first sequencing result as a reference region;

determining a candidate region which is overlapped with the current reference region in the plurality of candidate regions and a corresponding overlapping rate;

and deleting the candidate area corresponding to the candidate area with the overlapping rate larger than a preset threshold value.

6. The fine grain image recognition method according to claim 1, wherein before extracting the feature of the at least one candidate region to obtain at least one local region feature, the method further comprises:

cutting the at least one candidate area from the original image;

and preprocessing the at least one candidate region to obtain a candidate region with a preset size.

7. A convolutional neural network, comprising: the device comprises an acquisition module, a global feature extraction module, a candidate region extraction module and a classification module; the input end of the acquisition module is connected with the input end of the global feature extraction module, the candidate region extraction module is connected with the global feature extraction module in a closed loop manner, and the output end of the global feature extraction module is connected with the output end of the classification module; wherein,

the acquisition module is used for acquiring an original image;

the global feature extraction module is used for extracting features of the original image, wherein the features comprise global features and local region features;

the candidate region extraction module is configured to determine a plurality of candidate regions in the original image, determine a feature value of each candidate region according to the global feature, rank the plurality of feature values to obtain a first ranking result, and determine at least one candidate region corresponding to the largest N feature values in the first ranking result, where each candidate region corresponds to a local region feature;

the classification module is used for cascading the global feature and at least one local region feature to obtain a cascading feature, and classifying the original image according to the cascading feature to obtain a classification result.

8. A method of training a convolutional neural network, comprising:

acquiring an original image;

extracting the global features of the original image by adopting a global feature extraction module;

inputting the global features into a first classifier for prediction to obtain global loss;

determining a plurality of candidate regions in the original image by adopting a candidate region extraction module, and determining a characteristic value of each candidate region according to the global characteristic; sorting the plurality of characteristic values to obtain a first sorting result and a sorting loss, and determining at least one candidate region corresponding to the maximum N characteristic values in the first sorting result;

extracting the characteristics of the at least one candidate region by adopting the global characteristic extraction module to obtain at least one local region characteristic, wherein each candidate region corresponds to one local region characteristic;

inputting the at least one local region characteristic into a second classifier for prediction to obtain local region loss;

inputting the cascade characteristics into a third classifier for prediction to obtain cascade loss;

adjusting parameters of the convolutional neural network according to the global penalty, the local region penalty, the cascade penalty, and the ordering penalty.

9. A computer device comprising a memory, a processor, and a computer program stored on the memory and running on the processor, wherein the processor, when executing the computer program, implements the fine-grained image recognition method of any one of claims 1 to 6.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the fine-grained image recognition method according to any one of claims 1 to 6.