[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN110738247B - Fine-grained image classification method based on selective sparse sampling - Google Patents

Fine-grained image classification method based on selective sparse sampling Download PDF

Info

Publication number
CN110738247B
CN110738247B CN201910942790.8A CN201910942790A CN110738247B CN 110738247 B CN110738247 B CN 110738247B CN 201910942790 A CN201910942790 A CN 201910942790A CN 110738247 B CN110738247 B CN 110738247B
Authority
CN
China
Prior art keywords
class
image
response
classification
peak
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910942790.8A
Other languages
Chinese (zh)
Other versions
CN110738247A (en
Inventor
焦建彬
丁瑶
叶齐祥
韩振军
万方
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Chinese Academy of Sciences
Original Assignee
University of Chinese Academy of Sciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Chinese Academy of Sciences filed Critical University of Chinese Academy of Sciences
Priority to CN201910942790.8A priority Critical patent/CN110738247B/en
Publication of CN110738247A publication Critical patent/CN110738247A/en
Application granted granted Critical
Publication of CN110738247B publication Critical patent/CN110738247B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a fine-grained image classification method based on selective sparse sampling, which comprises the following steps: positioning important components by utilizing a classification network in a mode of extracting a class response graph from an image, and positioning key components which are effective for classification on a target as comprehensively as possible; then, locally amplifying the learned key component groups in a sparse resampling mode; and extracting features of the image after the local amplification, and determining the image category through a classifier by combining the features of the original image. The method realizes the quick positioning of the key part by utilizing the characteristic that the similar peak value response corresponds to the visual cue, and is quicker and more effective than the method of positioning the part by utilizing the detection frame; according to the invention, the key components are locally amplified in a sparse resampling mode, so that the image details are enhanced while the background information is retained, and the information loss is avoided. Therefore, the method has good practicability and expansibility and has important significance for the fine-grained image classification task.

Description

Fine-grained image classification method based on selective sparse sampling
Technical Field
The invention relates to the field of computer vision and image processing, in particular to a fine-grained image classification method based on selective sparse sampling.
Background
The task of classifying fine-grained images is one of important problems in the field of vision, and has important application value in the fields of animal and plant protection, medical image analysis and the like. Conventional fine-grained image classification models often require that the position of each object, and even each part on the object, be accurately labeled in the image. Although such methods can rely on a large amount of labeled information to learn the information of object identification, they place very high demands on the collection and production of data sets. This process of accurately labeling each target in the image dataset is time consuming and labor intensive, especially when the dataset size becomes large, which greatly limits the application of the algorithm to large-scale fine-grained image datasets.
In order to reduce manual labeling and supervision during modeling, a fine-grained image classification framework based only on image class labels has been proposed. The fine-grained image classification framework based on the image category label only requires that the target in the image is labeled, and other forms of labeling information such as a bounding box are not required. The labeling mode greatly reduces the workload of labeling, and simultaneously, massive internet image resources can be directly utilized to collect large-scale data sets. However, in the current fine-grained image classification algorithm training process based on image labels, because precise part position information is lacked, greater part positioning randomness is generated, so that the stability and the precision of the algorithm are influenced, and higher requirements are provided for the fine feature learning capability of the fine-grained image classification algorithm.
The existing fine-grained image classification method mainly comprises three types: 1. the characteristic learning-based method is typically represented by a bilinear model based on a classification network. 2. Based on a fine feature learning model for positioning the discriminant part, the method mostly uses a weak supervision target detection method for realizing the positioning of the discriminant part, then cuts the parts from an original image according to a positioning result, extracts features, and completes feature learning by combining the features of the original image; 3. based on the attention mechanism method, the attention mechanism method is introduced, firstly, the part with the most discriminating force is positioned in an iterative learning mode, and secondly, the intermediate output result of the iterative process, namely the features of the part under different scales, is fused. These methods are increasingly optimized and achieve state-of-the-art performance.
However, these methods have disadvantages, such as: the first method is more general, but is not optimized according to the characteristic of slight difference among various categories in a fine-grained classification task; in the second method, the part positioning process based on image label discrimination is complex and time-consuming, secondly, the method needs the number of artificial designated parts and does not have self-adaptability to image content, in addition, the method extracts the parts by using a cutting mode, and a large amount of useful information can be lost when the part positioning is inaccurate; the third method adopts an iterative learning mode to easily cause error accumulation. These deficiencies limit the robustness and generalization of learning to models.
Disclosure of Invention
In order to overcome the problems, the inventor of the invention carries out intensive research and provides a fine-grained image classification method based on selective sparse sampling, the rich semantic information of a classification network response graph (similar activation graph) is utilized to realize the positioning of a part with discrimination so as to improve the efficiency and the flexibility of a model, and then, the part with discrimination is learned on a larger scale in a local amplification mode so as to avoid the loss of information. Experiments show that the method improves the positioning speed and the positioning precision of the fine component, and exceeds the best method (such as NTS-CNN) in performance, thereby completing the invention.
The invention aims to provide the following technical scheme:
the invention aims to provide a fine-grained image classification method based on selective sparse sampling, which comprises a process of training a classification model for target classification, wherein the training process of the classification model comprises the following steps:
step 1: key component positioning: inputting the image into a classification network, outputting a class response graph corresponding to the image, and extracting a class peak value response on the class response graph;
step 2: class peak response grouping: grouping the class peak responses obtained in the step 1 according to response strength, wherein the class peak responses are respectively a discriminant attention group and a complementary attention group, each class peak response generates an attention diagram, and the two corresponding groups of class peak responses generate two groups of attention diagrams;
and step 3: resampling: respectively aggregating the attention diagrams in the two groups to generate two saliency diagrams, resampling the images under the guidance of the saliency diagrams, realizing local amplification of corresponding key components, and obtaining two resampled images;
and 4, step 4: completing the construction of a feature fusion and classification model: inputting the resampled image obtained in the step 3 into the classification network in the step 1 to extract features, combining the features of the original image and the resampled image, and classifying by using a classifier to obtain a classification model.
The fine-grained image classification method based on selective sparse sampling provided by the invention has the following beneficial effects:
(1) according to the method, learning is carried out based on the image category label, strong labeling data (target enclosing frame or part labeling information) in a relevant scene is not needed, the rich semantics of class peak value response on a class response graph are utilized, the key part is quickly positioned, and the feasibility and the practicability are remarkably improved;
(2) according to the method, the class peak responses are grouped according to the response values, so that a strong response leading learning process is avoided, parts corresponding to weak responses can be learned, and the robustness of the features is improved;
(3) according to the method, the local amplification of the key part is realized in an image resampling mode, so that the important details in the image are enhanced while the background information is kept, and the information loss is avoided;
(4) the method of the invention integrates the characteristics of the resampled image and the characteristics of the original image to realize the fusion of the local characteristics and the global characteristics of the image. In addition, the resampled image and the original image share a feature extraction network, so that data enhancement in a special form is realized, and the generalization of a model is favorably improved;
(5) according to the method, the class response graph can be updated along with the training of the network and the learning of the image characteristics, and the update of the class response graph can guide the generation of a new resampling image. Therefore, the positioning of the discriminant part and the feature learning can be mutually enhanced, and the method belongs to a special closed-loop iterative learning mode.
Drawings
FIG. 1 is a schematic diagram of a model structure of a selective sparse sampling-based fine-grained image classification method;
FIG. 2 shows an example and distribution of the number of class peak responses to which the model locates;
FIG. 3 shows a schematic of selective sparse sampling during model training;
FIG. 4 is a diagram illustrating the result of target classification in the proposed method;
fig. 5 shows a diagram of the target location result of the method proposed by the present invention.
Detailed Description
The invention is explained in further detail below with reference to the drawing. The features and advantages of the present invention will become more apparent from the description.
As shown in fig. 1, the present invention provides a fine-grained image classification method based on selective sparse sampling, which includes a process of training a classification model for object classification, wherein the training process of the classification model includes the following steps:
step 1: key component positioning: inputting the image into a classification network, outputting a class response graph corresponding to the image, and extracting a class peak value response on the class response graph;
wherein the class peak response corresponds to a key component in the image, the key component being a distinctive region for classification;
the class peak response is preferably a local maximum on the class response map;
step 2: class peak response grouping: grouping the class peak responses obtained in the step 1 according to response strength, wherein the class peak responses are respectively a discriminant attention group and a complementary attention group, each class peak response generates an attention diagram, and the two corresponding groups of class peak responses generate two groups of attention diagrams;
and step 3: resampling: respectively aggregating the attention diagrams in the two groups to generate two saliency diagrams, resampling the images under the guidance of the saliency diagrams, realizing local amplification of corresponding key components, and obtaining two resampled images;
and 4, step 4: inputting the resampled image obtained in the step 3 into the classification network in the step 1 to extract features, combining the features of the original image and the resampled image, and classifying by using a classifier to obtain a classification model.
Step 1 of the invention, key components are positioned. The key component positioning algorithm of the method is based on semantic information rich in the similar response graph, integrates the characteristics of components corresponding to the similar peak response points on the similar response graph, and aims to find out the components and the positions thereof which are important for classification and judgment. Compared with the method for positioning the components of the weak supervision target detection framework, the method disclosed by the invention omits the searching and screening processes of important components, so that the classification can be more efficient.
In a preferred embodiment of the invention, in step 1, the image is only given with image labels, the whole target and the positions of all parts are not needed, and the key parts are quickly positioned by utilizing the rich semantics of the class peak response on the class response image.
In a preferred embodiment of the invention, step 1 comprises the following sub-steps:
step 1.1: inputting the image into a classification network, and calculating a class response graph;
the classification network is preferably a convolutional neural network, and may be selected from any one of AlexNet, ResNet, VGGNet, google lenet, and the like.
Defining a fine-grained image classification dataset: c represents the number of classes, N represents the number of samples, wherein the training set comprises N samplestrainThe test set contains N samplestestGiving an image I in a training set, inputting the image I into a classification network, and extracting a feature atlas S ∈ R output by the deepest convolutional layerD×H×WS is sent to a full connection layer FC after passing through a global average pooling layer to obtain each class prediction score S ∈ R of the network to the imageC. Denote the weight of the full connection layer FC as Wfc∈RD×CEach class c and each feature map SdCorresponds to WfcIs a numerical value of
Figure BDA0002223379210000061
Then the calculation of the class response graph corresponding to the class c in the class response graph M is shown in equation (1):
Figure BDA0002223379210000062
the formula defines the relation between the characteristics learned by the network and the image categories, and is beneficial to intuitively understand which areas are helpful for category judgment.
Step 1.2: calculating a prediction probability set P of each category of the images by the network by using the classification result s obtained in the step 1.1, sequencing the prediction probabilities in a descending mode, and selecting the prediction probabilities of the top five
Figure BDA0002223379210000063
Calculating entropy as shown in formula (2):
Figure BDA0002223379210000064
this entropy measures the confidence of the current classification network prediction. As can be seen from the classification results of a plurality of classification networks on a plurality of data sets such as CUB _200_2011, Stanford Cars, FGVC-Aircraft and the like, the classification performance of the first five classification networks outputting prediction probabilities can reach 99.9%, namely, one of the first five classifications of network prediction is correct, so that the prediction probabilities of the first five classification networks are necessary and sufficient to select.
Step 1.3: calculating a class response map for extracting the class peak response according to equation (3) in consideration of the accuracy and recall of the class peak response locating unit:
Figure BDA0002223379210000071
wherein,
Figure BDA0002223379210000072
is composed of
Figure BDA0002223379210000073
The corresponding class response plot, which is the threshold, was selected to be 0.2 based on the control experiment.
The rules defined in equations (2) and (3) are: when the top-1 probability value predicted by the classification network is high, namely the prediction is more credible, only selecting a class response graph corresponding to the top-1 class, so that the extracted class peak value response does not contain noise; and when the probability value of top-1 predicted by the classification network is low, namely the sample is difficult to predict and cannot be credible, selecting a class response graph corresponding to the top-5 class, and ensuring the recall of class peak value response. In order to ensure the consistency of the training and testing process, a mode of predicting scores according to classification is selected instead of class labels.
To avoid the problem that the magnitude of the variable causes numerical computation complexity, the response graph R to the classoA normalization of the maximum-minimum pattern is performed, as in formula (4):
Figure BDA0002223379210000074
wherein R is a normalized class response graph, RoClass response plot, min (R) obtained for formula (3)o) Is the minimum value of the class response map, max (R)o) Is the maximum value of the class response graph.
Step 1.4: extracting local maximum values from the class response graph R in a window with a set size to obtain a class peak response position set T { (x)1,y1),(x2,y2),…,(xn,yn) And n is the number of peak-like responses.
In a preferred embodiment, the pane size is 3 x 3, 5 x 5, or 7 x 7, preferably the pane size is 3 x 3.
The number and position of the peak-like responses extracted in the process are self-adaptive to the image content and are not fixed, and the peak-like responses are distributed on a plurality of fine-grained image classification data sets as shown in FIG. 2. Thus, the proposed framework is more flexible and can be applied to different domains, such as birds, airplanes and cars, without the need to adjust the hyper-parameters for each specific task.
The invention comprises the following steps: a peak-like response packet. The class peak responses are grouped according to the response values, so that a strong response leading learning process is avoided, parts corresponding to weak responses can be learned, and the robustness of the features is improved. In an embodiment, step 2 comprises the following sub-steps:
step 2.1: will be described in detailClass-peak response in 1 divided into two sets TdAnd TcThe method is divided into the following formulas (5) and (6):
Td={(x,y)|(x,y)∈T if Rx,ynot less than ζ } formula (5)
Tc={(x,y)|(x,y)∈T if Rx,y<ζ formula (6)
Wherein R isx,yThe response value is the peak-like response (x, y), zeta is the division number, zeta is chosen to be a random number that can be (0,1) evenly distributed, or the median of all peak-like responses, etc., TdFor the discriminative class peak response set, i.e. the component decisive for the class decision, TcIs a complementary peak-like response set, i.e. a component that plays a complementary role for class determination.
Step 2.2: and (3) calculating a corresponding attention diagram for each class peak response by using a Gaussian kernel function in a way of formula (7), wherein the corresponding two groups of class peak responses generate two groups of attention diagrams:
Figure BDA0002223379210000081
wherein,
Figure BDA0002223379210000082
is a peak-like response (x)i,yi) β, β1And β2Are learnable parameters. The meaning of the formula is: the stronger the response value, the more the region is amplified.
Step 3 of the invention: and (6) resampling. The local amplification of key parts is realized by an image resampling mode, so that the background information is kept while important details in the image are enhanced, and the loss of information is avoided.
In a preferred embodiment of the present invention, step 3 comprises the following substeps:
step 3.1: using the attention maps obtained in step 2, summing the groups of attention maps to obtain a saliency map, Q, for guiding resamplingdAnd QcThe calculation method is shown in formulas (8) and (9):
Qd=∑Ai,if(xi,yi)∈Tdformula (8)
Qc=∑Ai,if(xi,yi)∈TcFormula (9)
Wherein Q isdRepresenting a discriminative branch significance map, QcRepresenting a complementary branch significance map.
Step 3.2: the two sets of saliency maps calculated from step 3.1 then guide the resampling of the original image.
The image I is regarded as a grid consisting of a point set V and an edge set E (a set of connecting lines between two adjacent points in the point set V), where V ═ x [ (×)0,y0),(x1,y1),…,(xend,yend)],(xi,yi) And the coordinates are corresponding to the image pixel points i. The points and edges form criss-cross grid lines. The goal of image resampling is to find a new set of coordinate points, V '═ x'0,y′0),(x′1,y′1),…,(x′end,y′end)]So that in the new coordinate system, important regions in the original image are uniformly sampled, while unimportant regions allow a certain degree of compression. This problem can be translated into finding a mapping between the original image and the resampled image, the mapping comprising two mapping functions f (x, y) and g (x, y), the resampled image then being Inew(x,y)=I(f(x,y),g(x,y))。
f (x, y) and g (x, y) uniformly distribute the saliency map calculated from the original image into the resampled image. The solution to this problem satisfies the following condition:
Figure BDA0002223379210000091
the estimates of the solutions of this equation are shown in equations (10) and (11):
Figure BDA0002223379210000101
Figure BDA0002223379210000102
wherein, k ((x ', y'), (x, y)) is a gaussian kernel function, which is used as a regularization term to avoid extreme situations, such as all pixel points converging to the same value. Significance map Q calculated from equations (8) and (9)dAnd QcThe two resampled images are obtained by substituting the two resampled images into equations (10) and (11). And QdThe corresponding image is named as a discriminant branch resample image, which highlights the regions that are decisive for classification; and QdThe corresponding image, named a complementary branched resampled image, enlarges the area with supplementary instructions for classification, and can stimulate the model to learn more supportive evidence.
As shown in fig. 3, the selective sparse sampling provided in the method of the present invention can prevent the strong feature from dominating the gradient learning process, thereby promoting the more comprehensive feature expression of the learning of the network.
The whole resampling process is realized through convolution operation, can be embedded into any neural network and realizes end-to-end learning and training, so that the classification loss obtained by calculating the resampled image can optimize the parameter β1And β2
In the step 3, the local amplification of the key part is realized in an image resampling mode, so that the important details in the image are enhanced while the background information is kept, and the information loss is avoided.
Step 4 of the invention: and completing the feature fusion and the classification model construction. And (3) integrating the characteristics of the resampled image and the characteristics of the original image (inputting the resampled image into the step 1 classification network to extract the characteristics, and cascading the characteristics with the characteristics of the original image extracted in the step 1 to generate new characteristic description of the image), thereby realizing the fusion of the global characteristics and the local characteristics of the image.
Through steps 1, 2, and 3, two resampled images are derived from one input image. The two resampled images are input into the classification network used in step 1 to extract features. To aggregate global and local features, define an image feature as FJ={Fo,FD,Fc},Fo,FD,FcRespectively corresponding to the original image features, the discriminative branch image features and the complementary branch image features. And (5) the features are ranked and fed into a full connection layer with softmax, and an image classification result is obtained. The resampled image and the original image share a feature extraction network, so that data enhancement in a special form is realized, and the generalization of the model is favorably improved.
In a preferred embodiment, the fine-grained image classification method based on selective sparse sampling further includes a model optimization process, including the following steps:
step 4.1: designing a cross entropy loss function, calculating the gradient of the classification network according to the loss function, carrying out gradient back transmission on the whole classification network, and updating network parameters;
the definition of the model classification cross entropy loss function is shown as equation (12):
Figure BDA0002223379210000111
wherein L isclsDenotes cross entropy loss, YiFor the corresponding prediction vectors, Y, of the original image and the resampled imagejFor prediction vectors corresponding to joint features, Y*Is an image label.
Step 4.2: and (3) judging whether the network is converged (namely the error is not reduced) or not according to the classification error obtained by calculating the cross entropy loss function, or judging whether the maximum iteration number is reached, stopping network training if the network is converged or the maximum iteration number is reached, and otherwise, skipping to the step 1.
The unknown images in the test set are input into the trained model to obtain a target classification result, as shown in fig. 4. The positioning result of the method of the present invention is shown in fig. 4, and it can be seen that the method of the present invention improves the classification performance by activating more regions compared to the general classification network.
The method and the device can improve the accuracy of fine-grained image classification and improve the target positioning capability. The invention is used for positioning the target and comprises the following steps:
step 1:class response map M for computing top-1 corresponding to original image, discriminant branch and complementary branchO,MD,MC
Step 2: class response map M corresponding to discriminant branchDClass response graph M corresponding to complementary branchesCMapping to an original class response map M by a corresponding inverse transformationOSpace, then MO,MD,MCThe three are added to generate a final class response graph Mf inal
The inverse transformation is a transformation for restoring the locally enlarged image to the original image.
And step 3: the final class response graph Mf inalAnd upsampling to the size of the original image, dividing the upsampled image by using the average value, and selecting the minimum bounding box of the maximum connected domain as a positioning result of the target.
The positioning result of the method of the invention is shown in fig. 5, which shows that the method of the invention has more accurate and comprehensive positioning compared with the reference method, and obviously improves the problem of information loss caused by overfitting in the reference model.
Examples
Example 1
1. Database and sample classification
The method is adopted to classify the fine-grained images, and for the accuracy and comparability of the experiment, the open data CUB-200 plus 2011, Stanford Cars and FGVC-Aircraft data sets widely used in the field of fine-grained image classification are used. The CUB-200 + 2011 dataset is an avian dataset, with 11788 images in total, of 200 species, which divides the entire image set into two parts: training and testing, the number of images per part being 5994 and 5794 respectively. The StanfordCars dataset is an automobile dataset with 16185 images, 196 cars, and 8144 and 8044 images for training and testing, respectively. The FGVC-Aircraft dataset is an airplane dataset with 10000 images, 6667 images for training and verification, and 3333 images for testing. The method only uses the image category label, and does not use any other additional strong labeling information, such as the target surrounding frame, the part point labeling information and the hierarchical semantic association of the category label.
Constructing a model: with the method of the invention, a 50-layer residual convolutional neural network (ResNet-50) is used as a feature extractor. The 60 periods were trained using a stochastic gradient descent with momentum with a batch size of 16. Setting the weight attenuation to 1e-4The momentum is set to 0.9. For parameters initialized from the pre-trained model on ImageNet, an initial learning rate of 0.001 was used; for other parameters, we use an initial learning rate of 0.01. The input images are each resized to 448 x 448 pixels.
2. Performance evaluation criteria
2.1 Classification Performance evaluation criterion
For evaluation of algorithm performance and comparison with other methods, we chose evaluation methods widely used in image classification: and (3) top-1 classification accuracy, for a single image, taking the class corresponding to the maximum value in the predicted probability vector as a prediction result, if the prediction class is consistent with the class information labeled by the image, the prediction is correct, and for the whole data set, the proportion of the number of the correctly predicted images in the whole data set is the top-1 classification accuracy of the data set.
In addition, in order to evaluate the positioning performance of the algorithm frame, a calibration frame of a data set is used in the evaluation process.
Evaluating the positioning performance of the frame: the method of the invention obtains the prediction frame of the target, if the frame and the mark frame IOU of the target in the original image are more than 0.5, the frame is considered to be positioned correctly, otherwise, the positioning is wrong. And calculating the percentage of correct picture positioning and all pictures for each category respectively as a performance evaluation result of frame positioning.
Figure BDA0002223379210000141
3. Results and analysis of the experiments
And evaluating the quality of the located class peak value response in the model, and respectively verifying the effectiveness of the discriminant branch, the complementary branch and the sparse attention module.
3.1) quality of class Peak response
When the peak-like response points fall in the labeling frame of the target, the response points are considered to be accurately positioned, and the positioning accuracy of the single image is the proportion of the number of the peak-like response points falling in the labeling frame to the total number of the detected peak-like response points. The accuracy of the data set as a whole is the average of the accuracies of all images.
Table 1 verification of accuracy of peak-like response in locating a part on a data set (%)
Data set CUB Aircraft Cars
Positioning accuracy 94.63 97.22 98.76
As can be seen from table 1, the peak-like response is used to locate the component on the target very well.
3.2) quality of class Peak response
The effect of each branch including the original branch (O-branch), the discriminative branch (D-branch) and the complementary branch (C-branch) was verified, and the experiment was performed on CUB-200-2011. The effect on the algorithm classification and frame positioning performance after removal of the different branches was determined and the results are shown in table 2.
Table 2 verifies the results of classification and box positioning (%), of the various branches, on CUB-200-2011
Is provided with Positioning By O-branching D-branch C-branch Total of
S3N O 57.7 86.0 - - 86.0
S3N O+D 59.2 87.0 86.5 - 87.6
S3N O+C 56.6 86.8 - 85.3 87.3
S3N D+C 62.6 - 87.1 85.6 87.5
S3N O+D+C 65.2 87.9 86.7 85.1 88.5
The second column in table 2 is the box positioning performance, evaluation follows the box positioning evaluation criteria. The third to sixth columns in Table 2 are top-1 classification performance, and the evaluation follows the evaluation criteria.
As can be seen from Table 2, both discriminative (D-branch) and complementary (C-branch) branches can be used for classification performance of the model, confirming that both can facilitate the learning of fine features by the model. Secondly, the discriminant branch (D-branch) is better than the complementary branch (C-branch) in improving the classification performance, which proves that the discriminant branch focuses on the components with the decisive influence on the classification, and the complementary branch focuses on the components with the weak support of the classification. The classification performance of the model is optimal when the original image branch (O-branch), the discriminant branch (D-branch) and the complementary branch (C-branch) exist simultaneously, and the characteristics learned among the original image branch, the discriminant branch and the complementary branch are proved to have complementarity. The existence of the discriminant branch (D-branch) and the complementary branch (C-branch) can improve the classification performance of the original image branch (O-branch), and proves that the weight sharing of the backbone network realizes the data enhancement in a special form and improves the generalization of the network.
3.3) Effect of sparse attention
The effectiveness of the sparse attention module provided by the invention on the selective sampling problem is verified, and the influence of several different attention mechanisms on the classification performance is respectively measured on the CUB-200-2011 data set, and the result is shown in Table 3.
TABLE 3 Classification accuracy (%) -for several different attention mechanisms
Attention mechanism mode Top-1 accuracy Notes
Significance-based attention 85.9 Class independent
Attention based on class response graph 87.8 Class correlation
Sparse attention 88.5 Component correlation
Among these, attention Based on significance is set forth in the literature "Recasens A, Kellnhofer P, Stent S, et.
Attention based on class response maps is presented in the literature "Zhou B, Khosla a, laperiza a, et al.
As can be seen from table 3, as the relevance between attention and categories is reduced, the classification and localization performance is significantly reduced, which indicates that the salient features based on the network bottom layer cannot guide feature learning from the perspective of high-level semantics. Second, compared to class response graph-based attention mechanisms, sparse attention mechanisms can capture finer visual cues while discarding regions that are irrelevant to the classification decision or even harmful. Compared with the prior art, the algorithm provided by the invention can well position the fine components which are useful for classifying and judging fine-grained images, thereby improving the classification performance.
3.4) comparison of Fine-grained image classification learning method
The existing fine-grained image classification learning method is used, and the test is carried out based on a characteristic learning method (B-CNN, Lowrank B-CNN, boost CNN, HIHCA and DFL-CNN), an attention mechanism method (RA-CNN, MA-CNN and DT-RAM) and NTS (weak supervision target detection framework-based method). The evaluation criteria of the image classification performance are the same as the embodiment by adopting the CUB-200 plus 2011, Stanford Cars and FGVC-Aircraft data sets.
B-CNN is described in the literature "Tsungyu Lin, Areni Roychowdhury, and Subhransu Maji. Biliner CNN models for fine-grained visual recognition. international conference on computer vision, pages 1449-;
low rank B-CNN is proposed in the literature "Shu Kong and Charless C Fowles. Low-rank multilineage point for fine-grained classification. computer vision and pattern recognition, patterns 7025-;
HIHCA is set forth in the literature "Sijia Cai, Wangmeng Zuo, and Lei Zhang. high-order integration of high-order capacitive activities for fine-ordered visual interaction. in IEEE International Conference on Computer Vision, 2017.";
Boosted-CNN is set forth in the documents "Mohammad Moghimi, Serge J. Belongie, Mohammad J. Saberian, Jian Yang, Nuno Vasconce cells, and Li-Jia Li. Boosted associative neural networks in Proceedings of the British Machine Vision Conserence 2016.";
DFL-CNN is proposed in the literature "Yaming Wang, Vlad I Moraruu, and Larry S Davis. learning acquired imaging filter bank with a cNn for fine-grained registration. computing and pattern registration, pages 4148 and 4157, 2018";
RA-CNN is proposed in the literature "Jianlong Fu, Heliang Zheng, and Tao Mei. hook closer to seebecter: Current authentication consistent neural network for fine-grain image Recognition. computer Vision and Pattern Recognition, 2017.";
MA-CNN is set forth in the literature "Heliang Zheng, Jianlong Fu, Mei Tao, and Jiebo Luo. Learning Multi-attribute connected neural network for fine-grained image recognition. in IEEE International Conference on Computer Vision, 2017.";
DT-RAM is described in the literature "Zhichao Li, Yi Yang, Xiao Liu, Feng Zhou, Shilei Wen, and Weixu. dynamic computing time for visual authentication. international conference computer vision, pages 1199. 1209, 2017.";
NTS is proposed in the literature "Ze Yang, Tiange Luo, Dong Wang, Zhijiang Hu, Jun Gao, and LiweiWang.Learing to personal for fine-grained classification. in ECCV2018pages 438-454".
Table 4 comparison results (%) of the fine-grained image classification method based on selective sparse sampling (S3N) and other methods on three data sets
Figure BDA0002223379210000171
Figure BDA0002223379210000181
As can be seen from Table 4, the accuracy of the method "S3N" provided by the invention in the test is higher than that of the existing fine-grained image classification methods (B-CNN, RA-CNN, MA-CNN, DPL-CNN, DFL-CNN, NTS-CNN) only using image class labels. It can be seen that after selective sparse sampling is used, the method provided by the invention can be used for mining more visual clues, and meanwhile, feature learning is carried out on discriminative parts on a larger scale, so that a model can learn more fine features.
3.5) influence of the selection of the threshold on the classification
And (3) comparing results (%) on CUB-200-2011 by using the selective sparse sampling-based fine-grained image classification method (S3N).
TABLE 5 Effect of different thresholds on classification
δ 0 0.05 0.1 0.15 0.2 0.25 0.3
top-1(%) 88.14 88.23 88.40 88.50 88.52 88.47 88.43
As can be seen from table 5, when the class response map for extracting the class peak response is calculated, the value of the threshold has a certain influence on the classification result, and when the value is 0.2, the classification performance is better.
The present invention has been described above in connection with preferred embodiments, but these embodiments are merely exemplary and merely illustrative. On the basis of the above, the invention can be subjected to various substitutions and modifications, and the substitutions and the modifications are all within the protection scope of the invention.

Claims (9)

1. A fine-grained image classification method based on selective sparse sampling comprises a process of training a classification model for target classification, wherein the training process of the classification model comprises the following steps:
step 1: key component positioning: inputting the image into a classification network, outputting a class response graph corresponding to the image, and extracting a class peak value response on the class response graph;
step 2: class peak response grouping: grouping the class peak responses obtained in the step 1 according to response strength, wherein the class peak responses are respectively a discriminant attention group and a complementary attention group, each class peak response generates an attention diagram, and the two corresponding groups of class peak responses generate two groups of attention diagrams;
step 2 comprises the following substeps:
step 2.1: dividing the peak-like response in the step 1 into two sets TdAnd TcThe division is as follows:
Td={(x,y)|(x,y)∈T if Rx,y≥ζ},
Tc={(x,y)|(x,y)∈T if Rx,y<ζ},
wherein R isx,yIs the response value of (x, y), ζ is the division number, TdFor the discriminative class peak response set, TcIs a complementary peak-like response set, T is a peak-like response position set, and (x, y) is a peak-like response position;
step 2.2: and calculating a corresponding attention diagram for each class peak response by using a Gaussian kernel function, wherein the calculation mode is as follows:
Figure FDA0002575617440000011
wherein,
Figure FDA0002575617440000012
is (x)i,yi) β, β1And β2Is a learnable parameter for controlling the degree of local amplification;
and step 3: resampling: respectively aggregating the attention diagrams in the two groups to generate two saliency diagrams, resampling the images under the guidance of the saliency diagrams, realizing local amplification of corresponding key components, and obtaining two resampled images;
and 4, step 4: inputting the resampled image obtained in the step 3 into the classification network in the step 1 to extract features, combining the features of the original image and the resampled image, and classifying by using a classifier to obtain a classification model.
2. The method of claim 1 wherein, in step 1, the class-like peak response is a local maximum on the class response map.
3. The method according to claim 1, wherein in step 1, the picture is given only image labels, and the whole target and the positions of all parts are not marked.
4. The method according to claim 1, characterized in that step 1 comprises the following sub-steps:
step 1.1: inputting the image into a classification network, and calculating a class response graph, wherein the class response graph is defined as follows:
Figure FDA0002575617440000021
wherein, WfcIs the weight of the fully connected layer, i.e. the classifier, each class c and each feature map SdCorresponds to WfcIs a numerical value of
Figure FDA0002575617440000022
D is the number of characteristic channels, and D is one channel label of the characteristic diagram S;
step 1.2: obtaining classification results by using the classification network in the step 1.1, calculating a prediction probability set P of the current image in each class, and selecting the prediction probabilities of the top five
Figure FDA0002575617440000027
Calculating the entropy:
Figure FDA0002575617440000023
step 1.3: defining a class response graph for extracting class peak responses:
Figure FDA0002575617440000024
wherein p isiIs the probability of predicting the input image as the ith class,
Figure FDA0002575617440000025
is composed of
Figure FDA0002575617440000026
The corresponding class response map, being the threshold,
Figure FDA0002575617440000031
the class response graph corresponding to the class with the highest prediction probability;
step 1.4: extracting local maximum values from the class response map in a window with a set size to obtain a class peak response position set T { (x)1,y1),(x2,y2),…,(xn,yn) And n is the number of peak-like responses.
5. Method according to claim 4, characterized in that in step 1.3, the response graph R to classesoNormalization in a maximum-minimum mannerAnd obtaining a class peak value response position set T based on the normalized class response diagram:
Figure FDA0002575617440000032
wherein R is a normalized class response graph, RoIs the class response plot, min (R) obtained in step 1.3o) Is the minimum value of the class response map, max (R)o) Is the maximum value of the class response graph.
6. The method according to claim 1, characterized in that step 3 comprises the following sub-steps:
step 3.1: summing the attention diagrams of each group to obtain a significance map Q for guiding resampling by using the attention diagrams obtained in the step 2dAnd Qc
Qd=∑Ai,if(xi,yi)∈Td
Qc=∑Ai,if(xi,yi)∈Tc
Wherein Q isdRepresenting a discriminative branch significance map, QcRepresents the significance map of the complementary branches, AiAn attention map, T, representing the ith class peak responsedRepresenting a set of discriminative class peak responses, TcA set representing complementary class peak responses;
step 3.2: and (3) guiding the original image to be resampled by the two groups of significance maps obtained by calculation in the step (3.1) to obtain a resampled image, wherein the calculation formula is as follows:
Figure FDA0002575617440000033
Figure FDA0002575617440000041
wherein, (x ', y') is the coordinates of pixel points in the original image, and f (x, y) is the corresponding coordinates in the sampled imageG (x, y) is its corresponding ordinate in the sampled image, Q is QdOr QcK ((x ', y'), (x, y)) is a gaussian kernel function.
7. The method of claim 6, wherein the resampling process is performed by a convolution operation.
8. The method according to claim 1, wherein the selective sparse sampling based fine-grained image classification method further comprises a model optimization process, comprising the following steps:
step 4.1: designing a cross entropy loss function, calculating the gradient of the classification network according to the loss function, carrying out gradient back transmission on the whole classification network, and updating network parameters;
the definition of the model classification cross entropy loss function is shown as follows:
Figure FDA0002575617440000042
wherein L isclsDenotes cross entropy loss, YiFor the corresponding prediction vectors, Y, of the original image and the resampled imagejFor prediction vectors corresponding to joint features, Y*The image label is an image label, O represents an original image, D represents a discriminant resample image, and C represents a complementary resample image;
step 4.2: and (3) judging whether the network is converged or not according to the classification error obtained by calculating the cross entropy loss function, or judging whether the maximum iteration times are reached or not, stopping network training if the network is converged or the maximum iteration times are reached, and otherwise, skipping to the step 1.
9. The method of claim 1, wherein the selective sparse sampling based fine-grained image classification method is further applied to target localization, comprising the steps of:
step 1: class response map M for computing top-1 corresponding to original image, discriminant branch and complementary branchO,MD,MC
Step 2: class response map M corresponding to discriminant branchDClass response graph M corresponding to complementary branchesCMapping to an original class response map M by a corresponding inverse transformationOSpace, then MO,MD,MCThe three are added to generate a final class response graph Mfinal
And step 3: the final class response graph MfinalAnd upsampling to the size of the original image, dividing the upsampled image by using the average value, and selecting the minimum bounding box of the maximum connected domain as a positioning result of the target.
CN201910942790.8A 2019-09-30 2019-09-30 Fine-grained image classification method based on selective sparse sampling Active CN110738247B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910942790.8A CN110738247B (en) 2019-09-30 2019-09-30 Fine-grained image classification method based on selective sparse sampling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910942790.8A CN110738247B (en) 2019-09-30 2019-09-30 Fine-grained image classification method based on selective sparse sampling

Publications (2)

Publication Number Publication Date
CN110738247A CN110738247A (en) 2020-01-31
CN110738247B true CN110738247B (en) 2020-08-25

Family

ID=69269842

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910942790.8A Active CN110738247B (en) 2019-09-30 2019-09-30 Fine-grained image classification method based on selective sparse sampling

Country Status (1)

Country Link
CN (1) CN110738247B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111507403B (en) * 2020-04-17 2024-11-05 腾讯科技(深圳)有限公司 Image classification method, apparatus, computer device and storage medium
CN111368942B (en) * 2020-05-27 2020-08-25 深圳创新奇智科技有限公司 Commodity classification identification method and device, electronic equipment and storage medium
CN111915618B (en) * 2020-06-02 2024-05-14 华南理工大学 Peak response enhancement-based instance segmentation algorithm and computing device
CN111784673B (en) * 2020-06-30 2023-04-18 创新奇智(上海)科技有限公司 Defect detection model training and defect detection method, device and storage medium
CN111816308B (en) * 2020-07-13 2023-09-29 中国医学科学院阜外医院 System for predicting coronary heart disease onset risk through facial image analysis
CN111967527B (en) * 2020-08-21 2022-09-06 菏泽学院 Peony variety identification method and system based on artificial intelligence
KR102483693B1 (en) * 2020-12-02 2023-01-03 울산대학교 산학협력단 Method and apparatus of explainable multi electrocardiogram arrhythmia diagnosis
CN112906810B (en) * 2021-03-08 2024-04-16 共达地创新技术(深圳)有限公司 Target detection method, electronic device, and storage medium
CN113177546A (en) * 2021-04-30 2021-07-27 中国科学技术大学 Target detection method based on sparse attention module
CN113470029B (en) * 2021-09-03 2021-12-03 北京字节跳动网络技术有限公司 Training method and device, image processing method, electronic device and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101729911A (en) * 2009-12-23 2010-06-09 宁波大学 Multi-view image color correction method based on visual perception
CN109583305A (en) * 2018-10-30 2019-04-05 南昌大学 A kind of advanced method that the vehicle based on critical component identification and fine grit classification identifies again
KR20190109194A (en) * 2018-03-16 2019-09-25 주식회사 에이아이트릭스 Apparatus and method for learning neural network capable of modeling uncerrainty

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10565305B2 (en) * 2016-11-18 2020-02-18 Salesforce.Com, Inc. Adaptive attention model for image captioning
CN109284749A (en) * 2017-07-19 2019-01-29 微软技术许可有限责任公司 Refine image recognition
CN108334901A (en) * 2018-01-30 2018-07-27 福州大学 A kind of flowers image classification method of the convolutional neural networks of combination salient region
CN110197202A (en) * 2019-04-30 2019-09-03 杰创智能科技股份有限公司 A kind of local feature fine granularity algorithm of target detection

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101729911A (en) * 2009-12-23 2010-06-09 宁波大学 Multi-view image color correction method based on visual perception
KR20190109194A (en) * 2018-03-16 2019-09-25 주식회사 에이아이트릭스 Apparatus and method for learning neural network capable of modeling uncerrainty
CN109583305A (en) * 2018-10-30 2019-04-05 南昌大学 A kind of advanced method that the vehicle based on critical component identification and fine grit classification identifies again

Also Published As

Publication number Publication date
CN110738247A (en) 2020-01-31

Similar Documents

Publication Publication Date Title
CN110738247B (en) Fine-grained image classification method based on selective sparse sampling
US11960568B2 (en) Model and method for multi-source domain adaptation by aligning partial features
CN113378632B (en) Pseudo-label optimization-based unsupervised domain adaptive pedestrian re-identification method
CN110443818B (en) Graffiti-based weak supervision semantic segmentation method and system
CN107657279B (en) Remote sensing target detection method based on small amount of samples
EP3478728B1 (en) Method and system for cell annotation with adaptive incremental learning
CN109993102B (en) Similar face retrieval method, device and storage medium
CN109871875B (en) Building change detection method based on deep learning
CN107633226B (en) Human body motion tracking feature processing method
US20240257423A1 (en) Image processing method and apparatus, and computer readable storage medium
CN108959305A (en) A kind of event extraction method and system based on internet big data
CN110929802A (en) Information entropy-based subdivision identification model training and image identification method and device
CN106408030A (en) SAR image classification method based on middle lamella semantic attribute and convolution neural network
CN110751027B (en) Pedestrian re-identification method based on deep multi-instance learning
CN110728694A (en) Long-term visual target tracking method based on continuous learning
CN112528022A (en) Method for extracting characteristic words corresponding to theme categories and identifying text theme categories
CN112541884B (en) Defect detection method and device, and computer readable storage medium
Zhang et al. An efficient class-constrained DBSCAN approach for large-scale point cloud clustering
CN110287970B (en) Weak supervision object positioning method based on CAM and covering
CN108805181B (en) Image classification device and method based on multi-classification model
Jiang et al. Dynamic proposal sampling for weakly supervised object detection
CN115311449A (en) Weak supervision image target positioning analysis system based on class reactivation mapping chart
CN114445716B (en) Key point detection method, key point detection device, computer device, medium, and program product
CN117371511A (en) Training method, device, equipment and storage medium for image classification model
US20220319002A1 (en) Tumor cell isolines

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant