CN110738247B - Fine-grained image classification method based on selective sparse sampling - Google Patents
Fine-grained image classification method based on selective sparse sampling Download PDFInfo
- Publication number
- CN110738247B CN110738247B CN201910942790.8A CN201910942790A CN110738247B CN 110738247 B CN110738247 B CN 110738247B CN 201910942790 A CN201910942790 A CN 201910942790A CN 110738247 B CN110738247 B CN 110738247B
- Authority
- CN
- China
- Prior art keywords
- class
- image
- response
- classification
- peak
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 92
- 238000005070 sampling Methods 0.000 title claims abstract description 27
- 230000004044 response Effects 0.000 claims abstract description 136
- 238000012952 Resampling Methods 0.000 claims abstract description 19
- 230000003321 amplification Effects 0.000 claims abstract description 9
- 238000003199 nucleic acid amplification method Methods 0.000 claims abstract description 9
- 238000010586 diagram Methods 0.000 claims description 25
- 230000000295 complement effect Effects 0.000 claims description 24
- 230000008569 process Effects 0.000 claims description 19
- 238000012549 training Methods 0.000 claims description 18
- 238000013145 classification model Methods 0.000 claims description 12
- 101100153586 Caenorhabditis elegans top-1 gene Proteins 0.000 claims description 10
- 101100370075 Mus musculus Top1 gene Proteins 0.000 claims description 10
- 238000004364 calculation method Methods 0.000 claims description 5
- 238000013507 mapping Methods 0.000 claims description 5
- 239000013598 vector Substances 0.000 claims description 5
- 230000004931 aggregating effect Effects 0.000 claims description 3
- 230000005540 biological transmission Effects 0.000 claims description 2
- 230000004807 localization Effects 0.000 claims description 2
- 238000010606 normalization Methods 0.000 claims description 2
- 238000005457 optimization Methods 0.000 claims description 2
- 230000000007 visual effect Effects 0.000 abstract description 6
- 238000001514 detection method Methods 0.000 abstract description 4
- 230000000717 retained effect Effects 0.000 abstract 1
- 238000013527 convolutional neural network Methods 0.000 description 20
- 238000002372 labelling Methods 0.000 description 11
- 238000011156 evaluation Methods 0.000 description 10
- 238000012360 testing method Methods 0.000 description 9
- 230000007246 mechanism Effects 0.000 description 8
- 238000004422 calculation algorithm Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 7
- 230000000694 effects Effects 0.000 description 5
- 238000002474 experimental method Methods 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 4
- 230000004927 fusion Effects 0.000 description 4
- 241000271566 Aves Species 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000007635 classification algorithm Methods 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000003909 pattern recognition Methods 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 241001465754 Metazoa Species 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 230000003213 activating effect Effects 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000005520 cutting process Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 238000012854 evaluation process Methods 0.000 description 1
- 238000010191 image analysis Methods 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012804 iterative process Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 238000007430 reference method Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 241000894007 species Species 0.000 description 1
- 230000003319 supportive effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Image Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a fine-grained image classification method based on selective sparse sampling, which comprises the following steps: positioning important components by utilizing a classification network in a mode of extracting a class response graph from an image, and positioning key components which are effective for classification on a target as comprehensively as possible; then, locally amplifying the learned key component groups in a sparse resampling mode; and extracting features of the image after the local amplification, and determining the image category through a classifier by combining the features of the original image. The method realizes the quick positioning of the key part by utilizing the characteristic that the similar peak value response corresponds to the visual cue, and is quicker and more effective than the method of positioning the part by utilizing the detection frame; according to the invention, the key components are locally amplified in a sparse resampling mode, so that the image details are enhanced while the background information is retained, and the information loss is avoided. Therefore, the method has good practicability and expansibility and has important significance for the fine-grained image classification task.
Description
Technical Field
The invention relates to the field of computer vision and image processing, in particular to a fine-grained image classification method based on selective sparse sampling.
Background
The task of classifying fine-grained images is one of important problems in the field of vision, and has important application value in the fields of animal and plant protection, medical image analysis and the like. Conventional fine-grained image classification models often require that the position of each object, and even each part on the object, be accurately labeled in the image. Although such methods can rely on a large amount of labeled information to learn the information of object identification, they place very high demands on the collection and production of data sets. This process of accurately labeling each target in the image dataset is time consuming and labor intensive, especially when the dataset size becomes large, which greatly limits the application of the algorithm to large-scale fine-grained image datasets.
In order to reduce manual labeling and supervision during modeling, a fine-grained image classification framework based only on image class labels has been proposed. The fine-grained image classification framework based on the image category label only requires that the target in the image is labeled, and other forms of labeling information such as a bounding box are not required. The labeling mode greatly reduces the workload of labeling, and simultaneously, massive internet image resources can be directly utilized to collect large-scale data sets. However, in the current fine-grained image classification algorithm training process based on image labels, because precise part position information is lacked, greater part positioning randomness is generated, so that the stability and the precision of the algorithm are influenced, and higher requirements are provided for the fine feature learning capability of the fine-grained image classification algorithm.
The existing fine-grained image classification method mainly comprises three types: 1. the characteristic learning-based method is typically represented by a bilinear model based on a classification network. 2. Based on a fine feature learning model for positioning the discriminant part, the method mostly uses a weak supervision target detection method for realizing the positioning of the discriminant part, then cuts the parts from an original image according to a positioning result, extracts features, and completes feature learning by combining the features of the original image; 3. based on the attention mechanism method, the attention mechanism method is introduced, firstly, the part with the most discriminating force is positioned in an iterative learning mode, and secondly, the intermediate output result of the iterative process, namely the features of the part under different scales, is fused. These methods are increasingly optimized and achieve state-of-the-art performance.
However, these methods have disadvantages, such as: the first method is more general, but is not optimized according to the characteristic of slight difference among various categories in a fine-grained classification task; in the second method, the part positioning process based on image label discrimination is complex and time-consuming, secondly, the method needs the number of artificial designated parts and does not have self-adaptability to image content, in addition, the method extracts the parts by using a cutting mode, and a large amount of useful information can be lost when the part positioning is inaccurate; the third method adopts an iterative learning mode to easily cause error accumulation. These deficiencies limit the robustness and generalization of learning to models.
Disclosure of Invention
In order to overcome the problems, the inventor of the invention carries out intensive research and provides a fine-grained image classification method based on selective sparse sampling, the rich semantic information of a classification network response graph (similar activation graph) is utilized to realize the positioning of a part with discrimination so as to improve the efficiency and the flexibility of a model, and then, the part with discrimination is learned on a larger scale in a local amplification mode so as to avoid the loss of information. Experiments show that the method improves the positioning speed and the positioning precision of the fine component, and exceeds the best method (such as NTS-CNN) in performance, thereby completing the invention.
The invention aims to provide the following technical scheme:
the invention aims to provide a fine-grained image classification method based on selective sparse sampling, which comprises a process of training a classification model for target classification, wherein the training process of the classification model comprises the following steps:
step 1: key component positioning: inputting the image into a classification network, outputting a class response graph corresponding to the image, and extracting a class peak value response on the class response graph;
step 2: class peak response grouping: grouping the class peak responses obtained in the step 1 according to response strength, wherein the class peak responses are respectively a discriminant attention group and a complementary attention group, each class peak response generates an attention diagram, and the two corresponding groups of class peak responses generate two groups of attention diagrams;
and step 3: resampling: respectively aggregating the attention diagrams in the two groups to generate two saliency diagrams, resampling the images under the guidance of the saliency diagrams, realizing local amplification of corresponding key components, and obtaining two resampled images;
and 4, step 4: completing the construction of a feature fusion and classification model: inputting the resampled image obtained in the step 3 into the classification network in the step 1 to extract features, combining the features of the original image and the resampled image, and classifying by using a classifier to obtain a classification model.
The fine-grained image classification method based on selective sparse sampling provided by the invention has the following beneficial effects:
(1) according to the method, learning is carried out based on the image category label, strong labeling data (target enclosing frame or part labeling information) in a relevant scene is not needed, the rich semantics of class peak value response on a class response graph are utilized, the key part is quickly positioned, and the feasibility and the practicability are remarkably improved;
(2) according to the method, the class peak responses are grouped according to the response values, so that a strong response leading learning process is avoided, parts corresponding to weak responses can be learned, and the robustness of the features is improved;
(3) according to the method, the local amplification of the key part is realized in an image resampling mode, so that the important details in the image are enhanced while the background information is kept, and the information loss is avoided;
(4) the method of the invention integrates the characteristics of the resampled image and the characteristics of the original image to realize the fusion of the local characteristics and the global characteristics of the image. In addition, the resampled image and the original image share a feature extraction network, so that data enhancement in a special form is realized, and the generalization of a model is favorably improved;
(5) according to the method, the class response graph can be updated along with the training of the network and the learning of the image characteristics, and the update of the class response graph can guide the generation of a new resampling image. Therefore, the positioning of the discriminant part and the feature learning can be mutually enhanced, and the method belongs to a special closed-loop iterative learning mode.
Drawings
FIG. 1 is a schematic diagram of a model structure of a selective sparse sampling-based fine-grained image classification method;
FIG. 2 shows an example and distribution of the number of class peak responses to which the model locates;
FIG. 3 shows a schematic of selective sparse sampling during model training;
FIG. 4 is a diagram illustrating the result of target classification in the proposed method;
fig. 5 shows a diagram of the target location result of the method proposed by the present invention.
Detailed Description
The invention is explained in further detail below with reference to the drawing. The features and advantages of the present invention will become more apparent from the description.
As shown in fig. 1, the present invention provides a fine-grained image classification method based on selective sparse sampling, which includes a process of training a classification model for object classification, wherein the training process of the classification model includes the following steps:
step 1: key component positioning: inputting the image into a classification network, outputting a class response graph corresponding to the image, and extracting a class peak value response on the class response graph;
wherein the class peak response corresponds to a key component in the image, the key component being a distinctive region for classification;
the class peak response is preferably a local maximum on the class response map;
step 2: class peak response grouping: grouping the class peak responses obtained in the step 1 according to response strength, wherein the class peak responses are respectively a discriminant attention group and a complementary attention group, each class peak response generates an attention diagram, and the two corresponding groups of class peak responses generate two groups of attention diagrams;
and step 3: resampling: respectively aggregating the attention diagrams in the two groups to generate two saliency diagrams, resampling the images under the guidance of the saliency diagrams, realizing local amplification of corresponding key components, and obtaining two resampled images;
and 4, step 4: inputting the resampled image obtained in the step 3 into the classification network in the step 1 to extract features, combining the features of the original image and the resampled image, and classifying by using a classifier to obtain a classification model.
Step 1 of the invention, key components are positioned. The key component positioning algorithm of the method is based on semantic information rich in the similar response graph, integrates the characteristics of components corresponding to the similar peak response points on the similar response graph, and aims to find out the components and the positions thereof which are important for classification and judgment. Compared with the method for positioning the components of the weak supervision target detection framework, the method disclosed by the invention omits the searching and screening processes of important components, so that the classification can be more efficient.
In a preferred embodiment of the invention, in step 1, the image is only given with image labels, the whole target and the positions of all parts are not needed, and the key parts are quickly positioned by utilizing the rich semantics of the class peak response on the class response image.
In a preferred embodiment of the invention, step 1 comprises the following sub-steps:
step 1.1: inputting the image into a classification network, and calculating a class response graph;
the classification network is preferably a convolutional neural network, and may be selected from any one of AlexNet, ResNet, VGGNet, google lenet, and the like.
Defining a fine-grained image classification dataset: c represents the number of classes, N represents the number of samples, wherein the training set comprises N samplestrainThe test set contains N samplestestGiving an image I in a training set, inputting the image I into a classification network, and extracting a feature atlas S ∈ R output by the deepest convolutional layerD×H×WS is sent to a full connection layer FC after passing through a global average pooling layer to obtain each class prediction score S ∈ R of the network to the imageC. Denote the weight of the full connection layer FC as Wfc∈RD×CEach class c and each feature map SdCorresponds to WfcIs a numerical value ofThen the calculation of the class response graph corresponding to the class c in the class response graph M is shown in equation (1):
the formula defines the relation between the characteristics learned by the network and the image categories, and is beneficial to intuitively understand which areas are helpful for category judgment.
Step 1.2: calculating a prediction probability set P of each category of the images by the network by using the classification result s obtained in the step 1.1, sequencing the prediction probabilities in a descending mode, and selecting the prediction probabilities of the top fiveCalculating entropy as shown in formula (2):
this entropy measures the confidence of the current classification network prediction. As can be seen from the classification results of a plurality of classification networks on a plurality of data sets such as CUB _200_2011, Stanford Cars, FGVC-Aircraft and the like, the classification performance of the first five classification networks outputting prediction probabilities can reach 99.9%, namely, one of the first five classifications of network prediction is correct, so that the prediction probabilities of the first five classification networks are necessary and sufficient to select.
Step 1.3: calculating a class response map for extracting the class peak response according to equation (3) in consideration of the accuracy and recall of the class peak response locating unit:
wherein,is composed ofThe corresponding class response plot, which is the threshold, was selected to be 0.2 based on the control experiment.
The rules defined in equations (2) and (3) are: when the top-1 probability value predicted by the classification network is high, namely the prediction is more credible, only selecting a class response graph corresponding to the top-1 class, so that the extracted class peak value response does not contain noise; and when the probability value of top-1 predicted by the classification network is low, namely the sample is difficult to predict and cannot be credible, selecting a class response graph corresponding to the top-5 class, and ensuring the recall of class peak value response. In order to ensure the consistency of the training and testing process, a mode of predicting scores according to classification is selected instead of class labels.
To avoid the problem that the magnitude of the variable causes numerical computation complexity, the response graph R to the classoA normalization of the maximum-minimum pattern is performed, as in formula (4):
wherein R is a normalized class response graph, RoClass response plot, min (R) obtained for formula (3)o) Is the minimum value of the class response map, max (R)o) Is the maximum value of the class response graph.
Step 1.4: extracting local maximum values from the class response graph R in a window with a set size to obtain a class peak response position set T { (x)1,y1),(x2,y2),…,(xn,yn) And n is the number of peak-like responses.
In a preferred embodiment, the pane size is 3 x 3, 5 x 5, or 7 x 7, preferably the pane size is 3 x 3.
The number and position of the peak-like responses extracted in the process are self-adaptive to the image content and are not fixed, and the peak-like responses are distributed on a plurality of fine-grained image classification data sets as shown in FIG. 2. Thus, the proposed framework is more flexible and can be applied to different domains, such as birds, airplanes and cars, without the need to adjust the hyper-parameters for each specific task.
The invention comprises the following steps: a peak-like response packet. The class peak responses are grouped according to the response values, so that a strong response leading learning process is avoided, parts corresponding to weak responses can be learned, and the robustness of the features is improved. In an embodiment, step 2 comprises the following sub-steps:
step 2.1: will be described in detailClass-peak response in 1 divided into two sets TdAnd TcThe method is divided into the following formulas (5) and (6):
Td={(x,y)|(x,y)∈T if Rx,ynot less than ζ } formula (5)
Tc={(x,y)|(x,y)∈T if Rx,y<ζ formula (6)
Wherein R isx,yThe response value is the peak-like response (x, y), zeta is the division number, zeta is chosen to be a random number that can be (0,1) evenly distributed, or the median of all peak-like responses, etc., TdFor the discriminative class peak response set, i.e. the component decisive for the class decision, TcIs a complementary peak-like response set, i.e. a component that plays a complementary role for class determination.
Step 2.2: and (3) calculating a corresponding attention diagram for each class peak response by using a Gaussian kernel function in a way of formula (7), wherein the corresponding two groups of class peak responses generate two groups of attention diagrams:
wherein,is a peak-like response (x)i,yi) β, β1And β2Are learnable parameters. The meaning of the formula is: the stronger the response value, the more the region is amplified.
Step 3 of the invention: and (6) resampling. The local amplification of key parts is realized by an image resampling mode, so that the background information is kept while important details in the image are enhanced, and the loss of information is avoided.
In a preferred embodiment of the present invention, step 3 comprises the following substeps:
step 3.1: using the attention maps obtained in step 2, summing the groups of attention maps to obtain a saliency map, Q, for guiding resamplingdAnd QcThe calculation method is shown in formulas (8) and (9):
Qd=∑Ai,if(xi,yi)∈Tdformula (8)
Qc=∑Ai,if(xi,yi)∈TcFormula (9)
Wherein Q isdRepresenting a discriminative branch significance map, QcRepresenting a complementary branch significance map.
Step 3.2: the two sets of saliency maps calculated from step 3.1 then guide the resampling of the original image.
The image I is regarded as a grid consisting of a point set V and an edge set E (a set of connecting lines between two adjacent points in the point set V), where V ═ x [ (×)0,y0),(x1,y1),…,(xend,yend)],(xi,yi) And the coordinates are corresponding to the image pixel points i. The points and edges form criss-cross grid lines. The goal of image resampling is to find a new set of coordinate points, V '═ x'0,y′0),(x′1,y′1),…,(x′end,y′end)]So that in the new coordinate system, important regions in the original image are uniformly sampled, while unimportant regions allow a certain degree of compression. This problem can be translated into finding a mapping between the original image and the resampled image, the mapping comprising two mapping functions f (x, y) and g (x, y), the resampled image then being Inew(x,y)=I(f(x,y),g(x,y))。
f (x, y) and g (x, y) uniformly distribute the saliency map calculated from the original image into the resampled image. The solution to this problem satisfies the following condition:the estimates of the solutions of this equation are shown in equations (10) and (11):
wherein, k ((x ', y'), (x, y)) is a gaussian kernel function, which is used as a regularization term to avoid extreme situations, such as all pixel points converging to the same value. Significance map Q calculated from equations (8) and (9)dAnd QcThe two resampled images are obtained by substituting the two resampled images into equations (10) and (11). And QdThe corresponding image is named as a discriminant branch resample image, which highlights the regions that are decisive for classification; and QdThe corresponding image, named a complementary branched resampled image, enlarges the area with supplementary instructions for classification, and can stimulate the model to learn more supportive evidence.
As shown in fig. 3, the selective sparse sampling provided in the method of the present invention can prevent the strong feature from dominating the gradient learning process, thereby promoting the more comprehensive feature expression of the learning of the network.
The whole resampling process is realized through convolution operation, can be embedded into any neural network and realizes end-to-end learning and training, so that the classification loss obtained by calculating the resampled image can optimize the parameter β1And β2。
In the step 3, the local amplification of the key part is realized in an image resampling mode, so that the important details in the image are enhanced while the background information is kept, and the information loss is avoided.
Through steps 1, 2, and 3, two resampled images are derived from one input image. The two resampled images are input into the classification network used in step 1 to extract features. To aggregate global and local features, define an image feature as FJ={Fo,FD,Fc},Fo,FD,FcRespectively corresponding to the original image features, the discriminative branch image features and the complementary branch image features. And (5) the features are ranked and fed into a full connection layer with softmax, and an image classification result is obtained. The resampled image and the original image share a feature extraction network, so that data enhancement in a special form is realized, and the generalization of the model is favorably improved.
In a preferred embodiment, the fine-grained image classification method based on selective sparse sampling further includes a model optimization process, including the following steps:
step 4.1: designing a cross entropy loss function, calculating the gradient of the classification network according to the loss function, carrying out gradient back transmission on the whole classification network, and updating network parameters;
the definition of the model classification cross entropy loss function is shown as equation (12):
wherein L isclsDenotes cross entropy loss, YiFor the corresponding prediction vectors, Y, of the original image and the resampled imagejFor prediction vectors corresponding to joint features, Y*Is an image label.
Step 4.2: and (3) judging whether the network is converged (namely the error is not reduced) or not according to the classification error obtained by calculating the cross entropy loss function, or judging whether the maximum iteration number is reached, stopping network training if the network is converged or the maximum iteration number is reached, and otherwise, skipping to the step 1.
The unknown images in the test set are input into the trained model to obtain a target classification result, as shown in fig. 4. The positioning result of the method of the present invention is shown in fig. 4, and it can be seen that the method of the present invention improves the classification performance by activating more regions compared to the general classification network.
The method and the device can improve the accuracy of fine-grained image classification and improve the target positioning capability. The invention is used for positioning the target and comprises the following steps:
step 1:class response map M for computing top-1 corresponding to original image, discriminant branch and complementary branchO,MD,MC;
Step 2: class response map M corresponding to discriminant branchDClass response graph M corresponding to complementary branchesCMapping to an original class response map M by a corresponding inverse transformationOSpace, then MO,MD,MCThe three are added to generate a final class response graph Mf inal;
The inverse transformation is a transformation for restoring the locally enlarged image to the original image.
And step 3: the final class response graph Mf inalAnd upsampling to the size of the original image, dividing the upsampled image by using the average value, and selecting the minimum bounding box of the maximum connected domain as a positioning result of the target.
The positioning result of the method of the invention is shown in fig. 5, which shows that the method of the invention has more accurate and comprehensive positioning compared with the reference method, and obviously improves the problem of information loss caused by overfitting in the reference model.
Examples
Example 1
1. Database and sample classification
The method is adopted to classify the fine-grained images, and for the accuracy and comparability of the experiment, the open data CUB-200 plus 2011, Stanford Cars and FGVC-Aircraft data sets widely used in the field of fine-grained image classification are used. The CUB-200 + 2011 dataset is an avian dataset, with 11788 images in total, of 200 species, which divides the entire image set into two parts: training and testing, the number of images per part being 5994 and 5794 respectively. The StanfordCars dataset is an automobile dataset with 16185 images, 196 cars, and 8144 and 8044 images for training and testing, respectively. The FGVC-Aircraft dataset is an airplane dataset with 10000 images, 6667 images for training and verification, and 3333 images for testing. The method only uses the image category label, and does not use any other additional strong labeling information, such as the target surrounding frame, the part point labeling information and the hierarchical semantic association of the category label.
Constructing a model: with the method of the invention, a 50-layer residual convolutional neural network (ResNet-50) is used as a feature extractor. The 60 periods were trained using a stochastic gradient descent with momentum with a batch size of 16. Setting the weight attenuation to 1e-4The momentum is set to 0.9. For parameters initialized from the pre-trained model on ImageNet, an initial learning rate of 0.001 was used; for other parameters, we use an initial learning rate of 0.01. The input images are each resized to 448 x 448 pixels.
2. Performance evaluation criteria
2.1 Classification Performance evaluation criterion
For evaluation of algorithm performance and comparison with other methods, we chose evaluation methods widely used in image classification: and (3) top-1 classification accuracy, for a single image, taking the class corresponding to the maximum value in the predicted probability vector as a prediction result, if the prediction class is consistent with the class information labeled by the image, the prediction is correct, and for the whole data set, the proportion of the number of the correctly predicted images in the whole data set is the top-1 classification accuracy of the data set.
In addition, in order to evaluate the positioning performance of the algorithm frame, a calibration frame of a data set is used in the evaluation process.
Evaluating the positioning performance of the frame: the method of the invention obtains the prediction frame of the target, if the frame and the mark frame IOU of the target in the original image are more than 0.5, the frame is considered to be positioned correctly, otherwise, the positioning is wrong. And calculating the percentage of correct picture positioning and all pictures for each category respectively as a performance evaluation result of frame positioning.
3. Results and analysis of the experiments
And evaluating the quality of the located class peak value response in the model, and respectively verifying the effectiveness of the discriminant branch, the complementary branch and the sparse attention module.
3.1) quality of class Peak response
When the peak-like response points fall in the labeling frame of the target, the response points are considered to be accurately positioned, and the positioning accuracy of the single image is the proportion of the number of the peak-like response points falling in the labeling frame to the total number of the detected peak-like response points. The accuracy of the data set as a whole is the average of the accuracies of all images.
Table 1 verification of accuracy of peak-like response in locating a part on a data set (%)
Data set | CUB | Aircraft | Cars |
Positioning accuracy | 94.63 | 97.22 | 98.76 |
As can be seen from table 1, the peak-like response is used to locate the component on the target very well.
3.2) quality of class Peak response
The effect of each branch including the original branch (O-branch), the discriminative branch (D-branch) and the complementary branch (C-branch) was verified, and the experiment was performed on CUB-200-2011. The effect on the algorithm classification and frame positioning performance after removal of the different branches was determined and the results are shown in table 2.
Table 2 verifies the results of classification and box positioning (%), of the various branches, on CUB-200-2011
Is provided with | Positioning | By O-branching | D-branch | C-branch | Total of |
S3N O | 57.7 | 86.0 | - | - | 86.0 |
S3N O+D | 59.2 | 87.0 | 86.5 | - | 87.6 |
S3N O+C | 56.6 | 86.8 | - | 85.3 | 87.3 |
S3N D+C | 62.6 | - | 87.1 | 85.6 | 87.5 |
S3N O+D+C | 65.2 | 87.9 | 86.7 | 85.1 | 88.5 |
The second column in table 2 is the box positioning performance, evaluation follows the box positioning evaluation criteria. The third to sixth columns in Table 2 are top-1 classification performance, and the evaluation follows the evaluation criteria.
As can be seen from Table 2, both discriminative (D-branch) and complementary (C-branch) branches can be used for classification performance of the model, confirming that both can facilitate the learning of fine features by the model. Secondly, the discriminant branch (D-branch) is better than the complementary branch (C-branch) in improving the classification performance, which proves that the discriminant branch focuses on the components with the decisive influence on the classification, and the complementary branch focuses on the components with the weak support of the classification. The classification performance of the model is optimal when the original image branch (O-branch), the discriminant branch (D-branch) and the complementary branch (C-branch) exist simultaneously, and the characteristics learned among the original image branch, the discriminant branch and the complementary branch are proved to have complementarity. The existence of the discriminant branch (D-branch) and the complementary branch (C-branch) can improve the classification performance of the original image branch (O-branch), and proves that the weight sharing of the backbone network realizes the data enhancement in a special form and improves the generalization of the network.
3.3) Effect of sparse attention
The effectiveness of the sparse attention module provided by the invention on the selective sampling problem is verified, and the influence of several different attention mechanisms on the classification performance is respectively measured on the CUB-200-2011 data set, and the result is shown in Table 3.
TABLE 3 Classification accuracy (%) -for several different attention mechanisms
Attention mechanism mode | Top-1 accuracy | Notes |
Significance-based attention | 85.9 | Class independent |
Attention based on class response graph | 87.8 | Class correlation |
Sparse attention | 88.5 | Component correlation |
Among these, attention Based on significance is set forth in the literature "Recasens A, Kellnhofer P, Stent S, et.
Attention based on class response maps is presented in the literature "Zhou B, Khosla a, laperiza a, et al.
As can be seen from table 3, as the relevance between attention and categories is reduced, the classification and localization performance is significantly reduced, which indicates that the salient features based on the network bottom layer cannot guide feature learning from the perspective of high-level semantics. Second, compared to class response graph-based attention mechanisms, sparse attention mechanisms can capture finer visual cues while discarding regions that are irrelevant to the classification decision or even harmful. Compared with the prior art, the algorithm provided by the invention can well position the fine components which are useful for classifying and judging fine-grained images, thereby improving the classification performance.
3.4) comparison of Fine-grained image classification learning method
The existing fine-grained image classification learning method is used, and the test is carried out based on a characteristic learning method (B-CNN, Lowrank B-CNN, boost CNN, HIHCA and DFL-CNN), an attention mechanism method (RA-CNN, MA-CNN and DT-RAM) and NTS (weak supervision target detection framework-based method). The evaluation criteria of the image classification performance are the same as the embodiment by adopting the CUB-200 plus 2011, Stanford Cars and FGVC-Aircraft data sets.
B-CNN is described in the literature "Tsungyu Lin, Areni Roychowdhury, and Subhransu Maji. Biliner CNN models for fine-grained visual recognition. international conference on computer vision, pages 1449-;
low rank B-CNN is proposed in the literature "Shu Kong and Charless C Fowles. Low-rank multilineage point for fine-grained classification. computer vision and pattern recognition, patterns 7025-;
HIHCA is set forth in the literature "Sijia Cai, Wangmeng Zuo, and Lei Zhang. high-order integration of high-order capacitive activities for fine-ordered visual interaction. in IEEE International Conference on Computer Vision, 2017.";
Boosted-CNN is set forth in the documents "Mohammad Moghimi, Serge J. Belongie, Mohammad J. Saberian, Jian Yang, Nuno Vasconce cells, and Li-Jia Li. Boosted associative neural networks in Proceedings of the British Machine Vision Conserence 2016.";
DFL-CNN is proposed in the literature "Yaming Wang, Vlad I Moraruu, and Larry S Davis. learning acquired imaging filter bank with a cNn for fine-grained registration. computing and pattern registration, pages 4148 and 4157, 2018";
RA-CNN is proposed in the literature "Jianlong Fu, Heliang Zheng, and Tao Mei. hook closer to seebecter: Current authentication consistent neural network for fine-grain image Recognition. computer Vision and Pattern Recognition, 2017.";
MA-CNN is set forth in the literature "Heliang Zheng, Jianlong Fu, Mei Tao, and Jiebo Luo. Learning Multi-attribute connected neural network for fine-grained image recognition. in IEEE International Conference on Computer Vision, 2017.";
DT-RAM is described in the literature "Zhichao Li, Yi Yang, Xiao Liu, Feng Zhou, Shilei Wen, and Weixu. dynamic computing time for visual authentication. international conference computer vision, pages 1199. 1209, 2017.";
NTS is proposed in the literature "Ze Yang, Tiange Luo, Dong Wang, Zhijiang Hu, Jun Gao, and LiweiWang.Learing to personal for fine-grained classification. in ECCV2018pages 438-454".
Table 4 comparison results (%) of the fine-grained image classification method based on selective sparse sampling (S3N) and other methods on three data sets
As can be seen from Table 4, the accuracy of the method "S3N" provided by the invention in the test is higher than that of the existing fine-grained image classification methods (B-CNN, RA-CNN, MA-CNN, DPL-CNN, DFL-CNN, NTS-CNN) only using image class labels. It can be seen that after selective sparse sampling is used, the method provided by the invention can be used for mining more visual clues, and meanwhile, feature learning is carried out on discriminative parts on a larger scale, so that a model can learn more fine features.
3.5) influence of the selection of the threshold on the classification
And (3) comparing results (%) on CUB-200-2011 by using the selective sparse sampling-based fine-grained image classification method (S3N).
TABLE 5 Effect of different thresholds on classification
δ | 0 | 0.05 | 0.1 | 0.15 | 0.2 | 0.25 | 0.3 |
top-1(%) | 88.14 | 88.23 | 88.40 | 88.50 | 88.52 | 88.47 | 88.43 |
As can be seen from table 5, when the class response map for extracting the class peak response is calculated, the value of the threshold has a certain influence on the classification result, and when the value is 0.2, the classification performance is better.
The present invention has been described above in connection with preferred embodiments, but these embodiments are merely exemplary and merely illustrative. On the basis of the above, the invention can be subjected to various substitutions and modifications, and the substitutions and the modifications are all within the protection scope of the invention.
Claims (9)
1. A fine-grained image classification method based on selective sparse sampling comprises a process of training a classification model for target classification, wherein the training process of the classification model comprises the following steps:
step 1: key component positioning: inputting the image into a classification network, outputting a class response graph corresponding to the image, and extracting a class peak value response on the class response graph;
step 2: class peak response grouping: grouping the class peak responses obtained in the step 1 according to response strength, wherein the class peak responses are respectively a discriminant attention group and a complementary attention group, each class peak response generates an attention diagram, and the two corresponding groups of class peak responses generate two groups of attention diagrams;
step 2 comprises the following substeps:
step 2.1: dividing the peak-like response in the step 1 into two sets TdAnd TcThe division is as follows:
Td={(x,y)|(x,y)∈T if Rx,y≥ζ},
Tc={(x,y)|(x,y)∈T if Rx,y<ζ},
wherein R isx,yIs the response value of (x, y), ζ is the division number, TdFor the discriminative class peak response set, TcIs a complementary peak-like response set, T is a peak-like response position set, and (x, y) is a peak-like response position;
step 2.2: and calculating a corresponding attention diagram for each class peak response by using a Gaussian kernel function, wherein the calculation mode is as follows:
wherein,is (x)i,yi) β, β1And β2Is a learnable parameter for controlling the degree of local amplification;
and step 3: resampling: respectively aggregating the attention diagrams in the two groups to generate two saliency diagrams, resampling the images under the guidance of the saliency diagrams, realizing local amplification of corresponding key components, and obtaining two resampled images;
and 4, step 4: inputting the resampled image obtained in the step 3 into the classification network in the step 1 to extract features, combining the features of the original image and the resampled image, and classifying by using a classifier to obtain a classification model.
2. The method of claim 1 wherein, in step 1, the class-like peak response is a local maximum on the class response map.
3. The method according to claim 1, wherein in step 1, the picture is given only image labels, and the whole target and the positions of all parts are not marked.
4. The method according to claim 1, characterized in that step 1 comprises the following sub-steps:
step 1.1: inputting the image into a classification network, and calculating a class response graph, wherein the class response graph is defined as follows:
wherein, WfcIs the weight of the fully connected layer, i.e. the classifier, each class c and each feature map SdCorresponds to WfcIs a numerical value ofD is the number of characteristic channels, and D is one channel label of the characteristic diagram S;
step 1.2: obtaining classification results by using the classification network in the step 1.1, calculating a prediction probability set P of the current image in each class, and selecting the prediction probabilities of the top fiveCalculating the entropy:
step 1.3: defining a class response graph for extracting class peak responses:
wherein p isiIs the probability of predicting the input image as the ith class,is composed ofThe corresponding class response map, being the threshold,the class response graph corresponding to the class with the highest prediction probability;
step 1.4: extracting local maximum values from the class response map in a window with a set size to obtain a class peak response position set T { (x)1,y1),(x2,y2),…,(xn,yn) And n is the number of peak-like responses.
5. Method according to claim 4, characterized in that in step 1.3, the response graph R to classesoNormalization in a maximum-minimum mannerAnd obtaining a class peak value response position set T based on the normalized class response diagram:
wherein R is a normalized class response graph, RoIs the class response plot, min (R) obtained in step 1.3o) Is the minimum value of the class response map, max (R)o) Is the maximum value of the class response graph.
6. The method according to claim 1, characterized in that step 3 comprises the following sub-steps:
step 3.1: summing the attention diagrams of each group to obtain a significance map Q for guiding resampling by using the attention diagrams obtained in the step 2dAnd Qc,
Qd=∑Ai,if(xi,yi)∈Td
Qc=∑Ai,if(xi,yi)∈Tc
Wherein Q isdRepresenting a discriminative branch significance map, QcRepresents the significance map of the complementary branches, AiAn attention map, T, representing the ith class peak responsedRepresenting a set of discriminative class peak responses, TcA set representing complementary class peak responses;
step 3.2: and (3) guiding the original image to be resampled by the two groups of significance maps obtained by calculation in the step (3.1) to obtain a resampled image, wherein the calculation formula is as follows:
wherein, (x ', y') is the coordinates of pixel points in the original image, and f (x, y) is the corresponding coordinates in the sampled imageG (x, y) is its corresponding ordinate in the sampled image, Q is QdOr QcK ((x ', y'), (x, y)) is a gaussian kernel function.
7. The method of claim 6, wherein the resampling process is performed by a convolution operation.
8. The method according to claim 1, wherein the selective sparse sampling based fine-grained image classification method further comprises a model optimization process, comprising the following steps:
step 4.1: designing a cross entropy loss function, calculating the gradient of the classification network according to the loss function, carrying out gradient back transmission on the whole classification network, and updating network parameters;
the definition of the model classification cross entropy loss function is shown as follows:
wherein L isclsDenotes cross entropy loss, YiFor the corresponding prediction vectors, Y, of the original image and the resampled imagejFor prediction vectors corresponding to joint features, Y*The image label is an image label, O represents an original image, D represents a discriminant resample image, and C represents a complementary resample image;
step 4.2: and (3) judging whether the network is converged or not according to the classification error obtained by calculating the cross entropy loss function, or judging whether the maximum iteration times are reached or not, stopping network training if the network is converged or the maximum iteration times are reached, and otherwise, skipping to the step 1.
9. The method of claim 1, wherein the selective sparse sampling based fine-grained image classification method is further applied to target localization, comprising the steps of:
step 1: class response map M for computing top-1 corresponding to original image, discriminant branch and complementary branchO,MD,MC;
Step 2: class response map M corresponding to discriminant branchDClass response graph M corresponding to complementary branchesCMapping to an original class response map M by a corresponding inverse transformationOSpace, then MO,MD,MCThe three are added to generate a final class response graph Mfinal;
And step 3: the final class response graph MfinalAnd upsampling to the size of the original image, dividing the upsampled image by using the average value, and selecting the minimum bounding box of the maximum connected domain as a positioning result of the target.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910942790.8A CN110738247B (en) | 2019-09-30 | 2019-09-30 | Fine-grained image classification method based on selective sparse sampling |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910942790.8A CN110738247B (en) | 2019-09-30 | 2019-09-30 | Fine-grained image classification method based on selective sparse sampling |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110738247A CN110738247A (en) | 2020-01-31 |
CN110738247B true CN110738247B (en) | 2020-08-25 |
Family
ID=69269842
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910942790.8A Active CN110738247B (en) | 2019-09-30 | 2019-09-30 | Fine-grained image classification method based on selective sparse sampling |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110738247B (en) |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111507403B (en) * | 2020-04-17 | 2024-11-05 | 腾讯科技(深圳)有限公司 | Image classification method, apparatus, computer device and storage medium |
CN111368942B (en) * | 2020-05-27 | 2020-08-25 | 深圳创新奇智科技有限公司 | Commodity classification identification method and device, electronic equipment and storage medium |
CN111915618B (en) * | 2020-06-02 | 2024-05-14 | 华南理工大学 | Peak response enhancement-based instance segmentation algorithm and computing device |
CN111784673B (en) * | 2020-06-30 | 2023-04-18 | 创新奇智(上海)科技有限公司 | Defect detection model training and defect detection method, device and storage medium |
CN111816308B (en) * | 2020-07-13 | 2023-09-29 | 中国医学科学院阜外医院 | System for predicting coronary heart disease onset risk through facial image analysis |
CN111967527B (en) * | 2020-08-21 | 2022-09-06 | 菏泽学院 | Peony variety identification method and system based on artificial intelligence |
KR102483693B1 (en) * | 2020-12-02 | 2023-01-03 | 울산대학교 산학협력단 | Method and apparatus of explainable multi electrocardiogram arrhythmia diagnosis |
CN112906810B (en) * | 2021-03-08 | 2024-04-16 | 共达地创新技术(深圳)有限公司 | Target detection method, electronic device, and storage medium |
CN113177546A (en) * | 2021-04-30 | 2021-07-27 | 中国科学技术大学 | Target detection method based on sparse attention module |
CN113470029B (en) * | 2021-09-03 | 2021-12-03 | 北京字节跳动网络技术有限公司 | Training method and device, image processing method, electronic device and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101729911A (en) * | 2009-12-23 | 2010-06-09 | 宁波大学 | Multi-view image color correction method based on visual perception |
CN109583305A (en) * | 2018-10-30 | 2019-04-05 | 南昌大学 | A kind of advanced method that the vehicle based on critical component identification and fine grit classification identifies again |
KR20190109194A (en) * | 2018-03-16 | 2019-09-25 | 주식회사 에이아이트릭스 | Apparatus and method for learning neural network capable of modeling uncerrainty |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10565305B2 (en) * | 2016-11-18 | 2020-02-18 | Salesforce.Com, Inc. | Adaptive attention model for image captioning |
CN109284749A (en) * | 2017-07-19 | 2019-01-29 | 微软技术许可有限责任公司 | Refine image recognition |
CN108334901A (en) * | 2018-01-30 | 2018-07-27 | 福州大学 | A kind of flowers image classification method of the convolutional neural networks of combination salient region |
CN110197202A (en) * | 2019-04-30 | 2019-09-03 | 杰创智能科技股份有限公司 | A kind of local feature fine granularity algorithm of target detection |
-
2019
- 2019-09-30 CN CN201910942790.8A patent/CN110738247B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101729911A (en) * | 2009-12-23 | 2010-06-09 | 宁波大学 | Multi-view image color correction method based on visual perception |
KR20190109194A (en) * | 2018-03-16 | 2019-09-25 | 주식회사 에이아이트릭스 | Apparatus and method for learning neural network capable of modeling uncerrainty |
CN109583305A (en) * | 2018-10-30 | 2019-04-05 | 南昌大学 | A kind of advanced method that the vehicle based on critical component identification and fine grit classification identifies again |
Also Published As
Publication number | Publication date |
---|---|
CN110738247A (en) | 2020-01-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110738247B (en) | Fine-grained image classification method based on selective sparse sampling | |
US11960568B2 (en) | Model and method for multi-source domain adaptation by aligning partial features | |
CN113378632B (en) | Pseudo-label optimization-based unsupervised domain adaptive pedestrian re-identification method | |
CN110443818B (en) | Graffiti-based weak supervision semantic segmentation method and system | |
CN107657279B (en) | Remote sensing target detection method based on small amount of samples | |
EP3478728B1 (en) | Method and system for cell annotation with adaptive incremental learning | |
CN109993102B (en) | Similar face retrieval method, device and storage medium | |
CN109871875B (en) | Building change detection method based on deep learning | |
CN107633226B (en) | Human body motion tracking feature processing method | |
US20240257423A1 (en) | Image processing method and apparatus, and computer readable storage medium | |
CN108959305A (en) | A kind of event extraction method and system based on internet big data | |
CN110929802A (en) | Information entropy-based subdivision identification model training and image identification method and device | |
CN106408030A (en) | SAR image classification method based on middle lamella semantic attribute and convolution neural network | |
CN110751027B (en) | Pedestrian re-identification method based on deep multi-instance learning | |
CN110728694A (en) | Long-term visual target tracking method based on continuous learning | |
CN112528022A (en) | Method for extracting characteristic words corresponding to theme categories and identifying text theme categories | |
CN112541884B (en) | Defect detection method and device, and computer readable storage medium | |
Zhang et al. | An efficient class-constrained DBSCAN approach for large-scale point cloud clustering | |
CN110287970B (en) | Weak supervision object positioning method based on CAM and covering | |
CN108805181B (en) | Image classification device and method based on multi-classification model | |
Jiang et al. | Dynamic proposal sampling for weakly supervised object detection | |
CN115311449A (en) | Weak supervision image target positioning analysis system based on class reactivation mapping chart | |
CN114445716B (en) | Key point detection method, key point detection device, computer device, medium, and program product | |
CN117371511A (en) | Training method, device, equipment and storage medium for image classification model | |
US20220319002A1 (en) | Tumor cell isolines |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |