CN114821180B

CN114821180B - A Weakly Supervised Fine-grained Image Classification Method Based on Soft Threshold Penalty Mechanism

Info

Publication number: CN114821180B
Application number: CN202210487333.6A
Authority: CN
Inventors: 董琴; 范浩楠; 刘柱; 杨国宇
Original assignee: Yancheng Institute of Technology
Current assignee: Beijing Aifenghuan Information Technology Co ltd
Priority date: 2022-05-06
Filing date: 2022-05-06
Publication date: 2022-12-06
Anticipated expiration: 2042-05-06
Also published as: CN114821180A

Abstract

The invention provides a weak supervision fine-grained image classification method based on a soft threshold punishment mechanism, which comprises the following steps: step 1: constructing a fine-grained image classification network of a secondary cascade network structure based on a soft threshold punishment mechanism; and 2, step: acquiring an image to be classified; and step 3: preprocessing the image to be classified; and 4, step 4: and classifying the images of the preprocessing result based on the fine-grained image classification network, and outputting the image classification result. The weak supervision fine-grained image classification method based on the soft threshold punishment mechanism optimizes the MMAL-Net network structure, reduces the calculated amount of the whole model by reducing the number of network branch layers, and also reduces the requirement of a training model on hardware; on the basis of a new branch structure, adding a soft threshold punishment mechanism module to resist noise appearing in the image; effectively shielding interference information, thereby improving the overall precision.

Description

A Weakly Supervised Fine-grained Image Classification Method Based on Soft Threshold Penalty Mechanism

技术领域technical field

本发明涉及图像分类和智能优化技术领域，特别涉及一种基于软阈值惩罚机制的弱监督细粒度图像分类方法。The invention relates to the technical field of image classification and intelligent optimization, in particular to a weakly supervised fine-grained image classification method based on a soft threshold penalty mechanism.

背景技术Background technique

MMAL-Net是多分支多尺度学习网络，基于全局特征的弱监督细粒度分类方法。沿用了另一种局部特征分类所用到的方法，以三级级联作为整体架构。MMAL-Net模型分类的准确性较高，在诸多的数据集上能够到达SOTA的准确率，甚至在飞行器数据集上达到94.7％，是目前该数据集的最高准确率。MMAL-Net的算法流程如图2所示。MMAL-Net is a multi-branch multi-scale learning network, a weakly supervised fine-grained classification method based on global features. Another method used in local feature classification is followed, with a three-level cascade as the overall architecture. The classification accuracy of the MMAL-Net model is high, and it can reach the accuracy rate of SOTA on many data sets, even reaching 94.7% on the aircraft data set, which is currently the highest accuracy rate of this data set. The algorithm flow of MMAL-Net is shown in Figure 2.

MMAL-Net以RA-CNN网络作为基础结构，选用了三级级联网络的方式来进行，在每一级网络上，采用ResNet进行特征的提取与分类，与之不同的是，在每一级网络之间，穿插有两个模块，分别是AOLM(Attention Object Location Module)和APPM(Attention PartProposal Module)模块，这两个模块将整个三级网络分成了三个分支，分别是原始图像分支、主体图像分支和部分图像分支。MMAL-Net uses the RA-CNN network as the basic structure, and uses a three-level cascaded network. On each level of the network, ResNet is used to extract and classify features. The difference is that at each level Between the network, there are two modules interspersed, which are AOLM (Attention Object Location Module) and APPM (Attention PartProposal Module) modules. These two modules divide the entire three-level network into three branches, which are the original image branch and the main body. Image branch and partial image branch.

ALOM用来预测物体的位置。采用了特征图的聚合操作，在特征提取的最后阶段，让特征图在通道维度进行聚合，设置相应阈值提取响应较高的最大连通域，并将该相应区域的感受野映射回原始图像，便可以找到原始图像中主体部分的位置，再将此部分的切割出来，然后再次进行特征的提取。ALOM is used to predict the position of objects. The aggregation operation of the feature map is adopted. In the final stage of feature extraction, the feature map is aggregated in the channel dimension, and the corresponding threshold is set to extract the largest connected domain with a higher response, and the receptive field of the corresponding area is mapped back to the original image, which is convenient. The position of the main part in the original image can be found, and then this part is cut out, and then the features are extracted again.

APPM在不需要边界边框或者标注的情况下，预测物体重点区域的信息。选用了一些固定大小的滑窗，并对滑窗中的数据进行池化操作，从而算出来每一个区域的计算结果，并对这些结果进行排序，选择结果比较大的区域，在对该区域进行非极大抑制操作后，再将该部件的图像输入网络。APPM predicts the information of the key area of the object without the need for border boxes or labels. Select some fixed-size sliding windows, and perform pooling operations on the data in the sliding windows, so as to calculate the calculation results of each area, and sort these results, select the area with relatively large results, and perform the calculation on the area. After the non-maximum suppression operation, the image of the part is fed into the network.

在MMAL-Net三级网络中所用到的Loss函数均是基础的Cross-entropy loss，最后对三级网络的loss值求和，得出最终的loss。The Loss function used in the MMAL-Net three-level network is the basic Cross-entropy loss, and finally the loss value of the three-level network is summed to obtain the final loss.

但是，MMAL-Net存在以下两点缺点：However, MMAL-Net has the following two disadvantages:

(1)在整个三级级联网络结构中，虽然通过共用一组参数让整体网络的参数数量减少，但计算量却因为复杂的三次级联结构显著提升，又因为前两级网络的输入都是448*448像素的图片，让整个网络的计算量又一步提升，因此训练网络的速度变慢很多，也占用了很大一部分显存。(1) In the entire three-level cascaded network structure, although the number of parameters in the overall network is reduced by sharing a set of parameters, the amount of calculation is significantly increased due to the complex three-level cascaded structure, and because the input of the first two levels of the network are both It is a 448*448 pixel picture, which further increases the calculation amount of the entire network, so the speed of training the network is much slower, and it also takes up a large part of the video memory.

(2)在中部分支的设置中，可以通过原图得到主体的定位信息，从而得到具有辨识性的区域。但是在一些不存在明显主体的细粒度图像分类任务中，这一分支的存在反而会削弱整个的特征提取能力。(2) In the setting of the middle branch, the positioning information of the main body can be obtained through the original image, so as to obtain a recognizable area. However, in some fine-grained image classification tasks where there is no obvious subject, the existence of this branch will weaken the entire feature extraction ability.

发明内容Contents of the invention

本发明提供一种基于软阈值惩罚机制的弱监督细粒度图像分类方法，优化了MMAL-Net网络结构，通过减少网络分支层数，来减少整体模型的计算量，也降低了训练模型对硬件的要求；在新分支结构的基础上，加入软阈值惩罚机制模块，来抵制图像中出现的噪声；有效屏蔽干扰信息，从而使得整体精度提升。The present invention provides a weakly supervised fine-grained image classification method based on a soft threshold penalty mechanism, optimizes the MMAL-Net network structure, reduces the amount of calculation of the overall model by reducing the number of network branch layers, and also reduces the training model on the hardware. Requirements: On the basis of the new branch structure, add a soft threshold penalty mechanism module to resist the noise in the image; effectively shield the interference information, so that the overall accuracy is improved.

本发明提供一种基于软阈值惩罚机制的弱监督细粒度图像分类方法，包括：The present invention provides a weakly supervised fine-grained image classification method based on a soft threshold penalty mechanism, including:

步骤1：基于软阈值惩罚机制，构建二级级联网络结构的细粒度图像分类网络；Step 1: Based on the soft threshold penalty mechanism, construct a fine-grained image classification network with a two-level cascaded network structure;

步骤2：获取待分类图像；Step 2: Obtain images to be classified;

步骤3：对所述待分类图像进行预处理；Step 3: Preprocessing the image to be classified;

步骤4：基于所述细粒度图像分类网络，对预处理结果进行图像分类，并输出图像分类结果。Step 4: Based on the fine-grained image classification network, perform image classification on the preprocessing result, and output the image classification result.

优选的，所述步骤1：基于软阈值惩罚机制，构建二级级联网络结构的细粒度图像分类网络，包括：Preferably, said step 1: based on a soft threshold penalty mechanism, constructing a fine-grained image classification network with a two-level cascaded network structure, including:

构建第一网络分支，所述第一网络分支包括：依次连接的Input448*448*3、第一ResNet50、Feature14*14*2048、第一GAP、第一FC和第一Softmax；Constructing a first network branch, the first network branch comprising: Input448*448*3, the first ResNet50, Feature14*14*2048, the first GAP, the first FC and the first Softmax connected in sequence;

构建第二网络分支，所述第二网络分支包括：依次连接的Input224*224*3*mult、第二ResNet50、Feature*7*7*2048*mult、第二GAP、第二FC和第二Softmax；Construct the second network branch, the second network branch includes: Input224*224*3*mult, the second ResNet50, Feature*7*7*2048*mult, the second GAP, the second FC and the second Softmax connected in sequence ;

将所述第二网络分支中的Input224*224*3*mult通过crop与所述第一网络分支中的Input448*448*3连接；Connect Input224*224*3*mult in the second network branch to Input448*448*3 in the first network branch through crop;

将所述第一网络分支中的Feature14*14*2048通过APPM与所述crop连接；Connect Feature14*14*2048 in the first network branch to the crop through APPM;

为所述第一网络分支设置第一损失函数RawLoss；setting a first loss function RawLoss for the first network branch;

为所述第二网络分支设置第二损失函数PartLoss；setting a second loss function PartLoss for the second network branch;

在所述APPM中设置软阈值惩罚机制；Set a soft threshold penalty mechanism in the APPM;

所述第一网络分支、第二网络分支、APPM和crop组成二级级联网络结构的细粒度图像分类网络。The first network branch, the second network branch, APPM and crop form a fine-grained image classification network with a two-level cascade network structure.

优选的，所述APPM是基于SCDA形成，将所述APPM对特征提取出来的所述Feature14*14*2048沿着池化层的通道方向进行合拢，得到14*14*1的二维图，用预设的多个不同尺寸的滑窗对所述二维图进行滑窗计算，计算过程如公式(2-1)所示：Preferably, the APPM is formed based on SCDA, and the Feature14*14*2048 extracted from the APPM is combined along the channel direction of the pooling layer to obtain a two-dimensional map of 14*14*1, which is used A plurality of preset sliding windows of different sizes perform sliding window calculation on the two-dimensional graph, and the calculation process is shown in formula (2-1):

其中，H和W分别为滑窗的高度和宽度，A(x，y)为合拢好的二维图的坐标位置对应的数值，a_w为滑窗计算结果。Among them, H and W are the height and width of the sliding window respectively, A(x, y) is the value corresponding to the coordinate position of the closed two-dimensional image, and a _w is the calculation result of the sliding window.

优选的，所述第一损失函数RawLoss的公式如公式(2-2)所示：Preferably, the formula of the first loss function RawLoss is shown in formula (2-2):

其中，m_i为第i个样本图像，n_i为第一卷积神经网络CNN对应于第i个样本图像的预测概率。Among them, m _i is the i-th sample image, and _ni is the predicted probability of the first convolutional neural network CNN corresponding to the i-th sample image.

优选的，所述第二损失函数PartLoss的公式如公式(2-3)所示：Preferably, the formula of the second loss function PartLoss is shown in formula (2-3):

其中，q为由所述第二ResNet50筛选出的局部特征区域个数，m_iq为第i个样本图像对应的第q个局部特征区域个数，n_iq为第二卷积神经网络CNN对应于第i个样本图像对应的第q个局部特征区域个数的预测概率。Wherein, q is the number of local feature regions screened out by the second ResNet50, _{miq is the number of qth local feature regions corresponding to the i-th sample image, and n iq} _is the second convolutional neural network CNN corresponding to The predicted probability of the number of qth local feature regions corresponding to the i-th sample image.

优选的，所述软阈值惩罚机制包括：Preferably, the soft threshold penalty mechanism includes:

设F(x，y)为不含噪声的图像，N(x，y)为噪声，G(x，y)为噪声影响之后的图像，选用L_1/2范式进行模型的建立如公式(2-5)所示：Let F(x, y) be the image without noise, N(x, y) be the noise, G(x, y) be the image after noise influence, choose the L _1/2 paradigm to establish the model as formula (2 -5) Shown:

i表示的是图像的序号，当进行图像迭代处理时，G(x_i,y_i)-F(x_i,y_i)出现残差，说明图像中出现噪声，并造成影响；i represents the serial number of the image. When the image is iteratively processed, G( _xi ,y _i )-F( _xi ,y _i ) has a residual error, indicating that noise appears in the image and affects it;

通过软阈值来限定残差状态，先构造惩罚因子||G(x_i,y_i)-F(x_i,y_i)||_h，来限制G(x_i,y_i)-F(x_i,y_i)不大于0，从而降低图像被噪声干扰的程度，如公式(2-6)所示：To limit the residual state by soft threshold, first construct the penalty factor ||G(x _i ,y _i )-F(x _i ,y _i )|| _h to limit G(x _i ,y _i )-F(x _i , y _i ) is not greater than 0, thereby reducing the degree of image interference by noise, as shown in formula (2-6):

公式中λ表示惩罚系数，调节该系数可以使结果接近真实值，由于又软阈值的限制，近似值也可能会出现在真实值的之上，因此软阈值可以有效降低噪声干扰的影响。In the formula, λ represents the penalty coefficient. Adjusting this coefficient can make the result close to the real value. Due to the limitation of the soft threshold, the approximate value may also appear above the real value, so the soft threshold can effectively reduce the influence of noise interference.

优选的，为进一步优化软阈值方法，可进一步修改目标函数，如公式(2-7)所示：Preferably, in order to further optimize the soft threshold method, the objective function can be further modified, as shown in formula (2-7):

其中，V为辅助变量，λ₁和λ₂均为惩罚系数，在计算过程中，使用软阈值算法对V进行迭代更新处理。Among them, V is an auxiliary variable, and λ ₁ and λ ₂ are penalty coefficients. In the calculation process, V is iteratively updated using the soft threshold algorithm.

优选的，所述步骤2：获取待分类图像，包括：Preferably, the step 2: obtaining images to be classified includes:

当用户在相册的查看界面触控圈选多个第一照片形成圈选框且所述圈选框在预设的第一时间内沿同一扩大方向扩大时，获取并输出预设的免触圈选提示信息，同时，控制所述圈选框沿所述扩大方向以预设的第一扩大速度继续进行扩大；When the user touches and selects a plurality of first photos on the viewing interface of the album to form a circle, and the circle expands along the same expansion direction within a preset first time, acquire and output the preset non-touch circle Select the prompt information, and at the same time, control the circled box to continue to expand along the expansion direction at a preset first expansion speed;

动态获取所述用户当前的眼部视线；Dynamically acquire the current eye sight of the user;

确定所述查看界面内对应于所述眼部视线的第一注视点位；determining a first fixation point corresponding to the eye line of sight in the viewing interface;

若所述第一注视点位落在所述查看界面内的所述圈选框内，获取所述第一注视点位与所述圈选框的所述扩大方向上的目标框边之间的第一垂直距离；If the first gaze point falls within the circled box in the viewing interface, obtain the distance between the first gaze point and the target frame edge in the expanding direction of the circled box first vertical distance;

基于所述第一垂直距离，对所述第一扩大速度进行调整，调整公式如公式(2-8)所示：Based on the first vertical distance, the first expansion speed is adjusted, and the adjustment formula is shown in formula (2-8):

其中，v₁′为调整后的所述第一扩大速度，v₁为调整前的所述第一扩大速度，l₁为所述第一垂直距离，

为预设的第一关系系数；Wherein, v ₁ ′ is the first expansion speed after adjustment, v ₁ is the first expansion speed before adjustment, l ₁ is the first vertical distance,

is the preset first relational coefficient;

若所述第一注视点位落在所述查看界面内的所述圈选框外所述扩大方向上的待圈选范围内且所述第一注视点位在预设的第二时间内发生变化，获取所述第一注视点位与所述圈选框的所述扩大方向上的目标框边之间的第二垂直距离；If the first gaze point falls within the range to be circled in the expansion direction outside the circled box in the viewing interface and the first gaze point occurs within the preset second time change, obtaining a second vertical distance between the first gaze point and the target frame edge in the expansion direction of the circled box;

基于所述第二垂直距离，对所述第一扩大速度进行调整，调整公式如公式(2-9)所示：Based on the second vertical distance, the first expansion speed is adjusted, and the adjustment formula is as shown in formula (2-9):

其中，v₂′为调整后的所述第一扩大速度，v₂为调整前的所述第一扩大速度，l₂为所述第二垂直距离，

为预设的第二关系系数；Wherein, v ₂ ′ is the first expansion speed after adjustment, v ₂ is the first expansion speed before adjustment, l ₂ is the second vertical distance,

is the preset second relational coefficient;

若所述第一注视点位落在所述查看界面内的所述圈选框外所述扩大方向上的待圈选范围内且所述第一注视点位在所述第二时间内未发生变化，获取所述查看界面内对应于所述第一注视点位的第二照片；If the first gaze point falls within the range to be circled in the expanding direction outside the circled box in the viewing interface and the first gaze point does not occur within the second time change, obtaining a second photo corresponding to the first gaze point in the viewing interface;

当所述第二照片刚好进入所述圈选框时，控制所述圈选框停止扩大；When the second photo just enters the circled box, control the circled box to stop expanding;

获取所述圈选框的所述扩大方向上的目标框边的移动类型；Acquiring the movement type of the target frame side in the expansion direction of the circled box;

当所述移动类型为向下换行时，控制所述圈选框退选所述查看界面内所述第二照片所在行内所述第二照片右侧的全部第三照片；When the movement type is line feed downward, control the circled box to deselect all the third photos on the right side of the second photo in the row where the second photo is located in the viewing interface;

当所述移动类型为向上换行时，控制所述圈选框退选所述查看界面内所述第二照片所在行内所述第二照片左侧的全部第四照片；When the movement type is upward line break, control the circled box to deselect all fourth photos on the left side of the second photo in the row where the second photo is located in the viewing interface;

当所述移动类型为向右换列时，控制所述圈选框退选所述查看界面内所述第二照片所在列内所述第二照片下侧的全部第五照片；When the movement type is changing columns to the right, control the circled box to deselect all the fifth photos below the second photo in the column where the second photo is located in the viewing interface;

当所述移动类型为向左换列时，控制所述圈选框退选所述查看界面内所述第二照片所在列内所述第二照片上侧的全部第六照片；When the movement type is changing columns to the left, control the circled box to deselect all the sixth photos above the second photo in the column where the second photo is located in the viewing interface;

退选完成后，将所述圈选框内圈选的全部第七照片作为待分类图像，完成获取。After the unselection is completed, all the seventh photos circled in the circled frame are used as images to be classified, and the acquisition is completed.

优选的，所述获取所述查看界面内对应于所述第一注视点位的第二照片，包括：Preferably, the acquiring the second photo corresponding to the first gaze point in the viewing interface includes:

获取显示界面当前的缩略比例；Get the current thumbnail ratio of the display interface;

获取对应于所述显示界面的大小的预设的缩略比例阈值；Acquiring a preset thumbnail ratio threshold corresponding to the size of the display interface;

若所述缩略比例大于等于所述缩略比例阈值，确定所述查看界面内位于所述第一注视点位的第八照片，并作为第二照片，完成获取；If the thumbnail ratio is greater than or equal to the thumbnail ratio threshold, determine the eighth photo located at the first gaze point in the viewing interface, and use it as the second photo to complete the acquisition;

否则，确定所述查看界面内位于所述第一注视点位的多个第九照片；Otherwise, determine a plurality of ninth photos located at the first gaze point in the viewing interface;

基于所述第九照片，生成截止照片放大确认框；Based on the ninth photo, a cut-off photo zoom-in confirmation box is generated;

在所述显示界面内悬浮显示所述截止照片放大确认框；A confirmation box for zooming in on the cut-off photo is suspended in the display interface;

确定所述查看界面内对应于所述用户当前的眼部视线的第二注视点位；determining a second fixation point corresponding to the user's current eye sight in the viewing interface;

当所述第二注视点位落在所述截止照片放大确认框内且在预设的第三时间内所述第二注视点位未发生改变时，将所述截止照片放大确认框内位于所述第二注视点位的所述第九照片，并作为第二照片，完成获取。When the second gaze point falls within the cut-off photo zoom-in confirmation frame and the second gaze point has not changed within the preset third time, place the cut-off photo zoom-in confirmation frame at the The ninth photo of the second gaze point is used as the second photo to complete the acquisition.

本发明提供一种基于软阈值惩罚机制的弱监督细粒度图像分类系统，包括：The present invention provides a weakly supervised fine-grained image classification system based on a soft threshold penalty mechanism, including:

构建模块，用于基于软阈值惩罚机制，构建二级级联网络结构的细粒度图像分类网络；A building block for constructing a fine-grained image classification network with a two-level cascaded network structure based on a soft threshold penalty mechanism;

获取模块，用于获取待分类图像；An acquisition module, configured to acquire images to be classified;

预处理模块，用于对所述待分类图像进行预处理；A preprocessing module, configured to preprocess the image to be classified;

分类模块，用于基于所述细粒度图像分类网络，对预处理结果进行图像分类，并输出图像分类结果。The classification module is configured to perform image classification on the preprocessing result based on the fine-grained image classification network, and output the image classification result.

本发明的其它特征和优点将在随后的说明书中阐述，并且，部分地从说明书中变得显而易见，或者通过实施本发明而了解。本发明的目的和其他优点可通过在所写的说明书、权利要求书、以及附图中所特别指出的结构来实现和获得。Additional features and advantages of the invention will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

下面通过附图和实施例，对本发明的技术方案做进一步的详细描述。The technical solutions of the present invention will be described in further detail below with reference to the accompanying drawings and embodiments.

附图说明Description of drawings

附图用来提供对本发明的进一步理解，并且构成说明书的一部分，与本发明的实施例一起用于解释本发明，并不构成对本发明的限制。在附图中：The accompanying drawings are used to provide a further understanding of the present invention, and constitute a part of the description, and are used together with the embodiments of the present invention to explain the present invention, and do not constitute a limitation to the present invention. In the attached picture:

图1为本发明实施例中一种基于软阈值惩罚机制的弱监督细粒度图像分类方法的流程图；Fig. 1 is a flow chart of a weakly supervised fine-grained image classification method based on a soft threshold penalty mechanism in an embodiment of the present invention;

图2为本发明实施例中MMAL-Net的结构示意图；Fig. 2 is the structural representation of MMAL-Net in the embodiment of the present invention;

图3为本发明实施例中细粒度图像分类网络的结构示意图；3 is a schematic structural diagram of a fine-grained image classification network in an embodiment of the present invention;

图4为本发明实施例中一种基于软阈值惩罚机制的弱监督细粒度图像分类系统的示意图。FIG. 4 is a schematic diagram of a weakly supervised fine-grained image classification system based on a soft threshold penalty mechanism in an embodiment of the present invention.

具体实施方式detailed description

以下结合附图对本发明的优选实施例进行说明，应当理解，此处所描述的优选实施例仅用于说明和解释本发明，并不用于限定本发明。The preferred embodiments of the present invention will be described below in conjunction with the accompanying drawings. It should be understood that the preferred embodiments described here are only used to illustrate and explain the present invention, and are not intended to limit the present invention.

本发明提供一种基于软阈值惩罚机制的弱监督细粒度图像分类方法，如图1所示，包括：The present invention provides a weakly supervised fine-grained image classification method based on a soft threshold penalty mechanism, as shown in Figure 1, including:

步骤2：获取待分类图像；Step 2: Obtain images to be classified;

所述步骤1：基于软阈值惩罚机制，构建二级级联网络结构的细粒度图像分类网络，包括：The step 1: based on the soft threshold penalty mechanism, construct a fine-grained image classification network with a two-level cascaded network structure, including:

所述APPM是基于SCDA形成，将所述APPM对特征提取出来的所述Feature14*14*2048沿着池化层的通道方向进行合拢，得到14*14*1的二维图，用预设的多个不同尺寸的滑窗对所述二维图进行滑窗计算，计算过程如公式(2-1)所示：The APPM is formed based on SCDA, and the Feature14*14*2048 extracted by the APPM is combined along the channel direction of the pooling layer to obtain a two-dimensional map of 14*14*1, using the preset A plurality of sliding windows of different sizes performs sliding window calculation on the two-dimensional graph, and the calculation process is shown in formula (2-1):

所述第一损失函数RawLoss的公式如公式(2-2)所示：The formula of the first loss function RawLoss is shown in formula (2-2):

所述第二损失函数PartLoss的公式如公式(2-3)所示：The formula of the second loss function PartLoss is shown in formula (2-3):

所述软阈值惩罚机制包括：The soft threshold penalty mechanism includes:

为进一步优化软阈值方法，可进一步修改目标函数，如公式(2-7)所示：In order to further optimize the soft threshold method, the objective function can be further modified, as shown in formula (2-7):

上述技术方案的工作原理及有益效果为：The working principle and beneficial effects of the above-mentioned technical scheme are:

本发明把MMAL-Net模型的三级级联网络修改成二级级联方式，把最后一级的推理任务放到了第一分支上。也就是说，在整个网络模型的运行过程中，仅需要通过一个普通的网络分类模型，就能得到分类细粒度网络的效果。其结构如图3所示。The invention modifies the three-level cascade network of the MMAL-Net model into a two-level cascade mode, and puts the reasoning task of the last level on the first branch. That is to say, during the operation of the entire network model, the effect of classifying fine-grained networks can be obtained only through an ordinary network classification model. Its structure is shown in Figure 3.

Input448*448*3为图像输入(图像大小为448像素*448像素，R(红)、G(绿)、B(蓝)三个颜色通道)；Input448*448*3 is the image input (the image size is 448 pixels*448 pixels, three color channels of R (red), G (green), and B (blue));

第一ResNet50为一种网络模型，深度残差网络，50是指网络层数；The first ResNet50 is a network model, a deep residual network, and 50 refers to the number of network layers;

Feature14*14*2048为2048张14像素*14像素特征二维图；Feature14*14*2048 is 2048 14-pixel*14-pixel feature two-dimensional images;

第一GAP为全局平均池化(Global Average Pooling)；The first GAP is Global Average Pooling (Global Average Pooling);

第一FC为全连接层(Fully Connection)；The first FC is a fully connected layer (Fully Connection);

第一Softmax为作为输出层的激励函数，在机器学习中常被看作是一种多分类器。通俗的意思就是，将一个物品输入，得出其中可能属于的类别概率。The first Softmax is the activation function as the output layer, which is often regarded as a multi-classifier in machine learning. The popular meaning is to input an item and get the probability of the category it may belong to.

Input224*224*3*mult为对Input448*448*3进行crop方法所得到的固定大小的图像输入(第一分支图像输入的部分区块，图像大小为224像素*224像素，R(红)、G(绿)、B(蓝)三个颜色通道)，mult是指图像裁剪后所得到固定区块数目。Input224*224*3*mult is the fixed-size image input obtained by performing the crop method on Input448*448*3 (partial blocks of the first branch image input, the image size is 224 pixels*224 pixels, R (red), G (green), B (blue) three color channels), mult refers to the fixed number of blocks obtained after image cropping.

第二ResNet50与第一ResNet50同理；The second ResNet50 is the same as the first ResNet50;

Feature*7*7*2048*mult为mult份2048张7像素*7像素特征二维图；Feature*7*7*2048*mult is 2048 7-pixel*7-pixel feature two-dimensional images for mult;

第二GAP、第二FC和第二Softmax分别与第一GAP、第一FC和第一Softmax同理；The second GAP, the second FC and the second Softmax are the same as the first GAP, the first FC and the first Softmax respectively;

Crop为裁剪，是直接从图像中截出一部分，保留原图像的真实尺寸比。根据图像裁剪的方法不同，所得到的裁剪图像的数量也不同。Crop is cropping, which cuts out a part of the image directly and retains the real size ratio of the original image. Depending on the method of image cropping, the number of cropped images obtained is also different.

经过图像预处理(滤波降噪、灰度化和缩至448像素*448像素大小)，将缩至448像素*448像素大小的图像作为网络的第一级输入。在网络模型的设计过程中，以ResNet50作为特征提取的主干网络。残差块经过特殊的结构设计得到ResNet50网络，其中50是指深度网络中卷积层和全连接层的总层数，整体精度适中，总体的计算量也适中，ResNet50的残差块采用了BottleNeck结构，既不会影响精度，也不会降低整体的计算量，特征通道的数量为2048，也更有助于选择出具有辨识性的区域。整体上，网络的两个分支特征提取采用了参数共享的方式，不仅减少了整个网络的参数量，还能使个网络适用于不同尺寸、不同部位的图像。After image preprocessing (filtering noise reduction, grayscale and shrinking to 448 pixels*448 pixels), the image shrunk to 448 pixels*448 pixels is used as the first-level input of the network. In the design process of the network model, ResNet50 is used as the backbone network for feature extraction. The residual block is specially designed to obtain the ResNet50 network, where 50 refers to the total number of convolutional layers and fully connected layers in the deep network. The overall accuracy is moderate, and the overall calculation amount is also moderate. The residual block of ResNet50 uses BottleNeck The structure will neither affect the accuracy nor reduce the overall calculation amount. The number of feature channels is 2048, and it is more helpful to select discriminative regions. On the whole, the feature extraction of the two branches of the network adopts the method of parameter sharing, which not only reduces the amount of parameters of the entire network, but also makes the network suitable for images of different sizes and different parts.

网络的局部特征提取选用了APPM结构，是基于SCDA(Selective ConvolutionalDescriptor Aggregation，选择性卷积描述符聚合)的研究形成的。首先，将模块对特征提取出来的14*14*2048沿着通道方向进行合拢，便得到了14*14*1的二维图。14*14*1的二维图为1张14像素*14像素大小的图像特征二维图。The local feature extraction of the network uses the APPM structure, which is formed based on the research of SCDA (Selective Convolutional Descriptor Aggregation, Selective Convolutional Descriptor Aggregation). First, the 14*14*2048 extracted by the module is combined along the channel direction to obtain a 14*14*1 two-dimensional map. The 14*14*1 two-dimensional map is a two-dimensional map of image features with a size of 14 pixels*14 pixels.

然后，用几种预先设定好的不同尺寸的滑窗来计算，滑窗内部的计算过程如公式(2-1)所示Then, use several pre-set sliding windows of different sizes to calculate, and the calculation process inside the sliding window is shown in formula (2-1)

预设好滑窗的高度和宽度分别是公式中的H和W，A(x,y)指合拢好的二维图坐标位置的对应数值，a_w即表示通过当前的滑窗所计算出来的数值。A(x，y)为合拢好的二维图的坐标位置对应的数值具体为14*14*1的二维图在对应待分类图像即原始图像中位置坐标值。其次，对不同位置的a_w值排序，其中输出值越大的区域就是具有辨识性特征的区域。最后，在存在特征区域重叠，不能简单排序的区域，使用NMS(non maximum suppression，非极大抑制)的方式选取多个高辨识性、低冗余度的候选区作为局部特征区域，成为第二层的输入图像。The height and width of the preset sliding window are H and W in the formula respectively, A(x,y) refers to the corresponding value of the coordinate position of the closed two-dimensional graph, and a _w means the value calculated by the current sliding window value. A(x, y) is the value corresponding to the coordinate position of the folded two-dimensional image, specifically the position coordinate value of the 14*14*1 two-dimensional image in the corresponding image to be classified, that is, the original image. Secondly, the _aw values of different positions are sorted, and the area with the larger output value is the area with discriminative features. Finally, in areas where feature areas overlap and cannot be simply sorted, NMS (non maximum suppression, non-maximum suppression) is used to select multiple high-discrimination, low-redundancy candidate areas as local feature areas, which become the second layer's input image.

本发明所采用的损失函数就是多分类问题中最常用的Cross-Entropy(交叉熵)损失。第一层Raw Loss公式如(2-2)所示The loss function adopted in the present invention is the most commonly used Cross-Entropy (cross-entropy) loss in multi-classification problems. The first layer Raw Loss formula is shown in (2-2)

其中，i表示的是样本图像，m_i表示的是样本图像的真实标号，通常代入1来计算，n_i指的是此类别神经网络的预测概率。m_i为第i个样本图像，该样本图像来源于上文滑窗截取的图像，共2048张，即i取值从1到2048。Among them, i represents the sample image, m _i represents the real label of the sample image, which is usually calculated by substituting 1, and _ni refers to the predicted probability of this type of neural network. m _i is the i-th sample image, which is derived from the image intercepted by the sliding window above, a total of 2048 images, that is, the value of i is from 1 to 2048.

第二层Part Loss公式如(2-3)所示The second layer Part Loss formula is shown in (2-3)

此处的q表示筛选出的局部特征区域个数，对每一块区域计算并得出损失值，最后求出平均值。m_iq为第i个样本图像对应的第q个局部特征区域具体为：该样本图像来源于第一分支预处理后的图像crop(裁剪)之后的图像，其对应的第q个局部特征区域。第二ResNet50筛选出的局部特征区域来自于待分类图像即原始图像预处理后的图像，即经过滤波降噪、灰度化后的图像，共2048个局部特征区域。Here, q represents the number of selected local feature regions, calculates and obtains the loss value for each region, and finally calculates the average value. m _iq is the qth local feature region corresponding to the i-th sample image, specifically: the sample image is derived from the image after the crop (cropping) of the image preprocessed by the first branch, and its corresponding qth local feature region. The local feature regions selected by the second ResNet50 come from the image to be classified, that is, the preprocessed image of the original image, that is, the image after filtering, noise reduction, and grayscale, with a total of 2048 local feature regions.

第一卷积神经网络CNN来自于第一ResNet50中，ResNet50就是卷积神经网络的一种网络模型；第二卷积神经网络与之同理。The first convolutional neural network CNN comes from the first ResNet50, and ResNet50 is a network model of the convolutional neural network; the second convolutional neural network is the same.

总的损失函数公式如(2-4)所示The total loss function formula is shown in (2-4)

Loss_total＝μLoss_raw+ωLoss_part， (2-4)Loss _total ＝μLoss _raw +ωLoss _part , (2-4)

Loss_raw，loss_part分别表示第一层和第二层的损失值，Loss_total即表示整体损失，μ和ω是取0-1之间的值，表示的是两层分支对整体网络的影响权重，用以调节。Loss _raw and loss _part respectively represent the loss values of the first layer and the second layer, Loss _total represents the overall loss, μ and ω are values between 0-1, representing the influence weight of the two-layer branch on the overall network , to adjust.

为降低网络的复杂度，能够更好的被使用，本发明使用的轻量化方法使用的是SqueezeNet(轻量化卷积神经网)结构，首先采用1*1的卷积核对局部特征卷积，起到降维作用即减少输入的通道数量，然后在Expand层分别用1*1和3*3的卷积核进行卷积运算，最后将运行结果拼接得到最终的输出特征数据。In order to reduce the complexity of the network, it can be used better. The lightweight method used in the present invention uses the SqueezeNet (lightweight convolutional neural network) structure. When it comes to dimensionality reduction, the number of input channels is reduced, and then convolution operations are performed with 1*1 and 3*3 convolution kernels in the Expand layer, and finally the operation results are spliced to obtain the final output feature data.

整体网络结构层次的减少，能够在识别准确率依旧高的条件下有效减少总的计算量。但是就原始数据集而言存在很多的环境噪声，又缺失主体监测的分支。本发明引入软阈值惩罚机制，来对局部特征提取能力进一步的提升。噪声对图像的影响可分为加法模型和乘法模型，设定F(x,y)为不含噪声的图像，N(x,y)为噪声，G(x,y)为噪声影响之后的图像，即原始图像，为了能够更好地提取特征，本发明选用L_1/2范式进行模型的建立如公式(2-5)所示The reduction of the overall network structure level can effectively reduce the total amount of calculation under the condition that the recognition accuracy is still high. However, as far as the original data set is concerned, there is a lot of environmental noise, and the branch of subject monitoring is missing. The present invention introduces a soft threshold penalty mechanism to further improve the local feature extraction capability. The influence of noise on an image can be divided into an additive model and a multiplicative model. Set F(x,y) as an image without noise, N(x,y) as noise, and G(x,y) as the image after noise influence , that is, the original image. In order to better extract features, the present invention uses the L _1/2 paradigm to establish the model, as shown in formula (2-5)

i表示的是样本图像,当进行图像迭代处理时，G(x_i,y_i)-F(x_i,y_i)出现残差，说明图像中出现噪声，并造成影响。本发明提出的软阈值方法，通过软阈值来限定残差状态。i represents the sample image. When the image is iteratively processed, G( _xi ,y _i )-F( _xi ,y _i ) has a residual error, indicating that noise appears in the image and affects it. In the soft threshold method proposed by the present invention, the residual state is limited by the soft threshold.

(x_i,y_i)为第i个特征图像的坐标；

为在数学上表示真实值，在式中表示后面式中取得最小值时(x_i,y_i)的值；arg为复数辐角，指的是复数的辐角主值，在此式中argmin即表示后面函数取最小值时，xi和yi的取值；F、N、G仅为函数名。(x _i , y _i ) is the coordinates of the i-th feature image;

In order to express the real value mathematically, the value of ( _xi , y _i ) when the minimum value is obtained in the following formula is expressed in the formula; arg is the argument angle of a complex number, which refers to the main value of the argument angle of a complex number. In this formula, argmin That is to say, when the subsequent function takes the minimum value, the values of xi and yi; F, N, and G are only function names.

先构造惩罚因子||G(x_i,y_i)-F(x_i,y_i)||_h，来限制G(x_i,y_i)-F(x_i,y_i)不大于0，从而降低图像被噪声干扰的程度，如公式(2-6)所示First construct the penalty factor ||G( _xi ,y _i )-F( _xi ,y _i )|| _h to limit G( _xi ,y _i )-F( _xi ,y _i ) not greater than 0, Thereby reducing the degree to which the image is disturbed by noise, as shown in formula (2-6)

公式中λ表示惩罚系数，调节该系数可以使结果接近真实值，由于又软阈值的限制，近似值也可能会出现在真实值的之上，因此软阈值可以有效降低噪声干扰的影响。s.t.全称subject to，意思是使得...满足...在式中的意思是：在N(x_i,y_i)>0,F(x_i,y_i)>0条件下使上式成立。In the formula, λ represents the penalty coefficient. Adjusting this coefficient can make the result close to the real value. Due to the limitation of the soft threshold, the approximate value may also appear above the real value, so the soft threshold can effectively reduce the influence of noise interference. The full name of st is subject to, which means to make...satisfy...In the formula, it means: make the above formula true under the conditions of N( _xi ,y _i )>0,F( _xi ,y _i )>0 .

h为预设的常数，优先取1/2。h is a preset constant, preferably 1/2.

为进一步优化软阈值方法，可进一步修改目标函数To further optimize the soft threshold method, the objective function can be further modified

其中V为辅助变量，在计算过程中，使用软阈值算法对V进行迭代更新处理。利用软阈值惩罚机制有效抵抗数据中的噪声。Among them, V is an auxiliary variable. In the calculation process, the soft threshold algorithm is used to iteratively update V. Use the soft threshold penalty mechanism to effectively resist the noise in the data.

本申请优化了MMAL-Net网络结构，通过减少网络分支层数，来减少整体模型的计算量，也降低了训练模型对硬件的要求；在新分支结构的基础上，加入软阈值惩罚机制模块，来抵制图像中出现的噪声；有效屏蔽干扰信息，从而使得整体精度提升。This application optimizes the network structure of MMAL-Net, reduces the calculation amount of the overall model by reducing the number of network branch layers, and also reduces the hardware requirements of the training model; on the basis of the new branch structure, a soft threshold penalty mechanism module is added, To resist the noise in the image; effectively shield the interference information, so that the overall accuracy is improved.

细粒度图像分类最初是采用强监督的深度学习来完成分类任务，强监督细粒度图像分类方法的监督信息依赖过多的标注，除了用于细粒度分类的网络外，整个算法框架中需要一个部件定位的目标检测网络或是语义分割网络。这导致了数据标注的成本与网络结构的成本都是高昂的，使得强监督方法无法在实际生产过程中得到较好的应用。因此，本申请仅使用类别标签，不需要额外标注信息，属于弱监督方法。Fine-grained image classification initially uses strong-supervised deep learning to complete the classification task. The supervision information of the strong-supervised fine-grained image classification method relies on too many labels. In addition to the network for fine-grained classification, a component is required in the entire algorithm framework localized object detection network or semantic segmentation network. This leads to the high cost of data labeling and network structure, which makes the strong supervision method unable to be better applied in the actual production process. Therefore, this application only uses category labels and does not require additional labeling information, which is a weakly supervised method.

在一个实施例中，所述步骤2：获取待分类图像，包括：In one embodiment, the step 2: obtaining images to be classified includes:

is the preset first relational coefficient;

is the preset second relational coefficient;

一般的，当用户选择需要进行图像分类的图像时，由于选择的量一定较多，在智能终端(例如：手机和平板等)上触控操作时，需要用户持续触控操作，体验较差，若时间过长，可能会造成用户手指不适；特别是一些从事图像分类工作的用户，更是体验不佳。例如：用户打开手机相册，手指在手机屏幕左上角向右侧触控移动，使得全选界面内需要选择的照片，当需要继续向下选择照片时，手指需向下触控移动，使得界面向下滑动并同时全选新的界面内出现需要选择的照片，但是，手指需要保持不动，直至到截止照片时，手指才能松开。因此，亟需进行解决。Generally, when a user selects an image that needs to be classified, since there must be a large amount of selection, when performing touch operations on smart terminals (such as mobile phones and tablets, etc.), the user needs to continue to touch the operation, and the experience is poor. If the time is too long, it may cause discomfort to the user's fingers; especially for some users who are engaged in image classification, the experience is not good. For example: when the user opens the mobile phone photo album, the finger touches the upper left corner of the mobile phone screen and moves to the right, so that the photos to be selected in the all-selection interface are selected. Swipe down and select all at the same time. The photos to be selected appear in the new interface, but the finger needs to keep still until the photo is cut off before the finger can be released. Therefore, urgently need to solve.

当用户触控圈选相册内的第一照片时，已选的第一照片组成圈选框，当圈选框在预设的第一时间(例如：2秒)内沿同一扩大方向(例如：竖直向下)扩大时，说明用户需要继续选择相册内更多的第一照片(例如：相册的显示界面在向下移动)，此时，可以进行自动圈选介入，输出预设的免触圈选提示信息(例如：在显示界面上显示“手指可以移开的哦！开始自动圈选”)，并控制圈选框沿扩大方向(例如：竖直向下)以预设的第一扩大速度(例如：1.2cm/s)继续进行扩大。When the user touches to circle the first photo in the album, the selected first photo forms a marquee frame, and when the marquee frame expands in the same direction within the preset first time (for example: 2 seconds) (for example: When it expands vertically downward), it means that the user needs to continue to select more first photos in the album (for example: the display interface of the album is moving downward). Circle selection prompt information (for example: display on the display interface "You can move your finger! Start automatic circle selection"), and control the circle box to expand along the expansion direction (for example: vertically downward) with the preset first expansion The speed (for example: 1.2cm/s) continues to expand.

此时，用户会不断查看圈选框已圈选的第一照片或查看圈选框即将圈选的第一照片，确定是否到截止照片。获取用户的眼部视线(视线获取属于现有技术范畴，不作赘述)，确定查看界面内对应于眼部视线的第一注视点位，第一注视点位为用户正在注视的位置。若第一注视点位落在圈选框内，说明用户在查看圈选框已圈选的第一照片，不太跟得上圈选框的第一扩大速度，基于第一注视点位与目标框边的第一垂直距离，对第一扩大速度进行降速，第一垂直距离越大，说明第一扩大速度较快，应越对第一扩大速度进行降速。若第一注视点位落在查看界面内圈选框外扩大方向上的待圈选范围内且第一注视点位在预设的第二时间(例如：3秒)内发生变化，说明用户在查看圈选框即将圈选的第一照片且未到截止照片，第一注视点位与目标框边的第二垂直距离越大，说明第一扩大速度较慢，越需要对第一扩大速度提速。充分保证了圈选框的扩大速度能够贴近用户，提升了人性化，更提升了免触圈选的智能化。当第一注视点位落在待圈选范围内且第一注视点位在第二时间未发生变化，说明已到达截止照片；但是，截止照片不一定是一行中最后一张照片或一列中最后一张照片，需要对截止照片之后的第一照片进行剔除。进一步提升免触圈选的智能化和人性化。获取为截止照片的第二照片。当第二照片刚好进入圈选框时，基于目标框边的移动类型，剔除无用照片。一般的，移动类型为向下换行，用户查看照片是以从左到右循环查看，因此，控制圈选框退选第二照片所在行右侧的全部第三照片。当移动类型为向上换行时，与之同理。当移动类型为向右换列时，用户查看照片是以从上到小循环查看，因此，控制圈选框退选第二照片下侧的全部第五照片。当移动类型为向左换列时，与之同理。At this point, the user will continuously check the first photo that has been circled by the marquee box or the first photo that will be circled by the marquee box to determine whether the deadline is reached. Obtain the user's eye line of sight (obtaining the line of sight belongs to the category of the prior art, and will not be described in detail), and determine the first fixation point corresponding to the eye sight in the viewing interface, and the first fixation point is the position where the user is gazing. If the first fixation point falls within the circled box, it means that the user is viewing the first photo that has been circled by the circled box, and cannot keep up with the first expansion speed of the circled box, based on the first gaze point and the target The first vertical distance of the frame edge slows down the first expansion speed. The larger the first vertical distance, the faster the first expansion speed, and the more the first expansion speed should be decelerated. If the first fixation point falls within the range to be circled in the direction of expansion outside the circled box in the viewing interface and the first fixation point changes within the preset second time (for example: 3 seconds), it means that the user is Check the first photo that is about to be selected by the circle box and the cut-off photo is not yet reached. The larger the second vertical distance between the first gaze point and the edge of the target frame, it means that the first expansion speed is slower, and the first expansion speed needs to be accelerated. . It fully ensures that the expansion speed of the circle selection box can be close to the user, which improves the humanization and the intelligence of the touch-free circle selection. When the first gaze point falls within the range to be circled and the first gaze point does not change in the second time, it means that the cutoff photo has been reached; however, the cutoff photo is not necessarily the last photo in a row or the last photo in a column For a photo, the first photo after the cut-off photo needs to be removed. Further enhance the intelligence and humanization of touch-free circle selection. Get the second photo that is the cutoff photo. When the second photo just enters the frame, useless photos are eliminated based on the movement type of the edge of the target frame. Generally, the movement type is line wrapping downward, and the user views photos in a cycle from left to right. Therefore, the control circle check box deselects all the third photos on the right side of the row where the second photo is located. The same is true when the movement type is line feed upward. When the movement type is changing columns to the right, the user views photos in a cycle from top to bottom. Therefore, the control circle check box deselects all the fifth photos below the second photo. The same is true when the movement type is column change to the left.

在一个实施例中，所述获取所述查看界面内对应于所述第一注视点位的第二照片，包括：In one embodiment, the obtaining the second photo corresponding to the first gaze point in the viewing interface includes:

获取为截止照片的第二照片时，一般的，只需确定查看界面内位于第一注视点位的照片即可。但是，由于一些智能终端的手机屏幕大小较小或者一些相册的缩略比例较小(例如：1：50，一个界面内显示50张照片)，视线获取的精度有限，使得第一注视点位无法用于精准定位用户注视的截止照片。When obtaining the second photo as the cut-off photo, generally, it is only necessary to determine the photo at the first gaze point in the viewing interface. However, due to the small size of the mobile phone screen of some smart terminals or the small thumbnail ratio of some photo albums (for example: 1:50, 50 photos are displayed in one interface), the accuracy of line of sight acquisition is limited, so that the first gaze point cannot Cutoff photo for pinpointing the user's gaze.

因此，获取显示界面当前的缩略比例，同时，获取显示界面的大小的缩略比例阈值；缩略比例阈值为该显示界面的大小下，第一注视点位可用于精准定位用户注视的截止照片最小缩略比例。若缩略比例大于等于缩略比例阈值，说明第一注视点位可用于精准定位用户注视的截止照片，确定位于第一注视点位的第八照片作为第二照片即可。否则(缩略比例小于缩略比例阈值)，确定第一注视点位的第九照片，基于第九照片，生成截止照片放大确认框，供用户进一步确认，当用户当前的第二注视点位落在截止照片放大确认框内且在预设的第三时间(例如：2秒)内未发生改变时，说明用户注视截止照片，确定位于第二注视点位的第九照片作为第二照片即可。极大程度上提升了应用于不同智能终端的适用性，更提升了截止照片确定的精准性。Therefore, the current thumbnail ratio of the display interface is obtained, and at the same time, the thumbnail ratio threshold of the size of the display interface is obtained; the thumbnail ratio threshold is the size of the display interface, and the first gaze point can be used to accurately locate the cut-off photo of the user's gaze Minimum thumbnail ratio. If the thumbnail ratio is greater than or equal to the thumbnail ratio threshold, it means that the first gaze point can be used to accurately locate the cut-off photo that the user gazes at, and the eighth photo at the first gaze point can be determined as the second photo. Otherwise (the thumbnail ratio is less than the thumbnail ratio threshold), determine the ninth photo of the first gaze point, and generate a cut-off photo zoom-in confirmation box based on the ninth photo for further confirmation by the user. When the user's current second gaze point falls within If there is no change within the preset third time (for example: 2 seconds) within the cut-off photo zoom-in confirmation box, it means that the user is looking at the cut-off photo, and the ninth photo at the second gaze point can be determined as the second photo. . It greatly improves the applicability of different smart terminals, and also improves the accuracy of the cut-off photo determination.

本发明提供一种基于软阈值惩罚机制的弱监督细粒度图像分类系统，如图4所示，包括：The present invention provides a weakly supervised fine-grained image classification system based on a soft threshold penalty mechanism, as shown in Figure 4, including:

构建模块1，用于基于软阈值惩罚机制，构建二级级联网络结构的细粒度图像分类网络；Building block 1 is used to construct a fine-grained image classification network with a two-level cascaded network structure based on a soft threshold penalty mechanism;

获取模块2，用于获取待分类图像；Obtaining module 2, for obtaining images to be classified;

预处理模块3，用于对所述待分类图像进行预处理；A preprocessing module 3, configured to preprocess the image to be classified;

分类模块4，用于基于所述细粒度图像分类网络，对预处理结果进行图像分类，并输出图像分类结果。The classification module 4 is configured to perform image classification on the preprocessing result based on the fine-grained image classification network, and output the image classification result.

显然，本领域的技术人员可以对本发明进行各种改动和变型而不脱离本发明的精神和范围。这样，倘若本发明的这些修改和变型属于本发明权利要求及其等同技术的范围之内，则本发明也意图包含这些改动和变型在内。Obviously, those skilled in the art can make various changes and modifications to the present invention without departing from the spirit and scope of the present invention. Thus, if these modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalent technologies, the present invention also intends to include these modifications and variations.

Claims

1. A weak supervision fine-grained image classification method based on a soft threshold punishment mechanism is characterized by comprising the following steps:

step 1: constructing a fine-grained image classification network of a secondary cascade network structure based on a soft threshold punishment mechanism;

step 2: acquiring an image to be classified;

and step 3: preprocessing the image to be classified;

and 4, step 4: based on the fine-grained image classification network, carrying out image classification on a preprocessing result and outputting an image classification result;

the step 1: based on a soft threshold punishment mechanism, a fine-grained image classification network of a two-level cascade network structure is constructed, and the method comprises the following steps:

constructing a first network branch, the first network branch comprising: input448 x 3, first ResNet50, feature14 x 2048, first GAP, first FC and first Softmax connected in sequence;

constructing a second network branch, the second network branch comprising: input224 × 3 × mult, second ResNet50, feature × 7 × 2048 × mult, second GAP, second FC, and second Softmax connected in this order;

connecting Input224 x 3 x mult in the second network branch with Input448 x 3 in the first network branch by crop;

connecting Feature14 x 2048 in the first network branch with the crop through APPM;

setting a first loss function RawLoss for the first network branch;

setting a second loss function PartLoss for the second network branch;

setting a soft threshold penalty mechanism in the APPM;

the first network branch, the second network branch, the APPM and the crop form a fine-grained image classification network of a two-level cascade network structure;

the step 2: acquiring an image to be classified, comprising:

when a user touches and selects a plurality of first photos to form a selection frame in a touch control manner on a viewing interface of a photo album and the selection frame is expanded along the same expansion direction within a preset first time, acquiring and outputting preset touch-free selection prompt information, and simultaneously controlling the selection frame to continue to expand along the expansion direction at a preset first expansion speed;

dynamically acquiring the current eye sight of the user;

determining a first point of gaze location within the viewing interface corresponding to the eye gaze;

if the first watching point position is located in the selection box in the viewing interface, acquiring a first vertical distance between the first watching point position and a target frame edge of the selection box in the expansion direction;

adjusting the first expansion speed based on the first vertical distance, wherein the adjustment formula is shown as a formula (2-8):

wherein v is ₁ ' adjusted said first expansion speed, v ₁ For said first expansion speed before adjustment,/ ₁ For the said first vertical distance, the distance is,

is a preset first relation coefficient;

if the first watching point location is located outside the selection circle frame in the viewing interface and within the range to be selected in the expansion direction and the first watching point location changes within a preset second time, acquiring a second vertical distance between the first watching point location and a target frame edge of the selection circle frame in the expansion direction;

adjusting the first expansion speed based on the second vertical distance, wherein the adjustment formula is shown as a formula (2-9):

wherein v is ₂ ' adjusted said first expansion speed, v ₂ For said first expansion speed before adjustment,/ ₂ Is the second vertical distance, and is,

is a preset second relation coefficient;

if the first watching point position is located in the to-be-circled range in the expansion direction outside the circled frame in the viewing interface and the first watching point position is not changed in the second time, acquiring a second photo corresponding to the first watching point position in the viewing interface;

when the second photo just enters the selection frame, controlling the selection frame to stop expanding;

acquiring the movement type of the target frame edge in the expansion direction of the selection frame;

when the moving type is downward line feed, controlling the selection frame to deselect all third photos on the right side of the second photo in the line where the second photo is located in the viewing interface;

when the moving type is upward line feed, controlling the selection frame to deselect all fourth photos on the left side of the second photo in the line where the second photo is located in the viewing interface;

when the moving type is changing columns to the right, controlling the selection circle frame to retreat all fifth photos on the lower side of the second photo in the column of the second photo in the viewing interface;

when the moving type is changing columns to the left, controlling the selection circle frame to retreat all sixth photos on the upper side of the second photo in the column of the second photo in the viewing interface;

after the selection is finished, all seventh pictures selected in the selection frame are taken as images to be classified, and the seventh pictures are obtained;

said obtaining a second photograph within the viewing interface corresponding to the first point of gaze location comprises:

acquiring the current thumbnail proportion of a display interface;

acquiring a preset thumbnail ratio threshold corresponding to the size of the display interface;

if the thumbnail proportion is larger than or equal to the thumbnail proportion threshold value, determining an eighth photo located at the first watching point position in the viewing interface, and taking the eighth photo as a second photo to finish the acquisition;

otherwise, determining a plurality of ninth photos within the viewing interface that are located at the first point of gaze location;

generating a cut-off picture enlargement confirmation frame based on the ninth picture;

displaying the cut-off photo amplification confirmation frame in a suspending manner in the display interface;

determining a second point of gaze location within the viewing interface that corresponds to the user's current eye gaze;

and when the second watching point position is located in the cut-off photo amplification confirmation frame and the second watching point position is not changed within a preset third time, the ninth photo located at the second watching point position in the cut-off photo amplification confirmation frame is taken as the second photo to finish the acquisition.

2. The method for classifying the weakly supervised fine grained image based on the soft threshold penalty mechanism according to claim 1, wherein the APPM is formed based on SCDA, the Feature14 x 2048 extracted from the features by the APPM is folded along the channel direction of the pooling layer to obtain a two-dimensional map of 14 x 1, and the two-dimensional map is subjected to sliding window calculation by using a plurality of preset sliding windows with different sizes, and the calculation process is shown in formula (2-1):

wherein H and W are respectively the height and width of the sliding window, A (x, y) is the numerical value corresponding to the coordinate position of the closed two-dimensional graph, and a _w The result is a sliding window calculation.

3. The method for classifying the weakly supervised fine grained image based on the soft threshold penalty mechanism as recited in claim 1, wherein the formula of the first loss function RawLoss is shown as formula (2-2):

wherein m is _i For the ith sample image, n _i The first convolutional neural network CNN corresponds to the predicted probability of the ith sample image.

4. The method for classifying the weakly supervised fine granular image based on the soft threshold penalty mechanism as recited in claim 1, wherein the formula of the second loss function PartLoss is shown as the formula (2-3):

wherein q is the number of local feature regions screened by the second ResNet50, m _iq The number of the q local feature regions corresponding to the ith sample image, n _iq And the prediction probability of the number of the q local feature areas corresponding to the ith sample image is the second convolutional neural network CNN.

5. The method of claim 1, wherein the soft-threshold penalty mechanism comprises:

let F (x, y) be the image without noise, N (x, y) be the noise, G (x, y) be the image after the influence of noise, select L _1/2 The paradigm is modeled as shown in equation (2-5):

i denotes the number of images, and G (x) when image iteration is performed _i ,y _i )-F(x _i ,y _i ) Residual errors occur, which indicates that noise occurs in the image and causes influence;

the residual error state is limited by a soft threshold, and a penalty factor G (x) is firstly constructed _i ,y _i )-F(x _i ,y _i )|| _h To limit G (x) _i ,y _i )-F(x _i ,y _i ) Not greater than 0, thereby reducing the degree to which the image is disturbed by noise, as shown in equation (2-6):

in the formula, lambda represents a penalty coefficient, and the result can be close to the true value by adjusting the coefficient.

6. The method for classifying the weakly supervised fine grained image based on the soft threshold penalty mechanism as claimed in claim 5, wherein to further optimize the soft threshold method, the objective function is further modified as shown in formula (2-7):

wherein V is an auxiliary variable, λ ₁ And λ ₂ And all the coefficients are penalty coefficients, and in the calculation process, the soft threshold algorithm is used for carrying out iterative update processing on the V.

7. A weakly supervised fine grained image classification system based on a soft threshold penalty mechanism, comprising:

the construction module is used for constructing a fine-grained image classification network of a secondary cascade network structure based on a soft threshold punishment mechanism;

the acquisition module is used for acquiring an image to be classified;

the preprocessing module is used for preprocessing the image to be classified;

the classification module is used for classifying the images of the preprocessing result based on the fine-grained image classification network and outputting the image classification result;

the building module performs the following operations:

constructing a second network branch, the second network branch comprising: input224 × 3 × mult, a second ResNet50, feature × 7 × 2048 × mult, a second GAP, a second FC, and a second Softmax connected in sequence;

connecting Input224 x 3 x mult in the second network branch with Input448 x 3 in the first network branch via crop;

setting a first loss function RawLoss for the first network branch;

setting a second loss function PartLoss for the second network branch;

setting a soft threshold penalty mechanism in the APPM;

the acquisition module performs the following operations:

when a user selects a plurality of first photos to form a selection frame in a touch control manner on a viewing interface of the photo album and the selection frame is expanded along the same expansion direction within a preset first time, acquiring and outputting preset touch-free selection prompt information, and simultaneously controlling the selection frame to continue to expand along the expansion direction at a preset first expansion speed;

dynamically acquiring the current eye sight of the user;

if the first watching point position is located in the selection frame in the viewing interface, acquiring a first vertical distance between the first watching point position and a target frame edge in the expansion direction of the selection frame;

is a preset first relation coefficient;

is a preset second relation coefficient;

when the second picture just enters the selection frame, controlling the selection frame to stop expanding;

when the moving type is line feed, controlling the selection circle frame to retreat all third photos on the right side of the second photo in the line where the second photo is located in the viewing interface;

when the moving type is the line feed, controlling the selection circle frame to retreat all fourth photos on the left side of the second photo in the line where the second photo is located in the viewing interface;

after the selection retreating is finished, all seventh photos selected in the circle selection frame are used as images to be classified, and the acquisition is finished;

the obtaining a second photograph within the viewing interface corresponding to the first point of gaze location includes:

acquiring the current thumbnail proportion of a display interface;

acquiring a preset thumbnail proportion threshold corresponding to the size of the display interface;

if the thumbnail proportion is larger than or equal to the thumbnail proportion threshold, determining an eighth photo located at the first watching point position in the viewing interface, and taking the eighth photo as a second photo to finish acquisition;