CN116935100A

CN116935100A - A multi-label image classification method based on feature fusion and self-attention mechanism

Info

Publication number: CN116935100A
Application number: CN202310728668.7A
Authority: CN
Inventors: 高世杰; 韩立新
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2023-06-19
Filing date: 2023-06-19
Publication date: 2023-10-24

Abstract

The invention discloses a multi-label image classification method based on feature fusion and self-attention mechanism. The method mainly includes the following steps: image global feature extraction, using a deep convolutional neural network to extract image global features; image local feature extraction, in On the feature map generated by the middle layer of the above-mentioned deep convolutional neural network, multiple convolution operations with a convolution kernel size of 1*1 are implemented to extract local features of the image; feature fusion is based on the self-attention mechanism to extract the global features of the image. Features and local features are fused to generate feature expressions for each category; image multi-label classification, based on the fused feature expressions, achieves the generation of image labels through fully connected layers and sigmoid activation functions. The image multi-label classification method of the present invention provides a way to fuse local features and global features of the image, which can effectively model the visual features of small targets in the image, take into account the semantic correlation between labels, and improve the image quality. Multi-label classification accuracy.

Description

A multi-label image classification method based on feature fusion and self-attention mechanism

技术领域Technical field

本发明属于图像识别领域，具体涉及基于图像全局特征与局部特征融合并引入自注意力机制的一种多标签图像分类方法。The invention belongs to the field of image recognition, and specifically relates to a multi-label image classification method based on the fusion of global features and local features of the image and the introduction of a self-attention mechanism.

背景技术Background technique

信息时代，图像已经成为一种传达信息的媒介以及载体，并在各个领域中广泛应用。实现信息时代海量数字图像的快速、准确分类，是当下图像应用领域的主要研究内容。虽然卷积神经网络(CNN)在单标签图像分类任务中表现出良好的性能，但是真实世界中的大部分图像均包含不止一个场景或者物体，一幅图像便可以被标注多个标签，这些标签可以对应于一幅图像中不同的物体、场景、动作和属性。In the information age, images have become a medium and carrier for conveying information and are widely used in various fields. Achieving rapid and accurate classification of massive digital images in the information age is the main research content in the current image application field. Although convolutional neural networks (CNN) have shown good performance in single-label image classification tasks, most images in the real world contain more than one scene or object, and one image can be labeled with multiple labels. These labels Can correspond to different objects, scenes, actions and attributes in an image.

如要对图像中这些丰富的语义信息进行提取，就需要使用图像的多标签生成技术，尽可能精确的识别出图像中的所有类别，而传统分类往往是硬分类，即一个数据仅被分到一个类中，具有排他性，在图像标注中体现为一幅图像只标注一个标签，具有一定的局限性。此外，对于一个典型的多标签图像，不同类别的物体位于不同的位置，具有不同的比例和姿势，物体之间的遮挡、重叠、光照等原因均会导致多标签图像的识别分类难度较高。多标签图像分类是一个更为普遍和实际的问题，对图像中丰富的语义信息和它们的依赖关系进行建模，高效准确完成多标签图像的分类识别，成为重点研究方向(参见“冀中，李慧慧，何宇清.基于深度示例差异化的零样本多标签图像分类[J].计算机科学与探索，2019，13(1)：9.”)，在如图像检索、人像分组、医学图像识别、场景理解等多个领域都有广泛应用。To extract this rich semantic information from images, you need to use image multi-label generation technology to identify all categories in the image as accurately as possible. However, traditional classification is often a hard classification, that is, a piece of data is only assigned to In a class, it is exclusive, which is reflected in image annotation by only annotating an image with one label, which has certain limitations. In addition, for a typical multi-label image, objects of different categories are located in different locations, with different proportions and postures. Occlusion, overlap, lighting and other reasons between objects will make the recognition and classification of multi-label images more difficult. Multi-label image classification is a more common and practical problem. Modeling the rich semantic information and their dependencies in images and efficiently and accurately completing the classification and recognition of multi-label images has become a key research direction (see "Jizhong, Li Huihui, He Yuqing. Zero-shot multi-label image classification based on deep example differentiation [J]. Computer Science and Exploration, 2019, 13(1): 9."), in such fields as image retrieval, portrait grouping, medical image recognition, scene It is widely used in many fields such as understanding.

CNN在单标签图像分类上的成功为解决多标签图像分类问题提供了一些启示。得益于CNN中卷积操作的平移不变性，即无论目标出现在图像中的哪个位置，它都会检测到同样的这些特征，输出同样的响应，在图像中出现多个目标时也是如此。所以，可以简单地把CNN模型中全连接层输出的向量通过Sigmoid函数让每一维度的值转为0～1之间的概率值，从而计算出样本属于各个类别的概率。这里模型输出的每一类别的概率分布之间是独立的，即把多标签问题分解成多个独立的二分类问题。但是，这种方法忽略了标签之间的语义相关性，即当图像被标注有某标签时，该图像同时具有另一个标签内容的概率很大。比如天空和云通常一起出现，而水和汽车几乎从不共同出现。此外，在深层卷积神经网络中，虽然多次的卷积和池化操作通过共享权重、降采样的方式降低了模型参数量，同时神经元的感受野也在不断扩大，模型深层的特征图将更多地反映图像的全局特征，这在图像中通常只有单个目标的单标签图像分类任务中是有利的，然而，在多标签图像分类任务中，图像中存在大小、位置、形状各异的小目标，这些小目标所蕴含的局部特征往往在模型深层较大的感受野下被忽略或稀释。所以，如果直接对整张图片提取全局特征，则难以避免在提取特征的过程中会损失掉小目标的视觉特征，从而影响多标签分类精度。The success of CNN in single-label image classification provides some inspiration for solving multi-label image classification problems. Thanks to the translation invariance of the convolution operation in CNN, no matter where the target appears in the image, it will detect the same features and output the same response. This is also true when multiple targets appear in the image. Therefore, you can simply use the Sigmoid function to convert the vector output by the fully connected layer in the CNN model into a probability value between 0 and 1, thereby calculating the probability that the sample belongs to each category. The probability distribution of each category output by the model here is independent, that is, the multi-label problem is decomposed into multiple independent two-classification problems. However, this method ignores the semantic correlation between tags, that is, when an image is annotated with a certain tag, there is a high probability that the image also has the content of another tag. For example, the sky and clouds often appear together, while water and cars almost never appear together. In addition, in deep convolutional neural networks, although multiple convolution and pooling operations reduce the number of model parameters by sharing weights and downsampling, at the same time, the receptive fields of neurons are also constantly expanding, and the feature maps of deep layers of the model are also expanding. will reflect more of the global characteristics of the image, which is advantageous in single-label image classification tasks where there is usually only a single target in the image. However, in multi-label image classification tasks, there are objects with different sizes, locations, and shapes in the image. Small targets, the local features contained in these small targets are often ignored or diluted in the larger receptive field deep in the model. Therefore, if global features are directly extracted from the entire image, it is difficult to avoid losing the visual features of small targets during the feature extraction process, thus affecting the accuracy of multi-label classification.

肖琳等(参见“肖琳，陈博理，黄鑫，等.基于标签语义注意力的多标签文本分类[J].软件学报，2020，31(4)：11.”)提出基于标签语义注意力的多标签文本分类的方法，依赖于文档的文本和对应的标签，使用双向长短时记忆获取每个单词的隐表示，通过使用标签语义注意力机制获得文档中每个单词的权重，另外标签在语义空间里往往是相互关联的。张永等(参见“张永，刘浩科，张洁.基于类属特征和实例相关性的多标签分类算法[J].模式识别与人工智能，2020，33(5)：10.”)提出基于类属特征和实例相关性的多标签分类算法，不仅考虑标签相关性还考虑实例特征的相关性，通过构建相似性图，学习实例特征空间的相似性。牟甲鹏等(参见“牟甲鹏、蔡剑、余孟池、徐建.基于标签相关性的类属属性多标签分类算法[J].计算机应用研究，2020，37(9)：4.”)提出一种基于标签相关性的类属属性多标签分类算法，该算法使用标签距离度量标签之间的相关性，通过在类属属性空间附加相关标签的方式完成标签相关性的引入，以达到提升分类性能的目的。Chen等人(参见“ChenZ M，Wei X S，Wang P，et al.Multi-label image recognition with graphconvolutional networks[C]//Proceedings of the IEEE/CVF conference on computervision and pattern recognition.2019：5177-5186.”)提出利用图卷积网络(GCN)显式建模类别标签之间的相关性，基于GCN的映射函数学习相互依赖的目标分类器，可以将生成的分类器应用于任意CNN模型学习到的图像特征，具有很高的扩展性和灵活性。Lanchantin等人(参见“Lanchantin J，Wang T，Ordonez V，et al.General multi-label imageclassification with transformers[C]//Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition.2021：16478-16488.”)提出利用Transformer模型，并使用Label Mask Training训练策略，训练时随机遮蔽部分真实标签，让模型预测被遮蔽的标签，从而发掘图像特征与标签之间以及标签集合内部复杂的依赖关系。Xiao Lin et al. (see "Xiao Lin, Chen Boli, Huang Xin, et al. Multi-label text classification based on label semantic attention [J]. Journal of Software, 2020, 31(4): 11.") proposed a method based on label semantic attention. The method of multi-label text classification relies on the text of the document and the corresponding label, uses two-way long short-term memory to obtain the hidden representation of each word, and obtains the weight of each word in the document by using the label semantic attention mechanism. In addition, the label is semantically Spaces are often interconnected. Zhang Yong et al. (see "Zhang Yong, Liu Haoke, Zhang Jie. Multi-label classification algorithm based on generic characteristics and instance correlation [J]. Pattern Recognition and Artificial Intelligence, 2020, 33(5): 10.") proposed based on The multi-label classification algorithm of generic features and instance correlation not only considers the correlation of labels but also the correlation of instance features, and learns the similarity of the instance feature space by constructing a similarity graph. Mou Jiapeng et al. (see "Mou Jiapeng, Cai Jian, Yu Mengchi, Xu Jian. Multi-label classification algorithm for generic attributes based on label correlation [J]. Computer Application Research, 2020, 37(9): 4.") proposed a method based on Label correlation multi-label classification algorithm for generic attributes. This algorithm uses label distance to measure the correlation between labels, and completes the introduction of label correlation by appending relevant labels to the generic attribute space to achieve the purpose of improving classification performance. . Chen et al. (see "ChenZ M, Wei ") proposes to use graph convolutional networks (GCN) to explicitly model the correlation between category labels, and learn interdependent target classifiers based on the mapping function of GCN. The generated classifier can be applied to any CNN model learned Image features with high scalability and flexibility. Lanchantin et al. (see "Lanchantin J, Wang T, Ordonez V, et al. General multi-label image classification with transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 16478-16488." ) proposed to use the Transformer model and the Label Mask Training training strategy to randomly mask some real labels during training, allowing the model to predict the masked labels, thereby exploring the complex dependencies between image features and labels and within the label set.

针对一般方法在提取图像全局特征的过程中会损失图像中部分小目标的视觉特征的问题，以及考虑到多标签分类问题中标签之间存在依赖关系的情况，有必要设计一种高效的多标签图像分类模型，以有效地对图像中小目标所拥有的局部特征以及多个标签之间的依赖关系进行建模。In view of the problem that general methods will lose the visual features of some small targets in the image in the process of extracting global features of the image, and considering the dependence between labels in multi-label classification problems, it is necessary to design an efficient multi-label method. Image classification model to effectively model the local features possessed by small objects in images and the dependencies between multiple labels.

发明内容Contents of the invention

为了克服上述现有技术的不足，本发明提出了一种基于特征融合和自注意力机制的多标签图像分类方法，该方法将深度卷积神经网络提取到的图像的全局特征与图像局部特征在网络中间层的隐含表示相结合，并引入自注意力机制对多个标签之间的依赖关系建模，得到更高的多标签分类准确率。In order to overcome the shortcomings of the above-mentioned existing technologies, the present invention proposes a multi-label image classification method based on feature fusion and self-attention mechanism. This method combines the global features of the image extracted by the deep convolutional neural network and the local features of the image. The implicit representation of the middle layer of the network is combined, and a self-attention mechanism is introduced to model the dependencies between multiple labels, resulting in higher multi-label classification accuracy.

为了达到上述目的，本发明采用如下技术方案：In order to achieve the above objects, the present invention adopts the following technical solutions:

步骤1：初始化ResNet50模型结构和参数，在ResNet50第三个卷积块输出的特征图上实施1*1卷积操作，以提取图像局部特征。此时1*1卷积核的通道数应与当前多标签分类任务的总类别数相一致。Step 1: Initialize the ResNet50 model structure and parameters, and implement a 1*1 convolution operation on the feature map output by the third convolution block of ResNet50 to extract local features of the image. At this time, the number of channels of the 1*1 convolution kernel should be consistent with the total number of categories in the current multi-label classification task.

步骤2：上述原ResNet50模型输出的特征图继续经过后续卷积块，经过平均池化(Average Pooling)，得到图像的全局特征矩阵。Step 2: The feature map output by the original ResNet50 model continues to pass through subsequent convolution blocks and average pooling (Average Pooling) to obtain the global feature matrix of the image.

步骤3：为了发掘标签之间的依赖关系，将上述特征向量通过自注意力机制进行融合，具体包含以下步骤：Step 3: In order to explore the dependencies between labels, the above feature vectors are fused through the self-attention mechanism, which specifically includes the following steps:

(1)将上述步骤1所得图像局部特征矩阵在通道维度的每一维分别展平为一维向量；将上述步骤2所得图像全局特征矩阵展平为一维向量；将这些向量按行拼接为一个矩阵E，其中局部特征向量应经线性变换，使其维度与全局特征向量维度相一致；(1) Flatten the local feature matrix of the image obtained in step 1 above into a one-dimensional vector in each dimension of the channel dimension; flatten the global feature matrix of the image obtained in step 2 above into a one-dimensional vector; splice these vectors row by row into A matrix E, in which the local feature vectors should be linearly transformed so that their dimensions are consistent with the global feature vector dimensions;

(2)初始化权值矩阵W^Q、W^K、W^V；(2) Initialize the weight matrices W ^Q , W ^K , W ^V ;

(3)将(1)所述特征矩阵E分别与(2)所述权值向量W^Q、W^K、W^V相乘，最终得到Query矩阵、Key矩阵和Value矩阵，矩阵的每一行向量都与前述一维全局特征向量或局部特征向量相关联。(3) Multiply the feature matrix E described in (1) with the weight vectors W ^Q , W ^K , and W ^V described in (2) respectively, and finally obtain the Query matrix, Key matrix and Value matrix. Each row vector of the matrix is Associated with the aforementioned one-dimensional global feature vector or local feature vector.

(4)计算注意力分数。将Query矩阵与Key矩阵的转置相乘，求得注意力分数矩阵Score，并除以数值(d^k为Key矩阵的列数)，之后使用Softmax函数对该矩阵中的每一行进行归一化处理。此时，Score矩阵中每个元素的数值代表了特征矩阵E中特征向量两两之间的注意力分数；(4) Calculate the attention score. Multiply the Query matrix and the transpose of the Key matrix to obtain the attention score matrix Score and divide it by the value. ( ^dk is the number of columns of the Key matrix), and then use the Softmax function to normalize each row in the matrix. At this time, the value of each element in the Score matrix represents the attention score between the two feature vectors in the feature matrix E;

(5)将注意力分数矩阵Score与Value矩阵相乘，如此，Value矩阵中每一行向量皆依据注意力分数与其他行向量加权求和而得。(5) Multiply the attention score matrix Score and the Value matrix. In this way, each row vector in the Value matrix is obtained by the weighted sum of the attention score and other row vectors.

步骤4：将Value矩阵输入一个全连接神经网络进行计算，最后经过Sigmoid激活函数，最终为每幅图像生成一个维度等于类别数的向量，向量每一维的数值代表了该幅图像属于对应类别的概率。Step 4: Input the Value matrix into a fully connected neural network for calculation. Finally, through the Sigmoid activation function, a vector with a dimension equal to the number of categories is generated for each image. The value of each dimension of the vector represents that the image belongs to the corresponding category. Probability.

本发明的有益效果具体表述如下：The beneficial effects of the present invention are specifically stated as follows:

(1)本发明通过同时将图像的全局特征及局部特征纳入考虑，从一定程度上解决了传统图像特征提取网络对小目标特征信息的丢失问题；(1) This invention solves the problem of loss of small target feature information in traditional image feature extraction networks to a certain extent by simultaneously taking into account the global features and local features of the image;

(2)本发明通过通道数等于总类别数的1*1大小的卷积核进行卷积运算，对每一类别独立计算特征图，相比共用特征图的一般方法，提升了分类精度；(2) The present invention performs convolution operations through a convolution kernel of 1*1 size with the number of channels equal to the total number of categories, and independently calculates feature maps for each category. Compared with the general method of sharing feature maps, the classification accuracy is improved;

(3)本发明使用自注意力机制建模多个标签之间的依赖关系，利用多标签分类问题中标签的语义相关性显著提升模型的分类性能；(3) The present invention uses a self-attention mechanism to model the dependencies between multiple labels, and uses the semantic correlation of labels in multi-label classification problems to significantly improve the classification performance of the model;

(4)本发明具有较好的抗干扰能力和强鲁棒性，能够满足实际多标签图像分类应用需求。(4) The present invention has good anti-interference ability and strong robustness, and can meet the needs of actual multi-label image classification applications.

附图说明Description of the drawings

附图1为一种基于特征融合和自注意力机制的多标签图像分类方法流程图。Figure 1 is a flow chart of a multi-label image classification method based on feature fusion and self-attention mechanism.

附图2为基于自注意力机制的特征融合过程流程图。Figure 2 is a flow chart of the feature fusion process based on the self-attention mechanism.

附图3为基于特征融合和自注意力机制的多标签图像分类方法神经网络模型示意图。Figure 3 is a schematic diagram of the neural network model of the multi-label image classification method based on feature fusion and self-attention mechanism.

具体实施方式Detailed ways

下面以附图和具体实施为例，进一步阐明本发明，应理解这些实施例仅用于说明本发明而不用于限制本发明的范围，在阅读了本发明之后，本领域技术人员对本发明的各种等价形式的修改均落于本申请所附权利要求所限定的范围。The present invention will be further clarified below by taking the accompanying drawings and specific implementations as examples. It should be understood that these embodiments are only used to illustrate the present invention and are not intended to limit the scope of the present invention. After reading the present invention, those skilled in the art will be familiar with various aspects of the present invention. Modifications in various equivalent forms all fall within the scope defined by the appended claims of this application.

以下以分类总类别数5类为例进行举例说明。The following takes the total number of classification categories as 5 as an example for illustration.

S1：初始化ResNet50模型结构和参数，这里的参数指ResNet50在ImageNet大规模视觉识别数据集上预训练而得的权重数据。之后，在ResNet50第三个卷积块输出的特征图上实施1*1卷积操作，以提取图像局部特征。具体地，以形状为5*1*1的卷积核对上述形状为512*28*28的特征图实施卷积操作，可得形式为5*28*28的特征图，其每一通道分别与相应类别对应。S1: Initialize the ResNet50 model structure and parameters. The parameters here refer to the weight data obtained by ResNet50 pre-trained on the ImageNet large-scale visual recognition data set. Afterwards, a 1*1 convolution operation is implemented on the feature map output by the third convolution block of ResNet50 to extract local features of the image. Specifically, a convolution kernel with a shape of 5*1*1 is used to perform a convolution operation on the feature map with a shape of 512*28*28, and a feature map in the form of 5*28*28 can be obtained, with each channel of corresponding categories.

S2：上述原ResNet50模型输出的特征图继续经过后续卷积块，经过平均池化(Average Pooling)，得到图像的全局特征图，形状为1*1*2048。S2: The feature map output by the original ResNet50 model continues to pass through subsequent convolution blocks and average pooling (Average Pooling) to obtain the global feature map of the image, with a shape of 1*1*2048.

S3：为了发掘标签之间的依赖关系，将上述特征向量通过自注意力机制进行融合，具体包含以下步骤：S3: In order to explore the dependencies between labels, the above feature vectors are fused through the self-attention mechanism, which specifically includes the following steps:

S31将上述步骤S1所得图像局部特征矩阵在通道维度的每一维分别展平为一维向量，即将形为5*28*28的特征图展平为5*784，记为F^regional；将上述步骤S2所得图像全局特征矩阵展平为2048维的一维向量，记为F^global；将这些向量按行拼接为一个矩阵E，其中局部特征向量应经线性变换，使其维度与全局特征向量维度一致，E矩阵的维度应为6*2048；S31 flatten the image local feature matrix obtained in the above step S1 into a one-dimensional vector in each dimension of the channel dimension, that is, flatten the feature map of the shape 5*28*28 into 5*784, recorded as F ^regional ; The image global feature matrix obtained in step S2 is flattened into a 2048-dimensional one-dimensional vector, recorded as F ^global ; these vectors are spliced row by row into a matrix E, in which the local feature vectors should be linearly transformed so that their dimensions are the same as the global feature vector dimensions. Consistent, the dimension of the E matrix should be 6*2048;

定义784*2048维的参数矩阵θ，对特征矩阵F^regional作线性变换：Define a 784*2048-dimensional parameter matrix θ, and linearly transform the feature matrix F ^regional :

F′＝F^regional·θ (9)F′＝ ^Fregional ·θ (9)

E＝(F′；F^global)_concat (10)E＝(F′；F ^global ) _concat (10)

S32 初始化2048*512维的权值矩阵W^Q、W^K、W^V；S32 initializes the 2048*512-dimensional weight matrix W ^Q , W ^K , and W ^V ;

S33 将S31所述特征矩阵E分别与S32所述权值向量W^Q、W^K、W^V相乘，最终得到Query矩阵、Key矩阵和Value矩阵，矩阵的每一行向量都与前述一维全局特征向量或局部特征向量相关联。矩阵维度均为6*512。S33 Multiply the feature matrix E mentioned in S31 with the weight vectors W ^Q , W ^K , and W ^V mentioned in S32 respectively, and finally obtain the Query matrix, Key matrix and Value matrix. Each row vector of the matrix is related to the aforementioned one-dimensional global feature. Vectors or local eigenvectors are associated. The matrix dimensions are all 6*512.

具体计算方法为：The specific calculation method is:

Query＝E·W^Q (11)Query＝E·W ^Q (11)

Key＝E·W^K (12)Key＝E·W ^K (12)

Value＝E·W^V (13)Value＝E·W ^V (13)

S34 计算注意力分数。将Query矩阵与Key矩阵的转置相乘，求得注意力分数矩阵Score(维度应为6*6)，并除以数值(d^k为Key矩阵的列数)，之后使用Softmax函数对该矩阵中的每一行进行归一化处理。此时，Score矩阵中每个元素的数值代表了特征矩阵E中特征向量两两之间的注意力分数；S34 Calculate attention score. Multiply the Query matrix and the transpose of the Key matrix to obtain the attention score matrix Score (the dimension should be 6*6), and divide it by the value ( ^dk is the number of columns of the Key matrix), and then use the Softmax function to normalize each row in the matrix. At this time, the value of each element in the Score matrix represents the attention score between the two feature vectors in the feature matrix E;

具体计算方法为：The specific calculation method is:

S35 将注意力分数矩阵Score与Value矩阵相乘，如此，所得矩阵F^mixed中每一行向量皆依据注意力分数与Value矩阵中相对应的其他行向量加权求和而得。S35 Multiply the attention score matrix Score and the Value matrix. In this way, each row vector in the resulting matrix F ^mixed is obtained by the weighted sum of the attention score and other corresponding row vectors in the Value matrix.

F^mixed＝Score·Value (15)F ^mixed =Score·Value (15)

S4：将F^mixed矩阵输入一个全连接神经网络进行计算，最后经过Sigmoid激活函数，最终为每幅图像生成一个维度等于类别数的向量，向量每一维的数值代表了该幅图像属于对应类别的概率。全连接神经网络具体定义如下，其中参数矩阵ω^c的维度为512*256，ω^b的维度为256*64，ω^a的维度为64*5：S4: Input the F ^mixed matrix into a fully connected neural network for calculation. Finally, through the Sigmoid activation function, a vector with a dimension equal to the number of categories is generated for each image. The value of each dimension of the vector represents that the image belongs to the corresponding category. Probability. The fully connected neural network is specifically defined as follows, where the dimension of the parameter matrix ω ^c is 512*256, the dimension of ω ^b is 256*64, and the dimension of ω ^a is 64*5:

out＝Sigmoid(((F^mixed·ω^c)x_elu·ω^b)_relu·ω^a) (16)。out=Sigmoid(((F ^mixed ·ω ^c )x _elu ·ω ^b ) _relu ·ω ^a ) (16).

Claims

1. a multi-label image classification method based on feature fusion and self-attention mechanism is characterized by comprising the following steps:

step 1: initializing a model and extracting local features of an image;

step 2: extracting global features of the image;

step 3: fusing global features and local features of the image through a self-attention mechanism;

step 4: and based on the fused characteristics, performing image multi-label classification by using a fully connected neural network.

2. The multi-label image classification method based on feature fusion and self-attention mechanism according to claim 1, wherein the step 1 initializes a model, extracts image local features, and the method comprises the following steps:

the ResNet50 model structure and parameters are initialized, where the parameters refer to weight data from the ResNet50 pre-training on the ImageNet large-scale visual recognition dataset. Thereafter, a 1*1 convolution operation is performed on the feature map output by the third convolution block of ResNet50 to extract image local features.

3. The multi-label image classification method based on feature fusion and self-attention mechanism according to claim 1, wherein the step 2 extracts global features of the image, and the method is as follows:

and (3) continuously passing the feature map output by the original ResNet50 model in the step (1) through a subsequent convolution block, and carrying out Average Pooling (Average Pooling) to obtain a global feature matrix of the image.

4. The multi-label image classification method based on feature fusion and self-attention mechanism according to claim 1, wherein the step 3 fuses the global features and the local features of the image by the self-attention mechanism, and the method is as follows:

(1) Flattening the image local feature matrix obtained in the step 2 into one-dimensional vectors in each dimension of the channel dimension respectively, and marking as F ^regional The method comprises the steps of carrying out a first treatment on the surface of the Flattening the global feature matrix of the image obtained in the step 3 into a one-dimensional vector, and marking the one-dimensional vector as F ^global The method comprises the steps of carrying out a first treatment on the surface of the Splicing the vectors into a matrix E according to rows, wherein the local eigenvectors are subjected to linear transformation to ensure that the dimensions of the local eigenvectors are consistent with those of the global eigenvectors; defining a parameter matrix theta and a characteristic matrix F ^regional Performing linear transformation:

F′＝F ^regional ·Θ (1)

E＝(F′；F ^global ) _concat (2)

(2) Initializing a weight matrix W ^Q 、W ^K 、W ^V ；

(3) Respectively combining (1) the feature matrix E with (2) the weight vector W ^Q 、W ^K 、W ^V Multiplying to obtain a Query matrix, a Key matrix and a Value matrix, wherein each row of vectors of the matrix are associated with the one-dimensional global feature vector or the local feature vector.

The specific calculation method comprises the following steps:

Query＝E·W ^Q (3)

Key＝E·W ^K (4)

Value＝E·W ^V (5)

(4) An attention score is calculated. The Query matrix and KeyTransposed multiplication of the matrices to obtain the attention Score matrix Score, and division by the value(d ^k The number of columns of the Key matrix) and then normalize each row in the matrix using the Softmax function. At this time, the numerical value of each element in the Score matrix represents the attention Score between every two feature vectors in the feature matrix E;

the specific calculation method comprises the following steps:

(5) Multiplying the attention Score matrix Score by the Value matrix, so that the resulting matrix F ^mixed Each row vector in the Value matrix is obtained by weighted summation of the attention score and other corresponding row vectors in the Value matrix.

F ^mixed ＝Score·Value (7)

5. The method for classifying images according to claim 1, wherein the step 4 is based on the fused features and uses a fully connected neural network to classify the images, and the method is as follows:

will F ^mixed The matrix is input into a fully-connected neural network for calculation, finally, a vector with the dimension equal to the category number is generated for each image through a Sigmoid activation function, and the numerical value of each dimension of the vector represents the probability that the image belongs to the corresponding category. The fully-connected neural network is specifically defined as follows:

out＝Sigmoid(((F ^mixed ·ω ^c ) _relu ·ω ^b ) _relu ·ω ^a ) (8)。