CN111369563B

CN111369563B - A Semantic Segmentation Method Based on Pyramid Atrous Convolutional Network

Info

Publication number: CN111369563B
Application number: CN202010108637.8A
Authority: CN
Inventors: 史景伦; 张宇; 傅钎栓; 李显惠; 林阳城
Original assignee: Guangzhou Menghui Robot Co ltd; South China University of Technology SCUT
Current assignee: Guangzhou Menghui Robot Co ltd; South China University of Technology SCUT
Priority date: 2020-02-21
Filing date: 2020-02-21
Publication date: 2023-04-07
Anticipated expiration: 2040-02-21
Also published as: CN111369563A

Abstract

The invention discloses a semantic segmentation method based on a pyramid hole convolution network, comprising the following steps: obtaining a medical image data set containing real segmentation results, performing preprocessing operations such as data enhancement on the data set; passing the preprocessed image through residual The shallow image features are obtained by the difference recursive convolution module and the pooling layer; the deep image features are obtained through the parallel network of the pyramid pooling module and the hole convolution module; the deconvolution layer, the skip connection and the residual recursive convolution module Deep image feature decoding; input the decoding result to the softmax layer to obtain the category of each pixel; train the pyramid hole convolution network, establish a loss function, and determine the network parameters through training samples; input the test image into the trained pyramid hole convolution network , to get the semantic segmentation result of the image. The atrous convolution and pyramid pooling methods adopted in the present invention can effectively extract multi-scale semantic information and detailed information, and improve the segmentation effect of the network.

Description

A semantic segmentation method based on pyramid atrous convolutional network

技术领域Technical Field

本发明涉及计算机视觉技术领域，具体涉及一种基于金字塔空洞卷积网络的语义分割方法。The present invention relates to the field of computer vision technology, and in particular to a semantic segmentation method based on a pyramid hole convolutional network.

背景技术Background Art

近年来，随着深度学习技术的迅速发展，其在医疗图像分析领域中的应用也越来越广。其中，语义分割技术在治疗计划、疾病诊断、病理研究等多种应用场景都发挥着巨大的作用。对于医疗图像而言，要准确的识别图像中各物体的类别需要具备专业领域的知识背景，并且需要耗费该专业权威一定的时间。而通过对语义分割技术的研究，可以实现自动地对输入的医疗图像进行精准的分割，从而方便医生做出更准确的判断，并设计出更好的治疗计划。In recent years, with the rapid development of deep learning technology, its application in the field of medical image analysis has become more and more extensive. Among them, semantic segmentation technology plays a huge role in various application scenarios such as treatment planning, disease diagnosis, and pathological research. For medical images, accurate identification of the categories of objects in the image requires professional knowledge background and takes a certain amount of time for the professional authority. Through the study of semantic segmentation technology, it is possible to automatically and accurately segment the input medical images, so that doctors can make more accurate judgments and design better treatment plans.

传统的语义分割算法包括基于分水岭的分割方法、基于聚类的分割方法、基于统计特征的分割方法，但是随着深度学习技术的发展，基于CNNs模型的语义分割方法开始成为主流，特别是随着FCN网络的提出，更是给语义分割技术的发展打开一所大门，越来越多的研究者们基于FCN模型提出了许多改进的语义分割模型。特别是，U-Net模型由于具有在训练集比较小的情况下模型效果依然很好的优点，因此在医疗图像语义分割领域被广泛使用。Traditional semantic segmentation algorithms include watershed-based segmentation methods, clustering-based segmentation methods, and statistical feature-based segmentation methods. However, with the development of deep learning technology, semantic segmentation methods based on CNNs models have become mainstream. In particular, with the introduction of FCN networks, a new door has been opened for the development of semantic segmentation technology. More and more researchers have proposed many improved semantic segmentation models based on FCN models. In particular, the U-Net model is widely used in the field of medical image semantic segmentation because it has the advantage of being able to perform well even when the training set is relatively small.

在U-Net模型的encoder结构中，其通过最大池化的方式进行下采样，池化可以增大感受野，从而获取更深层次的语义信息。但是，在池化后，图像的特征映射的分辨率也会相应地减小，从而会导致细节信息的丢失。虽然在U-Net网络中通过跳跃连接的方式来获取多尺度的细节信息，但是仍然会导致边界位置信息的丢失以及模型空间判别能力的下降。In the encoder structure of the U-Net model, downsampling is performed by maximum pooling. Pooling can increase the receptive field and obtain deeper semantic information. However, after pooling, the resolution of the feature map of the image will also be reduced accordingly, resulting in the loss of detail information. Although multi-scale detail information is obtained through skip connections in the U-Net network, it still leads to the loss of boundary position information and the decline of the model's spatial discrimination ability.

在本发明提出的过程中，至少发现空洞卷积由于其具有能够实现在增大感受野的同时不会导致特征图像分辨率降低的优点，从而被广泛地使用。同时，为了进一步提升U-Net模型的效果，注意力机制、金字塔池化模块、递归卷积、残差连接、密集连接等技术也都被用来与U-Net模型相结合。In the process of proposing the present invention, it is at least found that the dilated convolution is widely used because it can increase the receptive field without reducing the resolution of the feature image. At the same time, in order to further improve the effect of the U-Net model, attention mechanism, pyramid pooling module, recursive convolution, residual connection, dense connection and other technologies are also used to combine with the U-Net model.

发明内容Summary of the invention

本发明的目的是为了解决现有技术中的上述缺陷，提供一种基于金字塔空洞卷积网络的语义分割方法，通过使用多个残差递归卷积模块、空洞卷积模块、金字塔池化模块来提取不同尺度的特征，然后使用多层上采样和跳跃连接来恢复特征图像的尺寸。The purpose of the present invention is to solve the above-mentioned defects in the prior art and provide a semantic segmentation method based on a pyramid dilated convolutional network, which extracts features of different scales by using multiple residual recursive convolution modules, dilated convolution modules, and pyramid pooling modules, and then uses multi-layer upsampling and skip connections to restore the size of the feature image.

本发明的技术目的通过以下技术方案实现：The technical purpose of the present invention is achieved through the following technical solutions:

一种基于金字塔空洞卷积网络的语义分割方法，所述的金字塔空洞卷积网络包括第一残差递归卷积模块、第二残差递归卷积模块、池化层、金字塔池化模块、空洞卷积模块、反卷积层、第三残差递归卷积模块、第四残差递归卷积模块、softmax预测层，其结构连接方式为：所述的第一残差递归卷积模块依次串联池化层、第二残差递归卷积模块、池化层，所述的金字塔池化模块和空洞卷积模块并联后与前述的池化层串联，然后依次串联反卷积层、第三残差递归卷积模块、反卷积层、第四残差递归卷积模块、softmax预测层；所述的语义分割方法包括如下步骤：A semantic segmentation method based on a pyramid dilated convolutional network, wherein the pyramid dilated convolutional network comprises a first residual recursive convolutional module, a second residual recursive convolutional module, a pooling layer, a pyramid pooling module, a dilated convolutional module, a deconvolutional layer, a third residual recursive convolutional module, a fourth residual recursive convolutional module, and a softmax prediction layer, wherein the structural connection mode is as follows: the first residual recursive convolutional module is sequentially connected in series with the pooling layer, the second residual recursive convolutional module, and the pooling layer, the pyramid pooling module and the dilated convolutional module are connected in parallel and then connected in series with the aforementioned pooling layer, and then the deconvolutional layer, the third residual recursive convolutional module, the deconvolutional layer, the fourth residual recursive convolutional module, and the softmax prediction layer are sequentially connected in series; the semantic segmentation method comprises the following steps:

S1、获取包含真实分割结果的医疗图像数据集，对该数据集进行预处理操作实现数据增强；S1. Obtain a medical image dataset containing real segmentation results, and perform preprocessing operations on the dataset to achieve data enhancement;

S2、将预处理图像依次通过第一残差递归卷积模块、池化层、第二残差递归卷积模块、池化层，多尺度提取图像的语义信息，分别得到浅层图像特征F₁₁、F₁₂、F₂₁、F₂₂；S2, passing the preprocessed image through the first residual recursive convolution module, the pooling layer, the second residual recursive convolution module, and the pooling layer in sequence, extracting the semantic information of the image at multiple scales, and obtaining shallow image features F ₁₁ , F ₁₂ , F ₂₁ , and F ₂₂ respectively;

S3、将图像特征F₂₂通过一个由金字塔池化模块和空洞卷积模块并联的网络，其中，将图像特征F₂₂通过金字塔池化模块后得到图像特征F₃，将图像特征F₂₂通过空洞卷积模块后得到图像特征F₄；将图像特征F₃、F₄进行逐通道的聚合操作，然后再通过一个卷积核为1×1的卷积层，得到深层图像特征F₅，从而可以进一步提取深层次的语义信息；S3, passing the image feature _F22 through a network consisting of a pyramid pooling module and a dilated convolution module in parallel, wherein the image feature _F22 is passed through the pyramid pooling module to obtain the image feature _F3 , and the image feature _F22 is passed through the dilated convolution module to obtain the image feature _F4 ; performing a channel-by-channel aggregation operation on the image features _F3 and _F4 , and then passing through a convolution layer with a convolution kernel of 1×1 to obtain a deep image feature _F5 , so as to further extract deep semantic information;

S4、将图像特征F₅通过一个反卷积层，然后再与通过跳跃连接传递过来的浅层图像特征F₂₁进行逐通道的聚合操作，得到图像特征F₆₁；再将图像特征F₆₁通过第三残差递归卷积模块，得到图像特征F₆₂，其中，所述的跳跃连接是直接将浅层特征传递过来，与通过反卷积层后的结果进行逐通道的聚合；通过使用跳跃连接，可以使得输出的图像特征中保留更多的原始图像的细节信息，从而使得预测的分割图像的边界更平滑。S4, passing the image feature _F5 through a deconvolution layer, and then performing a channel-by-channel aggregation operation with the shallow image feature _F21 transmitted through the jump connection to obtain the image feature _F61 ; then passing the image feature _F61 through the third residual recursive convolution module to obtain the image feature _F62 , wherein the jump connection directly transmits the shallow feature and performs channel-by-channel aggregation with the result after passing through the deconvolution layer; by using the jump connection, more detail information of the original image can be retained in the output image feature, thereby making the boundary of the predicted segmented image smoother.

S5、将图像特征F₆₂通过一个反卷积层，然后再与通过跳跃连接传递过来的浅层图像特征F₁₁进行逐通道的聚合操作，得到图像特征F₇₁；再将图像特征F₇₁通过第四残差递归卷积模块，得到图像特征F₇₂；S5, passing the image feature F ₆₂ through a deconvolution layer, and then performing a channel-by-channel aggregation operation with the shallow image feature F ₁₁ transmitted through the skip connection to obtain the image feature F ₇₁ ; then passing the image feature F ₇₁ through the fourth residual recursive convolution module to obtain the image feature F ₇₂ ;

S6、将图像特征F₇₂输入到softmax预测层，得到原始输入图像中每个像素所属的类别；S6, input the image feature F ₇₂ into the softmax prediction layer to obtain the category to which each pixel in the original input image belongs;

S7、训练金字塔空洞卷积网络，建立损失函数，通过训练样本确定网络参数；S7, train the pyramid hole convolution network, establish the loss function, and determine the network parameters through training samples;

S8、将待分割的测试图像输入到训练完成的金字塔空洞卷积网络，得到该图像的语义分割结果。S8. Input the test image to be segmented into the trained pyramid hollow convolutional network to obtain the semantic segmentation result of the image.

进一步地，所述的步骤S1中预处理操作包括旋转、切片、标准化、自适应直方图均衡。Furthermore, the preprocessing operations in step S1 include rotation, slicing, normalization, and adaptive histogram equalization.

进一步地，所述的第一残差递归卷积模块、第二残差递归卷积模块、第三残差递归卷积模块、第四残差递归卷积模块结构相同，每个残差递归卷积模块都是先将输入通过两个串联的递归卷积层，然后再与输入以残差方式相加得到输出；所述的递归卷积层的结构连接依次为conv、ReLU、Add、conv、ReLU、Add、conv、ReLU，其中conv是一个卷积核为3×3的卷积层，Add为与输入进行逐像素相加。相比于使用普通的卷积层，使用残差连接可以帮助训练更深的网络，而使用递归卷积网络可以更好地提取图像中包含的语义信息。Furthermore, the first residual recursive convolution module, the second residual recursive convolution module, the third residual recursive convolution module, and the fourth residual recursive convolution module have the same structure, and each residual recursive convolution module first passes the input through two series-connected recursive convolution layers, and then adds the input in a residual manner to obtain the output; the structural connection of the recursive convolution layer is conv, ReLU, Add, conv, ReLU, Add, conv, ReLU, where conv is a convolution layer with a convolution kernel of 3×3, and Add is pixel-by-pixel addition to the input. Compared with using ordinary convolution layers, using residual connections can help train deeper networks, and using recursive convolution networks can better extract semantic information contained in images.

进一步地，所述的步骤S3中的金字塔池化模块包含四个不同池化大小的自适应平均池化层，用于从多个尺度获取步骤S2得到的图像特征F₂₂中包含的语义信息，四个池化层采用的池化大小分别N、N/2、N/3、N/6，其中N表示图像特征F₂₂的分辨率大小；再对不同池化层得到的不同大小的图像特征分别通过一个卷积核为1×1的卷积层，然后再进行转置卷积，得到与图像特征F₂₂大小一致的图像特征F₃₁、F₃₂、F₃₃、F₃₄，然后再将各个尺度的上采样结果与输入图像特征F₂₂进行聚合，最后为了减少通道数，再将聚合后的图像特征通过一个卷积核为3×3的卷积层，得到图像特征F₃，即F₃＝Conv(Concatenate(F₂₂,F₃₁,F₃₂,F₃₃,F₃₄))，其中Concatenate为聚合操作，Conv为3×3的卷积操作。通过进行多个尺度的池化操作，从而可以更好地获取到图像中包含的细节信息和更深层次的语义信息。Further, the pyramid pooling module in step S3 includes four adaptive average pooling layers with different pooling sizes, which are used to obtain the semantic information contained in the image feature _F22 obtained in step S2 from multiple scales. The pooling sizes adopted by the four pooling layers are N, N/2, N/3, and N/6, respectively, where N represents the resolution size of the image feature _F22 ; the image features of different sizes obtained by different pooling layers are respectively passed through a convolution layer with a convolution kernel of 1×1, and then transposed convolution is performed to obtain image features _F31 , _F32 , _F33 , and _F34 of the same size as the image feature _F22 , and then the upsampling results of each scale are aggregated with the input image feature _F22 . Finally, in order to reduce the number of channels, the aggregated image features are passed through a convolution layer with a convolution kernel of 3×3 to obtain the image feature _F3 , that is, _F3 ＝Conv(Concatenate( _F22 , _F31 , _F32 , _F33 , F34 ₎ )), where Concatenate is an aggregation operation and Conv is a 3×3 convolution operation. By performing pooling operations at multiple scales, the detailed information and deeper semantic information contained in the image can be better obtained.

进一步地，所述的步骤S3中空洞卷积模块由三个具有不同空洞因子的空洞卷积单元串联构成，三个空洞卷积单元的空洞因子分别是1、2、4，空洞卷积核的大小均为3×3；输入图像特征F₂₂后，经过三个空洞卷积单元得到的图像特征分别为F₄₁、F₄₂、F₄₃；空洞卷积单元之间以密集连接的方式进行连接，所述的密集连接方式即每一个空洞卷积单元的输入均与该空洞卷积单元的输出相加作为输出；该经过该空洞卷积模块后，能获得一个分辨率大小与图像特征F₂₂相等的图像特征F₄，F₄＝Add(F₂₂,F₄₁,F₄₂,F₄₃)，其中Add为逐像素相加操作。通过使用空洞卷积代替普通的卷积加池化，既可通过增大感受野获取到更深层次的语义信息，又可以避免因池化操作带来的分辨率减小从而丢失细节信息的问题。Furthermore, in the step S3, the dilated convolution module is composed of three dilated convolution units with different dilated factors connected in series, the dilated factors of the three dilated convolution units are 1, 2, and 4, respectively, and the size of the dilated convolution kernel is 3×3; after the image feature F ₂₂ is input, the image features obtained by the three dilated convolution units are F ₄₁ , F ₄₂ , and F ₄₃ , respectively; the dilated convolution units are connected in a densely connected manner, and the densely connected manner is that the input of each dilated convolution unit is added to the output of the dilated convolution unit as the output; after passing through the dilated convolution module, an image feature F ₄ with a resolution equal to that of the image feature F ₂₂ can be obtained, F ₄ =Add(F ₂₂ ,F ₄₁ ,F ₄₂ ,F ₄₃ ), where Add is a pixel-by-pixel addition operation. By using dilated convolution instead of ordinary convolution plus pooling, we can not only obtain deeper semantic information by increasing the receptive field, but also avoid the problem of losing detail information due to reduced resolution caused by pooling operations.

进一步地，所述的步骤S4与步骤S5中的反卷积层采用的是一个转置卷积。Furthermore, the deconvolution layer in step S4 and step S5 adopts a transposed convolution.

进一步地，所述的步骤S6中，对已经建立的金字塔空洞卷积网络进行端到端的训练，训练策略采用随机梯度下降算法，损失函数使用categorical_crossentropy，公式为：Furthermore, in step S6, the established pyramid hollow convolutional network is trained end-to-end, the training strategy adopts the stochastic gradient descent algorithm, and the loss function uses categorical_crossentropy, and the formula is:

其中，l_c表示分割后的特征映射F_s的类别交叉熵损失，f_s表示特征映射F_s中的一个体素，M是特征映射F_s中的体素数量，K是类别的数量，

表示体素f_s是否属于类别k，

表示体素f_s属于类别k的可能性大小。Among them, l _c represents the category cross entropy loss of the segmented feature map F _s , f _s represents a voxel in the feature map F _s , M is the number of voxels in the feature map F _s , K is the number of categories,

Indicates whether the voxel _fs belongs to category k,

It represents the possibility that voxel _fs belongs to category k.

本发明相对于现有技术具有如下的优点及效果：Compared with the prior art, the present invention has the following advantages and effects:

(1)本发明采用了空洞卷积模块来提取深层次的语义信息，相比于传统的卷积+池化的方式，空洞卷积模块可以在增大感受野的同时不会导致分辨率的下降。同时，在空洞卷积模块中包含了三个具有不同空洞因子的空洞卷积层，并且这三个空洞卷积层之间以密集连接方式进行连接，从而可以实现在多个尺度获取语义信息。(1) The present invention uses a dilated convolution module to extract deep semantic information. Compared with the traditional convolution + pooling method, the dilated convolution module can increase the receptive field without reducing the resolution. At the same time, the dilated convolution module contains three dilated convolution layers with different dilation factors, and the three dilated convolution layers are connected in a densely connected manner, so that semantic information can be obtained at multiple scales.

(2)本发明结合使用了金字塔空间池化模块来提取图像中包含的多个尺度的信息，从而可以有效地获取图像中包含的深层次的语义信息和浅层的细节信息。(2) The present invention combines the use of a pyramid spatial pooling module to extract information of multiple scales contained in an image, thereby effectively obtaining deep semantic information and shallow detail information contained in the image.

(3)本发明使用了残差递归卷积来取代普通卷积，从而可以帮助训练更深的网络结构以及获得更好的分割任务的特征表示能力。(3) The present invention uses residual recursive convolution to replace ordinary convolution, which can help train deeper network structures and obtain better feature representation capabilities for segmentation tasks.

(4)本发明包含的残差递归卷积、空洞卷积、金字塔池化等模块，是一个可以进行端到端训练的算法，和两阶段算法相比，参数数量更小，训练更加方便。(4) The residual recursive convolution, dilated convolution, pyramid pooling and other modules included in the present invention are an algorithm that can be trained end-to-end. Compared with the two-stage algorithm, it has a smaller number of parameters and is more convenient to train.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本发明公开的基于金字塔空洞卷积网络的语义分割方法流程图；FIG1 is a flow chart of a semantic segmentation method based on a pyramid dilated convolutional network disclosed in the present invention;

图2(a)是本发明实施例中残差递归卷积模块的示意图，图2(b)是图2(a)中使用到的递归卷积单元的示意图；FIG. 2( a ) is a schematic diagram of a residual recursive convolution module according to an embodiment of the present invention, and FIG. 2( b ) is a schematic diagram of a recursive convolution unit used in FIG. 2( a );

图3是本发明实施例中空间金字塔池化模块示意图；FIG3 is a schematic diagram of a spatial pyramid pooling module in an embodiment of the present invention;

图4是本发明实施例中空洞卷积模块简略示意图。FIG4 is a simplified schematic diagram of a dilated convolution module in an embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purpose, technical solution and advantages of the embodiments of the present invention clearer, the technical solution in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.

实施例如图1所示，本实施例提供一种基于金字塔空洞卷积网络的语义分割方法，具体包括以下步骤：As shown in FIG1 , this embodiment provides a semantic segmentation method based on a pyramid dilated convolutional network, which specifically includes the following steps:

S1、获取包含真实分割结果的医疗图像数据集，对该数据集进行数据增强等预处理操作；由于医疗图像的数据集大都具有容量小、对比度低等特点，因此首先对数据集中的图像进行旋转、切片、标准化、自适应直方图均衡。S1. Obtain a medical image dataset containing real segmentation results, and perform preprocessing operations such as data enhancement on the dataset. Since most medical image datasets have the characteristics of small capacity and low contrast, the images in the dataset are first rotated, sliced, standardized, and adaptively histogram equalized.

S2、将预处理图像依次通过第一残差递归卷积模块、池化层、第二残差递归卷积模块、池化层，多尺度提取图像的语义信息，分别得到浅层图像特征F₁₁、F₁₂、F₂₁、F₂₂。具体为：如图2(a)所示，所述的残差递归卷积模块是先将输入通过两个串联的递归卷积层，然后再与输入以残差方式相加得到输出；其中，如图2(b)所示，递归卷积单元的结构连接依次为conv、ReLU、Add、conv、ReLU、Add、conv、ReLU，其中conv是一个卷积核为3×3的卷积层，Add为与输入进行逐像素相加；所述的池化层采用的是一个步长为2的最大池化层。S2. The preprocessed image is sequentially passed through the first residual recursive convolution module, the pooling layer, the second residual recursive convolution module, and the pooling layer to extract the semantic information of the image at multiple scales, and obtain shallow image features _F11 , _F12 , _F21 , and _F22 respectively. Specifically, as shown in FIG2(a), the residual recursive convolution module first passes the input through two serially connected recursive convolution layers, and then adds the input in a residual manner to obtain the output; wherein, as shown in FIG2(b), the structural connection of the recursive convolution unit is conv, ReLU, Add, conv, ReLU, Add, conv, and ReLU, wherein conv is a convolution layer with a convolution kernel of 3×3, and Add is pixel-by-pixel addition to the input; the pooling layer adopts a maximum pooling layer with a step size of 2.

S3、将图像特征F₂₂通过一个由金字塔池化模块和空洞卷积模块并联的网络，其中，将图像特征F₂₂通过金字塔池化模块后得到图像特征F₃，将图像特征F₂₂通过空洞卷积模块后得到图像特征F₄；然后将图像特征F₃、F₄进行逐通道的聚合操作后再通过一个卷积核为1×1的卷积层，得到深层图像特征F₅，从而可以进一步提取深层次的语义信息；具体为：S3, passing the image feature _F22 through a network consisting of a pyramid pooling module and a dilated convolution module in parallel, wherein the image feature _F22 is passed through the pyramid pooling module to obtain the image feature _F3 , and the image feature _F22 is passed through the dilated convolution module to obtain the image feature _F4 ; then performing a channel-by-channel aggregation operation on the image features _F3 and _F4 and passing them through a convolution layer with a convolution kernel of 1×1 to obtain a deep image feature _F5 , so as to further extract deep semantic information; specifically:

如图3所示，金字塔池化模块包含四个不同池化大小的自适应平均池化层(自适应平均池化层即图3中的avgpool)，用于从多个尺度获取步骤S2得到的图像特征F₂₂中包含的语义信息，四个池化层采用的池化大小分别N、N/2、N/3、N/6，其中N表示图像特征F₂₂的分辨率大小；再对不同池化层得到的不同大小的图像特征分别通过一个卷积核为1×1的卷积层(即图3中的conv 1×1)，然后再进行转置卷积(转置卷积即图3中的up-conv)，得到与图像特征F₂₂大小一致的图像特征F₃₁、F₃₂、F₃₃、F₃₄，然后再将各个尺度的上采样结果与输入图像特征F₂₂进行聚合，最后为了减少通道数，再将聚合后的图像特征通过一个卷积核为3×3的卷积层，得到图像特征F₃，即F₃＝Conv(Concatenate(F₂₂,F₃₁,F₃₂,F₃₃,F₃₄))，其中Concatenate为聚合操作，Conv为3×3的卷积操作。As shown in FIG3 , the pyramid pooling module includes four adaptive average pooling layers with different pooling sizes (the adaptive average pooling layer is avgpool in FIG3 ), which are used to obtain the semantic information contained in the image feature F ₂₂ obtained in step S2 from multiple scales. The pooling sizes adopted by the four pooling layers are N, N/2, N/3, and N/6, respectively, where N represents the resolution size of the image feature F _22. The image features of different sizes obtained by different pooling layers are respectively passed through a convolution layer with a convolution kernel of 1×1 (i.e., conv 1×1 in FIG3 ), and then transposed convolution is performed (transposed convolution is up-conv in FIG3 ) to obtain image features F ₃₁ , F ₃₂ , F ₃₃ , and F ₃₄ of the same size as the image feature F _22. The upsampling results of each scale are then aggregated with the input image feature F _22. Finally, in order to reduce the number of channels, the aggregated image features are passed through a convolution layer with a convolution kernel of 3×3 to obtain the image feature F ₃ , i.e., F ₃ =Conv(Concatenate(F ₂₂ ,F ₃₁ ,F ₃₂ ,F ₃₃ ,F ₃₄ )), where Concatenate is an aggregation operation and Conv is a 3×3 convolution operation.

如图4所示，空洞卷积模块由三个具有不同空洞因子的空洞卷积单元串联构成，三个空洞卷积单元的空洞因子分别是1、2、4，空洞卷积核的大小均为3×3；输入图像特征F₂₂后，经过三个空洞卷积单元得到的图像特征分别为F₄₁、F₄₂、F₄₃；空洞卷积单元之间以密集连接的方式进行连接，所述密集连接方式即每一个空洞卷积单元的输入都会与该空洞卷积单元的输出相加作为输出；经过该空洞卷积模块后，能获得一个分辨率大小与图像特征F₂₂相等的图像特征F₄，F₄＝Add(F₂₂,F₄₁,F₄₂,F₄₃)，其中Add为逐像素相加操作。As shown in Figure 4, the dilated convolution module is composed of three dilated convolution units with different dilated factors connected in series, the dilated factors of the three dilated convolution units are 1, 2, and 4 respectively, and the size of the dilated convolution kernel is 3×3; after the image feature F ₂₂ is input, the image features obtained after the three dilated convolution units are F ₄₁ , F ₄₂ , and F ₄₃ respectively; the dilated convolution units are connected in a densely connected manner, that is, the input of each dilated convolution unit is added to the output of the dilated convolution unit as the output; after passing through the dilated convolution module, an image feature F ₄ with a resolution equal to that of the image feature F ₂₂ can be obtained, F ₄ =Add(F ₂₂ ,F ₄₁ ,F ₄₂ ,F ₄₃ ), where Add is a pixel-by-pixel addition operation.

本实施例中，逐通道的聚合操作是指聚合操作在通道维度上进行，即假设图像特征F₃的通道数为C₁，图像特征F₄的通道数为C₂，则聚合后得到的图像特征的通道数为C₁+C₂。In this embodiment, the channel-by-channel aggregation operation refers to the aggregation operation being performed on the channel dimension, that is, assuming that the number of channels of the image feature _F3 is _C1 and the number of channels of the image feature _F4 is _C2 , the number of channels of the image feature obtained after aggregation is _C1 + _C2 .

S4、将图像特征F₅通过一个反卷积层，然后再与通过跳跃连接传递过来的浅层图像特征F₂₁进行逐通道的聚合操作，得到图像特征F₆₁；再将图像特征F₆₁通过第三残差递归卷积模块，得到图像特征F₆₂；具体为：反卷积层采用的是一个转置卷积；所述的跳跃连接是直接将浅层特征传递过来，与通过反卷积层后的结果进行逐通道的聚合，逐通道聚合操作如前面步骤S3所述。S4, pass the image feature _F5 through a deconvolution layer, and then perform a channel-by-channel aggregation operation with the shallow image feature _F21 transmitted through the jump connection to obtain the image feature _F61 ; then pass the image feature _F61 through the third residual recursive convolution module to obtain the image feature _F62 ; specifically: the deconvolution layer adopts a transposed convolution; the jump connection directly transmits the shallow feature, and performs a channel-by-channel aggregation with the result after passing through the deconvolution layer, and the channel-by-channel aggregation operation is as described in the previous step S3.

S5、将图像特征F₆₂通过一个反卷积层，然后再与通过跳跃连接传递过来的浅层图像特征F₁₁进行逐通道的聚合操作，得到图像特征F₇₁；再将图像特征F₇₁通过第四残差递归卷积模块，得到图像特征F₇₂；具体为：反卷积层采用的是一个转置卷积；所述的跳跃连接是直接将浅层特征传递过来，与通过反卷积层后的结果进行逐通道的聚合，逐通道聚合操作如前面步骤S3所述。S5. Pass the image feature _F62 through a deconvolution layer, and then perform a channel-by-channel aggregation operation with the shallow image feature _F11 transmitted through the jump connection to obtain the image feature _F71 ; then pass the image feature _F71 through the fourth residual recursive convolution module to obtain the image feature _F72 ; specifically: the deconvolution layer adopts a transposed convolution; the jump connection directly transmits the shallow feature, and performs a channel-by-channel aggregation with the result after passing through the deconvolution layer, and the channel-by-channel aggregation operation is as described in the previous step S3.

S6、将图像特征F₇₂输入到softmax预测层，得到原始输入图像中每个像素所属的类别。S6. Input the image feature F ₇₂ into the softmax prediction layer to obtain the category to which each pixel in the original input image belongs.

S7、训练金字塔空洞卷积网络，建立损失函数，通过训练样本确定网络参数，网络参数具体包括学习率、权重下降、动量项、训练策略。对已经建立的金字塔空洞卷积网络进行端到端的训练，训练策略采用随机梯度下降算法，初始学习率设置为0.001，权重下降设置为10^-4，加入0.9的动量项momentum；损失函数使用categorical_crossentropy，和原始的交叉熵损失函数不同之处在于，categorical_crossentropy对于k^th类别体素的损失函数增加了相应的损失权重v^k，该权重大小与体素属于k^th类别成反比，公式为：S7. Train the pyramid dilated convolutional network, establish a loss function, and determine the network parameters through training samples. The network parameters specifically include learning rate, weight drop, momentum term, and training strategy. Perform end-to-end training on the established pyramid dilated convolutional network. The training strategy uses the stochastic gradient descent algorithm. The initial learning rate is set to 0.001, the weight drop is set to 10 ^-4 , and a momentum term of 0.9 is added. The loss function uses categorical_crossentropy. The difference from the original cross entropy loss function is that categorical_crossentropy adds a corresponding loss weight v ^k to the loss function of the k ^th category voxel. The weight is inversely proportional to the voxel belonging to the k ^th category. The formula is:

表示体素f_s是否属于类别k，

Indicates whether the voxel _fs belongs to category k,

It represents the possibility that voxel _fs belongs to category k.

综上所述，本实施例公开的一种基于金字塔空洞卷积网络的语义分割方法，提出并训练一种金字塔空洞卷积网络，建立损失函数，通过训练样本确定网络参数；将测试图像输入到训练完成的金字塔空洞卷积网络，得到该图像的语义分割结果。本实施例采用的空洞卷积和金字塔池化方法能有效提取多尺度的语义信息和细节信息，提升网络的分割效果。In summary, the present embodiment discloses a semantic segmentation method based on a pyramid dilated convolutional network, proposes and trains a pyramid dilated convolutional network, establishes a loss function, and determines network parameters through training samples; the test image is input into the trained pyramid dilated convolutional network to obtain the semantic segmentation result of the image. The dilated convolution and pyramid pooling methods used in the present embodiment can effectively extract multi-scale semantic information and detail information, and improve the segmentation effect of the network.

上述实施例为本发明较佳的实施方式，但本发明的实施方式并不受上述实施例的限制，其他的任何未背离本发明的精神实质与原理下所作的改变、修饰、替代、组合、简化，均应为等效的置换方式，都包含在本发明的保护范围之内。The above embodiments are preferred implementation modes of the present invention, but the implementation modes of the present invention are not limited to the above embodiments. Any other changes, modifications, substitutions, combinations, and simplifications that do not deviate from the spirit and principles of the present invention should be equivalent replacement methods and are included in the protection scope of the present invention.

Claims

1. The semantic segmentation method based on the pyramid hole convolution network is characterized in that the pyramid hole convolution network comprises a first residual error recursive convolution module, a second residual error recursive convolution module, a pooling layer, a pyramid pooling module, a hole convolution module, an deconvolution layer, a third residual error recursive convolution module, a fourth residual error recursive convolution module and a softmax prediction layer, and the structural connection mode is as follows: the pyramid pooling module and the cavity convolution module are connected in parallel and then connected in series with the pooling layer, and then sequentially connected in series with the deconvolution layer, the third residual recursive convolution module, the deconvolution layer, the fourth residual recursive convolution module and the softmax prediction layer; the semantic segmentation method comprises the following steps:

s1, acquiring a medical image data set containing a real segmentation result, and preprocessing the data set to realize data enhancement;

s2, the preprocessed image sequentially passes through the first residual recursive convolution module, the pooling layer, the second residual recursive convolution module and the pooling layer, the semantic information of the image is extracted in a multi-scale mode, and the shallow image characteristics F are obtained respectively ₁₁ 、F ₁₂ 、F ₂₁ 、F ₂₂ ；

S3, image feature F ₂₂ By a network of pyramid pooling modules and hole convolution modules in parallel, wherein the image features F are ₂₂ Obtaining image characteristics F through a pyramid pooling module ₃ Image feature F ₂₂ Obtaining image characteristics F through a cavity convolution module ₄ (ii) a Image feature F ₃ 、F ₄ Performing aggregation operation channel by channel, and performing convolution layer with convolution kernel of 1 × 1 to obtain deep image feature F ₅ Thereby further extracting deep semantic information;

s4, image feature F ₅ Through an inverse convolution layer and then coupled with the shallow image features F delivered through the skip-join ₂₁ Performing channel-by-channel aggregation operation to obtain image feature F ₆₁ (ii) a Then image feature F ₆₁ Obtaining image characteristics F through a third residual error recursive convolution module ₆₂ Wherein, the jump connection directly transmits the shallow feature and carries out channel-by-channel aggregation with the result after passing through the reverse convolution layer;

s5, image feature F ₆₂ Through an inverse convolution layer and then coupled with the shallow image features F delivered through the skip-join ₁₁ Performing channel-by-channel aggregation operation to obtain image feature F ₇₁ (ii) a Then image feature F ₇₁ Obtaining image characteristics F through a fourth residual error recursive convolution module ₇₂ ；

S6, image feature F ₇₂ Inputting the image into a softmax prediction layer to obtain the category of each pixel in the original input image;

s7, training a pyramid cavity convolution network, establishing a loss function, and determining network parameters through training samples;

and S8, inputting the test image to be segmented into the trained pyramid cavity convolution network to obtain the semantic segmentation result of the image.

2. The method for semantic segmentation based on the pyramid hole convolutional network of claim 1, wherein the preprocessing operation in step S1 includes rotation, slicing, normalization, and adaptive histogram equalization.

3. The semantic segmentation method based on the pyramid hole convolution network according to claim 1, characterized in that the first residual recursive convolution module, the second residual recursive convolution module, the third residual recursive convolution module and the fourth residual recursive convolution module have the same structure, and each residual recursive convolution module is formed by first passing an input through two recursive convolution layers connected in series and then adding the input and the input in a residual manner to obtain an output; the structure connection of the recursive convolutional layer is conv, reLU, add, conv and ReLU in sequence, wherein conv is a convolutional layer with a convolution kernel of 3 multiplied by 3, and Add is pixel-by-pixel addition with input;

4. the semantic segmentation method based on the pyramid hole convolutional network of claim 1, wherein the pyramid pooling module in step S3 comprises four adaptive average pooling layers with different pooling sizes, and is used for obtaining the image feature F obtained in step S2 from multiple scales ₂₂ The four pooling layers adopt the pooling sizes of N, N respectivelyN2, N/3, N/6, where N represents an image feature F ₂₂ The resolution of (2); then, the image features with different sizes obtained by different pooling layers are respectively passed through a convolution layer with convolution kernel of 1 × 1, and then the transposition convolution is carried out to obtain the image features F ₂₂ Image features F of uniform size ₃₁ 、F ₃₂ 、F ₃₃ 、F ₃₄ Then the up-sampling result of each scale and the input image characteristic F are combined ₂₂ Polymerizing, and passing the polymerized image features through a convolution layer with convolution kernel of 3 × 3 to obtain image features F ₃ I.e. F ₃ ＝Conv(Concatenate(F ₂₂ ,F ₃₁ ,F ₃₂ ,F ₃₃ ,F ₃₄ ) Concatenate is an aggregation operation and Conv is a convolution operation of 3 x 3).

5. The semantic segmentation method based on the pyramid hole convolution network according to claim 1, wherein in step S3, the hole convolution module is formed by connecting three hole convolution units with different hole factors in series, the hole factors of the three hole convolution units are 1, 2 and 4, respectively, and the sizes of the hole convolution kernels are all 3 × 3; input image feature F ₂₂ Then, the image characteristics obtained by the three cavity convolution units are respectively F ₄₁ 、F ₄₂ 、F ₄₃ (ii) a The cavity convolution units are connected in a dense connection mode, wherein the dense connection mode is that the input of each cavity convolution unit is added with the output of the cavity convolution unit to be used as output; after passing through the cavity convolution module, a resolution and an image characteristic F can be obtained ₂₂ Equal image features F ₄ ，F ₄ ＝Add(F ₂₂ ,F ₄₁ ,F ₄₂ ,F ₄₃ ) Where Add is a pixel-by-pixel addition operation.

6. The method as claimed in claim 1, wherein the deconvolution layer in step S4 and step S5 is a transposed convolution.

7. The method for semantic segmentation based on the pyramid hole convolutional network of claim 1, wherein in step S6, the established pyramid hole convolutional network is trained end to end, a random gradient descent algorithm is adopted as a training strategy, and a loss function uses catagorical _ cross, and the formula is as follows:

wherein l _c Representing a segmented feature map F _s Class cross entropy loss of f _s Representation feature mapping F _s M is a feature map F _s K is the number of classes,

representing a voxel f _s Whether it belongs to the category k, or not>

Representing a voxel f _s The probability of belonging to class k. />