CN118608792B

CN118608792B - Mamba-based ultra-light image segmentation method and computer device

Info

Publication number: CN118608792B
Application number: CN202411082749.5A
Authority: CN
Inventors: 王玲; 陈春霞; 丑西平; 孙宏波; 晏杭坤; 徐子翕
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2024-08-08
Filing date: 2024-08-08
Publication date: 2024-10-01
Anticipated expiration: 2044-08-08
Also published as: CN118608792A

Abstract

The invention relates to the field of image segmentation, in particular to an ultra-light image segmentation method and a computer device based on Mamba, which can maximize optimal computing resources while maintaining excellent segmentation performance, obtain an ultra-light model and are more suitable for mobile detection equipment. The technical proposal comprises: the method comprises the steps of obtaining an original image, preprocessing the original image to obtain an original image set, and dividing the original image set into a training set, a verification set and a test set according to a set proportion; constructing a Mamba-based ultra-light image segmentation model; taking the original images in the training set and the verification set as the input of the Mamba-based ultra-light image segmentation model, and performing image segmentation training on the Mamba-based ultra-light image segmentation model; and inputting the original image in the test set into a trained ultra-light image segmentation model based on Mamba to obtain an image segmentation result. The invention is suitable for image segmentation.

Description

Mamba-based ultra-light image segmentation method and computer device

Technical Field

The invention relates to the field of image segmentation, in particular to an ultra-light image segmentation method based on Mamba and a computer device.

Background

Conventional image segmentation is typically implemented using deep learning networks represented by convolution and transform architectures. Convolution has excellent local feature extraction capability, but has shortcomings in establishing relevance of the remote information. The self-attention mechanism, while solving the problem of telematics extraction with a continuous patch sequence, also brings about a significant computational load. In order to improve the segmentation performance of the model, most methods tend to use modules that add more complexity. However, this is not suitable for the actual application scenario, especially the mobile detection device, and the model of the computational load is not suitable for the actual application scenario due to the limitation of the computational resources.

In recent years, a State Space Model (SSMs) represented by Mamba has become a powerful competitor to traditional convolutional neural networks and transducer architectures. The State Space Model (SSMs) shows linear complexity in terms of input size and memory footprint, which makes them critical to the basis of the lightweight model. Furthermore SSMs is adept at capturing remote dependencies, which can critically address the convolution problem for extracting information over long distances. In industrial inspection, practical computing power and memory constraints are often considered.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, and provides an ultra-light image segmentation method and a computer device based on Mamba, which can maintain excellent segmentation performance, maximize optimized computing resources, obtain an ultra-light model and are more suitable for mobile detection equipment.

The present invention adopts the following technical scheme to achieve the above object, and in a first aspect, the present invention provides an ultra-light image segmentation method based on Mamba, including:

s1, acquiring an original image, preprocessing the original image to obtain an original image set, and dividing the original image set into a training set, a verification set and a test set according to a set proportion;

s2, constructing an ultra-light image segmentation model based on Mamba;

The Mamba-based ultra-light image segmentation model mainly comprises an encoder, a decoder and jump connection between the encoder and the decoder;

The encoder comprises a first residual convolution module, a second residual convolution module, a first parallel convolution module and a second parallel convolution module, the decoder comprises a convolution module, a first parallel vision module, a second parallel vision module and a third parallel vision module, the jump connection carries out multi-level and multi-scale information fusion through an attention mechanism module, and the attention mechanism module mainly comprises a space attention machine submodule and a channel attention mechanism submodule;

The first residual convolution module and the second residual convolution module have the same structure and mainly consist of three parallel convolution layers, the first parallel convolution module and the second parallel convolution module have the same structure and mainly consist of four parallel layers, each parallel layer mainly consists of three branches connected by residual errors, the first branch consists of a visual state space block and jump connection, the second branch consists of standard convolution and jump connection with a convolution kernel of 3, and the third branch consists of standard convolution and jump connection with a convolution kernel of 5;

the first parallel vision module, the second parallel vision module and the third parallel vision module have the same structure and mainly consist of four parallel layers, each parallel layer is mainly composed of a vision state space block and jump connection, the vision state space block is mainly composed of two branches, the first branch is mainly composed of a linear layer and SiLU activation functions, the second branch is mainly composed of a linear layer, depth convolution, siLU activation functions, a state space model and a layer normalization layer, and finally the two branches are combined for output through element-by-element multiplication;

S3, taking the original images in the training set and the verification set as input of an ultra-light image segmentation model based on Mamba, and performing image segmentation training on the ultra-light image segmentation model based on Mamba;

S4, inputting the original image in the test set into a trained ultra-light image segmentation model based on Mamba to obtain an image segmentation result.

Further, S3 specifically includes:

Encoder training process: respectively inputting an original image into a 3X 3 convolution layer and a 5X 5 convolution layer of a first residual convolution module to obtain two corresponding branch results, merging the two branch results to obtain a first feature map, then inputting the original image into a 1X 1 convolution layer of the first residual convolution module to obtain a feature map, fusing the feature map with the first feature map, outputting a second feature map after fusing, inputting the second feature map into a second residual convolution module, and outputting a third feature map by the second residual convolution module;

Inputting a third characteristic diagram with the channel number of C into a layer normalization layer of a first parallel convolution module, dividing the third characteristic diagram into four corresponding characteristic diagrams with the channel number of C/4, respectively inputting each corresponding characteristic diagram into each parallel layer, splicing and adjusting factors for the outputs of three residual connected branches in the parallel layers to obtain three corresponding characteristic diagrams, adding corresponding elements of the three corresponding characteristic diagrams of each branch to obtain 4 middle characteristic diagrams with the channel number of C/4, combining the four middle characteristic diagrams with the channel number of C/4 into a fourth characteristic diagram with the channel number of C through splicing operation, and finally outputting a fifth characteristic diagram through operation of the layer normalization layer and a projection operation layer.

Further, S3 specifically further includes:

Decoder training process: inputting a characteristic map with the channel number of C output by an encoder into a layer normalization layer of a third parallel vision module, dividing the characteristic map into four corresponding characteristic maps with the channel number of C/4, respectively inputting each characteristic into VSS Block, then carrying out residual error splicing and adjustment factors, obtaining the characteristic map with the channel number of C through characteristic map splicing, respectively outputting the corresponding characteristic map through the layer normalization layer and a projection operation layer operation, inputting the corresponding characteristic map into a convolution module with the convolution kernel of 1, and outputting a segmented image.

Further, the jump connection performs multi-level and multi-scale information fusion through an attention mechanism specifically includes:

Firstly, inputting a feature map into a space attention machine submodule, respectively carrying out maximum pooling and average pooling treatment, then splicing two pooled results along a channel dimension, carrying out convolution operation by using a convolution layer, limiting an output result within a range of [0, 1] by using a Sigmoid activation function, multiplying the input feature map by the result, and adding the result with the input feature map to obtain a space attention feature map;

The method comprises the steps of taking a spatial attention feature map as input of a channel attention machine sub-module, carrying out global average pooling on the input feature map through the channel attention machine sub-module, compressing the spatial dimension of the input feature map through self-adaptive pooling, reserving channel information, calculating global attention weights through one-dimensional convolution, calculating the attention weight of each channel through a full-connection layer or a convolution layer, limiting the attention weight within the range of [0, 1] through a Sigmoid activation function, applying the calculated attention weights to the corresponding input feature map, and returning to a final attention feature map after splicing the input feature map.

In a second aspect, the present invention provides a computer apparatus comprising a memory storing program instructions that, when executed, perform the Mamba-based ultra-lightweight image segmentation method described above.

The beneficial effects of the invention are as follows:

The invention introduces a visual state space block as a basic block to capture extensive context information, combines excellent local feature extraction capability of convolution, provides a parallel processing method for dividing channels and carrying out local convolution, greatly reduces parameter quantity and operation quantity, and constructs an asymmetric encoder-decoder structure. The ultra-lightweight model is obtained while the excellent segmentation performance is maintained, so that the computational resource is optimized to the maximum extent, and the method is more suitable for mobile detection equipment.

Drawings

FIG. 1 is a flowchart of an ultra-light image segmentation method based on Mamba according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an ultra-light image segmentation model based on Mamba according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a spatial attention mechanism submodule according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a channel attention mechanism submodule according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a first parallel convolution module according to an embodiment of the present disclosure;

Fig. 6 is a schematic structural diagram of each parallel layer in an MPL (Multiple Parallel Vision Layer, multiple parallel visual layer) module, i.e., a first parallel convolution module, provided by an embodiment of the present invention;

Fig. 7 is a schematic structural diagram of each parallel layer in a PVL (Parallel Vision Layer, parallel visual layer) module, i.e., a first parallel visual module, provided by an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a visual state space block according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

The invention provides a Mamba-based ultra-light image segmentation method, as shown in fig. 1, comprising the following steps:

s1, acquiring an original image, preprocessing the original image to obtain an original image set, and dividing the original image into a training set, a verification set and a test set according to a certain proportion.

S2, constructing an ultra-light image segmentation model based on Mamba.

As shown in fig. 2, the ultra-light image segmentation model based on Mamba has a four-layer structure and mainly comprises an encoder, a decoder and jump connection among the encoder and the decoder;

The encoder comprises a first residual convolution module, a second residual convolution module, a first parallel convolution module and a second parallel convolution module, the decoder comprises a convolution module, a first parallel vision module, a second parallel vision module and a third parallel vision module, the jump connection carries out multi-level and multi-scale information fusion through an attention mechanism module, the attention mechanism module mainly comprises a space attention mechanism submodule and a channel attention mechanism submodule, and the structures of the space attention mechanism submodule and the channel attention mechanism submodule are respectively shown in fig. 3 and 4;

the first residual convolution module and the second residual convolution module have the same structure and mainly consist of three parallel convolution layers, the first parallel convolution module and the second parallel convolution module have the same structure and mainly consist of four parallel layers as shown in fig. 5, each parallel layer mainly consists of three branches connected by residual, as shown in fig. 6, the first branch consists of a visual state space block and jump connection, scale (a scalar value) is introduced for controlling the scaling of the jump connection, so that the gradient vanishing problem is reduced, and the training is accelerated. The second branch consists of a standard convolution and jump connection with a convolution kernel of 3 and the third branch consists of a standard convolution and jump connection with a convolution kernel of 5, while scales are introduced for controlling the scaling of the jump connection.

The first parallel vision module, the second parallel vision module and the third parallel vision module are similar to the first parallel convolution module in structure, and are mainly composed of four parallel layers as shown in fig. 5, and unlike the first parallel convolution module, each parallel layer of the first parallel vision module is mainly composed of a branch connected by a residual error, two parallel convolution layers are absent, and only the vision state space module is included, so that high-level characteristics can be maintained while resolution of a feature map is recovered. As shown in fig. 7, each parallel layer of the first parallel vision module is mainly composed of a vision state space block and a jump connection, as shown in fig. 8, the vision state space block is mainly composed of two branches, the first branch is mainly composed of a linear layer and SiLU activation functions, the second branch is mainly composed of a linear layer, a depth convolution, siLU activation functions, a state space model and a layer normalization layer, and finally the two branches are combined for output through element-by-element multiplication.

S3, taking the original images in the training set and the verification set as input of an ultra-light image segmentation model based on Mamba, and performing image segmentation training on the ultra-light image segmentation model based on Mamba.

In one embodiment of the present invention, S3 specifically includes:

Encoder training process: to the original image Respectively inputting the 3×3 convolution layer and the 5×5 convolution layer of the first residual convolution module to obtain two corresponding branch results, and combining the two branch results to obtain a first feature mapThen the original image isThe 1X 1 convolution layer input into the first residual convolution module obtains a feature map and a first feature mapFusing and outputting a second characteristic diagramSecond characteristic diagramInputting a second residual convolution module, and outputting a third characteristic diagram by the second residual convolution moduleThe specific operation can be represented by the following formula:

；

cat represents a join operation, add represents an element overlay operation.

As shown in FIG. 5, a third characteristic diagram with the number of channels being CFirst, a layer normalization layer of a first parallel convolution module is input and then divided into four characteristic diagrams with the channel number of C/4Each feature map is then input into a parallel layer of convolutions of 3 x 3, 5 x 5, and the outputs of the three residual connected branches in the parallel layer are spliced and scaledThree corresponding feature maps are obtainedCorresponding element addition is carried out on the three feature graphs of each branch to obtain 4 feature graphs with the channel number of C/4Combining the four feature images with the channel number of C/4 into a fourth feature image with the channel number of C through splicing operationFinally, outputting a fifth characteristic diagram through the operation of the layer normalization layer and the projection operation layer。

The specific operation can be represented by the following formula:

；

。

Chunk ₄ denotes that the input feature map is divided into four parts along the channel dimension; layerNorm denotes layer normalization; project represents a Projection operation; reshape denotes changing the shape of the multi-dimensional array.

Will fifth characteristic diagramAnd inputting the second parallel convolution module, and obtaining a sixth characteristic diagram F ₆ through the similar operation.

Decoder training process: the sixth feature map F ₆ with the channel number of C is firstly input into a layer normalization layer of a third parallel vision module, then divided into four feature maps with the channel number of C/4, then each feature map is respectively input into a vision state space block, then residual splicing and adjustment factors are carried out, the feature maps with the channel number of C are obtained through feature map splicing, a seventh feature map F ₇ is respectively output through the layer normalization layer and a projection operation layer operation, then the seventh feature map F ₇ is input into a second parallel vision module to output an eighth feature map F ₈, the eighth feature map F ₈ is input into a first parallel vision module to output a ninth feature map F ₉, finally the ninth feature map F ₉ is input into a convolution module with a convolution kernel of 1, and the divided images are output.

In one embodiment of the present invention, the jump connection performs multi-level and multi-scale information fusion through an attention mechanism specifically includes:

As shown in fig. 3, the feature map is input into the spatial attention machine submodule, the maximum pooling and the average pooling are respectively carried out, then the two pooled results are spliced along the channel dimension, then the one-dimensional convolution layer is used for convolution operation, then the output is limited in the range of [0, 1] through the full connection layer and the Sigmoid activation function, and finally the input feature map is multiplied by the result and added with the input feature map to obtain the spatial attention feature map.

As shown in fig. 4, the output of the spatial attention machine sub-module is used as the input of the channel attention machine sub-module, the channel attention machine sub-module firstly performs global average pooling on the input feature map, compresses the spatial dimension of the input feature map through adaptive pooling, retains channel information, then splices with the feature maps of the rest layers, calculates global attention weight by using one-dimensional convolution, calculates the attention weight of each channel by using a full-connection layer or convolution layer, limits the attention weight within the range of [0, 1] by using a Sigmoid activation function, applies the calculated attention weight to the corresponding input feature map, adds the input feature map and returns to the final attention feature map, and the specific operation can be represented by the following formula:

；

Wherein GAP is a global average pooling of, For the feature maps of the different stages obtained from the encoder Concat represents a join operation in the channel dimension, conv1D represents a one-dimensional convolution operation, FCi is the fully joined layer of stage i, σ is the sigmoid function, and Σ is the element multiplication.

In order to verify the competitive performance of the Mamba-based ultra-lightweight image segmentation model under the light weight, the invention carries out a comparison experiment on the Mamba-based lightweight model and the classical medical image segmentation model. Specifically, the comparison objects include U-Net, VM-UNet, MAUNet, and UltraLight VM-UNet.

Common indicators of model performance assessed with a segmentation dataset commonly used in medicine include Dice Similarity Coefficient (DSC), sensitivity (SE), specificity (SPECIFICITY, SP) and Accuracy (ACC). The specific data are shown in Table 1.

Table 1 parameter evaluation table

As shown in Table 1, the model parameters of the present invention were 99.94% lower than the conventional pure vision Mannich model (VM-UNet), 75.51% lower than the current lightest vision Manbablet model (UltraLight VM-UNet), 99.84% lower than the conventional U-Net model, and 93.14% lower than the MALUNet model. The GFLOPs of the model is 97.28% lower than VM-UNet, while GFLOPs of the model is slightly raised compared with UltraLight VM-UNet and MALUNet, but the calculated amount is only raised by 32.58% compared with UltraLight VM-UNet model, but the parameter is reduced by 75.51%, so that the model is still superior to the current lightest vision Manbablet model in the overall view. The other parameters, namely DSC is an accuracy index for evaluating the segmentation result, and the model is superior to all the models; SENSITIVITY (SE) is the ability of the model to correctly identify positive samples, SPECIFICITY (SP) is the ability of the model to correctly identify negative samples, and is a pair of opposite parameters; the Accuracy measures the overall correct classification of the sample for the model, and it can be seen from table 1 that both are superior to the model described above. With such large decreases in parameters and GFLOPs, the performance of the model of the present invention remains excellent and highly competitive.

The foregoing is merely a preferred embodiment of the invention, and it is to be understood that the invention is not limited to the form disclosed herein but is not to be construed as excluding other embodiments, but is capable of numerous other combinations, modifications and environments and is capable of modifications within the scope of the inventive concept, either as taught or as a matter of routine skill or knowledge in the relevant art. And that modifications and variations which do not depart from the spirit and scope of the invention are intended to be within the scope of the appended claims.

Claims

1. A Mamba-based ultra-lightweight image segmentation method, comprising:

s1, acquiring an original image, preprocessing the original image to obtain an original image set, and dividing the original image into a training set, a verification set and a test set according to a set proportion;

s2, constructing an ultra-light image segmentation model based on Mamba;

2. The Mamba-based ultra-lightweight image segmentation method as set forth in claim 1, wherein S3 specifically includes:

encoder training process: the method comprises the steps of inputting an original image into a 3X 3 convolution layer and a 5X 5 convolution layer of a first residual convolution module respectively to obtain two corresponding branch results, merging the two branch results to obtain a first feature map, then fusing the feature map obtained by inputting the original image into the 1X 1 convolution layer of the first residual convolution module with the first feature map, outputting a second feature map after fusing, inputting the second feature map into a second residual convolution module, and outputting a third feature map by the second residual convolution module.

3. The Mamba-based ultra-lightweight image segmentation method as set forth in claim 2, wherein the encoder training process further includes:

4. The Mamba-based ultra-lightweight image segmentation method as set forth in claim 1, wherein S3 specifically further includes:

5. The method for ultra-lightweight image segmentation based on Mamba as set forth in claim 1, wherein the step of performing multi-level, multi-scale information fusion by using a attention mechanism includes:

Firstly, inputting a feature map into a space attention machine submodule, respectively carrying out maximum pooling and average pooling treatment, then splicing two pooled results along the channel dimension, then carrying out convolution operation by using a convolution layer, limiting an output result within the range of [0, 1] by using a Sigmoid activation function, multiplying the input feature map by the result, and then adding the result with the input feature map to obtain the space attention feature map.

6. The method for ultra-lightweight image segmentation based on Mamba as set forth in claim 5, wherein the step of performing multi-level, multi-scale information fusion by using a attention mechanism by using a skip connection further comprises:

7. A computer apparatus comprising a memory storing program instructions that, when executed, perform the Mamba-based ultra-lightweight image segmentation method as set forth in any one of claims 1-6.