CN117557579A

CN117557579A - Method and system for assisting non-supervision super-pixel segmentation by using cavity pyramid collaborative attention mechanism

Info

Publication number: CN117557579A
Application number: CN202311567395.9A
Authority: CN
Inventors: 李世华; 罗富贵; 郭雨阳; 行敏锋
Original assignee: Yangtze River Delta Research Institute of UESTC Huzhou
Current assignee: Yangtze River Delta Research Institute of UESTC Huzhou
Priority date: 2023-11-23
Filing date: 2023-11-23
Publication date: 2024-02-13

Abstract

The invention belongs to the technical field of digital image processing, and discloses a cavity pyramid collaborative attention mechanism assisted non-supervision super-pixel segmentation method and system. Combining the RGB channel information of the image with the position information of the pixel points; constructing a channel attention module by using an attention mechanism; processing the result of the attention mechanism by using the pyramid pooling of the cavity space; constructing a loss function, constructing a clustering loss term, and constructing a reconstruction loss term by using the space smoothing loss term; updating parameters of the model by using an Adam optimizer, and using the effective depth feature obtained by the last separation for super-pixel generation; and obtaining the maximum value of the channel dimension by using the argmax function, converting the argmax function processing result into a two-dimensional array, and completing self-adaptive super-pixel segmentation in the CPU according to the limiting condition. The invention has low complexity, strong self-adaption and generalization capability and provides effective support for improving the image processing efficiency and accuracy.

Description

Method and system for assisting non-supervision super-pixel segmentation by using cavity pyramid collaborative attention mechanism

Technical Field

The invention belongs to the technical field of digital image processing, and particularly relates to a cavity pyramid collaborative attention mechanism assisted non-supervision super-pixel segmentation method and system.

Background

Super-pixel segmentation is an effective means for improving the image processing efficiency and accuracy, and the supervised super-pixel segmentation method for deep learning relies on a large amount of marked data, so that the problems of inaccurate segmentation results and insufficient generalization capability of a segmentation model caused by data deviation exist. The traditional super-pixel segmentation method based on energy optimization, watershed, graph and clustering depends on proper parameter selection, the number of super-pixels cannot be determined in a self-adaptive mode according to the characteristics of the image, the problem of noise sensitivity exists, and the problem of high computational complexity is caused by a large-size image. The non-supervision super-pixel segmentation method is not limited by label data, model parameters and structures do not need to be manually adjusted, and the non-supervision super-pixel segmentation method has the advantages of better generalization capability, noise interference avoidance, high calculation complexity problem caused by image size change and the like, so that the non-supervision super-pixel segmentation method is an important substitute for the supervision and traditional super-pixel segmentation method. However, implementing superpixel segmentation with unsupervised requires solving two key problems, efficient depth feature extraction and superpixel generation, respectively.

The common feature extraction is mainly divided into manual design features and convolutional neural network extraction depth features, wherein the former is usually used for converting RGB channels of an image into LAB channels for representation, and then combining the position information of pixel points in the image to generate five-dimensional features for subsequent super-pixel generation. The latter automatically learns depth features by means of a constructed convolution model. The two methods can realize extraction of useful features in the image, the manual design of features depends on the field knowledge and experience of researchers on the problem, the construction of a convolutional neural network extraction depth feature-dependent convolutional model and the selection of feature extraction dimensions. However, the manual design features have the problem of insufficient features for super-pixel segmentation of complex images, the depth features extracted by a convolutional neural network depend on a constructed convolutional model, the useful features cannot be extracted by a simple convolutional model, and the complex convolutional model has the problem of high training difficulty.

Super-pixel generation mainly refers to the segmentation of an original image into small areas that are continuous, compact and have similar features using efficient information for feature extraction. Common methods are watershed, graph and cluster. In watershed technology, dark areas are generally considered as valleys and lighter areas as ridges. The height value of the ridge is defined by the gray value or gradient magnitude of a particular pixel, thus enabling super-pixel generation, but with the problem of being untrainable. In the technique of the graph, the pixels are regarded as nodes, the edge weights are defined by the similarity of adjacent pixels, but the processing of large-size images is difficult, and the super-pixel segmentation performance is highly dependent on the selection of parameters such as merging rules, similarity measurement values and the number of super-pixels. In the clustering technology, super-pixel generation can be quickly realized without additional labels, meanwhile, a trainable clustering algorithm (typically, a microminiaturizable K-means clustering) is developed, but a plurality of iterations are required to have a good effect, so that the calculation cost and time cost of the whole method are increased, and the problems of high calculation complexity of the rest clustering technologies generally exist.

Through the above analysis, the problems and defects existing in the prior art are as follows: the existing super-pixel segmentation method has insufficient generalization capability, high complexity and difficult acquisition of effective characteristic information and lacks self-adaptive capability.

Disclosure of Invention

Aiming at the problems existing in the prior art, the invention provides a method and a system for assisting non-supervision super-pixel segmentation by using a cavity pyramid collaborative attention mechanism.

The invention is realized by preprocessing the image to introduce the spatial relationship among the pixel points, applying the attention mechanism to the super-pixel segmentation task for the first time to enable the model to strengthen the attention to important characteristic channels and inhibit the response to irrelevant channels; the method comprises the steps of utilizing hole space pyramid pooling to expand receptive fields while reducing parameters, using an optimizer to achieve parameter updating and extraction of final effective depth features in combination with constructed loss functions, applying argmax functions to the extracted effective depth features to convert super-pixel segmentation tasks into classification problems, and achieving final self-adaptive super-pixel generation by adding size limiting conditions. Therefore, a clustering algorithm can be avoided, and effective depth features can be extracted by using a small number of parameters, so that the complexity of a super-pixel segmentation algorithm is greatly reduced. The proposed super-pixel segmentation method is unsupervised and therefore has a strong mobility.

Further, the attention mechanism cooperation hole space pyramid pooling promotion non-supervision super-pixel segmentation method comprises the following steps of:

Combining the RGB channel information of the image with the position information of the pixel points, and converting the three-dimensional characteristics into five-dimensional characteristics; firstly, giving an image to be subjected to super-pixel segmentation to a variable in image preprocessing, expressing the variable expressed in an array form in a tensor form, rearranging dimensions into [ C, H, W ] by using a permite function, changing a data type into a floating point type, and externally adding a batch processing dimension through None operation to obtain a variable image, wherein the shape of the variable image is [1, C, H, W ]; and generating a height and width sequence by using a torch.area function, converting the two sequences into two coordinate grids by combining the torch.meshgrid function, stacking the two coordinate grids by using a torch.stack function, connecting the obtained result with tensors [1, C, H, W ] in the channel dimension, and normalizing the result to obtain a final image preprocessing result.

Step two, constructing a channel attention module by using an attention mechanism; and applying the image preprocessing result to a point-by-point convolution layer to obtain tensors with the shapes of [1,8, H and W ], processing by using channel global average pooling to obtain an aggregation characteristic, processing the result by automatically calculating the fast one-dimensional convolution with the kernel size of L and a sigmoid function, and performing element multiplication with the tensors obtained by the point-by-point convolution to obtain a processing result of an attention mechanism. Initializing the weights of the point-by-point convolution and the quick one-dimensional convolution, which are related to the point-by-point convolution, by using a Kaiming initialization method, and initializing bias by using a constant value of 0; for the example normalization layer, the normalized weights are initialized with a constant value of 1, thereby ensuring proper initial parameter values during training.

Thirdly, processing the result of the attention mechanism by using hole space pyramid pooling, and extracting depth characteristics suitable for super-pixel segmentation; the tensor obtained by the channel attention module is respectively processed by aspp1 convolution layers with convolution size of 1 multiplied by 1, filling size of 0 and sampling rate of 1; an aspp2 convolution layer with a convolution kernel size of 3 multiplied by 3, a padding size of 2 and a sampling rate of 2; aspp3 convolution layer with convolution kernel size of 3×3, padding size of 4, sampling rate of 4; aspp4 convolution layer with convolution kernel size of 3×3, filling size of 6 and sampling rate of 6; the size of the input tensor is adjusted to 15×15, then 1×1 convolution with step size of 1 is performed, and avgpool operation of the ReLU activation function is applied; setting the output channels of all operations to be 16, normalizing each output channel by using example normalization, and splicing the result of each output channel in the channel dimension to obtain tensors with the shapes of [1, 80, H, W ]; the spliced tensor is processed by a convolution layer with the convolution kernel size of 1 multiplied by 1, an example normalization layer and a ReLU function to obtain a depth characteristic tensor with the final shape of [1,128, H, W ]; and initializing the weights and offsets of the convolution layers and the instance normalization layers in the pyramid pooling of the hole space by using the same initialization mode as the attention mechanism.

Step four, constructing a loss function, namely constructing a clustering loss item; simultaneously quantifying the difference between adjacent pixels using a spatial smoothing loss term; reconstructing a reconstruction loss term; separating depth features obtained by pyramid pooling of the hollow space into tensors with the shape of [1, 125, H, W ] for calculating a clustering loss term and a space smoothing loss term, and tensors with the shape of [1,3, H, W ] for calculating a reconstruction loss term; the channel dimensions of tensors [1, 125, H, W ] are processed using the softmax function to translate them into corresponding class probabilities. And calculating the negative log likelihood of each pixel and taking the average of all pixel loss values to obtain a final loss value and an average probability estimate of each sample for each class for constructing a clustering loss term. And processing the channel dimensions of tensors [1, 125, H, W ] by using a softmax function to obtain corresponding class probabilities, regarding the class probabilities as probability graphs, respectively calculating the difference value between each element of the W dimension and the adjacent element on the right side of the element and the difference value between each element of the H dimension and the adjacent element on the lower side of the element to obtain gradients of the probability graphs and the variable images in the horizontal and vertical directions, and using the four gradients obtained by calculation to construct a space smoothing loss term. And (3) measuring the difference between the tensors with the shapes of [1,3, H and W ] and the variable image by using a PyTorch to calculate a mean square error function, thereby determining a reconstruction loss term in the loss function.

Fifthly, updating parameters of the model by setting the learning rate and the iteration times of the Adam optimizer, and searching for minimized model parameters meeting the loss function; iterating through a defined optimize function, selecting an Adam optimizer to update model parameters, setting the iteration times to be 500 times, and setting the learning rate to be 1e ^-1 Is that; and the coefficient defined in the clustering loss term in the loss function is set as a constant value 2, and the weight of the space smoothing loss term and the weight of the reconstruction loss term are respectively set as a constant value 2 and a constant value 10, so that the construction of the whole loss function is completed. In this way, the model can automatically find suitable parameters within the set iteration times so as to minimize the overall loss function.

Step six, obtaining the maximum value of the channel dimension by using an argmax function, and converting the final effective depth characteristic into a most effective super-pixel label index corresponding to each pixel point; and converting the argmax function processing result into a two-dimensional array and completing self-adaptive super-pixel segmentation in a CPU according to the limiting condition.

Further, in the first step, image preprocessing is performed by using the following formula:

C _k ＝[R _k ,G _k ,B _k ,i _k ,j _k ] ^T

wherein R is _k ,G _k ,B _k Color channel value representing conversion of kth pixel point from integer type to floating point type in image, wherein i _k ,j _k Representing the row and column number of the kth pixel point in the image, C _k Five-dimensional Tex representing kth pixel in imageAnd the method is characterized and applied and processed later.

Further, the construction method of the second step specifically comprises the following steps:

step 21, applying the preprocessed five-dimensional features to a point-by-point convolution layer, and under the condition of keeping the height and the width unchanged, carrying out linear combination and transformation on the five-dimensional features to realize eight-dimensional feature output;

step 22, calculate the channel global average pooling pair χ ε R ^C×H×W Is treated to obtain a polymerization characteristic g (χ)

Step 23, calculating a fast one-dimensional convolution of the aggregated features with kernel size L:

ω＝σ(C1D _L (g(χ)))

wherein sigma is a sigmoid function, C1D _L The method is characterized in that the method is a quick one-dimensional convolution with the kernel size L, and omega is a learned channel weight; automatically calculating the kernel size L according to the number of channels:

where a=2, b=3, |t| _odd Defined as the odd number closest to t.

Step 24, representing the result of processing the quick one-dimensional convolution with the kernel size L and the sigmoid function of the aggregation feature as ρ epsilon R ^C×1×1 And is combined with χ ^C×H×W The result of the attention mechanism processing obtained by the element product is expressed as f (χ) ε R ^C×H×W The shape is consistent with the eight-dimensional features of the input attention mechanism;

further, the eight-dimensional feature output is expressed as χ ε R ^C×H×W And directly apply global averaging pooling with channels, where H represents image height, WThe image width is represented, and C represents the number of channels of the image, where c=8.

Further, the constructing a channel attention module in the third step specifically includes:

step 31, representing depth characteristics obtained after the pyramid pooling treatment of the cavity space asWherein H and W remain unchanged and still correspond to the height and width of the image, C ₁ =128; the middle convolution layer calculation process is as follows:

wherein aspp1 represents a convolution layer with a convolution kernel size of 1×1, a padding size of 0, and a sampling rate of 1; aspp2 represents a convolution layer with a convolution kernel size of 3×3, a padding size of 2, and a sampling rate of 2; aspp3 represents a convolution layer with a convolution kernel size of 3×3, a padding size of 4, and a sampling rate of 4; aspp4 represents a convolution layer with a convolution kernel size of 3×3, a padding size of 6, and a sampling rate of 6; avgpool means that the size of the input tensor is adjusted to 15×15 with the adaptive averaging pooling layer; performing 1×1 convolution with step length of 1, enabling the input channel to be 8 and the output channel to be 16, performing normalization operation on each output channel by adopting example normalization, and finally introducing nonlinearity by applying a ReLU activation function; y is ₁ 、y ₂ 、y ₃ 、y ₄ Y ₅ Is an intermediate tensor;

step 32, depth characterizationThe calculation is as follows:

F＝cbr[cat(y ₁ ,y ₂ ,y ₃ ,y ₄ ,y ₅ )]

wherein cat (·) represents that y is to be ₁ 、y ₂ 、y ₃ 、y ₄ And y ₅ Splicing in the channel dimension to form a larger tensor, aspp1, aspp2, aspp3, aspp4 and avgpoolThe number of the tensor channels after splicing is 80; cbr [. Cndot.]Representing that the tensor obtained by splicing is subjected to 1×1 convolution, and an output channel is set to 128 which meets the requirement of super-pixel segmentation; performing normalization operation on each output channel by adopting example normalization; finally, applying a ReLU activation function to introduce nonlinearity.

Further, the step four of constructing a construction loss function specifically includes:

step 41, the overall loss function is composed of three parts, namely a clustering loss term, a space smoothing loss term and a reconstruction loss term:

wherein, I _O Representing the overall loss function; l (L) _C Representing a cluster loss term; l (L) _S Representing a spatially smooth loss term; l (L) _R Representing a reconstruction loss term; lambda andare all constant coefficients and λ=2, +.>

In step 42, the clustering loss term is calculated as follows:

wherein β=2;representing the average of the class probability vectors over all pixels; p is p _(i,j),n The class probability vector representing the pixels located in row i and column j is calculated as follows:

p _(i,j),n ＝F′ _· softmax(1)

step 43, the first three channel separations in the channel dimension in F are used for the subsequent reconstruction loss term computation, leaving The characteristics of the remaining 125 channels are represented by F', and n E [0,124 ] is determined]Rounding, F' _· softmax (1) represents the class probability of converting the channel dimension of F' to the corresponding pixel using the softmax function;

step 44, calculate the spatial smoothing loss term in two parts: the smoothness loss in the x direction and the y direction is achieved, and the original input image is subjected to dimension rearrangement to obtain an image I; the method is characterized by calculating the absolute value of the probability difference value and the exponential function definition of the square error of the gradient of the image I, and calculating the average space smoothing term loss of all pixels, wherein the specific calculation mode is as follows:

wherein n represents that the channel is consistent with the clustering loss; Δp _x,ij Represented as pixel probability differences in the x-direction; delta _Ix,ij Represented as pixel intensity differences in the x-direction; Δp _y,ij Represented as pixel probability differences in the y-direction; delta _Iy,ij Represented as pixel intensity differences in the y-direction; the specific calculation mode is as follows:

in step 45, the reconstruction loss term is calculated as follows:

in computing the cluster loss term and the spatially smoothed loss term, the first three channels separated from the depth features are used for image reconstruction and are represented as||·|| ₂ Indicating the choice of 2-norms.

Further, the original input image dimension of step 44 is H W C and the image I dimension is C H W.

Further, in the fifth step, depth features are extracted under the minimized model parameters, the depth features are effective depth features, and the effective depth features obtained by the last separation are used for super-pixel generation.

Further, in the step six, the size limitation condition for super pixel generation is calculated as follows:

where segment_size represents the average size of the ideal superpixel; size represents the total number of pixels in the image; the min.size and the max.size are respectively threshold values for limiting the minimum and the maximum of the superpixels, and are used for filtering the superpixels which are too small or too large to ensure that the size of the finally generated superpixels is as uniform as possible; the super parameter n_spix=100 in the formula.

Another object of the present invention is to provide an attention mechanism and hole space pyramid pooling promotion non-supervision super-pixel segmentation system, which includes:

the image preprocessing module is used for combining the RGB channel information of the image with the position information of the pixel points and converting the three-dimensional characteristics into five-dimensional characteristics;

the attention mechanism module is used for constructing a channel attention module by using an attention mechanism;

the cavity space pyramid pooling module is used for processing the result of the attention mechanism by utilizing cavity space pyramid pooling and extracting depth characteristics suitable for super-pixel segmentation;

The loss function construction module is used for constructing a loss function and firstly constructing a clustering loss item; simultaneously quantifying the difference between adjacent pixels using a spatial smoothing loss term; reconstructing a reconstruction loss term;

the parameter updating module is used for updating the parameters of the model by setting the learning rate and the iteration times of the Adam optimizer;

and the super-pixel segmentation module is used for converting the argmax function processing result into a two-dimensional array and completing self-adaptive super-pixel segmentation in the CPU according to the limiting condition.

In combination with the technical scheme and the technical problems to be solved, the technical scheme to be protected has the following advantages and positive effects:

first, aiming at the technical problems in the prior art and the difficulty of solving the problems, the technical problems solved by the technical proposal of the invention are analyzed in detail and deeply by tightly combining the technical proposal to be protected, the results and data in the research and development process, and the like, and some technical effects brought after the problems are solved have creative technical effects. The specific description is as follows:

the invention provides an unsupervised super-pixel segmentation method for promoting the pooling of a hole space pyramid in coordination with an attention mechanism, which realizes the extraction of effective depth features through the pretreatment of images, the pooling of the hole space pyramid in coordination with the attention mechanism, the updating of a loss function and parameters of an Adam optimizer, and converts super-pixel segmentation into classification problems to realize self-adaptive super-pixel generation. The method solves the problems that the existing super-pixel segmentation method is insufficient in generalization capability, high in complexity, difficult to acquire effective characteristic information and poor in self-adaptive capability.

Secondly, the technical scheme is regarded as a whole or from the perspective of products, and the technical scheme to be protected has the following technical effects and advantages:

image preprocessing: the use of coordinates of pixel locations in an image as additional features helps to improve the performance of the superpixel segmentation method, allowing the model to better capture the relationship between spatial structure and pixels in the image. Coordinate information can be generated on the row and column numbers corresponding to each pixel point in the image, and the generated coordinate information is added to each pixel value of the original image to create a new image containing five channels. The new image with five-dimensional features is input into the depth feature extraction network, the image can be better understood from the features, and the depth features more suitable for super-pixel segmentation are extracted.

Attention mechanism: in the super-pixel segmentation method, the coverage area of local cross-channel interaction is determined by using a method of adaptively selecting the size of a one-dimensional convolution kernel, and the high-efficiency channel attention is realized by using only a small amount of parameters, so that the complexity brought by the super-pixel segmentation performance can be balanced while the super-pixel segmentation performance is improved.

Pyramid pooling of the cavity space: in the super-pixel segmentation task, each pixel needs to be classified to determine which super-pixel class it belongs to. Conventional convolution layers can only capture information within a limited range, and therefore require modeling of different scales and context information. The pyramid pooling of the cavity space enlarges the receptive field by introducing the expansion rate on the input image, and can capture information in a wider range without increasing the number of parameters. The constructed pyramid pooling of the hollow space can realize that depth features suitable for super-pixel segmentation tasks can be extracted from images with different complexity.

Loss function: in order to ensure that the extracted depth features originate from the image itself, the original input image can be restored by introducing a reconstruction penalty term into the penalty function such that the depth feature extraction is forced to generate an output that matches the original image. The constructed cluster loss term is a clustering cost based on entropy and is similar to the mutual information term with regularized information maximization. Clustering is achieved by maximizing mutual information of pixel points under depth feature representation, and meanwhile complexity of a super-pixel segmentation method is controlled through regularization items, so that overfitting is avoided. The constructed space smoothing loss term is a main priori of an image processing task, can quantify the difference between adjacent pixels, ensures that a super-pixel segmentation method generates smooth output, and simultaneously improves the generalization capability of the super-pixel segmentation method and prevents overfitting.

Parameter updating and superpixel generation: an Adam optimizer is selected for gradient descent optimization to realize model parameter updating, and independent adaptive learning rates are designed for different parameters by calculating first moment estimation and second moment estimation of the gradient. And the updating of the parameters is not influenced by gradient telescopic transformation, so that the problem that the gradient has great noise can be solved, and the updating of the model parameters is simpler and more efficient. The final effective depth feature is converted into the most effective super-pixel label index corresponding to each pixel point through the argmax function, so that the super-pixel segmentation problem can be converted into the classification problem through processing, the self-adaptive super-pixel segmentation can be rapidly completed by combining the limiting conditions, and the complex clustering algorithm is avoided.

Thirdly, the expected benefits and commercial value after the technical scheme of the invention is converted are as follows: the method can be applied to remote sensing image processing, and greatly increases the processing speed.

The technical scheme of the invention overcomes the technical bias: the traditional super-pixel segmentation method is realized by using a clustering algorithm, and the super-pixel segmentation is converted into a classification task, so that the clustering algorithm is avoided. The super-pixel segmentation method has strong adaptability to image size change.

Fourth, the significant technical progress brought by the unsupervised superpixel segmentation method provided by the present invention is mainly embodied in the following aspects:

1) Enhanced feature extraction capability:

by combining the attention mechanism and the pyramid pooling of the cavity space, the method can more effectively extract the depth characteristics of the image. The introduction of the attention mechanism enables the model to pay more attention to important feature channels, so that the accuracy of feature extraction is improved.

2) Improved super-pixel segmentation quality:

the traditional super-pixel segmentation method can ignore some detail information, and the method can better keep the details and the structure of the image by combining the spatial relationship and the depth characteristics, so as to generate a more accurate and finer super-pixel segmentation result.

3) Flexibility and adaptability improvement:

the method can flexibly adapt to different types and quality of images by adaptively generating superpixels, which is particularly important when processing diverse image datasets.

4) Parameter optimization and calculation efficiency:

by combining the Adam optimizer with a specific loss function, model parameters can be optimized more effectively, and the calculation cost and time are reduced. In addition, the pyramid pooling of the cavity space enlarges the receptive field and simultaneously reduces the parameter quantity, which further improves the calculation efficiency.

5) Extensive application potential:

the method is not only suitable for standard image processing tasks, but also can be expanded to other fields, such as medical image analysis, machine vision, image recognition and the like, and shows wide application potential.

In general, the method provided by the invention realizes remarkable technical progress in the field of super pixel segmentation through innovative technical combination and optimization strategies thereof, improves segmentation quality, expands application range and improves processing efficiency.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments of the present invention will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of an unsupervised superpixel segmentation method for promoting the pooling of a hole space pyramid in coordination with an attention mechanism provided by an embodiment of the present invention;

FIG. 2 is a block diagram of an attention mechanism collaborative hole space pyramid pooling promotion unsupervised superpixel segmentation system provided by an embodiment of the present invention;

FIG. 3 is a schematic illustration of a constructed attention mechanism provided by an embodiment of the present invention;

FIG. 4 is a pyramid pooling diagram of a constructed void space provided by an embodiment of the present invention;

FIG. 5 is a graph of a super-pixel segmentation effect provided by an embodiment of the present invention;

FIG. 6 is a graph comparing effects of the present invention; a, a real label; b, SLIC; c, algorithm 2; and D, the super-pixel segmentation result is obtained.

Detailed Description

The present invention will be described in further detail with reference to the following examples in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

The invention provides an unsupervised superpixel segmentation method based on an attention mechanism and hole space pyramid pooling. The method mainly comprises six steps, specifically:

1) Image preprocessing and feature transformation:

And combining the RGB channel information of the image with the position information of the pixel points, and converting the three-dimensional characteristics into five-dimensional characteristics.

2) Constructing a channel attention module:

by means of the attention mechanism, attention of the important characteristic channels is enhanced by the model, and response to irrelevant channels is restrained.

3) And (3) using hole space pyramid pooling:

hole space pyramid pooling is applied to the results of the attention mechanism processing to extract depth features suitable for superpixel segmentation.

4) Constructing and applying a loss function:

cluster loss terms, spatial smoothing loss terms (quantifying differences between neighboring pixels), and reconstruction loss terms are constructed.

5) Model parameter updating:

and updating parameters of the model by setting the learning rate and the iteration number of the Adam optimizer so as to find the model parameters meeting the minimization of the loss function.

6) Super-pixel label generation and adaptive super-pixel segmentation:

the maximum value of the channel dimension is obtained using the argmax function, converting the effective depth feature into a superpixel label index for each pixel point. And converting the result of argmax into a two-dimensional array, and completing self-adaptive super-pixel segmentation in a CPU according to the limiting conditions.

The following are two specific embodiments and specific implementations of the present invention. These embodiments generally include specific applications on the method, such as specific types of image datasets (e.g., natural scene images, medical images, etc.) or specific application scenes (e.g., image segmentation, object tracking, etc.). However, since your description does not provide enough detail to determine these embodiments, i will provide two embodiments:

Application example 1: super-pixel segmentation of natural scene images

In this embodiment, the method may be applied to natural scene images. By analyzing the color, texture, etc. features in the image, the model can effectively segment the image into groups of superpixels with similar features. This is very useful in image editing and enhancement applications.

Application example 2: medical image analysis

In medical images (such as MRI or CT scans), superpixel segmentation can be used to identify and distinguish between different tissue types or lesions. This approach may help doctors diagnose and plan treatments more accurately.

Specific implementations will involve selecting an appropriate image dataset, adjusting model parameters (e.g., learning rate, number of iterations, configuration of pooling layers, etc.), and post-processing steps such as merging or segmentation of super-pixel groups to improve segmentation accuracy and efficiency.

As shown in fig. 1, the embodiment of the invention provides an unsupervised super-pixel segmentation method for promoting the pooling of a hole space pyramid in coordination with a attention mechanism, which comprises the following steps:

Step 1, image preprocessing

In order to ensure that effective depth features are extracted, the image needs to be preprocessed, and the RGB channel information of the image and the position information of the pixel point are directly combined, so that the three-dimensional features are converted into five-dimensional features, and the high efficiency of the operation of the super-pixel segmentation integral method is ensured.

C _k ＝[R _k ,G _k ,B _k ,i _k ,j _k ] ^T

R in the above formula _k ,G _k ,B _k Color channel value representing conversion of kth pixel point from integer type to floating point type in image, wherein type conversion is mainly used for ensuring consistency of subsequent data types and accuracy of overall methodDegree. I in _k ,j _k Representing the number of rows and columns in which the kth pixel point in the image is located. C (C) _k Representing the five-dimensional characteristics of the kth pixel point in the image and applying and subsequent processing.

Step 2, attention mechanism

In order to enable the whole super-pixel segmentation method to dynamically adjust the pixel attention according to different input images, a channel attention module which does not need dimension reduction local cross-channel interaction, adaptively determines the size of a one-dimensional convolution kernel and is efficient is constructed.

Firstly, the preprocessed five-dimensional features are applied to a point-by-point convolution layer, and under the condition that the height and the width are kept unchanged, the five-dimensional features are subjected to linear combination and transformation to realize eight-dimensional feature output. The eight-dimensional feature output is expressed as χ ε R ^C×H×W And is directly applied to the global averaging pooling of channels, where H represents the image height, W represents the image width, and C represents the number of channels of the image, where c=8.

Computing a global average pooling pair χ e R for a channel ^C×H×W Is treated to obtain a polymerization characteristic g (χ)

Computing the quick one-dimensional convolution with the aggregate characteristics of L through the kernel size, and realizing the channel attention mechanics learning of local cross-channel interaction

ω＝σ(C1D _L (g(χ)))

Wherein sigma is a sigmoid function, C1D _L For a fast one-dimensional convolution with kernel size L, ω is the learned channel weight.

Automatically calculating kernel size L according to channel number

Wherein a=2, b=3, |t| _odd Defined as the odd number closest to t.

Fast one-dimensional volume with aggregation characteristics of L through kernel sizeThe product and sigmoid function processing result is expressed as ρ ε R ^C ^×1×1 And is combined with χ ^C×H×W The result of the attention mechanism processing obtained by the element product is expressed as f (χ) ε R ^C×H×W . Its shape is consistent with the eight-dimensional features of the input attention mechanism.

Step 3, pyramid pooling of the cavity space

And processing the result of the attention mechanism by using the pyramid pooling of the cavity space to extract the depth characteristics suitable for super-pixel segmentation.

Depth characteristics obtained after the pyramid pooling treatment of the cavity space are expressed as Wherein H and W remain unchanged and still correspond to the height and width of the image, C at this time ₁ =128. The middle convolution layer calculation process is as follows:

aspp1 represents a convolution layer with a convolution kernel size of 1×1, a padding size of 0, and a sampling rate of 1; aspp2 represents a convolution layer with a convolution kernel size of 3×3, a padding size of 2, and a sampling rate of 2; aspp3 represents a convolution layer with a convolution kernel size of 3×3, a padding size of 4, and a sampling rate of 4; aspp4 represents a convolution layer with a convolution kernel size of 3×3, a padding size of 6, and a sampling rate of 6; avgpool means that the size of the input tensor is adjusted to 15×15 by using an adaptive average pooling layer, then 1×1 convolution with a step length of 1 is performed, the input channel is 8, the output channel is 16, normalization operation is performed on each output channel by adopting example normalization, and finally nonlinearity is introduced by applying a ReLU activation function. Input channels of aspp1, aspp2, aspp3 and aspp4 are all set to 8, output channels are all set to 16, and each convolution layer performs normalization on each output channel by adopting examplesAnd (5) performing a unification operation. Wherein y is ₁ 、y ₂ 、y ₃ 、y ₄ Y ₅ Is an intermediate tensor.

Depth featuresThe calculation is as follows:

F＝cbr[cat(y ₁ ,y ₂ ,y ₃ ,y ₄ ,y ₅ )]

wherein cat (·) represents that y is to be ₁ 、y ₂ 、y ₃ 、y ₄ And y ₅ Splicing in the channel dimension to form a larger tensor, wherein the aspp1, aspp2, aspp3, aspp4 and avgpool output channels are all 16, so that the number of the tensor channels after splicing is 80; cbr [. Cndot. ]Representing that the tensor obtained by splicing is subjected to 1×1 convolution, and an output channel is set to 128 which meets the requirement of super-pixel segmentation; as with avgpool, performing normalization operation on each output channel by adopting example normalization; finally, applying a ReLU activation function to introduce nonlinearity.

Step 4, constructing a loss function

Cluster loss terms are built to encourage deterministic superpixel allocation, to encourage each superpixel size to be as uniform as possible; simultaneously introducing a space smoothing loss term to quantify the difference between adjacent pixels; the reconstruction loss term is constructed to ensure that the extracted depth features originate from the image itself.

The overall loss function consists of three parts, namely a clustering loss term, a space smoothing loss term and a reconstruction loss term.

The clustering loss term is calculated as follows:

p _(i,j),n ＝F′ _· softmax(1)

first, the first three channels in the channel dimension in F are separated for the subsequent reconstruction loss term computation, the features of the remaining 125 channels are denoted by F', at which time n ε [0,124 ] can be determined ]Rounding, F _· 'softmax (1) represents the class probability of converting the channel dimension of F' to the corresponding pixel using the softmax function.

The computational space smoothing loss term is divided into two parts: and (3) carrying out dimension rearrangement on the original input image to obtain an image I (the dimension of the original input image is H multiplied by W multiplied by C, and the dimension of the image I is C multiplied by H multiplied by W) due to the smoothness loss in the x direction and the y direction. Defined by an exponential function that calculates the absolute value of the probability difference and the squared difference of the image I gradient, and calculates the average spatial smoothing term loss for all pixels. The specific calculation mode is as follows:

wherein n represents that the channel is consistent with the clustering loss; Δp _x,ij Represented as pixel probability differences in the x-direction; delta _Ix,ij Represented as pixel intensity differences in the x-direction; Δp _y,ij Represented as pixel probability differences in the y-direction; delta _Iy,ij Represented as pixels in the y-directionIntensity difference. The specific calculation mode is as follows:

the reconstruction loss term is calculated as follows:

wherein the first three channels separated from the depth features are used for image reconstruction and are represented as||·|| ₂ Indicating the choice of 2-norms.

Step 5, parameter updating and superpixel generation

The parameter updating of the model is realized by setting the learning rate and the iteration times of the Adam optimizer, namely, the minimized model parameter meeting the loss function is searched, and the depth feature extracted under the parameter is regarded as the effective depth feature. Since different depth features are obtained each time, the F' effective depth feature obtained by the last separation is used for superpixel generation. And obtaining the maximum value of the channel dimension by using the argmax function, so that each pixel point corresponds to one most super-pixel label index. Converting the argmax function processing result into a two-dimensional array and completing self-adaptive super-pixel segmentation in a CPU according to limiting conditions, wherein the size limiting conditions generated for super-pixels are calculated as follows:

As shown in fig. 2, an embodiment of the present invention provides an attention mechanism-collaborative hole-space pyramid pooling-promotion non-supervision super-pixel segmentation system for promoting non-supervision super-pixel segmentation by using an attention mechanism-collaborative hole-space pyramid pooling method, the system comprising:

Examples:

the deep learning framework used in this embodiment is PyTorch and the programming language is Python.

Step 1, the acquired high-resolution second-order 0.8 m resolution color image is assigned to a variable img, whether cuda is available or not is judged, and if so, a subsequent code is operated in the GPU; and if not, the CPU runs in the CPU. The acquired image is applied to a written image preprocessing function, and a batch processing dimension is externally added through dimension rearrangement, data type conversion to floating point type and None operation to obtain tensor with the shape of [1,3,1003,773 ];

reading the last two dimensions of tensors to obtain the height and width values of the image, generating a height and width sequence by using a torch. Area function, converting the two sequences into two coordinate grids by using a torch. Mershgrid function, and stacking by using a torch. Stack function; the image is connected with the coordinate grid and normalized to obtain the result with tensor shape of [1,5,1003,773 ].

Step 2, applying the image preprocessing result to a point-by-point convolution layer to obtain tensor with the shape of [1,8,1003,773], then carrying out processing by utilizing channel global average pooling to obtain an aggregation characteristic, carrying out fast one-dimensional convolution with the kernel size of L=3 and sigmoid function processing on the result, and carrying out element multiplication with the tensor obtained by the point-by-point convolution to obtain a processing result tensor with the shape of [1,8,1003,773];

initializing the weights of the point-by-point convolution and the quick one-dimensional convolution by using a Kaiming initialization method, and initializing bias by using a constant value of 0; for the example normalization layer, the normalized weights are initialized with a constant value of 1, thereby ensuring proper initial parameter values during training.

Step 3, applying tensors obtained by attention mechanism processing to hole space pyramid pooling (shown in fig. 4) to enable the tensors to pass through aspp1 convolution layers with convolution sizes of 1×1, filling sizes of 0 and sampling rates of 1 respectively; an aspp2 convolution layer with a convolution kernel size of 3 multiplied by 3, a padding size of 2 and a sampling rate of 2; aspp3 convolution layer with convolution kernel size of 3×3, padding size of 4, sampling rate of 4; aspp4 convolution layer with convolution kernel size of 3×3, filling size of 6 and sampling rate of 6; the size of the input tensor is adjusted to 15 multiplied by 15, then 1 multiplied by 1 convolution with the step length of 1 is carried out, and a non-linear avgpool operation is introduced by applying a ReLU activation function;

Setting the output channels of the operations to be 16, and obtaining tensors with the shapes of [1×80×1003×773] after splicing; the spliced tensor is processed by a convolution layer with the convolution kernel size of 1 multiplied by 1, an example normalization layer and a ReLU to obtain a depth characteristic tensor with the final shape of [1 multiplied by 128 multiplied by 1003 multiplied by 773 ]; carrying out normalization operation on each output channel by using example normalization, and finally introducing nonlinearity by applying a ReLU activation function; the weights and offsets of the convolution layers and the instance normalization layers in the hole space pyramid pooling are initialized using the same initialization approach as the attention mechanism.

Step 4, separating the extracted depth features into a shape of [1×3×1003×773]]The tensor of the reconstruction loss term is used to calculate the reconstruction loss term, and another shape is [1×125×1003×773]]The tensor of (2) is used for calculating a clustering loss term and a space smoothing loss term; iterating through the defined optimize function, and updating model parameters by selecting an Adam optimizer to realize minimization of the loss function; wherein the iteration number is set to 500 times, and the learning rate is set to 1e ^-1 The method comprises the steps of carrying out a first treatment on the surface of the When the iteration times are reached, the biggest value of the channel dimension is obtained by using an argmax function, so that each pixel point corresponds to one of the most available super-pixel label indexes;

After the tag index is obtained, the channel dimension is removed through a squeeze function, meanwhile, the gradient is not calculated any more, and the result is transferred from the GPU to the CPU for processing; and converting the PyTorch tensor into a Numpy array for later superpixel segmentation, and finally constructing an_performance_label_connectivity_cython function to realize superpixel segmentation in a self-adaptive mode according to the defined superpixel size condition.

In this embodiment, a face image obtained in the scipy library is processed, and fig. 5 is a super-pixel segmentation effect diagram of an embodiment.

As can be seen from the above embodiment, the invention realizes the unsupervised super-pixel segmentation, and the segmentation accuracy test result reaches 94.7%. The method provided by the invention flexibly and efficiently realizes effective depth feature extraction and super-pixel generation self-adaptive according to the characteristics of the image under the condition that the real labels and the super-pixel number are not provided. Under the condition of different image sizes and image complexity, the non-supervision super-pixel segmentation can be rapidly and accurately realized. The method has the advantages of low complexity, strong self-adaption capability, strong generalization capability and the like, and provides effective support for improving the image processing efficiency and accuracy.

When applied to classification tasks, most researchers choose graph convolution to extract the features of irregularly distributed objects, as conventional convolution cannot effectively extract the features of irregularly distributed objects, however, obtaining the appropriate graph structure required for graph convolution is difficult. The super-pixel segmentation method can be utilized to perform super-pixel generation on intermediate features during network training so as to adaptively generate a homogeneous region, obtain a graph structure and further generate a space descriptor, and serve as graph nodes, and an adjacency matrix is obtained by considering the relation between the descriptors so as to meet the requirement of graph convolution.

When applied to segmentation tasks, conventional pixel segmentation methods also tend to cause over-segmentation problems if the processed image size is too large, which increases computational overhead and the segmentation algorithm is susceptible to noise. These problems can be well solved by the provided superpixel algorithm, because superpixels can reduce the number of pixels in an image, compared with dividing each pixel, the division of superpixels can significantly reduce the computational cost of division while maintaining the image structure; since superpixels are more prone to merging similar pixels together, spatial consistency over a greater range can be provided, so that the segmentation results are more continuous while reducing the occurrence of over-segmentation problems; by aggregating the local similarity, the influence of a single pixel is reduced, and thus various noises existing in the image can be suppressed to some extent.

In the experimental process, the embodiment of the invention is compared with the disclosed SLIC algorithm and a super-pixel segmentation algorithm (hereinafter referred to as algorithm 2) by a convolutional neural network with regular information maximization, wherein related parameters are consistent with parameters used by algorithm developers, and the following is a super-pixel segmentation result respectively generated by a real label and a segmentation algorithm. As shown in fig. 6, a real label; b, SLIC; c, algorithm 2; and D, the super-pixel segmentation result is obtained.

The result shows that the SLIC algorithm and the algorithm 2 have poor adaptability to super-pixel segmentation of large-size images, wherein the SLIC algorithm has certain undersection problem and poor boundary adhesiveness, but the algorithm running efficiency is higher than that of the segmentation method; the algorithm 2 has the problem of over-dividing and can not perform good super-pixel generation on the image boundary, and although the algorithm is unsupervised as the segmentation method of the invention, the time complexity of the algorithm is about 13 times that of the segmentation method of the invention, and the segmentation precision of the two super-pixel segmentation algorithms for comparison is lower than that of the super-pixel segmentation method of the invention.

The foregoing is merely illustrative of specific embodiments of the present invention, and the scope of the invention is not limited thereto, but any modifications, equivalents, improvements and alternatives falling within the spirit and principles of the present invention will be apparent to those skilled in the art within the scope of the present invention.

Claims

1. The method is characterized in that firstly, an image is preprocessed to introduce a spatial relation among pixel points, and the attention mechanism is applied to a super-pixel segmentation task for the first time to enable a model of the super-pixel segmentation task to strengthen the attention to important characteristic channels and inhibit the response to irrelevant channels; the method comprises the steps of utilizing hole space pyramid pooling to expand receptive fields while reducing parameters, using an optimizer to achieve parameter updating and extraction of final effective depth features in combination with constructed loss functions, applying argmax functions to the extracted effective depth features to convert super-pixel segmentation tasks into classification problems, and achieving final self-adaptive super-pixel generation by adding size limiting conditions.

2. The attention mechanism collaborative hole space pyramid pooling promotion non-supervised superpixel segmentation method of claim 1, comprising the steps of:

combining the RGB channel information of the image with the position information of the pixel points, and converting the three-dimensional characteristics into five-dimensional characteristics;

step two, constructing a channel attention module by using an attention mechanism;

thirdly, processing the result of the attention mechanism by using hole space pyramid pooling, and extracting depth characteristics suitable for super-pixel segmentation;

step four, constructing a loss function, namely constructing a clustering loss item; simultaneously quantifying the difference between adjacent pixels using a spatial smoothing loss term; reconstructing a reconstruction loss term;

fifthly, updating parameters of the model by setting the learning rate and the iteration times of the Adam optimizer, and searching for minimized model parameters meeting the loss function;

3. The attention mechanism collaborative hole space pyramid pooling-facilitated unsupervised superpixel segmentation method of claim 2, wherein step one performs image preprocessing using the following formula:

C _k ＝[R _k ,G _k ,B _k ,i _k ,j _k ] ^T

wherein R is _k ,G _k ,B _k Color channel value representing conversion of kth pixel point from integer type to floating point type in image, wherein i _k ,j _k Representing the row and column number of the kth pixel point in the image, C _k Representing the five-dimensional characteristics of the kth pixel point in the image and applying and subsequent processing.

4. The method for promoting unsupervised superpixel segmentation by using attention mechanism to cooperate with hole space pyramid pooling as set forth in claim 2, wherein the construction method of the second step specifically comprises:

ω＝σ(C1D _L (g(χ)))

Where a=2, b=3, |t| _odd Defined as the odd number closest to t.

5. the method for facilitating unsupervised superpixel segmentation with attention mechanism collaborative hole space pyramid pooling as set forth in claim 4, wherein the eight-dimensional feature output in step 21 is represented as χ e R ^C×H×W And is directly applied to the global averaging pooling of channels, where H represents the image height, W represents the image width, and C represents the number of channels of the image, where c=8.

6. The method for promoting unsupervised superpixel segmentation by using attention mechanism in cooperation with hole space pyramid pooling as set forth in claim 2, wherein the constructing the channel attention module in the third step specifically includes:

step 32, depth characterizationThe calculation is as follows:

F＝cbr[cat(y ₁ ,y ₂ ,y ₃ ,y ₄ ,y ₅ )]

wherein cat (·) represents that y is to be ₁ 、y ₂ 、y ₃ 、y ₄ And y ₅ Splicing in the channel dimension to form a larger tensor, wherein the output channels of aspp1, aspp2, aspp3, aspp4 and avgpool are all 16, and the number of the spliced tensor channels is 80; cbr [. Cndot.]Representing the 1 x 1 convolution of the stitched tensor, the output channel is set to fit the superphotograph128 for the prime segmentation requirement; performing normalization operation on each output channel by adopting example normalization; finally, applying a ReLU activation function to introduce nonlinearity.

7. The method for promoting unsupervised superpixel segmentation by using attention mechanism to cooperate with hole space pyramid pooling as set forth in claim 2, wherein the constructing a construction loss function in step four specifically includes:

In step 42, the clustering loss term is calculated as follows:

p _(i,j),n ＝F′ _· softmax(1)

step 43, step FThe first three channels in the channel dimension are separated for the calculation of the reconstruction loss term, the characteristics of the remaining 125 channels are represented by F', and n E [0,124 ] is determined]Rounding, F' _· softmax (1) represents the class probability of converting the channel dimension of F' to the corresponding pixel using the softmax function;

wherein n represents that the channel is consistent with the clustering loss; Δp _x,ij Represented as pixel probability differences in the x-direction; ΔI _x,ij Represented as pixel intensity differences in the x-direction; Δp _y,ij Represented as pixel probability differences in the y-direction; ΔI _y,ij Represented as pixel intensity differences in the y-direction; the specific calculation mode is as follows:

in step 45, the reconstruction loss term is calculated as follows:

in computing the cluster loss term and the spatially smoothed loss term, the first three channels separated from the depth features are used for image reconstruction and are represented as ||·|| ₂ Representing the selection of 2-norms;

step 44 has the original input image dimensions H W C and the image I dimension C H W.

8. The method for promoting unsupervised superpixel segmentation by using attention mechanism in cooperation with hole space pyramid pooling according to claim 2, wherein in the fifth step, depth features are extracted under minimized model parameters, the depth features are effective depth features, and the effective depth features obtained by the last separation are used for superpixel generation.

9. The attention mechanism collaborative hole space pyramid pooling promotion unsupervised superpixel segmentation method according to claim 2, wherein the size constraint on superpixel generation in step six is calculated as follows:

10. An attention mechanism collaborative hole space pyramid pooling promotion non-supervised superpixel segmentation system as set forth in any one of claims 1-9, wherein the system includes: