WO2024140642A1

WO2024140642A1 - Image processing method and apparatus, and computing device

Info

Publication number: WO2024140642A1
Application number: PCT/CN2023/141801
Authority: WO
Inventors: 唐泉; 刘传建; 韩凯; 王云鹤
Original assignee: 华为技术有限公司
Priority date: 2022-12-26
Filing date: 2023-12-26
Publication date: 2024-07-04
Also published as: CN116229143A

Abstract

An image processing method, comprising: acquiring a first image; performing feature extraction on the first image; fusing n extracted first feature maps, wherein during the i-th fusion, a vector representation of the category to which pixels contained in a feature map obtained in the (i-1)-th fusion belong is obtained, and the vector representation is fused with a first feature map to be fused in the i-th fusion to obtain a feature map obtained in the i-th fusion; and obtaining a second image on the basis of the feature map obtained in the last fusion, the second image being used for representing the category to which pixels contained in the first image belong. Therefore, when feature maps are fused, first a vector representation of the category to which pixels contained in a small-scale feature map belong is obtained from the small-scale feature map, and then the obtained vector representation of the category to which the pixels belong is fused with a large-scale feature map, realizing fusion of pixels and the category to which the pixels belongs, avoiding the effect of noise points or abnormal point features in the feature maps, and improving the accuracy of semantic segmentation on images.

Description

Image processing method, device and computing equipment

This application claims priority to the Chinese patent application filed with the State Intellectual Property Office of China on December 26, 2022, with application number 202211673092.0 and application name “An image processing method, device and computing device”, all contents of which are incorporated by reference in this application.

Technical Field

The present application relates to the field of artificial intelligence (AI) technology, and in particular to an image processing method, apparatus and computing device.

Background technique

Semantic segmentation is one of the basic tasks in computer vision, which distinguishes objects of different categories in an image at the pixel level. Semantic segmentation is usually regarded as a pixel classification task, that is, it is necessary to predict a predefined semantic category for each pixel of the image. At present, the neural network model used for semantic segmentation often follows the "encoder-decoder" design paradigm. In the "encoder-decoder" design paradigm, the encoder is mainly used for feature representation learning to learn the feature maps of different scales of the image input to the encoder. It generally uses convolutional neural networks (CNN), such as: visual geometry group (VGG) network, residual neural network (ResNet), etc., or uses visual attention model (vision transformer) and multi-layer perception model (multi-layer perception) to learn the features of the image. The decoder is mainly used to perform pixel-level classification tasks based on the feature map extracted by the encoder to achieve semantic segmentation. It can enhance the expressiveness of the features and restore the resolution of the feature map, and finally produce a segmentation result consistent with the resolution of the input image. In extensive practice, encoders usually use existing, pre-trained image classification models, so an excellent decoder design can effectively improve the accuracy of semantic segmentation. Therefore, how to provide an excellent decoder is a technical problem that needs to be solved urgently.

Summary of the invention

The present application provides an image processing method, apparatus, computing device, computer storage medium and computer product, which can effectively improve the accuracy of semantic segmentation.

In a first aspect, the present application provides an image processing method, the method comprising: acquiring a first image; performing feature extraction on the first image to obtain a plurality of first feature maps of different scales; fusing n first feature maps among the plurality of first feature maps, wherein, in the i-th fusion, a vector representation of the category to which the pixels contained in the feature map obtained by the (i-1)-th fusion belong is obtained, and the vector representation and the first feature map to be fused for the i-th time are fused to obtain the i-th fused feature map, 2≤i≤n-1; based on the feature map obtained by the last fusion, a second image is obtained, the second image being used to characterize the category to which the pixels contained in the first image belong.

In this way, during the image processing process, each time the feature map is fused, the features of the category to which the pixels contained in the feature map belong (i.e., vector representation) are first obtained from the small-scale feature map, and then the features of the category to which each pixel belongs are fused with the large-scale feature map, thereby realizing the fusion of "points and classes" (i.e., the fusion of pixels and the categories to which they belong), avoiding the influence of noise points or abnormal point features in the feature map, and improving the accuracy of semantic segmentation of the image.

In a possible implementation, obtaining a vector representation of the category to which the pixels contained in the feature map obtained by the (i-1)th fusion belong, specifically includes: obtaining the mask of the category to which the pixels contained in the target feature map belong, to obtain a category mask set, the target feature map being the feature map obtained by the (i-1)th fusion; obtaining a vector representation of the category to which the pixels contained in the target feature map belong based on the weight of each mask in the category mask set and the target feature map. Thus, by calculating the category mask, the vector representation of the category to which the pixels contained in the feature map belong can be obtained, thereby facilitating the subsequent "point-class" fusion.

In one possible implementation, based on the weight of each mask in the category mask set and the target feature map, a vector representation of the category to which the pixels contained in the target feature map belong is obtained, specifically including: obtaining a transposed matrix of the features of the pixels contained in the target feature map; multiplying the weight of each mask by the transposed matrix to obtain a vector representation of the category to which the pixels contained in the target feature map belong.

In a possible implementation, the vector representation and the first feature map to be fused for the i-th time are fused to obtain the i-th fused feature map, specifically including: obtaining a key vector based on the vector representation and the first weight matrix, and obtaining a value vector based on the vector representation and the second weight matrix; obtaining a query vector based on the first feature map to be fused for the i-th time and the third weight matrix; performing attention calculation based on the key vector, the value vector and the query vector to obtain a second feature map; fusing the second feature map with the first feature map to be fused for the i-th time, Get the i-th fused feature map.

In a possible implementation, a second image is obtained based on a feature map obtained by the last fusion, specifically including: upsampling k third feature maps to obtain k fourth feature maps, the k third feature maps being feature maps obtained by each fusion except the feature map obtained by the last fusion, and at least one of the low-scale feature maps of the two feature maps fused for the first time, and the scale of each fourth feature map is the same as the scale of the feature map obtained by the last fusion; the feature map obtained by the last fusion and the k fourth feature maps are spliced in the channel dimension to obtain a fifth feature map; and the second image is obtained based on the fifth feature map.

In a possible implementation, upsampling is performed on the k third feature maps, specifically including: for any one of the k third feature maps, based on a preset upsampling multiple, expanding the number of channels of any one of the k third feature maps to obtain a sixth feature map; splicing any one of the feature maps and the sixth feature map in the channel dimension, and rearranging pixels of the spliced feature map to obtain a fourth feature map corresponding to any one of the feature maps.

In a second aspect, the present application provides an image processing method, comprising: acquiring a first image; performing feature extraction on the first image to obtain a plurality of first feature maps of different scales; acquiring a vector representation of the category to which the pixels contained in the first feature map of the first scale belong; fusing the vector representation with the first feature map of a second scale to obtain a second feature map, the second scale being larger than the first scale; and obtaining a second image based on the second feature map, the second image being used to characterize the category to which the pixels contained in the first image belong.

In this way, during the image processing, when performing feature map fusion, the features of the category to which the pixels contained in the feature map belong (i.e., vector representation) are first obtained from the small-scale feature map, and then the features of the category to which each pixel belongs are fused with the large-scale feature map, thereby realizing the fusion of "points and classes" (i.e., the fusion of pixels and the categories to which the pixels belong), avoiding the influence of noise points or abnormal point features in the feature map, and improving the accuracy of semantic segmentation of the image.

In a possible implementation, obtaining a vector representation of a category to which pixels contained in a first feature map of a first scale belong, specifically includes: obtaining a mask of the category to which the pixels contained in the first feature map of the first scale belong, to obtain a category mask set; and obtaining a vector representation of the category to which the pixels contained in the first feature map of the first scale belong based on a weight of each mask in the category mask set and the first feature map of the first scale.

In a possible implementation, based on the weight of each mask in the category mask set and the first feature map of the first scale, a vector representation of the category to which the pixels contained in the first feature map of the first scale belong is obtained, specifically including: obtaining a transposed matrix of the features of the pixels contained in the first feature map of the first scale; multiplying the weight of each mask by the transposed matrix to obtain a vector representation of the category to which the pixels contained in the first feature map of the first scale belong.

In a possible implementation, the vector representation and the first feature map of the second scale are fused to obtain the second feature map, specifically including: obtaining a key vector based on the vector representation and the first weight matrix, and obtaining a value vector based on the vector representation and the second weight matrix; obtaining a query vector based on the first feature map of the second scale and the third weight matrix; performing attention calculation based on the key vector, the value vector and the query vector to obtain a third feature map; and fusing the third feature map with the first feature map of the second scale to obtain the second feature map.

In a possible implementation, obtaining a second image based on the second feature map specifically includes: upsampling the first feature map of the first scale to obtain a fourth feature map, where the scale of the fourth feature map is the same as that of the second feature map; concatenating the second feature map and the fourth feature map in the channel dimension to obtain a fifth feature map; and obtaining the second image based on the fifth feature map.

In a possible implementation, upsampling is performed on the first feature map of the first scale, specifically including: based on a preset upsampling multiple, expanding the number of channels of the first feature map of the first scale to obtain a sixth feature map; splicing the first feature map of the first scale and the sixth feature map in the channel dimension, and rearranging pixels of the spliced feature map to obtain a fifth feature map.

In a third aspect, the present application provides an image processing device, comprising: a communication module for acquiring a first image; a processing module for performing feature extraction on the first image to obtain multiple first feature maps of different scales; the processing module is also used to fuse n first feature maps among the multiple first feature maps, wherein, at the i-th fusion, a vector representation of the category to which the pixels contained in the feature map obtained by the (i-1)-th fusion belong is obtained, and the vector representation and the first feature map to be fused for the i-th time are fused to obtain the i-th fused feature map, 2≤i≤n-1; the processing module is also used to obtain a second image based on the feature map obtained by the last fusion, and the second image is used to characterize the category to which the pixels contained in the first image belong.

In one possible implementation, when the processing module obtains the vector representation of the category to which the pixels contained in the feature map obtained by the (i-1)th fusion belong, the processing module is specifically used to: obtain the mask of the category to which the pixels contained in the target feature map belong to, so as to obtain a category mask set, where the target feature map is the feature map obtained by the (i-1)th fusion; and obtain the vector representation of the category to which the pixels contained in the target feature map belong based on the weight of each mask in the category mask set and the target feature map.

In a possible implementation, when the processing module obtains the vector representation of the category to which the pixel contained in the target feature map belongs based on the weight of each mask in the category mask set and the target feature map, it is specifically used to: obtain the transposed moment of the feature of the pixel contained in the target feature map Matrix; multiply the weight of each mask by the transposed matrix to obtain a vector representation of the category to which the pixels contained in the target feature map belong.

In one possible implementation, when the processing module fuses the vector representation and the first feature map to be fused for the i-th time to obtain the i-th fused feature map, the processing module is specifically used to: obtain a key vector based on the vector representation and the first weight matrix, and obtain a value vector based on the vector representation and the second weight matrix; obtain a query vector based on the first feature map to be fused for the i-th time and the third weight matrix; perform attention calculation based on the key vector, the value vector and the query vector to obtain a second feature map; and fuse the second feature map with the first feature map to be fused for the i-th time to obtain the i-th fused feature map.

In a possible implementation, when the processing module obtains the second image based on the feature map obtained by the last fusion, it is specifically used to: upsample the k third feature maps to obtain k fourth feature maps, the k third feature maps are feature maps obtained by each fusion except the feature map obtained by the last fusion, and at least one of the low-scale feature maps of the two feature maps fused for the first time, and the scale of each fourth feature map is the same as the scale of the feature map obtained by the last fusion; splice the feature map obtained by the last fusion and the k fourth feature maps in the channel dimension to obtain a fifth feature map; and obtain the second image based on the fifth feature map.

In a possible implementation, when the processing module performs upsampling processing on the k third feature maps, it is specifically used to: for any one of the k third feature maps, based on a preset upsampling multiple, expand the number of channels of any one of the k third feature maps to obtain a sixth feature map; splice any one of the feature maps and the sixth feature map in the channel dimension, and rearrange pixels of the spliced feature map to obtain a fourth feature map corresponding to any one of the feature maps.

In a fourth aspect, the present application provides an image processing device, comprising: a communication module for acquiring a first image; a processing module for performing feature extraction on the first image to obtain a plurality of first feature maps of different scales; the processing module is also used to acquire a vector representation of a category to which pixels contained in the first feature map of the first scale belong; the processing module is also used to fuse the vector representation and the first feature map of a second scale to obtain a second feature map, wherein the second scale is larger than the first scale; the processing module is also used to obtain a second image based on the second feature map, wherein the second image is used to characterize the category to which pixels contained in the first image belong.

In a possible implementation, when the processing module obtains the vector representation of the category to which the pixels contained in the first feature map of the first scale belong, the processing module is specifically used to: obtain the mask of the category to which the pixels contained in the first feature map of the first scale belong to, so as to obtain a category mask set; and obtain the vector representation of the category to which the pixels contained in the first feature map of the first scale belong based on the weight of each mask in the category mask set and the first feature map of the first scale.

In a possible implementation, when the processing module obtains the vector representation of the category to which the pixels contained in the first feature map of the first scale belong based on the weight of each mask in the category mask set and the first feature map of the first scale, the processing module is specifically used to: obtain the transposed matrix of the features of the pixels contained in the first feature map of the first scale; multiply the weight of each mask by the transposed matrix to obtain the vector representation of the category to which the pixels contained in the first feature map of the first scale belong.

In one possible implementation, when the processing module fuses the vector representation and the first feature map of the second scale to obtain the second feature map, it is specifically used to: obtain a key vector based on the vector representation and the first weight matrix, and obtain a value vector based on the vector representation and the second weight matrix; obtain a query vector based on the first feature map of the second scale and the third weight matrix; perform attention calculation based on the key vector, the value vector and the query vector to obtain a third feature map; and fuse the third feature map with the first feature map of the second scale to obtain the second feature map.

In a possible implementation, when the processing module obtains the second image based on the second feature map, it is specifically used to: upsample the first feature map of the first scale to obtain a fourth feature map, and the scale of the fourth feature map is the same as the scale of the second feature map; splice the second feature map and the fourth feature map in the channel dimension to obtain a fifth feature map; and obtain the second image based on the fifth feature map.

In a possible implementation, when the processing module performs upsampling processing on the first feature map of the first scale, the processing module is specifically used to: expand the number of channels of the first feature map of the first scale based on a preset upsampling multiple to obtain a sixth feature map; splice the first feature map of the first scale and the sixth feature map in the channel dimension, and rearrange pixels of the spliced feature map to obtain a fifth feature map.

In a fifth aspect, the present application provides a computing device comprising: at least one memory for storing programs; and at least one processor for executing the programs stored in the memory; wherein, when the program stored in the memory is executed, the processor is used to execute the method described in the first aspect or the second aspect.

In a sixth aspect, the present application provides a computer-readable storage medium, which stores a computer program. When the computer program runs on a processor, the processor executes the method described in the first aspect or the second aspect.

In a seventh aspect, the present application provides a computer program product. When the computer program product runs on a processor, the processor executes the method described in the first aspect or the second aspect.

It can be understood that the beneficial effects of the second to seventh aspects can refer to the relevant descriptions in the first or second aspect. I will not go into details here.

BRIEF DESCRIPTION OF THE DRAWINGS

The following is a brief introduction to the drawings required for describing the embodiments or prior art.

FIG1 is a schematic diagram of the structure of an image semantic segmentation neural network model provided in an embodiment of the present application;

FIG2 is a schematic diagram of the structure of the category feature calculation module shown in FIG1;

FIG3 is a schematic diagram of the structure of the category feature fusion module shown in FIG1 ;

FIG4 is a schematic diagram of the structure of another image semantic segmentation neural network model provided in an embodiment of the present application;

FIG5 is a schematic diagram of the structure of another image semantic segmentation neural network model provided in an embodiment of the present application;

FIG6 is a schematic diagram of a flow chart of an image processing method provided in an embodiment of the present application;

FIG7 is a schematic diagram of a flow chart of another image processing method provided in an embodiment of the present application;

FIG8 is a schematic diagram of the structure of an image processing device provided in an embodiment of the present application;

FIG9 is a schematic diagram of the structure of another image processing device provided in an embodiment of the present application;

FIG. 10 is a schematic diagram of the structure of a computing device provided in an embodiment of the present application.

Detailed ways

The term "and/or" in this article is a description of the association relationship of associated objects, indicating that there can be three relationships. For example, A and/or B can represent: A exists alone, A and B exist at the same time, and B exists alone. The symbol "/" in this article indicates that the associated objects are in an or relationship, for example, A/B means A or B.

The terms "first" and "second" in the specification and claims herein are used to distinguish different objects rather than to describe a specific order of the objects. For example, a first response message and a second response message are used to distinguish different response messages rather than to describe a specific order of the response messages.

In the embodiments of the present application, words such as "exemplary" or "for example" are used to indicate examples, illustrations or descriptions. Any embodiment or design described as "exemplary" or "for example" in the embodiments of the present application should not be interpreted as being more preferred or more advantageous than other embodiments or designs. Specifically, the use of words such as "exemplary" or "for example" is intended to present related concepts in a specific way.

In the description of the embodiments of the present application, unless otherwise specified, "multiple" means two or more than two. For example, multiple processing units refer to two or more processing units, etc.; multiple elements refer to two or more elements, etc.

Exemplarily, Fig. 1 shows a structure of a neural network model for image semantic segmentation. As shown in Fig. 1 , the neural network model 100 for image semantic segmentation includes: an encoder 110 and a decoder 120 .

The encoder 110 is mainly used to learn the features of the image input into the image semantic segmentation neural network model 100 to obtain feature maps of multiple different scales (shape). Exemplarily, the scale (shape) may refer to: the calculation result of a certain layer in the neural network, which is a tensor (tensor) in the calculation process. The scale (shape) of the tensor indicates the arrangement mode (dimension) of the data in the memory, and the data length of each dimension, such as: after the image of the general 3 channels is converted into a tensor, its scale is [H, W, C], H is the image height dimension, W is the width dimension, and C is the number of channels. Among them, the number of channels in the neural network can be preset. Exemplarily, feature maps of different scales may include at least two of 4 times feature maps, 8 times feature maps, 16 times feature maps, and 32 times feature maps. Among them, the resolution of the N times feature map is 1/N of the resolution of the original image. Among them, the encoder 110 can be a convolutional neural network, a visual attention model (vision transformer), or a multi-layer perceptron model (multi-layer perception), which can be determined according to the actual situation and is not limited here. For ease of description, the encoder 110 will be introduced as a convolutional neural network. Continuing to refer to FIG. 1, the encoder 110 may include multiple convolutional layers, namely, convolutional layers 1 to n shown in FIG. 1. Each convolutional layer is used to extract a feature map of a scale. Each convolutional layer may include a large number of convolutional operators, which are also called kernels. Their role in image processing is equivalent to a filter that extracts specific information from an input image matrix. The convolutional operator can essentially be a weight matrix, which can be obtained by training on given training data. In the process of performing convolution operations on the image, the weight matrix is usually processed one pixel after another (or two pixels after two pixels... depending on the value of the step length stride) in the horizontal direction on the input image, thereby completing the work of extracting specific features from the image. The size of the weight matrix should be independent of the size of the image, and it is manually specified. It should be noted that the channel dimension of the weight matrix is the same as the channel dimension of the input image. In the process of performing convolution operations, the weight matrix will extend to the entire depth of the input image. Therefore, convolution with a single weight matrix will produce a convolution output with a single depth dimension, but in most cases, instead of using a single weight matrix, multiple weight matrices of the same dimension are applied. The output of each weight matrix is stacked to form the depth dimension of the convolved image. Different weight matrices can be used to extract different features from the image, for example a weight matrix Used to extract image edge information, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to blur unnecessary noise in the image... The multiple weight matrices have the same dimension, and the feature maps extracted by the multiple weight matrices with the same dimension are also the same dimension, and then the extracted feature maps with the same dimension are merged to form the output of the convolution operation. It should be understood that the encoder 110 can also be configured with a pooling layer, a fully connected layer, etc., which can be determined according to actual conditions and is not limited here. It should be understood that when the convolutional neural network is replaced with other networks or models, each convolution layer in the encoder 110 in Figure 1 can be replaced with a layer or module in other networks or models for extracting image features, etc., and the replaced scheme is still within the scope of protection of this application.

The decoder 210 may include (n-1) multi-scale feature fusion modules, namely, the multi-scale feature fusion module 1 to the multi-scale feature fusion module (n-1) shown in FIG1 . Each multi-scale feature fusion module is mainly used to fuse feature maps of two different scales. Each multi-scale feature fusion module may include a category feature calculation module 121 and a category feature fusion module 122.

The category feature calculation module 121 is mainly used to calculate the vector representation of a preset number of semantic segmentation categories based on the low-scale feature map, or calculate the vector representation of the category of each target contained in the original image. Exemplarily, as shown in FIG2 , the category feature calculation module 121 may include a normalization layer 1211, a linear mapping layer 1212, a linear mapping layer 1213 and a category feature calculation layer 1214.

Among them, the normalization layer 1211 is mainly used to input the features contained in the small-scale feature map in the category feature calculation module 121 A layer normalization operation is performed to concentrate the features contained in the feature map into an area centered on a certain value, thereby improving the stability of the model training phase and the accuracy of subsequent calculations. Wherein, C represents the number of channels of the feature map, and H and W represent the height and width of the feature map, respectively. Exemplarily, the normalization layer 1211 can be implemented by, but is not limited to, various normalization layers, such as: layer normalization (LN), batch normalization (BN), group normalization (GN), etc.

The linear mapping layer 1212 is mainly used to transform all the features in the feature map output by the normalization layer 1211 into a domain to achieve alignment of the features contained in the feature map. The linear mapping layer 1212 can be a fully connected layer or a convolutional layer, which can be determined according to actual conditions and is not limited here.

The linear mapping layer 1213 is mainly used to classify all the features in the feature map output by the normalization layer 1211 to obtain a class mask set. Wherein, L represents the number of categories of semantic segmentation. Exemplarily, the linear mapping layer 1213 can be understood as a classifier. In this embodiment, after obtaining the category mask of each category, each category mask can be operated separately by the Softmax function to obtain the attention weight of each category mask, that is, the probability of each category mask appearing.

The category feature calculation layer 1214 is mainly used to perform matrix multiplication calculations on the attention weights of each category mask and the feature map output by the linear mapping layer 1212 to obtain the vector representation of each category, that is, to obtain the category feature.

The category feature fusion module 122 is mainly used to fuse the large-scale feature map with the vector representation of each category output by the category feature calculation module 121 to obtain the fused feature map, thereby realizing the feature fusion of "point-class", reducing the negative impact of noise points on the feature expression ability, and effectively fusing feature maps of different scales. Exemplarily, as shown in FIG3, the category feature fusion module 122 may include: a linear mapping layer 1221, a linear mapping layer 1222, a normalization layer 1223, a linear mapping layer 1224, a multi-headed self-attention module (multi-headed self-attention, MHA) 1225, a feature fusion layer 1226, a normalization layer 1227, a feed forward neural network (feed forward networks, FFN) 1228 and a feature fusion layer 1229.

The linear mapping layer 1221 is mainly used to calculate the key value K (i.e., key sequence) (also called "key vector") required by MHA1225 based on the vector representation of each category (i.e., category features). is the category feature calculated by the category feature calculation module 121; W ^K is a mapping matrix, which can be preset. In this embodiment, the linear mapping layer 1221 can be implemented by, but not limited to, a fully connected layer or a convolutional layer.

The linear mapping layer 1222 is mainly used to calculate the eigenvalue V (i.e., value sequence) (also referred to as "value vector") required by the MHA 1225 based on the vector representation of each category (i.e., category feature). is the category feature calculated by the category feature calculation module 121; W ^V is a mapping matrix, which can be preset. Exemplarily, the linear mapping layer 1222 can be implemented by, but not limited to, a fully connected layer or a convolutional layer.

The normalization layer 1223 is mainly used to perform layer normalization operation on the feature X∈RC ^×(H′W′) contained in the large-scale feature map input to the category feature fusion module 122, so as to concentrate the features contained in the feature map into an area centered on a certain value, thereby improving the stability of the model training stage and the accuracy of subsequent calculations. Exemplarily, the normalization layer 1223 can be implemented by, but is not limited to, various normalization layers, such as: LN, BN, GN, etc.

The linear mapping layer 1224 is mainly used to calculate the query value Q (i.e., query sequence) (also referred to as "query vector") required by MHA1225 based on the feature map output by the normalization layer 1223. Wherein, the query value Q = Norm (X) W ^Q. X is a large-scale feature map; Norm (.) represents a normalization operation; W ^Q is a mapping matrix, which can be preset. Exemplarily, the linear mapping layer 1224 can be, but is not limited to, through a fully connected layer Or convolutional layer implementation.

The multi-head self-attention module MHA1225 is mainly used to perform self-attention calculation based on the determined key value K, feature value V and query value Q. Among them, MHA1225 can include multiple heads, each of which can perform a self-attention calculation. For one head, the calculation can be expressed as:
A＝Softmax(αQK ^T )V

Wherein, Q is the query value, K is the key value, V is the eigenvalue, and α is the regularization factor. In some embodiments, since Q represents a high-resolution matrix, and K and V both represent low-resolution matrices, after multiplying QK ^T , a high-resolution matrix can be obtained, and then the high-resolution matrix is multiplied with V to obtain a high-resolution matrix, thereby improving the resolution of the obtained feature map. If K and V are calculated using a large-scale feature map, and Q is calculated using category features, it will appear that Q represents a low-resolution matrix, and K and V both represent high-resolution matrices. At this time, after multiplying QK ^T , a high-resolution matrix can be obtained, and then the high-resolution matrix is multiplied with V, which will eliminate the high resolution, thereby obtaining a low-resolution matrix, thereby reducing the resolution of the obtained feature map. Therefore, in this embodiment, K and V are calculated using category features, and Q is calculated using a large-scale feature map.

After all heads have completed the self-attention calculation, the calculation results of each head can be spliced to obtain the final result. For example, if [A ₁ ,…,A _h ] is used to represent the splicing of the output of each head in MHA1225, the result obtained by MHA1225 through self-attention calculation can be:
Y＝W ₀ [A ₁ ,…,A _h ]

Among them, Y represents the output of MHA1225, and _W0 is the mapping matrix, which can be set in advance.

The feature fusion layer 1226 is mainly used to add the features in the large-scale feature map and the features in the feature map output by the MHA 1225 to complete feature fusion. Exemplarily, the feature fusion layer 1226 can be implemented by, but is not limited to, a convolution layer.

The normalization layer 1227 is mainly used to perform layer normalization operation on the features contained in the feature map obtained by fusion of the feature fusion layer 1226, so as to concentrate the features contained in the feature map into an area centered on a certain value, thereby improving the stability of the model training stage and the accuracy of subsequent calculations. Exemplarily, the normalization layer 1227 can be implemented by, but is not limited to, various normalization layers, such as: LN, BN, GN, etc.

The feedforward neural network FFN1228 is mainly used to spatially transform the features in the feature map output by the normalization layer 1227 to complete nonlinear processing, thereby mining the nonlinear relationship of the features and enhancing the expressiveness of the features.

The feature fusion layer 1229 is mainly used to add the features output by FFN1228 and the features output by the feature fusion layer 1226 to complete feature fusion, obtain a fused feature map, and output the feature map.

In addition, a classification layer 123 may also be included in the decoder 120. The classification layer 123 is mainly used to perform feature classification processing on the feature map output by the (m-1)th (i.e., the last or deepest) multi-scale feature fusion module to obtain a semantic segmentation result. Exemplarily, the classification layer 123 may be, but is not limited to, mainly composed of a convolutional layer or a fully connected layer. Exemplarily, the output of the classification layer 123 may be an image that contains the category to which each pixel belongs.

The above is an introduction to the image semantic segmentation neural network model 100 provided in an embodiment of the present application. In FIG. 1 , the large-scale feature maps that each multi-scale feature fusion module in the decoder 120 needs to fuse are all feature maps output by the encoder 110 except the nth convolution layer (i.e., the deepest convolution layer). The small-scale feature map that the first multi-scale feature fusion module (i.e., multi-scale feature fusion module 1) in the decoder 120 needs to fuse is the feature map output by the nth convolution layer (i.e., the deepest convolution layer) in the encoder 110. The small-scale feature map that any multi-scale feature fusion module in the decoder other than the first multi-scale feature fusion module needs to fuse is the feature map output by the multi-scale feature fusion module of the previous level of the multi-scale feature fusion module. For example, the small-scale feature map that the multi-scale feature fusion module (n-2) needs to fuse is the feature map output by the multi-scale feature fusion module (n-3). Exemplarily, the feature maps required to be fused by each multi-scale feature fusion module in Figure 1 can be understood as follows: the feature maps required to be fused by the multi-scale feature fusion module 1 are the feature maps output by the convolution layer n and the convolution layer (n-1) in the encoder 110; the feature maps required to be fused by the multi-scale feature fusion module m are the feature maps output by the convolution layer (n-m) and the multi-scale feature fusion module (m-1) in the encoder 110, 1＜m≤(n-1).

In some embodiments, the small-scale feature maps input to each multi-scale feature fusion module can also be upsampled, and the feature maps obtained by upsampling are spliced with the feature maps output by the last multi-scale feature fusion module in the channel dimension, and the spliced feature maps are input to the classification layer 123 to obtain the final semantic segmentation result, thereby improving the effect of semantic segmentation. For upsampling, in this embodiment, the small-scale feature maps can be linearly mapped based on the upsampling multiple and using a convolutional layer to expand the number of channels of the feature pair, and then the feature maps obtained by expanding the number of channels and the original feature maps are spliced in the channel dimension, and finally, the spliced feature maps are pixel-shuffled to obtain the upsampled feature maps. For example, assuming that the number of channels of the current feature map is C, if it is necessary to upsample by γ times, the existing feature maps can be transformed using linear mapping to obtain C(γ ² -1) feature maps, After that, the original feature map and the transformed feature map can be spliced in the channel dimension to obtain a feature map with a channel number of Cγ ^2. Finally, the pixel-shuffle operation can be used to obtain the upsampled feature map. In this embodiment, the new feature map used for channel dimension splicing is obtained by linear mapping, that is, it is obtained by convolution calculation of the original feature map, and is not like the traditional method of using linear interpolation to fit a feature that does not exist from multiple features. Therefore, the features in the feature map in the upsampling method used in this embodiment are all real, so it is better than the traditional upsampling method.

In some embodiments, the number of multi-scale fusion modules in the decoder 120 shown in FIG. 1 is not necessarily (n-1), and the number may be less than (n-1), but at least 1.

When the number of multi-scale fusion modules in the decoder 120 is 1, as shown in FIG4 , the multi-scale fusion module in the decoder 120 can arbitrarily select two feature maps of different scales for fusion, for example, select the feature maps output by the convolutional layer n and the convolutional layer (n-1) in the encoder 110 for fusion, or select the feature maps output by the convolutional layer n and the convolutional layer 2 in the encoder 110 for fusion, and so on.

When the number of multiscale fusion modules in the decoder 120 is greater than 1 but less than the number of convolutional layers in the encoder 110, as shown in FIG5 , m multiscale feature fusion modules are provided in the decoder 120, and 1＜m＜n, where n is the number of convolutional layers in the encoder 110. At this time, the feature maps output by the (n-m-1) convolutional layers in the encoder 110 can be arbitrarily discarded for fusion, that is, the feature maps output by the (n-m-1) convolutional layers in the encoder 110 are not fused. For example, in FIG5 , when m＝n-2, the first multiscale feature fusion module in the decoder 120 can fuse the feature maps output by the nth convolutional layer and the n-1th convolutional layer in the encoder 120, and the i-th (1＜i≤m) multiscale feature fusion module can fuse the feature maps output by the i-1th multiscale feature fusion module and the feature maps output by the n-ith convolutional layer in the encoder 110.

In some embodiments, the upsampling in FIG. 1 , FIG. 4 and FIG. 5 may be selected according to actual needs and is not strictly limited here.

The above is an introduction to the image semantic segmentation neural network model provided in the embodiment of the present application. Next, based on the above content, an image processing method provided in the embodiment of the present application is introduced.

Exemplarily, FIG6 shows a flow chart of an image processing method. It should be understood that the method can be executed by any device, equipment, platform, or device cluster with computing and processing capabilities. For example, it can be executed by a mobile phone, a computer, a vehicle terminal, a cloud server, etc. As shown in FIG6, the image processing method may include the following steps:

S601: Acquire a first image.

In this embodiment, the first image can be captured by an image acquisition device such as a camera. After the image acquisition device acquires the first image, the image can be transmitted to a device associated with the image acquisition device (such as a mobile phone, a vehicle-mounted terminal) so that these devices can process the image, and thus these devices acquire the first image. In some embodiments, when the method is performed by other devices other than the device associated with the image acquisition device (such as a cloud server, etc.), the device associated with the image acquisition device can transmit the first image acquired by the image acquisition device to these other devices through a network, etc. so that these other devices can process the first image, and thus these other devices acquire the first image.

S602: Perform feature extraction on the first image to obtain a plurality of first feature maps of different scales.

In this embodiment, after the first image is acquired, feature extraction may be performed on the first image, for example, by extracting features through the encoder 110 in FIG. 1 , etc., to obtain a plurality of first feature maps of different scales.

S603. Fuse n first feature maps among the multiple first feature maps, wherein, during the i-th fusion, obtain a vector representation of the category to which the pixels contained in the feature map obtained by the (i-1)-th fusion belong, and fuse the vector representation and the first feature map to be fused for the i-th time to obtain the i-th fused feature map, 2≤i≤n-1.

In this embodiment, after obtaining a plurality of first feature maps of different scales, all the first feature maps can be fused, or a part of them can be fused. For ease of description, the fusion of n (n≥3) first feature maps is used for description, where n can be the number of all first feature maps or the number of a part of the first feature maps. In addition, the scales of different first feature maps are different. Exemplarily, the n first feature maps can be fused by the decoder 120 shown in Figure 1. In this embodiment, when the n first feature maps are fused, each feature map can be fused in order from small to large in scale. Among them, the scale of the feature map obtained by the (i-1)th fusion is smaller than the scale of the first feature map required to be fused for the i-th time.

In some embodiments, when performing the i-th fusion, the vector representation of the category to which the pixels contained in the feature map obtained by the (i-1)-th fusion belong can be obtained first. Among them, 2≤i≤n-1. For example: the required vector representation can be obtained by the category feature calculation module 121 in the decoder 120 shown in the aforementioned FIG1. Exemplarily, the features of all pixels in the target feature map (i.e., the feature map obtained by the (i-1)-th fusion) can be classified first, so as to obtain the masks of the categories to which each pixel contained in the target feature map belongs, so as to obtain a category mask set. Then, each mask in the category mask set can be operated separately by functions such as Softmax, but not limited to, to obtain the weights of each mask. Finally, the weight of each mask is multiplied by the transposed matrix of the features of the pixels contained in the target feature map, so as to obtain the target feature map. That is to say, based on the weight of each mask in the category mask set and the target feature map, the vector representation of the category to which the pixels contained in the target feature map belong can be obtained.

After obtaining the vector representation, the obtained vector representation and the first feature map required to be fused for the i-th time can be fused to obtain the i-th fused feature map. For example, the required vector representation and the first feature map required to be fused for the i-th time can be fused by the category feature fusion module 122 in the decoder 120 shown in the aforementioned FIG. 1 . Exemplarily, the obtained vector representation can be multiplied with the first weight matrix (i.e., the aforementioned W ^K ) to obtain a key vector (i.e., the aforementioned “key value K”), and the obtained vector representation can be multiplied with the second weight matrix (i.e., the aforementioned W ^V ) to obtain a value vector (i.e., the aforementioned “eigenvalue V”); then, based on the first feature map to be fused for the i-th time and the third weight matrix (i.e., the aforementioned W ^Q ), a query vector (i.e., the aforementioned “query value Q”) is obtained; then, attention calculation is performed based on the key vector, the value vector and the query vector to obtain a second feature map; finally, the second feature map and the first feature map to be fused for the i-th time are fused to obtain the i-th fused feature map. For details, see the introduction to the category feature fusion module 122 in FIG. 1 above, which will not be repeated here.

S604: Based on the feature map obtained by the last fusion, a second image is obtained, where the second image is used to represent the category to which the pixels contained in the first image belong.

In this embodiment, after the last feature map fusion is completed, the feature map obtained by the last fusion can be processed, such as classified by a classifier, etc., to obtain a second image. The second image is used to represent the category to which the pixels contained in the first image belong. The second image can be understood as the result of semantic segmentation.

In some embodiments, in S604, the feature maps obtained by each fusion except the feature map obtained by the last fusion, and the k feature maps in the low-scale feature map of the two feature maps of the first fusion can be upsampled to obtain k fourth feature maps. The scale of each fourth feature map is the same as the scale of the feature map obtained by the last fusion; then, the feature map obtained by the last fusion and the k fourth feature maps can be spliced in the channel dimension to obtain the fifth feature map. Finally, the second image is obtained by processing the fifth feature map, such as classifying it through a classifier, etc. This improves the effect of semantic segmentation.

Exemplarily, when upsampling any one of the k feature maps, the number of channels of any one of the k feature maps can be expanded based on a preset upsampling multiple to obtain a sixth feature map. Then, the channel dimension of the any one of the feature maps and the obtained sixth feature map are spliced. Finally, the pixels of the spliced feature map are rearranged to obtain the fourth feature map corresponding to the any one of the feature maps. See the above description of upsampling for details.

Exemplarily, FIG7 shows the process of another image processing method. It should be understood that the method can be executed by any device, equipment, platform, or device cluster with computing and processing capabilities. For example, it can be executed by a mobile phone, a computer, a vehicle-mounted terminal, a cloud server, etc. The main difference between FIG7 and FIG6 is that FIG7 only fuses feature maps of two different scales, and the final semantic segmentation result is obtained from the result of the fusion; while FIG6 fuses three or more feature maps of different scales, and the final semantic segmentation result is obtained from the result of the fusion. Among them, the contents in FIG7 can be found in the description of FIG6 above, and will not be repeated here. As shown in FIG7, the image processing method may include the following steps:

S701: Acquire a first image.

S702: Perform feature extraction on the first image to obtain a plurality of first feature maps of different scales.

S703: Obtain a vector representation of the category to which the pixels contained in the first feature map of the first scale belong.

S704: Fuse the vector representation and the first feature map of the second scale to obtain a second feature map, where the second scale is larger than the first scale.

S705 . Obtain a second image based on the second feature map, where the second image is used to represent the categories to which the pixels included in the first image belong.

It should be understood that the methods described in FIG. 6 or FIG. 7 above can be applied to, but are not limited to, tasks such as autonomous driving, image editing, satellite telemetry, medical diagnosis, augmented/virtual reality, and the like.

Based on the method in the above embodiment, an embodiment of the present application provides an image processing device.

Exemplarily, FIG8 shows a schematic diagram of the structure of an image processing device. As shown in FIG8, the image processing device 800 may include: a communication module 801 and a processing module 802. The communication module 801 is used to obtain a first image; the processing module 802 is used to extract features from the first image to obtain a plurality of first feature maps of different scales; the processing module 802 is also used to fuse n first feature maps among the plurality of first feature maps, wherein, at the i-th fusion, a vector representation of the category to which the pixels contained in the feature map obtained by the (i-1)th fusion belong is obtained, and the vector representation and the first feature map required to be fused for the i-th time are fused to obtain the i-th fused feature map, 2≤i≤n-1; the processing module 802 is also used to obtain a second image based on the feature map obtained by the last fusion, and the second image is used to characterize the category to which the pixels contained in the first image belong.

In some embodiments, when the processing module 802 obtains the vector representation of the category to which the pixels contained in the feature map obtained by the (i-1)th fusion belong, it is specifically used to: obtain the mask of the category to which the pixels contained in the target feature map belong to, so as to obtain a category mask set, where the target feature map is the feature map obtained by the (i-1)th fusion; based on the weight of each mask in the category mask set and the target feature map, obtain the vector representation of the category to which the pixels contained in the target feature map belong.

In some embodiments, when the processing module 802 obtains a vector representation of the category to which the pixels contained in the target feature map belong based on the weight of each mask in the category mask set and the target feature map, it is specifically used to: obtain the transposed matrix of the features of the pixels contained in the target feature map; multiply the weight of each mask and the transposed matrix to obtain a vector representation of the category to which the pixels contained in the target feature map belong.

In some embodiments, when the processing module 802 fuses the vector representation and the first feature map to be fused for the i-th time to obtain the i-th fused feature map, it is specifically used to: obtain a key vector based on the vector representation and the first weight matrix, and obtain a value vector based on the vector representation and the second weight matrix; obtain a query vector based on the first feature map to be fused for the i-th time and the third weight matrix; perform attention calculation based on the key vector, the value vector and the query vector to obtain a second feature map; and fuse the second feature map with the first feature map to be fused for the i-th time to obtain the i-th fused feature map.

In some embodiments, when obtaining the second image based on the feature map obtained by the last fusion, the processing module 802 is specifically used to: upsample the k third feature maps to obtain k fourth feature maps, the k third feature maps being the feature maps obtained by each fusion except the feature map obtained by the last fusion, and at least one of the low-scale feature maps of the two feature maps fused for the first time, and the scale of each fourth feature map is the same as the scale of the feature map obtained by the last fusion; splicing the feature map obtained by the last fusion and the k fourth feature maps in the channel dimension to obtain a fifth feature map; and obtaining the second image based on the fifth feature map.

In some embodiments, when the processing module 802 performs upsampling processing on the k third feature maps, it is specifically used to: for any one of the k third feature maps, based on a preset upsampling multiple, expand the number of channels of any one of the k third feature maps to obtain a sixth feature map; splice any one of the feature maps and the sixth feature map in the channel dimension, and rearrange pixels of the spliced feature map to obtain a fourth feature map corresponding to any one of the feature maps.

It should be understood that the above-mentioned device is used to execute the method in the above-mentioned embodiment. The implementation principle and technical effect of the corresponding program module in the device are similar to those described in the above-mentioned method. The working process of the device can refer to the corresponding process in the above-mentioned method, which will not be repeated here.

Based on the method in the above embodiment, the embodiment of the present application also provides an image processing device.

Exemplarily, FIG9 shows a schematic diagram of the structure of an image processing device. As shown in FIG9 , the image processing device 900 may include: a communication module 901 and a processing module 902. The communication module 901 is used to obtain a first image; the processing module 902 is used to extract features from the first image to obtain a plurality of first feature maps of different scales; the processing module 902 is also used to obtain a vector representation of the category to which the pixels contained in the first feature map of the first scale belong; the processing module 902 is also used to fuse the vector representation and the first feature map of the second scale to obtain a second feature map, wherein the second scale is larger than the first scale; the processing module 902 is also used to obtain a second image based on the second feature map, wherein the second image is used to characterize the category to which the pixels contained in the first image belong.

In some embodiments, when obtaining the vector representation of the category to which the pixels contained in the first feature map of the first scale belong, the processing module 902 is specifically used to: obtain the mask of the category to which the pixels contained in the first feature map of the first scale belong to, so as to obtain a category mask set; based on the weight of each mask in the category mask set and the first feature map of the first scale, obtain the vector representation of the category to which the pixels contained in the first feature map of the first scale belong.

In some embodiments, when the processing module 902 obtains the vector representation of the category to which the pixels contained in the first feature map of the first scale belong based on the weight of each mask in the category mask set and the first feature map of the first scale, it is specifically used to: obtain the transposed matrix of the features of the pixels contained in the first feature map of the first scale; multiply the weight of each mask by the transposed matrix to obtain the vector representation of the category to which the pixels contained in the first feature map of the first scale belong.

In some embodiments, the processing module 902 fuses the vector representation and the first feature map of the second scale to obtain a second feature map When , it is specifically used to: obtain a key vector based on the vector representation and the first weight matrix, and obtain a value vector based on the vector representation and the second weight matrix; obtain a query vector based on the first feature map of the second scale and the third weight matrix; perform attention calculation based on the key vector, the value vector and the query vector to obtain a third feature map; fuse the third feature map and the first feature map of the second scale to obtain a second feature map.

In some embodiments, when obtaining the second image based on the second feature map, the processing module 902 is specifically used to: upsample the first feature map of the first scale to obtain a fourth feature map, where the scale of the fourth feature map is the same as that of the second feature map; splice the second feature map and the fourth feature map in the channel dimension to obtain a fifth feature map; and obtain the second image based on the fifth feature map.

In some embodiments, when the processing module 902 performs upsampling processing on the first feature map of the first scale, it is specifically used to: expand the number of channels of the first feature map of the first scale based on a preset upsampling multiple to obtain a sixth feature map; splice the first feature map of the first scale and the sixth feature map in the channel dimension, and rearrange pixels of the spliced feature map to obtain a fifth feature map.

Based on the method in the above embodiment, the embodiment of the present application also provides a computing device 1100.

Exemplarily, FIG10 shows a structural intent of a computing device. As shown in FIG10 , computing device 1000 includes: bus 1002, processor 1004, memory 1006 and communication interface 1008. Processor 1004, memory 1006 and communication interface 1008 communicate through bus 1002. Computing device 1000 can be a server or a terminal device. It should be understood that the present application does not limit the number of processors and memories in computing device 1000.

The bus 1002 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, etc. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of representation, FIG. 10 is represented by only one line, but does not mean that there is only one bus or one type of bus. The bus 1004 may include a path for transmitting information between various components of the computing device 1000 (e.g., the memory 1006, the processor 1004, and the communication interface 1008).

Processor 1004 may include any one or more processors such as a central processing unit (CPU), a graphics processing unit (GPU), a microprocessor (MP) or a digital signal processor (DSP).

The memory 1006 may include a volatile memory, such as a random access memory (RAM). The processor 1004 may also include a non-volatile memory, such as a read-only memory (ROM), a flash memory, a hard disk drive (HDD), or a solid state drive (SSD).

The memory 1006 stores executable program codes, and the processor 1004 executes the executable program codes to respectively implement the functions of the communication module 801 and the processing module 802 in FIG8 , or to implement the functions of the communication module 901 and the processing module 902 in FIG9 , thereby implementing all or part of the steps of the method in the above embodiment. That is, the memory 1006 stores instructions for executing all or part of the steps in the method in the above embodiment.

Alternatively, the memory 1006 stores executable codes, and the processor 1004 executes the executable codes to respectively implement the functions of the aforementioned image processing apparatus 800 or 900, thereby implementing all or part of the steps in the above-mentioned embodiment method. That is, the memory 1006 stores instructions for executing all or part of the steps in the above-mentioned embodiment method.

The communication interface 1003 uses a transceiver module such as, but not limited to, a network interface card or a transceiver to implement communication between the computing device 100 and other devices or a communication network.

Based on the method in the above embodiment, an embodiment of the present application provides a computer-readable storage medium, which stores a computer program. When the computer program runs on a processor, the processor executes the method in the above embodiment.

Based on the method in the above embodiment, an embodiment of the present application provides a computer program product. When the computer program product runs on a processor, the processor executes the method in the above embodiment.

It is understood that the processor in the embodiments of the present application may be a central processing unit (CPU), or other general-purpose processors, digital signal processors (DSP), application specific integrated circuits (ASIC), field programmable gate arrays (FPGA) or other programmable logic devices, transistor logic devices, hardware components or any combination thereof. The general-purpose processor may be a microprocessor or any conventional processor.

The method steps in the embodiments of the present application can be implemented by hardware or by a processor executing software instructions. Now. Software instructions can be composed of corresponding software modules, and software modules can be stored in random access memory (RAM), flash memory, read-only memory (ROM), programmable rom (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), registers, hard disks, mobile hard disks, CD-ROMs or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor so that the processor can read information from the storage medium and write information to the storage medium. Of course, the storage medium can also be a component of the processor. The processor and the storage medium can be located in an ASIC.

In the above embodiments, it can be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented by software, it can be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the process or function described in the embodiment of the present application is generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions may be stored in a computer-readable storage medium or transmitted through the computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server or data center to another website, computer, server or data center by wired (e.g., coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that a computer can access or a data storage device such as a server or data center that includes one or more available media integrated. The available medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a solid state drive (SSD)), etc.

It should be understood that the various numerical numbers involved in the embodiments of the present application are only used for the convenience of description and are not used to limit the scope of the embodiments of the present application.

Claims

An image processing method, characterized in that the method comprises:

acquiring a first image;

Performing feature extraction on the first image to obtain a plurality of first feature maps of different scales;

Fusing n first feature maps among the multiple first feature maps, wherein, during the i-th fusion, obtaining a vector representation of the category to which the pixels contained in the feature map obtained by the (i-1)-th fusion belong, and fusing the vector representation and the i-th first feature map to be fused to obtain an i-th fused feature map, 2≤i≤n-1;

Based on the feature map obtained by the last fusion, a second image is obtained, and the second image is used to represent the category to which the pixels contained in the first image belong.
The method according to claim 1 is characterized in that obtaining the vector representation of the category to which the pixels contained in the feature map obtained by the (i-1)th fusion belong specifically includes:

Obtaining the mask of the category to which the pixels contained in the target feature map belong, so as to obtain a category mask set, wherein the target feature map is the feature map obtained by the (i-1)th fusion;

Based on the weight of each mask in the category mask set and the target feature map, a vector representation of the category to which the pixels contained in the target feature map belong is obtained.
The method according to claim 2 is characterized in that the step of obtaining a vector representation of the category to which the pixels contained in the target feature map belong based on the weight of each mask in the category mask set and the target feature map specifically comprises:

Obtaining a transposed matrix of features of pixels contained in the target feature map;

The weight of each mask is multiplied by the transposed matrix to obtain a vector representation of the category to which the pixels contained in the target feature map belong.
The method according to any one of claims 1 to 3 is characterized in that fusing the vector representation with the first feature map to be fused for the i-th time to obtain the i-th fused feature map specifically includes:

Based on the vector representation and the first weight matrix, a key vector is obtained, and based on the vector representation and the second weight matrix, a value vector is obtained;

Based on the first feature map and the third weight matrix to be fused for the i-th time, a query vector is obtained;

Performing attention calculation based on the key vector, the value vector, and the query vector to obtain a second feature map;

The second feature map is fused with the first feature map to be fused for the i-th time to obtain the i-th fused feature map.
The method according to any one of claims 1 to 4, characterized in that obtaining the second image based on the feature map obtained by the last fusion specifically comprises:

Perform upsampling processing on the k third feature maps to obtain k fourth feature maps, wherein the k third feature maps are feature maps obtained by each fusion except the feature map obtained by the last fusion, and at least one of the low-scale feature maps of the two feature maps fused for the first time, and the scale of each of the fourth feature maps is the same as the scale of the feature map obtained by the last fusion;

The feature map obtained by the last fusion and the k fourth feature maps are concatenated in the channel dimension to obtain a fifth feature map;

Based on the fifth feature map, the second image is obtained.
The method according to claim 5, characterized in that the upsampling of the k third feature maps specifically comprises:

For any one of the k third feature maps, based on a preset upsampling multiple, the number of channels of the any one feature map is expanded to obtain a sixth feature map;

The arbitrary one feature map and the sixth feature map are spliced in the channel dimension, and pixels of the spliced feature map are rearranged to obtain a fourth feature map corresponding to the arbitrary one feature map.
An image processing method, characterized in that the method comprises:

acquiring a first image;

Performing feature extraction on the first image to obtain a plurality of first feature maps of different scales;

Obtaining a vector representation of a category to which pixels contained in the first feature map of the first scale belong;

fusing the vector representation and the first feature map at a second scale to obtain a second feature map, wherein the second scale is larger than the first scale;

Based on the second feature map, a second image is obtained, where the second image is used to represent the category to which the pixels contained in the first image belong.
An image processing device, characterized in that the device comprises:

A communication module, configured to acquire a first image;

A processing module, configured to perform feature extraction on the first image to obtain a plurality of first feature maps of different scales;

The processing module is further used to fuse n first feature maps among the multiple first feature maps, wherein, during the i-th fusion, a vector representation of the category to which the pixels contained in the feature map obtained by the (i-1)-th fusion belong is obtained, and the vector representation and the first feature map to be fused for the i-th time are fused to obtain the i-th fused feature map, 2≤i≤n-1;

The processing module is further used to obtain a second image based on the feature map obtained by the last fusion, where the second image is used to represent the category to which the pixels contained in the first image belong.
The device according to claim 8 is characterized in that when the processing module obtains the vector representation of the category to which the pixels contained in the feature map obtained by the (i-1)th fusion belong, it is specifically used to:

Obtaining the mask of the category to which the pixels contained in the target feature map belong, so as to obtain a category mask set, wherein the target feature map is the feature map obtained by the (i-1)th fusion;

Based on the weight of each mask in the category mask set and the target feature map, a vector representation of the category to which the pixels contained in the target feature map belong is obtained.
The device according to claim 9 is characterized in that when the processing module obtains the vector representation of the category to which the pixels contained in the target feature map belong based on the weight of each mask in the category mask set and the target feature map, it is specifically used to:

Obtaining a transposed matrix of features of pixels contained in the target feature map;

The weight of each mask is multiplied by the transposed matrix to obtain a vector representation of the category to which the pixels contained in the target feature map belong.
The device according to any one of claims 8 to 10 is characterized in that, when the processing module fuses the vector representation and the first feature map to be fused for the i-th time to obtain the i-th fused feature map, it is specifically used to:

Based on the vector representation and the first weight matrix, a key vector is obtained, and based on the vector representation and the second weight matrix, a value vector is obtained;

Based on the first feature map and the third weight matrix to be fused for the i-th time, a query vector is obtained;

Performing attention calculation based on the key vector, the value vector, and the query vector to obtain a second feature map;

The second feature map is fused with the first feature map to be fused for the i-th time to obtain the i-th fused feature map.
The device according to any one of claims 8 to 11 is characterized in that, when the processing module obtains the second image based on the feature map obtained by the last fusion, it is specifically used to:

Perform upsampling processing on the k third feature maps to obtain k fourth feature maps, wherein the k third feature maps are feature maps obtained by each fusion except the feature map obtained by the last fusion, and at least one of the low-scale feature maps of the two feature maps fused for the first time, and the scale of each of the fourth feature maps is the same as the scale of the feature map obtained by the last fusion;

The feature map obtained by the last fusion and the k fourth feature maps are concatenated in the channel dimension to obtain a fifth feature map;

Based on the fifth feature map, the second image is obtained.
The device according to claim 12, characterized in that when the processing module performs upsampling processing on the k third feature maps, it is specifically used to:

For any one of the k third feature maps, based on a preset upsampling multiple, the number of channels of the any one feature map is expanded to obtain a sixth feature map;

The arbitrary one feature map and the sixth feature map are spliced in the channel dimension, and pixels of the spliced feature map are rearranged to obtain a fourth feature map corresponding to the arbitrary one feature map.
An image processing device, characterized in that the device comprises:

A communication module, configured to acquire a first image;

A processing module, configured to perform feature extraction on the first image to obtain a plurality of first feature maps of different scales;

The processing module is further used to obtain a vector representation of a category to which pixels contained in the first feature map of the first scale belong;

The processing module is further used to fuse the vector representation and the first feature map at a second scale to obtain a second feature map, wherein the second scale is larger than the first scale;

The processing module is further used to obtain a second image based on the second feature map, where the second image is used to represent the category to which the pixels contained in the first image belong.
A computing device, comprising:

at least one memory for storing a program;

at least one processor, configured to execute the program stored in the memory;

Wherein, when the program stored in the memory is executed, the processor is used to execute the method according to any one of claims 1-7.
A computer-readable storage medium stores a computer program, and when the computer program runs on a processor, the processor executes the method according to any one of claims 1 to 7.
A computer program product, characterized in that when the computer program product runs on a processor, the processor is caused to execute the method according to any one of claims 1 to 7.