[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

WO2024140642A1 - Image processing method and apparatus, and computing device - Google Patents

Image processing method and apparatus, and computing device Download PDF

Info

Publication number
WO2024140642A1
WO2024140642A1 PCT/CN2023/141801 CN2023141801W WO2024140642A1 WO 2024140642 A1 WO2024140642 A1 WO 2024140642A1 CN 2023141801 W CN2023141801 W CN 2023141801W WO 2024140642 A1 WO2024140642 A1 WO 2024140642A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature map
feature
image
category
fusion
Prior art date
Application number
PCT/CN2023/141801
Other languages
French (fr)
Chinese (zh)
Inventor
唐泉
刘传建
韩凯
王云鹤
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2024140642A1 publication Critical patent/WO2024140642A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • G06V10/765Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects using rules for classification or partitioning the feature space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/30Noise filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • CNN convolutional neural networks
  • VCG visual geometry group
  • ResNet residual neural network
  • VAG visual geometry group
  • ResNet residual neural network
  • the decoder is mainly used to perform pixel-level classification tasks based on the feature map extracted by the encoder to achieve semantic segmentation. It can enhance the expressiveness of the features and restore the resolution of the feature map, and finally produce a segmentation result consistent with the resolution of the input image.
  • encoders usually use existing, pre-trained image classification models, so an excellent decoder design can effectively improve the accuracy of semantic segmentation. Therefore, how to provide an excellent decoder is a technical problem that needs to be solved urgently.
  • obtaining a vector representation of the category to which the pixels contained in the feature map obtained by the (i-1)th fusion belong specifically includes: obtaining the mask of the category to which the pixels contained in the target feature map belong, to obtain a category mask set, the target feature map being the feature map obtained by the (i-1)th fusion; obtaining a vector representation of the category to which the pixels contained in the target feature map belong based on the weight of each mask in the category mask set and the target feature map.
  • the vector representation of the category to which the pixels contained in the feature map belong can be obtained, thereby facilitating the subsequent "point-class" fusion.
  • a second image is obtained based on a feature map obtained by the last fusion, specifically including: upsampling k third feature maps to obtain k fourth feature maps, the k third feature maps being feature maps obtained by each fusion except the feature map obtained by the last fusion, and at least one of the low-scale feature maps of the two feature maps fused for the first time, and the scale of each fourth feature map is the same as the scale of the feature map obtained by the last fusion; the feature map obtained by the last fusion and the k fourth feature maps are spliced in the channel dimension to obtain a fifth feature map; and the second image is obtained based on the fifth feature map.
  • upsampling is performed on the k third feature maps, specifically including: for any one of the k third feature maps, based on a preset upsampling multiple, expanding the number of channels of any one of the k third feature maps to obtain a sixth feature map; splicing any one of the feature maps and the sixth feature map in the channel dimension, and rearranging pixels of the spliced feature map to obtain a fourth feature map corresponding to any one of the feature maps.
  • obtaining a vector representation of a category to which pixels contained in a first feature map of a first scale belong specifically includes: obtaining a mask of the category to which the pixels contained in the first feature map of the first scale belong, to obtain a category mask set; and obtaining a vector representation of the category to which the pixels contained in the first feature map of the first scale belong based on a weight of each mask in the category mask set and the first feature map of the first scale.
  • the processing module when the processing module obtains the vector representation of the category to which the pixels contained in the feature map obtained by the (i-1)th fusion belong, the processing module is specifically used to: obtain the mask of the category to which the pixels contained in the target feature map belong to, so as to obtain a category mask set, where the target feature map is the feature map obtained by the (i-1)th fusion; and obtain the vector representation of the category to which the pixels contained in the target feature map belong based on the weight of each mask in the category mask set and the target feature map.
  • the processing module when the processing module obtains the vector representation of the category to which the pixel contained in the target feature map belongs based on the weight of each mask in the category mask set and the target feature map, it is specifically used to: obtain the transposed moment of the feature of the pixel contained in the target feature map Matrix; multiply the weight of each mask by the transposed matrix to obtain a vector representation of the category to which the pixels contained in the target feature map belong.
  • the present application provides an image processing device, comprising: a communication module for acquiring a first image; a processing module for performing feature extraction on the first image to obtain a plurality of first feature maps of different scales; the processing module is also used to acquire a vector representation of a category to which pixels contained in the first feature map of the first scale belong; the processing module is also used to fuse the vector representation and the first feature map of a second scale to obtain a second feature map, wherein the second scale is larger than the first scale; the processing module is also used to obtain a second image based on the second feature map, wherein the second image is used to characterize the category to which pixels contained in the first image belong.
  • the processing module when the processing module obtains the vector representation of the category to which the pixels contained in the first feature map of the first scale belong, the processing module is specifically used to: obtain the mask of the category to which the pixels contained in the first feature map of the first scale belong to, so as to obtain a category mask set; and obtain the vector representation of the category to which the pixels contained in the first feature map of the first scale belong based on the weight of each mask in the category mask set and the first feature map of the first scale.
  • the processing module when the processing module fuses the vector representation and the first feature map of the second scale to obtain the second feature map, it is specifically used to: obtain a key vector based on the vector representation and the first weight matrix, and obtain a value vector based on the vector representation and the second weight matrix; obtain a query vector based on the first feature map of the second scale and the third weight matrix; perform attention calculation based on the key vector, the value vector and the query vector to obtain a third feature map; and fuse the third feature map with the first feature map of the second scale to obtain the second feature map.
  • the processing module when the processing module obtains the second image based on the second feature map, it is specifically used to: upsample the first feature map of the first scale to obtain a fourth feature map, and the scale of the fourth feature map is the same as the scale of the second feature map; splice the second feature map and the fourth feature map in the channel dimension to obtain a fifth feature map; and obtain the second image based on the fifth feature map.
  • the present application provides a computing device comprising: at least one memory for storing programs; and at least one processor for executing the programs stored in the memory; wherein, when the program stored in the memory is executed, the processor is used to execute the method described in the first aspect or the second aspect.
  • the present application provides a computer-readable storage medium, which stores a computer program.
  • the computer program runs on a processor, the processor executes the method described in the first aspect or the second aspect.
  • FIG1 is a schematic diagram of the structure of an image semantic segmentation neural network model provided in an embodiment of the present application.
  • FIG2 is a schematic diagram of the structure of the category feature calculation module shown in FIG1;
  • FIG4 is a schematic diagram of the structure of another image semantic segmentation neural network model provided in an embodiment of the present application.
  • FIG5 is a schematic diagram of the structure of another image semantic segmentation neural network model provided in an embodiment of the present application.
  • FIG6 is a schematic diagram of a flow chart of an image processing method provided in an embodiment of the present application.
  • the category feature calculation module 121 is mainly used to calculate the vector representation of a preset number of semantic segmentation categories based on the low-scale feature map, or calculate the vector representation of the category of each target contained in the original image.
  • the category feature calculation module 121 may include a normalization layer 1211, a linear mapping layer 1212, a linear mapping layer 1213 and a category feature calculation layer 1214.
  • the linear mapping layer 1221 is mainly used to calculate the key value K (i.e., key sequence) (also called "key vector") required by MHA1225 based on the vector representation of each category (i.e., category features). is the category feature calculated by the category feature calculation module 121; W K is a mapping matrix, which can be preset.
  • the linear mapping layer 1221 can be implemented by, but not limited to, a fully connected layer or a convolutional layer.
  • Q is the query value
  • K is the key value
  • V is the eigenvalue
  • is the regularization factor.
  • Q represents a high-resolution matrix
  • K and V both represent low-resolution matrices
  • after multiplying QK T a high-resolution matrix can be obtained, and then the high-resolution matrix is multiplied with V to obtain a high-resolution matrix, thereby improving the resolution of the obtained feature map.
  • K and V are calculated using a large-scale feature map, and Q is calculated using category features, it will appear that Q represents a low-resolution matrix, and K and V both represent high-resolution matrices.
  • the small-scale feature map that any multi-scale feature fusion module in the decoder other than the first multi-scale feature fusion module needs to fuse is the feature map output by the multi-scale feature fusion module of the previous level of the multi-scale feature fusion module.
  • the small-scale feature map that the multi-scale feature fusion module (n-2) needs to fuse is the feature map output by the multi-scale feature fusion module (n-3).
  • FIG6 shows a flow chart of an image processing method. It should be understood that the method can be executed by any device, equipment, platform, or device cluster with computing and processing capabilities. For example, it can be executed by a mobile phone, a computer, a vehicle terminal, a cloud server, etc. As shown in FIG6, the image processing method may include the following steps:
  • the feature maps obtained by each fusion except the feature map obtained by the last fusion, and the k feature maps in the low-scale feature map of the two feature maps of the first fusion can be upsampled to obtain k fourth feature maps.
  • the scale of each fourth feature map is the same as the scale of the feature map obtained by the last fusion; then, the feature map obtained by the last fusion and the k fourth feature maps can be spliced in the channel dimension to obtain the fifth feature map.
  • the second image is obtained by processing the fifth feature map, such as classifying it through a classifier, etc. This improves the effect of semantic segmentation.
  • the number of channels of any one of the k feature maps can be expanded based on a preset upsampling multiple to obtain a sixth feature map. Then, the channel dimension of the any one of the feature maps and the obtained sixth feature map are spliced. Finally, the pixels of the spliced feature map are rearranged to obtain the fourth feature map corresponding to the any one of the feature maps. See the above description of upsampling for details.
  • S702 Perform feature extraction on the first image to obtain a plurality of first feature maps of different scales.
  • S704 Fuse the vector representation and the first feature map of the second scale to obtain a second feature map, where the second scale is larger than the first scale.
  • the features of the category to which the pixels contained in the feature map belong i.e., vector representation
  • the features of the category to which each pixel belongs are fused with the large-scale feature map, thereby realizing the fusion of "points and classes" (i.e., the fusion of pixels and the categories to which the pixels belong), avoiding the influence of noise points or abnormal point features in the feature map, and improving the accuracy of semantic segmentation of the image.
  • FIG. 6 or FIG. 7 can be applied to, but are not limited to, tasks such as autonomous driving, image editing, satellite telemetry, medical diagnosis, augmented/virtual reality, and the like.
  • an embodiment of the present application provides an image processing device.
  • FIG8 shows a schematic diagram of the structure of an image processing device.
  • the image processing device 800 may include: a communication module 801 and a processing module 802.
  • the communication module 801 is used to obtain a first image;
  • the processing module 802 is used to extract features from the first image to obtain a plurality of first feature maps of different scales;
  • the processing module 802 is also used to fuse n first feature maps among the plurality of first feature maps, wherein, at the i-th fusion, a vector representation of the category to which the pixels contained in the feature map obtained by the (i-1)th fusion belong is obtained, and the vector representation and the first feature map required to be fused for the i-th time are fused to obtain the i-th fused feature map, 2 ⁇ i ⁇ n-1;
  • the processing module 802 is also used to obtain a second image based on the feature map obtained by the last fusion, and the second image is used to characterize the category to which the pixels contained in the first image belong.
  • the processing module 802 when the processing module 802 obtains the vector representation of the category to which the pixels contained in the feature map obtained by the (i-1)th fusion belong, it is specifically used to: obtain the mask of the category to which the pixels contained in the target feature map belong to, so as to obtain a category mask set, where the target feature map is the feature map obtained by the (i-1)th fusion; based on the weight of each mask in the category mask set and the target feature map, obtain the vector representation of the category to which the pixels contained in the target feature map belong.
  • the processing module 802 when the processing module 802 obtains a vector representation of the category to which the pixels contained in the target feature map belong based on the weight of each mask in the category mask set and the target feature map, it is specifically used to: obtain the transposed matrix of the features of the pixels contained in the target feature map; multiply the weight of each mask and the transposed matrix to obtain a vector representation of the category to which the pixels contained in the target feature map belong.
  • the processing module 802 when the processing module 802 performs upsampling processing on the k third feature maps, it is specifically used to: for any one of the k third feature maps, based on a preset upsampling multiple, expand the number of channels of any one of the k third feature maps to obtain a sixth feature map; splice any one of the feature maps and the sixth feature map in the channel dimension, and rearrange pixels of the spliced feature map to obtain a fourth feature map corresponding to any one of the feature maps.
  • the above-mentioned device is used to execute the method in the above-mentioned embodiment.
  • the implementation principle and technical effect of the corresponding program module in the device are similar to those described in the above-mentioned method.
  • the working process of the device can refer to the corresponding process in the above-mentioned method, which will not be repeated here.
  • the processing module 902 when obtaining the vector representation of the category to which the pixels contained in the first feature map of the first scale belong, is specifically used to: obtain the mask of the category to which the pixels contained in the first feature map of the first scale belong to, so as to obtain a category mask set; based on the weight of each mask in the category mask set and the first feature map of the first scale, obtain the vector representation of the category to which the pixels contained in the first feature map of the first scale belong.
  • the processing module 902 fuses the vector representation and the first feature map of the second scale to obtain a second feature map
  • it is specifically used to: obtain a key vector based on the vector representation and the first weight matrix, and obtain a value vector based on the vector representation and the second weight matrix; obtain a query vector based on the first feature map of the second scale and the third weight matrix; perform attention calculation based on the key vector, the value vector and the query vector to obtain a third feature map; fuse the third feature map and the first feature map of the second scale to obtain a second feature map.
  • the processing module 902 when obtaining the second image based on the second feature map, is specifically used to: upsample the first feature map of the first scale to obtain a fourth feature map, where the scale of the fourth feature map is the same as that of the second feature map; splice the second feature map and the fourth feature map in the channel dimension to obtain a fifth feature map; and obtain the second image based on the fifth feature map.
  • the above-mentioned device is used to execute the method in the above-mentioned embodiment.
  • the implementation principle and technical effect of the corresponding program module in the device are similar to those described in the above-mentioned method.
  • the working process of the device can refer to the corresponding process in the above-mentioned method, which will not be repeated here.
  • the embodiment of the present application also provides a computing device 1100.
  • FIG10 shows a structural intent of a computing device.
  • computing device 1000 includes: bus 1002, processor 1004, memory 1006 and communication interface 1008.
  • Processor 1004, memory 1006 and communication interface 1008 communicate through bus 1002.
  • Computing device 1000 can be a server or a terminal device. It should be understood that the present application does not limit the number of processors and memories in computing device 1000.
  • the bus 1002 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, etc.
  • the bus may be divided into an address bus, a data bus, a control bus, etc.
  • FIG. 10 is represented by only one line, but does not mean that there is only one bus or one type of bus.
  • the bus 1004 may include a path for transmitting information between various components of the computing device 1000 (e.g., the memory 1006, the processor 1004, and the communication interface 1008).
  • the memory 1006 may include a volatile memory, such as a random access memory (RAM).
  • the processor 1004 may also include a non-volatile memory, such as a read-only memory (ROM), a flash memory, a hard disk drive (HDD), or a solid state drive (SSD).
  • ROM read-only memory
  • HDD hard disk drive
  • SSD solid state drive
  • the memory 1006 stores executable program codes, and the processor 1004 executes the executable program codes to respectively implement the functions of the communication module 801 and the processing module 802 in FIG8 , or to implement the functions of the communication module 901 and the processing module 902 in FIG9 , thereby implementing all or part of the steps of the method in the above embodiment. That is, the memory 1006 stores instructions for executing all or part of the steps in the method in the above embodiment.
  • the memory 1006 stores executable codes
  • the processor 1004 executes the executable codes to respectively implement the functions of the aforementioned image processing apparatus 800 or 900, thereby implementing all or part of the steps in the above-mentioned embodiment method. That is, the memory 1006 stores instructions for executing all or part of the steps in the above-mentioned embodiment method.
  • an embodiment of the present application provides a computer-readable storage medium, which stores a computer program.
  • the computer program runs on a processor, the processor executes the method in the above embodiment.
  • the computer instructions may be transmitted from one website, computer, server or data center to another website, computer, server or data center by wired (e.g., coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that a computer can access or a data storage device such as a server or data center that includes one or more available media integrated.
  • the available medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a solid state drive (SSD)), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

An image processing method, comprising: acquiring a first image; performing feature extraction on the first image; fusing n extracted first feature maps, wherein during the i-th fusion, a vector representation of the category to which pixels contained in a feature map obtained in the (i-1)-th fusion belong is obtained, and the vector representation is fused with a first feature map to be fused in the i-th fusion to obtain a feature map obtained in the i-th fusion; and obtaining a second image on the basis of the feature map obtained in the last fusion, the second image being used for representing the category to which pixels contained in the first image belong. Therefore, when feature maps are fused, first a vector representation of the category to which pixels contained in a small-scale feature map belong is obtained from the small-scale feature map, and then the obtained vector representation of the category to which the pixels belong is fused with a large-scale feature map, realizing fusion of pixels and the category to which the pixels belongs, avoiding the effect of noise points or abnormal point features in the feature maps, and improving the accuracy of semantic segmentation on images.

Description

一种图像处理方法、装置及计算设备Image processing method, device and computing equipment
本申请要求于2022年12月26日提交中国国家知识产权局、申请号为202211673092.0、申请名称为“一种图像处理方法、装置及计算设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to the Chinese patent application filed with the State Intellectual Property Office of China on December 26, 2022, with application number 202211673092.0 and application name “An image processing method, device and computing device”, all contents of which are incorporated by reference in this application.
技术领域Technical Field
本申请涉及人工智能(artificial intelligence,AI)技术领域,尤其涉及一种图像处理方法、装置及计算设备。The present application relates to the field of artificial intelligence (AI) technology, and in particular to an image processing method, apparatus and computing device.
背景技术Background technique
语义分割是计算机视觉中的基础任务之一,其在像素层面将图像中不同类别的对象区分开来。语义分割通常被看作像素分类任务,即需要为图像的每一个像素预测一个预定义的语义类别。目前,用于进行语义分割的神经网络模型常遵循“编码器——解码器”的设计范式。在“编码器——解码器”的设计范式中,编码器主要用于特征表示学习,以学习到输入至编码器中的图像的不同尺度的特征图,其一般使用卷积神经网络(convolutional neural networks,CNN),如:视觉几何组(visual geometry group,VGG)网络、残差神经网络(residual neural network,ResNet)等,或者,使用视觉注意力模型(vision transformer)和多层感知机模型(multi-layer perception)等来学习图像的特征。解码器主要是用于在编码器提取的特征图的基础上进行像素级别的分类任务,以实现语义分割,其可以增强特征的表达能力,同时恢复特征图的分辨率,最终产生和输入图像分辨率一致的分割结果。在广泛的实践中,编码器通常使用既有的、预训练的图像分类模型,所以一个优秀的解码器设计能有效地提升语义分割的准确率。因此,如何提供一个优秀的解码器是目前亟需解决的技术问题。Semantic segmentation is one of the basic tasks in computer vision, which distinguishes objects of different categories in an image at the pixel level. Semantic segmentation is usually regarded as a pixel classification task, that is, it is necessary to predict a predefined semantic category for each pixel of the image. At present, the neural network model used for semantic segmentation often follows the "encoder-decoder" design paradigm. In the "encoder-decoder" design paradigm, the encoder is mainly used for feature representation learning to learn the feature maps of different scales of the image input to the encoder. It generally uses convolutional neural networks (CNN), such as: visual geometry group (VGG) network, residual neural network (ResNet), etc., or uses visual attention model (vision transformer) and multi-layer perception model (multi-layer perception) to learn the features of the image. The decoder is mainly used to perform pixel-level classification tasks based on the feature map extracted by the encoder to achieve semantic segmentation. It can enhance the expressiveness of the features and restore the resolution of the feature map, and finally produce a segmentation result consistent with the resolution of the input image. In extensive practice, encoders usually use existing, pre-trained image classification models, so an excellent decoder design can effectively improve the accuracy of semantic segmentation. Therefore, how to provide an excellent decoder is a technical problem that needs to be solved urgently.
发明内容Summary of the invention
本申请提供了一种图像处理方法、装置、计算设备、计算机存储介质及计算机产品,能够有效地提升语义分割的准确率。The present application provides an image processing method, apparatus, computing device, computer storage medium and computer product, which can effectively improve the accuracy of semantic segmentation.
第一方面,本申请提供一种图像处理方法,方法包括:获取第一图像;对第一图像进行特征提取,以得到多个不同尺度的第一特征图;对多个第一特征图中的n个第一特征图进行融合,其中,在第i次融合时,获取第(i-1)次融合得到的特征图所包含的像素所属类别的向量表示,以及,对向量表示和第i次所需融合的第一特征图进行融合,以得到第i次融合的特征图,2≤i≤n-1;基于最后一次融合得到的特征图,得到第二图像,第二图像用于表征第一图像所包含的像素所属的类别。In a first aspect, the present application provides an image processing method, the method comprising: acquiring a first image; performing feature extraction on the first image to obtain a plurality of first feature maps of different scales; fusing n first feature maps among the plurality of first feature maps, wherein, in the i-th fusion, a vector representation of the category to which the pixels contained in the feature map obtained by the (i-1)-th fusion belong is obtained, and the vector representation and the first feature map to be fused for the i-th time are fused to obtain the i-th fused feature map, 2≤i≤n-1; based on the feature map obtained by the last fusion, a second image is obtained, the second image being used to characterize the category to which the pixels contained in the first image belong.
这样,在图像处理过程中,在每次进行特征图融合时,均先由小尺度的特征图获取到该特征图所包含的像素所属类别的特征(即向量表示),然后,再将获取到的各个像素所属类别的特征与大尺度的特征图进行融合,实现了“点与类”的融合(即像素与像素所属的类别的融合),避免了特征图中噪声点或异常点特征的影响,提升了对图像进行语义分割的准确度。In this way, during the image processing process, each time the feature map is fused, the features of the category to which the pixels contained in the feature map belong (i.e., vector representation) are first obtained from the small-scale feature map, and then the features of the category to which each pixel belongs are fused with the large-scale feature map, thereby realizing the fusion of "points and classes" (i.e., the fusion of pixels and the categories to which they belong), avoiding the influence of noise points or abnormal point features in the feature map, and improving the accuracy of semantic segmentation of the image.
在一种可能的实现方式中,获取第(i-1)次融合得到的特征图所包含的像素所属类别的向量表示,具体包括:获取目标特征图中包含的像素所属类别的掩膜,以得到类别掩膜集合,该目标特征图为第(i-1)次融合得到的特征图;基于类别掩膜集合中每个掩膜的权重和目标特征图,得到目标特征图中所包含的像素所属类别的向量表示。由此通过计算类别的掩膜的方式,即可以得到特征图中所包含的像素所属类别的向量表示,从而便于后续进行“点-类”的融合。In a possible implementation, obtaining a vector representation of the category to which the pixels contained in the feature map obtained by the (i-1)th fusion belong, specifically includes: obtaining the mask of the category to which the pixels contained in the target feature map belong, to obtain a category mask set, the target feature map being the feature map obtained by the (i-1)th fusion; obtaining a vector representation of the category to which the pixels contained in the target feature map belong based on the weight of each mask in the category mask set and the target feature map. Thus, by calculating the category mask, the vector representation of the category to which the pixels contained in the feature map belong can be obtained, thereby facilitating the subsequent "point-class" fusion.
在一种可能的实现方式中,基于类别掩膜集合中每个掩膜的权重和目标特征图,得到目标特征图中所包含的像素所属类别的向量表示,具体包括:获取目标特征图所包含的像素的特征的转置矩阵;将每个掩膜的权重和转置矩阵相乘,得到目标特征图中所包含的像素所属类别的向量表示。In one possible implementation, based on the weight of each mask in the category mask set and the target feature map, a vector representation of the category to which the pixels contained in the target feature map belong is obtained, specifically including: obtaining a transposed matrix of the features of the pixels contained in the target feature map; multiplying the weight of each mask by the transposed matrix to obtain a vector representation of the category to which the pixels contained in the target feature map belong.
在一种可能的实现方式中,对向量表示和第i次所需融合的第一特征图进行融合,以得到第i次融合的特征图,具体包括:基于向量表示和第一权重矩阵,得到键向量,以及,基于向量表示和第二权重矩阵,得到值向量;基于第i次所需融合的第一特征图和第三权重矩阵,得到查询向量;基于键向量、值向量和查询向量进行注意力计算,以得到第二特征图;对第二特征图和第i次所需融合的第一特征图进行融合, 得到第i次融合的特征图。In a possible implementation, the vector representation and the first feature map to be fused for the i-th time are fused to obtain the i-th fused feature map, specifically including: obtaining a key vector based on the vector representation and the first weight matrix, and obtaining a value vector based on the vector representation and the second weight matrix; obtaining a query vector based on the first feature map to be fused for the i-th time and the third weight matrix; performing attention calculation based on the key vector, the value vector and the query vector to obtain a second feature map; fusing the second feature map with the first feature map to be fused for the i-th time, Get the i-th fused feature map.
在一种可能的实现方式中,基于最后一次融合得到的特征图,得到第二图像,具体包括:对k个第三特征图均进行上采样处理,以得到k个第四特征图,k个第三特征图为除最后一次融合得到的特征图之外的每次融合所得到的特征图,和,第一次融合的两个特征图中低尺度的特征图中的至少一个,每个第四特征图的尺度均与最后一次融合得到的特征图的尺度相同;对最后一次融合得到的特征图和k个第四特征图在通道维度进行拼接,以得到第五特征图;基于第五特征图,得到第二图像。In a possible implementation, a second image is obtained based on a feature map obtained by the last fusion, specifically including: upsampling k third feature maps to obtain k fourth feature maps, the k third feature maps being feature maps obtained by each fusion except the feature map obtained by the last fusion, and at least one of the low-scale feature maps of the two feature maps fused for the first time, and the scale of each fourth feature map is the same as the scale of the feature map obtained by the last fusion; the feature map obtained by the last fusion and the k fourth feature maps are spliced in the channel dimension to obtain a fifth feature map; and the second image is obtained based on the fifth feature map.
在一种可能的实现方式中,对k个第三特征图均进行上采样处理,具体包括:针对k个第三特征图中的任意一个特征图,基于预设的上采样倍数,对任意一个特征图的通道数目进行扩充,以得到第六特征图;对任意一个特征图和第六特征图在通道维度进行拼接,以及,对拼接得到的特征图进行像素重排,以得到任意一个特征图所对应的第四特征图。In a possible implementation, upsampling is performed on the k third feature maps, specifically including: for any one of the k third feature maps, based on a preset upsampling multiple, expanding the number of channels of any one of the k third feature maps to obtain a sixth feature map; splicing any one of the feature maps and the sixth feature map in the channel dimension, and rearranging pixels of the spliced feature map to obtain a fourth feature map corresponding to any one of the feature maps.
第二方面,本申请提供一种图像处理方法,方法包括:获取第一图像;对第一图像进行特征提取,以得到多个不同尺度的第一特征图;获取第一尺度的第一特征图中所包含的像素所属类别的向量表示;对向量表示和第二尺度的第一特征图进行融合,以得到第二特征图,第二尺度大于第一尺度;基于第二特征图,得到第二图像,第二图像用于表征第一图像所包含的像素所属的类别。In a second aspect, the present application provides an image processing method, comprising: acquiring a first image; performing feature extraction on the first image to obtain a plurality of first feature maps of different scales; acquiring a vector representation of the category to which the pixels contained in the first feature map of the first scale belong; fusing the vector representation with the first feature map of a second scale to obtain a second feature map, the second scale being larger than the first scale; and obtaining a second image based on the second feature map, the second image being used to characterize the category to which the pixels contained in the first image belong.
这样,在图像处理过程中,在进行特征图融合时,先由小尺度的特征图获取到该特征图所包含的像素所属类别的特征(即向量表示),然后,再将获取到的各个像素所属类别的特征与大尺度的特征图进行融合,实现了“点与类”的融合(即像素与像素所属的类别的融合),避免了特征图中噪声点或异常点特征的影响,提升了对图像进行语义分割的准确度。In this way, during the image processing, when performing feature map fusion, the features of the category to which the pixels contained in the feature map belong (i.e., vector representation) are first obtained from the small-scale feature map, and then the features of the category to which each pixel belongs are fused with the large-scale feature map, thereby realizing the fusion of "points and classes" (i.e., the fusion of pixels and the categories to which the pixels belong), avoiding the influence of noise points or abnormal point features in the feature map, and improving the accuracy of semantic segmentation of the image.
在一种可能的实现方式中,获取第一尺度的第一特征图中所包含的像素所属类别的向量表示,具体包括:获取第一尺度的第一特征图中包含的像素所属类别的掩膜,以得到类别掩膜集合;基于类别掩膜集合中每个掩膜的权重和第一尺度的第一特征图,得到第一尺度的第一特征图中所包含的像素所属类别的向量表示。In a possible implementation, obtaining a vector representation of a category to which pixels contained in a first feature map of a first scale belong, specifically includes: obtaining a mask of the category to which the pixels contained in the first feature map of the first scale belong, to obtain a category mask set; and obtaining a vector representation of the category to which the pixels contained in the first feature map of the first scale belong based on a weight of each mask in the category mask set and the first feature map of the first scale.
在一种可能的实现方式中,基于类别掩膜集合中每个掩膜的权重和第一尺度的第一特征图,得到第一尺度的第一特征图中所包含的像素所属类别的向量表示,具体包括:获取第一尺度的第一特征图所包含的像素的特征的转置矩阵;将每个掩膜的权重和转置矩阵相乘,得到第一尺度的第一特征图中所包含的像素所属类别的向量表示。In a possible implementation, based on the weight of each mask in the category mask set and the first feature map of the first scale, a vector representation of the category to which the pixels contained in the first feature map of the first scale belong is obtained, specifically including: obtaining a transposed matrix of the features of the pixels contained in the first feature map of the first scale; multiplying the weight of each mask by the transposed matrix to obtain a vector representation of the category to which the pixels contained in the first feature map of the first scale belong.
在一种可能的实现方式中,对向量表示和第二尺度的第一特征图进行融合,以得到第二特征图,具体包括:基于向量表示和第一权重矩阵,得到键向量,以及,基于向量表示和第二权重矩阵,得到值向量;基于第二尺度的第一特征图和第三权重矩阵,得到查询向量;基于键向量、值向量和查询向量进行注意力计算,以得到第三特征图;对第三特征图和第二尺度的第一特征图进行融合,以得到第二特征图。In a possible implementation, the vector representation and the first feature map of the second scale are fused to obtain the second feature map, specifically including: obtaining a key vector based on the vector representation and the first weight matrix, and obtaining a value vector based on the vector representation and the second weight matrix; obtaining a query vector based on the first feature map of the second scale and the third weight matrix; performing attention calculation based on the key vector, the value vector and the query vector to obtain a third feature map; and fusing the third feature map with the first feature map of the second scale to obtain the second feature map.
在一种可能的实现方式中,基于第二特征图,得到第二图像,具体包括:对第一尺度的第一特征图进行上采样处理,以得到第四特征图,第四特征图的尺度与第二特征图的尺度相同;对第二特征图和第四特征图在通道维度进行拼接,以得到第五特征图;基于第五特征图,得到第二图像。In a possible implementation, obtaining a second image based on the second feature map specifically includes: upsampling the first feature map of the first scale to obtain a fourth feature map, where the scale of the fourth feature map is the same as that of the second feature map; concatenating the second feature map and the fourth feature map in the channel dimension to obtain a fifth feature map; and obtaining the second image based on the fifth feature map.
在一种可能的实现方式中,对第一尺度的第一特征图进行上采样处理,具体包括:基于预设的上采样倍数,对第一尺度的第一特征图的通道数目进行扩充,以得到第六特征图;对第一尺度的第一特征图和第六特征图在通道维度进行拼接,以及,对拼接得到的特征图进行像素重排,以得到第五特征图。In a possible implementation, upsampling is performed on the first feature map of the first scale, specifically including: based on a preset upsampling multiple, expanding the number of channels of the first feature map of the first scale to obtain a sixth feature map; splicing the first feature map of the first scale and the sixth feature map in the channel dimension, and rearranging pixels of the spliced feature map to obtain a fifth feature map.
第三方面,本申请提供一种图像处理装置,装置包括:通信模块,用于获取第一图像;处理模块,用于对第一图像进行特征提取,以得到多个不同尺度的第一特征图;处理模块,还用于对多个第一特征图中的n个第一特征图进行融合,其中,在第i次融合时,获取第(i-1)次融合得到的特征图所包含的像素所属类别的向量表示,以及,对向量表示和第i次所需融合的第一特征图进行融合,以得到第i次融合的特征图,2≤i≤n-1;处理模块,还用于基于最后一次融合得到的特征图,得到第二图像,第二图像用于表征第一图像所包含的像素所属的类别。In a third aspect, the present application provides an image processing device, comprising: a communication module for acquiring a first image; a processing module for performing feature extraction on the first image to obtain multiple first feature maps of different scales; the processing module is also used to fuse n first feature maps among the multiple first feature maps, wherein, at the i-th fusion, a vector representation of the category to which the pixels contained in the feature map obtained by the (i-1)-th fusion belong is obtained, and the vector representation and the first feature map to be fused for the i-th time are fused to obtain the i-th fused feature map, 2≤i≤n-1; the processing module is also used to obtain a second image based on the feature map obtained by the last fusion, and the second image is used to characterize the category to which the pixels contained in the first image belong.
在一种可能的实现方式中,处理模块在获取第(i-1)次融合得到的特征图所包含的像素所属类别的向量表示时,具体用于:获取目标特征图中包含的像素所属类别的掩膜,以得到类别掩膜集合,目标特征图为第(i-1)次融合得到的特征图;基于类别掩膜集合中每个掩膜的权重和目标特征图,得到目标特征图中所包含的像素所属类别的向量表示。In one possible implementation, when the processing module obtains the vector representation of the category to which the pixels contained in the feature map obtained by the (i-1)th fusion belong, the processing module is specifically used to: obtain the mask of the category to which the pixels contained in the target feature map belong to, so as to obtain a category mask set, where the target feature map is the feature map obtained by the (i-1)th fusion; and obtain the vector representation of the category to which the pixels contained in the target feature map belong based on the weight of each mask in the category mask set and the target feature map.
在一种可能的实现方式中,处理模块在基于类别掩膜集合中每个掩膜的权重和目标特征图,得到目标特征图中所包含的像素所属类别的向量表示时,具体用于:获取目标特征图所包含的像素的特征的转置矩 阵;将每个掩膜的权重和转置矩阵相乘,得到目标特征图中所包含的像素所属类别的向量表示。In a possible implementation, when the processing module obtains the vector representation of the category to which the pixel contained in the target feature map belongs based on the weight of each mask in the category mask set and the target feature map, it is specifically used to: obtain the transposed moment of the feature of the pixel contained in the target feature map Matrix; multiply the weight of each mask by the transposed matrix to obtain a vector representation of the category to which the pixels contained in the target feature map belong.
在一种可能的实现方式中,处理模块在对向量表示和第i次所需融合的第一特征图进行融合,以得到第i次融合的特征图时,具体用于:基于向量表示和第一权重矩阵,得到键向量,以及,基于向量表示和第二权重矩阵,得到值向量;基于第i次所需融合的第一特征图和第三权重矩阵,得到查询向量;基于键向量、值向量和查询向量进行注意力计算,以得到第二特征图;对第二特征图和第i次所需融合的第一特征图进行融合,得到第i次融合的特征图。In one possible implementation, when the processing module fuses the vector representation and the first feature map to be fused for the i-th time to obtain the i-th fused feature map, the processing module is specifically used to: obtain a key vector based on the vector representation and the first weight matrix, and obtain a value vector based on the vector representation and the second weight matrix; obtain a query vector based on the first feature map to be fused for the i-th time and the third weight matrix; perform attention calculation based on the key vector, the value vector and the query vector to obtain a second feature map; and fuse the second feature map with the first feature map to be fused for the i-th time to obtain the i-th fused feature map.
在一种可能的实现方式中,处理模块在基于最后一次融合得到的特征图,得到第二图像时,具体用于:对k个第三特征图均进行上采样处理,以得到k个第四特征图,k个第三特征图为除最后一次融合得到的特征图之外的每次融合所得到的特征图,和,第一次融合的两个特征图中低尺度的特征图中的至少一个,每个第四特征图的尺度均与最后一次融合得到的特征图的尺度相同;对最后一次融合得到的特征图和k个第四特征图在通道维度进行拼接,以得到第五特征图;基于第五特征图,得到第二图像。In a possible implementation, when the processing module obtains the second image based on the feature map obtained by the last fusion, it is specifically used to: upsample the k third feature maps to obtain k fourth feature maps, the k third feature maps are feature maps obtained by each fusion except the feature map obtained by the last fusion, and at least one of the low-scale feature maps of the two feature maps fused for the first time, and the scale of each fourth feature map is the same as the scale of the feature map obtained by the last fusion; splice the feature map obtained by the last fusion and the k fourth feature maps in the channel dimension to obtain a fifth feature map; and obtain the second image based on the fifth feature map.
在一种可能的实现方式中,处理模块在对k个第三特征图均进行上采样处理时,具体用于:针对k个第三特征图中的任意一个特征图,基于预设的上采样倍数,对任意一个特征图的通道数目进行扩充,以得到第六特征图;对任意一个特征图和第六特征图在通道维度进行拼接,以及,对拼接得到的特征图进行像素重排,以得到任意一个特征图所对应的第四特征图。In a possible implementation, when the processing module performs upsampling processing on the k third feature maps, it is specifically used to: for any one of the k third feature maps, based on a preset upsampling multiple, expand the number of channels of any one of the k third feature maps to obtain a sixth feature map; splice any one of the feature maps and the sixth feature map in the channel dimension, and rearrange pixels of the spliced feature map to obtain a fourth feature map corresponding to any one of the feature maps.
第四方面,本申请提供一种图像处理装置,装置包括:通信模块,用于获取第一图像;处理模块,用于对第一图像进行特征提取,以得到多个不同尺度的第一特征图;处理模块,还用于获取第一尺度的第一特征图中所包含的像素所属类别的向量表示;处理模块,还用于对向量表示和第二尺度的第一特征图进行融合,以得到第二特征图,第二尺度大于第一尺度;处理模块,还用于基于第二特征图,得到第二图像,第二图像用于表征第一图像所包含的像素所属的类别。In a fourth aspect, the present application provides an image processing device, comprising: a communication module for acquiring a first image; a processing module for performing feature extraction on the first image to obtain a plurality of first feature maps of different scales; the processing module is also used to acquire a vector representation of a category to which pixels contained in the first feature map of the first scale belong; the processing module is also used to fuse the vector representation and the first feature map of a second scale to obtain a second feature map, wherein the second scale is larger than the first scale; the processing module is also used to obtain a second image based on the second feature map, wherein the second image is used to characterize the category to which pixels contained in the first image belong.
在一种可能的实现方式中,处理模块在获取第一尺度的第一特征图中所包含的像素所属类别的向量表示时,具体用于:获取第一尺度的第一特征图中包含的像素所属类别的掩膜,以得到类别掩膜集合;基于类别掩膜集合中每个掩膜的权重和第一尺度的第一特征图,得到第一尺度的第一特征图中所包含的像素所属类别的向量表示。In a possible implementation, when the processing module obtains the vector representation of the category to which the pixels contained in the first feature map of the first scale belong, the processing module is specifically used to: obtain the mask of the category to which the pixels contained in the first feature map of the first scale belong to, so as to obtain a category mask set; and obtain the vector representation of the category to which the pixels contained in the first feature map of the first scale belong based on the weight of each mask in the category mask set and the first feature map of the first scale.
在一种可能的实现方式中,处理模块在基于类别掩膜集合中每个掩膜的权重和第一尺度的第一特征图,得到第一尺度的第一特征图中所包含的像素所属类别的向量表示时,具体用于:获取第一尺度的第一特征图所包含的像素的特征的转置矩阵;将每个掩膜的权重和转置矩阵相乘,得到第一尺度的第一特征图中所包含的像素所属类别的向量表示。In a possible implementation, when the processing module obtains the vector representation of the category to which the pixels contained in the first feature map of the first scale belong based on the weight of each mask in the category mask set and the first feature map of the first scale, the processing module is specifically used to: obtain the transposed matrix of the features of the pixels contained in the first feature map of the first scale; multiply the weight of each mask by the transposed matrix to obtain the vector representation of the category to which the pixels contained in the first feature map of the first scale belong.
在一种可能的实现方式中,处理模块在对向量表示和第二尺度的第一特征图进行融合,以得到第二特征图时,具体用于:基于向量表示和第一权重矩阵,得到键向量,以及,基于向量表示和第二权重矩阵,得到值向量;基于第二尺度的第一特征图和第三权重矩阵,得到查询向量;基于键向量、值向量和查询向量进行注意力计算,以得到第三特征图;对第三特征图和第二尺度的第一特征图进行融合,以得到第二特征图。In one possible implementation, when the processing module fuses the vector representation and the first feature map of the second scale to obtain the second feature map, it is specifically used to: obtain a key vector based on the vector representation and the first weight matrix, and obtain a value vector based on the vector representation and the second weight matrix; obtain a query vector based on the first feature map of the second scale and the third weight matrix; perform attention calculation based on the key vector, the value vector and the query vector to obtain a third feature map; and fuse the third feature map with the first feature map of the second scale to obtain the second feature map.
在一种可能的实现方式中,处理模块在基于第二特征图,得到第二图像时,具体用于:对第一尺度的第一特征图进行上采样处理,以得到第四特征图,第四特征图的尺度与第二特征图的尺度相同;对第二特征图和第四特征图在通道维度进行拼接,以得到第五特征图;基于第五特征图,得到第二图像。In a possible implementation, when the processing module obtains the second image based on the second feature map, it is specifically used to: upsample the first feature map of the first scale to obtain a fourth feature map, and the scale of the fourth feature map is the same as the scale of the second feature map; splice the second feature map and the fourth feature map in the channel dimension to obtain a fifth feature map; and obtain the second image based on the fifth feature map.
在一种可能的实现方式中,处理模块在对第一尺度的第一特征图进行上采样处理时,具体用于:基于预设的上采样倍数,对第一尺度的第一特征图的通道数目进行扩充,以得到第六特征图;对第一尺度的第一特征图和第六特征图在通道维度进行拼接,以及,对拼接得到的特征图进行像素重排,以得到第五特征图。In a possible implementation, when the processing module performs upsampling processing on the first feature map of the first scale, the processing module is specifically used to: expand the number of channels of the first feature map of the first scale based on a preset upsampling multiple to obtain a sixth feature map; splice the first feature map of the first scale and the sixth feature map in the channel dimension, and rearrange pixels of the spliced feature map to obtain a fifth feature map.
第五方面,本申请提供一种计算设备,包括:至少一个存储器,用于存储程序;至少一个处理器,用于执行存储器存储的程序;其中,当存储器存储的程序被执行时,处理器用于执行第一方面或第二方面所描述的方法。In a fifth aspect, the present application provides a computing device comprising: at least one memory for storing programs; and at least one processor for executing the programs stored in the memory; wherein, when the program stored in the memory is executed, the processor is used to execute the method described in the first aspect or the second aspect.
第六方面,本申请提供一种计算机可读存储介质,计算机可读存储介质存储有计算机程序,当计算机程序在处理器上运行时,使得处理器执行第一方面或第二方面所描述的方法。In a sixth aspect, the present application provides a computer-readable storage medium, which stores a computer program. When the computer program runs on a processor, the processor executes the method described in the first aspect or the second aspect.
第七方面,本申请提供一种计算机程序产品,当计算机程序产品在处理器上运行时,使得处理器执行第一方面或第二方面所描述的方法。In a seventh aspect, the present application provides a computer program product. When the computer program product runs on a processor, the processor executes the method described in the first aspect or the second aspect.
可以理解的是,上述第二方面至第七方面的有益效果可以参见上述第一方面或第二方面中的相关描述, 在此不再赘述。It can be understood that the beneficial effects of the second to seventh aspects can refer to the relevant descriptions in the first or second aspect. I will not go into details here.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
下面对实施例或现有技术描述中所需使用的附图作简单地介绍。The following is a brief introduction to the drawings required for describing the embodiments or prior art.
图1是本申请实施例提供的一种图像语义分割神经网络模型的结构示意图;FIG1 is a schematic diagram of the structure of an image semantic segmentation neural network model provided in an embodiment of the present application;
图2是图1中所示的类别特征计算模块的结构示意图;FIG2 is a schematic diagram of the structure of the category feature calculation module shown in FIG1;
图3是图1中所示的类别特征融合模块的结构示意图;FIG3 is a schematic diagram of the structure of the category feature fusion module shown in FIG1 ;
图4是本申请实施例提供的另一种图像语义分割神经网络模型的结构示意图;FIG4 is a schematic diagram of the structure of another image semantic segmentation neural network model provided in an embodiment of the present application;
图5是本申请实施例提供的又一种图像语义分割神经网络模型的结构示意图;FIG5 is a schematic diagram of the structure of another image semantic segmentation neural network model provided in an embodiment of the present application;
图6是本申请实施例提供的一种图像处理方法的流程示意图;FIG6 is a schematic diagram of a flow chart of an image processing method provided in an embodiment of the present application;
图7是本申请实施例提供的另一种图像处理方法的流程示意图;FIG7 is a schematic diagram of a flow chart of another image processing method provided in an embodiment of the present application;
图8是本申请实施例提供的一种图像处理装置的结构示意图;FIG8 is a schematic diagram of the structure of an image processing device provided in an embodiment of the present application;
图9是本申请实施例提供的另一种图像处理装置的结构示意图;FIG9 is a schematic diagram of the structure of another image processing device provided in an embodiment of the present application;
图10是本申请实施例提供的一种计算设备的结构示意图。FIG. 10 is a schematic diagram of the structure of a computing device provided in an embodiment of the present application.
具体实施方式Detailed ways
本文中术语“和/或”,是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。本文中符号“/”表示关联对象是或者的关系,例如A/B表示A或者B。The term "and/or" in this article is a description of the association relationship of associated objects, indicating that there can be three relationships. For example, A and/or B can represent: A exists alone, A and B exist at the same time, and B exists alone. The symbol "/" in this article indicates that the associated objects are in an or relationship, for example, A/B means A or B.
本文中的说明书和权利要求书中的术语“第一”和“第二”等是用于区别不同的对象,而不是用于描述对象的特定顺序。例如,第一响应消息和第二响应消息等是用于区别不同的响应消息,而不是用于描述响应消息的特定顺序。The terms "first" and "second" in the specification and claims herein are used to distinguish different objects rather than to describe a specific order of the objects. For example, a first response message and a second response message are used to distinguish different response messages rather than to describe a specific order of the response messages.
在本申请实施例中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请实施例中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。In the embodiments of the present application, words such as "exemplary" or "for example" are used to indicate examples, illustrations or descriptions. Any embodiment or design described as "exemplary" or "for example" in the embodiments of the present application should not be interpreted as being more preferred or more advantageous than other embodiments or designs. Specifically, the use of words such as "exemplary" or "for example" is intended to present related concepts in a specific way.
在本申请实施例的描述中,除非另有说明,“多个”的含义是指两个或者两个以上,例如,多个处理单元是指两个或者两个以上的处理单元等;多个元件是指两个或者两个以上的元件等。In the description of the embodiments of the present application, unless otherwise specified, "multiple" means two or more than two. For example, multiple processing units refer to two or more processing units, etc.; multiple elements refer to two or more elements, etc.
示例性的,图1示出了一种图像语义分割神经网络模型的结构。如图1所示,该图像语义分割神经网络模型100包括:编码器110和解码器120。Exemplarily, Fig. 1 shows a structure of a neural network model for image semantic segmentation. As shown in Fig. 1 , the neural network model 100 for image semantic segmentation includes: an encoder 110 and a decoder 120 .
编码器110主要是用于学习输入至图像语义分割神经网络模型100中图像的特征,以得到多个不同尺度(shape)的特征图。示例性的,尺度(shape)可以是指:神经网络中某一层的计算结果,在计算过程中是一种张量(tensor),张量的尺度(shape)表明数据在内存中的排布方式(维度),以及每个维度的数据长度,如:一般3通道的图像转换为张量后,其尺度为[H,W,C],H为图像高度维度,W为宽度维度,C为通道数。其中,神经网络中通道的数量可以预先设定。示例性的,不同尺度的特征图可以包括4倍的特征图,8倍的特征图,16倍的特征图,和,32倍的特征图中的至少两种。其中,N倍的特征图的分辨率为原始图像的分辨率的1/N。其中,编码器110可以为卷积神经网络,也可以为视觉注意力模型(vision transformer),亦可以为多层感知机模型(multi-layer perception),具体可根据实际情况而定,此处不做限定。为便于描述,下面将以编码器110为卷积神经网络进行介绍。继续参阅图1,编码器110中可以包括多个卷积层,即图1中所示的卷积层1至卷积层n。每个卷积层均用于提取一种尺度的特征图。每个卷积层中均可以包括很多个卷积算子,卷积算子也称为核,其在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器,卷积算子本质上可以是一个权重矩阵,这个权重矩阵可以通过在给定的训练数据上训练得到,在对图像进行卷积操作的过程中,权重矩阵通常在输入图像上沿着水平方向一个像素接着一个像素(或两个像素接着两个像素……这取决于步长stride的取值)的进行处理,从而完成从图像中提取特定特征的工作。该权重矩阵的大小应该与图像的大小无关,其上是人工指定的,需要注意的是,权重矩阵的通道(channel)维度和输入图像的通道维度是相同的,在进行卷积运算的过程中,权重矩阵会延伸到输入图像的整个深度。因此,和一个单一的权重矩阵进行卷积会产生一个单一纵深维度的卷积化输出,但是大多数情况下不使用单一权重矩阵,而是应用维度相同的多个权重矩阵。每个权重矩阵的输出被堆叠起来形成卷积图像的纵深维度。不同的权重矩阵可以用来提取图像中不同的特征,例如一个权重矩阵 用来提取图像边缘信息,另一个权重矩阵用来提取图像的特定颜色,又一个权重矩阵用来对图像中不需要的噪点进行模糊化……该多个权重矩阵维度相同,经过该多个维度相同的权重矩阵提取后的特征图维度也相同,再将提取到的多个维度相同的特征图合并形成卷积运算的输出。应理解的是,在编码器110中还可以配置有池化层、全连接层等,具体可根据实际情况而定,此处不做限定。应理解的是,当将卷积神经网络替换为其他的网络或模型时,图1中编码器110中的每个卷积层均可以替换为其他的网络或模型中用于提取图像特征的层或模块等,替换后的方案仍在本申请的保护范围之内。The encoder 110 is mainly used to learn the features of the image input into the image semantic segmentation neural network model 100 to obtain feature maps of multiple different scales (shape). Exemplarily, the scale (shape) may refer to: the calculation result of a certain layer in the neural network, which is a tensor (tensor) in the calculation process. The scale (shape) of the tensor indicates the arrangement mode (dimension) of the data in the memory, and the data length of each dimension, such as: after the image of the general 3 channels is converted into a tensor, its scale is [H, W, C], H is the image height dimension, W is the width dimension, and C is the number of channels. Among them, the number of channels in the neural network can be preset. Exemplarily, feature maps of different scales may include at least two of 4 times feature maps, 8 times feature maps, 16 times feature maps, and 32 times feature maps. Among them, the resolution of the N times feature map is 1/N of the resolution of the original image. Among them, the encoder 110 can be a convolutional neural network, a visual attention model (vision transformer), or a multi-layer perceptron model (multi-layer perception), which can be determined according to the actual situation and is not limited here. For ease of description, the encoder 110 will be introduced as a convolutional neural network. Continuing to refer to FIG. 1, the encoder 110 may include multiple convolutional layers, namely, convolutional layers 1 to n shown in FIG. 1. Each convolutional layer is used to extract a feature map of a scale. Each convolutional layer may include a large number of convolutional operators, which are also called kernels. Their role in image processing is equivalent to a filter that extracts specific information from an input image matrix. The convolutional operator can essentially be a weight matrix, which can be obtained by training on given training data. In the process of performing convolution operations on the image, the weight matrix is usually processed one pixel after another (or two pixels after two pixels... depending on the value of the step length stride) in the horizontal direction on the input image, thereby completing the work of extracting specific features from the image. The size of the weight matrix should be independent of the size of the image, and it is manually specified. It should be noted that the channel dimension of the weight matrix is the same as the channel dimension of the input image. In the process of performing convolution operations, the weight matrix will extend to the entire depth of the input image. Therefore, convolution with a single weight matrix will produce a convolution output with a single depth dimension, but in most cases, instead of using a single weight matrix, multiple weight matrices of the same dimension are applied. The output of each weight matrix is stacked to form the depth dimension of the convolved image. Different weight matrices can be used to extract different features from the image, for example a weight matrix Used to extract image edge information, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to blur unnecessary noise in the image... The multiple weight matrices have the same dimension, and the feature maps extracted by the multiple weight matrices with the same dimension are also the same dimension, and then the extracted feature maps with the same dimension are merged to form the output of the convolution operation. It should be understood that the encoder 110 can also be configured with a pooling layer, a fully connected layer, etc., which can be determined according to actual conditions and is not limited here. It should be understood that when the convolutional neural network is replaced with other networks or models, each convolution layer in the encoder 110 in Figure 1 can be replaced with a layer or module in other networks or models for extracting image features, etc., and the replaced scheme is still within the scope of protection of this application.
解码器210中可以包括(n-1)个多尺度特征融合模块,即即图1中所示的多尺度特征融合模块1至多尺度特征融合模块(n-1)。每个多尺度特征融合模块均主要用于对两个不同尺度的特征图进行融合。其中,每个多尺度特征融合模块中均可以包括类别特征计算模块121和类别特征融合模块122。The decoder 210 may include (n-1) multi-scale feature fusion modules, namely, the multi-scale feature fusion module 1 to the multi-scale feature fusion module (n-1) shown in FIG1 . Each multi-scale feature fusion module is mainly used to fuse feature maps of two different scales. Each multi-scale feature fusion module may include a category feature calculation module 121 and a category feature fusion module 122.
类别特征计算模块121主要用于基于低尺度的特征图,计算预设数量的语义分割的类别的向量表示,或者,计算原始的图像中所包含的各个目标的类别的向量表示。示例性的,如图2所示,类别特征计算模块121中可以包括归一化层1211、线性映射层1212、线性映射层1213和类别特征计算层1214。The category feature calculation module 121 is mainly used to calculate the vector representation of a preset number of semantic segmentation categories based on the low-scale feature map, or calculate the vector representation of the category of each target contained in the original image. Exemplarily, as shown in FIG2 , the category feature calculation module 121 may include a normalization layer 1211, a linear mapping layer 1212, a linear mapping layer 1213 and a category feature calculation layer 1214.
其中,归一化层1211主要用于将输入至类别特征计算模块121中小尺度的特征图所包含的特征 进行层归一化(layer normalization)操作,以将该特征图所包含的特征集中到以某个值为中心的区域内,提升模型训练阶段的稳定性和提升后续计算的准确率。其中,其中C表示特征图的通道数,H和W分别表示特征图的高和宽。示例性的,归一化层1211可以但不限于通过各种归一化层实现,比如:层归一化(layer normalization,LN),批量归一化(batch normalization,BN),群组归一化(group normalization,GN)等。Among them, the normalization layer 1211 is mainly used to input the features contained in the small-scale feature map in the category feature calculation module 121 A layer normalization operation is performed to concentrate the features contained in the feature map into an area centered on a certain value, thereby improving the stability of the model training phase and the accuracy of subsequent calculations. Wherein, C represents the number of channels of the feature map, and H and W represent the height and width of the feature map, respectively. Exemplarily, the normalization layer 1211 can be implemented by, but is not limited to, various normalization layers, such as: layer normalization (LN), batch normalization (BN), group normalization (GN), etc.
线性映射层1212主要是用于将归一化层1211输出的特征图中所有的特征变换到一个域内,以实现对特征图所包含的特征的对齐。线性映射层1212可以为一个全连接层,也可以为一个卷积层,具体可根据实际情况而定,此处不做限定。The linear mapping layer 1212 is mainly used to transform all the features in the feature map output by the normalization layer 1211 into a domain to achieve alignment of the features contained in the feature map. The linear mapping layer 1212 can be a fully connected layer or a convolutional layer, which can be determined according to actual conditions and is not limited here.
线性映射层1213主要是用于对归一化层1211输出的特征图中所有的特征进行分类,以得到类别掩膜集合其中,L表示语义分割的类别数目。示例性的,线性映射层1213可以理解为是一个分类器。本实施例中,在得到各个类别的类别掩膜后,可以对通过Softmax函数对各个类别掩膜分别进行运算,以得到各个类别掩膜的注意力权重,即各个类别掩膜出现的概率。The linear mapping layer 1213 is mainly used to classify all the features in the feature map output by the normalization layer 1211 to obtain a class mask set. Wherein, L represents the number of categories of semantic segmentation. Exemplarily, the linear mapping layer 1213 can be understood as a classifier. In this embodiment, after obtaining the category mask of each category, each category mask can be operated separately by the Softmax function to obtain the attention weight of each category mask, that is, the probability of each category mask appearing.
类别特征计算层1214主要是用于对各个类别掩膜的注意力权重和线性映射层1212输出的特征图进行矩阵乘法计算,以得到各个类别的向量表示,即得到类别特征。The category feature calculation layer 1214 is mainly used to perform matrix multiplication calculations on the attention weights of each category mask and the feature map output by the linear mapping layer 1212 to obtain the vector representation of each category, that is, to obtain the category feature.
类别特征融合模块122主要是用于将大尺度的特征图与类别特征计算模块121输出的各个类别的向量表示进行融合,以得到融合后的特征图,从而实现“点——类”的特征融合,降低噪声点对特征表达能力的消极影响,从而有效地融合不同尺度的特征图。示例性的,如图3所示,类别特征融合模块122中可以包括:线性映射层1221、线性映射层1222、归一化层1223、线性映射层1224、多头自注意力模块(multi-headed self-attention,MHA)1225、特征融合层1226、归一化层1227、前馈神经网络(feed forward networks,FFN)1228和特征融合层1229。The category feature fusion module 122 is mainly used to fuse the large-scale feature map with the vector representation of each category output by the category feature calculation module 121 to obtain the fused feature map, thereby realizing the feature fusion of "point-class", reducing the negative impact of noise points on the feature expression ability, and effectively fusing feature maps of different scales. Exemplarily, as shown in FIG3, the category feature fusion module 122 may include: a linear mapping layer 1221, a linear mapping layer 1222, a normalization layer 1223, a linear mapping layer 1224, a multi-headed self-attention module (multi-headed self-attention, MHA) 1225, a feature fusion layer 1226, a normalization layer 1227, a feed forward neural network (feed forward networks, FFN) 1228 and a feature fusion layer 1229.
线性映射层1221主要用于基于各个类别的向量表示(即类别特征),计算得到MHA1225所需的关键值K(即key序列)(也可以称之为“键向量”)。其中,关键值 为类别特征计算模块121计算出的类别特征;WK为映射矩阵,可预先设定。本实施例中,线性映射层1221可以但不限于通过全连接层或卷积层实现。The linear mapping layer 1221 is mainly used to calculate the key value K (i.e., key sequence) (also called "key vector") required by MHA1225 based on the vector representation of each category (i.e., category features). is the category feature calculated by the category feature calculation module 121; W K is a mapping matrix, which can be preset. In this embodiment, the linear mapping layer 1221 can be implemented by, but not limited to, a fully connected layer or a convolutional layer.
线性映射层1222主要用于基于各个类别的向量表示(即类别特征),计算得到MHA1225所需的特征值V(即value序列)(也可以称之为“值向量”)。其中,特征值 为类别特征计算模块121计算出的类别特征;WV为映射矩阵,可预先设定。示例性的,线性映射层1222可以但不限于通过全连接层或卷积层实现。The linear mapping layer 1222 is mainly used to calculate the eigenvalue V (i.e., value sequence) (also referred to as "value vector") required by the MHA 1225 based on the vector representation of each category (i.e., category feature). is the category feature calculated by the category feature calculation module 121; W V is a mapping matrix, which can be preset. Exemplarily, the linear mapping layer 1222 can be implemented by, but not limited to, a fully connected layer or a convolutional layer.
归一化层1223主要用于将输入至类别特征融合模块122中大尺度的特征图所包含的特征X∈RC×(H′W′)进行层归一化(layer normalization)操作,以将该特征图所包含的特征集中到以某个值为中心的区域内,提升模型训练阶段的稳定性和提升后续计算的准确率。示例性的,归一化层1223可以但不限于通过各种归一化层实现,比如:LN,BN,GN)等。The normalization layer 1223 is mainly used to perform layer normalization operation on the feature X∈RC ×(H′W′) contained in the large-scale feature map input to the category feature fusion module 122, so as to concentrate the features contained in the feature map into an area centered on a certain value, thereby improving the stability of the model training stage and the accuracy of subsequent calculations. Exemplarily, the normalization layer 1223 can be implemented by, but is not limited to, various normalization layers, such as: LN, BN, GN, etc.
线性映射层1224主要用于基于归一化层1223输出的特征图,计算得到MHA1225所需的查询值Q(即query序列)(也可以称之为“查询向量”)。其中,查询值Q=Norm(X)WQ。X为大尺度的特征图;Norm(.)表示归一化操作;WQ为映射矩阵,可预先设定。示例性的,线性映射层1224可以但不限于通过全连接层 或卷积层实现。The linear mapping layer 1224 is mainly used to calculate the query value Q (i.e., query sequence) (also referred to as "query vector") required by MHA1225 based on the feature map output by the normalization layer 1223. Wherein, the query value Q = Norm (X) W Q. X is a large-scale feature map; Norm (.) represents a normalization operation; W Q is a mapping matrix, which can be preset. Exemplarily, the linear mapping layer 1224 can be, but is not limited to, through a fully connected layer Or convolutional layer implementation.
多头自注意力模块MHA1225主要是用于基于确定出的关键值K、特征值V和查询值Q进行自注意力计算。其中,MHA1225可以包括多个头(head),每个头均可以进行一次自注意力计算。对于一个头而言,计算可以表示为:
A=Softmax(αQKT)V
The multi-head self-attention module MHA1225 is mainly used to perform self-attention calculation based on the determined key value K, feature value V and query value Q. Among them, MHA1225 can include multiple heads, each of which can perform a self-attention calculation. For one head, the calculation can be expressed as:
A=Softmax(αQK T )V
其中,Q为查询值,K为关键值,V为特征值,α是正则化因子。在一些实施例中,由于Q表征的是高分辨率的矩阵,K和V均表征的是低分辨率的矩阵,所以,QKT相乘后可以得到一个高分辨率的矩阵,再用该高分辨率的矩阵与V相乘,即可以得到一个高分辨率的矩阵,由此即使得得到的特征图的分辨率提升。而如果采用大尺度的特征图计算得到K和V,而使用类别特征计算得到Q,就会出现Q表征的是低分辨率的矩阵,K和V均表征的是高分辨率的矩阵,此时,QKT相乘后可以得到一个高分辨率的矩阵,再用该高分辨率的矩阵与V相乘,会消除掉高分辨,从而得到一个低分辨率的矩阵,由此将使得得到的特征图的分辨率下降。因此,本实施例中,采用类别特征计算得到K和V,而采用大尺度的特征图计算得到Q。Wherein, Q is the query value, K is the key value, V is the eigenvalue, and α is the regularization factor. In some embodiments, since Q represents a high-resolution matrix, and K and V both represent low-resolution matrices, after multiplying QK T , a high-resolution matrix can be obtained, and then the high-resolution matrix is multiplied with V to obtain a high-resolution matrix, thereby improving the resolution of the obtained feature map. If K and V are calculated using a large-scale feature map, and Q is calculated using category features, it will appear that Q represents a low-resolution matrix, and K and V both represent high-resolution matrices. At this time, after multiplying QK T , a high-resolution matrix can be obtained, and then the high-resolution matrix is multiplied with V, which will eliminate the high resolution, thereby obtaining a low-resolution matrix, thereby reducing the resolution of the obtained feature map. Therefore, in this embodiment, K and V are calculated using category features, and Q is calculated using a large-scale feature map.
在所有的头均完成自注意力计算后,可以将各个头的计算结果进行拼接,以得到最终的结果。示例性的,若用[A1,…,Ah]表示将MHA1225中每个头的输出进行拼接,则MHA1225通过自注意力计算得到的结果可以为:
Y=W0[A1,…,Ah]
After all heads have completed the self-attention calculation, the calculation results of each head can be spliced to obtain the final result. For example, if [A 1 ,…,A h ] is used to represent the splicing of the output of each head in MHA1225, the result obtained by MHA1225 through self-attention calculation can be:
Y=W 0 [A 1 ,…,A h ]
其中,Y表示MHA1225的输出,W0为映射矩阵,可预先设定。Among them, Y represents the output of MHA1225, and W0 is the mapping matrix, which can be set in advance.
特征融合层1226主要是用于对大尺度的特征图中的特征和MHA1225输出的特征图中的特征相加,以完成特征融合。示例性的,特征融合层1226可以但不限于通过卷积层实现。The feature fusion layer 1226 is mainly used to add the features in the large-scale feature map and the features in the feature map output by the MHA 1225 to complete feature fusion. Exemplarily, the feature fusion layer 1226 can be implemented by, but is not limited to, a convolution layer.
归一化层1227主要用于将由特征融合层1226融合得到的特征图所包含的特征进行层归一化(layer normalization)操作,以将该特征图所包含的特征集中到以某个值为中心的区域内,提升模型训练阶段的稳定性和提升后续计算的准确率。示例性的,归一化层1227可以但不限于通过各种归一化层实现,比如:LN,BN,GN)等。The normalization layer 1227 is mainly used to perform layer normalization operation on the features contained in the feature map obtained by fusion of the feature fusion layer 1226, so as to concentrate the features contained in the feature map into an area centered on a certain value, thereby improving the stability of the model training stage and the accuracy of subsequent calculations. Exemplarily, the normalization layer 1227 can be implemented by, but is not limited to, various normalization layers, such as: LN, BN, GN, etc.
前馈神经网络FFN1228主要是用于对归一化层1227输出的特征图中的特征进行空间变化,以完成非线性处理,从而挖掘特征的非线性关系,增强特征的表现能力。The feedforward neural network FFN1228 is mainly used to spatially transform the features in the feature map output by the normalization layer 1227 to complete nonlinear processing, thereby mining the nonlinear relationship of the features and enhancing the expressiveness of the features.
特征融合层1229主要是用于将FFN1228输出的特征和特征融合层1226输出的特征相加,以完成特征融合,得到融合后的特征图,并输出该特征图。The feature fusion layer 1229 is mainly used to add the features output by FFN1228 and the features output by the feature fusion layer 1226 to complete feature fusion, obtain a fused feature map, and output the feature map.
另外,在解码器120中还可以包括一个分类层123。该分类层123主要用于对经第(m-1)个(即最后一个或最深层的一个)多尺度特征融合模块输出的特征图进行特征分类处理,以得到语义分割结果。示例性的,分类层123可以但不限于主要由卷积层或全连接层组成。示例性的,分类层123输出的可以是一张图像,该图像中包含了各个像素所属的类别。In addition, a classification layer 123 may also be included in the decoder 120. The classification layer 123 is mainly used to perform feature classification processing on the feature map output by the (m-1)th (i.e., the last or deepest) multi-scale feature fusion module to obtain a semantic segmentation result. Exemplarily, the classification layer 123 may be, but is not limited to, mainly composed of a convolutional layer or a fully connected layer. Exemplarily, the output of the classification layer 123 may be an image that contains the category to which each pixel belongs.
以上即是对本申请实施例提供的图像语义分割神经网络模型100的介绍。其中,在图1中,解码器120中每个多尺度特征融合模块所需融合的大尺度的特征图均是编码器110中除第n个卷积层(即最深层的一个卷积层)输出的特征图中的一个。解码器120中第1个多尺度特征融合模块(即多尺度特征融合模块1)所需融合的小尺度的特征图为编码器110中第n个卷积层(即最深层的一个卷积层)输出的特征图。解码器中除第1个多尺度特征融合模块之外的任意一个多尺度特征融合模块所需融合的小尺度的特征图军为该多尺度特征融合模块上一级的多尺度特征融合模块输出的特征图。例如,多尺度特征融合模块(n-2)所需融合的小尺度的特征图为多尺度特征融合模块(n-3)输出的特征图。示例性的,对于图1中各个多尺度特征融合模块所需融合的特征图,可以理解为:多尺度特征融合模块1所需融合的特征图为编码器110中卷积层n和卷积层(n-1)输出的特征图;多尺度特征融合模块m所需融合的特征图为编码器110中卷积层(n-m)和多尺度特征融合模块(m-1)输出的特征图,1<m≤(n-1)。The above is an introduction to the image semantic segmentation neural network model 100 provided in an embodiment of the present application. In FIG. 1 , the large-scale feature maps that each multi-scale feature fusion module in the decoder 120 needs to fuse are all feature maps output by the encoder 110 except the nth convolution layer (i.e., the deepest convolution layer). The small-scale feature map that the first multi-scale feature fusion module (i.e., multi-scale feature fusion module 1) in the decoder 120 needs to fuse is the feature map output by the nth convolution layer (i.e., the deepest convolution layer) in the encoder 110. The small-scale feature map that any multi-scale feature fusion module in the decoder other than the first multi-scale feature fusion module needs to fuse is the feature map output by the multi-scale feature fusion module of the previous level of the multi-scale feature fusion module. For example, the small-scale feature map that the multi-scale feature fusion module (n-2) needs to fuse is the feature map output by the multi-scale feature fusion module (n-3). Exemplarily, the feature maps required to be fused by each multi-scale feature fusion module in Figure 1 can be understood as follows: the feature maps required to be fused by the multi-scale feature fusion module 1 are the feature maps output by the convolution layer n and the convolution layer (n-1) in the encoder 110; the feature maps required to be fused by the multi-scale feature fusion module m are the feature maps output by the convolution layer (n-m) and the multi-scale feature fusion module (m-1) in the encoder 110, 1<m≤(n-1).
在一些实施例中,还可以对输入至每个多尺度特征融合模块中的小尺度的特征图进行上采样处理,并将经上采样得到的各个特征图在通道维度与最后一个多尺度特征融合模块输出的特征图拼接,并将拼接后的特征图输入至分类层123,以得到最终的语义分割结果,由此以提升语义分割的效果。对于上采样,本实施例中,可以先基于上采样的倍数,并利用一个卷积层对小尺度的特征图进行线性映射,以扩充该特征对的通道数目,然后,在将扩充通道数目得到的特征图和原有的特征图在通道维度进行拼接,最后,再对拼接得到的特征图进行像素重排(pixel-shuffle),以得到上采样后的特征图。举例来说,假设当前特征图的通道数目为C,若需要上采样γ倍,则可以使用线性映射对既有特征图进行变换得到C(γ2-1)个特征图, 之后可以在通道维度对原有的特征图和变换得到的特征图进行拼接,则得到通道数目为Cγ2的特征图,最后,可以使用像素重排(pixel-shuffle)操作得到上采样后的特征图。由于本实施例中,所采用的进行通道维度拼接的新的特征图是经线性映射得到,即是通过对原有的特征图进行卷积计算得到的,并不是像传统的使用线性插值等方式而由多个特征拟合出一个原有并不存在的特征,因此,在本实施例中采用的上采样方式中特征图中的特征均是真实存在的,所以其相对于传统的上采样方式效果更佳。In some embodiments, the small-scale feature maps input to each multi-scale feature fusion module can also be upsampled, and the feature maps obtained by upsampling are spliced with the feature maps output by the last multi-scale feature fusion module in the channel dimension, and the spliced feature maps are input to the classification layer 123 to obtain the final semantic segmentation result, thereby improving the effect of semantic segmentation. For upsampling, in this embodiment, the small-scale feature maps can be linearly mapped based on the upsampling multiple and using a convolutional layer to expand the number of channels of the feature pair, and then the feature maps obtained by expanding the number of channels and the original feature maps are spliced in the channel dimension, and finally, the spliced feature maps are pixel-shuffled to obtain the upsampled feature maps. For example, assuming that the number of channels of the current feature map is C, if it is necessary to upsample by γ times, the existing feature maps can be transformed using linear mapping to obtain C(γ 2 -1) feature maps, After that, the original feature map and the transformed feature map can be spliced in the channel dimension to obtain a feature map with a channel number of Cγ 2. Finally, the pixel-shuffle operation can be used to obtain the upsampled feature map. In this embodiment, the new feature map used for channel dimension splicing is obtained by linear mapping, that is, it is obtained by convolution calculation of the original feature map, and is not like the traditional method of using linear interpolation to fit a feature that does not exist from multiple features. Therefore, the features in the feature map in the upsampling method used in this embodiment are all real, so it is better than the traditional upsampling method.
在一些实施例中,前述图1中所示的解码器120中多尺度融合模块的数量并不一定是(n-1)个,其数量也可以小于(n-1)个,但至少为1个。In some embodiments, the number of multi-scale fusion modules in the decoder 120 shown in FIG. 1 is not necessarily (n-1), and the number may be less than (n-1), but at least 1.
当解码器120中多尺度融合模块的数量为1个时,如图4所示,解码器120中的多尺度融合模块可以任意选取两个不同尺度的特征图进行融合,比如,选择编码器110中卷积层n和卷积层(n-1)输出的特征图进行融合,或者,选择编码器110中卷积层n和卷积层2输出的特征图进行融合,等等。When the number of multi-scale fusion modules in the decoder 120 is 1, as shown in FIG4 , the multi-scale fusion module in the decoder 120 can arbitrarily select two feature maps of different scales for fusion, for example, select the feature maps output by the convolutional layer n and the convolutional layer (n-1) in the encoder 110 for fusion, or select the feature maps output by the convolutional layer n and the convolutional layer 2 in the encoder 110 for fusion, and so on.
当解码器120中多尺度融合模块的数量大于1个,但小于编码器110中卷积层的数量时,如图5所示,解码器120中设置有m个多尺度特征融合模块,且1<m<n,n为编码器110中卷积层的数量。此时,可以任意舍弃编码器110中(n-m-1)个卷积层输出的特征图进行融合,即不对编码器110输出的(n-m-1)个的特征图进行融合。例如,在图5中,当m=n-2时,解码器120中的第1个多尺度特征融合模块可以对编码器120中第n个卷积层和第n-1个卷积层输出的特征图进行融合,第i个(1<i≤m)多尺度特征融合模块可以对第i-1个多尺度特征融合模块输出的特征图和编码器110中第n-i个卷积层输出的特征图进行融合。When the number of multiscale fusion modules in the decoder 120 is greater than 1 but less than the number of convolutional layers in the encoder 110, as shown in FIG5 , m multiscale feature fusion modules are provided in the decoder 120, and 1<m<n, where n is the number of convolutional layers in the encoder 110. At this time, the feature maps output by the (n-m-1) convolutional layers in the encoder 110 can be arbitrarily discarded for fusion, that is, the feature maps output by the (n-m-1) convolutional layers in the encoder 110 are not fused. For example, in FIG5 , when m=n-2, the first multiscale feature fusion module in the decoder 120 can fuse the feature maps output by the nth convolutional layer and the n-1th convolutional layer in the encoder 120, and the i-th (1<i≤m) multiscale feature fusion module can fuse the feature maps output by the i-1th multiscale feature fusion module and the feature maps output by the n-ith convolutional layer in the encoder 110.
在一些实施例中,图1、图4和图5中的上采样可以根据实际需求进行选择,此处不做强制限定。In some embodiments, the upsampling in FIG. 1 , FIG. 4 and FIG. 5 may be selected according to actual needs and is not strictly limited here.
以上即是对本申请实施例提供的图像语义分割神经网络模型的介绍。接下来,基于上述内容,对本申请实施例提供的一种图像处理方法进行介绍。The above is an introduction to the image semantic segmentation neural network model provided in the embodiment of the present application. Next, based on the above content, an image processing method provided in the embodiment of the present application is introduced.
示例性的,图6示出了一种图像处理方法的流程。应理解的是,该方法可以通过任何具有计算、处理能力的装置、设备、平台、设备集群来执行。例如,可以通过手机、电脑、车载终端、云服务器等执行。如图6所示,该图像处理方法可以包括以下步骤:Exemplarily, FIG6 shows a flow chart of an image processing method. It should be understood that the method can be executed by any device, equipment, platform, or device cluster with computing and processing capabilities. For example, it can be executed by a mobile phone, a computer, a vehicle terminal, a cloud server, etc. As shown in FIG6, the image processing method may include the following steps:
S601、获取第一图像。S601: Acquire a first image.
本实施例中,第一图像可以是通过摄像头等图像采集装置拍摄得到。当图像采集装置采集到第一图像后,可以将该图像传输至与该图像采集装置关联的设备(比如:手机、车载终端),以由这些设备对该图像进行处理,这样,这些设备即获取到第一图像。在一些实施例中,当该方法是由除与图像采集装置关联的设备之外的其他设备执行(比如:云服务器等)时,与图像采集装置关联的设备可以将图像采集装置采集到的第一图像通过网络等传输至这些其他的设备,以由这些其他的设备对第一图像进行处理,这样,这些其他的设备即获取到第一图像。In this embodiment, the first image can be captured by an image acquisition device such as a camera. After the image acquisition device acquires the first image, the image can be transmitted to a device associated with the image acquisition device (such as a mobile phone, a vehicle-mounted terminal) so that these devices can process the image, and thus these devices acquire the first image. In some embodiments, when the method is performed by other devices other than the device associated with the image acquisition device (such as a cloud server, etc.), the device associated with the image acquisition device can transmit the first image acquired by the image acquisition device to these other devices through a network, etc. so that these other devices can process the first image, and thus these other devices acquire the first image.
S602、对第一图像进行特征提取,以得到多个不同尺度的第一特征图。S602: Perform feature extraction on the first image to obtain a plurality of first feature maps of different scales.
本实施例中,在获取到第一图像后,可以对第一图像进行特征提取,比如:通过前述图1中的编码器110提取特征等,以得到多个不同尺度的第一特征图。In this embodiment, after the first image is acquired, feature extraction may be performed on the first image, for example, by extracting features through the encoder 110 in FIG. 1 , etc., to obtain a plurality of first feature maps of different scales.
S603、对多个第一特征图中的n个第一特征图进行融合,其中,在第i次融合时,获取第(i-1)次融合得到的特征图所包含的像素所属类别的向量表示,以及,对向量表示和第i次所需融合的第一特征图进行融合,以得到第i次融合的特征图,2≤i≤n-1。S603. Fuse n first feature maps among the multiple first feature maps, wherein, during the i-th fusion, obtain a vector representation of the category to which the pixels contained in the feature map obtained by the (i-1)-th fusion belong, and fuse the vector representation and the first feature map to be fused for the i-th time to obtain the i-th fused feature map, 2≤i≤n-1.
本实施例中,在得到多个不同尺度的第一特征图后,可以对所有的第一特征图进行融合,也可以对其中一部分进行融合,为便于描述,采用对n个(n≥3)第一特征图进行融合进行描述,其中,n可以是所有的第一特征图的数量,也可以是一部分第一特征图的数量。另外,不同的第一特征图的尺度不同。示例性的,可以通过图1中所示的解码器120对n个第一特征图进行融合。本实施例中,在对n个第一特征图进行融合时,可以依据尺度由小到大的顺序依次对各个特征图进行融合。其中,第(i-1)次融合得到的特征图的尺度小于第i次所需融合的第一特征图的尺度。In this embodiment, after obtaining a plurality of first feature maps of different scales, all the first feature maps can be fused, or a part of them can be fused. For ease of description, the fusion of n (n≥3) first feature maps is used for description, where n can be the number of all first feature maps or the number of a part of the first feature maps. In addition, the scales of different first feature maps are different. Exemplarily, the n first feature maps can be fused by the decoder 120 shown in Figure 1. In this embodiment, when the n first feature maps are fused, each feature map can be fused in order from small to large in scale. Among them, the scale of the feature map obtained by the (i-1)th fusion is smaller than the scale of the first feature map required to be fused for the i-th time.
在一些实施例中,在进行第i次融合时,可以先获取第(i-1)次融合得到的特征图所包含的像素所属类别的向量表示。其中,2≤i≤n-1。例如:可以通过前述图1中所示的解码器120中的类别特征计算模块121获取到所需的向量表示。示例性的,可以先对目标特征图(即第(i-1)次融合得到的特征图)中所有像素的特征进行分类,从而获取目标特征图中包含的各个像素所属类别的掩膜,以得到类别掩膜集合。然后,可以但不限于通过Softmax等函数对类别掩膜集合中每个掩膜的分别进行运算,以得到各个掩膜的权重。最后,在将每个掩膜的权重和标特征图所包含的像素的特征的转置矩阵相乘,即可以得到目标特征图 中所包含的像素所属类别的向量表示。也即是说,可以基于类别掩膜集合中每个掩膜的权重和目标特征图,得到目标特征图中所包含的像素所属类别的向量表示。In some embodiments, when performing the i-th fusion, the vector representation of the category to which the pixels contained in the feature map obtained by the (i-1)-th fusion belong can be obtained first. Among them, 2≤i≤n-1. For example: the required vector representation can be obtained by the category feature calculation module 121 in the decoder 120 shown in the aforementioned FIG1. Exemplarily, the features of all pixels in the target feature map (i.e., the feature map obtained by the (i-1)-th fusion) can be classified first, so as to obtain the masks of the categories to which each pixel contained in the target feature map belongs, so as to obtain a category mask set. Then, each mask in the category mask set can be operated separately by functions such as Softmax, but not limited to, to obtain the weights of each mask. Finally, the weight of each mask is multiplied by the transposed matrix of the features of the pixels contained in the target feature map, so as to obtain the target feature map. That is to say, based on the weight of each mask in the category mask set and the target feature map, the vector representation of the category to which the pixels contained in the target feature map belong can be obtained.
在获取到向量表示后,可以对获取到的向量表示和第i次所需融合的第一特征图进行融合,以得到第i次融合的特征图,例如:可以通过前述图1中所示的解码器120中的类别特征融合模块122对获取到所需的向量表示和第i次所需融合的第一特征图进行融合。示例性的,可以将获取到的向量表示与第一权重矩阵(即前述的WK)相乘,以得到键向量(即前述的“关键值K”),以及,将获取到的向量表示与第二权重矩阵(即前述的WV)相乘,以得到值向量(即前述的“特征值V”);然后,在基于第i次所需融合的第一特征图和第三权重矩阵(即前述的WQ),得到查询向量(即前述的“查询值Q”);接着,在基于键向量、值向量和查询向量进行注意力计算,以得到第二特征图;最后,在对第二特征图和第i次所需融合的第一特征图进行融合,以得到第i次融合的特征图,详见上文图1中对类别特征融合模块122的介绍,此处不再赘述。After obtaining the vector representation, the obtained vector representation and the first feature map required to be fused for the i-th time can be fused to obtain the i-th fused feature map. For example, the required vector representation and the first feature map required to be fused for the i-th time can be fused by the category feature fusion module 122 in the decoder 120 shown in the aforementioned FIG. 1 . Exemplarily, the obtained vector representation can be multiplied with the first weight matrix (i.e., the aforementioned W K ) to obtain a key vector (i.e., the aforementioned “key value K”), and the obtained vector representation can be multiplied with the second weight matrix (i.e., the aforementioned W V ) to obtain a value vector (i.e., the aforementioned “eigenvalue V”); then, based on the first feature map to be fused for the i-th time and the third weight matrix (i.e., the aforementioned W Q ), a query vector (i.e., the aforementioned “query value Q”) is obtained; then, attention calculation is performed based on the key vector, the value vector and the query vector to obtain a second feature map; finally, the second feature map and the first feature map to be fused for the i-th time are fused to obtain the i-th fused feature map. For details, see the introduction to the category feature fusion module 122 in FIG. 1 above, which will not be repeated here.
S604、基于最后一次融合得到的特征图,得到第二图像,第二图像用于表征第一图像所包含的像素所属的类别。S604: Based on the feature map obtained by the last fusion, a second image is obtained, where the second image is used to represent the category to which the pixels contained in the first image belong.
本实施例中,在完成最后一次的特征图融合后,可以对最后一次融合得到的特征图进行处理,比如通过分类器等进行分类等,以得到第二图像。其中,第二图像用于表征第一图像所包含的像素所属的类别。第二图像可以理解为是进行语义分割后的结果。In this embodiment, after the last feature map fusion is completed, the feature map obtained by the last fusion can be processed, such as classified by a classifier, etc., to obtain a second image. The second image is used to represent the category to which the pixels contained in the first image belong. The second image can be understood as the result of semantic segmentation.
这样,在图像处理过程中,在每次进行特征图融合时,均先由小尺度的特征图获取到该特征图所包含的像素所属类别的特征(即向量表示),然后,再将获取到的各个像素所属类别的特征与大尺度的特征图进行融合,实现了“点与类”的融合(即像素与像素所属的类别的融合),避免了特征图中噪声点或异常点特征的影响,提升了对图像进行语义分割的准确度。In this way, during the image processing process, each time the feature map is fused, the features of the category to which the pixels contained in the feature map belong (i.e., vector representation) are first obtained from the small-scale feature map, and then the features of the category to which each pixel belongs are fused with the large-scale feature map, thereby realizing the fusion of "points and classes" (i.e., the fusion of pixels and the categories to which they belong), avoiding the influence of noise points or abnormal point features in the feature map, and improving the accuracy of semantic segmentation of the image.
在一些实施例中,在S604中,还可以先对除最后一次融合得到的特征图之外的每次融合所得到的特征图,和,第一次融合的两个特征图中低尺度的特征图中的k个特征图均进行上采样处理,以得到k个第四特征图。其中,每个第四特征图的尺度均与最后一次融合得到的特征图的尺度相同;接着,可以对最后一次融合得到的特征图和k个第四特征图在通道维度进行拼接,以得到第五特征图。最后,在通过对第五特征图进行处理,比如通过分类器等进行分类等,得到第二图像。由此以提升语义分割的效果。In some embodiments, in S604, the feature maps obtained by each fusion except the feature map obtained by the last fusion, and the k feature maps in the low-scale feature map of the two feature maps of the first fusion can be upsampled to obtain k fourth feature maps. The scale of each fourth feature map is the same as the scale of the feature map obtained by the last fusion; then, the feature map obtained by the last fusion and the k fourth feature maps can be spliced in the channel dimension to obtain the fifth feature map. Finally, the second image is obtained by processing the fifth feature map, such as classifying it through a classifier, etc. This improves the effect of semantic segmentation.
示例性的,在对k个特征图中的任意一个特征图进行上采样处理时,均可以先基于预设的上采样倍数,对任意一个特征图的通道数目进行扩充,以得到第六特征图。然后,在对该任意一个特征图和得到的第六特征图在通道维度进行拼接。最后,在对拼接得到的特征图进行像素重排,以得到该任意一个特征图所对应的第四特征图。详见前述对上采样的描述。Exemplarily, when upsampling any one of the k feature maps, the number of channels of any one of the k feature maps can be expanded based on a preset upsampling multiple to obtain a sixth feature map. Then, the channel dimension of the any one of the feature maps and the obtained sixth feature map are spliced. Finally, the pixels of the spliced feature map are rearranged to obtain the fourth feature map corresponding to the any one of the feature maps. See the above description of upsampling for details.
示例性的,图7示出了另一种图像处理方法的流程。应理解的是,该方法可以通过任何具有计算、处理能力的装置、设备、平台、设备集群来执行。例如,可以通过手机、电脑、车载终端、云服务器等执行。图7与图6的主要区别是:图7中是仅对两个不同尺度的特征图进行融合,并由融合得到的结果,得到最终的语义分割结果;而图6中是对三个或三个以上的不同尺度的特征图进行融合,,并由融合得到的结果,得到最终的语义分割结果。其中,图7中的内容均可以参见前述图6中的描述,此处不再赘述。如图7所示,该图像处理方法可以包括以下步骤:Exemplarily, FIG7 shows the process of another image processing method. It should be understood that the method can be executed by any device, equipment, platform, or device cluster with computing and processing capabilities. For example, it can be executed by a mobile phone, a computer, a vehicle-mounted terminal, a cloud server, etc. The main difference between FIG7 and FIG6 is that FIG7 only fuses feature maps of two different scales, and the final semantic segmentation result is obtained from the result of the fusion; while FIG6 fuses three or more feature maps of different scales, and the final semantic segmentation result is obtained from the result of the fusion. Among them, the contents in FIG7 can be found in the description of FIG6 above, and will not be repeated here. As shown in FIG7, the image processing method may include the following steps:
S701、获取第一图像。S701: Acquire a first image.
S702、对第一图像进行特征提取,以得到多个不同尺度的第一特征图。S702: Perform feature extraction on the first image to obtain a plurality of first feature maps of different scales.
S703、获取第一尺度的第一特征图中所包含的像素所属类别的向量表示。S703: Obtain a vector representation of the category to which the pixels contained in the first feature map of the first scale belong.
S704、对向量表示和第二尺度的第一特征图进行融合,以得到第二特征图,第二尺度大于第一尺度。S704: Fuse the vector representation and the first feature map of the second scale to obtain a second feature map, where the second scale is larger than the first scale.
S705、基于第二特征图,得到第二图像,第二图像用于表征第一图像所包含的像素所属的类别。S705 . Obtain a second image based on the second feature map, where the second image is used to represent the categories to which the pixels included in the first image belong.
这样,在图像处理过程中,在进行特征图融合时,先由小尺度的特征图获取到该特征图所包含的像素所属类别的特征(即向量表示),然后,再将获取到的各个像素所属类别的特征与大尺度的特征图进行融合,实现了“点与类”的融合(即像素与像素所属的类别的融合),避免了特征图中噪声点或异常点特征的影响,提升了对图像进行语义分割的准确度。In this way, during the image processing, when performing feature map fusion, the features of the category to which the pixels contained in the feature map belong (i.e., vector representation) are first obtained from the small-scale feature map, and then the features of the category to which each pixel belongs are fused with the large-scale feature map, thereby realizing the fusion of "points and classes" (i.e., the fusion of pixels and the categories to which the pixels belong), avoiding the influence of noise points or abnormal point features in the feature map, and improving the accuracy of semantic segmentation of the image.
应理解的是,上述图6或图7中所描述的方法,均可以但不限于应用到自动驾驶、图像编辑、卫星遥测、医学诊断、增强/虚拟现实等任务中。 It should be understood that the methods described in FIG. 6 or FIG. 7 above can be applied to, but are not limited to, tasks such as autonomous driving, image editing, satellite telemetry, medical diagnosis, augmented/virtual reality, and the like.
基于上述实施例中的方法,本申请实施例提供了一种图像处理装置。Based on the method in the above embodiment, an embodiment of the present application provides an image processing device.
示例性的,图8示出了一种图像处理装置的结构示意图。如图8所示,该图像处理装置800可以包括:通信模块801和处理模块802。其中,通信模块801,用于获取第一图像;处理模块802,用于对第一图像进行特征提取,以得到多个不同尺度的第一特征图;处理模块802,还用于对多个第一特征图中的n个第一特征图进行融合,其中,在第i次融合时,获取第(i-1)次融合得到的特征图所包含的像素所属类别的向量表示,以及,对向量表示和第i次所需融合的第一特征图进行融合,以得到第i次融合的特征图,2≤i≤n-1;处理模块802,还用于基于最后一次融合得到的特征图,得到第二图像,第二图像用于表征第一图像所包含的像素所属的类别。Exemplarily, FIG8 shows a schematic diagram of the structure of an image processing device. As shown in FIG8, the image processing device 800 may include: a communication module 801 and a processing module 802. The communication module 801 is used to obtain a first image; the processing module 802 is used to extract features from the first image to obtain a plurality of first feature maps of different scales; the processing module 802 is also used to fuse n first feature maps among the plurality of first feature maps, wherein, at the i-th fusion, a vector representation of the category to which the pixels contained in the feature map obtained by the (i-1)th fusion belong is obtained, and the vector representation and the first feature map required to be fused for the i-th time are fused to obtain the i-th fused feature map, 2≤i≤n-1; the processing module 802 is also used to obtain a second image based on the feature map obtained by the last fusion, and the second image is used to characterize the category to which the pixels contained in the first image belong.
在一些实施例中,处理模块802在获取第(i-1)次融合得到的特征图所包含的像素所属类别的向量表示时,具体用于:获取目标特征图中包含的像素所属类别的掩膜,以得到类别掩膜集合,目标特征图为第(i-1)次融合得到的特征图;基于类别掩膜集合中每个掩膜的权重和目标特征图,得到目标特征图中所包含的像素所属类别的向量表示。In some embodiments, when the processing module 802 obtains the vector representation of the category to which the pixels contained in the feature map obtained by the (i-1)th fusion belong, it is specifically used to: obtain the mask of the category to which the pixels contained in the target feature map belong to, so as to obtain a category mask set, where the target feature map is the feature map obtained by the (i-1)th fusion; based on the weight of each mask in the category mask set and the target feature map, obtain the vector representation of the category to which the pixels contained in the target feature map belong.
在一些实施例中,处理模块802在基于类别掩膜集合中每个掩膜的权重和目标特征图,得到目标特征图中所包含的像素所属类别的向量表示时,具体用于:获取目标特征图所包含的像素的特征的转置矩阵;将每个掩膜的权重和转置矩阵相乘,得到目标特征图中所包含的像素所属类别的向量表示。In some embodiments, when the processing module 802 obtains a vector representation of the category to which the pixels contained in the target feature map belong based on the weight of each mask in the category mask set and the target feature map, it is specifically used to: obtain the transposed matrix of the features of the pixels contained in the target feature map; multiply the weight of each mask and the transposed matrix to obtain a vector representation of the category to which the pixels contained in the target feature map belong.
在一些实施例中,处理模块802在对向量表示和第i次所需融合的第一特征图进行融合,以得到第i次融合的特征图时,具体用于:基于向量表示和第一权重矩阵,得到键向量,以及,基于向量表示和第二权重矩阵,得到值向量;基于第i次所需融合的第一特征图和第三权重矩阵,得到查询向量;基于键向量、值向量和查询向量进行注意力计算,以得到第二特征图;对第二特征图和第i次所需融合的第一特征图进行融合,得到第i次融合的特征图。In some embodiments, when the processing module 802 fuses the vector representation and the first feature map to be fused for the i-th time to obtain the i-th fused feature map, it is specifically used to: obtain a key vector based on the vector representation and the first weight matrix, and obtain a value vector based on the vector representation and the second weight matrix; obtain a query vector based on the first feature map to be fused for the i-th time and the third weight matrix; perform attention calculation based on the key vector, the value vector and the query vector to obtain a second feature map; and fuse the second feature map with the first feature map to be fused for the i-th time to obtain the i-th fused feature map.
在一些实施例中,处理模块802在基于最后一次融合得到的特征图,得到第二图像时,具体用于:对k个第三特征图均进行上采样处理,以得到k个第四特征图,k个第三特征图为除最后一次融合得到的特征图之外的每次融合所得到的特征图,和,第一次融合的两个特征图中低尺度的特征图中的至少一个,每个第四特征图的尺度均与最后一次融合得到的特征图的尺度相同;对最后一次融合得到的特征图和k个第四特征图在通道维度进行拼接,以得到第五特征图;基于第五特征图,得到第二图像。In some embodiments, when obtaining the second image based on the feature map obtained by the last fusion, the processing module 802 is specifically used to: upsample the k third feature maps to obtain k fourth feature maps, the k third feature maps being the feature maps obtained by each fusion except the feature map obtained by the last fusion, and at least one of the low-scale feature maps of the two feature maps fused for the first time, and the scale of each fourth feature map is the same as the scale of the feature map obtained by the last fusion; splicing the feature map obtained by the last fusion and the k fourth feature maps in the channel dimension to obtain a fifth feature map; and obtaining the second image based on the fifth feature map.
在一些实施例中,处理模块802在对k个第三特征图均进行上采样处理时,具体用于:针对k个第三特征图中的任意一个特征图,基于预设的上采样倍数,对任意一个特征图的通道数目进行扩充,以得到第六特征图;对任意一个特征图和第六特征图在通道维度进行拼接,以及,对拼接得到的特征图进行像素重排,以得到任意一个特征图所对应的第四特征图。In some embodiments, when the processing module 802 performs upsampling processing on the k third feature maps, it is specifically used to: for any one of the k third feature maps, based on a preset upsampling multiple, expand the number of channels of any one of the k third feature maps to obtain a sixth feature map; splice any one of the feature maps and the sixth feature map in the channel dimension, and rearrange pixels of the spliced feature map to obtain a fourth feature map corresponding to any one of the feature maps.
应当理解的是,上述装置用于执行上述实施例中的方法,装置中相应的程序模块,其实现原理和技术效果与上述方法中的描述类似,该装置的工作过程可参考上述方法中的对应过程,此处不再赘述。It should be understood that the above-mentioned device is used to execute the method in the above-mentioned embodiment. The implementation principle and technical effect of the corresponding program module in the device are similar to those described in the above-mentioned method. The working process of the device can refer to the corresponding process in the above-mentioned method, which will not be repeated here.
基于上述实施例中的方法,本申请实施例还提供了一种图像处理装置。Based on the method in the above embodiment, the embodiment of the present application also provides an image processing device.
示例性的,图9示出了一种图像处理装置的结构示意图。如图9所示,该图像处理装置900可以包括:通信模块901和处理模块902。其中,通信模块901,用于获取第一图像;处理模块902,用于对第一图像进行特征提取,以得到多个不同尺度的第一特征图;处理模块902,还用于获取第一尺度的第一特征图中所包含的像素所属类别的向量表示;处理模块902,还用于对向量表示和第二尺度的第一特征图进行融合,以得到第二特征图,第二尺度大于第一尺度;处理模块902,还用于基于第二特征图,得到第二图像,第二图像用于表征第一图像所包含的像素所属的类别。Exemplarily, FIG9 shows a schematic diagram of the structure of an image processing device. As shown in FIG9 , the image processing device 900 may include: a communication module 901 and a processing module 902. The communication module 901 is used to obtain a first image; the processing module 902 is used to extract features from the first image to obtain a plurality of first feature maps of different scales; the processing module 902 is also used to obtain a vector representation of the category to which the pixels contained in the first feature map of the first scale belong; the processing module 902 is also used to fuse the vector representation and the first feature map of the second scale to obtain a second feature map, wherein the second scale is larger than the first scale; the processing module 902 is also used to obtain a second image based on the second feature map, wherein the second image is used to characterize the category to which the pixels contained in the first image belong.
在一些实施例中,处理模块902在获取第一尺度的第一特征图中所包含的像素所属类别的向量表示时,具体用于:获取第一尺度的第一特征图中包含的像素所属类别的掩膜,以得到类别掩膜集合;基于类别掩膜集合中每个掩膜的权重和第一尺度的第一特征图,得到第一尺度的第一特征图中所包含的像素所属类别的向量表示。In some embodiments, when obtaining the vector representation of the category to which the pixels contained in the first feature map of the first scale belong, the processing module 902 is specifically used to: obtain the mask of the category to which the pixels contained in the first feature map of the first scale belong to, so as to obtain a category mask set; based on the weight of each mask in the category mask set and the first feature map of the first scale, obtain the vector representation of the category to which the pixels contained in the first feature map of the first scale belong.
在一些实施例中,处理模块902在基于类别掩膜集合中每个掩膜的权重和第一尺度的第一特征图,得到第一尺度的第一特征图中所包含的像素所属类别的向量表示时,具体用于:获取第一尺度的第一特征图所包含的像素的特征的转置矩阵;将每个掩膜的权重和转置矩阵相乘,得到第一尺度的第一特征图中所包含的像素所属类别的向量表示。In some embodiments, when the processing module 902 obtains the vector representation of the category to which the pixels contained in the first feature map of the first scale belong based on the weight of each mask in the category mask set and the first feature map of the first scale, it is specifically used to: obtain the transposed matrix of the features of the pixels contained in the first feature map of the first scale; multiply the weight of each mask by the transposed matrix to obtain the vector representation of the category to which the pixels contained in the first feature map of the first scale belong.
在一些实施例中,处理模块902在对向量表示和第二尺度的第一特征图进行融合,以得到第二特征图 时,具体用于:基于向量表示和第一权重矩阵,得到键向量,以及,基于向量表示和第二权重矩阵,得到值向量;基于第二尺度的第一特征图和第三权重矩阵,得到查询向量;基于键向量、值向量和查询向量进行注意力计算,以得到第三特征图;对第三特征图和第二尺度的第一特征图进行融合,以得到第二特征图。In some embodiments, the processing module 902 fuses the vector representation and the first feature map of the second scale to obtain a second feature map When , it is specifically used to: obtain a key vector based on the vector representation and the first weight matrix, and obtain a value vector based on the vector representation and the second weight matrix; obtain a query vector based on the first feature map of the second scale and the third weight matrix; perform attention calculation based on the key vector, the value vector and the query vector to obtain a third feature map; fuse the third feature map and the first feature map of the second scale to obtain a second feature map.
在一些实施例中,处理模块902在基于第二特征图,得到第二图像时,具体用于:对第一尺度的第一特征图进行上采样处理,以得到第四特征图,第四特征图的尺度与第二特征图的尺度相同;对第二特征图和第四特征图在通道维度进行拼接,以得到第五特征图;基于第五特征图,得到第二图像。In some embodiments, when obtaining the second image based on the second feature map, the processing module 902 is specifically used to: upsample the first feature map of the first scale to obtain a fourth feature map, where the scale of the fourth feature map is the same as that of the second feature map; splice the second feature map and the fourth feature map in the channel dimension to obtain a fifth feature map; and obtain the second image based on the fifth feature map.
在一些实施例中,处理模块902在对第一尺度的第一特征图进行上采样处理时,具体用于:基于预设的上采样倍数,对第一尺度的第一特征图的通道数目进行扩充,以得到第六特征图;对第一尺度的第一特征图和第六特征图在通道维度进行拼接,以及,对拼接得到的特征图进行像素重排,以得到第五特征图。In some embodiments, when the processing module 902 performs upsampling processing on the first feature map of the first scale, it is specifically used to: expand the number of channels of the first feature map of the first scale based on a preset upsampling multiple to obtain a sixth feature map; splice the first feature map of the first scale and the sixth feature map in the channel dimension, and rearrange pixels of the spliced feature map to obtain a fifth feature map.
应当理解的是,上述装置用于执行上述实施例中的方法,装置中相应的程序模块,其实现原理和技术效果与上述方法中的描述类似,该装置的工作过程可参考上述方法中的对应过程,此处不再赘述。It should be understood that the above-mentioned device is used to execute the method in the above-mentioned embodiment. The implementation principle and technical effect of the corresponding program module in the device are similar to those described in the above-mentioned method. The working process of the device can refer to the corresponding process in the above-mentioned method, which will not be repeated here.
基于上述实施例中的方法,本申请实施例还提供一种计算设备1100。Based on the method in the above embodiment, the embodiment of the present application also provides a computing device 1100.
示例性的,图10示出了一种计算设备的结构意图。如图10所示,计算设备1000包括:总线1002、处理器1004、存储器1006和通信接口1008。处理器1004、存储器1006和通信接口1008之间通过总线1002通信。计算设备1000可以是服务器或终端设备。应理解,本申请不限定计算设备1000中的处理器、存储器的个数。Exemplarily, FIG10 shows a structural intent of a computing device. As shown in FIG10 , computing device 1000 includes: bus 1002, processor 1004, memory 1006 and communication interface 1008. Processor 1004, memory 1006 and communication interface 1008 communicate through bus 1002. Computing device 1000 can be a server or a terminal device. It should be understood that the present application does not limit the number of processors and memories in computing device 1000.
总线1002可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,图10中仅用一条线表示,但并不表示仅有一根总线或一种类型的总线。总线1004可包括在计算设备1000各个部件(例如,存储器1006、处理器1004、通信接口1008)之间传送信息的通路。The bus 1002 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, etc. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of representation, FIG. 10 is represented by only one line, but does not mean that there is only one bus or one type of bus. The bus 1004 may include a path for transmitting information between various components of the computing device 1000 (e.g., the memory 1006, the processor 1004, and the communication interface 1008).
处理器1004可以包括中央处理器(central processing unit,CPU)、图形处理器(graphics processing unit,GPU)、微处理器(micro processor,MP)或者数字信号处理器(digital signal processor,DSP)等处理器中的任意一种或多种。Processor 1004 may include any one or more processors such as a central processing unit (CPU), a graphics processing unit (GPU), a microprocessor (MP) or a digital signal processor (DSP).
存储器1006可以包括易失性存储器(volatile memory),例如随机存取存储器(random access memory,RAM)。处理器1004还可以包括非易失性存储器(non-volatile memory),例如只读存储器(read-only memory,ROM),快闪存储器,机械硬盘(hard disk drive,HDD)或固态硬盘(solid state drive,SSD)。The memory 1006 may include a volatile memory, such as a random access memory (RAM). The processor 1004 may also include a non-volatile memory, such as a read-only memory (ROM), a flash memory, a hard disk drive (HDD), or a solid state drive (SSD).
存储器1006中存储有可执行的程序代码,处理器1004执行该可执行的程序代码以分别实现前述图8中通信模块801和处理模块802的功能,或者,实现前述图9中通信模块901和处理模块902的功能,从而实现上述实施例中方法的全部或部分步骤。也即,存储器1006上存有用于执行上述实施例方法中全部或部分步骤的指令。The memory 1006 stores executable program codes, and the processor 1004 executes the executable program codes to respectively implement the functions of the communication module 801 and the processing module 802 in FIG8 , or to implement the functions of the communication module 901 and the processing module 902 in FIG9 , thereby implementing all or part of the steps of the method in the above embodiment. That is, the memory 1006 stores instructions for executing all or part of the steps in the method in the above embodiment.
或者,存储器1006中存储有可执行的代码,处理器1004执行该可执行的代码以分别实现前述图像处理装置800或900的功能,从而实现上述实施例方法中全部或部分步骤。也即,存储器1006上存有用于执行上述实施例方法中全部或部分步骤的指令。Alternatively, the memory 1006 stores executable codes, and the processor 1004 executes the executable codes to respectively implement the functions of the aforementioned image processing apparatus 800 or 900, thereby implementing all or part of the steps in the above-mentioned embodiment method. That is, the memory 1006 stores instructions for executing all or part of the steps in the above-mentioned embodiment method.
通信接口1003使用例如但不限于网络接口卡、收发器一类的收发模块,来实现计算设备100与其他设备或通信网络之间的通信。The communication interface 1003 uses a transceiver module such as, but not limited to, a network interface card or a transceiver to implement communication between the computing device 100 and other devices or a communication network.
基于上述实施例中的方法,本申请实施例提供了一种计算机可读存储介质,计算机可读存储介质存储有计算机程序,当计算机程序在处理器上运行时,使得处理器执行上述实施例中的方法。Based on the method in the above embodiment, an embodiment of the present application provides a computer-readable storage medium, which stores a computer program. When the computer program runs on a processor, the processor executes the method in the above embodiment.
基于上述实施例中的方法,本申请实施例提供了一种计算机程序产品,当计算机程序产品在处理器上运行时,使得处理器执行上述实施例中的方法。Based on the method in the above embodiment, an embodiment of the present application provides a computer program product. When the computer program product runs on a processor, the processor executes the method in the above embodiment.
可以理解的是,本申请的实施例中的处理器可以是中央处理单元(central processing unit,CPU),还可以是其他通用处理器、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、晶体管逻辑器件,硬件部件或者其任意组合。通用处理器可以是微处理器,也可以是任何常规的处理器。It is understood that the processor in the embodiments of the present application may be a central processing unit (CPU), or other general-purpose processors, digital signal processors (DSP), application specific integrated circuits (ASIC), field programmable gate arrays (FPGA) or other programmable logic devices, transistor logic devices, hardware components or any combination thereof. The general-purpose processor may be a microprocessor or any conventional processor.
本申请的实施例中的方法步骤可以通过硬件的方式来实现,也可以由处理器执行软件指令的方式来实 现。软件指令可以由相应的软件模块组成,软件模块可以被存放于随机存取存储器(random access memory,RAM)、闪存、只读存储器(read-only memory,ROM)、可编程只读存储器(programmable rom,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)、寄存器、硬盘、移动硬盘、CD-ROM或者本领域熟知的任何其它形式的存储介质中。一种示例性的存储介质耦合至处理器,从而使处理器能够从该存储介质读取信息,且可向该存储介质写入信息。当然,存储介质也可以是处理器的组成部分。处理器和存储介质可以位于ASIC中。The method steps in the embodiments of the present application can be implemented by hardware or by a processor executing software instructions. Now. Software instructions can be composed of corresponding software modules, and software modules can be stored in random access memory (RAM), flash memory, read-only memory (ROM), programmable rom (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), registers, hard disks, mobile hard disks, CD-ROMs or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor so that the processor can read information from the storage medium and write information to the storage medium. Of course, the storage medium can also be a component of the processor. The processor and the storage medium can be located in an ASIC.
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者通过所述计算机可读存储介质进行传输。所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。In the above embodiments, it can be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented by software, it can be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the process or function described in the embodiment of the present application is generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions may be stored in a computer-readable storage medium or transmitted through the computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server or data center to another website, computer, server or data center by wired (e.g., coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that a computer can access or a data storage device such as a server or data center that includes one or more available media integrated. The available medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a solid state drive (SSD)), etc.
可以理解的是,在本申请的实施例中涉及的各种数字编号仅为描述方便进行的区分,并不用来限制本申请的实施例的范围。 It should be understood that the various numerical numbers involved in the embodiments of the present application are only used for the convenience of description and are not used to limit the scope of the embodiments of the present application.

Claims (17)

  1. 一种图像处理方法,其特征在于,所述方法包括:An image processing method, characterized in that the method comprises:
    获取第一图像;acquiring a first image;
    对所述第一图像进行特征提取,以得到多个不同尺度的第一特征图;Performing feature extraction on the first image to obtain a plurality of first feature maps of different scales;
    对所述多个第一特征图中的n个第一特征图进行融合,其中,在第i次融合时,获取第(i-1)次融合得到的特征图所包含的像素所属类别的向量表示,以及,对所述向量表示和第i次所需融合的第一特征图进行融合,以得到第i次融合的特征图,2≤i≤n-1;Fusing n first feature maps among the multiple first feature maps, wherein, during the i-th fusion, obtaining a vector representation of the category to which the pixels contained in the feature map obtained by the (i-1)-th fusion belong, and fusing the vector representation and the i-th first feature map to be fused to obtain an i-th fused feature map, 2≤i≤n-1;
    基于最后一次融合得到的特征图,得到第二图像,所述第二图像用于表征所述第一图像所包含的像素所属的类别。Based on the feature map obtained by the last fusion, a second image is obtained, and the second image is used to represent the category to which the pixels contained in the first image belong.
  2. 根据权利要求1所述的方法,其特征在于,所述获取第(i-1)次融合得到的特征图所包含的像素所属类别的向量表示,具体包括:The method according to claim 1 is characterized in that obtaining the vector representation of the category to which the pixels contained in the feature map obtained by the (i-1)th fusion belong specifically includes:
    获取目标特征图中包含的像素所属类别的掩膜,以得到类别掩膜集合,所述目标特征图为所述第(i-1)次融合得到的特征图;Obtaining the mask of the category to which the pixels contained in the target feature map belong, so as to obtain a category mask set, wherein the target feature map is the feature map obtained by the (i-1)th fusion;
    基于所述类别掩膜集合中每个所述掩膜的权重和所述目标特征图,得到所述目标特征图中所包含的像素所属类别的向量表示。Based on the weight of each mask in the category mask set and the target feature map, a vector representation of the category to which the pixels contained in the target feature map belong is obtained.
  3. 根据权利要求2所述的方法,其特征在于,所述基于所述类别掩膜集合中每个所述掩膜的权重和所述目标特征图,得到所述目标特征图中所包含的像素所属类别的向量表示,具体包括:The method according to claim 2 is characterized in that the step of obtaining a vector representation of the category to which the pixels contained in the target feature map belong based on the weight of each mask in the category mask set and the target feature map specifically comprises:
    获取所述目标特征图所包含的像素的特征的转置矩阵;Obtaining a transposed matrix of features of pixels contained in the target feature map;
    将每个所述掩膜的权重和所述转置矩阵相乘,得到所述目标特征图中所包含的像素所属类别的向量表示。The weight of each mask is multiplied by the transposed matrix to obtain a vector representation of the category to which the pixels contained in the target feature map belong.
  4. 根据权利要求1-3任一所述的方法,其特征在于,所述对所述向量表示和第i次所需融合的第一特征图进行融合,以得到第i次融合的特征图,具体包括:The method according to any one of claims 1 to 3 is characterized in that fusing the vector representation with the first feature map to be fused for the i-th time to obtain the i-th fused feature map specifically includes:
    基于所述向量表示和第一权重矩阵,得到键向量,以及,基于所述向量表示和第二权重矩阵,得到值向量;Based on the vector representation and the first weight matrix, a key vector is obtained, and based on the vector representation and the second weight matrix, a value vector is obtained;
    基于第i次所需融合的第一特征图和第三权重矩阵,得到查询向量;Based on the first feature map and the third weight matrix to be fused for the i-th time, a query vector is obtained;
    基于所述键向量、所述值向量和所述查询向量进行注意力计算,以得到第二特征图;Performing attention calculation based on the key vector, the value vector, and the query vector to obtain a second feature map;
    对所述第二特征图和所述第i次所需融合的第一特征图进行融合,得到所述第i次融合的特征图。The second feature map is fused with the first feature map to be fused for the i-th time to obtain the i-th fused feature map.
  5. 根据权利要求1-4任一所述的方法,其特征在于,所述基于最后一次融合得到的特征图,得到第二图像,具体包括:The method according to any one of claims 1 to 4, characterized in that obtaining the second image based on the feature map obtained by the last fusion specifically comprises:
    对k个第三特征图均进行上采样处理,以得到k个第四特征图,所述k个第三特征图为除最后一次融合得到的特征图之外的每次融合所得到的特征图,和,第一次融合的两个特征图中低尺度的特征图中的至少一个,每个所述第四特征图的尺度均与最后一次融合得到的特征图的尺度相同;Perform upsampling processing on the k third feature maps to obtain k fourth feature maps, wherein the k third feature maps are feature maps obtained by each fusion except the feature map obtained by the last fusion, and at least one of the low-scale feature maps of the two feature maps fused for the first time, and the scale of each of the fourth feature maps is the same as the scale of the feature map obtained by the last fusion;
    对最后一次融合得到的特征图和所述k个第四特征图在通道维度进行拼接,以得到第五特征图;The feature map obtained by the last fusion and the k fourth feature maps are concatenated in the channel dimension to obtain a fifth feature map;
    基于所述第五特征图,得到所述第二图像。Based on the fifth feature map, the second image is obtained.
  6. 根据权利要求5所述的方法,其特征在于,所述对k个第三特征图均进行上采样处理,具体包括:The method according to claim 5, characterized in that the upsampling of the k third feature maps specifically comprises:
    针对所述k个第三特征图中的任意一个特征图,基于预设的上采样倍数,对所述任意一个特征图的通道数目进行扩充,以得到第六特征图;For any one of the k third feature maps, based on a preset upsampling multiple, the number of channels of the any one feature map is expanded to obtain a sixth feature map;
    对所述任意一个特征图和所述第六特征图在通道维度进行拼接,以及,对拼接得到的特征图进行像素重排,以得到所述任意一个特征图所对应的第四特征图。The arbitrary one feature map and the sixth feature map are spliced in the channel dimension, and pixels of the spliced feature map are rearranged to obtain a fourth feature map corresponding to the arbitrary one feature map.
  7. 一种图像处理方法,其特征在于,所述方法包括:An image processing method, characterized in that the method comprises:
    获取第一图像;acquiring a first image;
    对所述第一图像进行特征提取,以得到多个不同尺度的第一特征图;Performing feature extraction on the first image to obtain a plurality of first feature maps of different scales;
    获取第一尺度的所述第一特征图中所包含的像素所属类别的向量表示;Obtaining a vector representation of a category to which pixels contained in the first feature map of the first scale belong;
    对所述向量表示和第二尺度的所述第一特征图进行融合,以得到第二特征图,所述第二尺度大于所述第一尺度;fusing the vector representation and the first feature map at a second scale to obtain a second feature map, wherein the second scale is larger than the first scale;
    基于所述第二特征图,得到第二图像,所述第二图像用于表征所述第一图像所包含的像素所属的类别。Based on the second feature map, a second image is obtained, where the second image is used to represent the category to which the pixels contained in the first image belong.
  8. 一种图像处理装置,其特征在于,所述装置包括:An image processing device, characterized in that the device comprises:
    通信模块,用于获取第一图像; A communication module, configured to acquire a first image;
    处理模块,用于对所述第一图像进行特征提取,以得到多个不同尺度的第一特征图;A processing module, configured to perform feature extraction on the first image to obtain a plurality of first feature maps of different scales;
    所述处理模块,还用于对所述多个第一特征图中的n个第一特征图进行融合,其中,在第i次融合时,获取第(i-1)次融合得到的特征图所包含的像素所属类别的向量表示,以及,对所述向量表示和第i次所需融合的第一特征图进行融合,以得到第i次融合的特征图,2≤i≤n-1;The processing module is further used to fuse n first feature maps among the multiple first feature maps, wherein, during the i-th fusion, a vector representation of the category to which the pixels contained in the feature map obtained by the (i-1)-th fusion belong is obtained, and the vector representation and the first feature map to be fused for the i-th time are fused to obtain the i-th fused feature map, 2≤i≤n-1;
    所述处理模块,还用于基于最后一次融合得到的特征图,得到第二图像,所述第二图像用于表征所述第一图像所包含的像素所属的类别。The processing module is further used to obtain a second image based on the feature map obtained by the last fusion, where the second image is used to represent the category to which the pixels contained in the first image belong.
  9. 根据权利要求8所述的装置,其特征在于,所述处理模块在获取第(i-1)次融合得到的特征图所包含的像素所属类别的向量表示时,具体用于:The device according to claim 8 is characterized in that when the processing module obtains the vector representation of the category to which the pixels contained in the feature map obtained by the (i-1)th fusion belong, it is specifically used to:
    获取目标特征图中包含的像素所属类别的掩膜,以得到类别掩膜集合,所述目标特征图为所述第(i-1)次融合得到的特征图;Obtaining the mask of the category to which the pixels contained in the target feature map belong, so as to obtain a category mask set, wherein the target feature map is the feature map obtained by the (i-1)th fusion;
    基于所述类别掩膜集合中每个所述掩膜的权重和所述目标特征图,得到所述目标特征图中所包含的像素所属类别的向量表示。Based on the weight of each mask in the category mask set and the target feature map, a vector representation of the category to which the pixels contained in the target feature map belong is obtained.
  10. 根据权利要求9所述的装置,其特征在于,所述处理模块在基于所述类别掩膜集合中每个所述掩膜的权重和所述目标特征图,得到所述目标特征图中所包含的像素所属类别的向量表示时,具体用于:The device according to claim 9 is characterized in that when the processing module obtains the vector representation of the category to which the pixels contained in the target feature map belong based on the weight of each mask in the category mask set and the target feature map, it is specifically used to:
    获取所述目标特征图所包含的像素的特征的转置矩阵;Obtaining a transposed matrix of features of pixels contained in the target feature map;
    将每个所述掩膜的权重和所述转置矩阵相乘,得到所述目标特征图中所包含的像素所属类别的向量表示。The weight of each mask is multiplied by the transposed matrix to obtain a vector representation of the category to which the pixels contained in the target feature map belong.
  11. 根据权利要求8-10任一所述的装置,其特征在于,所述处理模块在对所述向量表示和第i次所需融合的第一特征图进行融合,以得到第i次融合的特征图时,具体用于:The device according to any one of claims 8 to 10 is characterized in that, when the processing module fuses the vector representation and the first feature map to be fused for the i-th time to obtain the i-th fused feature map, it is specifically used to:
    基于所述向量表示和第一权重矩阵,得到键向量,以及,基于所述向量表示和第二权重矩阵,得到值向量;Based on the vector representation and the first weight matrix, a key vector is obtained, and based on the vector representation and the second weight matrix, a value vector is obtained;
    基于第i次所需融合的第一特征图和第三权重矩阵,得到查询向量;Based on the first feature map and the third weight matrix to be fused for the i-th time, a query vector is obtained;
    基于所述键向量、所述值向量和所述查询向量进行注意力计算,以得到第二特征图;Performing attention calculation based on the key vector, the value vector, and the query vector to obtain a second feature map;
    对所述第二特征图和所述第i次所需融合的第一特征图进行融合,得到所述第i次融合的特征图。The second feature map is fused with the first feature map to be fused for the i-th time to obtain the i-th fused feature map.
  12. 根据权利要求8-11任一所述的装置,其特征在于,所述处理模块在基于最后一次融合得到的特征图,得到第二图像时,具体用于:The device according to any one of claims 8 to 11 is characterized in that, when the processing module obtains the second image based on the feature map obtained by the last fusion, it is specifically used to:
    对k个第三特征图均进行上采样处理,以得到k个第四特征图,所述k个第三特征图为除最后一次融合得到的特征图之外的每次融合所得到的特征图,和,第一次融合的两个特征图中低尺度的特征图中的至少一个,每个所述第四特征图的尺度均与最后一次融合得到的特征图的尺度相同;Perform upsampling processing on the k third feature maps to obtain k fourth feature maps, wherein the k third feature maps are feature maps obtained by each fusion except the feature map obtained by the last fusion, and at least one of the low-scale feature maps of the two feature maps fused for the first time, and the scale of each of the fourth feature maps is the same as the scale of the feature map obtained by the last fusion;
    对最后一次融合得到的特征图和所述k个第四特征图在通道维度进行拼接,以得到第五特征图;The feature map obtained by the last fusion and the k fourth feature maps are concatenated in the channel dimension to obtain a fifth feature map;
    基于所述第五特征图,得到所述第二图像。Based on the fifth feature map, the second image is obtained.
  13. 根据权利要求12所述的装置,其特征在于,所述处理模块在对k个第三特征图均进行上采样处理时,具体用于:The device according to claim 12, characterized in that when the processing module performs upsampling processing on the k third feature maps, it is specifically used to:
    针对所述k个第三特征图中的任意一个特征图,基于预设的上采样倍数,对所述任意一个特征图的通道数目进行扩充,以得到第六特征图;For any one of the k third feature maps, based on a preset upsampling multiple, the number of channels of the any one feature map is expanded to obtain a sixth feature map;
    对所述任意一个特征图和所述第六特征图在通道维度进行拼接,以及,对拼接得到的特征图进行像素重排,以得到所述任意一个特征图所对应的第四特征图。The arbitrary one feature map and the sixth feature map are spliced in the channel dimension, and pixels of the spliced feature map are rearranged to obtain a fourth feature map corresponding to the arbitrary one feature map.
  14. 一种图像处理装置,其特征在于,所述装置包括:An image processing device, characterized in that the device comprises:
    通信模块,用于获取第一图像;A communication module, configured to acquire a first image;
    处理模块,用于对所述第一图像进行特征提取,以得到多个不同尺度的第一特征图;A processing module, configured to perform feature extraction on the first image to obtain a plurality of first feature maps of different scales;
    所述处理模块,还用于获取第一尺度的所述第一特征图中所包含的像素所属类别的向量表示;The processing module is further used to obtain a vector representation of a category to which pixels contained in the first feature map of the first scale belong;
    所述处理模块,还用于对所述向量表示和第二尺度的所述第一特征图进行融合,以得到第二特征图,所述第二尺度大于所述第一尺度;The processing module is further used to fuse the vector representation and the first feature map at a second scale to obtain a second feature map, wherein the second scale is larger than the first scale;
    所述处理模块,还用于基于所述第二特征图,得到第二图像,所述第二图像用于表征所述第一图像所包含的像素所属的类别。The processing module is further used to obtain a second image based on the second feature map, where the second image is used to represent the category to which the pixels contained in the first image belong.
  15. 一种计算设备,其特征在于,包括:A computing device, comprising:
    至少一个存储器,用于存储程序;at least one memory for storing a program;
    至少一个处理器,用于执行所述存储器存储的程序; at least one processor, configured to execute the program stored in the memory;
    其中,当所述存储器存储的程序被执行时,所述处理器用于执行如权利要求1-7任一所述的方法。Wherein, when the program stored in the memory is executed, the processor is used to execute the method according to any one of claims 1-7.
  16. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,当所述计算机程序在处理器上运行时,使得所述处理器执行如权利要求1-7任一所述的方法。A computer-readable storage medium stores a computer program, and when the computer program runs on a processor, the processor executes the method according to any one of claims 1 to 7.
  17. 一种计算机程序产品,其特征在于,当所述计算机程序产品在处理器上运行时,使得所述处理器执行如权利要求1-7任一所述的方法。 A computer program product, characterized in that when the computer program product runs on a processor, the processor is caused to execute the method according to any one of claims 1 to 7.
PCT/CN2023/141801 2022-12-26 2023-12-26 Image processing method and apparatus, and computing device WO2024140642A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211673092.0 2022-12-26
CN202211673092.0A CN116229143A (en) 2022-12-26 2022-12-26 Image processing method and device and computing equipment

Publications (1)

Publication Number Publication Date
WO2024140642A1 true WO2024140642A1 (en) 2024-07-04

Family

ID=86583370

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/141801 WO2024140642A1 (en) 2022-12-26 2023-12-26 Image processing method and apparatus, and computing device

Country Status (2)

Country Link
CN (1) CN116229143A (en)
WO (1) WO2024140642A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116229143A (en) * 2022-12-26 2023-06-06 华为技术有限公司 Image processing method and device and computing equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444365A (en) * 2020-03-27 2020-07-24 Oppo广东移动通信有限公司 Image classification method and device, electronic equipment and storage medium
CN112183507A (en) * 2020-11-30 2021-01-05 北京沃东天骏信息技术有限公司 Image segmentation method, device, equipment and storage medium
CN114820633A (en) * 2022-04-11 2022-07-29 北京三快在线科技有限公司 Semantic segmentation method, training device and training equipment of semantic segmentation model
US20220327815A1 (en) * 2019-09-05 2022-10-13 Basf Se System and method for identification of plant species
CN115205535A (en) * 2022-06-14 2022-10-18 深圳市正浩创新科技股份有限公司 Image processing method, computer readable medium and electronic device
CN116229143A (en) * 2022-12-26 2023-06-06 华为技术有限公司 Image processing method and device and computing equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220327815A1 (en) * 2019-09-05 2022-10-13 Basf Se System and method for identification of plant species
CN111444365A (en) * 2020-03-27 2020-07-24 Oppo广东移动通信有限公司 Image classification method and device, electronic equipment and storage medium
CN112183507A (en) * 2020-11-30 2021-01-05 北京沃东天骏信息技术有限公司 Image segmentation method, device, equipment and storage medium
CN114820633A (en) * 2022-04-11 2022-07-29 北京三快在线科技有限公司 Semantic segmentation method, training device and training equipment of semantic segmentation model
CN115205535A (en) * 2022-06-14 2022-10-18 深圳市正浩创新科技股份有限公司 Image processing method, computer readable medium and electronic device
CN116229143A (en) * 2022-12-26 2023-06-06 华为技术有限公司 Image processing method and device and computing equipment

Also Published As

Publication number Publication date
CN116229143A (en) 2023-06-06

Similar Documents

Publication Publication Date Title
CN111104962B (en) Semantic segmentation method and device for image, electronic equipment and readable storage medium
US11328172B2 (en) Method for fine-grained sketch-based scene image retrieval
US11176415B2 (en) Assisted image annotation
KR101865102B1 (en) Systems and methods for visual question answering
JP7559063B2 (en) FACE PERSHING METHOD AND RELATED DEVICE
US12062158B2 (en) Image denoising method and apparatus
CN111340077B (en) Attention mechanism-based disparity map acquisition method and device
US20220108478A1 (en) Processing images using self-attention based neural networks
CN109886066A (en) Fast target detection method based on the fusion of multiple dimensioned and multilayer feature
CN110826596A (en) Semantic segmentation method based on multi-scale deformable convolution
WO2020098257A1 (en) Image classification method and device and computer readable storage medium
CN111783779B (en) Image processing method, apparatus and computer readable storage medium
US20220237896A1 (en) Method for training a model to be used for processing images by generating feature maps
WO2024140642A1 (en) Image processing method and apparatus, and computing device
TW201913557A (en) Image cutting method and device
CN113343981A (en) Visual feature enhanced character recognition method, device and equipment
CN113869371A (en) Model training method, clothing fine-grained segmentation method and related device
CN107274425A (en) A kind of color image segmentation method and device based on Pulse Coupled Neural Network
JP2023543964A (en) Image processing method, image processing device, electronic device, storage medium and computer program
CN114463335A (en) Weak supervision semantic segmentation method and device, electronic equipment and storage medium
EP4047547A1 (en) Method and system for removing scene text from images
CN113515920B (en) Method, electronic device and computer readable medium for extracting formulas from tables
CN113076755A (en) Keyword extraction method, device, equipment and storage medium
WO2024199101A1 (en) Lane line detection method and apparatus, and electronic device
US20240135684A1 (en) Systems and methods for annotating 3d data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23910567

Country of ref document: EP

Kind code of ref document: A1