CN118351299A

CN118351299A - Image segmentation method and device based on open vocabulary segmentation

Info

Publication number: CN118351299A
Application number: CN202410250116.4A
Authority: CN
Inventors: 王兆卿; 陈紫业; 贺潇; 郭彦东
Original assignee: Zhifang Shenzhen Technology Co ltd
Current assignee: Zhifang Shenzhen Technology Co ltd; Zhifangguishen Technology Shenzhen Co ltd
Priority date: 2024-03-05
Filing date: 2024-03-05
Publication date: 2024-07-16

Abstract

The invention relates to an image segmentation method and device based on open vocabulary segmentation, which acquire a target image and a preset point grid; determining at least one text embedded vector corresponding to the target image based on the target image; determining a query embedding vector corresponding to each point aiming at each point in a preset point grid; determining a prediction mask area vector corresponding to each point based on the query embedded vector corresponding to each point; determining a prediction mask category corresponding to each point based on the prediction mask area vector corresponding to each point, at least one text embedding vector corresponding to the target image and a preset mask text matching algorithm; and the preset mask text matching algorithm is used for determining a text embedding vector matched with the prediction mask area vector corresponding to each point from at least one text embedding vector corresponding to the target image. The multi-functionality of image segmentation can be finished and the segmentation performance of the image segmentation can be improved without consuming a great deal of manual annotation cost.

Description

Image segmentation method and device based on open vocabulary segmentation

技术领域Technical Field

本发明涉及图像分割领域，特别是涉及基于开放词汇分割的图像分割方法和装置。The present invention relates to the field of image segmentation, and in particular to an image segmentation method and device based on open vocabulary segmentation.

背景技术Background technique

为了克服封闭词汇分割的限制，人们提出了开放词汇分割。开放词汇分割使用自然语言表示的类别名字的文本嵌入作为标签嵌入，而不是从训练数据集中学习它们。通过这样做，模型可以对更宽泛的词汇进行分类，从而提高处理更广泛类别的能力。为了确保提供有意义的嵌入，通常使用预训练好的文本编码器。这个编码器可以有效地捕捉单词和短语的语义含义，这对于开放词汇分割非常关键。多模态模型，例如(Contrastive language-image Pre-Training，CLIP)已经展现出在开放词汇分割方面的潜力，因为它们能够从大规模互联网数据中学习对齐的图像文本特征表示。To overcome the limitations of closed vocabulary segmentation, open vocabulary segmentation was proposed. Open vocabulary segmentation uses text embeddings of category names represented in natural language as label embeddings instead of learning them from a training dataset. By doing so, the model can classify a wider range of vocabulary, thereby improving the ability to handle a wider range of categories. To ensure that meaningful embeddings are provided, a pre-trained text encoder is usually used. This encoder can effectively capture the semantic meaning of words and phrases, which is critical for open vocabulary segmentation. Multimodal models such as (Contrastive language-image Pre-Training, CLIP) have shown potential in open vocabulary segmentation because they are able to learn aligned image-text feature representations from large-scale Internet data.

目前通常依赖于图像-掩模-文本三元组进行基于开放词汇分割的图像分割，但这种方法需要耗费大量的人工精力对掩模和文本之间的对应关系进行标注，会导致昂贵的注释成本。尽管现有技术中已经提出了一些弱监督方法，例如通过文本监督来降低注释成本，但监督的不完整性严重限制了通用性和性能。其中，文本监督只利用图像和文本对进行语义分割，在捕获复杂的空间细节方面存在不足，这对于密集预测来说不是最佳的。此外，文本监督这种类型的监督缺乏位置信息，使得模型难以区分具有相同语义类的不同实例。这些问题严重限制了现有弱监督方法的多功能性和分割性能。Currently, image segmentation based on open vocabulary segmentation usually relies on image-mask-text triples, but this method requires a lot of manual effort to annotate the correspondence between mask and text, which leads to expensive annotation costs. Although some weakly supervised methods have been proposed in the prior art, such as text supervision to reduce annotation costs, the incompleteness of supervision severely limits the versatility and performance. Among them, text supervision only uses image and text pairs for semantic segmentation, which is insufficient in capturing complex spatial details, which is not optimal for dense prediction. In addition, this type of supervision, text supervision, lacks location information, making it difficult for the model to distinguish different instances with the same semantic class. These problems severely limit the versatility and segmentation performance of existing weakly supervised methods.

因此，现有技术中在基于开放词汇分割的图像分割过程中，需要昂贵的注释成本，同时限制了图像分割的多功能性和分割性能。Therefore, the existing techniques for image segmentation based on open vocabulary segmentation require expensive annotation costs, while limiting the versatility and segmentation performance of image segmentation.

发明内容Summary of the invention

本发明提供一种基于开放词汇分割的图像分割方法和装置，用以解决现有技术中在基于开放词汇分割的图像分割过程中，需要昂贵的注释成本，同时限制了图像分割的多功能性和分割性能的问题，实现无须耗费大量的人工成本注释图像、掩码和文本之间的关系，也能完成图像分割的多功能性并提高图像分割的分割性能。The present invention provides an image segmentation method and device based on open vocabulary segmentation, which are used to solve the problem that in the image segmentation process based on open vocabulary segmentation in the prior art, expensive annotation costs are required, while the versatility and segmentation performance of image segmentation are limited. The method and device can achieve the versatility of image segmentation and improve the segmentation performance of image segmentation without consuming a lot of manual costs to annotate the relationship between images, masks and texts.

一种基于开放词汇分割的图像分割方法，所述方法包括：获取目标图像和预设的点网格；基于所述目标图像，确定所述目标图像对应的至少一个文本嵌入向量；针对所述预设的点网格中的每个点，确定所述每个点对应的查询嵌入向量；并基于所述每个点对应的查询嵌入向量，确定每个点对应的预测掩码区域向量；其中，所述每个点对应的查询嵌入向量中包括每个点的位置嵌入向量，以及每个点在所述目标图像中对应的至少一个像素级的真实掩码嵌入向量；基于所述每个点对应的预测掩码区域向量、所述目标图像对应的至少一个文本嵌入向量以及预设的掩码文本匹配算法，确定所述每个点对应的预测掩码类别；其中，所述每个点对应的预测掩码类别中包括每个点在所述目标图像中对应的像素级的类别标签；所述预设的掩码文本匹配算法用于从所述目标图像对应的至少一个文本嵌入向量中，确定与每个点对应的预测掩码区域向量匹配的文本嵌入向量。A method for image segmentation based on open vocabulary segmentation, the method comprising: obtaining a target image and a preset point grid; determining at least one text embedding vector corresponding to the target image based on the target image; determining a query embedding vector corresponding to each point in the preset point grid; and determining a predicted mask region vector corresponding to each point based on the query embedding vector corresponding to each point; wherein the query embedding vector corresponding to each point includes a position embedding vector of each point and at least one pixel-level true mask embedding vector corresponding to each point in the target image; determining a predicted mask category corresponding to each point based on the predicted mask region vector corresponding to each point, at least one text embedding vector corresponding to the target image and a preset mask text matching algorithm; wherein the predicted mask category corresponding to each point includes a pixel-level category label corresponding to each point in the target image; the preset mask text matching algorithm is used to determine a text embedding vector matching the predicted mask region vector corresponding to each point from the at least one text embedding vector corresponding to the target image.

在其中一个实施例中，所述基于所述目标图像，确定所述目标图像对应的至少一个文本嵌入向量，包括：基于所述目标图像，以及预设的增强图像文本提取模型，确定所述目标图像对应的至少一个文本嵌入向量；所述预设的增强图像文本提取模型用于提取并增强与所述目标图像描述匹配的至少一个文本嵌入向量。In one of the embodiments, determining at least one text embedding vector corresponding to the target image based on the target image includes: determining at least one text embedding vector corresponding to the target image based on the target image and a preset enhanced image text extraction model; the preset enhanced image text extraction model is used to extract and enhance at least one text embedding vector that matches the target image description.

在其中一个实施例中，所述预设的增强图像文本提取模型包括：预设的视觉语言模型、预设的文本语言增强模型和预设的文本编码器，所述基于所述目标图像，以及预设的增强图像文本提取模型，确定所述目标图像对应的至少一个文本嵌入向量，包括：将所述目标图像输入所述预设的视觉语言模型中，确定所述目标图像的初始文本嵌入向量；所述目标图像的初始文本嵌入向量中包括对所述目标图像的描述文本的特征表示；将所述目标图像的初始文本嵌入向量，输入预设的文本语言增强模型中，确定所述目标图像的文本特征表示的至少一个增强文本；将所述目标图像的文本特征表示的至少一个增强文本，分别输入所述预设的文本编码器，得到所述目标图像对应的至少一个文本嵌入向量。In one of the embodiments, the preset enhanced image text extraction model includes: a preset visual language model, a preset text language enhancement model and a preset text encoder, and the determining of at least one text embedding vector corresponding to the target image based on the target image and the preset enhanced image text extraction model includes: inputting the target image into the preset visual language model to determine an initial text embedding vector of the target image; the initial text embedding vector of the target image includes a feature representation of a descriptive text of the target image; the initial text embedding vector of the target image is input into the preset text language enhancement model to determine at least one enhanced text represented by the text features of the target image; and the at least one enhanced text represented by the text features of the target image is respectively input into the preset text encoder to obtain at least one text embedding vector corresponding to the target image.

在其中一个实施例中，所述针对所述预设的点网格中的每个点，确定所述每个点对应的查询嵌入向量，包括：针对所述预设的点网格中的每个点，基于预训练好的视觉提示编码器，将所述每个点编码为两个位置嵌入向量和所述每个点在所述目标图像中对应的至少一个真实掩码嵌入向量；基于所述目标图像和预设的视觉编码器，确定所述至少一个真实掩码嵌入向量对应的内容嵌入向量；针对所述预设的点网格中的每个点，将每个点对应的两个位置嵌入向量和所述每个点在所述目标图像中对应的至少一个真实掩码嵌入向量拼接后，与每个点对应的预设查询类型以及所述至少一个真实掩码嵌入向量对应的内容嵌入向量组合，得到所述每个点对应的查询嵌入向量。In one of the embodiments, for each point in the preset point grid, determining the query embedding vector corresponding to each point includes: for each point in the preset point grid, based on a pre-trained visual cue encoder, encoding each point into two position embedding vectors and at least one real mask embedding vector corresponding to each point in the target image; based on the target image and a preset visual encoder, determining a content embedding vector corresponding to the at least one real mask embedding vector; for each point in the preset point grid, concatenating the two position embedding vectors corresponding to each point and the at least one real mask embedding vector corresponding to each point in the target image, and combining them with the preset query type corresponding to each point and the content embedding vector corresponding to the at least one real mask embedding vector to obtain the query embedding vector corresponding to each point.

在其中一个实施例中，在所述基于所述每个点对应的查询嵌入向量，确定每个点对应的预测掩码区域向量之前，所述方法还包括：基于所述目标图像、预设的视觉编码器和预训练好的多尺度像素解码器，确定所述每个点对应的多尺度像素特征向量；所述基于所述每个点对应的查询嵌入向量，确定每个点对应的预测掩码区域向量，包括：基于所述每个点对应的多尺度像素特征向量、所述每个点对应的查询嵌入向量和预训练好的掩码解码器，确定每个点对应的预测掩码区域向量。In one of the embodiments, before determining the predicted mask area vector corresponding to each point based on the query embedding vector corresponding to each point, the method further includes: determining the multi-scale pixel feature vector corresponding to each point based on the target image, a preset visual encoder and a pre-trained multi-scale pixel decoder; determining the predicted mask area vector corresponding to each point based on the query embedding vector corresponding to each point includes: determining the predicted mask area vector corresponding to each point based on the multi-scale pixel feature vector corresponding to each point, the query embedding vector corresponding to each point and the pre-trained mask decoder.

在其中一个实施例中，所述基于所述目标图像、预设的视觉编码器和预训练好的多尺度像素解码器，确定所述每个点对应的多尺度像素特征向量，包括：针对所述每个点，将所述目标图像中与所述每个点对应的图像块，输入所述预设的视觉编码器中，得到所述每个点的初始像素特征向量；将所述每个点的初始像素特征向量输入所述预训练好的多尺度像素解码器中，得到所述每个点的多尺度像素特征向量。In one of the embodiments, determining the multi-scale pixel feature vector corresponding to each point based on the target image, a preset visual encoder and a pre-trained multi-scale pixel decoder includes: for each point, inputting the image block corresponding to each point in the target image into the preset visual encoder to obtain the initial pixel feature vector of each point; inputting the initial pixel feature vector of each point into the pre-trained multi-scale pixel decoder to obtain the multi-scale pixel feature vector of each point.

在其中一个实施例中，所述基于所述每个点对应的预测掩码区域向量、所述目标图像对应的至少一个文本嵌入向量以及预设的掩码文本匹配算法，确定所述每个点对应的预测掩码类别，包括：将所述每个点的初始像素特征向量，以及所述每个点对应的预测掩码区域向量，输入预训练好的多尺度特征适配器，得到所述每个点的像素特征向量；针对每个点，基于所述每个点对应的像素特征向量和所述目标图像对应的每个文本嵌入向量的反余弦相似度，确定所述每个点的像素特征向量和所述目标图像对应的每个文本嵌入向量之间的成本矩阵；基于所述每个点的像素特征向量和所述每个图像对应的所有文本嵌入向量之间的成本矩阵，以及二部图匹配算法，确定每个点对应的最佳匹配的像素特征向量和文本嵌入向量；将每个点对应的最佳匹配的文本嵌入向量，确定为每个点对应的最佳匹配的像素特征向量的预测掩码类别。In one of the embodiments, the method of determining the predicted mask category corresponding to each point based on the predicted mask region vector corresponding to each point, at least one text embedding vector corresponding to the target image, and a preset mask text matching algorithm includes: inputting the initial pixel feature vector of each point and the predicted mask region vector corresponding to each point into a pre-trained multi-scale feature adapter to obtain the pixel feature vector of each point; for each point, based on the arc cosine similarity between the pixel feature vector corresponding to each point and each text embedding vector corresponding to the target image, determining the cost matrix between the pixel feature vector of each point and each text embedding vector corresponding to the target image; based on the cost matrix between the pixel feature vector of each point and all text embedding vectors corresponding to each image, and a bipartite graph matching algorithm, determining the best matching pixel feature vector and text embedding vector corresponding to each point; determining the best matching text embedding vector corresponding to each point as the predicted mask category of the best matching pixel feature vector corresponding to each point.

在其中一个实施例中，在所述将所述每个点的初始像素特征向量、所述每个点的多尺度像素特征向量以及所述每个点对应的预测掩码区域向量，输入预训练好的多尺度特征适配器，得到所述每个点的像素特征向量之前，所述方法还包括：基于训练样本集和预设的余弦相似度损失函数，更新得到预训练好的多尺度特征适配器；其中，所述训练样本集中包括至少一个图像、每个图像对应的像素级的类别标签；所述预设的余弦相似度损失函数用于实现从每个图像对应的至少一个文本嵌入向量中，确定与每个点对应的预测掩码区域向量对应的文本嵌入向量。In one of the embodiments, before inputting the initial pixel feature vector of each point, the multi-scale pixel feature vector of each point and the predicted mask area vector corresponding to each point into a pre-trained multi-scale feature adapter to obtain the pixel feature vector of each point, the method further includes: updating the pre-trained multi-scale feature adapter based on a training sample set and a preset cosine similarity loss function; wherein the training sample set includes at least one image and a pixel-level category label corresponding to each image; and the preset cosine similarity loss function is used to determine the text embedding vector corresponding to the predicted mask area vector corresponding to each point from at least one text embedding vector corresponding to each image.

本发明还提供了一种基于开放词汇分割的图像分割装置，所述装置包括：获取模块，用于获取目标图像和预设的点网格；第一确定模块，用于基于所述目标图像，确定所述目标图像对应的至少一个文本嵌入向量；第二确定模块，用于针对所述预设的点网格中的每个点，确定所述每个点对应的查询嵌入向量；并基于所述每个点对应的查询嵌入向量，确定每个点对应的预测掩码区域向量；其中，所述每个点对应的查询嵌入向量中包括每个点的位置嵌入向量，以及每个点在所述目标图像中对应的至少一个像素级的真实掩码嵌入向量；第三确定模块，用于基于所述每个点对应的预测掩码区域向量、所述目标图像对应的至少一个文本嵌入向量以及预设的掩码文本匹配算法，确定所述每个点对应的预测掩码类别；其中，所述每个点对应的预测掩码类别中包括每个点在所述目标图像中对应的像素级的类别标签；所述预设的掩码文本匹配算法用于从所述目标图像对应的至少一个文本嵌入向量中，确定与每个点对应的预测掩码区域向量匹配的文本嵌入向量。The present invention also provides an image segmentation device based on open vocabulary segmentation, the device comprising: an acquisition module, used to acquire a target image and a preset point grid; a first determination module, used to determine at least one text embedding vector corresponding to the target image based on the target image; a second determination module, used to determine the query embedding vector corresponding to each point in the preset point grid; and based on the query embedding vector corresponding to each point, determine the predicted mask area vector corresponding to each point; wherein the query embedding vector corresponding to each point includes the position embedding vector of each point and at least one pixel-level real mask embedding vector corresponding to each point in the target image; a third determination module, used to determine the predicted mask category corresponding to each point based on the predicted mask area vector corresponding to each point, at least one text embedding vector corresponding to the target image and a preset mask text matching algorithm; wherein the predicted mask category corresponding to each point includes the pixel-level category label corresponding to each point in the target image; the preset mask text matching algorithm is used to determine the text embedding vector matching the predicted mask area vector corresponding to each point from the at least one text embedding vector corresponding to the target image.

本发明还提供计算机设备，包括存储器和处理器，所述存储器中存储有计算机可读指令，所述计算机可读指令被所述处理器执行时，使得所述处理器执行上述所述基于开放词汇分割的图像分割方法的步骤。The present invention also provides a computer device, including a memory and a processor, wherein the memory stores computer-readable instructions, and when the computer-readable instructions are executed by the processor, the processor executes the steps of the above-mentioned image segmentation method based on open vocabulary segmentation.

本发明还提供存储有计算机可读指令的存储介质，所述计算机可读指令被一个或多个处理器执行时，使得一个或多个处理器执行上述所述基于开放词汇分割的图像分割方法的步骤。The present invention also provides a storage medium storing computer-readable instructions, and when the computer-readable instructions are executed by one or more processors, the one or more processors execute the steps of the above-mentioned image segmentation method based on open vocabulary segmentation.

上述基于开放词汇分割的图像分割方法和装置，通过提供预设的点网格，并单独学习所述预设的点网格中各个点的查询嵌入向量，从而确定每个点对应的预测掩码区域向量，实现对目标图像实现像素级的掩码划分，再结合预设的掩码文本匹配算法，对每个点对应的预测掩码区域向量，与目标图像对应的文本嵌入向量进行匹配，实现掩码与目标图像对应的文本描述中的实体对齐，从而实现在掩码和文本解耦的情况下，对目标图像进行像素级的分割。通过这种方式，实现掩码与文本的解耦，无须耗费大量的人工成本对应图像、掩码和文本之间的关系，从而实现使用独立的图像-掩码和图像-文本对来解放掩码和文本之间的严格对应关系。并且，实现掩码与文本的解耦的同时，不仅能够实现对目标图像实现语义的分割，而且能够对目标图像进行实例的区分。The above-mentioned image segmentation method and device based on open vocabulary segmentation, by providing a preset point grid and learning the query embedding vector of each point in the preset point grid separately, thereby determining the predicted mask area vector corresponding to each point, realizing pixel-level mask division of the target image, and then combining the preset mask text matching algorithm, matching the predicted mask area vector corresponding to each point with the text embedding vector corresponding to the target image, realizing entity alignment in the text description corresponding to the mask and the target image, thereby realizing pixel-level segmentation of the target image under the condition of decoupling the mask and the text. In this way, the decoupling of the mask and the text is realized, and the relationship between the corresponding image, mask and text does not need to be expended a lot of manual costs, thereby realizing the use of independent image-mask and image-text pairs to liberate the strict correspondence between the mask and the text. Moreover, while realizing the decoupling of the mask and the text, it is possible not only to realize the semantic segmentation of the target image, but also to distinguish the instance of the target image.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明提供的基于开放词汇分割的图像分割方法的流程示意图之一；FIG1 is a schematic diagram of a flow chart of an image segmentation method based on open vocabulary segmentation provided by the present invention;

图2为本发明提供的基于开放词汇分割的图像分割方法的流程示意图之二；FIG2 is a second flow chart of the image segmentation method based on open vocabulary segmentation provided by the present invention;

图3为本发明提供的基于开放词汇分割的图像分割方法的流程示意图之三；FIG3 is a third flow chart of the image segmentation method based on open vocabulary segmentation provided by the present invention;

图4为本发明提供的基于开放词汇分割的图像分割方法的流程示意图之四；FIG4 is a fourth flow chart of the image segmentation method based on open vocabulary segmentation provided by the present invention;

图5为本发明提供的基于开放词汇分割的图像分割方法的流程示意图之五；FIG5 is a fifth flow chart of the image segmentation method based on open vocabulary segmentation provided by the present invention;

图6为本发明提供的基于开放词汇分割的图像分割方法的框架示意图；FIG6 is a schematic diagram of the framework of an image segmentation method based on open vocabulary segmentation provided by the present invention;

图7为本发明提供的掩码解码器层的结构示意图；FIG7 is a schematic diagram of the structure of a mask decoder layer provided by the present invention;

图8为本发明提供的多尺度特征适配器的框架示意图；FIG8 is a schematic diagram of a framework of a multi-scale feature adapter provided by the present invention;

图9为本发明提供的基于开放词汇分割的图像分割方法与其它方法的实验结果对比图；FIG9 is a comparison diagram of experimental results of the image segmentation method based on open vocabulary segmentation provided by the present invention and other methods;

图10为本发明提供的基于开放词汇分割的图像分割装置的框架示意图。FIG10 is a schematic diagram of the framework of an image segmentation device based on open vocabulary segmentation provided by the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。In order to make the purpose, technical solution and advantages of the present invention more clearly understood, the present invention is further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention and are not intended to limit the present invention.

需要说明的是，除非另外定义，本公开实施例使用的技术术语或者科学术语应当为本公开所属领域内具有一般技能的人士所理解的通常意义。本公开实施例中使用的“第一”、“第二”以及类似的词语并不表示任何顺序、数量或者重要性，而只是用来区分不同的组成部分。“包括”或者“包含”等类似的词语意指出现该词前面的元件或者物件涵盖出现在该词后面列举的元件或者物件及其等同，而不排除其他元件或者物件。“连接”或者“相连”等类似的词语并非限定于物理的或者机械的连接，而是可以包括电性的连接，不管是直接的还是间接的。“上”、“下”、“左”、“右”等仅用于表示相对位置关系，当被描述对象的绝对位置改变后，则该相对位置关系也可能相应地改变。It should be noted that, unless otherwise defined, the technical terms or scientific terms used in the embodiments of the present disclosure should be understood by people with ordinary skills in the field to which the present disclosure belongs. The "first", "second" and similar words used in the embodiments of the present disclosure do not indicate any order, quantity or importance, but are only used to distinguish different components. "Including" or "comprising" and similar words mean that the elements or objects appearing before the word cover the elements or objects listed after the word and their equivalents, without excluding other elements or objects. "Connect" or "connected" and similar words are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "Up", "down", "left", "right" and the like are only used to indicate relative positional relationships. When the absolute position of the described object changes, the relative positional relationship may also change accordingly.

为了便于理解，对本发明的发明构思进行说明。To facilitate understanding, the inventive concept of the present invention is described.

目前通常依赖于图像-掩模-文本三元组进行基于开放词汇分割的图像分割，但这种方法需要耗费大量的人工精力对掩模和文本之间的对应关系进行标注，会导致昂贵的注释成本。Currently, image segmentation based on open vocabulary segmentation usually relies on image-mask-text triples, but this method requires a lot of manual effort to annotate the correspondence between mask and text, which leads to expensive annotation costs.

尽管现有技术中已经提出了一些弱监督方法，例如通过文本监督来降低注释成本，但监督的不完整性严重限制了通用性和性能。其中，文本监督只利用图像和文本对进行语义分割，在捕获复杂的空间细节方面存在不足，这对于密集预测来说不是最佳的。此外，文本监督这种类型的监督缺乏位置信息，使得模型难以区分具有相同语义类的不同实例，也就是说，这种类型只适用于普通的语义分割，难以适用于实例分割或全景分割。这些问题严重限制了现有弱监督方法的多功能性和分割性能。Although some weak supervision methods have been proposed in the prior art, such as reducing annotation costs through text supervision, the incompleteness of supervision severely limits the versatility and performance. Among them, text supervision only uses image and text pairs for semantic segmentation, which is insufficient in capturing complex spatial details, which is not optimal for dense prediction. In addition, this type of supervision, text supervision, lacks location information, making it difficult for the model to distinguish different instances with the same semantic class, that is, this type is only suitable for ordinary semantic segmentation and is difficult to apply to instance segmentation or panoramic segmentation. These problems severely limit the versatility and segmentation performance of existing weak supervision methods.

因此，针对上述问题，本发明提供的基于开放词汇分割的图像分割方法和装置，通过使用独立的图像-掩模和图像-文本对，解放了掩模和文本之间的严格对应关系。这两种类型的对可以很容易地从不同的来源收集。并且通过提供预设的点网格，并单独学习所述预设的点网格中各个点的查询嵌入向量，从而确定每个点对应的预测掩码区域向量，实现对目标图像实现像素级的掩码划分，再结合预设的掩码文本匹配算法，对每个点对应的预测掩码区域向量，与目标图像对应的文本嵌入向量进行匹配，实现掩码与目标图像对应的文本描述中的实体对齐，从而实现在掩码和文本解耦的情况下，对目标图像进行像素级的分割。通过这种方式，实现掩码与文本的解耦，无须耗费大量的人工成本对应图像、掩码和文本之间的关系，从而实现使用独立的图像-掩码和图像-文本对来解放掩码和文本之间的严格对应关系。并且，实现掩码与文本的解耦的同时，不仅能够实现对目标图像实现语义的分割，而且能够对目标图像进行实例的区分。Therefore, in view of the above problems, the image segmentation method and device based on open vocabulary segmentation provided by the present invention liberates the strict correspondence between mask and text by using independent image-mask and image-text pairs. These two types of pairs can be easily collected from different sources. And by providing a preset point grid and learning the query embedding vector of each point in the preset point grid separately, the predicted mask area vector corresponding to each point is determined, and the target image is masked at the pixel level. The predicted mask area vector corresponding to each point is matched with the text embedding vector corresponding to the target image in combination with the preset mask text matching algorithm, so as to realize the entity alignment in the text description corresponding to the mask and the target image, thereby realizing the pixel-level segmentation of the target image under the condition of decoupling the mask and the text. In this way, the decoupling of the mask and the text is realized, and the relationship between the corresponding image, mask and text does not need to be expended a lot of labor costs, so as to realize the use of independent image-mask and image-text pairs to liberate the strict correspondence between the mask and the text. Moreover, while decoupling the mask from the text, it is possible not only to achieve semantic segmentation of the target image, but also to distinguish instances of the target image.

下面结合附图说明本发明提供的基于开放词汇分割的图像分割方法和装置。The image segmentation method and device based on open vocabulary segmentation provided by the present invention are described below with reference to the accompanying drawings.

图1为本发明提供的一种基于开放词汇分割的图像分割方法的流程示意图。可以理解，该基于开放词汇分割的图像分割方法可以由基于开放词汇分割的图像分割装置执行。其中，基于开放词汇分割的图像分割装置可以为一个计算机设备。Fig. 1 is a flow chart of an image segmentation method based on open vocabulary segmentation provided by the present invention. It can be understood that the image segmentation method based on open vocabulary segmentation can be performed by an image segmentation device based on open vocabulary segmentation. The image segmentation device based on open vocabulary segmentation can be a computer device.

如图1所示，在一个实施例中，提出了一种基于开放词汇分割的图像分割方法，具体可以包括以下步骤：As shown in FIG1 , in one embodiment, an image segmentation method based on open vocabulary segmentation is proposed, which may specifically include the following steps:

步骤110，获取目标图像和预设的点网格。Step 110, obtaining a target image and a preset point grid.

其中，所述目标图像为待进行语义分割或全景分割的图像，例如可以为医学成像领域或自动驾驶汽车领域的图像。所述预设的点网格用于做为目标图像的视觉提示，确定预测掩码区域向量，主要用于对齐掩码和目标图像的位置信息。所述预设的点网格中包括多个均匀分布的点。所述预设的点网格为尺寸为h×w的网格，高度h≤4，宽度w≤4，该网格与像素中心对齐。The target image is an image to be semantically segmented or panoptically segmented, for example, an image in the field of medical imaging or autonomous driving. The preset point grid is used as a visual cue for the target image to determine the predicted mask region vector, mainly used to align the position information of the mask and the target image. The preset point grid includes a plurality of evenly distributed points. The preset point grid is a grid of size h×w, with a height of h≤4 and a width of w≤4, and the grid is aligned with the center of the pixel.

可以理解，在获取目标图像后，还可以按照预设的处理步骤对所述目标图像进行预处理得到不同分辨率的图像。例如可以对目标图像应用随机水平翻转。随后，图像被随机缩放至716x 716至1075x 1075分辨率范围内的分辨率。最后，从缩放图像中提取896x 896分辨率的裁剪图像作为输入。It is understood that after acquiring the target image, the target image can also be preprocessed according to the preset processing steps to obtain images of different resolutions. For example, a random horizontal flip can be applied to the target image. Subsequently, the image is randomly scaled to a resolution within a range of 716x 716 to 1075x 1075. Finally, a cropped image with a resolution of 896x 896 is extracted from the scaled image as input.

步骤120，基于所述目标图像，确定所述目标图像对应的至少一个文本嵌入向量。Step 120: Based on the target image, determine at least one text embedding vector corresponding to the target image.

其中，所述目标图像对应的至少一个文本嵌入向量为所述目标图像对应的至少一个描述文本的文本嵌入向量。其中，所述目标图像对应的至少一个描述文本用于描述目标图像中实体的类型，比如目标图像中包括一个人、猫以及桌子，则描述文本则为人、猫和桌子。结合前述开放词汇分割，可以理解，为了可以学习到图像中更多的语义特征，可以直接通过使用预训练好的文本编码器提取目标图像对应的文本嵌入做为目标图像的标签嵌入，从而用于对目标图像中涉及的实体进行分类。Among them, the at least one text embedding vector corresponding to the target image is the text embedding vector of at least one description text corresponding to the target image. Among them, the at least one description text corresponding to the target image is used to describe the type of entity in the target image. For example, if the target image includes a person, a cat and a table, the description text is a person, a cat and a table. Combined with the aforementioned open vocabulary segmentation, it can be understood that in order to learn more semantic features in the image, the text embedding corresponding to the target image can be directly extracted by using a pre-trained text encoder as the label embedding of the target image, which can be used to classify the entities involved in the target image.

具体地，可以通过一些多模态模型，例如CLIP提取得到目标图像对应的至少一个文本嵌入向量。Specifically, at least one text embedding vector corresponding to the target image can be extracted through some multimodal models, such as CLIP.

步骤130，针对所述预设的点网格中的每个点，确定所述每个点对应的查询嵌入向量；并基于所述每个点对应的查询嵌入向量，确定每个点对应的预测掩码区域向量。Step 130: for each point in the preset point grid, determine the query embedding vector corresponding to each point; and based on the query embedding vector corresponding to each point, determine the prediction mask region vector corresponding to each point.

其中，所述每个点对应的查询嵌入向量中包括每个点的位置嵌入向量，以及每个点在所述目标图像中对应的至少一个像素级的真实掩码嵌入向量。The query embedding vector corresponding to each point includes a position embedding vector of each point and at least one pixel-level true mask embedding vector corresponding to each point in the target image.

可以理解，本发明中为了采用独立的图像-掩码和图像-文本对来解放掩码和文本之间的严格对应关系，同时为了能够捕获复杂的空间细节，获取更多的位置信息，能够区分目标图像中具有相同语义类的不同实例，通过设置预设的点网格，单独学习所述预设的点网格中各个点的查询嵌入向量，从而得到目标图像对应的像素级的真实掩码嵌入向量，实现对目标图像实现像素级的掩码划分，从而使基于所述每个点对应的查询嵌入向量，确定的每个点对应的预测掩码区域向量，不仅可以实现对目标图像进行语义分割，也可以对目标图像进行实例分割以及全景分割，从而实现采用弱监督方法，减少人工标注成本的同时，保证了图像分割的多功能性和分割性能。It can be understood that in order to use independent image-mask and image-text pairs to liberate the strict correspondence between mask and text, and at the same time to capture complex spatial details, obtain more location information, and distinguish different instances of the same semantic class in the target image, a preset point grid is set, and the query embedding vector of each point in the preset point grid is learned separately, so as to obtain the pixel-level true mask embedding vector corresponding to the target image, and realize pixel-level mask division of the target image, so that the predicted mask area vector corresponding to each point determined based on the query embedding vector corresponding to each point can not only realize semantic segmentation of the target image, but also instance segmentation and panoramic segmentation of the target image, thereby realizing the use of weak supervision methods, reducing the cost of manual annotation, and ensuring the versatility and segmentation performance of image segmentation.

还可以理解，实际应用，可能一次查询多个点，则可以基于上述方法确定多个点对应的一组预测掩码区域向量，并且多个点对应的一组预测掩码区域向量可以对应多个真实掩码嵌入向量。预测掩码区域向量与真实掩模向量之间可以是一对一、一对多、多对一或者多对多的关系。It can also be understood that in actual applications, multiple points may be queried at a time, and a set of predicted mask region vectors corresponding to the multiple points can be determined based on the above method, and a set of predicted mask region vectors corresponding to the multiple points can correspond to multiple real mask embedding vectors. The relationship between the predicted mask region vector and the real mask vector can be one-to-one, one-to-many, many-to-one, or many-to-many.

步骤140，基于所述每个点对应的预测掩码区域向量、所述目标图像对应的至少一个文本嵌入向量以及预设的掩码文本匹配算法，确定所述每个点对应的预测掩码类别。Step 140: Determine the predicted mask category corresponding to each point based on the predicted mask region vector corresponding to each point, at least one text embedding vector corresponding to the target image, and a preset mask text matching algorithm.

其中，所述每个点对应的预测掩码类别中包括每个点在所述目标图像中对应的像素级的类别标签；所述预设的掩码文本匹配算法用于从所述目标图像对应的至少一个文本嵌入向量中，确定与每个点对应的预测掩码区域向量匹配的文本嵌入向量。Among them, the predicted mask category corresponding to each point includes the pixel-level category label corresponding to each point in the target image; the preset mask text matching algorithm is used to determine the text embedding vector that matches the predicted mask area vector corresponding to each point from at least one text embedding vector corresponding to the target image.

可以理解，此处采用预设的掩码文本匹配算法，结合前面确定的预测掩码区域向量，实现掩码与目标图像的对应的文本描述中的实体对齐，从而实现在掩码和文本解耦的情况下，对目标图像中实体进行像素级的分类，从而不经实现对目标图像实现语义的分割，而且能够对目标图像进行实例的区分，保证了图像分割的多功能性和分割性能。It can be understood that the preset mask text matching algorithm is used here, combined with the predicted mask area vector determined previously, to achieve entity alignment between the mask and the corresponding text description of the target image, thereby achieving pixel-level classification of entities in the target image under the decoupling of the mask and the text, thereby achieving semantic segmentation of the target image without realizing it, and being able to distinguish instances of the target image, thereby ensuring the versatility and segmentation performance of image segmentation.

本发明提供的基于开放词汇分割的图像分割方法和装置，通过提供预设的点网格，并单独学习所述预设的点网格中各个点的查询嵌入向量，从而确定每个点对应的预测掩码区域向量，实现对目标图像实现像素级的掩码划分，再结合预设的掩码文本匹配算法，对每个点对应的预测掩码区域向量，与目标图像对应的文本嵌入向量进行匹配，实现掩码与目标图像对应的文本描述中的实体对齐，从而实现在掩码和文本解耦的情况下，对目标图像进行像素级的分割。通过这种方式，实现掩码与文本的解耦，无须耗费大量的人工成本对应图像、掩码和文本之间的关系，从而实现使用独立的图像-掩码和图像-文本对来解放掩码和文本之间的严格对应关系。并且，实现掩码与文本的解耦的同时，不仅能够实现对目标图像实现语义的分割，而且能够对目标图像进行实例的区分。The image segmentation method and device based on open vocabulary segmentation provided by the present invention provide a preset point grid and learn the query embedding vector of each point in the preset point grid separately, so as to determine the predicted mask area vector corresponding to each point, realize the pixel-level mask division of the target image, and then combine the preset mask text matching algorithm to match the predicted mask area vector corresponding to each point with the text embedding vector corresponding to the target image, realize the entity alignment in the text description corresponding to the mask and the target image, so as to realize the pixel-level segmentation of the target image under the condition of decoupling the mask and the text. In this way, the decoupling of the mask and the text is realized, and the relationship between the corresponding image, the mask and the text is not spent a lot of labor costs, so as to realize the use of independent image-mask and image-text pairs to liberate the strict correspondence between the mask and the text. In addition, while realizing the decoupling of the mask and the text, it is possible not only to realize the semantic segmentation of the target image, but also to distinguish the instance of the target image.

在其中一个实施例中，所述预设的增强图像文本提取模型包括：预设的视觉语言模型、预设的文本语言增强模型和预设的文本编码器，如图2所示，所述基于所述目标图像，以及预设的增强图像文本提取模型，确定所述目标图像对应的至少一个文本嵌入向量，包括：In one embodiment, the preset enhanced image text extraction model includes: a preset visual language model, a preset text language enhancement model and a preset text encoder. As shown in FIG2 , the step of determining at least one text embedding vector corresponding to the target image based on the target image and the preset enhanced image text extraction model includes:

步骤210，将所述目标图像输入所述预设的视觉语言模型中，确定所述目标图像的初始文本嵌入向量；所述目标图像的初始文本嵌入向量中包括对所述目标图像的描述文本的特征表示。Step 210, input the target image into the preset visual language model to determine an initial text embedding vector of the target image; the initial text embedding vector of the target image includes a feature representation of the description text of the target image.

其中，所述预设的视觉语言模型来细化文本描述，提高文本描述的质量。可以理解，由于收集的图像-文本对通常包含某些与图像不匹配的文本，进而导致掩模和实体之间的对应不正确，直接对齐这些不正确的对应关系会导致分割性能不佳。因此，本发明通过设置相应的预设的视觉语言模型以提高文本描述的质量，从而使得到目标图像的初始文本嵌入向量是与目标图像较为匹配的文本描述的特征表示。Among them, the preset visual language model is used to refine the text description and improve the quality of the text description. It can be understood that since the collected image-text pairs usually contain some text that does not match the image, which leads to incorrect correspondence between the mask and the entity, directly aligning these incorrect correspondences will lead to poor segmentation performance. Therefore, the present invention improves the quality of the text description by setting a corresponding preset visual language model, so that the initial text embedding vector of the target image is a feature representation of the text description that matches the target image more closely.

优选地，所述预设的视觉语言模型可以为大型视觉语言模型(Large Languageand Vision Assistant，LLaVa)模型。可以理解，所述预设的视觉语言模型也可以采用其他大型视觉语言模型，例如文心一言，本发明对此不作限定。Preferably, the preset visual language model may be a Large Language and Vision Assistant (LLaVa) model. It is understood that the preset visual language model may also adopt other large visual language models, such as Wenxinyiyan, and the present invention is not limited thereto.

步骤220，将所述目标图像的初始文本嵌入向量，输入预设的文本语言增强模型中，确定所述目标图像的文本特征表示的至少一个增强文本。Step 220: input the initial text embedding vector of the target image into a preset text language enhancement model to determine at least one enhanced text represented by the text feature of the target image.

其中，所述预设的文本语言增强模型用于进行精确的实体提取。所述目标图像的文本特征表示的至少一个增强文本即对应所述目标图像中的至少一个实体，例如可以包括目标图像中涉及的任务、场景以及人物所穿的衣服等。The preset text language enhancement model is used for accurate entity extraction. The at least one enhanced text represented by the text feature of the target image corresponds to at least one entity in the target image, such as tasks, scenes, and clothes worn by characters involved in the target image.

优选地，所述预设的文本语言增强模型例如可以为(Chat Generative Pre-trained Transformer，ChatGPT)解析器。可以理解，所述预设的文本语言增强模型也可以采用其他模型，本发明对此不作限定。Preferably, the preset text language enhancement model may be, for example, a (Chat Generative Pre-trained Transformer, ChatGPT) parser. It is understandable that the preset text language enhancement model may also adopt other models, and the present invention is not limited to this.

可以理解，预设的文本语言增强模型与所述预设的视觉语言模型结合，能够从所述目标图像中提取出精确的实体，从而获取精确的图像-文本对，为后续实现将图像的文本与图像的掩码对齐打下基础。It can be understood that the preset text language enhancement model combined with the preset visual language model can extract accurate entities from the target image, thereby obtaining accurate image-text pairs, laying the foundation for the subsequent alignment of the image text with the image mask.

步骤230，将所述目标图像的文本特征表示的至少一个增强文本，分别输入所述预设的文本编码器，得到所述目标图像对应的至少一个文本嵌入向量。Step 230: input at least one enhanced text represented by the text feature of the target image into the preset text encoder to obtain at least one text embedding vector corresponding to the target image.

其中，预设的文本编码器，例如可以为基于ConvNext的CLIP模型。具体地，预设的文本编码器被构造为16层Transformer，每层768个神经单元，具有12个注意力头。The preset text encoder, for example, can be a CLIP model based on ConvNext. Specifically, the preset text encoder is constructed as a 16-layer Transformer, each layer has 768 neural units, and has 12 attention heads.

在其中一个实施例中，如图3所示，所述针对所述预设的点网格中的每个点，确定所述每个点对应的查询嵌入向量，包括：In one embodiment, as shown in FIG3 , determining, for each point in the preset point grid, a query embedding vector corresponding to each point includes:

步骤310，针对所述预设的点网格中的每个点，基于预训练好的视觉提示编码器，将所述每个点编码为两个位置嵌入向量和所述每个点在所述目标图像中对应的至少一个真实掩码嵌入向量。Step 310, for each point in the preset point grid, based on a pre-trained visual cue encoder, encode each point into two position embedding vectors and at least one true mask embedding vector corresponding to each point in the target image.

具体地，可以基于预训练好的视觉提示编码器中的正弦编码确定每个点的位置嵌入向量。Specifically, the position embedding vector of each point can be determined based on the sinusoidal encoding in the pre-trained visual cue encoder.

其中，所述预训练好的视觉提示编码器用于对视觉提示(点或边界框)进行编码。预训练好的视觉提示编码器基于第一预设损失函数确定，具体确定过程参考后文相关描述。在本发明中每个视觉提示(每个点在所述目标图像中对应的至少一个真实掩码嵌入向量)都与位置嵌入向量相耦合。Wherein, the pre-trained visual cue encoder is used to encode the visual cue (point or bounding box). The pre-trained visual cue encoder is determined based on a first preset loss function, and the specific determination process is referred to the relevant description below. In the present invention, each visual cue (at least one real mask embedding vector corresponding to each point in the target image) is coupled with a position embedding vector.

步骤320，基于所述目标图像和预设的视觉编码器，确定所述至少一个真实掩码嵌入向量对应的内容嵌入向量。Step 320: Determine a content embedding vector corresponding to the at least one true mask embedding vector based on the target image and a preset visual encoder.

其中，预设的视觉编码器用于提取目标图像中的视觉特征，例如可以为基于ConvNext的CLIP模型。具体地，预设的视觉编码器可以被配置为ConvNext大型模型，包括四个级。每个阶段包含不同数量的块：第一阶段3个，第二阶段3个、第三阶段27个，第四阶段3个。利用预设的视觉编码器提取多尺度特征。这些特征被表示为不同宽度和比例的特征图：192宽的特征图缩小了4倍，384宽的特征图缩小了8倍，768宽的特征图减小了16倍，1536宽的特征图减少了32倍。Among them, the preset visual encoder is used to extract visual features in the target image, for example, it can be a CLIP model based on ConvNext. Specifically, the preset visual encoder can be configured as a ConvNext large model, including four stages. Each stage contains a different number of blocks: 3 in the first stage, 3 in the second stage, 27 in the third stage, and 3 in the fourth stage. Multi-scale features are extracted using the preset visual encoder. These features are represented as feature maps of different widths and scales: the 192-wide feature map is reduced by 4 times, the 384-wide feature map is reduced by 8 times, the 768-wide feature map is reduced by 16 times, and the 1536-wide feature map is reduced by 32 times.

步骤330，针对所述预设的点网格中的每个点，将每个点对应的两个位置嵌入向量和所述每个点在所述目标图像中对应的至少一个真实掩码嵌入向量拼接后，与每个点对应的预设查询类型以及所述至少一个真实掩码嵌入向量对应的内容嵌入向量组合，得到所述每个点对应的查询嵌入向量。Step 330, for each point in the preset point grid, concatenate the two position embedding vectors corresponding to each point and the at least one true mask embedding vector corresponding to each point in the target image, and combine them with the preset query type corresponding to each point and the content embedding vector corresponding to the at least one true mask embedding vector to obtain the query embedding vector corresponding to each point.

具体地，第i个点对应的查询嵌入向量的表达式为：其中，表示第i个点对应的两个位置嵌入向量，q^msk∈R^M×dim表示第i个点在所述目标图像中对应的M个真实掩码嵌入向量，q^type∈R^dim表示第i个点对应的预设查询类型，可以为点或者框，q^feat∈R^dim表示所述至少一个真实掩码嵌入向量对应的内容嵌入向量。Specifically, the expression of the query embedding vector corresponding to the i-th point is: in, Represents the two position embedding vectors corresponding to the i-th point, q ^msk ∈R ^M×dim represents the M real mask embedding vectors corresponding to the i-th point in the target image, q ^type ∈R ^dim represents the preset query type corresponding to the i-th point, which can be a point or a box, and q ^feat ∈R ^dim represents the content embedding vector corresponding to the at least one real mask embedding vector.

在其中一个实施例中，在所述基于所述每个点对应的查询嵌入向量，确定每个点对应的预测掩码区域向量之前，所述方法还包括：基于所述目标图像、预设的视觉编码器和预训练好的多尺度像素解码器，确定所述每个点对应的多尺度像素特征向量。In one of the embodiments, before determining the predicted mask area vector corresponding to each point based on the query embedding vector corresponding to each point, the method further includes: determining the multi-scale pixel feature vector corresponding to each point based on the target image, a preset visual encoder and a pre-trained multi-scale pixel decoder.

可以理解，为了有效地将掩码与视觉提示(例如点和框)相关联，采用视觉提示编码器、多尺度像素解码器和掩码解码器结合实现。It can be understood that in order to effectively associate the mask with the visual cue (such as points and boxes), the visual cue encoder, the multi-scale pixel decoder and the mask decoder are combined to achieve it.

其中，预训练好的多尺度像素解码器用于增强从所述预设的视觉编码器中提取的目标图像中的视觉特征，从而减少因掩模和实体之间的对应关系中的固有噪声。预训练好的多尺度像素解码器中包括特征金字塔网络(Feature pyramid networks，FPN)架构。首先在多尺度特征上采用6层多尺度可变形transformer来聚合上下文信息。然后，将多尺度像素解码器中的低分辨率特征图放大2倍，然后将其与来自主干的相应分辨率特征图组合，该特征图已被投影以匹配信道维度。该投影是通过1×1卷积层实现的，然后是组归一化(GroupNorm)。随后，合并后的特征通过额外的3×3卷积层进行进一步处理，并辅以GroupNorm和ReLU激活。该过程迭代应用，从32倍缩小的特征图开始，直到我们获得4倍缩小的最终特征图。为了生成逐像素嵌入，在末端应用单个1×1卷积层。在整个多尺度像素解码器中，所有特征图都保持256个通道的一致维度。Among them, the pre-trained multi-scale pixel decoder is used to enhance the visual features in the target image extracted from the preset visual encoder, thereby reducing the inherent noise in the correspondence between the mask and the entity. The pre-trained multi-scale pixel decoder includes a feature pyramid network (FPN) architecture. First, a 6-layer multi-scale deformable transformer is used on the multi-scale features to aggregate contextual information. Then, the low-resolution feature map in the multi-scale pixel decoder is enlarged by 2 times and then combined with the corresponding resolution feature map from the backbone, which has been projected to match the channel dimension. The projection is achieved through a 1×1 convolutional layer, followed by group normalization (GroupNorm). Subsequently, the merged features are further processed by an additional 3×3 convolutional layer, supplemented by GroupNorm and ReLU activation. This process is applied iteratively, starting with a feature map that is 32 times smaller, until we obtain a final feature map that is 4 times smaller. In order to generate a pixel-by-pixel embedding, a single 1×1 convolutional layer is applied at the end. Throughout the multi-scale pixel decoder, all feature maps maintain a consistent dimension of 256 channels.

相应地，所述基于所述每个点对应的查询嵌入向量，确定每个点对应的预测掩码区域向量，包括：基于所述每个点对应的多尺度像素特征向量、所述每个点对应的查询嵌入向量和预训练好的掩码解码器，确定每个点对应的预测掩码区域向量。Correspondingly, the method of determining the predicted mask area vector corresponding to each point based on the query embedding vector corresponding to each point includes: determining the predicted mask area vector corresponding to each point based on the multi-scale pixel feature vector corresponding to each point, the query embedding vector corresponding to each point and a pre-trained mask decoder.

具体地，每个点对应的预测掩码区域向量，可以基于所述每个点对应的多尺度像素特征向量和所述每个点对应的查询嵌入向量的矩阵乘法得到，对应的计算公式可以为：其中，q_i表示第i个点对应的查询嵌入向量，第i个点对应的多尺度像素特征向量，sigmoid(·)是将掩码值归一化为[0,1]的sigmoid函数。Specifically, the predicted mask region vector corresponding to each point can be obtained based on the matrix multiplication of the multi-scale pixel feature vector corresponding to each point and the query embedding vector corresponding to each point. The corresponding calculation formula can be: Among them, _qi represents the query embedding vector corresponding to the i-th point, The multi-scale pixel feature vector corresponding to the i-th point, sigmoid(·) is the sigmoid function that normalizes the mask value to [0,1].

可以理解，实际场景中，一次可以同时查询多个点，因此可以将一个或多个点对应的查询嵌入向量做为一组查询嵌入，也即Q＝q₁,q₂,…q_n，则对应的预测掩码区域向量对应的计算公式为：例如可以为m_n＝sigmoid(Q；x^p)，m_n∈R^M×H×W， It can be understood that in actual scenarios, multiple points can be queried at the same time, so the query embedding vectors corresponding to one or more points can be used as a group of query embeddings, that is, Q = q ₁ , q ₂ , ... q _n , and the corresponding calculation formula for the prediction mask region vector is: For example, it can be m _n = sigmoid (Q; x ^p ), m _n ∈ ^RM×H×W ,

在本发明中，可使用transformer解码器作为掩码解码器。具体地，本发明使用了六个如图7所示的掩码解码器层，并且在每一层之后应用相同的损失函数，具体的损失函数可参考后文的第一预设损失函数。In the present invention, a transformer decoder can be used as a mask decoder. Specifically, the present invention uses six mask decoder layers as shown in FIG7 , and the same loss function is applied after each layer. The specific loss function can refer to the first preset loss function below.

需要说明的是，上述视觉提示编码器、掩码解码器和多尺度像素解码器均可以基于所述至少一个图像以及所述每个点对应的查询嵌入向量，结合第一预设损失函数更新得到。具体地，第一预设损失函数为：It should be noted that the visual cue encoder, mask decoder and multi-scale pixel decoder can all be updated based on the at least one image and the query embedding vector corresponding to each point in combination with a first preset loss function. Specifically, the first preset loss function is:

其中λ_bce和λ_dice是两个超参数，用于平衡二元交叉熵损失和语言分割损失每个点m_i对应的预测掩码区域向量，中包括k个真实掩模区域向量m_j，以及每个真实掩模区域向量对应的类别标签c_j。 Where λ _bce and λ _dice are two hyperparameters used to balance the binary cross entropy loss and language segmentation loss The predicted mask region vector corresponding to each point _mi , It includes k real mask region vectors m _j and the category label c _j corresponding to each real mask region vector.

在其中一个实施例中，如图4所示，所述基于所述每个点对应的预测掩码区域向量、所述目标图像对应的至少一个文本嵌入向量以及预设的掩码文本匹配算法，确定所述每个点对应的预测掩码类别，包括：In one embodiment, as shown in FIG4 , the step of determining the predicted mask category corresponding to each point based on the predicted mask region vector corresponding to each point, at least one text embedding vector corresponding to the target image, and a preset mask text matching algorithm includes:

步骤410，将所述每个点的初始像素特征向量，以及所述每个点对应的预测掩码区域向量，输入预训练好的多尺度特征适配器，得到所述每个点的像素特征向量。Step 410 , input the initial pixel feature vector of each point and the predicted mask region vector corresponding to each point into a pre-trained multi-scale feature adapter to obtain the pixel feature vector of each point.

其中，预训练好的多尺度特征适配器用于和所述预训练好的多尺度像素解码器相匹配，用于将所述预训练好的多尺度像素解码器中输出的不同尺度的特征图分别进行处理，细化和增强各种尺度的像素特征。Among them, the pre-trained multi-scale feature adapter is used to match the pre-trained multi-scale pixel decoder, and is used to process the feature maps of different scales output from the pre-trained multi-scale pixel decoder respectively, and refine and enhance the pixel features of various scales.

具体地，可以将所述每个点的初始像素特征向量和所述每个点对应的预测掩码区域向量，依次经过掩码池化和多尺度特征适配器中，得到所述每个点的像素特征向量。其中，所述每个点的像素特征向量对应的表达式为：r_i＝Fv(P(x^v,m_i))，其中，P()表示淹码池化，Fv()表示多尺度特征适配器，x^v表示每个点的初始像素特征向量，m_i表示每个点对应的预测掩码区域向量。Specifically, the initial pixel feature vector of each point and the predicted mask region vector corresponding to each point may be sequentially subjected to mask pooling and a multi-scale feature adapter to obtain the pixel feature vector of each point. The expression corresponding to the pixel feature vector of each point is: r _i =Fv(P(x ^v ,m _i )), where P() represents mask pooling, Fv() represents a multi-scale feature adapter, x ^v represents the initial pixel feature vector of each point, and m _i represents the predicted mask region vector corresponding to each point.

具体地，所述多尺度特征适配器的具体结构可以参考图8。Specifically, the specific structure of the multi-scale feature adapter can refer to FIG8 .

步骤420，针对每个点，基于所述每个点对应的像素特征向量和所述目标图像对应的每个文本嵌入向量的反余弦相似度，确定所述每个点的像素特征向量和所述目标图像对应的每个文本嵌入向量之间的成本矩阵。Step 420, for each point, based on the arc cosine similarity between the pixel feature vector corresponding to each point and each text embedding vector corresponding to the target image, determine a cost matrix between the pixel feature vector of each point and each text embedding vector corresponding to the target image.

其中，每个点对应的预测掩码区域向量和所述每个图像对应的每个文本嵌入向量之间的成本矩阵对应的表达式为：其中，δ'_i,j表示第i个点对应的像素特征向量和所述每个图像对应的第j个文本嵌入向量的反余弦相似度，r_i表示第i个点的像素特征向量，t_j表示每个图像对应的第j个文本嵌入向量。Among them, the expression corresponding to the cost matrix between the predicted mask area vector corresponding to each point and each text embedding vector corresponding to each image is: Wherein, δ'i _,j represents the arc cosine similarity between the pixel feature vector corresponding to the i-th point and the j-th text embedding vector corresponding to each image, _ri represents the pixel feature vector of the i-th point, and _tj represents the j-th text embedding vector corresponding to each image.

步骤430，基于所述每个点的像素特征向量和所述每个图像对应的所有文本嵌入向量之间的成本矩阵，以及二部图匹配算法，确定每个点对应的最佳匹配的像素特征向量和文本嵌入向量。Step 430, based on the cost matrix between the pixel feature vector of each point and all text embedding vectors corresponding to each image, and a bipartite graph matching algorithm, determine the best matching pixel feature vector and text embedding vector corresponding to each point.

步骤440，将每个点对应的最佳匹配的文本嵌入向量，确定为每个点对应的最佳匹配的像素特征向量的预测掩码类别。Step 440 , the best matching text embedding vector corresponding to each point is determined as the prediction mask category of the best matching pixel feature vector corresponding to each point.

具体地，在其中一个实施例中，如图5所示，多尺度特征适配器的训练过程包括：Specifically, in one embodiment, as shown in FIG5 , the training process of the multi-scale feature adapter includes:

步骤510，针对训练样本集中至少一个图像中的每个图像，将所述每个图像输入预设增强图像文本提取模型中，确定所述每个图像对应的至少一个文本嵌入向量。Step 510: for each image in at least one image in the training sample set, input the each image into a preset enhanced image text extraction model to determine at least one text embedding vector corresponding to the each image.

步骤520，将训练样本集中所述至少一个图像中的每个图像和所述每个点对应的预测掩码区域向量，输入预设的视觉编码器和预训练好的多尺度像素解码器中，通过所述多尺度特征适配器，得到所述每个点的像素特征向量。Step 520, input each image in the at least one image in the training sample set and the predicted mask area vector corresponding to each point into a preset visual encoder and a pre-trained multi-scale pixel decoder, and obtain the pixel feature vector of each point through the multi-scale feature adapter.

步骤530，基于所述训练样本集中的所有图像对应的至少一个文本嵌入向量、所述点网格中每个点的像素特征向量和预设的余弦相似度损失函数，更新得到预训练好的多尺度特征适配器。Step 530 , based on at least one text embedding vector corresponding to all images in the training sample set, the pixel feature vector of each point in the point grid and a preset cosine similarity loss function, a pre-trained multi-scale feature adapter is updated.

具体地，该步骤530包括：Specifically, step 530 includes:

步骤5301，针对每个点，基于所述每个点的像素特征向量和所述每个图像对应的每个文本嵌入向量的反余弦相似度，确定所述每个点的像素特征向量和所述每个图像对应的每个文本嵌入向量之间的成本矩阵。Step 5301, for each point, based on the arc cosine similarity between the pixel feature vector of each point and each text embedding vector corresponding to each image, determine the cost matrix between the pixel feature vector of each point and each text embedding vector corresponding to each image.

步骤5302，基于所述每个点的像素特征向量和所述每个图像对应的所有文本嵌入向量之间的成本矩阵，以及二部图匹配算法，确定每个点对应的最佳匹配的像素特征向量和文本嵌入向量。Step 5302, based on the cost matrix between the pixel feature vector of each point and all text embedding vectors corresponding to each image, and a bipartite graph matching algorithm, determine the best matching pixel feature vector and text embedding vector corresponding to each point.

步骤5303，将所述每个点对应的最佳匹配的像素特征向量和文本嵌入向量，输入所述预设的余弦相似度损失函数，得到每个点对应的损失函数值，基于所述每个点对应的损失函数值更新得到所述预训练好的多尺度特征适配器的参数。Step 5303, input the best matching pixel feature vector and text embedding vector corresponding to each point into the preset cosine similarity loss function to obtain the loss function value corresponding to each point, and update the parameters of the pre-trained multi-scale feature adapter based on the loss function value corresponding to each point.

其中，所述预设的余弦相似度损失函数为：Among them, the preset cosine similarity loss function is:

其中，r_i'表示第i个点对应的最佳匹配的像素特征向量，t_k表示第i个点对应的最佳匹配的第k文本嵌入向量。 Among them, _ri ' represents the best matching pixel feature vector corresponding to the i-th point, and _tk represents the best matching k-th text embedding vector corresponding to the i-th point.

图6为本发明提供的基于开放词汇分割的图像分割方法的一框架示意图。对于图像掩模对，掩模生成的一个分支，包括视觉提示编码器、像素解码器和掩码解码器，被用来预测输入图像的一组二进制掩模。对于图像-文本对，掩码-文本二部图匹配用于利用预测掩码和文本描述中的实体之间的置信对。之后，采用多尺度特征适配器来增强掩码视觉嵌入，这进一步与基于置信对的相关实体嵌入对齐。由于收集的图像-文本对通常包含某些与图像不匹配的文本，导致掩模和实体之间的对应不正确。因此，可以采用采用大视觉-文本模型来提高文本描述的质量，并结合基于ChatGPT的解析器来进行精确的实体提取。对于掩码文本对齐，引入了掩码文本二部图匹配来利用预测掩码和实体的置信对。多尺度特征适配器旨在增强预测掩模的视觉嵌入，进一步与相应实体的文本嵌入对齐。由于本发明中允许掩模和文本不配对，因此引入了多尺度集成来提高视觉嵌入的质量，并有效地解决掩模和实体之间对应关系中的固有噪声，从而稳定匹配过程。FIG6 is a schematic diagram of a framework of an image segmentation method based on open vocabulary segmentation provided by the present invention. For image-mask pairs, a branch of mask generation, including a visual cue encoder, a pixel decoder, and a mask decoder, is used to predict a set of binary masks of the input image. For image-text pairs, mask-text bipartite graph matching is used to utilize confidence pairs between predicted masks and entities in text descriptions. Afterwards, a multi-scale feature adapter is used to enhance the mask visual embedding, which is further aligned with the relevant entity embedding based on confidence pairs. Since the collected image-text pairs usually contain some text that does not match the image, the correspondence between the mask and the entity is incorrect. Therefore, a large visual-text model can be used to improve the quality of the text description, and combined with a ChatGPT-based parser for accurate entity extraction. For mask-text alignment, mask-text bipartite graph matching is introduced to utilize confidence pairs of predicted masks and entities. The multi-scale feature adapter aims to enhance the visual embedding of the predicted mask and further align it with the text embedding of the corresponding entity. Since the mask and text are allowed to be unpaired in the present invention, multi-scale integration is introduced to improve the quality of visual embedding and effectively solve the inherent noise in the correspondence between mask and entity, thereby stabilizing the matching process.

如图7所示，每个掩码解码器层都包括一个交叉注意力层。该交叉注意力层至关重要，因为它确保最终的像素特征富含基本的几何信息，如点坐标和边界框。此外，在每个解码器层中，使用交叉注意力层，这确保最终的像素特征向量能够访问关键的几何信息(例如，点坐标和方框)。最后，视觉提示嵌入(查询嵌入向量)被馈送到层归一化，然后通过多层感知机(MLP)进行处理。然后，这些经过处理的查询嵌入向量与像素特征向量进行点积，最终生成预测掩码(预测掩码区域向量)。该掩码解码器层通过交叉注意力层来更新视觉提示嵌入和像素特征。自注意力层用于更新视觉提示。在每个注意力层，位置编码被添加到像素特征，并且整个原始视觉提示(包括位置编码)被添加到更新的视觉提示。As shown in Figure 7, each mask decoder layer includes a crisscross attention layer. This crisscross attention layer is crucial because it ensures that the final pixel features are enriched with basic geometric information such as point coordinates and bounding boxes. In addition, in each decoder layer, a crisscross attention layer is used, which ensures that the final pixel feature vector has access to key geometric information (e.g., point coordinates and boxes). Finally, the visual cue embedding (query embedding vector) is fed to the layer for normalization and then processed by a multi-layer perceptron (MLP). These processed query embedding vectors are then dot-producted with the pixel feature vector to ultimately generate a predicted mask (predicted mask region vector). The mask decoder layer updates the visual cue embedding and pixel features through a crisscross attention layer. The self-attention layer is used to update the visual cue. At each attention layer, the position encoding is added to the pixel feature, and the entire original visual cue (including the position encoding) is added to the updated visual cue.

图8为本发明提供的多尺度特征适配器的框架示意图。如图8所示的多尺度特征适配器旨在细化和增强各种尺度的像素特征。该适配器由多个低等级适配器(LoRA适配器)[27]组成，每个适配器都是专门为更新特定规模的功能而定制的。每个LoRA适配器都包含两个线性层，分别表示为A和B，中间放置一个非线性激活函数。线性层A和B用于线性变换输入数据，而激活函数引入非线性，允许对数据中更复杂的关系进行建模。每个适配器负责处理特定规模的功能，这表明采用分层方法来细化功能。这种模块化设计允许根据特征的分辨率和语义复杂性对特征进行集中和专门的处理，这对于密集的预测任务特别有利。FIG8 is a schematic diagram of the framework of the multi-scale feature adapter provided by the present invention. The multi-scale feature adapter shown in FIG8 is designed to refine and enhance pixel features of various scales. The adapter consists of multiple low-level adapters (LoRA adapters) [27], each of which is specially tailored to update features of a specific scale. Each LoRA adapter contains two linear layers, denoted as A and B, with a nonlinear activation function placed in the middle. The linear layers A and B are used to linearly transform the input data, while the activation function introduces nonlinearity, allowing more complex relationships in the data to be modeled. Each adapter is responsible for processing features of a specific scale, which suggests a hierarchical approach to refining features. This modular design allows features to be processed intensively and specialized according to their resolution and semantic complexity, which is particularly beneficial for intensive prediction tasks.

为了便于理解，下面结合实验说明本发明能够实现的效果。主要与全监督和弱监督方法进行比较，具体的比较结果参考图9。其中，“COCO S.”、“COCO P.”和“COCO C.”分别表示COCO数据集、全景数据集和字幕数据集。“O365”表示Object 365数据集。“M.41M”表示合并后的41M图像数据集。本申请的所有数据集的mIoU。具体地，通过一系列基准对本发明的方法与现有的方法进行了全面比较。其中，涉及的数据集包括ADE20K(包含150和847类变体)、PASCAL Context(459和59类变体)、PASCAL VOC(包含20和21类类别)和Cityscapes。其中，涉及的现有全监督方法可参考：(A simple baseline for openvocabulary semanticsegmentation with pre-trained visionlanguage model，SimBaseline)、(Decouplingzero-shot semantic segmentation，ZegFormer)、(Language-driven semanticsegmentation，LSeg+)、(Open-vocabulary semantic segmentation with mask-adaptedclip，OVSeg)、(Open-vocabulary panoptic segmentation with text-to-imagediffusion models，ODISE)、(Generalized decoding for pixel,image,and language，X-Decoder)、(A simple framework for open-vocabulary segmentation and detection，OpenSEED)、(Openvocabulary universal image segmentation with maskclip，MaskCLIP)、(Open-vocabulary segmentation with single frozen convolutionalclip，FC-CLIP)。涉及的现有弱监督方法可参考：(Groupvit:Semantic segmentationemerges from text supervision，GroupViT)、(Learning to generate text-groundedmask for open-world semantic segmentation from only image-text pairs，TCL)、(Learning open-vocabulary semantic segmentation models from natural languagesupervision，OVSeg)、(Patch aggregation with learnable centers for pen-vocabulary semantic segmentation，SegCLIP)、(Perceptual grouping in contrastivevision-language models，CLIPpy)、(Mixreorg:Cross-modal mixed patcheorganization is a good mask learner for open-world semantic segmentation，MixReorg)、(Sam-clip:Merging vision foundation models towards semantic andspatial understanding，SAM-CLIP)。与现有的弱监督方法相比，本发明的方法在所有评估的数据集上都表现出显着的性能改进。具体来说，在更具挑战性的PASCAL Context-459数据集中，本发明的方法不仅超越了弱监督方法，而且还优于最先进的全监督方法，例如FC-CLIP。这表明本发明的方法在对各种语义类别进行分类方面具有卓越的能力。此外，在PASCAL VOC基准(20和21类)中，本发明的方法展示了比最先进的弱监督方法的显着增强，分别实现了18.3％和12.2％mIoU的改进，这表明本发明的方法能捕获细粒度的空间结构。这些结果将弱监督开放词汇分割的实际适用性提升到了新的高度。For ease of understanding, the effects that can be achieved by the present invention are illustrated below in conjunction with experiments. It is mainly compared with fully supervised and weakly supervised methods, and the specific comparison results are shown in Figure 9. Among them, "COCO S.", "COCO P." and "COCO C." represent the COCO dataset, the panoramic dataset and the subtitle dataset, respectively. "O365" represents the Object 365 dataset. "M.41M" represents the merged 41M image dataset. mIoU of all datasets of this application. Specifically, a series of benchmarks are used to comprehensively compare the method of the present invention with the existing methods. Among them, the datasets involved include ADE20K (including 150 and 847 variants), PASCAL Context (459 and 59 variants), PASCAL VOC (including 20 and 21 categories) and Cityscapes. Among them, the existing fully supervised methods involved can be referred to: (A simple baseline for openvocabulary semantic segmentation with pre-trained visionlanguage model, SimBaseline), (Decouplingzero-shot semantic segmentation, ZegFormer), (Language-driven semantic segmentation, LSeg+), (Open-vocabulary semantic segmentation with mask-adaptedclip, OVSeg), (Open-vocabulary panoptic segmentation with text-to-imagediffusion models, ODISE), (Generalized decoding for pixel, image, and language, X-Decoder), (A simple framework for open-vocabulary segmentation and detection, OpenSEED), (Openvocabulary universal image segmentation with maskclip, MaskCLIP), (Open-vocabulary segmentation with single frozen convolutionalclip, FC-CLIP). The existing weak supervision methods involved can be referred to: (Groupvit: Semantic segmentation emerges from text supervision, GroupViT), (Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs, TCL), (Learning open-vocabulary semantic segmentation models from natural language supervision, OVSeg), (Patch aggregation with learnable centers for pen-vocabulary semantic segmentation, SegCLIP), (Perceptual grouping in contrastive vision-language models, CLIPpy), (Mixreorg: Cross-modal mixed patcheorganization is a good mask learner for open-world semantic segmentation, MixReorg), (Sam-clip: Merging vision foundation models towards semantic and spatial understanding, SAM-CLIP). Compared with the existing weak supervision methods, the method of the present invention shows significant performance improvements on all evaluated datasets. Specifically, in the more challenging PASCAL Context-459 dataset, our method not only surpasses weakly supervised methods, but also outperforms the most advanced fully supervised methods, such as FC-CLIP. This shows that our method has superior ability in classifying various semantic categories. In addition, in the PASCAL VOC benchmark (20 and 21 categories), our method demonstrates significant enhancement over the most advanced weakly supervised methods, achieving 18.3% and 12.2% mIoU improvements, respectively, which shows that our method can capture fine-grained spatial structures. These results bring the practical applicability of weakly supervised open vocabulary segmentation to a new level.

下面对本发明提供的基于开放词汇分割的图像分割装置以及基于开放词汇分割的图像分割装置进行描述，下文描述的基于开放词汇分割的图像分割装置与上文描述的基于开放词汇分割的图像分割方法可相互对应参照。The image segmentation device based on open vocabulary segmentation and the image segmentation method based on open vocabulary segmentation provided by the present invention are described below. The image segmentation device based on open vocabulary segmentation described below and the image segmentation method based on open vocabulary segmentation described above can be referred to each other.

如图10所示，在一个实施例中，提供了一种基于开放词汇分割的图像分割装置，该基于开放词汇分割的图像分割装置可以包括：As shown in FIG10 , in one embodiment, an image segmentation apparatus based on open vocabulary segmentation is provided. The image segmentation apparatus based on open vocabulary segmentation may include:

获取模块1010，用于获取目标图像和预设的点网格；An acquisition module 1010 is used to acquire a target image and a preset point grid;

第一确定模块1020，用于基于所述目标图像，确定所述目标图像对应的至少一个文本嵌入向量；A first determination module 1020, configured to determine, based on the target image, at least one text embedding vector corresponding to the target image;

第二确定模块1030，用于针对所述预设的点网格中的每个点，确定所述每个点对应的查询嵌入向量；并基于所述每个点对应的查询嵌入向量，确定每个点对应的预测掩码区域向量；其中，所述每个点对应的查询嵌入向量中包括每个点的位置嵌入向量，以及每个点在所述目标图像中对应的至少一个像素级的真实掩码嵌入向量；The second determination module 1030 is used to determine, for each point in the preset point grid, a query embedding vector corresponding to each point; and based on the query embedding vector corresponding to each point, determine a predicted mask region vector corresponding to each point; wherein the query embedding vector corresponding to each point includes a position embedding vector of each point and at least one pixel-level true mask embedding vector corresponding to each point in the target image;

第三确定模块1040，用于基于所述每个点对应的预测掩码区域向量、所述目标图像对应的至少一个文本嵌入向量以及预设的掩码文本匹配算法，确定所述每个点对应的预测掩码类别；其中，所述每个点对应的预测掩码类别中包括每个点在所述目标图像中对应的像素级的类别标签；所述预设的掩码文本匹配算法用于从所述目标图像对应的至少一个文本嵌入向量中，确定与每个点对应的预测掩码区域向量匹配的文本嵌入向量。The third determination module 1040 is used to determine the predicted mask category corresponding to each point based on the predicted mask area vector corresponding to each point, at least one text embedding vector corresponding to the target image and a preset mask text matching algorithm; wherein the predicted mask category corresponding to each point includes the pixel-level category label corresponding to each point in the target image; the preset mask text matching algorithm is used to determine the text embedding vector that matches the predicted mask area vector corresponding to each point from the at least one text embedding vector corresponding to the target image.

本发明提供的基于开放词汇分割的图像分割装置，通过提供预设的点网格，并单独学习所述预设的点网格中各个点的查询嵌入向量，从而确定每个点对应的预测掩码区域向量，实现对目标图像实现像素级的掩码划分，再结合预设的掩码文本匹配算法，对每个点对应的预测掩码区域向量，与目标图像对应的文本嵌入向量进行匹配，实现掩码与目标图像对应的文本描述中的实体对齐，从而实现在掩码和文本解耦的情况下，对目标图像进行像素级的分割。通过这种方式，实现掩码与文本的解耦，无须耗费大量的人工成本对应图像、掩码和文本之间的关系，从而实现使用独立的图像-掩码和图像-文本对来解放掩码和文本之间的严格对应关系。并且，实现掩码与文本的解耦的同时，不仅能够实现对目标图像实现语义的分割，而且能够对目标图像进行实例的区分。The image segmentation device based on open vocabulary segmentation provided by the present invention provides a preset point grid and separately learns the query embedding vector of each point in the preset point grid, thereby determining the predicted mask area vector corresponding to each point, realizing pixel-level mask division of the target image, and then combining the preset mask text matching algorithm, matching the predicted mask area vector corresponding to each point with the text embedding vector corresponding to the target image, realizing entity alignment in the text description corresponding to the mask and the target image, thereby realizing pixel-level segmentation of the target image under the condition of decoupling the mask and the text. In this way, the decoupling of the mask and the text is realized, and the relationship between the corresponding image, the mask and the text is not spent a lot of labor costs, so as to realize the use of independent image-mask and image-text pairs to liberate the strict correspondence between the mask and the text. In addition, while realizing the decoupling of the mask and the text, it is possible not only to realize the semantic segmentation of the target image, but also to distinguish the instance of the target image.

此外，本发明还提供了一种计算机设备，所述计算机设备包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序，所述处理器执行本发明提供的基于开放词汇分割的图像分割方法，所述基于开放词汇分割的图像分割方法包括：获取目标图像和预设的点网格；基于所述目标图像，确定所述目标图像对应的至少一个文本嵌入向量；针对所述预设的点网格中的每个点，确定所述每个点对应的查询嵌入向量；并基于所述每个点对应的查询嵌入向量，确定每个点对应的预测掩码区域向量；其中，所述每个点对应的查询嵌入向量中包括每个点的位置嵌入向量，以及每个点在所述目标图像中对应的至少一个像素级的真实掩码嵌入向量；基于所述每个点对应的预测掩码区域向量、所述目标图像对应的至少一个文本嵌入向量以及预设的掩码文本匹配算法，确定所述每个点对应的预测掩码类别；其中，所述每个点对应的预测掩码类别中包括每个点在所述目标图像中对应的像素级的类别标签；所述预设的掩码文本匹配算法用于从所述目标图像对应的至少一个文本嵌入向量中，确定与每个点对应的预测掩码区域向量匹配的文本嵌入向量。In addition, the present invention also provides a computer device, the computer device includes a memory, a processor and a computer program stored in the memory and executable on the processor, the processor executes the image segmentation method based on open vocabulary segmentation provided by the present invention, the image segmentation method based on open vocabulary segmentation includes: acquiring a target image and a preset point grid; based on the target image, determining at least one text embedding vector corresponding to the target image; for each point in the preset point grid, determining a query embedding vector corresponding to each point; and based on the query embedding vector corresponding to each point, determining a predicted mask area vector corresponding to each point; wherein each point The corresponding query embedding vector includes the position embedding vector of each point and at least one pixel-level true mask embedding vector corresponding to each point in the target image; based on the predicted mask area vector corresponding to each point, at least one text embedding vector corresponding to the target image and a preset mask text matching algorithm, the predicted mask category corresponding to each point is determined; wherein the predicted mask category corresponding to each point includes the pixel-level category label corresponding to each point in the target image; the preset mask text matching algorithm is used to determine the text embedding vector that matches the predicted mask area vector corresponding to each point from the at least one text embedding vector corresponding to the target image.

另一方面，本发明还提供一种计算机程序产品，所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算机程序，所述计算机程序包括程序指令，当所述程序指令被计算机执行时，计算机能够执行本发明提供的基于开放词汇分割的图像分割方法，所述基于开放词汇分割的图像分割方法包括：获取目标图像和预设的点网格；基于所述目标图像，确定所述目标图像对应的至少一个文本嵌入向量；针对所述预设的点网格中的每个点，确定所述每个点对应的查询嵌入向量；并基于所述每个点对应的查询嵌入向量，确定每个点对应的预测掩码区域向量；其中，所述每个点对应的查询嵌入向量中包括每个点的位置嵌入向量，以及每个点在所述目标图像中对应的至少一个像素级的真实掩码嵌入向量；基于所述每个点对应的预测掩码区域向量、所述目标图像对应的至少一个文本嵌入向量以及预设的掩码文本匹配算法，确定所述每个点对应的预测掩码类别；其中，所述每个点对应的预测掩码类别中包括每个点在所述目标图像中对应的像素级的类别标签；所述预设的掩码文本匹配算法用于从所述目标图像对应的至少一个文本嵌入向量中，确定与每个点对应的预测掩码区域向量匹配的文本嵌入向量。On the other hand, the present invention also provides a computer program product, which includes a computer program stored on a non-transitory computer-readable storage medium, and the computer program includes program instructions. When the program instructions are executed by a computer, the computer can execute the image segmentation method based on open vocabulary segmentation provided by the present invention, and the image segmentation method based on open vocabulary segmentation includes: acquiring a target image and a preset point grid; based on the target image, determining at least one text embedding vector corresponding to the target image; for each point in the preset point grid, determining the query embedding vector corresponding to each point; and based on the query embedding vector corresponding to each point, determining the predicted mask area corresponding to each point vector; wherein the query embedding vector corresponding to each point includes the position embedding vector of each point and at least one pixel-level true mask embedding vector corresponding to each point in the target image; based on the predicted mask area vector corresponding to each point, at least one text embedding vector corresponding to the target image and a preset mask text matching algorithm, the predicted mask category corresponding to each point is determined; wherein the predicted mask category corresponding to each point includes the pixel-level category label corresponding to each point in the target image; the preset mask text matching algorithm is used to determine the text embedding vector that matches the predicted mask area vector corresponding to each point from the at least one text embedding vector corresponding to the target image.

又一方面，本发明还提供一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现以执行本发明提供的基于开放词汇分割的图像分割方法，所述基于开放词汇分割的图像分割方法包括：获取目标图像和预设的点网格；基于所述目标图像，确定所述目标图像对应的至少一个文本嵌入向量；针对所述预设的点网格中的每个点，确定所述每个点对应的查询嵌入向量；并基于所述每个点对应的查询嵌入向量，确定每个点对应的预测掩码区域向量；其中，所述每个点对应的查询嵌入向量中包括每个点的位置嵌入向量，以及每个点在所述目标图像中对应的至少一个像素级的真实掩码嵌入向量；基于所述每个点对应的预测掩码区域向量、所述目标图像对应的至少一个文本嵌入向量以及预设的掩码文本匹配算法，确定所述每个点对应的预测掩码类别；其中，所述每个点对应的预测掩码类别中包括每个点在所述目标图像中对应的像素级的类别标签；所述预设的掩码文本匹配算法用于从所述目标图像对应的至少一个文本嵌入向量中，确定与每个点对应的预测掩码区域向量匹配的文本嵌入向量。On the other hand, the present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, is implemented to perform the image segmentation method based on open vocabulary segmentation provided by the present invention, the image segmentation method based on open vocabulary segmentation comprising: acquiring a target image and a preset point grid; based on the target image, determining at least one text embedding vector corresponding to the target image; for each point in the preset point grid, determining a query embedding vector corresponding to each point; and based on the query embedding vector corresponding to each point, determining a predicted mask area vector corresponding to each point; wherein the query embedding vector corresponding to each point The vector includes a position embedding vector of each point and at least one pixel-level true mask embedding vector corresponding to each point in the target image; based on the predicted mask area vector corresponding to each point, at least one text embedding vector corresponding to the target image and a preset mask text matching algorithm, the predicted mask category corresponding to each point is determined; wherein the predicted mask category corresponding to each point includes the pixel-level category label corresponding to each point in the target image; the preset mask text matching algorithm is used to determine a text embedding vector that matches the predicted mask area vector corresponding to each point from at least one text embedding vector corresponding to the target image.

以上所描述的装置实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下，即可以理解并实施。The device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the scheme of this embodiment. Those of ordinary skill in the art may understand and implement it without creative work.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件。基于这样的理解，上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在计算机可读存储介质中，如ROM/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。Through the description of the above implementation methods, those skilled in the art can clearly understand that each implementation method can be implemented by means of software plus a necessary general hardware platform, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solution is essentially or the part that contributes to the prior art can be embodied in the form of a software product, and the computer software product can be stored in a computer-readable storage medium, such as ROM/RAM, a disk, an optical disk, etc., including a number of instructions for a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods described in each embodiment or some parts of the embodiments.

可以理解，以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。It can be understood that the above embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit it. Although the present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that they can still modify the technical solutions described in the aforementioned embodiments, or make equivalent replacements for some of the technical features therein. However, these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. An image segmentation method based on open vocabulary segmentation, the method comprising:

acquiring a target image and a preset point grid;

Determining at least one text embedding vector corresponding to the target image based on the target image;

Determining a query embedding vector corresponding to each point in the preset point grid; determining a prediction mask area vector corresponding to each point based on the query embedded vector corresponding to each point; the query embedding vector corresponding to each point comprises a position embedding vector of each point and a true mask embedding vector of at least one pixel level corresponding to each point in the target image;

determining a prediction mask category corresponding to each point based on the prediction mask area vector corresponding to each point, at least one text embedding vector corresponding to the target image and a preset mask text matching algorithm; the prediction mask class corresponding to each point comprises class labels of pixel levels corresponding to each point in the target image; the preset mask text matching algorithm is used for determining a text embedded vector matched with the prediction mask area vector corresponding to each point from at least one text embedded vector corresponding to the target image.

2. An open vocabulary segmentation based image segmentation method according to claim 1 wherein the determining at least one text embedding vector corresponding to the target image based on the target image comprises:

Determining at least one text embedding vector corresponding to the target image based on the target image and a preset enhanced image text extraction model; the preset enhanced image text extraction model is used for extracting and enhancing at least one text embedded vector matched with the target image description.

3. An open vocabulary segmentation based image segmentation method according to claim 2 wherein,

The preset enhanced image text extraction model comprises the following steps: the method for determining the text embedding vector corresponding to the target image comprises the steps of:

Inputting the target image into the preset visual language model, and determining an initial text embedding vector of the target image; the initial text embedding vector of the target image comprises a characteristic representation of the descriptive text of the target image;

Embedding the initial text of the target image into a vector, inputting the vector into a preset text language enhancement model, and determining at least one enhancement text represented by the text characteristics of the target image;

And respectively inputting at least one enhanced text represented by the text characteristics of the target image into the preset text encoder to obtain at least one text embedded vector corresponding to the target image.

4. The method for image segmentation based on open vocabulary segmentation according to claim 1, wherein the determining, for each point in the preset point grid, a query embedding vector corresponding to the each point comprises:

For each point in the preset point grid, based on a pre-trained visual cue encoder, encoding each point into two position embedded vectors and at least one real mask embedded vector corresponding to each point in the target image;

determining a content embedding vector corresponding to the at least one true mask embedding vector based on the target image and a preset visual encoder;

And for each point in the preset point grid, splicing the two position embedded vectors corresponding to each point and at least one real mask embedded vector corresponding to each point in the target image, and combining the preset query type corresponding to each point and the content embedded vector corresponding to the at least one real mask embedded vector to obtain the query embedded vector corresponding to each point.

5. The open vocabulary segmentation based image segmentation method of claim 1 wherein prior to the determining the prediction mask region vector for each point based on the query embedding vector for each point, the method further comprises: determining a multi-scale pixel feature vector corresponding to each point based on the target image, a preset visual encoder and a pre-trained multi-scale pixel decoder;

the determining the prediction mask area vector corresponding to each point based on the query embedded vector corresponding to each point comprises the following steps:

And determining a prediction mask area vector corresponding to each point based on the multi-scale pixel characteristic vector corresponding to each point, the query embedded vector corresponding to each point and the pre-trained mask decoder.

6. The method for image segmentation based on open vocabulary segmentation according to claim 5, wherein the determining the multi-scale pixel feature vector corresponding to each point based on the target image, a preset visual encoder and a pre-trained multi-scale pixel decoder comprises:

Inputting an image block corresponding to each point in the target image into the preset visual encoder aiming at each point to obtain an initial pixel characteristic vector of each point;

And inputting the initial pixel characteristic vector of each point into the pre-trained multi-scale pixel decoder to obtain the multi-scale pixel characteristic vector of each point.

7. The method for image segmentation based on open vocabulary segmentation according to claim 6, wherein the determining the prediction mask category corresponding to each point based on the prediction mask region vector corresponding to each point, at least one text embedding vector corresponding to the target image, and a preset mask text matching algorithm comprises:

Inputting the initial pixel characteristic vector of each point and the prediction mask area vector corresponding to each point into a pre-trained multi-scale characteristic adapter to obtain the pixel characteristic vector of each point;

for each point, determining a cost matrix between the pixel feature vector of each point and each text embedding vector corresponding to the target image based on the inverse cosine similarity of the pixel feature vector corresponding to each point and each text embedding vector corresponding to the target image;

determining the best matched pixel characteristic vector and text embedding vector corresponding to each point based on a cost matrix between the pixel characteristic vector of each point and all text embedding vectors corresponding to each image and a bipartite graph matching algorithm;

and embedding the text with the best match corresponding to each point into the vector, and determining the prediction mask type of the pixel characteristic vector with the best match corresponding to each point.

8. The method of claim 7, wherein before inputting the initial pixel feature vector of each point, the multi-scale pixel feature vector of each point, and the prediction mask region vector corresponding to each point, the method further comprises:

Updating to obtain a pre-trained multi-scale feature adapter based on the training sample set and a preset cosine similarity loss function; the training sample set comprises at least one image and a pixel-level class label corresponding to each image; the preset cosine similarity loss function is used for determining a text embedding vector corresponding to the prediction mask area vector corresponding to each point from at least one text embedding vector corresponding to each image.

9. An image segmentation apparatus based on open vocabulary segmentation, the apparatus comprising:

The acquisition module is used for acquiring the target image and a preset point grid;

the first determining module is used for determining at least one text embedding vector corresponding to the target image based on the target image;

The second determining module is used for determining a query embedding vector corresponding to each point aiming at each point in the preset point grid; determining a prediction mask area vector corresponding to each point based on the query embedded vector corresponding to each point; the query embedding vector corresponding to each point comprises a position embedding vector of each point and a true mask embedding vector of at least one pixel level corresponding to each point in the target image;

A third determining module, configured to determine a prediction mask category corresponding to each point based on the prediction mask area vector corresponding to each point, at least one text embedding vector corresponding to the target image, and a preset mask text matching algorithm; the prediction mask class corresponding to each point comprises class labels of pixel levels corresponding to each point in the target image; the preset mask text matching algorithm is used for determining a text embedded vector matched with the prediction mask area vector corresponding to each point from at least one text embedded vector corresponding to the target image.

10. A computer device comprising a memory and a processor, wherein the memory has stored therein computer readable instructions which, when executed by the processor, cause the processor to perform the steps of the open vocabulary segmentation based image segmentation method of any one of claims 1 to 8.

11. A storage medium storing computer readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the open vocabulary segmentation based image segmentation method of any of the preceding claims 1-8.