CN118674803A

CN118674803A - Image generation model, training method and device for image generation model

Info

Publication number: CN118674803A
Application number: CN202410812014.7A
Authority: CN
Inventors: 张泳祥
Original assignee: Vivo Mobile Communication Co Ltd
Current assignee: Vivo Mobile Communication Co Ltd
Priority date: 2024-06-21
Filing date: 2024-06-21
Publication date: 2024-09-20

Abstract

The present application discloses an image generation model, a training method and a device for the image generation model, and belongs to the field of artificial intelligence technology. The image generation model includes: a conditional encoder, which is used to encode at least one original image to obtain at least one image feature information, and is used to encode the description text of each original image to obtain at least one text feature information; at least one conditional feature information is determined based on a control condition, at least one image feature information and at least one text feature information; the control condition indicates the type of the image generation task, and the conditional feature information is determined according to the type of the image generation task; a noise adding module, which is used to add noise to at least one original image based on a Gaussian noise matrix and a mask image of at least one original image to obtain at least one noise image; a diffusion model, which is used to generate at least one derivative image based on at least one conditional feature information and at least one noise image; the conditional encoder and the noise adding module are respectively connected to the diffusion model.

Description

Image generation model, image generation model training method and device

技术领域Technical Field

本申请属于人工智能技术领域，具体涉及一种图像生成模型、图像生成模型的训练方法及装置。The present application belongs to the field of artificial intelligence technology, and specifically relates to an image generation model, and a training method and device for the image generation model.

背景技术Background Art

随着AIGC(Artificial Intelligence Generated Content，生成式人工智能)技术的发展，文生图、文和图生图、图生图、图像外扩、图像局部修改、图像消除等图像生成任务的应用已经越来越广泛。With the development of AIGC (Artificial Intelligence Generated Content) technology, the application of image generation tasks such as text-to-image, text and image-to-image, image-to-image, image expansion, local image modification, and image elimination has become increasingly widespread.

现有技术中，对于文生图任务，需要使用亿级别的训练样本训练得到文生图模型来处理，对于图生图任务，需要在文生图模型的基础上使用百万级别的图生图训练样本训练得到图生图模型来处理，或者，对于图像外扩任务，需要在文生图模型的基础上使用百万级别的图像外扩训练样本训练得到图像外扩模型来处理，或者，对于图像局部修改任务，需要在文生图模型的基础上使用百万级别的图像局部修改训练样本训练得到图像局部修改模型来处理。In the prior art, for text-generated image tasks, it is necessary to use billions of training samples to train a text-generated image model for processing; for image-generated image tasks, it is necessary to use millions of image-generated image training samples to train an image-generated image model based on the text-generated image model; or, for image expansion tasks, it is necessary to use millions of image expansion training samples to train an image expansion model based on the text-generated image model; or, for image local modification tasks, it is necessary to use millions of image local modification training samples to train an image local modification model based on the text-generated image model.

由此可见，电子设备每处理一种图像生成任务，便需要调用一次对应的模型，操作繁琐，降低了电子设备处理图像生成任务的效率。It can be seen that each time the electronic device processes an image generation task, it needs to call the corresponding model once, which is cumbersome and reduces the efficiency of the electronic device in processing image generation tasks.

发明内容Summary of the invention

本申请的目的是提供一种图像生成模型、图像生成模型的训练方法及装置，能够提高电子设备处理图像生成任务的效率。The purpose of this application is to provide an image generation model, a training method and a device for the image generation model, which can improve the efficiency of electronic devices in processing image generation tasks.

第一方面，本申请的一些实施例提供了一种图像生成模型，包括：条件编码器、加噪模块和扩散模型，条件编码器和加噪模块分别与扩散模型连接；In a first aspect, some embodiments of the present application provide an image generation model, including: a conditional encoder, a noise adding module and a diffusion model, wherein the conditional encoder and the noise adding module are respectively connected to the diffusion model;

条件编码器，用于对至少一张原始图像进行编码，得到至少一项图像特征信息，以及用于对每张原始图像的描述文本进行编码，得到至少一项文本特征信息；并基于控制条件、至少一项图像特征信息和至少一项文本特征信息，确定至少一项条件特征信息；控制条件用于指示图像生成任务的类型，条件特征信息是根据图像生成任务的类型确定的；A conditional encoder, used to encode at least one original image to obtain at least one image feature information, and to encode a description text of each original image to obtain at least one text feature information; and to determine at least one conditional feature information based on a control condition, at least one image feature information and at least one text feature information; the control condition is used to indicate a type of an image generation task, and the conditional feature information is determined according to the type of the image generation task;

加噪模块，用于基于高斯噪声矩阵和至少一张原始图像的掩码图像，对至少一张原始图像进行加噪处理，得到至少一张噪声图像；A noise adding module, used for performing noise adding processing on at least one original image based on a Gaussian noise matrix and a mask image of at least one original image, so as to obtain at least one noisy image;

扩散模型，用于基于至少一项条件特征信息和至少一张噪声图像，生成至少一张衍生图像，每张衍生图像对应一项条件特征信息和一张噪声图像。The diffusion model is used to generate at least one derivative image based on at least one conditional feature information and at least one noise image, and each derivative image corresponds to one conditional feature information and one noise image.

第二方面，本申请的一些实施例提供了一种图像生成模型的训练方法，包括：In a second aspect, some embodiments of the present application provide a method for training an image generation model, including:

获取至少两个训练样本组，每个训练样本组包括一张样本图像、每张样本图像的描述文本和掩码图像；Obtain at least two training sample groups, each training sample group including a sample image, description text of each sample image, and a mask image;

对至少两个训练样本组中的至少一张样本图像进行编码，得到至少一项图像特征信息，以及对每张样本图像的描述文本进行编码，得到至少一项文本特征信息；Encoding at least one sample image in at least two training sample groups to obtain at least one item of image feature information, and encoding a description text of each sample image to obtain at least one item of text feature information;

基于控制条件、至少一项图像特征信息和至少一项文本特征信息，确定至少一项条件特征信息；控制条件用于指示图像生成任务的类型，条件特征信息是根据图像生成任务的类型确定的；Determining at least one conditional feature information based on a control condition, at least one image feature information, and at least one text feature information; the control condition is used to indicate the type of the image generation task, and the conditional feature information is determined according to the type of the image generation task;

基于至少一张样本图像、至少一张样本图像的掩码图像和至少一项条件特征信息，对初始模型进行训练，得到图像生成模型。Based on at least one sample image, at least one mask image of the sample image and at least one conditional feature information, an initial model is trained to obtain an image generation model.

第三方面，本申请的一些实施例提供了一种图像生成模型的训练装置，该装置包括：In a third aspect, some embodiments of the present application provide a training device for an image generation model, the device comprising:

获取单元，用于获取至少两个训练样本组，每个训练样本组包括一张样本图像、每张样本图像的描述文本和掩码图像；An acquisition unit, used for acquiring at least two training sample groups, each training sample group including a sample image, description text of each sample image and a mask image;

编码单元，用于对获取单元获取的至少两个训练样本组中的至少一张样本图像进行编码，得到至少一项图像特征信息，以及对每张样本图像的描述文本进行编码，得到至少一项文本特征信息；an encoding unit, configured to encode at least one sample image in the at least two training sample groups acquired by the acquisition unit to obtain at least one item of image feature information, and to encode a description text of each sample image to obtain at least one item of text feature information;

确定单元，用于基于控制条件、编码单元得到的至少一项图像特征信息和至少一项文本特征信息，确定至少一项条件特征信息；控制条件用于指示图像生成任务的类型，条件特征信息是根据图像生成任务的类型确定的；a determining unit, configured to determine at least one conditional feature information based on the control condition, at least one image feature information and at least one text feature information obtained by the encoding unit; the control condition is used to indicate the type of the image generation task, and the conditional feature information is determined according to the type of the image generation task;

训练单元，用于基于获取单元获取的至少一张样本图像、至少一张样本图像的掩码图像和确定单元确定的至少一项条件特征信息，对初始模型进行训练，得到图像生成模型。The training unit is used to train the initial model based on at least one sample image acquired by the acquisition unit, the mask image of at least one sample image and at least one conditional feature information determined by the determination unit to obtain an image generation model.

第四方面，本申请的一些实施例提供了一种电子设备，该电子设备包括处理器和存储器，该存储器存储可在处理器上运行的程序或指令，该程序或指令被处理器执行时实现如第二方面所述的图像生成模型的训练方法的步骤。In a fourth aspect, some embodiments of the present application provide an electronic device comprising a processor and a memory, wherein the memory stores programs or instructions that can be run on the processor, and when the program or instructions are executed by the processor, the steps of the training method of the image generation model as described in the second aspect are implemented.

第五方面，本申请的一些实施例提供了一种可读存储介质，该可读存储介质上存储程序或指令，该程序或指令被处理器执行时实现如第二方面所述的图像生成模型的训练方法的步骤。In a fifth aspect, some embodiments of the present application provide a readable storage medium storing a program or instruction, which, when executed by a processor, implements the steps of the image generation model training method as described in the second aspect.

第六方面，本申请的一些实施例提供了一种芯片，该芯片包括处理器和通信接口，该通信接口和处理器耦合，该处理器用于运行程序或指令，实现如第二方面所述的图像生成模型的训练方法。In a sixth aspect, some embodiments of the present application provide a chip comprising a processor and a communication interface, wherein the communication interface is coupled to the processor, and the processor is used to run a program or instruction to implement the training method of the image generation model as described in the second aspect.

第七方面，本申请的一些实施例提供一种计算机程序产品，该程序产品被存储在存储介质中，该程序产品被至少一个处理器执行以实现如第二方面所述的图像生成模型的训练方法。In a seventh aspect, some embodiments of the present application provide a computer program product, which is stored in a storage medium and executed by at least one processor to implement the training method of the image generation model as described in the second aspect.

本申请的一些实施例提供了的图像生成模型，包括条件编码器、加噪模块和扩散模型，条件编码器和加噪模块分别与扩散模型连接；条件编码器，用于对至少一张原始图像进行编码，得到至少一项图像特征信息，以及用于对每张原始图像的描述文本进行编码，得到至少一项文本特征信息；并基于控制条件、至少一项图像特征信息和至少一项文本特征信息，确定至少一项条件特征信息；控制条件用于指示图像生成任务的类型，条件特征信息是根据图像生成任务的类型确定的；加噪模块，用于基于高斯噪声矩阵和至少一张原始图像的掩码图像，对至少一张原始图像进行加噪处理，得到至少一张噪声图像；扩散模型，用于基于至少一项条件特征信息和至少一张噪声图像，生成至少一张衍生图像，每张衍生图像对应一项条件特征信息和一张噪声图像。在上述图像生成模型中，根据图像生成任务的类型确定条件特征信息，以控制图像生成模型执行不同的图像生成任务，并用高斯噪声矩阵和掩码图像对原始图像进行加噪处理，使得图像生成模型不仅能够具备从噪声中生成图像的能力，也能够具备预测、消除或修改掩码图像对应的掩码区域的图像内容的能力，使得图像生成模型能够适应不同类型的图像生成任务，如文生图、图生图、文加图生图、图像局部修改、图像消除、图像外扩等等。如此，一个图像生成模型具备了处理多种类型的图像生成任务的能力，在处理不同类型的图像生成任务时不需要再调用多个不同的模型，提高了处理图像生成任务的效率。Some embodiments of the present application provide an image generation model, including a conditional encoder, a noise addition module and a diffusion model, wherein the conditional encoder and the noise addition module are respectively connected to the diffusion model; the conditional encoder is used to encode at least one original image to obtain at least one image feature information, and is used to encode the description text of each original image to obtain at least one text feature information; and based on a control condition, at least one image feature information and at least one text feature information, at least one conditional feature information is determined; the control condition is used to indicate the type of image generation task, and the conditional feature information is determined according to the type of image generation task; the noise addition module is used to perform noise processing on at least one original image based on a Gaussian noise matrix and a mask image of at least one original image to obtain at least one noise image; the diffusion model is used to generate at least one derivative image based on at least one conditional feature information and at least one noise image, each derivative image corresponding to one conditional feature information and one noise image. In the above-mentioned image generation model, conditional feature information is determined according to the type of image generation task to control the image generation model to perform different image generation tasks, and the original image is subjected to noise processing using a Gaussian noise matrix and a mask image, so that the image generation model can not only have the ability to generate images from noise, but also have the ability to predict, eliminate or modify the image content of the mask area corresponding to the mask image, so that the image generation model can adapt to different types of image generation tasks, such as text-generated images, image-generated images, text-plus-image-generated images, local image modification, image elimination, image expansion, etc. In this way, an image generation model has the ability to handle multiple types of image generation tasks, and there is no need to call multiple different models when handling different types of image generation tasks, thereby improving the efficiency of processing image generation tasks.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本申请的一些实施例提供的图像生成模型的结构示意图；FIG1 is a schematic diagram of the structure of an image generation model provided by some embodiments of the present application;

图2是本申请的一些实施例提供的图像生成模型的结构示意图；FIG2 is a schematic diagram of the structure of an image generation model provided by some embodiments of the present application;

图3是本申请的一些实施例提供的Transformer Block的结构示意图；FIG3 is a schematic diagram of the structure of a Transformer Block provided in some embodiments of the present application;

图4是本申请的一些实施例提供的第一条件编码器的结构示意图；FIG4 is a schematic diagram of the structure of a first conditional encoder provided in some embodiments of the present application;

图5是本申请的一些实施例提供的第二条件编码器的结构示意图；FIG5 is a schematic diagram of the structure of a second conditional encoder provided in some embodiments of the present application;

图6是本申请的一些实施例提供的扩散自注意力模块的结构示意图；FIG6 is a schematic diagram of the structure of a diffuse self-attention module provided in some embodiments of the present application;

图7是本申请的一些实施例提供的扩散模型的结构示意图；FIG7 is a schematic diagram of the structure of a diffusion model provided by some embodiments of the present application;

图8是本申请的一些实施例提供的卷积模块的结构示意图；FIG8 is a schematic diagram of the structure of a convolution module provided in some embodiments of the present application;

图9是本申请的一些实施例提供的图像编码器的结构示意图；FIG9 is a schematic diagram of the structure of an image encoder provided by some embodiments of the present application;

图10是本申请的一些实施例提供的反卷积模块的结构示意图；FIG10 is a schematic diagram of the structure of a deconvolution module provided in some embodiments of the present application;

图11是本申请的一些实施例提供的图像解码器的结构示意图；FIG11 is a schematic diagram of the structure of an image decoder provided in some embodiments of the present application;

图12是本申请的一些实施例提供的图像生成模型的训练方法的流程示意图；FIG12 is a flow chart of a method for training an image generation model provided in some embodiments of the present application;

图13是本申请的一些实施例提供的图像生成模型的训练方法的流程示意图；FIG13 is a flowchart of a method for training an image generation model provided in some embodiments of the present application;

图14是本申请的一些实施例提供的图像生成模型的训练方法的流程示意图；FIG14 is a flowchart of a method for training an image generation model provided in some embodiments of the present application;

图15是本申请的一些实施例提供的掩码图像的示意图；FIG15 is a schematic diagram of a mask image provided by some embodiments of the present application;

图16是本申请的一些实施例提供的两种确定加噪矩阵的示意图；FIG16 is a schematic diagram of two methods for determining noise addition matrices provided in some embodiments of the present application;

图17是本申请的一些实施例提供的损失值计算的示意图；FIG17 is a schematic diagram of loss value calculation provided by some embodiments of the present application;

图18是本申请的一些实施例提供的图像生成模型的训练装置的结构示意图；FIG18 is a schematic diagram of the structure of a training device for an image generation model provided in some embodiments of the present application;

图19是本申请的一些实施例提供的电子设备的结构示意图；FIG19 is a schematic diagram of the structure of an electronic device provided in some embodiments of the present application;

图20是本申请的一些实施例提供的电子设备的硬件结构示意图。FIG. 20 is a schematic diagram of the hardware structure of an electronic device provided in some embodiments of the present application.

具体实施方式DETAILED DESCRIPTION

下面将结合本申请的一些实施例中的附图，对本申请的一些实施例中的技术方案进行清楚地描述，显然，所描述的实施例是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员获得的所有其他实施例，都属于本申请保护的范围。The following will be combined with the drawings in some embodiments of the present application to clearly describe the technical solutions in some embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all the embodiments. All other embodiments obtained by ordinary technicians in this field based on the embodiments in the present application belong to the scope of protection of this application.

本申请的说明书和权利要求书中的术语“第一”、“第二”等是用于区别类似的对象，而不用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换，以便本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施，且“第一”、“第二”等所区分的对象通常为一类，并不限定对象的个数，例如第一对象可以是一个，也可以是多个。此外，说明书以及权利要求中“和/或”表示所连接对象的至少其中之一，字符“/”，一般表示前后关联对象是一种“或”的关系。The terms "first", "second", etc. in the specification and claims of this application are used to distinguish similar objects, and are not used to describe a specific order or sequence. It should be understood that the data used in this way can be interchangeable under appropriate circumstances, so that the embodiments of the present application can be implemented in an order other than those illustrated or described here, and the objects distinguished by "first", "second", etc. are generally of one type, and the number of objects is not limited. For example, the first object can be one or more. In addition, "and/or" in the specification and claims represents at least one of the connected objects, and the character "/" generally indicates that the objects associated with each other are in an "or" relationship.

本申请的说明书和权利要求书中的术语“至少一个(项)”、“至少之一”等指其包含对象中的任意一个、任意两个或两个以上的组合。例如，a、b、c中的至少一个(项)，可以表示：“a”、“b”、“c”、“a和b”、“a和c”、“b和c”以及“a、b和c”，其中a，b，c可以是单个，也可以是多个。同理，“至少两个(项)”是指两个或两个以上，其表达的含义与“至少一个(项)”类似。The terms "at least one (item)", "at least one of" and the like in the specification and claims of the present application refer to any one, any two or a combination of more than two of the objects included therein. For example, at least one (item) of a, b, and c can be represented by: "a", "b", "c", "a and b", "a and c", "b and c" and "a, b and c", where a, b, and c can be single or multiple. Similarly, "at least two (items)" refers to two or more, and its meaning is similar to that of "at least one (item)".

首先，对本申请的说明书中涉及的名词进行解释说明。First, the terms used in the specification of this application are explained.

DiT：英文全称为Diffusion Transformer，是一种结合了Transformer架构的扩散模型，用于图像和视频生成任务，能够高效地捕获数据间的依赖关系并生成高质量的结果。DiT: Diffusion Transformer is a diffusion model that combines the Transformer architecture for image and video generation tasks. It can efficiently capture dependencies between data and generate high-quality results.

VAE：英文全称为Variational Auto encoder，中文名称为变分自编码器，是一种生成模型，主要用于无监督学习。VAE主要包括编码器和解码器两部分，其核心思想是通过编码器将原始数据映射至潜在空间，再通过解码器将潜在空间的向量还原为原始数据。VAE: The full name of VAE is Variational Auto encoder, which is a generative model mainly used for unsupervised learning. VAE mainly consists of two parts: encoder and decoder. The core idea is to map the original data to the latent space through the encoder, and then restore the vector of the latent space to the original data through the decoder.

Tokenizer：用于将文本分解成单词、短语或符号的工具。Tokenizer: A tool used to break text into words, phrases, or tokens.

Self Attention：自注意力算子，用于计算输入数据中各元素的注意力权重，并根据各元素的注意力权重重新计算各元素的特征向量。Self Attention: The self-attention operator is used to calculate the attention weight of each element in the input data and recalculate the feature vector of each element based on the attention weight of each element.

Linear：线性变换算子，用于将输入数据从一个空间映射至另一个空间。Linear: A linear transformation operator used to map input data from one space to another.

Conv：卷积算子，用于对输入数据进行下采样，减小输入数据的尺寸。Conv: Convolution operator, used to downsample the input data and reduce the size of the input data.

Deconv：反卷积算子，用于对输入数据进行上采样，增大输入数据的尺寸。Deconv: Deconvolution operator, used to upsample the input data and increase the size of the input data.

Feed Forward：前向反馈算子，用于对每个元素的向量进行非线性变换,从而引入更多的非线性能力，提高模型的表达能力。Feed Forward: Feedforward operator, used to perform nonlinear transformation on the vector of each element, thereby introducing more nonlinear capabilities and improving the expressiveness of the model.

Add：加法算子，用于进行加法运算。Add: Addition operator, used for addition operations.

Norm：归一化算子，用于对输入数据进行归一化处理。Norm: Normalization operator, used to normalize the input data.

多头注意力计算(Multi Head Self Attention，MHSA)：是一种自注意力机制，用于在不同的自注意力头(heads)之间共享信息。Multi Head Self Attention (MHSA): is a self-attention mechanism used to share information between different self-attention heads.

分割一切模型(Segment Anything Model，SAM)：用于分割图像中一切实体的分割模型。Segment Anything Model (SAM): A segmentation model used to segment all entities in an image.

文生图任务：根据文本生成图像的任务。Text-to-image task: The task of generating images based on text.

图生图任务：根据图像生成图像的任务。Image-to-image task: The task of generating images based on images.

文和图生图任务：根据文本对图像做处理的任务。Text and image generation task: the task of processing images based on text.

图像局部修改任务：对图像中局部区域的图像内容进行修改的任务。Local image modification task: the task of modifying the image content in a local area of an image.

图像消除任务：对图像中局部区域的图像内容进行消除的任务。Image elimination task: the task of eliminating the image content in a local area of an image.

图像外扩任务：将图像中局部区域补充完整的任务。Image expansion task: the task of completing the local area in the image.

下面结合附图，通过具体的实施例及其应用场景对本申请的一些实施例提供的文本处理方法进行详细地说明。The text processing methods provided by some embodiments of the present application are described in detail below through specific embodiments and their application scenarios in conjunction with the accompanying drawings.

本申请的一些实施例提供的图像生成模型应用于处理图像生成任务的场景。例如，图像生成模型应用于处理文生图任务、图生图任务、文和图生图任务、图像消除任务、图像局部修改任务或图像外扩任务等图像生成任务的场景。The image generation model provided in some embodiments of the present application is applied to the scene of processing image generation tasks. For example, the image generation model is applied to the scene of processing image generation tasks such as text-to-image task, image-to-image task, text-to-image-to-image task, image elimination task, image local modification task or image expansion task.

例如，对于处理文生图任务的应用场景：相关技术中，电子设备可以获取大量与文生图场景相关的文生图训练样本，如描述文本和样本图像，根据描述文本和样本图像做模型训练，得到具有文生图功能的模型，再调用上述具有文生图功能的模型处理文生图任务。For example, for the application scenario of processing text image tasks: in the related technology, the electronic device can obtain a large number of text image training samples related to the text image scenario, such as descriptive text and sample images, perform model training based on the descriptive text and sample images, obtain a model with text image function, and then call the above model with text image function to process the text image task.

例如，对于处理图生图任务的应用场景：相关技术中，当用户想要电子设备训练一种能够实现图生图功能的模型时，电子设备先通过上述文生图场景中的方法训练得到一个文生图模型，在该文生图模型的基础上使用图生图训练样本，训练得到具有图生图功能的模型，再调用上述具有图生图功能的模型处理图生图任务。For example, for the application scenario of processing image-generated-image tasks: in the related technology, when a user wants an electronic device to train a model that can realize the image-generated-image function, the electronic device first trains a text-generated-image model through the method in the above-mentioned text-generated-image scenario, and then uses the image-generated-image training samples on the basis of the text-generated-image model to train a model with the image-generated-image function, and then calls the above-mentioned model with the image-generated-image function to process the image-generated-image task.

例如，对于处理图像局部修改任务的应用场景，相关技术中，当用户想要电子设备训练一种具有图像局部修改功能的模型时，电子设备先通过上述文生图场景中的方法训练得到一个文生图模型，在该文生图模型的基础上使用图像局部修改训练样本，训练得到具有图像局部修改功能的模型，再调用上述具有图像局部修改功能的模型处理图像局部修改任务。For example, for application scenarios of processing local image modification tasks, in related technologies, when a user wants an electronic device to train a model with a local image modification function, the electronic device first trains a Vincent graph model using the method in the above-mentioned Vincent graph scenario, and then uses the image local modification training samples on the basis of the Vincent graph model to train a model with the image local modification function, and then calls the above-mentioned model with the image local modification function to process the image local modification task.

本申请的一些实施例提供的图像生成模型，包括条件编码器、加噪模块和扩散模型，条件编码器和加噪模块分别与扩散模型连接；条件编码器，用于对至少一张原始图像进行编码，得到至少一项图像特征信息，以及用于对每张原始图像的描述文本进行编码，得到至少一项文本特征信息；并基于控制条件、至少一项图像特征信息和至少一项文本特征信息，确定至少一项条件特征信息；控制条件用于指示图像生成任务的类型，条件特征信息是根据图像生成任务的类型确定的；加噪模块，用于基于高斯噪声矩阵和至少一张原始图像的掩码图像，对至少一张原始图像进行加噪处理，得到至少一张噪声图像；扩散模型，用于基于至少一项条件特征信息和至少一张噪声图像，生成至少一张衍生图像，每张衍生图像对应一项条件特征信息和一张噪声图像。在上述图像生成模型中，根据图像生成任务的类型确定条件特征信息，以控制图像生成模型执行不同的图像生成任务，并用高斯噪声矩阵和掩码图像对原始图像进行加噪处理，使得图像生成模型不仅能够具备从噪声中生成图像的能力，也能够具备预测、消除或修改掩码图像对应的掩码区域的图像内容的能力，使得图像生成模型能够适应不同类型的图像生成任务，如文生图、图生图、文加图生图、图像局部修改、图像消除、图像外扩等等。如此，一个图像生成模型具备了处理多种类型的图像生成任务的能力，在处理不同类型的图像生成任务时不需要再调用多个不同的模型，提高了处理图像生成任务的效率。Some embodiments of the present application provide an image generation model, including a conditional encoder, a noise addition module and a diffusion model, wherein the conditional encoder and the noise addition module are respectively connected to the diffusion model; the conditional encoder is used to encode at least one original image to obtain at least one image feature information, and is used to encode the description text of each original image to obtain at least one text feature information; and based on a control condition, at least one image feature information and at least one text feature information, at least one conditional feature information is determined; the control condition is used to indicate the type of image generation task, and the conditional feature information is determined according to the type of image generation task; the noise addition module is used to perform noise processing on at least one original image based on a Gaussian noise matrix and a mask image of at least one original image to obtain at least one noise image; the diffusion model is used to generate at least one derivative image based on at least one conditional feature information and at least one noise image, each derivative image corresponding to one conditional feature information and one noise image. In the above-mentioned image generation model, conditional feature information is determined according to the type of image generation task to control the image generation model to perform different image generation tasks, and the original image is subjected to noise processing using a Gaussian noise matrix and a mask image, so that the image generation model can not only have the ability to generate images from noise, but also have the ability to predict, eliminate or modify the image content of the mask area corresponding to the mask image, so that the image generation model can adapt to different types of image generation tasks, such as text-generated images, image-generated images, text-plus-image-generated images, local image modification, image elimination, image expansion, etc. In this way, an image generation model has the ability to handle multiple types of image generation tasks, and there is no need to call multiple different models when handling different types of image generation tasks, thereby improving the efficiency of processing image generation tasks.

图1是本申请的一些实施例提供的图像生成模型的结构示意图，该图像生成模型包括：条件编码器11、加噪模块12和扩散模型13，条件编码器11和加噪模块12分别与扩散模型13连接。1 is a schematic diagram of the structure of an image generation model provided by some embodiments of the present application, wherein the image generation model comprises: a conditional encoder 11, a noise adding module 12 and a diffusion model 13, wherein the conditional encoder 11 and the noise adding module 12 are respectively connected to the diffusion model 13.

本申请的一些实施例中，上述条件编码器11是用于对图像、文本进行编码并根据控制条件和编码得到的特征信息确定条件特征信息的编码器。In some embodiments of the present application, the conditional encoder 11 is an encoder for encoding images and texts and determining conditional feature information according to control conditions and feature information obtained by encoding.

本申请的一些实施例中，条件编码器11，用于对至少一张原始图像进行编码，得到至少一项图像特征信息，以及用于对每张原始图像的描述文本进行编码，得到至少一项文本特征信息；并基于控制条件、至少一项图像特征信息和至少一项文本特征信息，确定至少一项条件特征信息；控制条件用于指示图像生成任务的类型，条件特征信息是根据图像生成任务的类型确定的。In some embodiments of the present application, the conditional encoder 11 is used to encode at least one original image to obtain at least one image feature information, and to encode the description text of each original image to obtain at least one text feature information; and based on the control condition, at least one image feature information and at least one text feature information, at least one conditional feature information is determined; the control condition is used to indicate the type of image generation task, and the conditional feature information is determined according to the type of image generation task.

本申请的一些实施例中，上述至少一张原始图像中的每张原始图像可以是包括至少一个实体的图像。In some embodiments of the present application, each of the at least one original image may be an image including at least one entity.

示例性地，上述原始图像可以是包括植物、动物、人物、山、水、房子、电脑等实体中的至少一个实体的图像。Exemplarily, the original image may be an image of at least one of plants, animals, people, mountains, water, houses, computers, and the like.

本申请的一些实施例中，上述每张原始图像的描述文本可以用于描述每张原始图像的图像主题。In some embodiments of the present application, the description text of each original image may be used to describe the image theme of each original image.

示例性地，一张原始图像的描述文本可以是“一个男人和一个女人站在草坪前，比出点赞的手势”。For example, the description text of an original image may be "a man and a woman standing in front of a lawn, making a thumbs-up gesture."

本申请的一些实施例中，上述每张原始图像的描述文本可以用于描述每张原始图像的图像主题以及对每张原始图像执行的处理操作。In some embodiments of the present application, the description text of each original image may be used to describe the image subject of each original image and the processing operation performed on each original image.

示例性地，一张原始图像的描述文本可以是“一个男人和一个女人站在草坪前，比出点赞的手势。将图中的男人删除”。For example, the description text of an original image may be "A man and a woman stand in front of a lawn, making a thumbs-up gesture. Delete the man in the picture."

本申请的一些实施例中，上述每张原始图像的描述文本可以用于描述每张原始图像的图像主题以及每张原始图像中掩码区域的图像内容。In some embodiments of the present application, the description text of each original image may be used to describe the image subject of each original image and the image content of the mask area in each original image.

示例性地，一张原始图像的描述文本可以是“一个男人和一个女人站在草坪前，比出点赞的手势，图中有男人、女人、气球、海报、帽子”。可以理解的是，该原始图像的掩码区域包括男人、女人、气球、海报和帽子在该原始图像中对应的图像区域。For example, the description text of an original image may be "a man and a woman standing in front of a lawn, making a thumbs-up gesture, and the picture contains a man, a woman, a balloon, a poster, and a hat". It can be understood that the mask area of the original image includes the image areas corresponding to the man, the woman, the balloon, the poster, and the hat in the original image.

本申请的一些实施例中，结合图1，如图2所示，上述条件编码器11包括第一条件编码器111和第二条件编码器112。In some embodiments of the present application, in combination with FIG. 1 , as shown in FIG. 2 , the conditional encoder 11 includes a first conditional encoder 111 and a second conditional encoder 112 .

需要说明的是，上述第一条件编码器111可以是任何能够对图像进行编码的图像编码器，上述第二条件编码器112可以是任何能够对文本进行编码的文本编码器，本申请的一些实施例对此不做限定。It should be noted that the first conditional encoder 111 may be any image encoder capable of encoding images, and the second conditional encoder 112 may be any text encoder capable of encoding text, and some embodiments of the present application do not limit this.

本申请的一些实施例中，上述第一条件编码器111包括图像分割模块、映射层和自注意力模块。其中，图像分割模块和自注意力模块分别与映射层连接。In some embodiments of the present application, the first conditional encoder 111 includes an image segmentation module, a mapping layer and a self-attention module, wherein the image segmentation module and the self-attention module are respectively connected to the mapping layer.

本申请的一些实施例中，上述图像分割模块用于对图像进行分割。上述映射层是用于将输入的图像从原始的像素空间映射到特征表达空间的模型层。上述自注意力模块是用于捕捉图像中像素点之间的关联关系的模块。In some embodiments of the present application, the image segmentation module is used to segment the image. The mapping layer is a model layer for mapping the input image from the original pixel space to the feature expression space. The self-attention module is a module for capturing the correlation between pixels in the image.

本申请的一些实施例中，上述条件编码器11用于对至少一张原始图像进行编码，得到至少一项图像特征信息时，包括：电子设备将至少一张原始图像输入图像分割模块，图像分割模块将每张原始图像分割为N个原始图像块，并将每个原始图像块的像素点重排列为一维向量，得到每张原始图像的N个一维向量，并将每张原始图像的N个一维向量拼接得到每张原始图像的像素重排图像输入映射层；映射层提取图像分割模块输出的每张像素重排图像中每个像素点的特征信息，得到至少一个像素重排特征向量输入自注意力模块；自注意力模块基于每张像素重排图像中像素点之间的关联关系，对映射层输出的每个像素重排特征向量进行特征提取，得到每张原始图像的图像特征信息。In some embodiments of the present application, the above-mentioned conditional encoder 11 is used to encode at least one original image to obtain at least one item of image feature information, including: the electronic device inputs at least one original image into an image segmentation module, the image segmentation module divides each original image into N original image blocks, and rearranges the pixels of each original image block into a one-dimensional vector to obtain N one-dimensional vectors of each original image, and splices the N one-dimensional vectors of each original image to obtain a pixel rearranged image of each original image input into a mapping layer; the mapping layer extracts the feature information of each pixel point in each pixel rearranged image output by the image segmentation module, and obtains at least one pixel rearranged feature vector input into a self-attention module; the self-attention module extracts features from each pixel rearranged feature vector output by the mapping layer based on the correlation between the pixels in each pixel rearranged image, and obtains the image feature information of each original image.

其中，N为大于1的整数。Wherein, N is an integer greater than 1.

本申请的一些实施例中，一维向量是一个有序的数列，该数列包括原始图像中每个像素点。In some embodiments of the present application, a one-dimensional vector is an ordered sequence, which includes each pixel in the original image.

本申请的一些实施例中，对于每张原始图像，上述图像分割模块将原始图像分割为N个原始图像块，再将每个原始图像块的像素点从左到右从上到下重新排列，得到一个一维向量即新的像素排列，进而得到N个一维向量，将N个一维向量拼接得到N维向量，该N维向量组成了像素重排图像。In some embodiments of the present application, for each original image, the above-mentioned image segmentation module divides the original image into N original image blocks, and then rearranges the pixels of each original image block from left to right and from top to bottom to obtain a one-dimensional vector, that is, a new pixel arrangement, and then obtains N one-dimensional vectors, and splices the N one-dimensional vectors to obtain an N-dimensional vector, which constitutes a pixel rearranged image.

示例性地，上述图像分割模块可以是patch模块，以原始图像的分辨率为[3，512，512]为例，电子设备通过图像分割模块将原始图像中每3*32*32的像素区域分割为一个原始图像块，共得到256个原始图像块，然后将每个原始图像块中的像素点按从左到右从上到下的顺序展平，得到1*768的像素排列，最终将原始图像转化为像素排列为256*768的像素重排图像。Exemplarily, the above-mentioned image segmentation module can be a patch module. Taking the resolution of the original image as [3, 512, 512] as an example, the electronic device divides each 3*32*32 pixel area in the original image into an original image block through the image segmentation module, and obtains a total of 256 original image blocks. The pixels in each original image block are then flattened in order from left to right and from top to bottom to obtain a pixel arrangement of 1*768, and finally the original image is converted into a pixel rearranged image with a pixel arrangement of 256*768.

如此，电子设备通过图像分割模块将分辨率较大的原始图像分割为分辨率较小的原始图像块，能够减少后续的计算量，且由于同一个像素区域中像素点的特征比较接近，将每个原始图像块的像素点重新排列，能够保留该原始图像块中像素点的特征信息，便于图像生成模型对原始图像进行特征提取。In this way, the electronic device divides the original image with a larger resolution into original image blocks with a smaller resolution through the image segmentation module, which can reduce the subsequent calculation amount. Moreover, since the features of the pixels in the same pixel area are relatively close, the pixels of each original image block are rearranged, which can retain the feature information of the pixels in the original image block, making it easier for the image generation model to extract features from the original image.

本申请的一些实施例中，上述映射层可以是Project层，该Project层执行线性变换操作。In some embodiments of the present application, the above-mentioned mapping layer may be a Project layer, which performs a linear transformation operation.

示例性地，线性变换操作可以表示为：y＝W*x+b。Exemplarily, the linear transformation operation may be expressed as: y=W*x+b.

其中，x表示像素重排图像，y表示像素重排特征向量，b表示偏置向量，W表示权重矩阵。并且，b和W均是第二编码器的固定参数。Wherein, x represents the pixel rearrangement image, y represents the pixel rearrangement feature vector, b represents the bias vector, and W represents the weight matrix. Moreover, b and W are both fixed parameters of the second encoder.

本申请的一些实施例中，对于每张像素重排图像，电子设备通过映射层对像素重排图像中每个像素点进行上述线性变换操作，得到每个像素点的特征信息，将像素重排图像中多个像素点的特征信息组合，得到像素重排特征向量。In some embodiments of the present application, for each pixel rearranged image, the electronic device performs the above-mentioned linear transformation operation on each pixel in the pixel rearranged image through a mapping layer to obtain feature information of each pixel, and combines the feature information of multiple pixels in the pixel rearranged image to obtain a pixel rearranged feature vector.

如此，电子设备通过映射层将像素重排图像从原始的像素空间映射到特征表达空间，即提取像素重排图像的特征，得到像素重排特征向量，能够更好地表示原始图像的图像内容，使得图像生成模型能够更容易识别原始图像中的实体。In this way, the electronic device maps the pixel-rearranged image from the original pixel space to the feature expression space through the mapping layer, that is, extracts the features of the pixel-rearranged image to obtain a pixel-rearranged feature vector, which can better represent the image content of the original image, so that the image generation model can more easily identify the entities in the original image.

本申请的一些实施例中，各像素点之间的关联关系包括各像素点之间的空间依赖性和语义相关性。其中，空间依赖性指的是各像素点之间的相对位置和距离关系，而语义相关性指的是各像素点之间的内容关联度。其中，内容关联度是指不同像素点在内容、色彩、亮度等方面所表现出的相关程度或关联性。In some embodiments of the present application, the association relationship between each pixel point includes the spatial dependency and semantic correlation between each pixel point. Among them, the spatial dependency refers to the relative position and distance relationship between each pixel point, and the semantic correlation refers to the content correlation between each pixel point. Among them, the content correlation refers to the correlation or correlation between different pixels in terms of content, color, brightness, etc.

示例性地，在原始图像中，相邻的像素点往往具有相似的色彩和亮度值，从而形成连续的视觉感知，这种色彩和亮度的连续性反映了像素点之间的内容关联度。原始图像的纹理和图案是由多个像素点共同构成的，如果像素点之间的纹理和图案保持连续性，那么这些像素点之间就具有较高的内容关联度。在某些情况下，像素点之间的内容关联度也可能体现在它们所表示的语义信息上。例如，在一张人脸图像中，不同像素点可能分别代表眼睛、鼻子、嘴巴等不同的语义单元，这些语义单元之间的相对位置和组合方式体现了像素点之间的内容关联度。因此，各像素点之间的内容关联度包括各像素点之间色彩和亮度的连续性、各像素点之间纹理和图案的连续性、以及各像素点之间的语义关联性。For example, in the original image, adjacent pixels often have similar color and brightness values, thus forming a continuous visual perception. This continuity of color and brightness reflects the content relevance between pixels. The texture and pattern of the original image are composed of multiple pixels. If the texture and pattern between pixels remain continuous, then these pixels have a high content relevance. In some cases, the content relevance between pixels may also be reflected in the semantic information they represent. For example, in a face image, different pixels may represent different semantic units such as eyes, nose, and mouth, and the relative positions and combinations of these semantic units reflect the content relevance between pixels. Therefore, the content relevance between pixels includes the continuity of color and brightness between pixels, the continuity of texture and pattern between pixels, and the semantic relevance between pixels.

本申请的一些实施例中，上述自注意力模块可以包括多层Transformer Block，每层Transformer Block包括自注意力算子(Self Attention)、加法算子(Add)、归一化算子(Norm)和前向反馈算子(Feed Forward)。In some embodiments of the present application, the above-mentioned self-attention module may include multiple layers of Transformer Block, each layer of Transformer Block includes a self-attention operator (Self Attention), an addition operator (Add), a normalization operator (Norm) and a forward feedback operator (Feed Forward).

本申请的一些实施例中，上述自注意力算子用于计算每个像素点的注意力权重，并根据每个像素点的注意力权重重新计算每个像素点的特征向量。上述加法算子用于实现加法运算，上述归一化算子用于对输入进行归一化处理，上述前向反馈算子用于对输入进行非线性变换处理。In some embodiments of the present application, the self-attention operator is used to calculate the attention weight of each pixel, and recalculate the feature vector of each pixel according to the attention weight of each pixel. The addition operator is used to implement addition operations, the normalization operator is used to normalize the input, and the forward feedback operator is used to perform nonlinear transformation on the input.

本申请的一些实施例中，对于每个像素重排特征向量，上述自注意力算子将输入的像素重排特征向量中每个像素点转换为三个向量，分别为查询向量query，键向量key和值向量value。对于像素重排特征向量中的任意一个像素点，记为第一像素点，将该第一像素点的查询向量query与该像素重排特征向量中每个像素点的键向量key分别点乘，得到每个像素点对应的注意力权重，每个像素点对应的注意力权重表示该像素点与第一像素点的关联性，将每个像素点对应的注意力权重与该像素点的值向量value相乘，得到每个像素点的向量，将多个向量相加，得到该第一像素点通过自注意力算子后的输出向量。同理，对像素重排特征向量中的每个像素点均执行上述操作，则可以得到像素重排特征向量中每个像素点通过自注意力算子处理后的输出向量，将该多个输出向量拼接，可以得到该像素重排特征向量通过自注意力算子处理后的特征向量，自注意力算子将该处理后的特征向量输入上述加法算子。In some embodiments of the present application, for each pixel rearrangement feature vector, the self-attention operator converts each pixel point in the input pixel rearrangement feature vector into three vectors, namely, query vector query, key vector key and value vector value. For any pixel point in the pixel rearrangement feature vector, it is recorded as the first pixel point, and the query vector query of the first pixel point is multiplied with the key vector key of each pixel point in the pixel rearrangement feature vector to obtain the attention weight corresponding to each pixel point. The attention weight corresponding to each pixel point represents the correlation between the pixel point and the first pixel point. The attention weight corresponding to each pixel point is multiplied with the value vector value of the pixel point to obtain the vector of each pixel point, and multiple vectors are added to obtain the output vector of the first pixel point after passing through the self-attention operator. Similarly, the above operation is performed on each pixel point in the pixel rearrangement feature vector, and the output vector of each pixel point in the pixel rearrangement feature vector after being processed by the self-attention operator can be obtained. The multiple output vectors are spliced to obtain the feature vector of the pixel rearrangement feature vector after being processed by the self-attention operator, and the self-attention operator inputs the processed feature vector into the above addition operator.

本申请的一些实施例中，上述加法算子将自注意力算子的输出与自注意力算子的输入相加，即将像素重排特征向量与自注意力算子对该像素重排特征向量处理后输出的特征向量相加，将相加得到的特征向量输入上述归一化算子。如此，若其上一层的输出保留的信息比较少，则可以通过加入上一层的输入，保留输入的特征信息，避免丢失太多信息。In some embodiments of the present application, the addition operator adds the output of the self-attention operator to the input of the self-attention operator, that is, adds the pixel rearrangement feature vector to the feature vector output after the self-attention operator processes the pixel rearrangement feature vector, and inputs the added feature vector into the normalization operator. In this way, if the output of the previous layer retains less information, the input feature information can be retained by adding the input of the previous layer to avoid losing too much information.

本申请的一些实施例中，上述归一化算子根据加法算子输出的相加得到的特征向量中每个像素点的多维特征值，计算每个维度的均值和标准差，并使用每个维度的均值和标准差来归一化每个像素点在该维度下的特征值，得到特征矩阵，一个特征矩阵对应一个像素重排特征向量。In some embodiments of the present application, the above-mentioned normalization operator calculates the mean and standard deviation of each dimension according to the multidimensional eigenvalues of each pixel point in the eigenvector obtained by adding the output of the addition operator, and uses the mean and standard deviation of each dimension to normalize the eigenvalue of each pixel point in that dimension to obtain a feature matrix, and one feature matrix corresponds to one pixel rearrangement feature vector.

本申请的一些实施例中，上述前向反馈算子包括一个或多个全连接层和激活函数层。前向反馈算子先通过一个或多个全连接层对归一化算子输出的特征矩阵进行线性计算，再通过激活函数层将全连接层的线性输出转换为非线性输出，即对每个像素点的值进行非线性变换，得到原始图像的图像特征信息。In some embodiments of the present application, the above-mentioned forward feedback operator includes one or more fully connected layers and an activation function layer. The forward feedback operator first performs linear calculations on the feature matrix output by the normalization operator through one or more fully connected layers, and then converts the linear output of the fully connected layer into a nonlinear output through the activation function layer, that is, performs a nonlinear transformation on the value of each pixel point to obtain the image feature information of the original image.

示例性地，如图3所示，是本申请的一些实施例提供的Transformer Block的结构示意图。在图3中，Transformer Block包括自注意力算子(Self Attention)、加法&归一化算子(Add&Norm)和前向反馈算子(Feed Forward)，且自注意力算子和前向反馈算子分别与加法&归一化算子连接。电子设备将像素重排特征向量中各个像素点的查询向量query，键向量key和值向量value输入自注意力算子，自注意力算子处理后得到特征向量并输入加法&归一化算子，加法&归一化算子将自注意力算子输入的特征向量和像素重排特征向量进行向量相加和归一化处理，得到特征矩阵后输入前向反馈算子做非线性变换，前向反馈算子将非线性变换结果再输入加法&归一化算子，加法&归一化算子将前向反馈算子的输入即特征矩阵和输出即非线性变换结果进行向量相加和归一化处理，最终输出原始图像的图像特征信息。Exemplarily, as shown in Figure 3, it is a schematic diagram of the structure of the Transformer Block provided by some embodiments of the present application. In Figure 3, the Transformer Block includes a self-attention operator (Self Attention), an addition & normalization operator (Add&Norm) and a forward feedback operator (Feed Forward), and the self-attention operator and the forward feedback operator are respectively connected to the addition & normalization operator. The electronic device inputs the query vector query, key vector key and value vector value of each pixel point in the pixel rearrangement feature vector into the self-attention operator. After being processed by the self-attention operator, the feature vector is obtained and input into the addition & normalization operator. The addition & normalization operator performs vector addition and normalization processing on the feature vector input by the self-attention operator and the pixel rearrangement feature vector. After obtaining the feature matrix, the feature matrix is input into the forward feedback operator for nonlinear transformation. The forward feedback operator inputs the nonlinear transformation result into the addition & normalization operator. The addition & normalization operator performs vector addition and normalization processing on the input of the forward feedback operator, i.e., the feature matrix, and the output, i.e., the nonlinear transformation result, and finally outputs the image feature information of the original image.

需要说明的是，上述第一条件编码器111的自注意力模块可以包括16层Transformer Block，能够提取到原始图像更深层次更丰富的特征信息。It should be noted that the self-attention module of the first conditional encoder 111 may include a 16-layer Transformer Block, which can extract deeper and richer feature information of the original image.

示例性地，如图4所示，是本申请的一些实施例提供的第一条件编码器的结构示意图。在图4中，第一条件编码器111包括图像分割模块、映射层和自注意力模块，且自注意力模块包括16层的Transformer Block。第一条件编码器111对一个原始图像编码得到一项图像特征信息的具体实现包括：电子设备将分辨率为512×512×3的原始图像输入第一条件编码器的图像分割模块做图像分割处理后，得到分辨率为256×768的像素重排图像输入映射层，映射层对像素重排图像进行特征提取，得到像素排列为256×768的像素重排特征向量输入自注意力模块，自注意力模块对像素重排特征向量处理后得到原始图像的维度为256×768的图像特征信息(Image Embedding)。Exemplarily, as shown in FIG4, it is a schematic diagram of the structure of the first conditional encoder provided by some embodiments of the present application. In FIG4, the first conditional encoder 111 includes an image segmentation module, a mapping layer and a self-attention module, and the self-attention module includes a 16-layer Transformer Block. The specific implementation of the first conditional encoder 111 encoding an original image to obtain an image feature information includes: the electronic device inputs the original image with a resolution of 512×512×3 into the image segmentation module of the first conditional encoder for image segmentation processing, and obtains a pixel rearranged image with a resolution of 256×768 and inputs it into the mapping layer. The mapping layer extracts features from the pixel rearranged image to obtain a pixel rearranged feature vector with a pixel arrangement of 256×768 and inputs it into the self-attention module. After the self-attention module processes the pixel rearranged feature vector, the image feature information (Image Embedding) of the original image with a dimension of 256×768 is obtained.

如此，第一条件编码器对原始图像进行特征提取，得到原始图像的图像特征信息，则图像生成模型在处理图像生成任务时，能够结合原始图像的图像特征信息做处理，提高了处理图像生成任务的准确性。In this way, the first conditional encoder extracts features from the original image to obtain image feature information of the original image. When the image generation model processes the image generation task, it can combine the image feature information of the original image for processing, thereby improving the accuracy of processing the image generation task.

本申请的一些实施例中，上述第二条件编码器112包括分词器和自注意力模块。并且，分词器与自注意力模块连接。In some embodiments of the present application, the second conditional encoder 112 includes a word segmenter and a self-attention module. In addition, the word segmenter is connected to the self-attention module.

本申请的一些实施例中，上述分词器用于对文本进行分词并将分词结果转换为向量。上述自注意力模块是用于捕捉文本中分词之间的关联关系的模块。In some embodiments of the present application, the above-mentioned word segmenter is used to segment text and convert the segmentation result into a vector. The above-mentioned self-attention module is a module for capturing the association relationship between segmentations in a text.

本申请的一些实施例中，上述条件编码器11用于对每张原始图像的描述文本进行编码，得到至少一项文本特征信息时，包括：电子设备将每张原始图像的描述文本输入分词器，分词器将每张原始图像的描述文本分割为至少两个分词，并对每个分词进行编码，得到每个描述文本的文本特征向量输入自注意力模块；自注意力模块基于描述文本中分词之间的关联关系，对分词器处理得到的描述文本的文本特征向量进行特征提取，得到描述文本的文本特征信息。In some embodiments of the present application, the above-mentioned conditional encoder 11 is used to encode the description text of each original image to obtain at least one item of text feature information, including: the electronic device inputs the description text of each original image into a word segmenter, the word segmenter divides the description text of each original image into at least two words, and encodes each word to obtain a text feature vector of each description text and inputs it into a self-attention module; the self-attention module extracts features from the text feature vector of the description text obtained by the word segmenter based on the association relationship between the words in the description text, and obtains text feature information of the description text.

示例性地，分词器先按照分词规则将描述文本分割成一个个有意义的分词(token)如单词或者字符。并且，对于不同的语言和任务，分词规则可能会有所不同。例如，在英文中，分词通常可以通过空格或者标点符号进行；而在中文中，由于单词之间没有明显的分隔符，因此需要使用分词算法或者字词典进行分词。在将描述文本分割为多个分词之后，可以将每个分词映射到一个唯一的整数ID，则描述文本就被转换为了一个整数序列，即得到了描述文本的文本特征向量。Exemplarily, the tokenizer first divides the description text into meaningful tokens such as words or characters according to the tokenization rules. In addition, the tokenization rules may be different for different languages and tasks. For example, in English, tokens can usually be segmented by spaces or punctuation marks; in Chinese, since there are no obvious separators between words, a tokenization algorithm or a dictionary is required for tokenization. After the description text is segmented into multiple tokens, each token can be mapped to a unique integer ID, and the description text is converted into an integer sequence, that is, the text feature vector of the description text is obtained.

示例性地，第二条件编码器112中的分词器可以是Tokenizer，Tokenizer将描述文本分割为多个分词，并映射得到每个分词对应的向量，得到维度为[128，768]的文本特征向量。其中，128表示文本编码器支持的最大分词数，768为每个分词对应的向量长度。Exemplarily, the word segmenter in the second conditional encoder 112 may be a Tokenizer, which segments the description text into multiple words, and maps the vector corresponding to each word to obtain a text feature vector with a dimension of [128, 768]. 128 represents the maximum number of words supported by the text encoder, and 768 is the length of the vector corresponding to each word.

进一步地，在对描述文本进行分词后，还可以将分词中的停用词去除，再映射得到每个分词对应的向量，能够减少模型需处理的数据量的大小，提高处理效率。Furthermore, after the description text is segmented, stop words in the segmentation can be removed, and then the vector corresponding to each segmentation can be mapped, which can reduce the amount of data that the model needs to process and improve processing efficiency.

需要说明的是，上述第二条件编码器112的自注意力模块的结构与上述第一条件编码器111的自注意力模块的结构(如图3所示)相同，第二条件编码器112的自注意力模块对输入的文本特征向量的处理过程也与第一条件编码器111的自注意力模块对输入的像素重排特征向量的处理过程相同，其具体实现可以参见上述相关描述，本申请在此不再赘述。It should be noted that the structure of the self-attention module of the second conditional encoder 112 is the same as the structure of the self-attention module of the first conditional encoder 111 (as shown in FIG3 ), and the processing process of the input text feature vector by the self-attention module of the second conditional encoder 112 is also the same as the processing process of the input pixel rearrangement feature vector by the self-attention module of the first conditional encoder 111. The specific implementation can be found in the above-mentioned relevant description, and the present application will not repeat it here.

需要说明的是，上述第二条件编码器112的自注意力模块可以包括12层Transformer Block，能够提取到描述文本更深层次更丰富的特征信息。It should be noted that the self-attention module of the second conditional encoder 112 may include a 12-layer Transformer Block, which can extract deeper and richer feature information describing the text.

示例性地，如图5所示，是本申请的一些实施例提供的第二条件编码器的结构示意图。在图5中，第二条件编码器112包括分词器和自注意力模块，且自注意力模块包括12层Transformer Block。第二条件编码器112对一个原始图像的描述文本编码得到一项文本特征信息的具体实现包括：电子设备将描述文本“阳光明媚，草地上有一只小狗在奔跑”输入第二条件编码器的分词器，分词器对描述文本处理后得到维度为128×768的文本特征向量并输入自注意力模块，自注意力模块对文本特征向量处理后得到描述文本的维度为128×768的文本特征信息(Text Embedding)。Exemplarily, as shown in FIG5 , it is a schematic diagram of the structure of the second conditional encoder provided by some embodiments of the present application. In FIG5 , the second conditional encoder 112 includes a word segmenter and a self-attention module, and the self-attention module includes a 12-layer Transformer Block. The specific implementation of the second conditional encoder 112 encoding the description text of an original image to obtain a text feature information includes: the electronic device inputs the description text "The sun is shining, and there is a puppy running on the grass" into the word segmenter of the second conditional encoder, and the word segmenter processes the description text to obtain a text feature vector with a dimension of 128×768 and inputs it into the self-attention module. After the self-attention module processes the text feature vector, it obtains text feature information (Text Embedding) with a dimension of 128×768 of the description text.

如此，第二条件编码器对描述文本进行特征提取，得到描述文本的文本特征信息，则图像生成模型在处理图像生成任务时，能够结合描述文本的文本特征信息，生成符合描述文本描述的衍生图像或基于描述文本对原始图像做处理，提高了处理图像生成任务的准确性。In this way, the second conditional encoder extracts features from the description text to obtain text feature information of the description text. When the image generation model processes the image generation task, it can combine the text feature information of the description text to generate a derivative image that conforms to the description of the description text or process the original image based on the description text, thereby improving the accuracy of processing the image generation task.

本申请的一些实施例中，如图2所示，上述第二条件编码器121包括文本编码器small和文本编码器big，文本编码器small和文本编码器big均包括卷积模块和自注意力模块，且二者的卷积模块是相同的，自注意力模块均由Transformer Block组成。其中，文本编码器small的自注意力模块包括12层Transformer Block，文本编码器big的自注意力模块包括36层Transformer Block。In some embodiments of the present application, as shown in FIG2 , the second conditional encoder 121 includes a text encoder small and a text encoder big, both of which include a convolution module and a self-attention module, and the convolution modules of the two are the same, and the self-attention modules are composed of Transformer Blocks. Among them, the self-attention module of the text encoder small includes 12 layers of Transformer Blocks, and the self-attention module of the text encoder big includes 36 layers of Transformer Blocks.

示例性地，在图像生成模型包括文本编码器small和文本编码器big这两个第二条件编码器的情况下，文本编码器small对描述文本编码得到文本特征信息1，文本编码器big对描述文本编码得到文本特征信息2，即第二条件编码器112可以编码得到描述文本的两种文本特征信息。Exemplarily, when the image generation model includes two second conditional encoders, a text encoder small and a text encoder big, the text encoder small encodes the description text to obtain text feature information 1, and the text encoder big encodes the description text to obtain text feature information 2, that is, the second conditional encoder 112 can encode two types of text feature information of the description text.

需要说明的是，文本编码器small是一个较轻量但仍然具有特征表示能力的编码器，适用于算力有限的电子设备，例如移动通信设备、车机设备等。文本编码器big是一个具有更多的参数量、更强的建模能力和表达能力，能够处理更复杂的描述文本的编码器，适用于高性能电子设备，例如云端图形处理单元(Graphics Processing Unit，GPU)、服务器等。It should be noted that the text encoder small is a relatively lightweight encoder that still has feature representation capabilities, and is suitable for electronic devices with limited computing power, such as mobile communication devices, car equipment, etc. The text encoder big is an encoder with more parameters, stronger modeling and expression capabilities, and can handle more complex description texts. It is suitable for high-performance electronic devices, such as cloud graphics processing units (GPUs), servers, etc.

本申请的一些实施例中，控制条件用于指示图像生成任务的类型，且不同类型的图像生成任务对应的控制条件可能相同也可能不同。In some embodiments of the present application, the control condition is used to indicate the type of the image generation task, and the control conditions corresponding to different types of image generation tasks may be the same or different.

示例性地，对于文生图任务，控制条件是不改变文本特征信息且将图像特征信息置为0。对于图像处理任务如图生图、图像局部修改等等，控制条件是不改变文本特征信息和图像特征信息。Exemplarily, for the task of text-to-image generation, the control condition is not to change the text feature information and to set the image feature information to 0. For image processing tasks such as image-to-image generation, local image modification, etc., the control condition is not to change the text feature information and image feature information.

需要说明的是，控制条件是在训练得到图像生成模型的模型训练过程中使用，在使用已训练好的图像生成模型处理图像生成任务时不用控制条件。It should be noted that the control conditions are used in the model training process of training the image generation model, and no control conditions are used when using the trained image generation model to process image generation tasks.

示例性地，电子设备在训练得到图像生成模型的模型训练过程中，可以按照预设概率使用控制条件。例如，在模型训练过程中，有46.5％的概率使用的是指示文生图任务的控制条件，有53.5％的概率使用的是指示图像处理任务的控制条件。Exemplarily, during the model training process of training the image generation model, the electronic device may use the control condition according to a preset probability. For example, during the model training process, there is a 46.5% probability of using the control condition indicating the Vincent figure task, and a 53.5% probability of using the control condition indicating the image processing task.

本申请的一些实施例中，条件编码器11，具体用于：在图像生成任务的类型为文生图任务的情况下，将至少一项文本特征信息确定为至少一项条件特征信息；在图像生成任务的类型为图像处理任务的情况下，将至少一项图像特征信息和至少一项文本特征信息确定为至少一项条件特征信息；其中，图像处理任务包括以下至少一项：图生图任务、文和图生图任务、图像消除任务、图像局部修改任务、图像外扩任务。In some embodiments of the present application, the conditional encoder 11 is specifically used to: when the type of the image generation task is a text-to-image task, determine at least one item of text feature information as at least one item of conditional feature information; when the type of the image generation task is an image processing task, determine at least one item of image feature information and at least one item of text feature information as at least one item of conditional feature information; wherein the image processing task includes at least one of the following: an image-to-image task, a text-to-image task, an image elimination task, an image local modification task, and an image expansion task.

示例性地，在图像生成任务的类型为文生图任务的情况下，控制条件是不改变文本特征信息且将图像特征信息置为0，则条件编码器11将至少一项文本特征信息确定为至少一项条件特征信息。在图像生成任务的类型为图像处理任务的情况下，控制条件是不改变文本特征信息和图像特征信息，则条件编码器11将至少一项文本特征信息和至少一项图像特征信息确定为至少一项条件特征信息。Exemplarily, when the type of the image generation task is a text image task, the control condition is not to change the text feature information and to set the image feature information to 0, then the conditional encoder 11 determines at least one item of text feature information as at least one item of conditional feature information. When the type of the image generation task is an image processing task, the control condition is not to change the text feature information and the image feature information, then the conditional encoder 11 determines at least one item of text feature information and at least one item of image feature information as at least one item of conditional feature information.

示例性地，对于一张原始图像和该原始图像的描述文本，在图像生成任务的类型为文生图任务的情况下，将该张原始图像的描述文本的文本特征信息确定为条件特征信息。Exemplarily, for an original image and a description text of the original image, when the type of the image generation task is a text image task, text feature information of the description text of the original image is determined as the conditional feature information.

示例性地，对于一张原始图像和该原始图像的描述文本，在图像生成任务的类型为图像修改任务的情况下，将该张原始图像的图像特征信息和该张原始图像的描述文本的文本特征信息确定为条件特征信息。Exemplarily, for an original image and a description text of the original image, when the type of the image generation task is an image modification task, image feature information of the original image and text feature information of the description text of the original image are determined as conditional feature information.

如此，通过使用控制条件，图像生成模型可以根据图像生成任务的类型确定条件特征信息并进行后续处理，使得一个图像生成模型能够具备处理多种类型的图像生成任务的能力，提高了图像生成任务的处理效率。In this way, by using control conditions, the image generation model can determine the conditional feature information and perform subsequent processing according to the type of image generation task, so that an image generation model can have the ability to handle multiple types of image generation tasks, thereby improving the processing efficiency of image generation tasks.

进一步地，对于一个原始图像和该原始图像的描述文本，在第二条件编码器112编码得到描述文本的两种文本特征信息的情况下。对于文生图任务，控制条件是不改变两种文本特征信息且将图像特征信息置为0，或者，控制条件是不改变任意一种文本特征信息且将另一种文本特征信息和图像特征信息置为0。对于图像处理任务，控制条件是不改变图像特征信息和两种文本特征信息，或者，控制条件是不改变图像特征信息和任意一种文本特征信息且将另一种文本信息置为0。Further, for an original image and a description text of the original image, in the case where two types of text feature information of the description text are encoded by the second conditional encoder 112. For the text image task, the control condition is not to change the two types of text feature information and set the image feature information to 0, or the control condition is not to change any type of text feature information and set the other type of text feature information and the image feature information to 0. For the image processing task, the control condition is not to change the image feature information and the two types of text feature information, or the control condition is not to change the image feature information and any type of text feature information and set the other type of text information to 0.

本申请的一些实施例中，在图像生成任务的类型为文生图任务的情况下，条件编码器11可以将原始图像的描述文本的两种文本特征信息中任意一种文本特征信息确定为条件特征信息；或者，条件编码器11可以将原始图像的描述文本的两种文本特征信息确定为条件特征信息。在图像生成任务的类型为图像处理任务的情况下，条件编码器11可以将原始图像的图像特征信息和原始图像的描述文本的任意一种文本特征信息确定为条件特征信息；或者，条件编码器11可以将原始图像的图像特征信息和原始图像的描述文本的两种文本特征信息确定为条件特征信息。In some embodiments of the present application, when the type of the image generation task is a text image task, the conditional encoder 11 may determine any one of the two text feature information of the description text of the original image as the conditional feature information; or, the conditional encoder 11 may determine the two text feature information of the description text of the original image as the conditional feature information. When the type of the image generation task is an image processing task, the conditional encoder 11 may determine the image feature information of the original image and any one of the text feature information of the description text of the original image as the conditional feature information; or, the conditional encoder 11 may determine the image feature information of the original image and the two text feature information of the description text of the original image as the conditional feature information.

本申请的一些实施例中，上述图像生成模型的加噪模块12，用于基于高斯噪声矩阵和至少一张原始图像的掩码图像，对至少一张原始图像进行加噪处理，得到至少一张噪声图像。In some embodiments of the present application, the noise adding module 12 of the above-mentioned image generation model is used to perform noise adding processing on at least one original image based on a Gaussian noise matrix and a mask image of at least one original image to obtain at least one noisy image.

本申请的一些实施例中，原始图像的掩码图像用于表示该原始图像中的掩码区域。可以理解的是，获取到原始图像的掩码图像后便可以确定该原始图像中被掩码的掩码区域。In some embodiments of the present application, the mask image of the original image is used to represent the mask region in the original image. It is understandable that after the mask image of the original image is acquired, the mask region in the original image can be determined.

示例性地，掩码图像可以是数组或矩阵，该数组或矩阵的维度与原始图像的维度相同，即该数组或矩阵中元素点的数量与原始图像中像素点的数量相同，该数组或矩阵中元素点的排列方式与原始图像中像素点的排列方式相同，即在该数组或矩阵中，每个元素点对应原始图像中的一个像素点。并且，每个元素点的值表示该元素点对应的像素点是否是被掩码的像素点。例如，电子设备可以设置值为1的元素点对应的是掩码区域的像素点，在本申请的后续过程中需要做处理，值为0的元素点对应的是未掩码区域的像素点，在本申请的后续过程中不用做处理。Exemplarily, the mask image can be an array or a matrix, the dimension of the array or the matrix is the same as the dimension of the original image, that is, the number of element points in the array or the matrix is the same as the number of pixels in the original image, and the arrangement of the element points in the array or the matrix is the same as the arrangement of the pixels in the original image, that is, in the array or the matrix, each element point corresponds to a pixel point in the original image. And the value of each element point indicates whether the pixel point corresponding to the element point is a masked pixel point. For example, the electronic device can set the element point with a value of 1 to correspond to the pixel point of the mask area, which needs to be processed in the subsequent process of this application, and the element point with a value of 0 corresponds to the pixel point of the unmasked area, which does not need to be processed in the subsequent process of this application.

示例性地，电子设备还可以用一个与原始图像的大小相同的二值或布尔图像表示掩码图像。并且，在该二值或布尔图像中，黑色部分表示掩码区域，白色部分表示未掩码区域；或者，白色部分表示掩码区域，黑色部分表示未掩码区域。Exemplarily, the electronic device may also use a binary or Boolean image of the same size as the original image to represent the mask image. In addition, in the binary or Boolean image, the black portion represents the masked area, and the white portion represents the unmasked area; or the white portion represents the masked area, and the black portion represents the unmasked area.

例如，假设原始图像的分辨率为[3，512，512]，表示该原始图像包括512×512个像素点，且每个像素点的值是3维向量，该3维向量表示该像素点的RGB值。以掩码图像是二值图像，且黑色部分表示掩码区域，白色部分表示未掩码区域为例，则掩码图像也可以是分辨率为[3，512，512]的图像，表示该掩码图像包括512×512个像素点，且每个像素点的值是3维向量，该图像的掩码区域中各像素点的3维向量是黑色对应的RGB值，非掩码区域中各像素点的3维向量是白色对应的RGB值。For example, assuming that the resolution of the original image is [3, 512, 512], it means that the original image includes 512×512 pixels, and the value of each pixel is a 3D vector, and the 3D vector represents the RGB value of the pixel. For example, if the mask image is a binary image, and the black part represents the masked area, and the white part represents the unmasked area, then the mask image can also be an image with a resolution of [3, 512, 512], indicating that the mask image includes 512×512 pixels, and the value of each pixel is a 3D vector, the 3D vector of each pixel in the masked area of the image is the RGB value corresponding to black, and the 3D vector of each pixel in the unmasked area is the RGB value corresponding to white.

本申请的一些实施例中，上述高斯噪声矩阵是一种服从正态分布的噪声矩阵，可以通过电子设备随机采样得到。In some embodiments of the present application, the Gaussian noise matrix is a noise matrix that obeys a normal distribution and can be obtained by random sampling of an electronic device.

本申请的一些实施例中，加噪模块12，用于基于高斯噪声矩阵和至少一张原始图像的掩码图像，对至少一张原始图像进行加噪处理，得到至少一张噪声图像，包括：加噪模块12将高斯噪声矩阵和至少一张原始图像的掩码图像进行加权求和，得到至少一个加噪矩阵，将至少一个加噪矩阵和至少一张原始图像进行加权求和，得到至少一张噪声图像。In some embodiments of the present application, the denoising module 12 is used to perform denoising processing on at least one original image based on a Gaussian noise matrix and a mask image of at least one original image to obtain at least one noisy image, including: the denoising module 12 performs weighted summation of the Gaussian noise matrix and the mask image of at least one original image to obtain at least one denoised matrix, and performs weighted summation of at least one denoised matrix and at least one original image to obtain at least one noisy image.

本申请的一些实施例中，每张噪声图像对应一个加噪矩阵和一张原始图像。In some embodiments of the present application, each noisy image corresponds to a noise matrix and an original image.

本申请的一些实施例中，加噪模块12可以通过如下公式(1)确定加噪矩阵：In some embodiments of the present application, the noise adding module 12 may determine the noise adding matrix by the following formula (1):

z'＝Weight*GNM+(1-Weight)*Mask (1)z'＝Weight*GNM+(1-Weight)*Mask (1)

在公式(1)中，z'表示加噪矩阵，Weight表示权重，GNM表示高斯噪声矩阵，Mask表示掩码图像。In formula (1), z' represents the noise matrix, Weight represents the weight, GNM represents the Gaussian noise matrix, and Mask represents the mask image.

需要说明的是，Weight是在模型训练过程中作为模型的参数通过反向传播调整得到的。It should be noted that Weight is obtained by adjusting the model parameters through back propagation during model training.

示例性地，上述权重可以取[0-1]的任何值，包括0和1。若权重取1，则可以理解为通过高斯噪声矩阵对原始图像进行加噪处理；若权重是0，则可以理解为通过掩码图像对原始图像进行加噪处理；若权重大于0且小于1，则可以理解为使用高斯噪声矩阵和掩码图像结合的方式对原始图像进行加噪处理。Exemplarily, the weight can take any value from [0 to 1], including 0 and 1. If the weight is 1, it can be understood as the original image is subjected to noise processing by the Gaussian noise matrix; if the weight is 0, it can be understood as the original image is subjected to noise processing by the mask image; if the weight is greater than 0 and less than 1, it can be understood as the original image is subjected to noise processing by combining the Gaussian noise matrix and the mask image.

需要说明的是，高斯噪声矩阵中各元素的值可以是一维或多维向量，掩码图像中各元素的值也可以是一维或多维向量。在加噪模块12将高斯噪声矩阵和掩码图像加权求和时，若高斯噪声矩阵的维度与掩码图像的维度相同，即高斯噪声矩阵包括的元素点的数量与掩码图像包括的元素点的数量相同，且高斯噪声矩阵包括的元素点的排列方式与掩码图像包括的元素点的排列方式相同，且高斯噪声矩阵中各元素点的值的长度与掩码图像中各元素点的值的长度相同，则加噪模块12直接将高斯噪声矩阵与掩码图像输入上述公式(1)中确定加噪矩阵。但若高斯噪声矩阵的维度与掩码图像的维度不同，则加噪模块12可以对维度更高的一方进行降维处理，以使二者维度相同，再将维度相同的高斯噪声矩阵与掩码图像输入上述公式(1)中确定加噪矩阵。It should be noted that the value of each element in the Gaussian noise matrix can be a one-dimensional or multi-dimensional vector, and the value of each element in the mask image can also be a one-dimensional or multi-dimensional vector. When the noise adding module 12 weights and sums the Gaussian noise matrix and the mask image, if the dimension of the Gaussian noise matrix is the same as the dimension of the mask image, that is, the number of element points included in the Gaussian noise matrix is the same as the number of element points included in the mask image, and the arrangement of the element points included in the Gaussian noise matrix is the same as the arrangement of the element points included in the mask image, and the length of the value of each element point in the Gaussian noise matrix is the same as the length of the value of each element point in the mask image, then the noise adding module 12 directly inputs the Gaussian noise matrix and the mask image into the above formula (1) to determine the noise adding matrix. However, if the dimension of the Gaussian noise matrix is different from the dimension of the mask image, the noise adding module 12 can perform dimensionality reduction processing on the one with higher dimension to make the two dimensions the same, and then input the Gaussian noise matrix and the mask image with the same dimension into the above formula (1) to determine the noise adding matrix.

例如，在高斯噪声矩阵的维度是[16，64，64]的情况下，假设掩码图像的维度是[3，512，512]，则加噪模块12可以对掩码图像进行降维处理，得到维度为[16，64，64]的掩码图像。For example, when the dimension of the Gaussian noise matrix is [16, 64, 64], assuming that the dimension of the mask image is [3, 512, 512], the denoising module 12 can perform dimensionality reduction processing on the mask image to obtain a mask image with a dimension of [16, 64, 64].

本申请的一些实施例中，加噪模块12可以通过如下公式(2)确定噪声图像：In some embodiments of the present application, the noise adding module 12 may determine the noise image by the following formula (2):

在公式(2)中，NoiseLatent表示噪声图像，x₀表示原始图像，z'表示加噪矩阵，是非线性系数，是固定值。In formula (2), NoiseLatent represents the noise image, _x0 represents the original image, and z' represents the noise matrix. is the nonlinear coefficient and is a fixed value.

需要说明的是，加噪矩阵中各元素的值可以是一维或多维向量，原始图像中各像素点的值也可以是一维或多维向量。在加噪模块12将加噪矩阵和原始图像加权求和时，若加噪矩阵的维度与原始图像的维度相同，即加噪矩阵包括的元素点的数量与原始图像包括的像素点的数量相同，且加噪矩阵包括的元素点的排列方式与原始图像包括的像素点的排列方式相同，且加噪矩阵中各元素点的值的长度与原始图像中各像素点的值的长度相同，则加噪模块12直接将加噪矩阵与原始图像输入上述公式(2)中确定噪声图像。若加噪矩阵的维度与原始图像的维度不同，则可以对维度更高的一方进行降维处理，以使二者维度相同，再将维度相同的加噪矩阵与原始图像输入上述公式(2)中确定噪声图像。It should be noted that the value of each element in the noise matrix can be a one-dimensional or multi-dimensional vector, and the value of each pixel in the original image can also be a one-dimensional or multi-dimensional vector. When the noise module 12 weights and sums the noise matrix and the original image, if the dimension of the noise matrix is the same as the dimension of the original image, that is, the number of element points included in the noise matrix is the same as the number of pixel points included in the original image, and the arrangement of the element points included in the noise matrix is the same as the arrangement of the pixel points included in the original image, and the length of the value of each element point in the noise matrix is the same as the length of the value of each pixel point in the original image, then the noise module 12 directly inputs the noise matrix and the original image into the above formula (2) to determine the noise image. If the dimension of the noise matrix is different from the dimension of the original image, the higher-dimensional side can be subjected to dimensionality reduction processing to make the two dimensions the same, and then the noise matrix and the original image with the same dimension are input into the above formula (2) to determine the noise image.

例如，在加噪矩阵的维度是[16，64，64]的情况下，假设原始图像的维度是[3，512，512]，则加噪模块12可以对原始图像进行降维处理，得到维度为[16，64，64]的原始图像。For example, when the dimension of the noise matrix is [16, 64, 64], assuming that the dimension of the original image is [3, 512, 512], the noise adding module 12 can perform dimensionality reduction processing on the original image to obtain an original image with a dimension of [16, 64, 64].

如此，加噪模块12通过仅高斯噪声矩阵或仅掩码图像或高斯噪声矩阵和掩码图像结合的方式对原始图像加噪，则加噪矩阵的形式很多，对原始图像加噪得到的噪声图像的形式也很多，因此，图像生成模型能够学习到从多种噪声中重建图像的能力，提高了模型的性能。In this way, the denoising module 12 adds noise to the original image by using only a Gaussian noise matrix, only a mask image, or a combination of a Gaussian noise matrix and a mask image. There are many forms of the denoising matrix, and there are also many forms of the noisy image obtained by denoising the original image. Therefore, the image generation model can learn the ability to reconstruct images from multiple noises, thereby improving the performance of the model.

需要说明的是，维度越大(即像素点的数量越多或各像素点的值的长度越长)则计算量越大，由于原始图像的维度通常比较大，而高斯噪声矩阵的维度可能会比较小，因此可以先对原始图像进行特征压缩处理，再根据高斯噪声矩阵和掩码图像对特征压缩后的原始图像进行加噪处理，能够减少加噪模块12执行加噪处理的计算量，减少资源占用，提高处理效率。It should be noted that the larger the dimension (that is, the more pixels or the longer the length of the value of each pixel), the greater the amount of calculation. Since the dimension of the original image is usually large, and the dimension of the Gaussian noise matrix may be small, the original image can be feature compressed first, and then the original image after feature compression can be denoised according to the Gaussian noise matrix and the mask image. This can reduce the amount of calculation for the denoising module 12 to perform the denoising process, reduce resource usage, and improve processing efficiency.

本申请的一些实施例中，上述图像生成模型的扩散模型13，用于基于至少一项条件特征信息和至少一张噪声图像，生成至少一张衍生图像，每张衍生图像对应一项条件特征信息和一张噪声图像。In some embodiments of the present application, the diffusion model 13 of the above-mentioned image generation model is used to generate at least one derivative image based on at least one conditional feature information and at least one noise image, and each derivative image corresponds to one conditional feature information and one noise image.

示例性地，上述扩散模型13可以是Diffusion Model。Exemplarily, the diffusion model 13 may be a Diffusion Model.

本申请的一些实施例中，扩散模型13用于基于一项条件特征信息和一张噪声图像，生成一张衍生图像时包括：扩散模型13获取至少一张噪声图像的特征信息，并对至少一项条件特征信息和至少一张噪声图像的特征信息融合处理，得到至少一项融合特征信息，每项融合特征信息对应一项条件特征信息和一张噪声图像的特征信息；扩散模型13基于至少一项融合特征信息和至少一个加噪向量，确定至少一张噪声图像的噪声矩阵，每个加噪向量表示一张噪声图像的噪声等级；扩散模型基于噪声矩阵和噪声图像确定衍生图像。In some embodiments of the present application, the diffusion model 13 is used to generate a derivative image based on a conditional feature information and a noise image, including: the diffusion model 13 obtains the feature information of at least one noise image, and fuses the at least one conditional feature information and the feature information of at least one noise image to obtain at least one fused feature information, each fused feature information corresponds to a conditional feature information and feature information of a noise image; the diffusion model 13 determines the noise matrix of at least one noise image based on the at least one fused feature information and at least one noise vector, each noise vector represents the noise level of a noise image; the diffusion model determines the derivative image based on the noise matrix and the noise image.

本申请的一些实施例中，上述至少一张噪声图像的特征信息可以以特征矩阵的形式表示。In some embodiments of the present application, the feature information of the at least one noise image may be represented in the form of a feature matrix.

本申请的一些实施例中，上述扩散模型13可以包括图像分割模块、线性变换算子、拼接算子、扩散自注意力模块、层归一化算子和卷积算子。In some embodiments of the present application, the above-mentioned diffusion model 13 may include an image segmentation module, a linear transformation operator, a splicing operator, a diffusion self-attention module, a layer normalization operator and a convolution operator.

本申请的一些实施例中，上述扩散模型13获取至少一张噪声图像的特征信息，并对至少一项条件特征信息和至少一张噪声图像的特征信息融合处理，得到至少一项融合特征信息，包括：加噪模块12将至少一张噪声图像输入图像分割模块，图像分割模块将每张噪声图像分割为M个噪声图像块输入线性变换算子；线性变换算子提取图像分割模块处理得到的每个噪声图像块的特征信息，并添加位置编码向量，得到至少一张噪声图像的特征信息输入拼接算子，并且，线性变换算子对至少一项条件特征信息进行线性变换处理，得到至少一项线性变换后的条件特征信息，拼接算子将线性变换算子处理得到的至少一张噪声图像的特征信息和至少一项线性变换后的条件特征信息拼接，得到融合特征信息。In some embodiments of the present application, the diffusion model 13 obtains feature information of at least one noise image, and fuses at least one conditional feature information and feature information of at least one noise image to obtain at least one fused feature information, including: the noise adding module 12 inputs at least one noise image into the image segmentation module, and the image segmentation module divides each noise image into M noise image blocks and inputs them into the linear transformation operator; the linear transformation operator extracts the feature information of each noise image block processed by the image segmentation module, and adds a position coding vector to obtain feature information of at least one noise image and inputs it into the splicing operator, and the linear transformation operator performs linear transformation processing on at least one conditional feature information to obtain at least one conditional feature information after linear transformation, and the splicing operator splices the feature information of at least one noise image processed by the linear transformation operator with at least one conditional feature information after linear transformation to obtain fused feature information.

其中，M为大于1的整数，且M根据实际需求设置。Wherein, M is an integer greater than 1, and M is set according to actual needs.

示例性地，上述图像分割模块可以是patching模块，线性变换算子可以是linear，拼接算子可以是concat。Exemplarily, the image segmentation module may be a patching module, the linear transformation operator may be linear, and the concatenation operator may be concat.

本申请的一些实施例中，对于每个噪声图像，加噪模块12将噪声图像输入图像分割模块，图像分割模块将噪声图像分割为M个噪声图像块后输入线性变换算子，线性变换算子提取图像分割模块处理得到的每个噪声图像块的特征信息，并按照每个噪声图像块在噪声图像中的位置，将M个噪声图像块的特征信息拼接，得到噪声图像的特征信息。In some embodiments of the present application, for each noise image, the noise addition module 12 inputs the noise image into an image segmentation module, the image segmentation module divides the noise image into M noise image blocks and then inputs the blocks into a linear transformation operator, the linear transformation operator extracts the feature information of each noise image block processed by the image segmentation module, and splices the feature information of the M noise image blocks according to the position of each noise image block in the noise image to obtain the feature information of the noise image.

进一步地，上述图像分割模块还可以在将M个噪声图像块的特征信息拼接的过程中添加位置编码向量，以标记每个噪声图像块的位置顺序，得到噪声图像的特征信息。Furthermore, the above-mentioned image segmentation module can also add a position encoding vector in the process of splicing the feature information of the M noise image blocks to mark the position sequence of each noise image block to obtain the feature information of the noise image.

示例性地，以噪声图像是维度为[16，64，64]的特征矩阵为例，图像分割模块将噪声图像分割为多个噪声图像块，再通过上述线性变换算子提取每个噪声图像块的特征信息，并添加位置编码向量(Position Embedding)，得到噪声图像的维度为4096*768的特征信息。For example, taking the noise image as a feature matrix with a dimension of [16, 64, 64] as an example, the image segmentation module divides the noise image into multiple noise image blocks, and then extracts the feature information of each noise image block through the above-mentioned linear transformation operator, and adds a position encoding vector (Position Embedding) to obtain the feature information of the noise image with a dimension of 4096*768.

本申请的一些实施例中，条件编码器11将条件特征信息输入线性变换算子，在条件特征信息包括一项文本特征信息的情况下，线性变换算子对该一项文本特征信息进行线性变换处理，得到线性变换后的条件特征信息。In some embodiments of the present application, the conditional encoder 11 inputs the conditional feature information into a linear transformation operator. When the conditional feature information includes one item of text feature information, the linear transformation operator performs a linear transformation on the one item of text feature information to obtain the conditional feature information after linear transformation.

本申请的一些实施例中，在条件特征信息包括文本特征信息和图像特征信息两项的情况下，线性变换算子将图像特征信息和文本特征信息转换为维度相同的两个特征矩阵，再将两个特征矩阵中相同位置的特征相加，得到线性变换后的条件特征信息。In some embodiments of the present application, when the conditional feature information includes text feature information and image feature information, the linear transformation operator converts the image feature information and the text feature information into two feature matrices of the same dimension, and then adds the features at the same position in the two feature matrices to obtain the conditional feature information after linear transformation.

本申请的一些实施例中，线性变换算子将线性变换后的条件特征信息和噪声图像的特征信息输入上述拼接模块，拼接模型将该线性变换后的条件特征信息与噪声图像的特征信息在维度上拼接，得到融合特征信息(Input Tokens)。In some embodiments of the present application, the linear transformation operator inputs the conditional feature information after the linear transformation and the feature information of the noise image into the above-mentioned splicing module, and the splicing model splices the conditional feature information after the linear transformation with the feature information of the noise image in dimension to obtain fused feature information (Input Tokens).

示例性地，以条件特征信息包括图像特征信息和文本特征信息，且图像特征信息是维度为256*768的特征矩阵，文本特征信息是维度为128*768的特征矩阵，噪声图像的特征信息是维度为4096*768的特征矩阵为例，通过线性变换算子将图像特征信息和文本特征信息均转换为维度为410*768的特征矩阵，再将两个特征矩阵相加，得到维度为410*768的线性变换后的条件特征信息，再通过拼接算子将该维度为410*768的线性变换后的条件特征信息与上述维度为4096*768的噪声图像的特征信息在维度上拼接，得到维度为4506*768的特征矩阵，即融合特征信息。Exemplarily, taking the case where the conditional feature information includes image feature information and text feature information, and the image feature information is a feature matrix with a dimension of 256*768, the text feature information is a feature matrix with a dimension of 128*768, and the feature information of the noise image is a feature matrix with a dimension of 4096*768, the image feature information and the text feature information are converted into a feature matrix with a dimension of 410*768 through a linear transformation operator, and then the two feature matrices are added to obtain the conditional feature information after the linear transformation with a dimension of 410*768, and then the conditional feature information after the linear transformation with a dimension of 410*768 is spliced with the feature information of the noise image with a dimension of 4096*768 through a splicing operator in terms of dimension to obtain a feature matrix with a dimension of 4506*768, i.e., the fused feature information.

本申请的一些实施例中，上述扩散模型13基于至少一项融合特征信息和至少一个加噪向量，确定至少一张噪声图像的噪声矩阵时，包括：拼接算子将融合特征信息输入扩散自注意力模块，扩散自注意力模块对加噪向量进行特征提取，得到一组缩放因子，并基于一组缩放因子对融合特征向量进行特征缩放，得到缩放特征信息输入层归一化算子，层归一化算子对缩放特征信息进行归一化处理，得到噪声特征信息输入卷积算子，卷积算子对噪声特征信息进行降通道数处理，得到噪声图像的噪声矩阵。In some embodiments of the present application, when the above-mentioned diffusion model 13 determines the noise matrix of at least one noisy image based on at least one fused feature information and at least one noise vector, it includes: the splicing operator inputs the fused feature information into the diffusion self-attention module, the diffusion self-attention module performs feature extraction on the noise vector to obtain a set of scaling factors, and performs feature scaling on the fused feature vector based on the set of scaling factors to obtain the scaled feature information input into the layer normalization operator, the layer normalization operator performs normalization processing on the scaled feature information to obtain the noise feature information input into the convolution operator, the convolution operator performs channel reduction processing on the noise feature information to obtain the noise matrix of the noise image.

本申请的一些实施例中，加噪向量可以是预先设置的一组常数向量。In some embodiments of the present application, the noise adding vector may be a set of preset constant vectors.

示例性地，加噪向量可以是0123456789…999。Exemplarily, the noise vector may be 0123456789…999.

示例性地，上述扩散自注意力模块可以是DiT Block或Large DiT Block。Exemplarily, the above-mentioned diffuse self-attention module can be a DiT Block or a Large DiT Block.

示例性地，如图6所示，是本申请的一些实施例提供的扩散自注意力模块的结构示意图。在图6中，扩散自注意力模块包括多层感知机(Multilayer Perceptron，MLP)、层归一化算子(Layer Norm)、多头自注意力机制(Multi Head Self Attention，MHSA)层、前向反馈算子、特征缩放层(scale)和残差计算层(shift)。电子设备将加噪向量(Timestep)和融合特征信息(Attention Embedding)输入扩散自注意力模块后，扩散自注意力模块的处理过程如下：Exemplarily, as shown in FIG6 , it is a schematic diagram of the structure of the diffusion self-attention module provided by some embodiments of the present application. In FIG6 , the diffusion self-attention module includes a multilayer perceptron (Multilayer Perceptron, MLP), a layer normalization operator (Layer Norm), a multi-head self-attention mechanism (Multi Head Self Attention, MHSA) layer, a forward feedback operator, a feature scaling layer (scale) and a residual calculation layer (shift). After the electronic device inputs the noise vector (Timestep) and the fused feature information (Attention Embedding) into the diffusion self-attention module, the processing process of the diffusion self-attention module is as follows:

a.多层感知机对加噪向量做特征提取，得到6个缩放因子分别为α1、α2、β1、β2、γ1和γ2。拼接算子将融合特征信息输入层归一化算子做归一化处理，再经过特征缩放&残差计算层通过缩放因子β1和γ1对归一化处理后的融合特征信息做特征缩放，特征缩放的计算过程为Attention Embedding*γ1+β1；a. The multi-layer perceptron extracts features from the noise vector and obtains six scaling factors, namely α1, α2, β1, β2, γ1 and γ2. The concatenation operator inputs the fused feature information into the normalization operator layer for normalization, and then the feature scaling & residual calculation layer performs feature scaling on the normalized fused feature information using scaling factors β1 and γ1. The feature scaling calculation process is Attention Embedding*γ1+β1;

b.将步骤a的输出输入多头自注意力机制层做自注意力计算，捕捉条件特征信息与噪声图像的特征信息之间的关联关系；b. Input the output of step a into the multi-head self-attention mechanism layer for self-attention calculation to capture the correlation between the conditional feature information and the feature information of the noise image;

c.将步骤b的输出输入特征缩放层通过缩放因子α1做特征缩放，再与融合特征信息做残差计算；c. Input the output of step b into the feature scaling layer by scaling factor α1, and then perform residual calculation with the fused feature information;

d.将步骤c的输出输入层归一化算子做归一化处理，再经过特征缩放&残差计算层通过缩放因子β2和γ2做特征缩放；d. Normalize the output input layer normalization operator of step c, and then perform feature scaling through the feature scaling & residual calculation layer using scaling factors β2 and γ2;

e.将步骤d的输出输入前向反馈算子，通过一个或多个全连接层对输入进行线性计算，再通过激活函数层将全连接层的线性输出进行非线性变换，得到非线性输出，以增加特征的非线性表示能力；e. Input the output of step d into the forward feedback operator, perform linear calculation on the input through one or more fully connected layers, and then perform nonlinear transformation on the linear output of the fully connected layer through the activation function layer to obtain a nonlinear output to increase the nonlinear representation capability of the feature;

f.将步骤e的输出输入特征缩放层通过缩放因子α2做特征缩放，并与步骤c的输出做残差计算，得到缩放特征信息作为扩散自注意力模块的输出。f. Input the output of step e into the feature scaling layer and perform feature scaling by the scaling factor α2, and perform residual calculation with the output of step c to obtain the scaled feature information as the output of the diffuse self-attention module.

需要说明的是，本申请的一些实施例中，上述扩散模型可以包括16层扩散自注意力模块，用于加深网络，加大模型的参数量，使得模型有更多的参数量能够学习到更多知识，增加模型的拟合能力。It should be noted that in some embodiments of the present application, the above-mentioned diffusion model may include a 16-layer diffusion self-attention module, which is used to deepen the network and increase the number of parameters of the model, so that the model has more parameters to learn more knowledge and increase the fitting ability of the model.

本申请的一些实施例中，扩散自注意力模块将缩放特征信息输入层归一化算子，层归一化算子先对缩放特征信息进行维度变换处理，再对维度变换的结果进行归一化处理，得到噪声特征信息。In some embodiments of the present application, the diffuse self-attention module inputs the scaled feature information into a layer normalization operator, and the layer normalization operator first performs a dimensionality transformation on the scaled feature information, and then normalizes the result of the dimensionality transformation to obtain noise feature information.

示例性地，上述扩散模型13的层归一化算子可以是Layer Norm。Exemplarily, the layer normalization operator of the diffusion model 13 may be Layer Norm.

示例性地，维度变换处理可以理解为进行reshape操作，归一化处理可以理解为Layer Norm操作。Exemplarily, the dimensionality transformation process can be understood as a reshape operation, and the normalization process can be understood as a Layer Norm operation.

示例性地，假设缩放特征信息是维度为4506*768的特征矩阵，经过上述层归一化算子做reshape操作和Layer Norm操作后，得到维度为[768，64，64]的特征矩阵，即噪声特征信息。For example, assuming that the scaled feature information is a feature matrix with a dimension of 4506*768, after the above-mentioned layer normalization operator performs reshape operation and Layer Norm operation, a feature matrix with a dimension of [768, 64, 64] is obtained, that is, the noise feature information.

本申请的一些实施例中，层归一化算子将噪声特征信息输入卷积算子，卷积算子在不改变噪声特征信息的大小的情况下，减少噪声特征信息中各元素的通道数，得到通道数更少的特征矩阵，称为噪声图像的噪声矩阵(Predict Latent)。In some embodiments of the present application, the layer normalization operator inputs the noise feature information into the convolution operator, and the convolution operator reduces the number of channels of each element in the noise feature information without changing the size of the noise feature information, thereby obtaining a feature matrix with fewer channels, which is called the noise matrix of the noise image (Predict Latent).

示例性地，上述卷积算子可以是Conv_out。Exemplarily, the above-mentioned convolution operator may be Conv_out.

示例性地，假设噪声特征信息是维度为[768，64，64]的特征矩阵，经过卷积算子降低特征矩阵的通道数之后，得到维度为[32，64，64]的噪声矩阵。Exemplarily, assuming that the noise feature information is a feature matrix with a dimension of [768, 64, 64], after the convolution operator reduces the number of channels of the feature matrix, a noise matrix with a dimension of [32, 64, 64] is obtained.

如此，在扩散自注意力模块中，用加噪向量不断向噪声图像加噪声，直至得到完全是噪声的噪声矩阵，即预测出了噪声图像中的噪声。In this way, in the diffuse self-attention module, the noise vector is used to continuously add noise to the noisy image until a noise matrix that is completely noise is obtained, that is, the noise in the noisy image is predicted.

本申请的一些实施例中，上述扩散模型根据噪声图像得到噪声矩阵的过程是扩散过程，卷积算子输出噪声矩阵后，可以按照逆扩散过程对噪声矩阵做处理，得到衍生图像。其中，逆扩散过程与扩散过程方向相反。In some embodiments of the present application, the process of obtaining the noise matrix from the noise image by the diffusion model is a diffusion process, and after the convolution operator outputs the noise matrix, the noise matrix can be processed according to the inverse diffusion process to obtain a derivative image, wherein the inverse diffusion process is in the opposite direction to the diffusion process.

示例性地，卷积算子得到噪声矩阵后，将噪声矩阵按照与扩散过程相反的方向依次输入扩散模型的卷积算子、层归一化算子和扩散自注意力模块，最终输出衍生图像。Exemplarily, after the convolution operator obtains the noise matrix, the noise matrix is input into the convolution operator, layer normalization operator and diffusion self-attention module of the diffusion model in sequence in the direction opposite to the diffusion process, and finally a derived image is output.

示例性地，如图7所示，是本申请的一些实施例提供的扩散模型的结构示意图，在图7中，扩散模型13包括图像分割(patching)模块、线性变换算子(linear)、拼接算子(concat)、扩散自注意力模块(Large DiT Block)、层归一化算子(Layer Norm)和卷积算子(Conv_out)。Exemplarily, as shown in Figure 7, it is a structural diagram of the diffusion model provided by some embodiments of the present application. In Figure 7, the diffusion model 13 includes an image segmentation (patching) module, a linear transformation operator (linear), a concatenation operator (concat), a diffusion self-attention module (Large DiT Block), a layer normalization operator (Layer Norm) and a convolution operator (Conv_out).

接下来结合图7对扩散模型13生成噪声矩阵的过程进行说明：以条件特征信息包括图像特征信息和文本特征信息为例，条件编码器11将维度为410*768的条件特征信息输入线性变换算子，线性变换算子对图像特征信息和文本特征信息做线性变换和特征组合，得到线性变换后的条件特征信息(Condition Embedding)，维度为410*768；将分辨率为16*64*64的噪声图像(Noise Latent)输入图像分割模块分割为M个图像块，将M个图像块输入线性变换算子做特征提取，得到每个图像块的特征信息，然后添加位置编码向量(PositionEmbedding)，得到噪声图像的特征信息，维度为4096*768。将线性变换后的条件特征信息(Condition Embedding)和噪声图像的特征信息输入拼接算子进行拼接，得到融合特征信息(Attention Embedding)，维度为4506*768，将融合特征信息和维度为1*768的加噪向量(Timestep)输入包括16层Large DiT Block的扩散自注意力模块做处理，得到缩放特征信息，维度为4506*768，将缩放特征信息输入层归一化算子做reshape操作和Layer Norm操作，得到噪声特征信息，维度为[768，64，64]，将噪声特征信息输入卷积算子减少通道数，得到噪声图像的噪声矩阵(Predict Noise)，维度为[32，64，64]。Next, the process of generating a noise matrix by the diffusion model 13 is explained in conjunction with Figure 7: Taking the conditional feature information including image feature information and text feature information as an example, the conditional encoder 11 inputs the conditional feature information with a dimension of 410*768 into the linear transformation operator, and the linear transformation operator performs linear transformation and feature combination on the image feature information and the text feature information to obtain the conditional feature information (Condition Embedding) after linear transformation, with a dimension of 410*768; the noise image (Noise Latent) with a resolution of 16*64*64 is input into the image segmentation module to be segmented into M image blocks, and the M image blocks are input into the linear transformation operator for feature extraction to obtain the feature information of each image block, and then the position encoding vector (PositionEmbedding) is added to obtain the feature information of the noise image, with a dimension of 4096*768. The conditional feature information (Condition Embedding) after linear transformation and the feature information of the noise image are input into the splicing operator for splicing to obtain the fused feature information (Attention Embedding) with a dimension of 4506*768. The fused feature information and the noise vector (Timestep) with a dimension of 1*768 are input into the diffusion self-attention module including 16 layers of Large DiT Block for processing to obtain scaled feature information with a dimension of 4506*768. The scaled feature information is input into the layer normalization operator for reshape operation and Layer Norm operation to obtain the noise feature information with a dimension of [768, 64, 64]. The noise feature information is input into the convolution operator to reduce the number of channels to obtain the noise matrix (Predict Noise) of the noise image with a dimension of [32, 64, 64].

如此，图像生成模型包括条件编码器、加噪模块和扩散模型，条件编码器根据控制条件、文本特征信息和图像特征信息确定条件特征信息，以控制图像生成模型执行不同类型的图像生成任务，并在加噪模块用高斯噪声矩阵和掩码图像对原始图像进行加噪处理，得到噪声图像输入扩散模型生成衍生图像，使得图像生成模型不仅能够具备从噪声中生成图像的能力，也能够具备预测、消除或修改掩码图像对应的掩码区域的图像内容的能力，使得图像生成模型能够适应不同类型的图像生成任务，即一个图像生成模型具备了处理多种类型的图像生成任务的能力，在处理不同类型的图像生成任务时不需要再调用多个不同的模型，提高了处理图像生成任务的效率。In this way, the image generation model includes a conditional encoder, a noise addition module and a diffusion model. The conditional encoder determines the conditional feature information according to the control condition, text feature information and image feature information to control the image generation model to perform different types of image generation tasks, and in the noise addition module, the original image is subjected to noise processing using a Gaussian noise matrix and a mask image to obtain a noise image input to the diffusion model to generate a derivative image, so that the image generation model can not only have the ability to generate images from noise, but also have the ability to predict, eliminate or modify the image content of the mask area corresponding to the mask image, so that the image generation model can adapt to different types of image generation tasks, that is, an image generation model has the ability to handle multiple types of image generation tasks, and there is no need to call multiple different models when handling different types of image generation tasks, thereby improving the efficiency of processing image generation tasks.

本申请的一些实施例中，如图2所示，图像生成模型还包括图像编码器14和图像解码器15，图像编码器14与加噪模块12连接，图像解码器15与扩散模型13连接。In some embodiments of the present application, as shown in FIG. 2 , the image generation model further includes an image encoder 14 and an image decoder 15 , the image encoder 14 is connected to the noise addition module 12 , and the image decoder 15 is connected to the diffusion model 13 .

本申请的一些实施例中，上述图像编码器14，用于对至少一张原始图像进行特征压缩，得到至少一个图像特征矩阵。In some embodiments of the present application, the image encoder 14 is used to perform feature compression on at least one original image to obtain at least one image feature matrix.

本申请的一些实施例中，上述至少一个图像特征矩阵中每个图像特征矩阵对应一张原始图像，并且，每个图像特征矩阵中像素点的数量小于原始图像中像素点的数量。In some embodiments of the present application, each image feature matrix in the at least one image feature matrix corresponds to an original image, and the number of pixels in each image feature matrix is less than the number of pixels in the original image.

示例性地，以原始图像的维度是[3，512，512]为例，图像编码器14对原始图像进行特征压缩后，得到维度为[16，64，64]的图像特征矩阵。Exemplarily, taking the dimension of the original image as [3, 512, 512], the image encoder 14 performs feature compression on the original image to obtain an image feature matrix with a dimension of [16, 64, 64].

本申请的一些实施例中，上述图像编码器14可以是任何能够实现特征压缩功能，即将一个高维的输入映射到一个低维的隐变量上的编码器。In some embodiments of the present application, the image encoder 14 may be any encoder that can implement a feature compression function, that is, map a high-dimensional input to a low-dimensional latent variable.

示例性地，上述图像编码器14可以是VAE编码器(VAE-Encoder)，对原始图像进行特征压缩可以理解为是将原始图像从像素空间压缩到latent空间。Exemplarily, the image encoder 14 may be a VAE encoder (VAE-Encoder), and feature compression of the original image may be understood as compressing the original image from a pixel space to a latent space.

本申请的一些实施例中，上述图像编码器14可以包括卷积模块和自注意力模块，且卷积模块与自注意力模块连接。In some embodiments of the present application, the above-mentioned image encoder 14 may include a convolution module and a self-attention module, and the convolution module is connected to the self-attention module.

本申请的一些实施例中，上述图像编码器14的卷积模块用于对原始图像进行特征压缩，上述图像编码器14的自注意力模块用于在不改变特征维度的基础上，对特征内部不同位置的元素点进行建模，捕捉元素点之间的依赖关系，使得得到的图像特征矩阵能够包括原始图像更多的信息，可以更好地表征原始图像。In some embodiments of the present application, the convolution module of the above-mentioned image encoder 14 is used to perform feature compression on the original image, and the self-attention module of the above-mentioned image encoder 14 is used to model element points at different positions within the feature without changing the feature dimension, so as to capture the dependency between the element points, so that the obtained image feature matrix can include more information of the original image and better characterize the original image.

本申请的一些实施例中，图像编码器14，用于对至少一张原始图像进行特征压缩，得到至少一个图像特征矩阵时，包括：电子设备将至少一张原始图像输入图像编码器14的卷积模块，卷积模块对至少一张原始图像进行初步特征压缩，得到至少一个压缩图像特征矩阵，每个压缩图像特征矩阵的像素点的数量多于每个图像特征矩阵的像素点的数量；卷积模块将至少一个压缩图像特征矩阵输入图像编码器中的自注意力模块，自注意力模块基于每个压缩图像特征矩阵中的像素点之间的关联关系，对每个压缩图像特征矩阵进行特征提取，得到至少一个图像特征矩阵。In some embodiments of the present application, the image encoder 14 is used to perform feature compression on at least one original image to obtain at least one image feature matrix, including: the electronic device inputs at least one original image into the convolution module of the image encoder 14, the convolution module performs preliminary feature compression on at least one original image to obtain at least one compressed image feature matrix, the number of pixels in each compressed image feature matrix is greater than the number of pixels in each image feature matrix; the convolution module inputs at least one compressed image feature matrix into the self-attention module in the image encoder, the self-attention module performs feature extraction on each compressed image feature matrix based on the correlation between the pixels in each compressed image feature matrix to obtain at least one image feature matrix.

本申请的一些实施例中，上述卷积模块可以包括卷积层和非线性函数激活层。对于一个原始图像，该卷积层用于对原始图像进行初步特征压缩，降低原始图像的维度，得到一个压缩图像特征矩阵，该非线性函数激活层用于对该压缩图像特征矩阵中每个像素点进行非线性变换，得到图像特征矩阵。In some embodiments of the present application, the convolution module may include a convolution layer and a nonlinear function activation layer. For an original image, the convolution layer is used to perform preliminary feature compression on the original image, reduce the dimension of the original image, and obtain a compressed image feature matrix, and the nonlinear function activation layer is used to perform nonlinear transformation on each pixel in the compressed image feature matrix to obtain an image feature matrix.

示例性地，上述卷积模块可以是ConvBlock，上述卷积层可以是Conv，上述非线性函数激活层可以是Relu。Exemplarily, the convolution module may be ConvBlock, the convolution layer may be Conv, and the nonlinear function activation layer may be Relu.

需要说明的是，上述卷积层中卷积核的大小、步长和输出通道数是事先定义的。并且，卷积层可以包括多个卷积核，卷积核的数量与输出通道数相同。其中，卷积核的大小包括卷积核的高度和宽度。It should be noted that the size, step size and number of output channels of the convolution kernel in the above convolution layer are predefined. In addition, the convolution layer may include multiple convolution kernels, and the number of convolution kernels is the same as the number of output channels. The size of the convolution kernel includes the height and width of the convolution kernel.

本申请的一些实施例中，原始图像输入上述卷积模块的卷积层后，卷积层中的卷积核按照步长在原始图像上滑动，每次滑动，将卷积核与其覆盖的图像块进行点积运算(即对应位置的元素点的值相乘后求和)，则每次滑动可以得到一个值，该值是当前位置和卷积核的卷积结果，当卷积核滑动完整个原始图像后，可以得到一个二维的输出特征图，即对于每个卷积核都会得到一个二维的输出特征图，将多个卷积核的输出特征图拼接后，可以得到一个压缩图像特征矩阵，该压缩图像特征矩阵的维度是[卷积核的数量，新高度，新宽度]。In some embodiments of the present application, after the original image is input into the convolution layer of the above-mentioned convolution module, the convolution kernel in the convolution layer slides on the original image according to the step size. Each time it slides, the convolution kernel and the image block it covers are dot-producted (i.e., the values of the element points at the corresponding positions are multiplied and then summed). Then, a value can be obtained each time it slides, which is the convolution result of the current position and the convolution kernel. After the convolution kernel slides across the entire original image, a two-dimensional output feature map can be obtained, that is, a two-dimensional output feature map will be obtained for each convolution kernel. After splicing the output feature maps of multiple convolution kernels, a compressed image feature matrix can be obtained. The dimension of the compressed image feature matrix is [number of convolution kernels, new height, new width].

需要说明的是，上述卷积层中卷积核的数量是该卷积层的输出通道数，上述新高度和新宽度取决于原始图像的大小、卷积核的大小和步长。示例性地，新高度和新宽度可以通过以下公式计算：新高度＝(原始图像的高度-卷积核高度)+2/步长+1；新宽度＝(样本图像的宽度-卷积核宽度)+2/步长+1。It should be noted that the number of convolution kernels in the above convolution layer is the number of output channels of the convolution layer, and the above new height and new width depend on the size of the original image, the size of the convolution kernel and the step size. Exemplarily, the new height and new width can be calculated by the following formula: new height = (height of the original image - height of the convolution kernel) + 2/step size + 1; new width = (width of the sample image - width of the convolution kernel) + 2/step size + 1.

本申请的一些实施例中，上述卷积层将压缩图像特征矩阵输入非线性函数激活层，非线性函数激活层对该压缩图像特征矩阵中每个像素点的值进行非线性变换。具体地，对于该压缩图像特征矩阵中的每个像素点，若该像素点的值小于0，则将该像素点的值替换为0，若该像素点的值大于0，则该像素点的值保持不变，最终输出图像特征矩阵。In some embodiments of the present application, the convolution layer inputs the compressed image feature matrix into the nonlinear function activation layer, and the nonlinear function activation layer performs a nonlinear transformation on the value of each pixel in the compressed image feature matrix. Specifically, for each pixel in the compressed image feature matrix, if the value of the pixel is less than 0, the value of the pixel is replaced with 0, and if the value of the pixel is greater than 0, the value of the pixel remains unchanged, and finally the image feature matrix is output.

本申请的一些实施例中，上述卷积模块可以包括多个卷积层和多个非线性函数激活层，且每个卷积层连接一个非线性函数激活层，即卷积层的数量与非线性函数激活层的数量相同。In some embodiments of the present application, the above-mentioned convolution module may include multiple convolution layers and multiple nonlinear function activation layers, and each convolution layer is connected to a nonlinear function activation layer, that is, the number of convolution layers is the same as the number of nonlinear function activation layers.

示例性地，如图8所示，是本申请的一些实施例提供的卷积模块的结构示意图，在图8中，卷积模块包括三个卷积层和三个非线性函数激活层，三个卷积层分别是卷积层1_1(Conv1_1)、卷积层1_2(Conv1_2)和卷积层1_3(Conv1_3)，三个非线性函数激活层分别是非线性函数激活层1_1(Relu1_1)、非线性函数激活层1_2(Relu1_2)和非线性函数激活层1_3(Relu1_3)。卷积层1_1的卷积核大小为3x3，步长为2，输出通道数为4，卷积核的数量也为4，其后连接一个非线性函数激活层1_1，非线性函数激活层1_1后连接一个卷积层1_2，卷积层1_2的卷积核大小为3x3，步长为1，输出通道数为4，卷积核的数量也为4，其后连接一个非线性函数激活层1_2，非线性函数激活层1_2后连接一个卷积层1_3，卷积层1_3的卷积核大小为3x3，步长为1，输出通道数为4，卷积核的数量也为4，其后接一个非线性函数激活层1_3。Exemplarily, as shown in Figure 8, it is a structural diagram of a convolution module provided in some embodiments of the present application. In Figure 8, the convolution module includes three convolution layers and three nonlinear function activation layers. The three convolution layers are convolution layer 1_1 (Conv1_1), convolution layer 1_2 (Conv1_2) and convolution layer 1_3 (Conv1_3), and the three nonlinear function activation layers are nonlinear function activation layer 1_1 (Relu1_1), nonlinear function activation layer 1_2 (Relu1_2) and nonlinear function activation layer 1_3 (Relu1_3). The convolution kernel size of convolution layer 1_1 is 3x3, the stride is 2, the number of output channels is 4, and the number of convolution kernels is also 4. It is then connected to a nonlinear function activation layer 1_1. The nonlinear function activation layer 1_1 is then connected to a convolution layer 1_2. The convolution kernel size of convolution layer 1_2 is 3x3, the stride is 1, the number of output channels is 4, and the number of convolution kernels is also 4. It is then connected to a nonlinear function activation layer 1_2. The nonlinear function activation layer 1_2 is then connected to a convolution layer 1_3. The convolution kernel size of convolution layer 1_3 is 3x3, the stride is 1, the number of output channels is 4, and the number of convolution kernels is also 4. It is then connected to a nonlinear function activation layer 1_3.

需要说明的是，上述卷积层1_1主要用于提取原始图像中低级别的特征，如边缘、颜色、纹理等。上述卷积层1_2主要用于在前一层提取的低级别的特征的基础上，进一步组合特征以提取更复杂的图像结构，如纹理或简单的形状等。上述卷积层1_3主要用于在前面两层提取的特征的基础上，提取更高级别的语义信息，如实体类别、场景、实体的内容等。It should be noted that the above convolution layer 1_1 is mainly used to extract low-level features in the original image, such as edges, colors, textures, etc. The above convolution layer 1_2 is mainly used to further combine features based on the low-level features extracted by the previous layer to extract more complex image structures, such as textures or simple shapes, etc. The above convolution layer 1_3 is mainly used to extract higher-level semantic information, such as entity categories, scenes, entity contents, etc., based on the features extracted by the previous two layers.

示例性地，在图8中，以原始图像的维度是[3，512，512]为例，将维度为[3，512，512]的原始图像输入卷积层1_1做特征提取，可以提取到原始图像的边缘和线条方向等，得到维度为[4，256，256]的特征矩阵并输出，再将该特征矩阵输入非线性函数激活层1_1对每个像素点的值做非线性变换，输出维度为[4，256，256]的特征矩阵，再将该特征矩阵输入卷积层1_2做进一步特征提取，可以提取到原始图像中更大的纹理块或简单的形状，如斑点、条纹、角等等，得到维度为[4，256，256]的特征矩阵并输出，再将该特征矩阵输入非线性函数激活层1_2对每个像素点的值做非线性变换，输出维度为[4，256，256]的特征矩阵，再将该特征矩阵输入卷积层1_3做进一步特征提取，可以提取到原始图像中实体的具体形状、类别等，得到维度为[4，256，256]的特征矩阵并输出，再将该特征矩阵输入非线性函数激活层1_3对每个像素点的值做非线性变换，输出维度为[4，256，256]的特征矩阵。Exemplarily, in FIG8 , taking the dimension of the original image as [3, 512, 512] as an example, the original image with the dimension of [3, 512, 512] is input into the convolution layer 1_1 for feature extraction, and the edges and line directions of the original image can be extracted, and a feature matrix with the dimension of [4, 256, 256] is obtained and output, and then the feature matrix is input into the nonlinear function activation layer 1_1 to perform a nonlinear transformation on the value of each pixel point, and a feature matrix with the dimension of [4, 256, 256] is output, and then the feature matrix is input into the convolution layer 1_2 for further feature extraction, and larger texture blocks or simple shapes such as spots in the original image can be extracted. , stripes, corners, etc., to obtain a feature matrix with a dimension of [4, 256, 256] and output it, then input the feature matrix into the nonlinear function activation layer 1_2 to perform a nonlinear transformation on the value of each pixel, and output a feature matrix with a dimension of [4, 256, 256]. Then input the feature matrix into the convolution layer 1_3 for further feature extraction, and the specific shape and category of the entity in the original image can be extracted, and a feature matrix with a dimension of [4, 256, 256] is obtained and output, then input the feature matrix into the nonlinear function activation layer 1_3 to perform a nonlinear transformation on the value of each pixel, and output a feature matrix with a dimension of [4, 256, 256].

本申请的一些实施例中，上述图像编码器14的自注意力模块包括4层TansformerBlock。如此，可以在每一层提取到压缩图像特征矩阵中像素点之间不同层次的关联关系，使得关联关系的提取更加丰富，得到特征信息更加丰富的图像特征矩阵。In some embodiments of the present application, the self-attention module of the above-mentioned image encoder 14 includes 4 layers of TansformerBlock. In this way, different levels of association relationships between pixels in the compressed image feature matrix can be extracted at each layer, so that the extraction of association relationships is richer, and an image feature matrix with richer feature information is obtained.

需要说明的是，Tansformer Block的结构示意图参见上述图3，上述图像编码器14的自注意力模块对至少一个压缩图像特征矩阵做处理的过程与上述第一条件编码器111的自注意力模块对像素重排特征向量做处理的过程相同，具体实现可以参见上述实施例的相关描述，本申请在此不再赘述。It should be noted that, for the structural diagram of Tansformer Block, please refer to Figure 3 above. The process by which the self-attention module of the above-mentioned image encoder 14 processes at least one compressed image feature matrix is the same as the process by which the self-attention module of the above-mentioned first conditional encoder 111 processes the pixel rearrangement feature vector. The specific implementation can be found in the relevant description of the above-mentioned embodiment, and this application will not go into details here.

需要说明的是，上述图像编码器14可以包括多个卷积模块和多个自注意力模块，且卷积模块的数量比自注意力模块的数量多一个，在不同的卷积模块对特征矩阵进行不同程度的特征压缩，而在不同的自注意力模块，对不同维度的特征矩阵中像素点之间的关联关系进行提取。It should be noted that the above-mentioned image encoder 14 may include multiple convolution modules and multiple self-attention modules, and the number of convolution modules is one more than the number of self-attention modules. Different convolution modules perform different degrees of feature compression on the feature matrix, and different self-attention modules extract the correlation between pixel points in feature matrices of different dimensions.

示例性地，如图9所示，是本申请的一些实施例提供的图像编码器的结构示意图，在图9中，图像编码器包括三个卷积模块和两个自注意力模块，该三个卷积模块分别是卷积模块1(ConvBlock1)、卷积模块2(ConvBlock2)和卷积模块3(ConvBlock3)，两个自注意力模块分别是包括4层Transformer Block的自注意力模块1和包括6层Transformer Block的自注意力模块2。并且，卷积模块1、卷积模块2和卷积模块3的作用是降低特征矩阵的维度，卷积核的通道数不同，卷积模块1的通道数为4，卷积模块2的通道数为8，卷积模块3的通道数为16。Exemplarily, as shown in FIG9, it is a schematic diagram of the structure of an image encoder provided by some embodiments of the present application. In FIG9, the image encoder includes three convolution modules and two self-attention modules. The three convolution modules are convolution module 1 (ConvBlock1), convolution module 2 (ConvBlock2) and convolution module 3 (ConvBlock3). The two self-attention modules are self-attention module 1 including 4 layers of Transformer Block and self-attention module 2 including 6 layers of Transformer Block. In addition, the functions of convolution module 1, convolution module 2 and convolution module 3 are to reduce the dimension of the feature matrix. The number of channels of the convolution kernel is different. The number of channels of convolution module 1 is 4, the number of channels of convolution module 2 is 8, and the number of channels of convolution module 3 is 16.

示例性地，在图9中，电子设备将维度为[3，512，512]的原始图像输入卷积模块1进行特征压缩，得到维度为[4，256，256]的特征矩阵，再将该维度为[4，256，256]的压缩图像特征矩阵输入自注意力模块1，得到维度为[4，256，256]的特征矩阵，再将该维度为[4，256，256]的特征矩阵输入卷积模块2进行进一步特征压缩，得到维度为[8，128，128]的压缩图像特征矩阵，再将该维度为[8，128，128]的特征矩阵输入自注意力模块12，得到维度为[8，128，128]的特征矩阵，再将该维度为[8，128，128]的特征矩阵输入卷积模块3进行进一步特征压缩，得到维度为[16，64，64]的图像特征矩阵。Exemplarily, in Figure 9, the electronic device inputs the original image with a dimension of [3, 512, 512] into the convolution module 1 for feature compression to obtain a feature matrix with a dimension of [4, 256, 256], and then inputs the compressed image feature matrix with a dimension of [4, 256, 256] into the self-attention module 1 to obtain a feature matrix with a dimension of [4, 256, 256], and then inputs the feature matrix with a dimension of [4, 256, 256] into the convolution module 2 for further feature compression to obtain a compressed image feature matrix with a dimension of [8, 128, 128], and then inputs the feature matrix with a dimension of [8, 128, 128] into the self-attention module 12 to obtain a feature matrix with a dimension of [8, 128, 128], and then inputs the feature matrix with a dimension of [8, 128, 128] into the convolution module 3 for further feature compression to obtain an image feature matrix with a dimension of [16, 64, 64].

如此，在加噪模块12对原始图像进行加噪处理之前，通过图像编码器14对原始图像进行特征压缩，再对特征压缩后得到的图像特征矩阵做加噪处理，能够减少加噪模块12执行加噪处理的数据量，提高处理效率。In this way, before the denoising module 12 performs denoising on the original image, the image encoder 14 performs feature compression on the original image, and then performs denoising on the image feature matrix obtained after feature compression, which can reduce the amount of data that the denoising module 12 performs denoising on and improve processing efficiency.

本申请的一些实施例中，加噪模块12，具体用于基于高斯噪声矩阵和至少一张原始图像的掩码图像，对至少一个图像特征矩阵进行加噪处理，得到至少一张噪声图像。In some embodiments of the present application, the noise adding module 12 is specifically used to perform noise adding processing on at least one image feature matrix based on a Gaussian noise matrix and a mask image of at least one original image to obtain at least one noise image.

本申请的一些实施例中，一张噪声图像对应一张原始图像的掩码图像和一个图像特征矩阵。In some embodiments of the present application, a noise image corresponds to a mask image of an original image and an image feature matrix.

本申请的一些实施例中，图像编码器14将至少一个图像特征矩阵输入加噪模块12，加噪模块12基于高斯噪声矩阵和至少一张原始图像的掩码图像，通过上述公式(1)和公式(2)，对至少一个图像特征矩阵进行加噪处理，得到至少一张噪声图像。其具体实现可以参见上述实施例的相关描述，本申请在此不再赘述。In some embodiments of the present application, the image encoder 14 inputs at least one image feature matrix into the noise adding module 12. The noise adding module 12 performs noise adding processing on the at least one image feature matrix based on the Gaussian noise matrix and the mask image of at least one original image through the above formula (1) and formula (2) to obtain at least one noise image. The specific implementation thereof can refer to the relevant description of the above embodiment, and the present application will not repeat it here.

本申请的一些实施例中，上述图像解码器15，用于对至少一张衍生图像进行解码，得到分辨率更高的解码后的衍生图像。In some embodiments of the present application, the image decoder 15 is used to decode at least one derivative image to obtain a decoded derivative image with a higher resolution.

示例性地，上述图像解码器15可以是任何能够实现解码功能，即将一个低维的隐向量映射到一个高维的输出的解码器。Exemplarily, the above-mentioned image decoder 15 can be any decoder that can realize the decoding function, that is, map a low-dimensional latent vector to a high-dimensional output.

例如，上述图像解码器15可以是VAE解码器(VAE-Decoder)。For example, the image decoder 15 may be a VAE decoder (VAE-Decoder).

示例性地，以衍生图像的维度是[16，64，64]为例，通过图像解码器15对衍生图像解码后，可以得到维度为[3，512，512]的解码后的衍生图像。Exemplarily, taking the dimension of the derived image as [16, 64, 64], after the derived image is decoded by the image decoder 15, a decoded derived image with a dimension of [3, 512, 512] can be obtained.

本申请的一些实施例中，上述图像解码器15可以包括反卷积模块和自注意力模块，将衍生图像输入反卷积模块，增大特征的分辨率，得到衍生图像特征矩阵，将衍生图像特征矩阵输入自注意力模块，基于衍生图像特征矩阵中各像素点之间的关联关系，对衍生图像特征矩阵进行特征提取，得到解码后的衍生图像。In some embodiments of the present application, the above-mentioned image decoder 15 may include a deconvolution module and a self-attention module. The derived image is input into the deconvolution module to increase the resolution of the features and obtain a derived image feature matrix. The derived image feature matrix is input into the self-attention module. Based on the correlation between each pixel point in the derived image feature matrix, the feature of the derived image feature matrix is extracted to obtain a decoded derivative image.

本申请的一些实施例中，上述反卷积模块可以包括反卷积层、非线性函数激活层和卷积层。In some embodiments of the present application, the deconvolution module may include a deconvolution layer, a nonlinear function activation layer and a convolution layer.

需要说明的是，上述反卷积层用于增大输入的特征矩阵的分辨率。上述非线性函数激活层用于对输入的特征矩阵进行非线性变换。上述卷积层用于对输入的特征矩阵进行特征提取。It should be noted that the above deconvolution layer is used to increase the resolution of the input feature matrix. The above nonlinear function activation layer is used to perform nonlinear transformation on the input feature matrix. The above convolution layer is used to extract features from the input feature matrix.

本申请的一些实施例中，反卷积层和卷积层中卷积核的大小、步长和输出通道数均是事先定义的。In some embodiments of the present application, the size, step size and number of output channels of the convolution kernel in the deconvolution layer and the convolution layer are all defined in advance.

示例性地，如图10所示，是本申请的一些实施例提供的反卷积模块的结构示意图，在图10中，反卷积模块包括一个反卷积层、两个卷积层和三个非线性函数激活层，一个反卷积层是反卷积层2_1(Deconv2_1)，两个卷积层分别是卷积层2_2(Conv2_2)和卷积层2_3(Conv2_3)，三个非线性激活函数层分别是非线性函数激活层2_1(Relu2_1)、非线性函数激活层2_2(Relu2_2)和非线性函数激活层2_3(Relu2_3)。并且，反卷积层2_1的卷积核大小为3x3，步长为2，输出通道数为16，其后连接一个非线性函数激活层2_1层，非线性函数激活层2_1后连接一个卷积层2_2，卷积层2_2的卷积核大小为3x3，步长为1，输出通道数为8，其后连接一个非线性函数激活层2_2，非线性函数激活层2_2后连接一个卷积层2_3，卷积层2_3的卷积核大小为3x3，步长为1，输出通道数为8，其后连接一个非线性函数激活层2_3。Exemplarily, as shown in Figure 10, it is a structural diagram of a deconvolution module provided by some embodiments of the present application. In Figure 10, the deconvolution module includes a deconvolution layer, two convolution layers and three nonlinear function activation layers. One deconvolution layer is a deconvolution layer 2_1 (Deconv2_1), the two convolution layers are convolution layer 2_2 (Conv2_2) and convolution layer 2_3 (Conv2_3), and the three nonlinear activation function layers are nonlinear function activation layer 2_1 (Relu2_1), nonlinear function activation layer 2_2 (Relu2_2) and nonlinear function activation layer 2_3 (Relu2_3). In addition, the convolution kernel size of the deconvolution layer 2_1 is 3x3, the stride is 2, and the number of output channels is 16. It is then connected to a nonlinear function activation layer 2_1. The nonlinear function activation layer 2_1 is then connected to a convolution layer 2_2. The convolution kernel size of the convolution layer 2_2 is 3x3, the stride is 1, and the number of output channels is 8. It is then connected to a nonlinear function activation layer 2_2. The nonlinear function activation layer 2_2 is then connected to a convolution layer 2_3. The convolution kernel size of the convolution layer 2_3 is 3x3, the stride is 1, and the number of output channels is 8. It is then connected to a nonlinear function activation layer 2_3.

示例性地，在图10中，以衍生图像的维度是[16，64，64]为例，将维度为[16，64，64]的衍生图像输入反卷积层2_1做反卷积操作，得到维度为[8，128，128]的特征矩阵并输出，再将该特征矩阵输入非线性函数激活层2_1层对每个像素点的值做非线性变换，输出维度为[8，128，128]的特征矩阵，再将该特征矩阵输入卷积层2_2做特征提取，得到维度为[8，128，128]的特征矩阵并输出，再将该特征矩阵输入非线性函数激活层2_2对每个像素点的值做非线性变换，输出维度为[8，128，128]的特征矩阵，再将该特征矩阵输入卷积层2_3做进一步特征提取，得到维度为[8，128，128]的特征矩阵并输出，再将该特征矩阵输入非线性函数激活层2_3对每个像素点的值做非线性变换，输出维度为[8，128，128]的特征矩阵。Exemplarily, in FIG10 , taking the derivative image whose dimension is [16, 64, 64] as an example, the derivative image whose dimension is [16, 64, 64] is input into the deconvolution layer 2_1 for deconvolution operation, and a feature matrix whose dimension is [8, 128, 128] is obtained and output, and then the feature matrix is input into the nonlinear function activation layer 2_1 to perform nonlinear transformation on the value of each pixel point, and a feature matrix whose dimension is [8, 128, 128] is output, and then the feature matrix is input into the convolution layer 2_2 for feature extraction, and a feature matrix whose dimension is [8, 128, 128] and output it, then input the feature matrix into the nonlinear function activation layer 2_2 to perform nonlinear transformation on the value of each pixel point, and output a feature matrix with a dimension of [8, 128, 128], then input the feature matrix into the convolution layer 2_3 for further feature extraction, and obtain a feature matrix with a dimension of [8, 128, 128] and output it, then input the feature matrix into the nonlinear function activation layer 2_3 to perform nonlinear transformation on the value of each pixel point, and output a feature matrix with a dimension of [8, 128, 128].

本申请的一些实施例中，上述图像解码器可以包括多个反卷积模块和多个自注意力模块，且反卷积模块的数量比自注意力模块的数量多一个，在不同的反卷积模块对特征矩阵进行不同程度的维度增大，而在不同的自注意力模块，对不同维度的特征矩阵中像素点之间的关联关系进行提取。In some embodiments of the present application, the above-mentioned image decoder may include multiple deconvolution modules and multiple self-attention modules, and the number of deconvolution modules is one more than the number of self-attention modules. In different deconvolution modules, the dimension of the feature matrix is increased to different degrees, and in different self-attention modules, the correlation relationship between pixel points in feature matrices of different dimensions is extracted.

示例性地，如图11所示，是本申请的一些实施例提供的图像解码器的结构示意图，在图11中，图像解码器包括三个反卷积模块和两个自注意力模块，该三个反卷积模块分别是反卷积模块1(DeConvBlock1)、反卷积模块2(DeConvBlock2)和反卷积模块3(DeConvBlock3)，该两个自注意力模块分别是包括6层Transformer Block的自注意力模块3和包括4层Transformer Block的的自注意力模块4。并且，反卷积模块1、反卷积模块2和反卷积模块3的作用是增加特征分辨率，只有卷积核的通道数上有差异，反卷积模块1的通道数为16，反卷积模块2的通道数为8，反卷积模块3的通道数为4。Exemplarily, as shown in FIG. 11, it is a schematic diagram of the structure of an image decoder provided by some embodiments of the present application. In FIG. 11, the image decoder includes three deconvolution modules and two self-attention modules. The three deconvolution modules are respectively deconvolution module 1 (DeConvBlock1), deconvolution module 2 (DeConvBlock2) and deconvolution module 3 (DeConvBlock3). The two self-attention modules are respectively self-attention module 3 including 6 layers of Transformer Block and self-attention module 4 including 4 layers of Transformer Block. Moreover, the function of deconvolution module 1, deconvolution module 2 and deconvolution module 3 is to increase the feature resolution. There is only a difference in the number of channels of the convolution kernel. The number of channels of deconvolution module 1 is 16, the number of channels of deconvolution module 2 is 8, and the number of channels of deconvolution module 3 is 4.

示例性地，在图11中，上述图像解码器对衍生图像解码得到解码后的衍生图像的实现过程如下：将维度为[16，64，64]的衍生图像输入反卷积模块1增加特征分辨率，得到维度为[8，128，128]的特征矩阵，再将该维度为[8，128，128]的特征矩阵输入自注意力模块3，得到维度为[8，128，128]的特征矩阵，再将该维度为[8，128，128]的特征矩阵输入反卷积模块2进一步增加特征分辨率，得到维度为[4，256，256]的特征矩阵，再将该维度为[4，256，256]的特征矩阵输入自注意力模块4，得到维度为[4，256，256]的特征矩阵，再将该维度为[4，256，256]的特征矩阵输入反卷积模块3进一步增加特征分辨率，得到维度为[3，512，512]的解码后的衍生图像。Exemplarily, in Figure 11, the implementation process of the above-mentioned image decoder decoding the derivative image to obtain the decoded derivative image is as follows: the derivative image with a dimension of [16, 64, 64] is input into the deconvolution module 1 to increase the feature resolution, and a feature matrix with a dimension of [8, 128, 128] is obtained, and then the feature matrix with a dimension of [8, 128, 128] is input into the self-attention module 3 to obtain a feature matrix with a dimension of [8, 128, 128], and then the feature matrix with a dimension of [8, 128, 128] is input into the deconvolution module 2 to further increase the feature resolution, and a feature matrix with a dimension of [4, 256, 256] is obtained, and then the feature matrix with a dimension of [4, 256, 256] is input into the self-attention module 4 to obtain a feature matrix with a dimension of [4, 256, 256], and then the feature matrix with a dimension of [4, 256, 256] is input into the deconvolution module 3 to further increase the feature resolution, and a decoded derivative image with a dimension of [3, 512, 512] is obtained.

需要说明的是，上述图像解码器15的自注意力模块包括多个Tansformer Block，Tansformer Block的结构示意图参见上述图3。上述图像解码器15的自注意力模块特征矩阵做处理的过程与上述第一条件编码器111的自注意力模块对像素重排特征向量做处理的过程相同，具体实现可以参见上述实施例的相关描述，本申请在此不再赘述。It should be noted that the self-attention module of the above-mentioned image decoder 15 includes multiple Tansformer Blocks, and the structural diagram of Tansformer Block is shown in the above-mentioned Figure 3. The process of processing the feature matrix of the self-attention module of the above-mentioned image decoder 15 is the same as the process of processing the pixel rearrangement feature vector by the self-attention module of the above-mentioned first conditional encoder 111. The specific implementation can refer to the relevant description of the above-mentioned embodiment, and this application will not repeat it here.

如此，图像生成模型包括图像编码器和图像解码器，在加噪模块做加噪处理之前通过图像编码器对原始图像进行特征压缩，减少了加噪处理过程的计算量和后续生成衍生图像过程中的数据处理量，提高了处理效率，并且在得到衍生图像后通过图像解码器解码，得到分辨率更高的解码后的衍生图像。In this way, the image generation model includes an image encoder and an image decoder. The image encoder is used to compress the features of the original image before the denoising module performs denoising processing, thereby reducing the amount of calculation in the denoising process and the amount of data processing in the subsequent generation of derivative images, thereby improving processing efficiency. After the derivative image is obtained, it is decoded by the image decoder to obtain a decoded derivative image with higher resolution.

示例性地，如图2所示，图像生成模型包括条件编码器11、图像编码器14、加噪模块12、扩散模型13和图像解码器15，条件编码器11包括第一条件编码器111和第二条件编码器112，第二条件编码器112包括文本编码器small和文本编码器big。以一个原始图像、一个原始图像的描述文本和一个原始图像的掩码图像为例，对图像生成模型的数据处理过程进行描述：电子设备将原始图像输入第一条件编码器111进行编码，得到一项图像特征信息，将描述文本输入文本编码器small和文本编码器big分别编码，得到两种文本特征信息，统称为一项文本特征信息，再根据控制条件、一项图像特征信息和一项文本特征信息，确定条件特征信息，条件编码器11将条件特征信息输入扩散模型13。电子设备将原始图像输入图像编码器14进行特征压缩，得到图像特征矩阵输入加噪模块12，电子设备将掩码图像输入加噪模块12，加噪模块12基于高斯噪声矩阵和掩码图像对图像特征矩阵进行加噪处理，得到噪声图像，加噪模块12将噪声图像输入扩散模型13。扩散模型13基于条件编码器11得到的条件特征信息、加噪模块12得到的噪声图像和加噪向量，生成衍生图像，扩散模型13将衍生图像输入图像解码器15，图像解码器15对衍生图像进行解码，得到解码后的衍生图像。Exemplarily, as shown in FIG2 , the image generation model includes a conditional encoder 11, an image encoder 14, a noise adding module 12, a diffusion model 13 and an image decoder 15. The conditional encoder 11 includes a first conditional encoder 111 and a second conditional encoder 112. The second conditional encoder 112 includes a text encoder small and a text encoder big. Taking an original image, a description text of the original image and a mask image of the original image as an example, the data processing process of the image generation model is described: the electronic device inputs the original image into the first conditional encoder 111 for encoding to obtain an image feature information, inputs the description text into the text encoder small and the text encoder big for encoding respectively, and obtains two types of text feature information, collectively referred to as a text feature information, and then determines the conditional feature information according to the control condition, an image feature information and a text feature information, and the conditional encoder 11 inputs the conditional feature information into the diffusion model 13. The electronic device inputs the original image into the image encoder 14 for feature compression, obtains the image feature matrix and inputs it into the denoising module 12, the electronic device inputs the mask image into the denoising module 12, the denoising module 12 performs denoising processing on the image feature matrix based on the Gaussian noise matrix and the mask image to obtain a noise image, and the denoising module 12 inputs the noise image into the diffusion model 13. The diffusion model 13 generates a derivative image based on the conditional feature information obtained by the conditional encoder 11, the noise image obtained by the denoising module 12, and the noise vector, the diffusion model 13 inputs the derivative image into the image decoder 15, and the image decoder 15 decodes the derivative image to obtain a decoded derivative image.

本申请的一些实施例提供的图像生成模型，根据图像生成任务的类型确定条件特征信息，以控制图像生成模型执行不同的图像生成任务，并用高斯噪声矩阵和掩码图像对原始图像进行加噪处理，使得图像生成模型不仅能够具备从噪声中生成图像的能力，也能够具备预测、消除或修改掩码图像对应的掩码区域的图像内容的能力，使得图像生成模型能够适应不同类型的图像生成任务，如文生图、图生图、文加图生图、图像局部修改、图像消除、图像外扩等等。如此，一个图像生成模型具备了处理多种类型的图像生成任务的能力，在处理不同类型的图像生成任务时不需要再调用多个不同的模型，提高了处理图像生成任务的效率。The image generation model provided by some embodiments of the present application determines conditional feature information according to the type of image generation task to control the image generation model to perform different image generation tasks, and uses a Gaussian noise matrix and a mask image to perform noise processing on the original image, so that the image generation model can not only have the ability to generate images from noise, but also have the ability to predict, eliminate or modify the image content of the mask area corresponding to the mask image, so that the image generation model can adapt to different types of image generation tasks, such as text-generated images, image-generated images, text-plus-image-generated images, local image modification, image elimination, image expansion, etc. In this way, an image generation model has the ability to handle multiple types of image generation tasks, and there is no need to call multiple different models when handling different types of image generation tasks, thereby improving the efficiency of processing image generation tasks.

本申请的一些实施例提供的图像生成模型的训练方法，执行主体可以为图像生成模型的训练装置。示例性地，该图像生成模型的训练装置可以为电子设备，也可以为该电子设备中的部件，例如集成电路或芯片，具体的可以根据实际使用需求确定，本申请的一些实施例不作限定。以下以图像生成模型的训练装置为电子设备，以电子设备执行图像生成模型的训练方法为例，对本申请的一些实施例提供的图像生成模型的训练方法进行示例性说明。Some embodiments of the present application provide a training method for an image generation model, and the executing entity may be a training device for the image generation model. Exemplarily, the training device for the image generation model may be an electronic device, or a component in the electronic device, such as an integrated circuit or a chip. The specific one may be determined based on actual usage requirements, and some embodiments of the present application do not limit this. The following takes the training device for the image generation model as an electronic device, and takes the electronic device executing the training method for the image generation model as an example, to exemplify the training method for the image generation model provided in some embodiments of the present application.

需要说明的是，本申请的一些实施例提供的图像生成模型的训练方法训练得到的图像生成模型是上述实施例所述的图像生成模型。It should be noted that the image generation model trained by the image generation model training method provided in some embodiments of the present application is the image generation model described in the above embodiments.

图12是本申请的一些实施例提供的图像生成模型的训练方法的流程示意图，如图12所示，本申请的一些实施例提供的图像生成模型的训练方法可以包括以下步骤101至步骤104。Figure 12 is a flow chart of the training method of the image generation model provided by some embodiments of the present application. As shown in Figure 12, the training method of the image generation model provided by some embodiments of the present application may include the following steps 101 to 104.

步骤101、电子设备获取至少两个训练样本组。Step 101: The electronic device obtains at least two training sample groups.

本申请的一些实施例中，上述至少两个训练样本组中每个训练样本组包括一张样本图像、每张样本图像的描述文本和掩码图像。In some embodiments of the present application, each of the at least two training sample groups includes a sample image, description text of each sample image, and a mask image.

本申请的一些实施例中，上述样本图像可以是包括至少一个实体的图像，且样本图像的数量可以是至少一个。In some embodiments of the present application, the sample image may be an image including at least one entity, and the number of the sample images may be at least one.

示例性地，样本图像可以包括植物、动物、人物、山、水、房子、电脑等实体中的至少一个。Exemplarily, the sample image may include at least one of entities such as plants, animals, people, mountains, water, houses, computers, etc.

本申请的一些实施例中，上述样本图像的描述文本可以用于描述该样本图像的图像主题。In some embodiments of the present application, the description text of the sample image may be used to describe the image theme of the sample image.

示例性地，样本图像的描述文本可以是“一个男人和一个女人站在草坪前，比出点赞的手势”。For example, the description text of the sample image may be “a man and a woman standing in front of a lawn, making a thumbs-up gesture”.

本申请的一些实施例中，上述样本图像的描述文本可以用于描述该样本图像的图像主题以及对该样本图像执行的处理操作。In some embodiments of the present application, the description text of the sample image may be used to describe the image theme of the sample image and the processing operation performed on the sample image.

示例性地，样本图像的描述文本可以是“一个男人和一个女人站在草坪前，比出点赞的手势。将图中的男人删除”。For example, the description text of the sample image may be "A man and a woman stand in front of a lawn, making a thumbs-up gesture. Delete the man in the picture."

本申请的一些实施例中，上述样本图像的描述文本还可以用于描述该样本图像的图像主题以及该样本图像中掩码区域的图像内容。In some embodiments of the present application, the description text of the sample image may also be used to describe the image theme of the sample image and the image content of the mask area in the sample image.

示例性地，样本图像的描述文本可以是“一个男人和一个女人站在草坪前，比出点赞的手势，图中有男人、女人、气球、海报、帽子”。可以理解的是，该样本图像的掩码区域包括男人、女人、气球、海报和帽子在该样本图像中对应的图像区域。For example, the description text of the sample image may be “a man and a woman standing in front of a lawn, making a thumbs-up gesture, and the picture contains a man, a woman, a balloon, a poster, and a hat.” It is understandable that the mask area of the sample image includes the image areas corresponding to the man, the woman, the balloon, the poster, and the hat in the sample image.

需要说明的是，本申请的一些实施例中样本图像的掩码图像用于表示该样本图像中的掩码区域。可以理解的是，获取到样本图像的掩码图像后便可以确定该样本图像中被掩码的掩码区域。It should be noted that in some embodiments of the present application, the mask image of the sample image is used to represent the mask area in the sample image. It is understandable that after the mask image of the sample image is acquired, the mask area in the sample image can be determined.

需要说明的是，上述样本图像的掩码图像与上述实施例中原始图像的掩码图像的形式相同，关于样本图像的掩码图像的描述可以参见上述实施例的相关描述，本申请在此不再赘述。It should be noted that the mask image of the sample image is in the same form as the mask image of the original image in the above embodiment. For the description of the mask image of the sample image, please refer to the relevant description of the above embodiment, and this application will not repeat it here.

本申请的一些实施例中，电子设备可以使用按分割实体构造和按尺寸信息构造两种方式来构建训练样本组。其中，实体可以是样本图像中实际存在的对象，如人、动物、植物、桌子、电脑等等，也可以是样本图像中实际存在的感兴趣对象，如样本图像的主体。In some embodiments of the present application, the electronic device can construct a training sample group using two methods: constructing by segmented entities and constructing by size information. The entity can be an object actually existing in the sample image, such as a person, an animal, a plant, a table, a computer, etc., or an object of interest actually existing in the sample image, such as the subject of the sample image.

需要说明的是，电子设备按分割实体构造可以理解为对样本图像进行实体识别并分割，将样本图像分割得到的每个实体的图像区域作为一个掩码区域，并基于掩码区域生成该样本图像的掩码图像，进而构建训练样本组。It should be noted that the electronic device constructed by segmenting entities can be understood as entity recognition and segmentation of the sample image, using the image area of each entity obtained by segmenting the sample image as a mask area, and generating a mask image of the sample image based on the mask area, thereby constructing a training sample group.

需要说明的是，电子设备按照尺寸信息构造可以理解为不考虑样本图像中有哪些实体，而是按照尺寸信息在样本图像中随机确定掩码区域，并根据确定的掩码区域生成该样本图像的掩码图像，进而构建训练样本组。It should be noted that the electronic device constructed according to the size information can be understood as not considering which entities are in the sample image, but randomly determining the mask area in the sample image according to the size information, and generating a mask image of the sample image based on the determined mask area, thereby constructing a training sample group.

本申请的一些实施例中，电子设备可以使用按分割实体构造的方式构建训练样本组。In some embodiments of the present application, the electronic device may construct a training sample group in a manner of constructing by segmenting entities.

示例性地，结合图12，如图13所示，在上述步骤101之前，本申请的一些实施例提供的图像生成模型的训练方法还可以包括以下步骤201至步骤204。Exemplarily, in combination with FIG. 12 , as shown in FIG. 13 , before the above-mentioned step 101 , the training method of the image generation model provided by some embodiments of the present application may further include the following steps 201 to 204 .

步骤201、电子设备获取至少两张样本图像、每张样本图像的图像描述信息和每张样本图像的至少两个掩码区域的标注信息。Step 201: The electronic device obtains at least two sample images, image description information of each sample image, and annotation information of at least two mask regions of each sample image.

本申请的一些实施例中，上述样本图像的图像描述信息用于描述该样本图像的图像主题。In some embodiments of the present application, the image description information of the sample image is used to describe the image theme of the sample image.

本申请的一些实施例中，每张样本图像的至少两个掩码区域为通过分割一切模型对每张样本图像进行掩码区域的预测、并基于预测结果对每张样本图像分割得到的。In some embodiments of the present application, at least two mask regions of each sample image are obtained by predicting the mask regions of each sample image using a segmentation model and segmenting each sample image based on the prediction results.

本申请的一些实施例中，开源的分割数据集包括数千万的图像以及数十亿的分割的掩码区域，实体丰富且信息密度高。因此，电子设备可以从开源的分割数据集中获取样本图像、样本图像的图像描述信息和多个掩码区域的标注信息。在该种情况下，分割数据集中样本图像对应的多个掩码区域是通过将样本图像输入分割一切模型(Segment AnythingModel，SAM)后，由SAM做实体分割得到的，样本图像的图像描述信息是样本图像携带的。In some embodiments of the present application, the open source segmentation dataset includes tens of millions of images and billions of segmented mask areas, which are rich in entities and have high information density. Therefore, the electronic device can obtain sample images, image description information of sample images, and annotation information of multiple mask areas from the open source segmentation dataset. In this case, the multiple mask areas corresponding to the sample images in the segmentation dataset are obtained by inputting the sample images into the Segment Anything Model (SAM) and then performing entity segmentation by SAM, and the image description information of the sample images is carried by the sample images.

本申请的一些实施例中，每个掩码区域的标注信息包括以下至少一项：每个掩码区域的面积、每个掩码区域的预测准确度、每个掩码区域的预测稳定度。In some embodiments of the present application, the annotation information of each mask region includes at least one of the following: the area of each mask region, the prediction accuracy of each mask region, and the prediction stability of each mask region.

本申请的一些实施例中，预测准确度表征通过分割一切模型对掩码区域的预测准确性，预测稳定度表征通过分割一切模型对掩码区域的预测稳定性。In some embodiments of the present application, the prediction accuracy represents the prediction accuracy of the mask region by the segmentation model, and the prediction stability represents the prediction stability of the mask region by the segmentation model.

示例性地，掩码区域的预测准确度是该掩码区域的预测框与真实框的重合程度，表示SAM预测该掩码区域的准确性；掩码区域的预测稳定度是SAM预测该掩码区域的置信度，表示SAM预测该掩码区域的稳定性。Exemplarily, the prediction accuracy of the mask area is the degree of overlap between the predicted box of the mask area and the true box, which indicates the accuracy of SAM in predicting the mask area; the prediction stability of the mask area is the confidence of SAM in predicting the mask area, which indicates the stability of SAM in predicting the mask area.

步骤202、电子设备基于每张样本图像的至少两个掩码区域的标注信息，从每张样本图像的至少两个掩码区域中，确定标注信息满足过滤条件的第一掩码区域。Step 202: The electronic device determines, based on the annotation information of at least two mask areas of each sample image, a first mask area whose annotation information satisfies a filtering condition from among the at least two mask areas of each sample image.

本申请的一些实施例中，上述过滤条件包括以下至少一项：掩码区域的面积在预设范围内、掩码区域的预测准确度大于准确度阈值、掩码区域的预测稳定度大于稳定度阈值。In some embodiments of the present application, the above-mentioned filtering conditions include at least one of the following: the area of the mask area is within a preset range, the prediction accuracy of the mask area is greater than the accuracy threshold, and the prediction stability of the mask area is greater than the stability threshold.

需要说明的是，上述预设范围、准确度阈值和稳定度阈值均可以根据实际需求设置或调整，例如，上述预设范围可以是5％-45％，上述准确度阈值可以是0.8，上述稳定度阈值可以是0.9，本申请的一些实施例对此不做限定。It should be noted that the above-mentioned preset range, accuracy threshold and stability threshold can be set or adjusted according to actual needs. For example, the above-mentioned preset range can be 5%-45%, the above-mentioned accuracy threshold can be 0.8, and the above-mentioned stability threshold can be 0.9. Some embodiments of the present application do not limit this.

本申请的一些实施例中，上述掩码区域的面积在预设范围内是指掩码区域的面积占样本图像的面积的比例在预设范围内。In some embodiments of the present application, the area of the mask region being within a preset range means that the ratio of the area of the mask region to the area of the sample image is within a preset range.

示例性地，以至少两个掩码区域的标注信息包括每个掩码区域的面积、每个掩码区域的预测准确度和每个掩码区域的预测稳定度为例，假设过滤条件包括掩码区域的面积在5％-45％范围内、掩码区域的预测准确度大于0.8、掩码区域的预测稳定度大于0.9，则电子设备可以在至少两个掩码区域中选择面积占样本图像的面积的5％-45％、且预测准确度大于0.8、且预测稳定度大于0.9的掩码区域作为第一掩码区域。其中，该第一掩码区域的数量可以是至少一个。Exemplarily, taking the case where the annotation information of at least two mask regions includes the area of each mask region, the prediction accuracy of each mask region, and the prediction stability of each mask region as an example, assuming that the filtering conditions include that the area of the mask region is within the range of 5%-45%, the prediction accuracy of the mask region is greater than 0.8, and the prediction stability of the mask region is greater than 0.9, the electronic device can select a mask region whose area accounts for 5%-45% of the area of the sample image, and whose prediction accuracy is greater than 0.8, and whose prediction stability is greater than 0.9 from the at least two mask regions as the first mask region. The number of the first mask region can be at least one.

如此，电子设备将较小的掩码区域过滤掉，减少了对模型训练帮助不大的训练数据。将预测稳定度过低和预测准确度过低的掩码区域过滤掉，能够减少低质量的训练样本，则模型训练过程中的预测结果更加稳定，降低模型的训练难度，加快模型收敛，加快模型的训练效率。In this way, the electronic device filters out the smaller mask area, reducing the training data that is not helpful for model training. Filtering out the mask area with too low prediction stability and too low prediction accuracy can reduce low-quality training samples, so that the prediction results in the model training process are more stable, reducing the difficulty of model training, accelerating model convergence, and improving model training efficiency.

步骤203、电子设备在每张样本图像的图像描述信息中，添加每张样本图像的第一掩码区域的描述信息，得到每张样本图像的描述文本。Step 203: The electronic device adds the description information of the first mask area of each sample image to the image description information of each sample image to obtain a description text of each sample image.

本申请的一些实施例中，上述第一掩码区域的描述信息用于描述该第一掩码区域在样本图像中对应的图像内容。In some embodiments of the present application, the description information of the first mask area is used to describe the image content corresponding to the first mask area in the sample image.

示例性地，第一掩码区域的描述信息可以包括实体的类型，例如男人、女人、桌子、电脑等等，还可以包括实体的外形，例如白猫、黑狗、方桌、圆桌等等。Exemplarily, the description information of the first mask area may include the type of entity, such as man, woman, table, computer, etc., and may also include the appearance of the entity, such as white cat, black dog, square table, round table, etc.

本申请的一些实施例中，上述至少两个掩码区域的标注信息还可以包括每个掩码区域的描述信息，电子设备可以从第一掩码区域的标注信息中获取描述信息，将该描述信息添加至图像描述信息中，得到样本图像的描述文本，则该描述文本可以用于描述该样本图像的图像主题以及样本图像对应的第一掩码区域的图像内容。In some embodiments of the present application, the annotation information of the above-mentioned at least two mask areas may also include description information of each mask area. The electronic device can obtain the description information from the annotation information of the first mask area, add the description information to the image description information, and obtain the description text of the sample image. The description text can then be used to describe the image theme of the sample image and the image content of the first mask area corresponding to the sample image.

步骤204、电子设备将每张样本图像、每张样本图像的描述文本和每张样本图像的掩码图像构建为一个训练样本组，得到至少两个训练样本组。Step 204: The electronic device constructs each sample image, the description text of each sample image, and the mask image of each sample image into a training sample group, thereby obtaining at least two training sample groups.

本申请的一些实施例中，每张样本图像的掩码图像为每张样本图像中与第一掩码区域对应的掩码图像。In some embodiments of the present application, the mask image of each sample image is a mask image corresponding to the first mask area in each sample image.

本申请的一些实施例中，电子设备在确定第一掩码区域后，可以生成第一矩阵，该第一矩阵中元素点的数量与样本图像的像素点数量相同，且该第一矩阵中元素点的排列方式与样本图像中像素点的排列方式相同，确定第一掩码区域在该第一矩阵中对应的元素点，将这些元素点的值标记为1，并将第一矩阵中其他元素点的值标记为0，最终得到的第一矩阵为该样本图像的掩码图像。In some embodiments of the present application, after determining the first mask area, the electronic device can generate a first matrix, the number of element points in the first matrix is the same as the number of pixel points in the sample image, and the arrangement of the element points in the first matrix is the same as the arrangement of the pixel points in the sample image, determine the element points corresponding to the first mask area in the first matrix, mark the values of these element points as 1, and mark the values of other element points in the first matrix as 0, and the final first matrix is the mask image of the sample image.

如此，电子设备通过对样本图像进行实体分割来确定样本图像中的至少两个掩码区域，并将至少两个掩码区域中对模型训练意义不大的掩码区域，如面积较小、预测不准或预测不稳定的掩码区域过滤掉，保留第一掩码区域，减少了训练数据的噪声，降低了模型的训练难度，进而提高了模型的训练效率，并且使得模型训练过程中的预测结果更加稳定，使得模型收敛更加稳定，并且，将第一掩码区域的描述信息添加在图像描述信息中得到样本图像的描述文本，以便模型能够学习到按照描述文本预测第一掩码区域的图像内容，提高了模型性能。In this way, the electronic device determines at least two mask areas in the sample image by performing entity segmentation on the sample image, and filters out the mask areas of the at least two mask areas that are not very meaningful for model training, such as mask areas with small areas, inaccurate predictions, or unstable predictions, and retains the first mask area, thereby reducing the noise of the training data, reducing the difficulty of model training, and thereby improving the training efficiency of the model, and making the prediction results in the model training process more stable, making the model convergence more stable, and adding the description information of the first mask area to the image description information to obtain the description text of the sample image, so that the model can learn to predict the image content of the first mask area according to the description text, thereby improving the model performance.

进一步地，对于一张样本图像，在第一掩码区域的数量大于数量阈值的情况下，可以随机从第一掩码区域中选择X个第三掩码区域，再在样本图像的图像描述信息中，添加样本图像的第三掩码区域的描述信息，得到样本图像的描述文本，并生成样本图像的掩码图像，该掩码图像与样本图像的X个第三掩码区域对应，将样本图像、样本图像的描述文本和样本图像的掩码图像构建为一个训练样本组。Furthermore, for a sample image, when the number of first mask areas is greater than a quantity threshold, X third mask areas can be randomly selected from the first mask areas, and then the description information of the third mask areas of the sample image is added to the image description information of the sample image to obtain a description text of the sample image, and generate a mask image of the sample image, which corresponds to the X third mask areas of the sample image, and the sample image, the description text of the sample image and the mask image of the sample image are constructed as a training sample group.

需要说明的是，X是大于或等于1的整数，例如，X是5。It should be noted that X is an integer greater than or equal to 1, for example, X is 5.

需要说明的是，上述数量阈值可以由电子设备默认设置，也可以由用户根据实际需求设置，本申请对此不做限定。例如，数量阈值可以是20。It should be noted that the above quantity threshold may be set by the electronic device by default, or may be set by the user according to actual needs, and this application does not limit this. For example, the quantity threshold may be 20.

如此，电子设备使用掩码图像对样本图像进行加噪处理后，得到的加噪图像中包括的原始图像信息较多，使得模型能够学习到样本图像更多的图像信息，提升模型的学习效果。In this way, after the electronic device uses the mask image to perform noise processing on the sample image, the obtained noised image includes more original image information, so that the model can learn more image information of the sample image and improve the learning effect of the model.

值得注意的是，在第一掩码区域的数量大于数量阈值的情况下，电子设备可以在每次迭代训练之前，从第一掩码区域中随机选择X个第三掩码区域，通过上述步骤生成训练样本组。并且，不同次迭代训练选择的X个第三掩码区域可以不同，如此，模型能够学习到在多种掩码组合的情况下预测掩码区域的图像内容的能力。It is worth noting that when the number of the first mask regions is greater than the number threshold, the electronic device can randomly select X third mask regions from the first mask regions before each iterative training, and generate a training sample group through the above steps. In addition, the X third mask regions selected in different iterative trainings can be different, so that the model can learn the ability to predict the image content of the mask region under multiple mask combinations.

本申请的一些实施例中，电子设备可以使用按尺寸信息构造的方式构建训练样本组。In some embodiments of the present application, the electronic device may construct a training sample group in a manner constructed according to size information.

示例性地，结合图12，如图14所示，在上述步骤101之前，本申请的一些实施例提供的图像生成模型的训练方法还包括以下步骤301至步骤303。Exemplarily, in combination with FIG. 12 , as shown in FIG. 14 , before the above-mentioned step 101 , the training method of the image generation model provided by some embodiments of the present application also includes the following steps 301 to 303 .

步骤301、电子设备获取至少两张样本图像和每张样本图像的描述文本。Step 301: The electronic device obtains at least two sample images and a description text of each sample image.

本申请的一些实施例中，电子设备可以从开源的图像集中获取样本图像和样本图像的描述文本。In some embodiments of the present application, the electronic device may obtain sample images and description texts of the sample images from an open source image collection.

步骤302、电子设备根据尺寸范围和每张样本图像的尺寸信息，在每张样本图像中生成第二掩码区域。Step 302: The electronic device generates a second mask area in each sample image according to the size range and the size information of each sample image.

本申请的一些实施例中，上述尺寸范围可以包括第二掩码区域的高度范围和宽度范围。上述高度范围是第二掩码区域的区域高度占样本图像的高度的范围，上述宽度范围是第二掩码区域的区域宽度占样本图像的宽度的范围。In some embodiments of the present application, the size range may include a height range and a width range of the second mask area. The height range is the range of the area height of the second mask area to the height of the sample image, and the width range is the range of the area width of the second mask area to the width of the sample image.

示例性地，上述高度范围可以是5％-15％，上述宽度范围可以是10％-20％。Exemplarily, the height may range from 5% to 15%, and the width may range from 10% to 20%.

本申请的一些实施例中，上述样本图像的尺寸信息可以包括样本图像的高度和宽度。In some embodiments of the present application, the size information of the sample image may include the height and width of the sample image.

本申请的一些实施例中，电子设备在获取到样本图像后，可以在高度范围中任选一个百分比，将该百分比与样本图像的高度相乘，得到第二掩码区域的区域高度，在宽度范围中任选一个百分比，将该百分比与样本图像的宽度相乘，得到第二掩码区域的区域宽度。然后，电子设备可以在样本图像的任意位置确定高度为该区域高度且宽度为该区域宽度的区域作为第二掩码区域。In some embodiments of the present application, after acquiring the sample image, the electronic device can select any percentage in the height range, multiply the percentage by the height of the sample image, and obtain the region height of the second mask region, and select any percentage in the width range, multiply the percentage by the width of the sample image, and obtain the region width of the second mask region. Then, the electronic device can determine a region with a height equal to the region height and a width equal to the region width at any position of the sample image as the second mask region.

示例性地，电子设备可以从样本图像的上边缘开始在样本图像中确定宽度为样本图像的宽度且高度为区域高度的第一区域，并从样本图像的下边缘开始在样本图像中确定宽度为样本图像的宽度且高度为区域高度的第二区域，并从样本图像的左边缘开始在样本图像中确定高度为样本图像的高度且宽度为区域宽度的第三区域，并从样本图像的右边缘开始在样本图像中确定高度为样本图像的高度且宽度为区域宽度的第四区域，将第一区域、第二区域、第三区域与第四区域组合，得到第二掩码区域。Exemplarily, the electronic device may determine, in the sample image, a first region whose width is the width of the sample image and whose height is the region height starting from the upper edge of the sample image, and determine, in the sample image, a second region whose width is the width of the sample image and whose height is the region height starting from the lower edge of the sample image, and determine, in the sample image, a third region whose height is the height of the sample image and whose width is the region width starting from the left edge of the sample image, and determine, in the sample image, a fourth region whose height is the height of the sample image and whose width is the region width starting from the right edge of the sample image, and combine the first region, the second region, the third region and the fourth region to obtain a second mask region.

例如，以样本图像的高度和宽度均是512，宽度范围是10％～20％、且高度范围是5％～15％为例，假设电子设备选择的高度范围是10％，宽度范围是15％，则可以确定区域高度是51、区域宽度是76。电子设备可以从样本图像的上边缘开始在样本图像中确定宽度为512且高度为51的第一区域，并从样本图像的下边缘开始在样本图像中确定宽度为512且高度为51的第二区域，并从样本图像的左边缘开始在样本图像中确定高度为512且宽度为76的第三区域，并从样本图像的右边缘开始在样本图像中确定高度为512且宽度为76的第四区域，将第一区域、第二区域、第三区域与第四区域组合，得到第二掩码区域。For example, taking the case where both the height and width of the sample image are 512, the width range is 10% to 20%, and the height range is 5% to 15%, assuming that the height range selected by the electronic device is 10% and the width range is 15%, it can be determined that the region height is 51 and the region width is 76. The electronic device can determine a first region with a width of 512 and a height of 51 in the sample image starting from the upper edge of the sample image, and a second region with a width of 512 and a height of 51 in the sample image starting from the lower edge of the sample image, and a third region with a height of 512 and a width of 76 in the sample image starting from the left edge of the sample image, and a fourth region with a height of 512 and a width of 76 in the sample image starting from the right edge of the sample image, and the first region, the second region, the third region and the fourth region are combined to obtain a second mask region.

步骤303、电子设备将每张样本图像、每张样本图像的描述文本、每张样本图像的掩码图像构建为一个训练样本组，得到至少两个训练样本组。Step 303: The electronic device constructs each sample image, the description text of each sample image, and the mask image of each sample image into a training sample group, thereby obtaining at least two training sample groups.

本申请的一些实施例中，每张样本图像的掩码图像为每张样本图像中第二掩码区域的掩码图像。In some embodiments of the present application, the mask image of each sample image is a mask image of the second mask area in each sample image.

本申请的一些实施例中，电子设备在确定第二掩码区域后，可以生成第二矩阵，该第二矩阵中元素点的数量与样本图像中像素点的数量相同，且该第二矩阵中元素点的排列方式与样本图像中像素点的排列方式相同，确定第二掩码区域在该第二矩阵中对应的元素点，将这些元素点的值标记为1，并将该第二矩阵中其他元素点的值标记为0，最终得到的第二矩阵为该样本图像的掩码图像。In some embodiments of the present application, after determining the second mask area, the electronic device can generate a second matrix, the number of element points in the second matrix is the same as the number of pixel points in the sample image, and the arrangement of the element points in the second matrix is the same as the arrangement of the pixel points in the sample image, determine the element points corresponding to the second mask area in the second matrix, mark the values of these element points as 1, and mark the values of other element points in the second matrix as 0, and the final second matrix is the mask image of the sample image.

示例性地，如图15所示，是本申请的一些实施例提供的掩码图像的示意图，在图15中，掩码图像以图像的形式表示，其中黑色部分表示第二掩码区域，白色部分表示未掩码区域，M_h表示第二掩码区域的高度，M_w表示第二掩码区域的宽度。Exemplarily, as shown in FIG15 , it is a schematic diagram of a mask image provided by some embodiments of the present application. In FIG15 , the mask image is represented in the form of an image, wherein the black portion represents the second mask area, the white portion represents the unmasked area, M_h represents the height of the second mask area, and M_w represents the width of the second mask area.

如此，电子设备在不考虑实体的情况下，按照尺寸信息在样本图像中生成第二掩码区域，并生成样本图像的掩码图像，则确定的样本图像的掩码图像的形式更加丰富，即能够得到更加丰富的训练样本，以便模型学习到在各种掩码情况下预测图像的能力，提高模型的性能。In this way, the electronic device generates a second mask area in the sample image according to the size information without considering the entity, and generates a mask image of the sample image. The form of the mask image of the determined sample image is richer, that is, richer training samples can be obtained, so that the model can learn the ability to predict images under various mask situations, thereby improving the performance of the model.

本申请的一些实施例中，上述提到的两种构建训练样本组的方式可以单独使用或结合使用，例如在模型训练过程中50％的训练样本组是使用按分割实体构造的方式构建的，50％的训练样本组是使用按尺寸信息构造的方式构建的，如此能够得到更加丰富的训练数据，使得模型学习到在各种情况下生成图像或处理图像的能力，提高模型性能。In some embodiments of the present application, the two methods of constructing training sample groups mentioned above can be used separately or in combination. For example, during the model training process, 50% of the training sample groups are constructed by segmenting entities, and 50% of the training sample groups are constructed by size information. In this way, richer training data can be obtained, so that the model can learn the ability to generate or process images in various situations, thereby improving model performance.

步骤102、电子设备对至少两个训练样本组中的至少一张样本图像进行编码，得到至少一项图像特征信息，以及对每张样本图像的描述文本进行编码，得到至少一项文本特征信息。Step 102: The electronic device encodes at least one sample image in at least two training sample groups to obtain at least one item of image feature information, and encodes the description text of each sample image to obtain at least one item of text feature information.

本申请的一些实施例中，上述至少一项图像特征信息中一项图像特征信息用于表征一张样本图像。上述至少一项文本像特征信息中一项文本特征信息用于表征一张样本图像的描述文本。In some embodiments of the present application, one item of image feature information in the at least one item of image feature information is used to characterize a sample image. One item of text feature information in the at least one item of text feature information is used to characterize a description text of a sample image.

本申请的一些实施例中，上述步骤102具体可以通过以下步骤401至步骤403实现。In some embodiments of the present application, the above step 102 can be specifically implemented by following the steps 401 to 403.

步骤401、电子设备通过第一条件编码器中的图像分割模块，将每张样本图像分割为N个样本图像块，并将每个样本图像块的像素点重排列为一维向量，得到每张样本图像的N个一维向量，并将每张样本图像的N个一维向量拼接得到每张样本图像的像素重排图像。Step 401: The electronic device divides each sample image into N sample image blocks through the image segmentation module in the first conditional encoder, and rearranges the pixels of each sample image block into a one-dimensional vector to obtain N one-dimensional vectors of each sample image, and concatenates the N one-dimensional vectors of each sample image to obtain a pixel rearranged image of each sample image.

步骤402、电子设备通过第一条件编码器中的映射层，提取每张像素重排图像中每个像素点的特征信息，得到至少一个像素重排特征向量。Step 402: The electronic device extracts feature information of each pixel in each pixel rearrangement image through a mapping layer in the first conditional encoder to obtain at least one pixel rearrangement feature vector.

步骤403、电子设备通过第一条件编码器中的自注意力模块，基于每个像素重排图像中像素点之间的关联关系，对每个像素重排特征向量进行特征提取，得到每张样本图像的图像特征信息。Step 403: The electronic device uses the self-attention module in the first conditional encoder to extract features from each pixel rearrangement feature vector based on the correlation between pixels in each pixel rearrangement image, so as to obtain image feature information of each sample image.

需要说明的是，步骤401至步骤403中的第一条件编码器与上述实施例中的第一条件编码器111的结构(如图4所示)相同，对图像编码的过程也相同。因此，步骤401至步骤403的具体实现过程可以参见上述实施例的相关描述，本申请在此不再赘述。It should be noted that the first conditional encoder in steps 401 to 403 has the same structure as the first conditional encoder 111 in the above embodiment (as shown in FIG. 4 ), and the process of encoding the image is also the same. Therefore, the specific implementation process of steps 401 to 403 can refer to the relevant description of the above embodiment, and this application will not repeat it here.

示例性地，如图4所示，第一条件编码器包括图像分割模块、映射层和自注意力模块。第一条件编码器对一个样本图像编码得到一项图像特征信息的具体实现包括：电子设备将分辨率为512×512×3的样本图像输入第一条件编码器的图像分割模块，通过图像分割模块将样本图像分割为N个样本图像块，并将每个样本图像块的像素点重排列为一维向量，得到样本图像的N个一维向量，并将样本图像的N个一维向量拼接，得到样本图像的分辨率为256×768的像素重排图像输入映射层，通过映射层对像素重排图像中每个像素点进行线性变换处理，得到每个像素点的特征信息，将像素重排图像中多个像素点的特征信息拼接组合，得到维度为256×768的像素重排特征向量输入自注意力模块，在自注意力模块中依次经过自注意力算子、加法&归一化算子、前向反馈算子、加法&归一化算子对像素重排特征向量进行处理，得到样本图像的维度为256×768的图像特征信息。Exemplarily, as shown in FIG4 , the first conditional encoder includes an image segmentation module, a mapping layer, and a self-attention module. The specific implementation of the first conditional encoder encoding a sample image to obtain an image feature information includes: the electronic device inputs a sample image with a resolution of 512×512×3 into the image segmentation module of the first conditional encoder, divides the sample image into N sample image blocks through the image segmentation module, and rearranges the pixels of each sample image block into a one-dimensional vector to obtain N one-dimensional vectors of the sample image, and splices the N one-dimensional vectors of the sample image to obtain a pixel rearranged image with a resolution of 256×768 for the sample image, inputs the mapping layer, performs linear transformation processing on each pixel in the pixel rearranged image through the mapping layer to obtain feature information of each pixel, splices and combines feature information of multiple pixels in the pixel rearranged image to obtain a pixel rearranged feature vector with a dimension of 256×768, inputs the pixel rearranged feature vector into the self-attention module, and processes the pixel rearranged feature vector in the self-attention module in sequence through the self-attention operator, the addition & normalization operator, the forward feedback operator, and the addition & normalization operator to obtain image feature information of the sample image with a dimension of 256×768.

如此，电子设备通过第一条件编码器对样本图像进行特征提取，得到样本图像的图像特征信息，并在后续模型训练中使用，使得模型能够学习到图像的图像特征信息，具备图像处理的能力。In this way, the electronic device extracts features of the sample image through the first conditional encoder to obtain image feature information of the sample image, and uses it in subsequent model training, so that the model can learn the image feature information of the image and have the ability to process images.

本申请的一些实施例中，上述步骤102具体可以通过以下步骤501至步骤502实现。In some embodiments of the present application, the above step 102 can be specifically implemented by following the steps 501 to 502.

步骤501、电子设备通过第二条件编码器中的分词器，将每个描述文本分割为至少两个分词，并对每个分词进行编码，得到每个描述文本的文本特征向量。Step 501: The electronic device divides each description text into at least two words through a word segmenter in a second conditional encoder, and encodes each word segment to obtain a text feature vector of each description text.

步骤502、电子设备通过第二条件编码器中的自注意力模块，基于每个描述文本中分词之间的关联关系，对每个描述文本的文本特征向量进行特征提取，得到每个描述文本的文本特征信息。Step 502: The electronic device uses the self-attention module in the second conditional encoder to extract features from the text feature vector of each description text based on the association between the word segments in each description text, so as to obtain text feature information of each description text.

需要说明的是，步骤501至步骤502中的第二条件编码器与上述实施例中的第二条件编码器112的结构(如图5所示)相同，对文本编码的过程也相同。因此，步骤501至步骤502的具体实现过程可以参见上述实施例的相关描述，本申请在此不再赘述。It should be noted that the second conditional encoder in step 501 to step 502 has the same structure as the second conditional encoder 112 in the above embodiment (as shown in FIG. 5 ), and the process of encoding the text is also the same. Therefore, the specific implementation process of step 501 to step 502 can refer to the relevant description of the above embodiment, and this application will not repeat it here.

示例性地，如图5所示，第二条件编码器包括分词器和自注意力模块。第二条件编码器对一个样本图像的描述文本编码得到一项文本特征信息的具体实现包括：电子设备将描述文本输入第二条件编码器的分词器，通过分词器将描述文本分割为至少两个分词，并对每个分词进行编码，得到描述文本的维度为128×768的文本特征向量输入自注意力模块，在自注意力模块中依次经过自注意力算子、加法&归一化算子、前向反馈算子、加法&归一化算子对文本特征向量进行处理，得到描述文本的维度为128×768的文本特征信息。Exemplarily, as shown in FIG5 , the second conditional encoder includes a word segmenter and a self-attention module. The specific implementation of the second conditional encoder encoding the description text of a sample image to obtain a text feature information includes: the electronic device inputs the description text into the word segmenter of the second conditional encoder, divides the description text into at least two word segments through the word segmenter, and encodes each word segment to obtain a text feature vector of the description text with a dimension of 128×768, which is input into the self-attention module, and the text feature vector is processed by the self-attention operator, the addition & normalization operator, the forward feedback operator, and the addition & normalization operator in the self-attention module in turn to obtain text feature information of the description text with a dimension of 128×768.

如此，通过第二条件编码器对描述文本进行特征提取，得到描述文本的文本特征信息，并在后续模型训练中使用，使得模型能够学习到文本特征信息，便于模型学习到根据描述文本更加准确处理图像生成任务。In this way, the description text is feature extracted through the second conditional encoder to obtain text feature information of the description text, and used in subsequent model training, so that the model can learn the text feature information, which facilitates the model to learn to more accurately process the image generation task based on the description text.

步骤103、电子设备基于控制条件、至少一项图像特征信息和至少一项文本特征信息，确定至少一项条件特征信息。Step 103: The electronic device determines at least one item of conditional feature information based on the control condition, at least one item of image feature information and at least one item of text feature information.

本申请的一些实施例中，上述控制条件用于指示图像生成任务的类型，上述条件特征信息是根据图像生成任务的类型确定的。In some embodiments of the present application, the above control conditions are used to indicate the type of image generation task, and the above conditional feature information is determined according to the type of image generation task.

本申请的一些实施例中，上述步骤103具体可以通过以下步骤601实现。In some embodiments of the present application, the above step 103 can be specifically implemented through the following step 601.

步骤601、电子设备在控制条件指示的图像生成任务的类型为文生图任务的情况下，将至少一项文本特征信息确定为至少一项条件特征信息；电子设备在控制条件指示的图像生成任务的类型为图像处理任务的情况下，将至少一项图像特征信息和至少一项文本特征信息确定为至少一项条件特征信息。Step 601: When the type of image generation task indicated by the control condition is a text-image task, the electronic device determines at least one item of text feature information as at least one item of conditional feature information; when the type of image generation task indicated by the control condition is an image processing task, the electronic device determines at least one item of image feature information and at least one item of text feature information as at least one item of conditional feature information.

本申请的一些实施例中，图像处理任务包括以下至少一项：图生图任务、文和图生图任务、图像消除任务、图像局部修改任务、图像外扩任务。In some embodiments of the present application, the image processing task includes at least one of the following: an image-to-image task, a text-to-image task, an image elimination task, an image local modification task, and an image expansion task.

需要说明的是，步骤601确定至少一项条件特征信息的具体实现过程与上述实施例的实现过程相同，其具体实现可以参见上述实施例的相关描述，本申请在此不再赘述。It should be noted that the specific implementation process of determining at least one conditional feature information in step 601 is the same as the implementation process of the above embodiment. The specific implementation can refer to the relevant description of the above embodiment, and this application will not repeat it here.

如此，通过使用控制条件，图像生成模型可以根据图像生成任务的类型确定条件特征信息并进行后续处理，使得训练得到的一个图像生成模型能够具备处理多种类型的图像生成任务的能力，提高了图像生成模型的训练效率以及图像生成任务的处理效率。In this way, by using control conditions, the image generation model can determine the conditional feature information and perform subsequent processing according to the type of image generation task, so that a trained image generation model can have the ability to handle multiple types of image generation tasks, thereby improving the training efficiency of the image generation model and the processing efficiency of the image generation task.

步骤104、电子设备基于至少一张样本图像、至少一张样本图像的掩码图像和至少一项条件特征信息，对初始模型进行训练，得到图像生成模型。Step 104: The electronic device trains the initial model based on at least one sample image, at least one mask image of the sample image, and at least one conditional feature information to obtain an image generation model.

本申请的一些实施例中，初始模型是未训练的图像生成模型，其结构与上述图像生成模型的结构相同，但模型参数是初始参数，需要经过训练后才能具备处理多种类型的图像生成任务的能力。In some embodiments of the present application, the initial model is an untrained image generation model whose structure is the same as that of the above-mentioned image generation model, but the model parameters are initial parameters and need to be trained before it can have the ability to handle various types of image generation tasks.

本申请的一些实施例中，上述步骤104具体可以通过以下步骤701至步骤702实现。In some embodiments of the present application, the above step 104 can be specifically implemented by following the steps 701 to 702.

步骤701、电子设备基于高斯噪声矩阵和至少一张样本图像的掩码图像，对至少一张样本图像进行加噪处理，得到至少一张噪声图像。Step 701: The electronic device performs noise addition processing on at least one sample image based on a Gaussian noise matrix and a mask image of at least one sample image to obtain at least one noise image.

本申请的一些实施例中，高斯噪声矩阵是一种服从正态分布的噪声矩阵，可以通过电子设备随机采样得到。In some embodiments of the present application, the Gaussian noise matrix is a noise matrix that obeys a normal distribution and can be obtained by random sampling of an electronic device.

本申请的一些实施例中，一张噪声图像对应一张样本图像的掩码图像和一张样本图像。In some embodiments of the present application, a noise image corresponds to a mask image of a sample image and a sample image.

本申请的一些实施例中，上述步骤701具体可以通过以下步骤701a至步骤701b实现。In some embodiments of the present application, the above step 701 can be specifically implemented by following the steps 701a to 701b.

步骤701a、电子设备将高斯噪声矩阵和至少一张样本图像的掩码图像进行加权求和，得到至少一个加噪矩阵。Step 701a: The electronic device performs weighted summation on the Gaussian noise matrix and the mask image of at least one sample image to obtain at least one noise matrix.

示例性地，电子设备可以通过上述公式(1)计算得到加噪矩阵。Exemplarily, the electronic device can calculate the noise matrix using the above formula (1).

示例性地，在确定加噪矩阵时使用的掩码图像可以通过上述两种不同的方式随机确定，则生成的加噪矩阵也可以包括两种不同形式。如图16所示，是本申请的一些实施例提供的两种确定加噪矩阵的示意图，在图16中，电子设备将维度为[16，64，64]的高斯噪声矩阵(GNM)与Weight相乘，并将按照分割实体构造的方式确定的维度为[16，64，64]的掩码图像(Mask)与(1-Weight)相乘，再将两次相乘的结果求和得到维度为[16，64，64]的加噪矩阵z'；电子设备将维度为[16，64，64]的高斯噪声矩阵(GNM)与Weight相乘，并将按照尺寸信息构造的方式确定的维度为[16，64，64]的掩码图像(Mask)与(1-Weight)相乘，再将两次相乘的结果求和得到维度为[16，64，64]的加噪矩阵z'。Exemplarily, the mask image used when determining the noise matrix can be randomly determined in the above two different ways, and the generated noise matrix can also include two different forms. As shown in Figure 16, it is a schematic diagram of two methods for determining noise matrices provided by some embodiments of the present application. In Figure 16, the electronic device multiplies the Gaussian noise matrix (GNM) with a dimension of [16, 64, 64] by Weight, and multiplies the mask image (Mask) with a dimension of [16, 64, 64] determined by the segmentation entity construction method by (1-Weight), and then sums the results of the two multiplications to obtain a noise matrix z' with a dimension of [16, 64, 64]; the electronic device multiplies the Gaussian noise matrix (GNM) with a dimension of [16, 64, 64] by Weight, and multiplies the mask image (Mask) with a dimension of [16, 64, 64] determined by the size information construction method by (1-Weight), and then sums the results of the two multiplications to obtain a noise matrix z' with a dimension of [16, 64, 64].

如此，能够组合得到多种加噪方式，进而得到多种加噪矩阵，则在模型训练的过程中，模型能够学习到从多种不同的噪声中重建图像，提高模型的性能。In this way, multiple noise adding methods can be combined to obtain multiple noise adding matrices. In the process of model training, the model can learn to reconstruct images from a variety of different noises, thereby improving the performance of the model.

步骤701b、电子设备将至少一个加噪矩阵和至少一张样本图像进行加权求和，得到至少一张噪声图像。Step 701b: The electronic device performs weighted summation on at least one noise matrix and at least one sample image to obtain at least one noise image.

本申请的一些实施例中，每张噪声图像对应一个加噪矩阵和一张样本图像。In some embodiments of the present application, each noisy image corresponds to a noise matrix and a sample image.

示例性地，电子设备可以通过上述公式(2)计算得到噪声图像。Exemplarily, the electronic device can calculate the noise image using the above formula (2).

需要说明的是，上述步骤701a至步骤701b的具体实现与上述实施例中对原始图像加噪得到噪声图像的实现过程相同，其具体实现可以参见上述实施例的相关描述，本申请在此不再赘述。It should be noted that the specific implementation of the above steps 701a to 701b is the same as the implementation process of adding noise to the original image to obtain a noisy image in the above embodiment. The specific implementation can be found in the relevant description of the above embodiment, and this application will not repeat it here.

如此，电子设备通过仅高斯噪声矩阵或仅掩码图像或高斯噪声矩阵和掩码图像结合的方式对样本图像进行加噪处理，且掩码图像通过上述按照分割实体构造和按照尺寸信息构造两种方式得到，则加噪矩阵的形式很多，对样本图像加噪得到的噪声图像的形式也很多，因此，模型不仅能够学习到从多种噪声中重建图像的能力，也能够学习到在各种掩码方式下预测掩码区域的图像内容的能力，提高了模型的性能。In this way, the electronic device performs noise processing on the sample image by using only a Gaussian noise matrix or only a mask image or a combination of a Gaussian noise matrix and a mask image, and the mask image is obtained by the above-mentioned two methods of constructing according to the segmentation entity and constructing according to the size information. There are many forms of the noise matrix, and there are also many forms of the noise image obtained by noisy the sample image. Therefore, the model can not only learn the ability to reconstruct images from various noises, but also learn the ability to predict the image content of the mask area under various masking methods, thereby improving the performance of the model.

步骤702、电子设备基于至少一项条件特征信息和至少一张噪声图像，对初始模型进行训练，得到图像生成模型。Step 702: The electronic device trains an initial model based on at least one conditional feature information and at least one noise image to obtain an image generation model.

本申请的一些实施例中，每张噪声图像对应一项条件特征信息。In some embodiments of the present application, each noise image corresponds to one item of conditional feature information.

本申请的一些实施例中，上述图像生成模型用于处理图像生成任务。In some embodiments of the present application, the above-mentioned image generation model is used to process image generation tasks.

本申请的一些实施例中，上述图像生成任务包括文生图任务和图像处理任务。上述图像处理任务包括以下至少一项：图像局部修改、图像局部消除和图像外扩等。In some embodiments of the present application, the image generation task includes a text image task and an image processing task. The image processing task includes at least one of the following: local image modification, local image elimination, and image expansion.

本申请的一些实施例中，上述步骤702具体可以通过以下步骤801至步骤804实现。In some embodiments of the present application, the above step 702 can be specifically implemented by following the steps 801 to 804.

步骤801、电子设备通过扩散模型，获取至少一张噪声图像的特征信息，并对至少一项条件特征信息和至少一张噪声图像的特征信息融合处理，得到至少一项融合特征信息。Step 801: The electronic device obtains feature information of at least one noise image through a diffusion model, and fuses at least one conditional feature information and feature information of at least one noise image to obtain at least one fused feature information.

本申请的一些实施例中，每项融合特征信息对应一项条件特征信息和一张噪声图像的特征信息。In some embodiments of the present application, each fusion feature information corresponds to one conditional feature information and one feature information of a noise image.

需要说明的是，上述步骤801中的扩散模型与上述实施例中的扩散模型103的结构相同，具体结构可以参见上述实施例的相关描述，本申请在此不再赘述。It should be noted that the diffusion model in the above step 801 has the same structure as the diffusion model 103 in the above embodiment. For the specific structure, reference may be made to the relevant description of the above embodiment, which will not be described in detail in this application.

本申请的一些实施例中，上述步骤801之前，本申请的一些实施例提供的图像生成模型的训练方法还包括以下步骤901至步骤902。In some embodiments of the present application, before the above step 801, the training method of the image generation model provided by some embodiments of the present application also includes the following steps 901 to 902.

步骤901、电子设备通过扩散模型中的图像分割模块，将每张噪声图像分割为M个噪声图像块，M为大于1的整数。Step 901: The electronic device divides each noise image into M noise image blocks through the image segmentation module in the diffusion model, where M is an integer greater than 1.

步骤902、电子设备通过扩散模型中的线性变换算子，提取每个噪声图像块的特征信息，并添加位置编码向量，得到每张噪声图像的特征信息。Step 902: The electronic device extracts feature information of each noise image block through a linear transformation operator in a diffusion model, and adds a position coding vector to obtain feature information of each noise image.

示例性地，以一张噪声图像为例，电子设备将噪声图像输入扩散模型中的图像分割模块，通过图像分割模块将噪声图像分割为M个噪声图像块后输入线性变换算子，电子设备通过线性变换算子提取M个噪声图像块中每个噪声图像块的特征信息，并生成每个噪声图像块的位置编码向量，按照每个噪声图像块在噪声图像中的位置，将M个噪声图像块的特征信息拼接，并添加位置编码向量，得到噪声图像的特征信息。Exemplarily, taking a noise image as an example, the electronic device inputs the noise image into the image segmentation module in the diffusion model, divides the noise image into M noise image blocks through the image segmentation module, and then inputs the linear transformation operator. The electronic device extracts the feature information of each of the M noise image blocks through the linear transformation operator, and generates a position coding vector for each noise image block. According to the position of each noise image block in the noise image, the feature information of the M noise image blocks is spliced, and the position coding vector is added to obtain the feature information of the noise image.

需要说明的是，上述步骤901至步骤902的具体实现与上述实施例中提取噪声图像的特征信息的实现过程相同，其具体实现可以参见上述实施例的相关描述，本申请在此不再赘述。It should be noted that the specific implementation of the above steps 901 to 902 is the same as the implementation process of extracting the feature information of the noise image in the above embodiment. The specific implementation can be found in the relevant description of the above embodiment, and this application will not repeat it here.

如此，电子设备通过图像分割模块和线性变换算子提取噪声图像的特征信息，则扩散模型结合噪声图像的特征信息和条件特征信息预测噪声矩阵，提高预测准确性。In this way, the electronic device extracts the characteristic information of the noise image through the image segmentation module and the linear transformation operator, and the diffusion model combines the characteristic information of the noise image and the conditional characteristic information to predict the noise matrix, thereby improving the prediction accuracy.

本申请的一些实施例中，上述步骤801具体可以通过以下步骤801a至步骤801b实现。In some embodiments of the present application, the above step 801 can be specifically implemented by following the steps 801a to 801b.

步骤801a、电子设备通过扩散模型中的线性变换算子，对至少一项条件特征信息进行线性变换处理，得到至少一项线性变换后的条件特征信息。Step 801a: The electronic device performs linear transformation processing on at least one item of conditional feature information by using a linear transformation operator in a diffusion model to obtain at least one item of conditional feature information after linear transformation.

步骤801b、电子设备通过扩散模型中的拼接算子，将至少一项线性变换后的条件特征信息和至少一张噪声图像的特征信息拼接，得到至少一项融合特征信息。Step 801b: The electronic device splices at least one item of conditional feature information after linear transformation and feature information of at least one noise image through a splicing operator in the diffusion model to obtain at least one item of fused feature information.

本申请的一些实施例中，上述至少一项融合特征信息中的一项融合特征信息对应一项线性变换后的条件特征信息和一张噪声图像的特征信息。In some embodiments of the present application, one fused feature information in the at least one fused feature information corresponds to a conditional feature information after linear transformation and feature information of a noise image.

示例性地，电子设备通过上述线性变换算子对每项条件特征信息进行线性变换处理，得到线性变换后的条件特征信息输入上述拼接算子，通过上述拼接算子将线性变换后的条件特征信息和噪声图像的特征信息在维度上进行拼接，得到融合特征信息。Exemplarily, the electronic device performs linear transformation processing on each conditional feature information through the above-mentioned linear transformation operator, obtains the conditional feature information after linear transformation and inputs it into the above-mentioned splicing operator, and splices the conditional feature information after linear transformation and the feature information of the noise image in dimension through the above-mentioned splicing operator to obtain fused feature information.

需要说明的是，上述步骤801a至步骤801b的具体实现与上述实施例中得到融合特征信息的实现过程相同，其具体实现可以参见上述实施例的相关描述，本申请在此不再赘述。It should be noted that the specific implementation of the above steps 801a to 801b is the same as the implementation process of obtaining the fusion feature information in the above embodiment. The specific implementation can be found in the relevant description of the above embodiment, and this application will not repeat it here.

如此，电子设备通过线性变换算子将条件特征信息映射至与噪声图像的特征信息相同的特征空间，再通过拼接算子将线性变换后的条件特征信息和噪声图像的特征信息拼接，得到融合特征信息，为扩散模型的后续处理提供数据基础。In this way, the electronic device maps the conditional feature information to the same feature space as the feature information of the noise image through a linear transformation operator, and then splices the conditional feature information after the linear transformation with the feature information of the noise image through a splicing operator to obtain fused feature information, providing a data basis for subsequent processing of the diffusion model.

步骤802、电子设备通过扩散模型，基于至少一项融合特征信息和至少一个加噪向量，确定至少一张噪声图像的噪声矩阵，每个加噪向量表示一张噪声图像的噪声等级。Step 802: The electronic device determines a noise matrix of at least one noise image by using a diffusion model based on at least one fusion feature information and at least one noise vector, where each noise vector represents a noise level of a noise image.

本申请的一些实施例中，上述步骤802具体可以通过以下步骤802a至步骤802c实现。In some embodiments of the present application, the above step 802 can be specifically implemented by following the steps 802a to 802c.

步骤802a、电子设备通过扩散模型中的扩散自注意力模块，对至少一个加噪向量进行特征提取，得到至少一组缩放因子，并基于至少一组缩放因子，对至少一项融合特征信息进行特征缩放，得到至少一项缩放特征信息，每项缩放特征信息对应一组缩放因子和一项融合特征信息。Step 802a, the electronic device performs feature extraction on at least one noisy vector through the diffusion self-attention module in the diffusion model to obtain at least one set of scaling factors, and based on the at least one set of scaling factors, performs feature scaling on at least one item of fused feature information to obtain at least one item of scaled feature information, each item of scaled feature information corresponds to a set of scaling factors and one item of fused feature information.

需要说明的是，上述步骤802a中的扩散自注意力模块与上述实施例中的扩散自注意力模块(如图6所示)的结构相同，对数据的处理过程也相同，扩散自注意力模块的结构与步骤802a的具体实现过程均可以参见上述实施例的相关描述，本申请在此不再赘述。It should be noted that the diffuse self-attention module in the above step 802a has the same structure as the diffuse self-attention module in the above embodiment (as shown in Figure 6), and the data processing process is also the same. The structure of the diffuse self-attention module and the specific implementation process of step 802a can be found in the relevant description of the above embodiment, and this application will not repeat them here.

步骤802b、电子设备通过扩散模型中的层归一化算子，对至少一项缩放特征信息归一化处理，得到至少一项噪声特征信息。Step 802b: The electronic device normalizes at least one item of scaling feature information using a layer normalization operator in a diffusion model to obtain at least one item of noise feature information.

步骤802c、电子设备通过扩散模型中的卷积算子，对至少一项噪声特征信息进行降通道数处理，得到至少一张噪声图像的噪声矩阵。Step 802c: The electronic device performs channel reduction processing on at least one item of noise feature information by using a convolution operator in a diffusion model to obtain a noise matrix of at least one noise image.

需要说明的是，上述步骤802a至步骤802c的具体实现与上述实施例中确定噪声图像的噪声矩阵的实现过程相同，其具体实现可以参见上述实施例的相关描述，本申请在此不再赘述。It should be noted that the specific implementation of the above steps 802a to 802c is the same as the implementation process of determining the noise matrix of the noise image in the above embodiment. The specific implementation can refer to the relevant description of the above embodiment, and this application will not repeat it here.

示例性地，如图7所示，电子设备通过扩散模型确定噪声矩阵的实现过程包括：电子设备通过线性变换算子对条件特征信息做线性变换处理，得到线性变换后的条件特征信息。电子设备通过图像分割模块将噪声图像分割为M个图像块输入线性变换算子，通过线性变换算子对M个图像块做特征偷取，得到每个图像块的特征信息，然后添加位置编码向量，得到噪声图像的特征信息。电子设备将线性变换后的条件特征信息和噪声图像的特征信息输入拼接算子进行拼接，得到融合特征信息输入扩散自注意力模块，通过扩散自注意力模块对加噪向量做特征提取，得到一组缩放因子，并根据该组缩放因子对融合特征信息做缩放处理，得到缩放特征信息输入至层归一化算子，通过层归一化算子对缩放特征信息做reshape操作和Layer Norm操作，得到噪声特征信息输入卷积算子，通过卷积算子减少噪声特征信息的通道数，得到噪声图像的噪声矩阵。Exemplarily, as shown in FIG7, the implementation process of the electronic device determining the noise matrix through the diffusion model includes: the electronic device performs linear transformation processing on the conditional feature information through the linear transformation operator to obtain the conditional feature information after the linear transformation. The electronic device divides the noise image into M image blocks through the image segmentation module and inputs the linear transformation operator, performs feature stealing on the M image blocks through the linear transformation operator to obtain the feature information of each image block, and then adds the position encoding vector to obtain the feature information of the noise image. The electronic device inputs the conditional feature information after the linear transformation and the feature information of the noise image into the splicing operator for splicing, obtains the fused feature information, inputs the diffusion self-attention module, extracts the features of the noise vector through the diffusion self-attention module, obtains a set of scaling factors, and performs scaling processing on the fused feature information according to the set of scaling factors to obtain the scaled feature information input to the layer normalization operator, performs reshape operation and Layer Norm operation on the scaled feature information through the layer normalization operator, obtains the noise feature information input to the convolution operator, reduces the number of channels of the noise feature information through the convolution operator, and obtains the noise matrix of the noise image.

如此，电子设备通过扩散模型，将条件特征信息和噪声图像的特征信息融合，再用加噪向量向融合特征信息加噪，直至得到完全是噪声的噪声矩阵，且由于增加了条件特征信息，使得噪声矩阵的预测结合了文本特征信息或文本特征信息和图像特征信息，使得模型能够学习到在仅有文本或有文本和图像的情况下预测噪声矩阵的能力。In this way, the electronic device fuses the conditional feature information and the feature information of the noise image through a diffusion model, and then adds noise to the fused feature information with a noise vector until a noise matrix that is completely noise is obtained. Moreover, due to the addition of the conditional feature information, the prediction of the noise matrix combines the text feature information or the text feature information and the image feature information, so that the model can learn the ability to predict the noise matrix when there is only text or text and image.

步骤803、电子设备基于至少一张噪声图像的噪声矩阵、高斯噪声矩阵和至少一张样本图像的掩码图像，确定至少一张样本图像的噪声损失值。Step 803: The electronic device determines a noise loss value of at least one sample image based on a noise matrix of at least one noise image, a Gaussian noise matrix, and a mask image of at least one sample image.

本申请的一些实施例中，上述步骤803具体可以通过以下步骤803a至步骤803b实现。In some embodiments of the present application, the above step 803 can be specifically implemented by following the steps 803a to 803b.

步骤803a、电子设备基于高斯噪声矩阵和至少一张噪声图像的噪声矩阵，确定至少一个第一损失值，并基于至少一张样本图像的掩码图像和至少一张噪声图像的噪声矩阵，确定至少一个第二损失值。Step 803a: The electronic device determines at least one first loss value based on the Gaussian noise matrix and the noise matrix of at least one noise image, and determines at least one second loss value based on the mask image of at least one sample image and the noise matrix of at least one noise image.

步骤803b、电子设备将至少一个第一损失值和至少一个第二损失值加权求和，得到至少一张样本图像的噪声损失值。Step 803b: The electronic device performs weighted summation of at least one first loss value and at least one second loss value to obtain a noise loss value of at least one sample image.

本申请的一些实施例中，电子设备可以通过如下公式(3)确定噪声损失值：In some embodiments of the present application, the electronic device may determine the noise loss value by the following formula (3):

Loss＝α*MSE(GNM，PN₁[16,:,:])+(1-α)*MSE(Mask，PN₂[16,:,:]) (3)Loss＝α*MSE(GNM, PN ₁ [16,:,:])+(1-α)*MSE(Mask, PN ₂ [16,:,:]) (3)

在公式(3)中，Loss表示噪声损失值，GNM表示高斯噪声矩阵，Mask表示掩码图像，PN₁[16,:,:]表示噪声矩阵的前16通道组成的矩阵，PN₂[16,:,:]表示噪声矩阵的后16通道组成的矩阵，α表示权重，MSE(GNM，PN₁[16,:,:])表示第一损失值，MSE(Mask，PN₂[16,:,:])表示第二损失值。In formula (3), Loss represents the noise loss value, GNM represents the Gaussian noise matrix, Mask represents the mask image, PN ₁ [16,:,:] represents the matrix composed of the first 16 channels of the noise matrix, PN ₂ [16,:,:] represents the matrix composed of the last 16 channels of the noise matrix, α represents the weight, MSE(GNM, PN ₁ [16,:,:]) represents the first loss value, and MSE(Mask, PN ₂ [16,:,:]) represents the second loss value.

需要说明的是，上述第一损失值让模型能够从高斯噪声矩阵中重建图像，上述第二损失值增加了监督信号，让模型可以预测、消除或修改掩码区域的图像内容，使得模型具有对图像进行局部修改、局部消除及图像外扩的能力。It should be noted that the above-mentioned first loss value enables the model to reconstruct the image from the Gaussian noise matrix, and the above-mentioned second loss value increases the supervision signal, allowing the model to predict, eliminate or modify the image content of the mask area, so that the model has the ability to locally modify, locally eliminate and expand the image.

示例性地，如图17所示，是本申请的一些实施例提供的损失值计算的示意图，在图17中，噪声矩阵的维度是[32，64，64]，将其分割为两个维度为[16，64，64]的矩阵，这两个矩阵分别是噪声矩阵前16通道组成的矩阵和后16通道组成的矩阵，根据前16通道组成的维度为[16，64，64]的矩阵和维度为[16，64，64]的高斯噪声矩阵确定第一损失值L1，根据后16通道组成的维度为[16，64，64]矩阵和维度为[16，64，64]的掩码图像确定第二损失值L2。Exemplarily, as shown in Figure 17, it is a schematic diagram of loss value calculation provided by some embodiments of the present application. In Figure 17, the dimension of the noise matrix is [32, 64, 64], which is divided into two matrices with a dimension of [16, 64, 64]. The two matrices are respectively a matrix composed of the first 16 channels of the noise matrix and a matrix composed of the last 16 channels. The first loss value L1 is determined based on the matrix with a dimension of [16, 64, 64] composed of the first 16 channels and the Gaussian noise matrix with a dimension of [16, 64, 64]. The second loss value L2 is determined based on the matrix with a dimension of [16, 64, 64] composed of the last 16 channels and the mask image with a dimension of [16, 64, 64].

如此，电子设备通过重建损失函数，并基于损失函数调整模型参数，使得模型不仅能够学习到从噪声中重建图像的能力，还能够学习到对图像进行局部修改、局部消除或图像外扩的能力。In this way, the electronic device reconstructs the loss function and adjusts the model parameters based on the loss function, so that the model can not only learn the ability to reconstruct images from noise, but also learn the ability to locally modify, locally eliminate or expand the image.

步骤804、电子设备基于至少一张样本图像的噪声损失值，调整初始模型的模型参数，得到图像生成模型。Step 804: The electronic device adjusts the model parameters of the initial model based on the noise loss value of at least one sample image to obtain an image generation model.

本申请的一些实施例中，初始模型的模型参数可以包括初始模型中各个模块及各个模块中各个层的参数。In some embodiments of the present application, the model parameters of the initial model may include parameters of each module in the initial model and each layer in each module.

本申请的一些实施例中，电子设备基于噪声损失值调整初始模型的模型参数后，继续使用训练样本组对初始模型进行迭代训练，直至迭代次数大于次数阈值，或某次迭代后计算的损失值小于损失值阈值，可以停止训练，得到图像生成模型。In some embodiments of the present application, after the electronic device adjusts the model parameters of the initial model based on the noise loss value, it continues to use the training sample group to iteratively train the initial model until the number of iterations is greater than the number threshold, or the loss value calculated after a certain iteration is less than the loss value threshold. The training can be stopped to obtain the image generation model.

需要说明的是，上述次数阈值和损失值阈值均可以预先设置或根据实际需求设置，本申请对此不做限定。例如，次数阈值可以是100，损失值阈值可以是0.5。It should be noted that the above number threshold and loss value threshold can be preset or set according to actual needs, and this application does not limit this. For example, the number threshold can be 100, and the loss value threshold can be 0.5.

如此，电子设备基于条件特征信息和噪声图像对初始模型进行训练，由于噪声图像中增加了高斯噪声矩阵和掩码图像，则在执行模型训练的过程中，模型不仅能够学习到从噪声中生成图像，也能够学习到预测、消除或修改掩码图像对应的图像区域的图像内容，并且，由于条件特征信息根据控制条件、图像特征信息和文本特征信息生成，则将条件特征信息与噪声图像结合，使得模型还能够学习到根据文本生成图像、根据图像生成图像或处理图像、根据文本和图像生成图像或处理图像，以及根据文本预测、消除或修改掩码区域的图像内容，进而使得训练得到的图像生成模型具有处理各种类型的图像生成任务的能力。In this way, the electronic device trains the initial model based on the conditional feature information and the noise image. Since the Gaussian noise matrix and the mask image are added to the noise image, during the model training process, the model can not only learn to generate images from noise, but also learn to predict, eliminate or modify the image content of the image area corresponding to the mask image. Moreover, since the conditional feature information is generated according to the control conditions, image feature information and text feature information, the conditional feature information is combined with the noise image, so that the model can also learn to generate images according to text, generate images according to images or process images, generate images according to text and images or process images, and predict, eliminate or modify the image content of the mask area according to text, thereby enabling the trained image generation model to have the ability to handle various types of image generation tasks.

如此，电子设备通过使用高斯噪声矩阵和至少一张样本图像的掩码图像，对至少一张样本图像进行加噪处理，得到了至少一张噪声图像，则每张噪声图像不仅包括样本图像的图像信息，还包括噪声信息和掩码图像的图像信息，因此，使用至少一张噪声图像和至少一项条件特征信息对初始模型进行训练，使得最终得到的一个初始模型不仅能够学习到从噪声中生成图像，也能够学习到处理各种图像生成任务的能力，则不用针对不同类型的图像生成任务训练不同的模型，减少了电子设备的处理负担，提高了模型训练的效率。In this way, the electronic device performs noise processing on at least one sample image by using a Gaussian noise matrix and a mask image of at least one sample image to obtain at least one noise image. Each noise image includes not only the image information of the sample image, but also the noise information and the image information of the mask image. Therefore, the initial model is trained using at least one noise image and at least one conditional feature information, so that the final initial model can not only learn to generate images from noise, but also learn the ability to handle various image generation tasks. There is no need to train different models for different types of image generation tasks, which reduces the processing burden of the electronic device and improves the efficiency of model training.

本申请的一些实施例中，维度越大则计算量越大，由于样本图像的维度通常比较大，而高斯噪声矩阵的维度可能会比较小，因此可以先对样本图像进行特征压缩处理，再根据高斯噪声矩阵和掩码图像对特征压缩后的图像进行加噪处理，能够减少电子设备执行加噪处理的计算量，减少资源占用，提高处理效率。In some embodiments of the present application, the larger the dimension, the greater the amount of calculation. Since the dimension of the sample image is usually relatively large, while the dimension of the Gaussian noise matrix may be relatively small, the sample image can be feature compressed first, and then the feature compressed image can be noised according to the Gaussian noise matrix and the mask image. This can reduce the amount of calculation required for the electronic device to perform noise addition processing, reduce resource usage, and improve processing efficiency.

本申请的一些实施例中，上述步骤104具体可以通过以下步骤901至步骤903实现。In some embodiments of the present application, the above step 104 can be specifically implemented by following steps 901 to 903.

步骤901、电子设备通过图像编码器，对至少一张样本图像进行特征压缩，得到至少一个图像特征矩阵。Step 901: The electronic device performs feature compression on at least one sample image through an image encoder to obtain at least one image feature matrix.

本申请的一些实施例中，一个图像特征矩阵对应一张样本图像。In some embodiments of the present application, one image feature matrix corresponds to one sample image.

本申请的一些实施例中，上述一个图像特征矩阵中像素点的数量小于一张样本图像中像素点的数量。In some embodiments of the present application, the number of pixels in the above-mentioned image feature matrix is less than the number of pixels in a sample image.

示例性地，以一张样本图像的维度是[3，512，512]为例，电子设备可以对样本图像进行特征压缩，得到一个维度为[16，64，64]的图像特征矩阵。For example, taking a sample image whose dimension is [3, 512, 512] as an example, the electronic device may perform feature compression on the sample image to obtain an image feature matrix whose dimension is [16, 64, 64].

本申请的一些实施例中，上述步骤901具体可以通过以下步骤901a至步骤901b实现。In some embodiments of the present application, the above step 901 can be specifically implemented by following the steps 901a to 901b.

步骤901a、电子设备通过图像编码器中的卷积模块，对至少一张样本图像进行特征压缩，得到至少一个压缩图像特征矩阵。Step 901a: The electronic device performs feature compression on at least one sample image through a convolution module in an image encoder to obtain at least one compressed image feature matrix.

本申请的一些实施例中，每个压缩图像特征矩阵的像素点的数量多于每个图像特征矩阵的像素点的数量。In some embodiments of the present application, the number of pixels in each compressed image feature matrix is greater than the number of pixels in each image feature matrix.

步骤901b、电子设备通过图像编码器中的自注意力模块，基于每个压缩图像特征矩阵中的像素点之间的关联关系，对每个压缩图像特征矩阵进行特征提取，得到至少一个图像特征矩阵。Step 901b: The electronic device uses a self-attention module in an image encoder to extract features from each compressed image feature matrix based on the correlation between pixels in each compressed image feature matrix to obtain at least one image feature matrix.

示例性地，如图9所示，上述图像编码器包括三个卷积模块和两个自注意力模块，该三个卷积模块分别是卷积模块1、卷积模块2和卷积模块3，两个自注意力模块分别是包括4层Transformer Block的自注意力模块1和包括6层Transformer Block的自注意力模块2。电子设备通过图像编码器对一张样本图像做特征压缩，得到一个图像特征矩阵的具体实现包括：电子设备将维度为[3，512，512]的样本图像输入卷积模块1进行特征压缩，得到维度为[4，256，256]的特征矩阵，再将该维度为[4，256，256]的压缩图像特征矩阵输入自注意力模块1，得到维度为[4，256，256]的特征矩阵，再将该维度为[4，256，256]的特征矩阵输入卷积模块2进行进一步特征压缩，得到维度为[8，128，128]的压缩图像特征矩阵，再将该维度为[8，128，128]的特征矩阵输入自注意力模块12，得到维度为[8，128，128]的特征矩阵，再将该维度为[8，128，128]的特征矩阵输入卷积模块3进行进一步特征压缩，得到样本图像的维度为[16，64，64]的图像特征矩阵。Exemplarily, as shown in Figure 9, the above-mentioned image encoder includes three convolution modules and two self-attention modules, the three convolution modules are convolution module 1, convolution module 2 and convolution module 3, and the two self-attention modules are self-attention module 1 including 4 layers of Transformer Block and self-attention module 2 including 6 layers of Transformer Block. The electronic device performs feature compression on a sample image through an image encoder to obtain an image feature matrix. The specific implementation includes: the electronic device inputs the sample image with a dimension of [3, 512, 512] into the convolution module 1 for feature compression to obtain a feature matrix with a dimension of [4, 256, 256], then inputs the compressed image feature matrix with a dimension of [4, 256, 256] into the self-attention module 1 to obtain a feature matrix with a dimension of [4, 256, 256], then inputs the feature matrix with a dimension of [4, 256, 256] into the convolution module 2 for further feature compression to obtain a compressed image feature matrix with a dimension of [8, 128, 128], then inputs the feature matrix with a dimension of [8, 128, 128] into the self-attention module 12 to obtain a feature matrix with a dimension of [8, 128, 128], then inputs the feature matrix with a dimension of [8, 128, 128] into the convolution module 3 for further feature compression to obtain an image feature matrix with a dimension of [16, 64, 64] of the sample image.

需要说明的是，上述步骤901a至步骤901b的具体实现与上述实施例中确定原始图像的图像特征矩阵的实现过程相同，其具体实现可以参见上述实施例的相关描述，本申请在此不再赘述。It should be noted that the specific implementation of the above steps 901a to 901b is the same as the implementation process of determining the image feature matrix of the original image in the above embodiment. The specific implementation can be found in the relevant description of the above embodiment, and this application will not repeat it here.

如此，电子设备在对样本图像进行加噪处理之前，通过图像编码器中的卷积模块对样本图像进行特征压缩，并通过图像编码器中的自注意力模块根据像素点之间的关联关系做进一步特征提取，使得最终得到的图像特征矩阵不仅数据量小，而且能够表征样本图像中不同像素点之间的关联关系，能够更好地表征样本图像，即在不破坏样本图像的图像信息的情况减少了加噪处理的数据量，提高电子设备的处理效率。In this way, before the electronic device performs noise processing on the sample image, it compresses the features of the sample image through the convolution module in the image encoder, and further extracts features according to the correlation between the pixels through the self-attention module in the image encoder, so that the final image feature matrix not only has a small amount of data, but also can characterize the correlation between different pixels in the sample image, and can better characterize the sample image, that is, the amount of data for noise processing is reduced without destroying the image information of the sample image, thereby improving the processing efficiency of the electronic device.

步骤902、电子设备基于高斯噪声矩阵和所述至少一张样本图像的掩码图像，对所述至少一个图像特征矩阵进行加噪处理，得到至少一张噪声图像。Step 902: The electronic device performs noise addition processing on the at least one image feature matrix based on the Gaussian noise matrix and the mask image of the at least one sample image to obtain at least one noise image.

本申请的一些实施例中，电子设备基于高斯噪声矩阵和至少一张样本图像的掩码图像，通过上述公式(1)和公式(2)，对至少一个图像特征矩阵进行加噪处理，得到至少一张噪声图像。其具体实现可以参见上述实施例的相关描述，本申请在此不再赘述。In some embodiments of the present application, the electronic device performs noise processing on at least one image feature matrix based on the Gaussian noise matrix and the mask image of at least one sample image through the above formula (1) and formula (2) to obtain at least one noise image. The specific implementation thereof can be referred to the relevant description of the above embodiment, and the present application will not repeat it here.

步骤903、电子设备基于所述至少一项条件特征信息和所述至少一张噪声图像，对初始模型进行训练，得到图像生成模型。Step 903: The electronic device trains the initial model based on the at least one conditional feature information and the at least one noise image to obtain an image generation model.

需要说明的是，上述步骤903的具体实现与上述步骤701的实现过程相同，上述步骤903的具体实现可以参见上述步骤701的相关描述，本申请在此不再赘述。It should be noted that the specific implementation of the above step 903 is the same as the implementation process of the above step 701. The specific implementation of the above step 903 can refer to the relevant description of the above step 701, and this application will not repeat it here.

如此，电子设备在对样本图像进行加噪处理之前，先对样本图像进行特征压缩，以减少样本图像中像素点的数量，减少对样本图像加噪时的计算量，得到数据量较小的噪声图像，进而减少后续模型训练过程中的数据量，减少电子设备的资源占用，提高电子设备的处理效率。In this way, before the electronic device performs noise processing on the sample image, it first performs feature compression on the sample image to reduce the number of pixels in the sample image, reduce the amount of calculation when adding noise to the sample image, and obtain a noise image with a smaller data volume, thereby reducing the amount of data in the subsequent model training process, reducing the resource usage of the electronic device, and improving the processing efficiency of the electronic device.

如此，电子设备使用上述图像生成模型的训练方法，根据图像生成任务的类型确定条件特征信息，以使用不同类型的图像生成任务训练图像生成模型，并用高斯噪声矩阵和掩码图像对样本图像进行加噪处理，使得图像生成模型不仅能够具备从噪声中生成图像的能力，也能够具备预测、消除或修改掩码图像对应的掩码区域的图像内容的能力，使得图像生成模型能够学习到处理不同类型的图像生成任务的能力。如此，不需要针对每种图像生成任务训练得到对应的模型，通过上述方法训练得到一个图像生成模型便具备了处理多种类型的图像生成任务的能力，不仅提高了模型训练效率，而且提高了处理图像生成任务的效率。In this way, the electronic device uses the above-mentioned training method of the image generation model to determine the conditional feature information according to the type of image generation task, so as to train the image generation model using different types of image generation tasks, and use the Gaussian noise matrix and the mask image to perform noise processing on the sample image, so that the image generation model can not only have the ability to generate images from noise, but also have the ability to predict, eliminate or modify the image content of the mask area corresponding to the mask image, so that the image generation model can learn the ability to handle different types of image generation tasks. In this way, there is no need to train a corresponding model for each image generation task. An image generation model trained by the above method has the ability to handle multiple types of image generation tasks, which not only improves the efficiency of model training, but also improves the efficiency of processing image generation tasks.

值得注意的是，上述描述均是图像生成模型的训练过程，接下来描述在训练得到图像生成模型之后，使用该图像生成模型处理不同的图像生成任务的具体实现过程。It is worth noting that the above descriptions are all about the training process of the image generation model. The following describes the specific implementation process of using the image generation model to handle different image generation tasks after the image generation model is trained.

本申请的一些实施例中，在上述步骤104之后，本申请的一些实施例提供的图像生成模型的训练方法还可以包括以下步骤1001。In some embodiments of the present application, after the above step 104, the training method of the image generation model provided by some embodiments of the present application may further include the following step 1001.

步骤1001、电子设备将原始信息输入图像生成模型，输出衍生图像，或者将原始信息输入图像生成模型，执行图像处理，输出衍生图像。Step 1001: The electronic device inputs original information into an image generation model and outputs a derived image, or inputs original information into an image generation model, performs image processing, and outputs a derived image.

本申请的一些实施例中，上述原始信息包括以下至少一项：待处理图像、描述文本、掩码图像。In some embodiments of the present application, the above-mentioned original information includes at least one of the following: an image to be processed, a description text, and a mask image.

本申请的一些实施例中，在上述原始信息包括描述文本的情况下，电子设备将描述文本输入图像生成模型，图像生成模型执行文生图任务，生成描述文本描述的衍生图像并输出。In some embodiments of the present application, when the above-mentioned original information includes descriptive text, the electronic device inputs the descriptive text into the image generation model, and the image generation model performs a text-to-image task to generate and output a derivative image described by the descriptive text.

本申请的一些实施例中，在上述原始信息包括描述文本和待处理图像的情况下，电子设备将描述文本和待处理图像输入图像生成模型，图像生成模型执行文和图生图任务，根据描述文本对待处理图像进行图像处理，得到衍生图像并输出。In some embodiments of the present application, when the above-mentioned original information includes descriptive text and an image to be processed, the electronic device inputs the descriptive text and the image to be processed into an image generation model, the image generation model executes the text-to-image generation task, performs image processing on the image to be processed according to the descriptive text, obtains a derivative image and outputs it.

示例性地，以上述文和图生图任务是改变待处理图像的风格为例，上述描述文本包括修改后的图像风格。电子设备将描述文本和待处理图像输入图像生成模型，图像生成模型根据描述文本中修改后的图像风格对待处理图像的风格进行修改，得到衍生图像并输出。For example, in the above text and image generation task of changing the style of the image to be processed, the above description text includes the modified image style. The electronic device inputs the description text and the image to be processed into the image generation model, and the image generation model modifies the style of the image to be processed according to the modified image style in the description text, obtains a derivative image and outputs it.

本申请的一些实施例中，在上述原始信息包括描述文本、待处理图像和掩码图像的情况下，电子设备将描述文本、待处理图像和掩码图像输入图像生成模型，图像生成模型执行图像局部修改任务，根据描述文本将待处理图像中掩码图像对应的掩码区域的图像内容进行修改，得到衍生图像并输出。In some embodiments of the present application, when the above-mentioned original information includes descriptive text, an image to be processed and a mask image, the electronic device inputs the descriptive text, the image to be processed and the mask image into an image generation model, and the image generation model performs a local image modification task, modifies the image content of the mask area corresponding to the mask image in the image to be processed according to the descriptive text, obtains a derivative image and outputs it.

本申请的一些实施例中，在上述原始信息包括待处理图像和掩码图像的情况下，电子设备将待处理图像和掩码图像输入图像生成模型，图像生成模型执行图像消除任务，将待处理图像中掩码图像对应的掩码区域的图像内容消除，得到衍生图像并输出。In some embodiments of the present application, when the above-mentioned original information includes an image to be processed and a mask image, the electronic device inputs the image to be processed and the mask image into an image generation model, and the image generation model performs an image elimination task, eliminates the image content of the mask area corresponding to the mask image in the image to be processed, obtains a derivative image and outputs it.

本申请的一些实施例中，在上述原始信息包括描述文本、待处理图像和掩码图像的情况下，电子设备将描述文本、待处理图像和掩码图像输入图像生成模型，图像生成模型执行图像消除任务，将待处理图像中掩码图像对应的掩码区域的图像内容消除，得到衍生图像并输出。In some embodiments of the present application, when the above-mentioned original information includes descriptive text, an image to be processed and a mask image, the electronic device inputs the descriptive text, the image to be processed and the mask image into an image generation model, and the image generation model performs an image elimination task, eliminates the image content of the mask area corresponding to the mask image in the image to be processed, obtains a derivative image and outputs it.

本申请的一些实施例中，在上述原始信息包括描述文本、待处理图像和掩码图像的情况下，电子设备将描述文本、待处理图像和掩码图像输入图像生成模型，图像生成模型执行图像外扩任务，根据描述文本对待处理图像中掩码图像对应的掩码区域的图像内容进行预测，得到衍生图像并输出。In some embodiments of the present application, when the above-mentioned original information includes descriptive text, an image to be processed and a mask image, the electronic device inputs the descriptive text, the image to be processed and the mask image into an image generation model, the image generation model performs an image expansion task, predicts the image content of the mask area corresponding to the mask image in the image to be processed based on the descriptive text, obtains a derived image and outputs it.

如此，电子设备使用训练好的图像生成模型，能够处理各种类型的图像生成任务，提高了图像生成任务的处理效率。In this way, the electronic device can process various types of image generation tasks using the trained image generation model, thereby improving the processing efficiency of the image generation tasks.

本申请的一些实施例中，在原始信息包括待处理图像、描述文本和掩码图像的情况下，上述步骤1001具体可以通过以下步骤1101至步骤1102实现。In some embodiments of the present application, when the original information includes an image to be processed, description text, and a mask image, the above step 1001 can be specifically implemented by the following steps 1101 to 1102.

步骤1101、电子设备通过图像生成模型中的第一条件编码器，对待处理图像进行编码，得到待处理图像的图像特征信息，并通过图像生成模型的第二条件编码器，对描述文本进行编码，得到描述文本的文本特征信息，并通过图像生成模型中的加噪模块，基于高斯噪声矩阵和掩码图像，对待处理图像进行加噪处理，得到噪声图像。Step 1101: The electronic device encodes the image to be processed through the first conditional encoder in the image generation model to obtain image feature information of the image to be processed, and encodes the description text through the second conditional encoder in the image generation model to obtain text feature information of the description text, and performs noise processing on the image to be processed based on the Gaussian noise matrix and the mask image through the noise addition module in the image generation model to obtain a noise image.

步骤1102、电子设备通过图像生成模型中的扩散模型，基于图像特征信息、文本特征信息和噪声图像做图像处理，得到衍生图像。Step 1102: The electronic device performs image processing based on the image feature information, the text feature information and the noise image through the diffusion model in the image generation model to obtain a derivative image.

需要说明的是，上述步骤1101至步骤1102的具体实现与上述实施例中生成衍生图像的实现过程相同，其具体实现可以参见上述实施例的相关描述，本申请在此不再赘述。It should be noted that the specific implementation of the above steps 1101 to 1102 is the same as the implementation process of generating a derivative image in the above embodiment. The specific implementation can be found in the relevant description of the above embodiment, and this application will not repeat it here.

如此，由于图像生成模型在模型训练过程中通过控制条件学习到了处理各种类型的图像生成任务的能力，因此，电子设备使用训练好的图像生成模型，在输入待处理图像、描述文本和掩码图像的情况下，能够对待处理图像进行图像处理，得到衍生图像，提高了图像生成任务的处理效率。In this way, since the image generation model has learned the ability to handle various types of image generation tasks by controlling conditions during the model training process, the electronic device uses the trained image generation model to perform image processing on the image to be processed, when the image to be processed, descriptive text and mask image are input, to obtain a derivative image, thereby improving the processing efficiency of the image generation task.

本申请的一些实施例中，上述步骤1101之前，本申请的一些实施例提供的图像生成模型的训练方法还包括以下步骤1201。In some embodiments of the present application, before the above step 1101, the training method of the image generation model provided by some embodiments of the present application also includes the following step 1201.

步骤1201、电子设备通过图像生成模型中的图像编码器，对待处理图像进行特征压缩，得到图像特征矩阵。Step 1201: The electronic device performs feature compression on the image to be processed through the image encoder in the image generation model to obtain an image feature matrix.

本申请的一些实施例中，上述步骤1101具体可以通过以下步骤1301实现。In some embodiments of the present application, the above step 1101 can be specifically implemented through the following step 1301.

步骤1301、电子设备通过图像生成模型中的加噪模块，基于高斯噪声矩阵和掩码图像，对图像特征矩阵进行加噪处理，得到噪声图像。Step 1301: The electronic device performs noise processing on the image feature matrix based on the Gaussian noise matrix and the mask image through the noise addition module in the image generation model to obtain a noise image.

本申请的一些实施例中，上述步骤1102之后，本申请的一些实施例提供的图像生成模型的训练方法还包括以下步骤1401。In some embodiments of the present application, after the above step 1102, the training method of the image generation model provided by some embodiments of the present application also includes the following step 1401.

步骤1401、电子设备通过图像生成模型中的图像解码器，对衍生图像解码，输出解码后的衍生图像。Step 1401: The electronic device decodes the derived image through the image decoder in the image generation model and outputs the decoded derivative image.

需要说明的是，上述步骤1201、步骤1301和步骤1401的具体实现与上述实施例中使用图像编码器、加噪模块和图像解码器的过程相同，其具体实现可以参见上述实施例的相关描述，本申请在此不再赘述。It should be noted that the specific implementation of the above steps 1201, 1301 and 1401 is the same as the process of using the image encoder, noise addition module and image decoder in the above embodiments. The specific implementation can be found in the relevant description of the above embodiments, and this application will not repeat them here.

如此，电子设备在做加噪处理之前通过图像编码器对待处理图像进行特征压缩，减少了加噪处理过程的计算量和后续生成衍生图像过程中的数据处理量，提高了电子设备的处理效率，并且在得到衍生图像后通过图像解码器解码，得到分辨率更高的解码后的衍生图像。In this way, the electronic device uses an image encoder to perform feature compression on the image to be processed before performing noise addition processing, thereby reducing the amount of calculation in the noise addition process and the amount of data processing in the subsequent process of generating a derivative image, thereby improving the processing efficiency of the electronic device, and after obtaining the derivative image, decoding it through an image decoder to obtain a decoded derivative image with a higher resolution.

示例性地，在原始信息包括描述文本的情况下，电子设备通过图像生成模型中的第二条件编码器提取描述文本的文本特征信息，再将文本特征信息和高斯噪声矩阵输入图像生成模型中的扩散模型处理后得到衍生图像，该衍生图像是描述文本描述的图像，完成了文生图任务。Exemplarily, when the original information includes descriptive text, the electronic device extracts text feature information of the descriptive text through the second conditional encoder in the image generation model, and then inputs the text feature information and the Gaussian noise matrix into the diffusion model in the image generation model for processing to obtain a derived image. The derived image is an image describing the text, thereby completing the text-generated image task.

示例性地，在原始信息包括描述文本和待处理图像的情况下，该描述文本包括待处理图像的图像描述信息和待处理图像修改后的图像风格。电子设备通过图像生成模型中的第一条件编码器提取待处理图像的图像特征信息，通过图像生成模型中的第二条件编码器提取描述文本的文本特征信息，电子设备通过图像生成模型中的图像编码器对待处理图像进行特征压缩，得到图像特征矩阵，电子设备通过图像生成模型中的加噪模块基于高斯噪声矩阵对图像特征矩阵进行加噪处理得到噪声图像，电子设备将文本特征信息、图像特征信息和噪声图像输入图像生成模型中的扩散模型处理后得到衍生图像，再通过图像解码器对衍生图像解码得到解码后的衍生图像，该解码后的衍生图像是对待处理图像进行风格转换后得到的图像，完成了文和图生图任务。Exemplarily, in the case where the original information includes a description text and an image to be processed, the description text includes image description information of the image to be processed and the modified image style of the image to be processed. The electronic device extracts image feature information of the image to be processed through a first conditional encoder in the image generation model, extracts text feature information of the description text through a second conditional encoder in the image generation model, performs feature compression on the image to be processed through an image encoder in the image generation model to obtain an image feature matrix, performs noise processing on the image feature matrix based on a Gaussian noise matrix through a noise addition module in the image generation model to obtain a noise image, inputs text feature information, image feature information and noise image into a diffusion model in the image generation model for processing to obtain a derivative image, and then decodes the derivative image through an image decoder to obtain a decoded derivative image, the decoded derivative image is an image obtained after style conversion of the image to be processed, and the text and image generation task is completed.

示例性地，在原始信息包括描述文本、待处理图像和掩码图像的情况下，该掩码图像是待处理图像中掩码区域的掩码图像，该描述文本包括待处理图像的图像描述信息。电子设备通过图像生成模型中的第一条件编码器提取待处理图像的图像特征信息，通过图像生成模型中的第二条件编码器提取描述文本的文本特征信息，电子设备通过图像生成模型中的图像编码器对待处理图像进行特征压缩，得到图像特征矩阵，电子设备通过图像生成模型中的加噪模块基于高斯噪声矩阵和掩码图像对图像特征矩阵进行加噪处理得到噪声图像，电子设备将文本特征信息、图像特征信息和噪声图像输入图像生成模型中的扩散模型处理后得到衍生图像，再通过图像解码器对衍生图像解码得到衍生图像，该解码后的衍生图像是将待处理图像中掩码区域的图像内容消除后得到的图像，完成了图像消除任务。Exemplarily, in the case where the original information includes description text, an image to be processed and a mask image, the mask image is a mask image of a mask area in the image to be processed, and the description text includes image description information of the image to be processed. The electronic device extracts image feature information of the image to be processed through a first conditional encoder in the image generation model, extracts text feature information of the description text through a second conditional encoder in the image generation model, and performs feature compression on the image to be processed through an image encoder in the image generation model to obtain an image feature matrix. The electronic device performs noise processing on the image feature matrix based on a Gaussian noise matrix and a mask image through a noise adding module in the image generation model to obtain a noise image. The electronic device inputs the text feature information, the image feature information and the noise image into a diffusion model in the image generation model for processing to obtain a derivative image, and then decodes the derivative image through an image decoder to obtain a derivative image. The decoded derivative image is an image obtained after eliminating the image content of the mask area in the image to be processed, thereby completing the image elimination task.

示例性地，在原始信息包括描述文本、待处理图像和掩码图像，该掩码图像是待处理图像中掩码区域的掩码图像，该描述文本包括待处理图像的图像描述信息和掩码区域修改后的图像内容。电子设备通过图像生成模型中的第一条件编码器提取待处理图像的图像特征信息，通过图像生成模型中的第二条件编码器提取描述文本的文本特征信息，电子设备通过图像生成模型中的图像编码器对待处理图像进行特征压缩，得到图像特征矩阵，电子设备通过图像生成模型中的加噪模块基于高斯噪声矩阵和掩码图像对图像特征矩阵进行加噪处理得到噪声图像，电子设备将文本特征信息、图像特征信息和噪声图像输入图像生成模型中的扩散模型处理后得到衍生图像，再通过图像解码器对衍生图像解码得到衍生图像，该解码后的衍生图像是按照描述文本将待处理图像中掩码区域的图像内容修改后得到的图像，完成了图像局部修改任务。Exemplarily, the original information includes description text, an image to be processed and a mask image, the mask image is a mask image of a mask area in the image to be processed, and the description text includes image description information of the image to be processed and the modified image content of the mask area. The electronic device extracts image feature information of the image to be processed through a first conditional encoder in the image generation model, extracts text feature information of the description text through a second conditional encoder in the image generation model, and performs feature compression on the image to be processed through an image encoder in the image generation model to obtain an image feature matrix. The electronic device performs noise processing on the image feature matrix based on a Gaussian noise matrix and a mask image through a noise adding module in the image generation model to obtain a noise image. The electronic device inputs the text feature information, the image feature information and the noise image into a diffusion model in the image generation model for processing to obtain a derivative image, and then decodes the derivative image through an image decoder to obtain a derivative image. The decoded derivative image is an image obtained by modifying the image content of the mask area in the image to be processed according to the description text, thereby completing the image local modification task.

示例性地，在原始信息包括描述文本、待处理图像和掩码图像，该掩码图像是待处理图像中掩码区域的掩码图像，该描述文本包括待处理图像的图像描述信息。电子设备通过图像生成模型中的第一条件编码器提取待处理图像的图像特征信息，通过图像生成模型中的第二条件编码器提取描述文本的文本特征信息，电子设备通过图像生成模型中的图像编码器对待处理图像进行特征压缩，得到图像特征矩阵，电子设备通过图像生成模型中的加噪模块基于高斯噪声矩阵和掩码图像对图像特征矩阵进行加噪处理得到噪声图像，电子设备将文本特征信息、图像特征信息和噪声图像输入图像生成模型中的扩散模型处理后得到衍生图像，再通过图像解码器对衍生图像解码得到衍生图像，该解码后的衍生图像是按照描述文本预测出待处理图像中掩码区域的图像内容后得到的图像，完成了图像外扩任务。Exemplarily, the original information includes description text, an image to be processed and a mask image, the mask image is a mask image of a mask area in the image to be processed, and the description text includes image description information of the image to be processed. The electronic device extracts image feature information of the image to be processed through a first conditional encoder in the image generation model, extracts text feature information of the description text through a second conditional encoder in the image generation model, and performs feature compression on the image to be processed through an image encoder in the image generation model to obtain an image feature matrix. The electronic device performs noise processing on the image feature matrix based on a Gaussian noise matrix and a mask image through a noise addition module in the image generation model to obtain a noise image. The electronic device inputs the text feature information, the image feature information and the noise image into a diffusion model in the image generation model for processing to obtain a derivative image, and then decodes the derivative image through an image decoder to obtain a derivative image. The decoded derivative image is an image obtained after predicting the image content of the mask area in the image to be processed according to the description text, thereby completing the image expansion task.

需要说明的是，上述通过图像生成模型执行文生图、文和图生图、图像消除、图像局部修改和图像外扩任务时，具体的实现过程均可以参见上述图像生成模型的训练方法中的相关描述，本申请在此不再赘述。It should be noted that when the above-mentioned image generation model is used to perform text-to-image, text and image-to-image, image elimination, local image modification and image expansion tasks, the specific implementation process can be referred to the relevant description in the training method of the above-mentioned image generation model, and this application will not go into details here.

本申请的一些实施例中，在电子设备中配置图像生成模型时，可以根据电子设备的算力和执行的图像生成任务的类型在电子设备中配置条件编码器。In some embodiments of the present application, when configuring an image generation model in an electronic device, a conditional encoder may be configured in the electronic device based on the computing power of the electronic device and the type of image generation task being performed.

示例性地，对于算力有限的电子设备，可以在条件编码器中配置轻量级的文本编码器，即参数量较少的文本编码器。对于算力较大的电子设备，可以在条件编码器中仅配置重量级的文本编码器即参数量较多的文本编码器，或者，可以在条件编码器中配置轻量级的文本编码器和重量级的文本编码器组合，以得到更好的文本特征信息。For example, for electronic devices with limited computing power, a lightweight text encoder, i.e., a text encoder with a small number of parameters, can be configured in the conditional encoder. For electronic devices with greater computing power, only a heavyweight text encoder, i.e., a text encoder with a large number of parameters, can be configured in the conditional encoder, or a combination of a lightweight text encoder and a heavyweight text encoder can be configured in the conditional encoder to obtain better text feature information.

例如，对于移动通信设备、车机设备等可以配置轻量级的文本编码器。对于云端服务器可以配置轻量级的文本编码器和重量级的文本编码器。For example, a lightweight text encoder can be configured for mobile communication devices, vehicle equipment, etc. A lightweight text encoder and a heavyweight text encoder can be configured for a cloud server.

示例性地，若电子设备配置图像生成模型仅处理文生图任务，由于输入没有图像，则可以在条件编码器中不配置图像编码器，能够节省电子设备的存储空间。Exemplarily, if the electronic device is configured with an image generation model to process only text-based image tasks, since there is no image in the input, the image encoder may not be configured in the conditional encoder, thereby saving storage space of the electronic device.

本申请的一些实施例提供了一种多功能于一体的图像生成模型的框架，并配合本申请的一些实施例提出的图像生成模型的训练方法，电子设备使用训练样本组做模型训练时，通过控制条件控制图像生成模型执行不同类型的图像生成任务，使得文生图、图生图、图像消除、图像局部修改、图像外扩等图像生成任务的处理集成于一个图像生成模型之中，不用训练多个模型，减少了电子设备处理的数据量，进而减少了消耗的时间和占用的资源，减少了电子设备的资源浪费，提高了模型训练的效率，有效地降低了开发和部署成本。Some embodiments of the present application provide a framework of a multifunctional image generation model, and in conjunction with the image generation model training method proposed in some embodiments of the present application, when an electronic device uses a training sample group for model training, the image generation model is controlled by controlling conditions to perform different types of image generation tasks, so that the processing of image generation tasks such as text-generated images, image-generated images, image elimination, local image modification, and image expansion are integrated into one image generation model. There is no need to train multiple models, which reduces the amount of data processed by the electronic device, thereby reducing the time consumed and the resources occupied, reducing the waste of resources of the electronic device, improving the efficiency of model training, and effectively reducing the development and deployment costs.

本申请的一些实施例提供的图像生成模型的训练方法具体可以包括以下步骤2501-步骤2517。The training method of the image generation model provided in some embodiments of the present application may specifically include the following steps 2501-2517.

步骤2501、电子设备获取至少两个训练样本组，每个训练样本组包括一张样本图像、每张样本图像的描述文本和掩码图像。Step 2501: The electronic device obtains at least two training sample groups, each training sample group including a sample image, description text of each sample image, and a mask image.

步骤2502、电子设备通过VAE编码器中的卷积模块，对至少一张样本图像进行特征压缩，得到至少一个压缩图像特征矩阵。Step 2502: The electronic device performs feature compression on at least one sample image through a convolution module in a VAE encoder to obtain at least one compressed image feature matrix.

步骤2503、电子设备通过VAE编码器中的自注意力模块，基于每个压缩图像特征矩阵中的像素点之间的关联关系，对每个压缩图像特征矩阵进行特征提取，得到至少一个图像特征矩阵。Step 2503: The electronic device uses the self-attention module in the VAE encoder to extract features from each compressed image feature matrix based on the correlation between the pixels in each compressed image feature matrix to obtain at least one image feature matrix.

步骤2504、电子设备将高斯噪声矩阵和掩码图像加权求和，得到加噪矩阵。Step 2504: The electronic device performs weighted summation of the Gaussian noise matrix and the mask image to obtain a noisy matrix.

步骤2505、电子设备将加噪矩阵和样本图像加权求和，得到噪声图像。Step 2505: The electronic device performs weighted summation of the noise matrix and the sample image to obtain a noise image.

步骤2506、电子设备通过第一条件编码器中的图像分割模块，将每张样本图像分割为N个样本图像块，并将每个样本图像块的像素点重排列为一维向量，得到每张样本图像的N个一维向量，并将每张样本图像的所述N个一维向量拼接得到每张样本图像的像素重排图像。Step 2506: The electronic device divides each sample image into N sample image blocks through the image segmentation module in the first conditional encoder, and rearranges the pixels of each sample image block into a one-dimensional vector to obtain N one-dimensional vectors for each sample image, and splices the N one-dimensional vectors for each sample image to obtain a pixel rearranged image of each sample image.

其中，N为大于1的整数。Wherein, N is an integer greater than 1.

步骤2507、电子设备通过所述第一条件编码器中的映射层，提取每张像素重排图像中每个像素点的特征信息，得到至少一个像素重排特征向量。Step 2507: The electronic device extracts feature information of each pixel in each pixel rearrangement image through the mapping layer in the first conditional encoder to obtain at least one pixel rearrangement feature vector.

步骤2508、电子设备通过所述第一条件编码器中的自注意力模块，基于每个像素重排图像中像素点之间的关联关系，对每个像素重排特征向量进行特征提取，得到每张样本图像的图像特征信息。Step 2508: The electronic device uses the self-attention module in the first conditional encoder to extract features from each pixel rearrangement feature vector based on the correlation between pixels in each pixel rearrangement image, so as to obtain image feature information of each sample image.

需要说明的是，上述步骤2506-步骤2508可以在步骤2502之前且步骤2501之后执行，也可以在步骤2505之后执行，本申请对此不做限定。It should be noted that the above steps 2506 to 2508 can be performed before step 2502 and after step 2501, or can be performed after step 2505, and this application does not limit this.

步骤2509、电子设备通过第二条件编码器中的分词器，将每个描述文本分割为至少两个分词，并对每个分词进行编码，得到每个描述文本的文本特征向量。Step 2509: The electronic device divides each description text into at least two words through the word segmenter in the second conditional encoder, and encodes each word segment to obtain a text feature vector of each description text.

步骤2510、电子设备通过所述第二条件编码器中的自注意力模块，基于每个描述文本中分词之间的关联关系，对每个描述文本的文本特征向量进行特征提取，得到每个描述文本的文本特征信息。Step 2510: The electronic device uses the self-attention module in the second conditional encoder to extract features from the text feature vector of each description text based on the association between the word segments in each description text, so as to obtain text feature information of each description text.

需要说明的是，上述步骤2509-步骤2510可以在步骤2502之前且步骤2501之后执行，也可以在步骤2505之后执行，本实施例对此不做限定。It should be noted that the above steps 2509 to 2510 may be performed before step 2502 and after step 2501, or may be performed after step 2505, and this embodiment does not limit this.

另外，上述步骤2506-步骤2508与步骤2509-步骤2510之间没有先后执行顺序。In addition, there is no sequential execution order between the above steps 2506-2508 and steps 2509-2510.

步骤2511、电子设备基于控制条件、所述至少一项图像特征信息和所述至少一项文本特征信息，确定至少一项条件特征信息。Step 2511: The electronic device determines at least one item of conditional feature information based on the control condition, the at least one item of image feature information, and the at least one item of text feature information.

步骤2512、电子设备通过扩散模型中的图像分割模块，将每张噪声图像分割为M个噪声图像块。Step 2512: The electronic device divides each noise image into M noise image blocks through the image segmentation module in the diffusion model.

其中，M为大于1的整数；Wherein, M is an integer greater than 1;

步骤2513、电子设备通过扩散模型中的线性变换算子，提取每个噪声图像块的特征信息，并添加位置编码向量，得到每张噪声图像的特征信息。Step 2513: The electronic device extracts feature information of each noise image block through a linear transformation operator in the diffusion model, and adds a position encoding vector to obtain feature information of each noise image.

步骤2514、电子设备通过所述扩散模型中的线性变换算子，对所述至少一项条件特征信息进行线性变换处理，得到至少一项线性变换后的条件特征信息。Step 2514: The electronic device performs linear transformation processing on the at least one item of conditional feature information through the linear transformation operator in the diffusion model to obtain at least one item of conditional feature information after linear transformation.

步骤2515、电子设备通过所述扩散模型中的拼接算子，将所述至少一项线性变换后的条件特征信息和所述至少一张噪声图像的特征信息拼接，得到至少一项融合特征信息。Step 2515: The electronic device splices the at least one linearly transformed conditional feature information and the feature information of the at least one noise image through the splicing operator in the diffusion model to obtain at least one fused feature information.

步骤2516、电子设备通过所述扩散模型中的扩散自注意力模块，对所述至少一个加噪向量进行特征提取，得到至少一组缩放因子，并基于至少一组缩放因子，对所述至少一项融合特征信息进行特征缩放，得到至少一项缩放特征信息。Step 2516: The electronic device performs feature extraction on the at least one noisy vector through the diffusion self-attention module in the diffusion model to obtain at least one set of scaling factors, and performs feature scaling on the at least one fused feature information based on the at least one set of scaling factors to obtain at least one scaled feature information.

其中，每项缩放特征信息对应一组缩放因子和一项融合特征信息。Each item of scaling feature information corresponds to a set of scaling factors and an item of fusion feature information.

步骤2517、电子设备通过所述扩散模型中的层归一化算子，对所述至少一项缩放特征信息归一化处理，得到至少一项噪声特征信息。Step 2517: The electronic device normalizes the at least one item of scaling feature information using a layer normalization operator in the diffusion model to obtain at least one item of noise feature information.

步骤2518、电子设备通过所述扩散模型中的卷积算子，对所述至少一项噪声特征信息进行降通道数处理，得到所述至少一张噪声图像的噪声矩阵。Step 2518: The electronic device performs channel reduction processing on the at least one item of noise feature information through the convolution operator in the diffusion model to obtain a noise matrix of the at least one noise image.

步骤2519、电子设备基于所述高斯噪声矩阵和所述至少一张噪声图像的噪声矩阵，确定至少一个第一损失值，并基于所述至少一张样本图像的掩码图像和所述至少一张噪声图像的噪声矩阵，确定至少一个第二损失值。Step 2519: The electronic device determines at least one first loss value based on the Gaussian noise matrix and the noise matrix of the at least one noise image, and determines at least one second loss value based on the mask image of the at least one sample image and the noise matrix of the at least one noise image.

步骤2520、电子设备将所述至少一个第一损失值和所述至少一个第二损失值加权求和，得到所述至少一张样本图像的噪声损失值。Step 2520: The electronic device performs weighted summation of the at least one first loss value and the at least one second loss value to obtain a noise loss value of the at least one sample image.

步骤2521、电子设备基于所述至少一张样本图像的噪声损失值，调整所述初始模型的模型参数，得到图像生成模型。Step 2521: The electronic device adjusts the model parameters of the initial model based on the noise loss value of the at least one sample image to obtain an image generation model.

步骤2522、电子设备获取待处理图像、待处理图像的描述文本和掩码图像。Step 2522: The electronic device obtains the image to be processed, the description text of the image to be processed, and the mask image.

本申请的一些实施例中，上述掩码图像是待处理图像中掩码区域的掩码图像，上述描述文本包括待处理图像的图像描述信息和对掩码区域的图像内容修改后的图像内容。In some embodiments of the present application, the mask image is a mask image of a mask area in the image to be processed, and the description text includes image description information of the image to be processed and image content after modifying the image content of the mask area.

步骤2523、电子设备通过所述图像生成模型中的第一条件编码器，对所述待处理图像进行编码，得到所述待处理图像的图像特征信息，并通过所述图像生成模型的第二条件编码器，对所述描述文本进行编码，得到所述描述文本的文本特征信息。Step 2523: The electronic device encodes the image to be processed through the first conditional encoder in the image generation model to obtain image feature information of the image to be processed, and encodes the description text through the second conditional encoder in the image generation model to obtain text feature information of the description text.

步骤2524、电子设备通过所述图像生成模型中的图像编码器，对所述待处理图像进行特征压缩，得到图像特征矩阵。Step 2524: The electronic device performs feature compression on the image to be processed through the image encoder in the image generation model to obtain an image feature matrix.

步骤2525、电子设备通过所述图像生成模型中的加噪模块，基于高斯噪声矩阵和所述掩码图像，对所述图像特征矩阵进行加噪处理，得到所述噪声图像。Step 2525: The electronic device performs noise processing on the image feature matrix based on the Gaussian noise matrix and the mask image through the noise adding module in the image generation model to obtain the noise image.

电子设备通过所述图像生成模型中的扩散模型，基于所述图像特征信息、所述文本特征信息和所述噪声图像做图像处理，得到衍生图像。The electronic device performs image processing based on the image feature information, the text feature information and the noise image through the diffusion model in the image generation model to obtain a derivative image.

步骤2526、电子设备通过图像解码器对衍生图像解码，得到解码后的衍生图像。Step 2526: The electronic device decodes the derivative image through an image decoder to obtain a decoded derivative image.

本申请的一些实施例中，上述解码后的衍生图像是将待处理图像中掩码区域的图像内容修改后得到的图像。In some embodiments of the present application, the decoded derivative image is an image obtained by modifying the image content of the mask area in the image to be processed.

需要说明的是，上述步骤2522-步骤2526是以实现图像局部修改为例说明图像生成模型的使用过程，并不对图像生成模型的使用形成限定。It should be noted that the above steps 2522 to 2526 are used to illustrate the use of the image generation model by taking the implementation of local image modification as an example, and do not limit the use of the image generation model.

需要说明的是，上述步骤2501-步骤2526的实现过程与上述图像生成模型的训练方法的实施例中的实现过程相同，本实施例中未详细描述的部分，均可参见上述实施例的相关描述，本申请在此不再赘述。It should be noted that the implementation process of the above steps 2501 to 2526 is the same as the implementation process in the embodiment of the training method of the above-mentioned image generation model. For the parts not described in detail in this embodiment, please refer to the relevant description of the above embodiment, and this application will not repeat them here.

需要说明的是，上述各个方法实施例，或者各个方法实施例中的各种可能的实现方式均可以单独执行，也可以任意两个或两个以上相互结合执行，具体可以根据实际使用需求确定，本申请的一些实施例对此不做限制。It should be noted that the above-mentioned method embodiments, or various possible implementation methods in each method embodiment, can be executed separately, or any two or more can be executed in combination with each other. The specific implementation method can be determined according to actual usage requirements, and some embodiments of the present application do not limit this.

或者，or,

需要说明的是，上述各个方法实施例，或者各个方法实施例中的各种可能的实现方式可以单独执行，或者，在不存在矛盾的前提下，也可以相互结合执行，具体可以根据实际使用需求确定，本申请的一些实施例对此不做限制。It should be noted that the above-mentioned method embodiments, or various possible implementation methods in each method embodiment, can be executed separately, or, under the premise that there is no contradiction, can also be executed in combination with each other. The specific implementation can be determined according to actual usage requirements, and some embodiments of the present application do not limit this.

本申请的一些实施例提供的图像生成模型的训练方法，执行主体可以为图像生成模型的训练装置。本申请的一些实施例中以图像生成模型的训练装置执行图像生成模型的训练方法为例，说明本申请的一些实施例提供的图像生成模型的训练装置。In some embodiments of the present application, the training method of the image generation model provided by the training device of the image generation model may be executed by the training device of the image generation model. In some embodiments of the present application, the training device of the image generation model provided by some embodiments of the present application is illustrated by taking the training method of the image generation model performed by the training device of the image generation model as an example.

图18是本申请的一些实施例提供的图像生成模型的训练装置的结构示意图，装置包括：FIG18 is a schematic diagram of the structure of a training device for an image generation model provided in some embodiments of the present application, the device comprising:

获取单元1701，用于获取至少两个训练样本组，每个训练样本组包括一张样本图像、每张样本图像的描述文本和掩码图像；An acquisition unit 1701 is used to acquire at least two training sample groups, each training sample group includes a sample image, description text of each sample image, and a mask image;

编码单元1702，用于对获取单元获取的至少两个训练样本组中的至少一张样本图像进行编码，得到至少一项图像特征信息，以及对每张样本图像的描述文本进行编码，得到至少一项文本特征信息；The encoding unit 1702 is used to encode at least one sample image in the at least two training sample groups acquired by the acquisition unit to obtain at least one image feature information, and to encode the description text of each sample image to obtain at least one text feature information;

确定单元1703，用于基于控制条件、编码单元得到的至少一项图像特征信息和至少一项文本特征信息，确定至少一项条件特征信息；控制条件用于指示图像生成任务的类型，条件特征信息是根据图像生成任务的类型确定的；A determining unit 1703 is used to determine at least one conditional feature information based on the control condition, at least one image feature information and at least one text feature information obtained by the encoding unit; the control condition is used to indicate the type of the image generation task, and the conditional feature information is determined according to the type of the image generation task;

训练单元1704，用于基于获取单元获取的至少一张样本图像、至少一张样本图像的掩码图像和确定单元确定的至少一项条件特征信息，对初始模型进行训练，得到图像生成模型。The training unit 1704 is used to train the initial model based on at least one sample image acquired by the acquisition unit, the mask image of at least one sample image and at least one conditional feature information determined by the determination unit to obtain an image generation model.

本申请的一些实施例中，确定单元1703，具体用于：In some embodiments of the present application, the determining unit 1703 is specifically configured to:

在控制条件指示的图像生成任务的类型为文生图任务的情况下，将至少一项文本特征信息确定为至少一项条件特征信息；In a case where the type of the image generation task indicated by the control condition is a text image task, determining at least one item of text feature information as at least one item of condition feature information;

在控制条件指示的图像生成任务的类型为图像处理任务的情况下，将至少一项图像特征信息和至少一项文本特征信息确定为至少一项条件特征信息；In a case where the type of the image generation task indicated by the control condition is an image processing task, determining at least one item of image feature information and at least one item of text feature information as at least one item of conditional feature information;

其中，图像处理任务包括以下至少一项：图生图任务、文和图生图任务、图像消除任务、图像局部修改任务、图像外扩任务。Among them, the image processing task includes at least one of the following: an image-to-image task, a text-to-image task, an image elimination task, an image local modification task, and an image expansion task.

本申请的一些实施例中，装置还包括：In some embodiments of the present application, the device further includes:

处理单元，用于基于高斯噪声矩阵和获取单元获取的至少一张样本图像的掩码图像，对至少一张样本图像进行加噪处理，得到至少一张噪声图像；a processing unit, configured to perform noise addition processing on at least one sample image based on the Gaussian noise matrix and the mask image of at least one sample image acquired by the acquisition unit, so as to obtain at least one noise image;

训练单元，用于基于确定单元确定的至少一项条件特征信息和处理单元处理得到的至少一张噪声图像，对初始模型进行训练，得到图像生成模型；A training unit, configured to train the initial model based on at least one condition feature information determined by the determination unit and at least one noise image processed by the processing unit to obtain an image generation model;

其中，每张噪声图像对应一项条件特征信息。Among them, each noise image corresponds to a conditional feature information.

本申请的一些实施例中，处理单元，具体用于：In some embodiments of the present application, the processing unit is specifically used to:

将高斯噪声矩阵和获取单元获取的至少一张样本图像的掩码图像进行加权求和，得到至少一个加噪矩阵；以及，将至少一个加噪矩阵和获取单元获取的至少一张样本图像进行加权求和，得到至少一张噪声图像；Performing a weighted summation of the Gaussian noise matrix and the mask image of at least one sample image acquired by the acquisition unit to obtain at least one noise matrix; and performing a weighted summation of the at least one noise matrix and at least one sample image acquired by the acquisition unit to obtain at least one noise image;

其中，每张噪声图像对应一个加噪矩阵和一张样本图像。Among them, each noisy image corresponds to a noise matrix and a sample image.

压缩单元，用于通过图像编码器，对获取单元获取的至少一张样本图像进行特征压缩，得到至少一个图像特征矩阵；A compression unit, configured to perform feature compression on at least one sample image acquired by the acquisition unit through an image encoder to obtain at least one image feature matrix;

处理单元，用于基于高斯噪声矩阵和获取单元获取的至少一张样本图像的掩码图像，对至少一个图像特征矩阵进行加噪处理，得到至少一张噪声图像；A processing unit, configured to perform noise addition processing on at least one image feature matrix based on the Gaussian noise matrix and the mask image of at least one sample image acquired by the acquisition unit, so as to obtain at least one noise image;

训练单元，用于基于确定单元确定的至少一项条件特征信息和获取单元获取的至少一张噪声图像，对初始模型进行训练，得到图像生成模型。The training unit is used to train the initial model based on at least one condition feature information determined by the determination unit and at least one noise image acquired by the acquisition unit to obtain an image generation model.

本申请的一些实施例中，压缩单元，具体用于：In some embodiments of the present application, the compression unit is specifically used for:

通过图像编码器中的卷积模块，对获取单元获取的至少一张样本图像进行特征压缩，得到至少一个压缩图像特征矩阵，每个压缩图像特征矩阵的像素点的数量多于每个图像特征矩阵的像素点的数量；Performing feature compression on at least one sample image acquired by the acquisition unit through a convolution module in the image encoder to obtain at least one compressed image feature matrix, wherein the number of pixel points in each compressed image feature matrix is greater than the number of pixel points in each image feature matrix;

通过图像编码器中的自注意力模块，基于每个压缩图像特征矩阵中的像素点之间的关联关系，对每个压缩图像特征矩阵进行特征提取，得到至少一个图像特征矩阵。Through the self-attention module in the image encoder, feature extraction is performed on each compressed image feature matrix based on the correlation relationship between the pixels in each compressed image feature matrix to obtain at least one image feature matrix.

本申请的一些实施例中，获取单元1701，还用于：In some embodiments of the present application, the acquisition unit 1701 is further configured to:

获取至少两张样本图像、每张样本图像的图像描述信息和每张样本图像的至少两个掩码区域的标注信息，每张样本图像的至少两个掩码区域为通过分割一切模型对每张样本图像进行掩码区域的预测、并基于预测结果对每张样本图像分割得到的，每个掩码区域的标注信息包括以下至少一项：每个掩码区域的面积、每个掩码区域的预测准确度、每个掩码区域的预测稳定度；预测准确度表征通过分割一切模型对掩码区域的预测准确性，预测稳定度表征通过分割一切模型对掩码区域的预测稳定性；Acquire at least two sample images, image description information of each sample image, and annotation information of at least two mask areas of each sample image, wherein the at least two mask areas of each sample image are obtained by predicting the mask areas of each sample image using the segmentation model and segmenting each sample image based on the prediction results, and the annotation information of each mask area includes at least one of the following: the area of each mask area, the prediction accuracy of each mask area, and the prediction stability of each mask area; the prediction accuracy represents the prediction accuracy of the mask area using the segmentation model, and the prediction stability represents the prediction stability of the mask area using the segmentation model;

基于每张样本图像的至少两个掩码区域的标注信息，从每张样本图像的至少两个掩码区域中，确定标注信息满足过滤条件的第一掩码区域，过滤条件包括以下至少一项：掩码区域的面积在预设范围内、掩码区域的预测准确度大于准确度阈值、掩码区域的预测稳定度大于稳定度阈值；Based on the annotation information of at least two mask areas of each sample image, determining a first mask area whose annotation information satisfies a filtering condition from the at least two mask areas of each sample image, where the filtering condition includes at least one of the following: an area of the mask area is within a preset range, a prediction accuracy of the mask area is greater than an accuracy threshold, and a prediction stability of the mask area is greater than a stability threshold;

在每张样本图像的图像描述信息中，添加每张样本图像的第一掩码区域的描述信息，得到每张样本图像的描述文本；Adding the description information of the first mask area of each sample image to the image description information of each sample image to obtain a description text of each sample image;

将每张样本图像、每张样本图像的描述文本和每张样本图像的掩码图像构建为一个训练样本组，得到至少两个训练样本组，每张样本图像的掩码图像为每张样本图像中与第一掩码区域对应的掩码图像。Each sample image, the description text of each sample image and the mask image of each sample image are constructed into a training sample group to obtain at least two training sample groups, wherein the mask image of each sample image is the mask image corresponding to the first mask area in each sample image.

获取至少两张样本图像和每张样本图像的描述文本；Obtain at least two sample images and description text for each sample image;

根据尺寸范围和每张样本图像的尺寸信息，在每张样本图像中生成第二掩码区域；generating a second mask region in each sample image according to the size range and size information of each sample image;

将每张样本图像、每张样本图像的描述文本、每张样本图像的掩码图像构建为一个训练样本组，得到至少两个训练样本组，每张样本图像的掩码图像为每张样本图像中第二掩码区域的掩码图像。Each sample image, the description text of each sample image, and the mask image of each sample image are constructed into a training sample group to obtain at least two training sample groups, and the mask image of each sample image is the mask image of the second mask area in each sample image.

本申请的一些实施例中，编码单元1702，具体用于：In some embodiments of the present application, the encoding unit 1702 is specifically used to:

通过第一条件编码器中的图像分割模块，将获取单元获取的每张样本图像分割为N个样本图像块，并将每个样本图像块的像素点重排列为一维向量，得到每张样本图像的N个一维向量，并将每张样本图像的N个一维向量拼接得到每张样本图像的像素重排图像，N为大于1的整数；By using the image segmentation module in the first conditional encoder, each sample image acquired by the acquisition unit is segmented into N sample image blocks, and the pixels of each sample image block are rearranged into a one-dimensional vector to obtain N one-dimensional vectors of each sample image, and the N one-dimensional vectors of each sample image are spliced to obtain a pixel rearranged image of each sample image, where N is an integer greater than 1;

通过第一条件编码器中的映射层，提取每张像素重排图像中每个像素点的特征信息，得到至少一个像素重排特征向量；Extracting feature information of each pixel in each pixel rearrangement image through a mapping layer in the first conditional encoder to obtain at least one pixel rearrangement feature vector;

通过第一条件编码器中的自注意力模块，基于每个像素重排图像中像素点之间的关联关系，对每个像素重排特征向量进行特征提取，得到每张样本图像的图像特征信息。Through the self-attention module in the first conditional encoder, based on the correlation between pixels in each pixel rearrangement image, feature extraction is performed on each pixel rearrangement feature vector to obtain image feature information of each sample image.

通过第二条件编码器中的分词器，将获取单元获取的每个描述文本分割为至少两个分词，并对每个分词进行编码，得到每个描述文本的文本特征向量；Each description text acquired by the acquisition unit is divided into at least two words by a word segmenter in the second conditional encoder, and each word is encoded to obtain a text feature vector of each description text;

通过第二条件编码器中的自注意力模块，基于每个描述文本中分词之间的关联关系，对每个描述文本的文本特征向量进行特征提取，得到每个描述文本的文本特征信息。Through the self-attention module in the second conditional encoder, based on the association relationship between the segmented words in each description text, feature extraction is performed on the text feature vector of each description text to obtain the text feature information of each description text.

本申请的一些实施例中，训练单元1704，具体用于：In some embodiments of the present application, the training unit 1704 is specifically used to:

通过扩散模型，获取至少一张噪声图像的特征信息，并对至少一项条件特征信息和至少一张噪声图像的特征信息融合处理，得到至少一项融合特征信息，每项融合特征信息对应一项条件特征信息和一张噪声图像的特征信息；Acquire feature information of at least one noise image through a diffusion model, and fuse at least one condition feature information and at least one noise image feature information to obtain at least one fused feature information, where each fused feature information corresponds to one condition feature information and one noise image feature information;

通过扩散模型，基于至少一项融合特征信息和至少一个加噪向量，确定至少一张噪声图像的噪声矩阵，每个加噪向量表示一张噪声图像的噪声等级；Determining a noise matrix of at least one noisy image based on at least one fused feature information and at least one noisy vector by a diffusion model, each noisy vector representing a noise level of a noisy image;

基于至少一张噪声图像的噪声矩阵、高斯噪声矩阵和获取单元获取的至少一张样本图像的掩码图像，确定至少一张样本图像的噪声损失值；Determine a noise loss value of at least one sample image based on a noise matrix of at least one noise image, a Gaussian noise matrix, and a mask image of at least one sample image acquired by an acquisition unit;

基于至少一张样本图像的噪声损失值，调整初始模型的模型参数，得到图像生成模型。Based on the noise loss value of at least one sample image, the model parameters of the initial model are adjusted to obtain an image generation model.

提取单元，用于在通过扩散模型，获取至少一张噪声图像的特征信息，并对至少一项条件特征信息和至少一张噪声图像的特征信息融合处理，得到至少一项融合特征信息之前，通过扩散模型中的图像分割模块，将每张噪声图像分割为M个噪声图像块，M为大于1的整数；以及，通过扩散模型中的线性变换算子，提取每个噪声图像块的特征信息，并添加位置编码向量，得到每张噪声图像的特征信息。The extraction unit is used to obtain feature information of at least one noise image through a diffusion model, and fuse at least one conditional feature information and feature information of at least one noise image to obtain at least one fused feature information, divide each noise image into M noise image blocks through an image segmentation module in the diffusion model, where M is an integer greater than 1; and extract feature information of each noise image block through a linear transformation operator in the diffusion model, and add a position encoding vector to obtain feature information of each noise image.

通过扩散模型中的线性变换算子，对至少一项条件特征信息进行线性变换处理，得到至少一项线性变换后的条件特征信息；Performing linear transformation processing on at least one item of conditional feature information through a linear transformation operator in a diffusion model to obtain at least one item of conditional feature information after linear transformation;

通过扩散模型中的拼接算子，将至少一项线性变换后的条件特征信息和至少一张噪声图像的特征信息拼接，得到至少一项融合特征信息。At least one item of conditional feature information after linear transformation and at least one item of feature information of a noise image are spliced by a splicing operator in a diffusion model to obtain at least one item of fused feature information.

通过扩散模型中的扩散自注意力模块，对至少一个加噪向量进行特征提取，得到至少一组缩放因子，并基于至少一组缩放因子，对至少一项融合特征信息进行特征缩放，得到至少一项缩放特征信息，每项缩放特征信息对应一组缩放因子和一项融合特征信息；Performing feature extraction on at least one noisy vector through a diffusion self-attention module in a diffusion model to obtain at least one set of scaling factors, and performing feature scaling on at least one fused feature information based on the at least one set of scaling factors to obtain at least one scaled feature information, wherein each scaled feature information corresponds to a set of scaling factors and one fused feature information;

通过扩散模型中的层归一化算子，对至少一项缩放特征信息归一化处理，得到至少一项噪声特征信息；Normalizing at least one item of scaling feature information by a layer normalization operator in a diffusion model to obtain at least one item of noise feature information;

通过扩散模型中的卷积算子，对至少一项噪声特征信息进行降通道数处理，得到至少一张噪声图像的噪声矩阵。Through the convolution operator in the diffusion model, at least one noise feature information is subjected to channel reduction processing to obtain a noise matrix of at least one noise image.

基于高斯噪声矩阵和至少一张噪声图像的噪声矩阵，确定至少一个第一损失值，并基于至少一张样本图像的掩码图像和至少一张噪声图像的噪声矩阵，确定至少一个第二损失值；Determining at least one first loss value based on the Gaussian noise matrix and the noise matrix of the at least one noise image, and determining at least one second loss value based on the mask image of the at least one sample image and the noise matrix of the at least one noise image;

将至少一个第一损失值和至少一个第二损失值加权求和，得到至少一张样本图像的噪声损失值。The at least one first loss value and the at least one second loss value are weightedly summed to obtain a noise loss value of at least one sample image.

执行单元，用于在基于获取单元获取的至少一张样本图像、至少一张样本图像的掩码图像和确定单元确定的至少一项条件特征信息，对初始模型进行训练，得到图像生成模型之后，将原始信息输入图像生成模型，输出衍生图像，或者将原始信息输入图像生成模型，执行图像处理，输出衍生图像；an execution unit, configured to, after training the initial model based on at least one sample image acquired by the acquisition unit, the mask image of at least one sample image, and at least one conditional feature information determined by the determination unit to obtain an image generation model, input the original information into the image generation model and output a derived image, or input the original information into the image generation model, perform image processing, and output a derived image;

其中，原始信息包括以下至少一项：待处理图像、描述文本、掩码图像。The original information includes at least one of the following: an image to be processed, a description text, and a mask image.

本申请的一些实施例中，执行单元，具体用于在原始信息包括待处理图像、描述文本和掩码图像的情况下，通过图像生成模型中的第一条件编码器，对待处理图像进行编码，得到待处理图像的图像特征信息，并通过图像生成模型的第二条件编码器，对描述文本进行编码，得到描述文本的文本特征信息，并通过图像生成模型中的加噪模块，基于高斯噪声矩阵和掩码图像，对待处理图像进行加噪处理，得到噪声图像；以及，通过图像生成模型中的扩散模型，基于图像特征信息、文本特征信息和噪声图像做图像处理，得到衍生图像。In some embodiments of the present application, the execution unit is specifically used to, when the original information includes an image to be processed, a description text and a mask image, encode the image to be processed through a first conditional encoder in an image generation model to obtain image feature information of the image to be processed, and encode the description text through a second conditional encoder in the image generation model to obtain text feature information of the description text, and perform noise processing on the image to be processed based on a Gaussian noise matrix and a mask image through a noise addition module in the image generation model to obtain a noise image; and perform image processing based on the image feature information, text feature information and the noise image through a diffusion model in the image generation model to obtain a derivative image.

压缩单元，用于在通过图像生成模型中的加噪模块，基于高斯噪声矩阵和掩码图像，对待处理图像进行加噪处理，得到噪声图像之前，通过图像生成模型中的图像编码器，对待处理图像进行特征压缩，得到图像特征矩阵；A compression unit is used for performing feature compression on the image to be processed based on the Gaussian noise matrix and the mask image through the noise adding module in the image generation model to obtain the image feature matrix before obtaining the noise image;

处理单元，用于通过图像生成模型中的加噪模块，基于高斯噪声矩阵和掩码图像，对图像特征矩阵进行加噪处理，得到噪声图像；A processing unit, configured to perform noise processing on the image feature matrix based on the Gaussian noise matrix and the mask image through a noise adding module in the image generation model to obtain a noise image;

解码单元，用于通过图像生成模型中的图像解码器，对执行单元得到的衍生图像解码，输出解码后的衍生图像。The decoding unit is used to decode the derivative image obtained by the execution unit through the image decoder in the image generation model, and output the decoded derivative image.

本申请的一些实施例提供的图像生成模型的训练装置，电子设备使用训练样本组做模型训练时，通过控制条件控制图像生成模型执行不同类型的图像生成任务，使得文生图、图生图、图像消除、图像局部修改、图像外扩等图像生成任务的处理集成于一个图像生成模型之中，不用训练多个模型，减少了电子设备处理的数据量，进而减少了消耗的时间和占用的资源，减少了电子设备的资源浪费，提高了模型训练的效率，有效地降低了开发和部署成本。Some embodiments of the present application provide a training device for an image generation model. When an electronic device uses a training sample group for model training, the image generation model is controlled by controlling conditions to perform different types of image generation tasks, so that the processing of image generation tasks such as text-generated images, image-generated images, image elimination, local image modification, and image expansion are integrated into one image generation model. There is no need to train multiple models, which reduces the amount of data processed by the electronic device, thereby reducing the time consumed and the resources occupied, reducing the waste of resources of the electronic device, improving the efficiency of model training, and effectively reducing the development and deployment costs.

本申请的一些实施例中的图像生成模型的训练装置可以是电子设备，也可以是电子设备中的部件，例如集成电路或芯片。该电子设备可以是终端，也可以为除终端之外的其他设备。示例性的，电子设备可以为手机、平板电脑、笔记本电脑、掌上电脑、车载电子设备、移动上网装置(Mobile Internet Device，MID)、增强现实(augmented reality，AR)/虚拟现实(virtual reality，VR)设备、机器人、可穿戴设备、超级移动个人计算机(ultra-mobile personal computer，UMPC)、上网本或者个人数字助理(personal digitalassistant，PDA)等，还可以为服务器、网络附属存储器(Network Attached Storage，NAS)、个人计算机(personal computer，PC)、电视机(television，TV)、柜员机或者自助机等，本申请的一些实施例不作具体限定。In some embodiments of the present application, the training device of the image generation model can be an electronic device or a component in the electronic device, such as an integrated circuit or a chip. The electronic device can be a terminal or other devices other than a terminal. Exemplarily, the electronic device can be a mobile phone, a tablet computer, a laptop computer, a PDA, a vehicle-mounted electronic device, a mobile Internet device (Mobile Internet Device, MID), an augmented reality (augmented reality, AR)/virtual reality (virtual reality, VR) device, a robot, a wearable device, an ultra-mobile personal computer (ultra-mobile personal computer, UMPC), a netbook or a personal digital assistant (personal digital assistant, PDA), etc. It can also be a server, a network attached storage (Network Attached Storage, NAS), a personal computer (personal computer, PC), a television (television, TV), a teller machine or a self-service machine, etc., and some embodiments of the present application are not specifically limited.

本申请的一些实施例中的图像生成模型的训练装置可以为具有操作系统的装置。该操作系统可以为安卓(Android)操作系统，可以为ios操作系统，还可以为其他可能的操作系统，本申请的一些实施例不作具体限定。In some embodiments of the present application, the training device of the image generation model may be a device having an operating system. The operating system may be an Android operating system, an iOS operating system, or other possible operating systems, which are not specifically limited in some embodiments of the present application.

本申请的一些实施例提供的图像生成模型的训练装置能够实现上述图像生成模型的训练方法的各个实施例实现的各个过程，为避免重复，这里不再赘述。The image generation model training device provided in some embodiments of the present application is capable of implementing the various processes implemented in the various embodiments of the above-mentioned image generation model training method. To avoid repetition, they will not be described again here.

可选地，如图19所示，本申请的一些实施例还提供一种电子设备2700，包括处理器2701和存储器2702，存储器2702上存储有可在处理器2701上运行的程序或指令，该程序或指令被处理器2701执行时实现上述图像生成模型的训练方法的各个实施例的各个步骤，且能达到相同的技术效果，为避免重复，这里不再赘述。Optionally, as shown in Figure 19, some embodiments of the present application also provide an electronic device 2700, including a processor 2701 and a memory 2702, and the memory 2702 stores programs or instructions that can be executed on the processor 2701. When the program or instructions are executed by the processor 2701, the various steps of the various embodiments of the training method of the above-mentioned image generation model are implemented, and the same technical effect can be achieved. To avoid repetition, they are not repeated here.

需要说明的是，本申请的一些实施例中的电子设备包括上述所述的移动电子设备和非移动电子设备。It should be noted that the electronic devices in some embodiments of the present application include the mobile electronic devices and non-mobile electronic devices mentioned above.

图20为实现本申请的一些实施例的一种电子设备的硬件结构示意图。FIG20 is a schematic diagram of the hardware structure of an electronic device implementing some embodiments of the present application.

该电子设备2800包括但不限于：射频单元2801、网络模块2802、音频输出单元2803、输入单元2804、传感器2805、显示单元2806、用户输入单元2807、接口单元2808、存储器2809、以及处理器2810等部件。The electronic device 2800 includes but is not limited to: a radio frequency unit 2801, a network module 2802, an audio output unit 2803, an input unit 2804, a sensor 2805, a display unit 2806, a user input unit 2807, an interface unit 2808, a memory 2809, and a processor 2810 and other components.

本领域技术人员可以理解，电子设备2800还可以包括给各个部件供电的电源(比如电池)，电源可以通过电源管理系统与处理器2810逻辑相连，从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。图20中示出的电子设备结构并不构成对电子设备的限定，电子设备可以包括比图示更多或更少的部件，或者组合某些部件，或者不同的部件布置，在此不再赘述。Those skilled in the art will appreciate that the electronic device 2800 may also include a power source (such as a battery) for supplying power to each component, and the power source may be logically connected to the processor 2810 through a power management system, so that the power management system can manage charging, discharging, and power consumption management. The electronic device structure shown in FIG20 does not constitute a limitation on the electronic device, and the electronic device may include more or fewer components than shown, or combine certain components, or arrange components differently, which will not be described in detail here.

其中，处理器2810，用于：The processor 2810 is used for:

可选地，处理器2810，具体用于在控制条件指示的图像生成任务的类型为文生图任务的情况下，将至少一项文本特征信息确定为至少一项条件特征信息；在控制条件指示的图像生成任务的类型为图像处理任务的情况下，将至少一项图像特征信息和至少一项文本特征信息确定为至少一项条件特征信息；Optionally, the processor 2810 is specifically configured to, when the type of the image generation task indicated by the control condition is a text image task, determine at least one item of text feature information as at least one item of condition feature information; when the type of the image generation task indicated by the control condition is an image processing task, determine at least one item of image feature information and at least one item of text feature information as at least one item of condition feature information;

可选地，处理器2810，具体用于：Optionally, the processor 2810 is specifically configured to:

基于高斯噪声矩阵和至少一张样本图像的掩码图像，对至少一张样本图像进行加噪处理，得到至少一张噪声图像；Based on the Gaussian noise matrix and the mask image of the at least one sample image, performing noise processing on the at least one sample image to obtain at least one noise image;

基于至少一项条件特征信息和至少一张噪声图像，对初始模型进行训练，得到图像生成模型；Based on at least one conditional feature information and at least one noise image, an initial model is trained to obtain an image generation model;

将高斯噪声矩阵和至少一张样本图像的掩码图像进行加权求和，得到至少一个加噪矩阵；Performing weighted summation on the Gaussian noise matrix and the mask image of at least one sample image to obtain at least one noise matrix;

将至少一个加噪矩阵和至少一张样本图像进行加权求和，得到至少一张噪声图像；Performing weighted summation on at least one noise matrix and at least one sample image to obtain at least one noise image;

通过图像编码器，对至少一张样本图像进行特征压缩，得到至少一个图像特征矩阵；Performing feature compression on at least one sample image by using an image encoder to obtain at least one image feature matrix;

基于高斯噪声矩阵和至少一张样本图像的掩码图像，对至少一个图像特征矩阵进行加噪处理，得到至少一张噪声图像；Based on the Gaussian noise matrix and the mask image of at least one sample image, performing noise processing on at least one image feature matrix to obtain at least one noise image;

基于至少一项条件特征信息和至少一张噪声图像，对初始模型进行训练，得到图像生成模型。Based on at least one conditional feature information and at least one noise image, the initial model is trained to obtain an image generation model.

通过图像编码器中的卷积模块，对至少一张样本图像进行特征压缩，得到至少一个压缩图像特征矩阵，每个压缩图像特征矩阵的像素点的数量多于每个图像特征矩阵的像素点的数量；Performing feature compression on at least one sample image through a convolution module in an image encoder to obtain at least one compressed image feature matrix, wherein the number of pixel points in each compressed image feature matrix is greater than the number of pixel points in each image feature matrix;

可选地，在获取至少两个训练样本组之前，处理器2810，还用于：Optionally, before acquiring at least two training sample groups, the processor 2810 is further configured to:

通过第一条件编码器中的图像分割模块，将每张样本图像分割为N个样本图像块，并将每个样本图像块的像素点重排列为一维向量，得到每张样本图像的N个一维向量，并将每张样本图像的N个一维向量拼接得到每张样本图像的像素重排图像，N为大于1的整数；By using the image segmentation module in the first conditional encoder, each sample image is segmented into N sample image blocks, and the pixels of each sample image block are rearranged into a one-dimensional vector to obtain N one-dimensional vectors of each sample image, and the N one-dimensional vectors of each sample image are spliced to obtain a pixel rearranged image of each sample image, where N is an integer greater than 1;

通过第二条件编码器中的分词器，将每个描述文本分割为至少两个分词，并对每个分词进行编码，得到每个描述文本的文本特征向量；Each description text is divided into at least two words by a word segmenter in the second conditional encoder, and each word is encoded to obtain a text feature vector of each description text;

基于至少一张噪声图像的噪声矩阵、高斯噪声矩阵和至少一张样本图像的掩码图像，确定至少一张样本图像的噪声损失值；Determining a noise loss value of at least one sample image based on a noise matrix of at least one noise image, a Gaussian noise matrix, and a mask image of at least one sample image;

可选地，在通过扩散模型，获取至少一张噪声图像的特征信息，并对至少一项条件特征信息和至少一张噪声图像的特征信息融合处理，得到至少一项融合特征信息之前，处理器2810，还用于：Optionally, before acquiring feature information of at least one noise image through a diffusion model and fusing at least one condition feature information with feature information of at least one noise image to obtain at least one fused feature information, the processor 2810 is further configured to:

通过扩散模型中的图像分割模块，将每张噪声图像分割为M个噪声图像块，M为大于1的整数；Each noise image is segmented into M noise image blocks through the image segmentation module in the diffusion model, where M is an integer greater than 1;

通过扩散模型中的线性变换算子，提取每个噪声图像块的特征信息，并添加位置编码向量，得到每张噪声图像的特征信息。The feature information of each noise image block is extracted through the linear transformation operator in the diffusion model, and the position encoding vector is added to obtain the feature information of each noise image.

可选地，在基于至少一张样本图像、至少一张样本图像的掩码图像和至少一项条件特征信息，对初始模型进行训练，得到图像生成模型之后，处理器2810，还用于：Optionally, after the initial model is trained based on at least one sample image, at least one mask image of the sample image and at least one conditional feature information to obtain the image generation model, the processor 2810 is further configured to:

将原始信息输入图像生成模型，输出衍生图像，或者将原始信息输入图像生成模型，执行图像处理，输出衍生图像；Inputting original information into an image generation model and outputting a derived image, or inputting original information into an image generation model, performing image processing, and outputting a derived image;

通过图像生成模型中的第一条件编码器，对待处理图像进行编码，得到待处理图像的图像特征信息，并通过图像生成模型的第二条件编码器，对描述文本进行编码，得到描述文本的文本特征信息，并通过图像生成模型中的加噪模块，基于高斯噪声矩阵和掩码图像，对待处理图像进行加噪处理，得到噪声图像；The image to be processed is encoded by the first conditional encoder in the image generation model to obtain image feature information of the image to be processed, and the description text is encoded by the second conditional encoder in the image generation model to obtain text feature information of the description text, and the image to be processed is subjected to noise processing based on the Gaussian noise matrix and the mask image by the noise addition module in the image generation model to obtain a noise image;

通过图像生成模型中的扩散模型，基于图像特征信息、文本特征信息和噪声图像做图像处理，得到衍生图像。Through the diffusion model in the image generation model, image processing is performed based on image feature information, text feature information and noise image to obtain a derived image.

可选地，在通过图像生成模型中的加噪模块，基于高斯噪声矩阵和掩码图像，对待处理图像进行加噪处理，得到噪声图像之前，处理器2810，还用于：Optionally, before performing noise processing on the image to be processed based on the Gaussian noise matrix and the mask image through the noise adding module in the image generation model to obtain the noisy image, the processor 2810 is further used to:

通过图像生成模型中的图像编码器，对待处理图像进行特征压缩，得到图像特征矩阵；Through the image encoder in the image generation model, the image to be processed is feature compressed to obtain an image feature matrix;

通过图像生成模型中的加噪模块，基于高斯噪声矩阵和掩码图像，对图像特征矩阵进行加噪处理，得到噪声图像；Through the noise adding module in the image generation model, based on the Gaussian noise matrix and the mask image, the image feature matrix is subjected to noise adding processing to obtain a noise image;

可选地，在通过图像生成模型中的扩散模型，基于图像特征信息、文本特征信息和噪声图像做图像处理，得到衍生图像之后，处理器2810，还用于：Optionally, after performing image processing based on the image feature information, the text feature information and the noise image through the diffusion model in the image generation model to obtain a derived image, the processor 2810 is further configured to:

通过图像生成模型中的图像解码器，对衍生图像解码，输出解码后的衍生图像。The derived image is decoded by the image decoder in the image generation model, and the decoded derivative image is output.

本申请的一些实施例提供的图像生成模型的训练方法，电子设备使用训练样本组做模型训练时，通过控制条件控制图像生成模型执行不同类型的图像生成任务，使得文生图、图生图、图像消除、图像局部修改、图像外扩等图像生成任务的处理集成于一个图像生成模型之中，不用训练多个模型，减少了电子设备处理的数据量，进而减少了消耗的时间和占用的资源，减少了电子设备的资源浪费，提高了模型训练的效率，有效地降低了开发和部署成本。Some embodiments of the present application provide a training method for an image generation model. When an electronic device uses a training sample group for model training, the image generation model is controlled by controlling conditions to perform different types of image generation tasks, so that the processing of image generation tasks such as text-generated images, image-generated images, image elimination, local image modification, and image expansion are integrated into one image generation model. There is no need to train multiple models, which reduces the amount of data processed by the electronic device, thereby reducing the time consumed and the resources occupied, reducing the waste of resources of the electronic device, improving the efficiency of model training, and effectively reducing the development and deployment costs.

应理解的是，本申请的一些实施例中，输入单元2804可以包括图形处理器(Graphics Processing Unit，GPU)28041和麦克风28042，图形处理器28041对在视频捕获模式或图像捕获模式中由图像捕获装置(如摄像头)获得的静态图片或视频的图像数据进行处理。显示单元2806可包括显示面板28061，可以采用液晶显示器、有机发光二极管等形式来配置显示面板28061。用户输入单元2807包括触控面板28071以及其他输入设备28072中的至少一种。触控面板28071，也称为触摸屏。触控面板28071可包括触摸检测装置和触摸控制器两个部分。其他输入设备28072可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆，在此不再赘述。It should be understood that in some embodiments of the present application, the input unit 2804 may include a graphics processor (GPU) 28041 and a microphone 28042, and the graphics processor 28041 processes the image data of a static picture or video obtained by an image capture device (such as a camera) in a video capture mode or an image capture mode. The display unit 2806 may include a display panel 28061, and the display panel 28061 may be configured in the form of a liquid crystal display, an organic light emitting diode, etc. The user input unit 2807 includes a touch panel 28071 and at least one of other input devices 28072. The touch panel 28071 is also called a touch screen. The touch panel 28071 may include two parts: a touch detection device and a touch controller. Other input devices 28072 may include, but are not limited to, a physical keyboard, function keys (such as a volume control button, a switch button, etc.), a trackball, a mouse, and a joystick, which will not be repeated here.

存储器2809可用于存储软件程序以及各种数据。存储器2809可主要包括存储程序或指令的第一存储区和存储数据的第二存储区，其中，第一存储区可存储操作系统、至少一个功能所需的应用程序或指令(比如声音播放功能、图像播放功能等)等。此外，存储器2809可以包括易失性存储器或非易失性存储器，或者，存储器2809可以包括易失性和非易失性存储器两者。其中，非易失性存储器可以是只读存储器(Read-Only Memory，ROM)、可编程只读存储器(Programmable ROM，PROM)、可擦除可编程只读存储器(Erasable PROM，EPROM)、电可擦除可编程只读存储器(Electrically EPROM，EEPROM)或闪存。易失性存储器可以是随机存取存储器(Random Access Memory，RAM)，静态随机存取存储器(Static RAM，SRAM)、动态随机存取存储器(Dynamic RAM，DRAM)、同步动态随机存取存储器(Synchronous DRAM，SDRAM)、双倍数据速率同步动态随机存取存储器(Double Data Rate SDRAM，DDRSDRAM)、增强型同步动态随机存取存储器(Enhanced SDRAM，ESDRAM)、同步连接动态随机存取存储器(Synch link DRAM，SLDRAM)和直接内存总线随机存取存储器(Direct Rambus RAM，DRRAM)。本申请的一些实施例中的存储器2809包括但不限于这些和任意其它适合类型的存储器。The memory 2809 can be used to store software programs and various data. The memory 2809 may mainly include a first storage area for storing programs or instructions and a second storage area for storing data, wherein the first storage area may store an operating system, an application program or instructions required for at least one function (such as a sound playback function, an image playback function, etc.), etc. In addition, the memory 2809 may include a volatile memory or a non-volatile memory, or the memory 2809 may include both volatile and non-volatile memories. Among them, the non-volatile memory may be a read-only memory (ROM), a programmable read-only memory (PROM), an erasable programmable read-only memory (EPROM), an electrically erasable programmable read-only memory (EEPROM), or a flash memory. The volatile memory may be a random access memory (RAM), a static random access memory (SRAM), a dynamic random access memory (DRAM), a synchronous dynamic random access memory (SDRAM), a double data rate synchronous dynamic random access memory (DDRSDRAM), an enhanced synchronous dynamic random access memory (ESDRAM), a synchronous link dynamic random access memory (SLDRAM) and a direct memory bus random access memory (DRRAM). The memory 2809 in some embodiments of the present application includes but is not limited to these and any other suitable types of memory.

处理器2810可包括一个或多个处理单元；可选的，处理器2810集成应用处理器和调制解调处理器，其中，应用处理器主要处理涉及操作系统、用户界面和应用程序等的操作，调制解调处理器主要处理无线通信信号，如基带处理器。可以理解的是，上述调制解调处理器也可以不集成到处理器2810中。The processor 2810 may include one or more processing units; optionally, the processor 2810 integrates an application processor and a modem processor, wherein the application processor mainly processes operations related to an operating system, a user interface, and application programs, and the modem processor mainly processes wireless communication signals, such as a baseband processor. It is understandable that the modem processor may not be integrated into the processor 2810.

本申请的一些实施例还提供一种可读存储介质，所述可读存储介质上存储有程序或指令，该程序或指令被处理器执行时实现上述图像生成模型的训练方法实施例的各个过程，且能达到相同的技术效果，为避免重复，这里不再赘述。Some embodiments of the present application also provide a readable storage medium, on which a program or instruction is stored. When the program or instruction is executed by a processor, the various processes of the training method embodiment of the above-mentioned image generation model are implemented, and the same technical effect can be achieved. To avoid repetition, it will not be repeated here.

其中，所述处理器为上述实施例中所述的电子设备中的处理器。所述可读存储介质，包括计算机可读存储介质，如计算机只读存储器ROM、随机存取存储器RAM、磁碟或者光盘等。The processor is the processor in the electronic device described in the above embodiment. The readable storage medium includes a computer readable storage medium, such as a computer read-only memory ROM, a random access memory RAM, a magnetic disk or an optical disk.

本申请的一些实施例另提供了一种芯片，所述芯片包括处理器和通信接口，所述通信接口和所述处理器耦合，所述处理器用于运行程序或指令，实现上述图像生成模型的训练方法实施例的各个过程，且能达到相同的技术效果，为避免重复，这里不再赘述。Some embodiments of the present application further provide a chip, which includes a processor and a communication interface, wherein the communication interface is coupled to the processor, and the processor is used to run programs or instructions to implement the various processes of the above-mentioned image generation model training method embodiment, and can achieve the same technical effect. To avoid repetition, it will not be repeated here.

应理解，本申请的一些实施例提到的芯片还可以称为系统级芯片、系统芯片、芯片系统或片上系统芯片等。It should be understood that the chip mentioned in some embodiments of the present application can also be called a system-level chip, a system chip, a chip system or a system-on-chip chip, etc.

本申请的一些实施例提供一种计算机程序产品，该程序产品被存储在存储介质中，该程序产品被至少一个处理器执行以实现如上述图像生成模型的训练方法实施例的各个过程，且能达到相同的技术效果，为避免重复，这里不再赘述。Some embodiments of the present application provide a computer program product, which is stored in a storage medium. The program product is executed by at least one processor to implement the various processes of the training method embodiment of the image generation model as described above, and can achieve the same technical effect. To avoid repetition, it will not be repeated here.

需要说明的是，在本文中，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。此外，需要指出的是，本申请实施方式中的方法和装置的范围不限按示出或讨论的顺序来执行功能，还可包括根据所涉及的功能按基本同时的方式或按相反的顺序来执行功能，例如，可以按不同于所描述的次序来执行所描述的方法，并且还可以添加、省去、或组合各种步骤。另外，参照某些示例所描述的特征可在其他示例中被组合。It should be noted that, in this article, the terms "comprise", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements includes not only those elements, but also other elements not explicitly listed, or also includes elements inherent to such process, method, article or device. In the absence of further restrictions, an element defined by the sentence "comprises one..." does not exclude the presence of other identical elements in the process, method, article or device including the element. In addition, it should be noted that the scope of the method and device in the embodiment of the present application is not limited to performing functions in the order shown or discussed, and may also include performing functions in a substantially simultaneous manner or in reverse order according to the functions involved, for example, the described method may be performed in an order different from that described, and various steps may also be added, omitted, or combined. In addition, the features described with reference to certain examples may be combined in other examples.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件，但很多情况下前者是更佳的实施方式。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以计算机软件产品的形式体现出来，该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中，包括若干指令用以使得一台终端(可以是手机，计算机，服务器，或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above implementation methods, those skilled in the art can clearly understand that the above-mentioned embodiment methods can be implemented by means of software plus a necessary general hardware platform, and of course by hardware, but in many cases the former is a better implementation method. Based on such an understanding, the technical solution of the present application, or the part that contributes to the prior art, can be embodied in the form of a computer software product, which is stored in a storage medium (such as ROM/RAM, a disk, or an optical disk), and includes a number of instructions for a terminal (which can be a mobile phone, a computer, a server, or a network device, etc.) to execute the methods described in each embodiment of the present application.

上面结合附图对本申请的实施例进行了描述，但是本申请并不局限于上述的具体实施方式，上述的具体实施方式仅仅是示意性的，而不是限制性的，本领域的普通技术人员在本申请的启示下，在不脱离本申请宗旨和权利要求所保护的范围情况下，还可做出很多形式，均属于本申请的保护之内。The embodiments of the present application are described above in conjunction with the accompanying drawings, but the present application is not limited to the above-mentioned specific implementation methods. The above-mentioned specific implementation methods are merely illustrative and not restrictive. Under the guidance of the present application, ordinary technicians in this field can also make many forms without departing from the purpose of the present application and the scope of protection of the claims, all of which are within the protection of the present application.

Claims

1. An image generation model, characterized in that it comprises: a conditional encoder, a noise adding module and a diffusion model, wherein the conditional encoder and the noise adding module are respectively connected to the diffusion model;

The conditional encoder is used to encode at least one original image to obtain at least one image feature information, and to encode the description text of each original image to obtain at least one text feature information; and to determine at least one conditional feature information based on a control condition, the at least one image feature information and the at least one text feature information; the control condition is used to indicate a type of an image generation task, and the conditional feature information is determined according to the type of the image generation task;

The noise adding module is used to perform noise adding processing on the at least one original image based on the Gaussian noise matrix and the mask image of the at least one original image to obtain at least one noisy image;

The diffusion model is used to generate at least one derivative image based on the at least one conditional feature information and the at least one noise image, and each derivative image corresponds to one conditional feature information and one noise image.

2. The model according to claim 1, characterized in that the conditional encoder is specifically used for:

In a case where the type of the image generation task is a text image task, determining the at least one item of text feature information as the at least one item of conditional feature information;

When the type of the image generation task is an image processing task, the at least one item of image feature information and the at least one item of text feature information are determined as the at least one item of conditional feature information; wherein the image processing task includes at least one of the following: an image-to-image task, a text-to-image task, an image elimination task, an image local modification task, and an image expansion task.

3. The model according to claim 1 or 2, characterized in that the image generation model further comprises: an image encoder and an image decoder, the image encoder is connected to the noise addition module, and the image decoder is connected to the diffusion model;

The image encoder is used to perform feature compression on the at least one original image to obtain at least one image feature matrix;

The noise adding module is specifically used to perform noise adding processing on the at least one image feature matrix based on the Gaussian noise matrix and the mask image of the at least one original image to obtain at least one noise image;

The image decoder is used to decode the at least one derived image.

4. A training method for an image generation model, comprising:

Obtain at least two training sample groups, each training sample group including a sample image, description text of each sample image, and a mask image;

Encoding at least one sample image in the at least two training sample groups to obtain at least one item of image feature information, and encoding a description text of each sample image to obtain at least one item of text feature information;

Determine at least one item of conditional feature information based on a control condition, the at least one item of image feature information and the at least one item of text feature information; the control condition is used to indicate a type of an image generation task, and the conditional feature information is determined according to the type of the image generation task;

Based on the at least one sample image, the mask image of the at least one sample image and the at least one conditional feature information, the initial model is trained to obtain an image generation model.

5. The method according to claim 4, characterized in that the determining at least one conditional feature information based on the control condition, the at least one image feature information and the at least one text feature information comprises:

In a case where the type of the image generation task indicated by the control condition is a text image task, determining the at least one item of text feature information as the at least one item of condition feature information;

In a case where the type of the image generation task indicated by the control condition is an image processing task, determining the at least one item of image feature information and the at least one item of text feature information as the at least one item of conditional feature information;

The image processing task includes at least one of the following: an image-to-image task, a text-to-image task, an image elimination task, an image local modification task, and an image expansion task.

6. The method according to claim 4, characterized in that the training of the initial model based on the at least one sample image, the mask image of the at least one sample image and the at least one conditional feature information to obtain the image generation model comprises:

Based on the Gaussian noise matrix and the mask image of the at least one sample image, performing noise processing on the at least one sample image to obtain at least one noise image;

Based on the at least one condition feature information and the at least one noise image, training the initial model to obtain the image generation model;

Among them, each noise image corresponds to a conditional feature information.

7. The method according to claim 6, characterized in that the step of performing noise processing on the at least one sample image based on the Gaussian noise matrix and the mask image of the at least one sample image to obtain the at least one noise image comprises:

Performing weighted summation on the Gaussian noise matrix and the mask image of the at least one sample image to obtain at least one noise matrix;

Performing weighted summation on the at least one noise matrix and the at least one sample image to obtain at least one noise image;

Among them, each noisy image corresponds to a noise matrix and a sample image.

8. The method according to claim 4, characterized in that the training of the initial model based on the at least one sample image, the mask image of the at least one sample image and the at least one conditional feature information to obtain the image generation model comprises:

Performing feature compression on the at least one sample image by an image encoder to obtain at least one image feature matrix;

Based on the Gaussian noise matrix and the mask image of the at least one sample image, performing noise processing on the at least one image feature matrix to obtain at least one noise image;

Based on the at least one conditional feature information and the at least one noise image, the initial model is trained to obtain an image generation model.

9. The method according to claim 8, characterized in that the step of performing feature compression on the at least one sample image by using an image encoder to obtain at least one image feature matrix comprises:

Performing feature compression on the at least one sample image through a convolution module in the image encoder to obtain at least one compressed image feature matrix, wherein the number of pixel points in each compressed image feature matrix is greater than the number of pixel points in each image feature matrix;

Through the self-attention module in the image encoder, feature extraction is performed on each compressed image feature matrix based on the correlation relationship between the pixels in each compressed image feature matrix to obtain at least one image feature matrix.

10. The method according to claim 4, characterized in that before obtaining at least two training sample groups, the method further comprises:

Acquire at least two sample images, image description information of each sample image, and annotation information of at least two mask areas of each sample image, wherein the at least two mask areas of each sample image are obtained by predicting the mask areas of each sample image using a segmentation model and segmenting each sample image based on the prediction results, and the annotation information of each mask area includes at least one of the following: the area of each mask area, the prediction accuracy of each mask area, and the prediction stability of each mask area; the prediction accuracy represents the prediction accuracy of the mask area using the segmentation model, and the prediction stability represents the prediction stability of the mask area using the segmentation model;

Based on the annotation information of at least two mask areas of each sample image, determining a first mask area whose annotation information satisfies a filtering condition from the at least two mask areas of each sample image, wherein the filtering condition includes at least one of the following: an area of the mask area is within a preset range, a prediction accuracy of the mask area is greater than an accuracy threshold, and a prediction stability of the mask area is greater than a stability threshold;

Adding the description information of the first mask area of each sample image to the image description information of each sample image to obtain a description text of each sample image;

Each sample image, the description text of each sample image and the mask image of each sample image are constructed into a training sample group to obtain at least two training sample groups, wherein the mask image of each sample image is the mask image corresponding to the first mask area in each sample image.

11. The method according to claim 4, characterized in that before obtaining at least two training sample groups, the method further comprises:

Obtain at least two sample images and description text for each sample image;

generating a second mask region in each sample image according to the size range and size information of each sample image;

Each sample image, the description text of each sample image, and the mask image of each sample image are constructed into a training sample group to obtain the at least two training sample groups, wherein the mask image of each sample image is the mask image of the second mask area in each sample image.

12. The method according to claim 4, characterized in that encoding at least one sample image in the at least two training sample groups to obtain at least one item of image feature information comprises:

By using the image segmentation module in the first conditional encoder, each sample image is segmented into N sample image blocks, and the pixels of each sample image block are rearranged into a one-dimensional vector to obtain N one-dimensional vectors of each sample image, and the N one-dimensional vectors of each sample image are spliced to obtain a pixel rearranged image of each sample image, where N is an integer greater than 1;

Extracting feature information of each pixel in each pixel rearrangement image through a mapping layer in the first conditional encoder to obtain at least one pixel rearrangement feature vector;

Through the self-attention module in the first conditional encoder, based on the correlation relationship between pixels in each pixel rearrangement image, feature extraction is performed on each pixel rearrangement feature vector to obtain image feature information of each sample image.

13. The method according to claim 4, characterized in that the encoding of the description text of each sample image to obtain at least one text feature information comprises:

Each description text is divided into at least two words by a word segmenter in the second conditional encoder, and each word is encoded to obtain a text feature vector of each description text;

Through the self-attention module in the second conditional encoder, based on the association relationship between the segmented words in each description text, feature extraction is performed on the text feature vector of each description text to obtain text feature information of each description text.

14. The method according to claim 6 or 8, characterized in that the training of the initial model based on the at least one conditional feature information and the at least one noise image to obtain the image generation model comprises:

Acquire feature information of the at least one noise image through a diffusion model, and fuse the at least one condition feature information with the feature information of the at least one noise image to obtain at least one fused feature information, where each fused feature information corresponds to one condition feature information and one piece of feature information of the noise image;

Determining a noise matrix of the at least one noisy image based on the at least one fused feature information and at least one noise vector by means of the diffusion model, each noise vector representing a noise level of a noisy image;

Determining a noise loss value of the at least one sample image based on a noise matrix of the at least one noise image, the Gaussian noise matrix, and a mask image of the at least one sample image;

Based on the noise loss value of the at least one sample image, the model parameters of the initial model are adjusted to obtain an image generation model.

15. The method according to claim 14, characterized in that before obtaining the feature information of the at least one noise image through the diffusion model and fusing the at least one conditional feature information with the feature information of the at least one noise image to obtain at least one fused feature information, the method further comprises:

Each noise image is divided into M noise image blocks by the image segmentation module in the diffusion model, where M is an integer greater than 1;

The feature information of each noise image block is extracted by the linear transformation operator in the diffusion model, and the position encoding vector is added to obtain the feature information of each noise image.

16. The method according to claim 14, characterized in that the step of fusing the at least one conditional feature information and the feature information of the at least one noise image through a diffusion model to obtain at least one fused feature information comprises:

Performing linear transformation processing on the at least one item of conditional feature information by using a linear transformation operator in the diffusion model to obtain at least one item of conditional feature information after linear transformation;

The at least one linearly transformed conditional feature information and the feature information of the at least one noise image are spliced by a splicing operator in the diffusion model to obtain at least one fused feature information.

17. The method according to claim 14, characterized in that the step of determining the noise matrix of the at least one noise image based on the at least one fused feature information and the at least one noise vector through the diffusion model comprises:

By using the diffusion self-attention module in the diffusion model, feature extraction is performed on the at least one noise-added vector to obtain at least one set of scaling factors, and based on the at least one set of scaling factors, feature scaling is performed on the at least one fused feature information to obtain at least one scaled feature information, where each scaled feature information corresponds to a set of scaling factors and one fused feature information;

Normalizing the at least one item of scaling feature information by a layer normalization operator in the diffusion model to obtain at least one item of noise feature information;

The at least one noise feature information is subjected to channel number reduction processing by using a convolution operator in the diffusion model to obtain a noise matrix of the at least one noise image.

18. The method according to claim 14, characterized in that the determining the noise loss value of the at least one sample image based on the noise matrix of the at least one noise image, the Gaussian noise matrix and the mask image of the at least one sample image comprises:

Determining at least one first loss value based on the Gaussian noise matrix and the noise matrix of the at least one noise image, and determining at least one second loss value based on the mask image of the at least one sample image and the noise matrix of the at least one noise image;

The at least one first loss value and the at least one second loss value are weightedly summed to obtain a noise loss value of the at least one sample image.

19. The method according to claim 4, characterized in that after the initial model is trained based on the at least one sample image, the mask image of the at least one sample image and the at least one conditional feature information to obtain the image generation model, the method further comprises:

Inputting original information into the image generation model and outputting a derived image, or inputting original information into the image generation model, performing image processing, and outputting a derived image;

The original information includes at least one of the following: an image to be processed, a description text, and a mask image.

20. The method according to claim 19, characterized in that, when the original information includes the image to be processed, the description text and the mask image, the inputting the original information into the image generation model, generating an image or performing image processing, and outputting a derived image comprises:

The image to be processed is encoded by a first conditional encoder in the image generation model to obtain image feature information of the image to be processed, and the description text is encoded by a second conditional encoder in the image generation model to obtain text feature information of the description text, and the image to be processed is subjected to noise processing based on a Gaussian noise matrix and the mask image by a noise addition module in the image generation model to obtain a noise image;

By using the diffusion model in the image generation model, image processing is performed based on the image feature information, the text feature information and the noise image to obtain a derivative image.

21. The method according to claim 20, characterized in that, before the image to be processed is subjected to noise processing based on a Gaussian noise matrix and the mask image by means of a noise adding module in the image generation model to obtain a noise image, the method further comprises:

Performing feature compression on the image to be processed by an image encoder in the image generation model to obtain an image feature matrix;

The step of performing noise processing on the image to be processed based on a Gaussian noise matrix and the mask image by using a noise adding module in the image generation model to obtain a noise image includes:

By using the noise adding module in the image generation model, based on the Gaussian noise matrix and the mask image, the image feature matrix is subjected to noise adding processing to obtain the noise image;

After performing image processing based on the image feature information, the text feature information and the noise image through the diffusion model in the image generation model to obtain a derived image, the method further includes:

The derivative image is decoded by an image decoder in the image generation model, and a decoded derivative image is output.

22. A training device for an image generation model, characterized in that the device comprises:

An acquisition unit, used for acquiring at least two training sample groups, each training sample group including a sample image, description text of each sample image and a mask image;

an encoding unit, configured to encode at least one sample image in the at least two training sample groups acquired by the acquisition unit to obtain at least one item of image feature information, and to encode a description text of each sample image to obtain at least one item of text feature information;

a determining unit, configured to determine at least one conditional feature information based on a control condition, the at least one image feature information obtained by the encoding unit, and the at least one text feature information; the control condition is used to indicate a type of an image generation task, and the conditional feature information is determined according to the type of the image generation task;

A training unit is used to train the initial model based on the at least one sample image acquired by the acquisition unit, the mask image of the at least one sample image and the at least one conditional feature information determined by the determination unit to obtain an image generation model.

23. The device according to claim 22, characterized in that the determining unit is specifically configured to:

24. The device according to claim 22, characterized in that the device further comprises:

a processing unit, configured to perform noise addition processing on the at least one sample image based on a Gaussian noise matrix and the mask image of the at least one sample image acquired by the acquisition unit, so as to obtain at least one noise image;

The training unit is configured to train the initial model based on the at least one condition feature information determined by the determining unit and the at least one noise image processed by the processing unit to obtain the image generation model;

Among them, each noise image corresponds to a conditional feature information.

25. The device according to claim 24, characterized in that the processing unit is specifically used for:

Performing a weighted summation on the Gaussian noise matrix and the mask image of the at least one sample image acquired by the acquisition unit to obtain at least one noise matrix; and performing a weighted summation on the at least one noise matrix and the at least one sample image acquired by the acquisition unit to obtain at least one noise image;

Among them, each noisy image corresponds to a noise matrix and a sample image.

26. The device according to claim 22, characterized in that the device further comprises:

A compression unit, configured to perform feature compression on the at least one sample image acquired by the acquisition unit through an image encoder to obtain at least one image feature matrix;

a processing unit, configured to perform noise addition processing on the at least one image feature matrix based on the Gaussian noise matrix and the mask image of the at least one sample image acquired by the acquisition unit, so as to obtain at least one noise image;

The training unit is used to train the initial model based on the at least one condition feature information determined by the determination unit and the at least one noise image acquired by the acquisition unit to obtain an image generation model.

27. The device according to claim 26, characterized in that the compression unit is specifically used for:

Performing feature compression on the at least one sample image acquired by the acquisition unit through a convolution module in the image encoder to obtain at least one compressed image feature matrix, wherein the number of pixel points in each compressed image feature matrix is greater than the number of pixel points in each image feature matrix;

28. The device according to claim 22, characterized in that the acquisition unit is further used for:

29. The device according to claim 22, characterized in that the acquisition unit is further used to:

Obtain at least two sample images and description text of each sample image;

30. The device according to claim 22, characterized in that the encoding unit is specifically used to:

By using the image segmentation module in the first conditional encoder, each sample image acquired by the acquisition unit is segmented into N sample image blocks, and the pixel points of each sample image block are rearranged into a one-dimensional vector to obtain N one-dimensional vectors of each sample image, and the N one-dimensional vectors of each sample image are spliced to obtain a pixel rearranged image of each sample image, where N is an integer greater than 1;

31. The device according to claim 22, characterized in that the encoding unit is specifically used to:

By means of a word segmenter in a second conditional encoder, each description text acquired by the acquisition unit is segmented into at least two word segments, and each word segment is encoded to obtain a text feature vector of each description text;

32. The device according to claim 24 or 26, characterized in that the training unit is specifically used for:

Determine a noise loss value of the at least one sample image based on a noise matrix of the at least one noise image, the Gaussian noise matrix, and the mask image of the at least one sample image acquired by the acquisition unit;

33. The device according to claim 32, characterized in that the device further comprises:

The extraction unit is used to obtain the feature information of the at least one noise image through the diffusion model, and fuse the at least one conditional feature information with the feature information of the at least one noise image to obtain at least one fused feature information, divide each noise image into M noise image blocks through the image segmentation module in the diffusion model, where M is an integer greater than 1; and extract the feature information of each noise image block through the linear transformation operator in the diffusion model, and add the position encoding vector to obtain the feature information of each noise image.

34. The device according to claim 32, characterized in that the training unit is specifically used for:

35. The device according to claim 32, characterized in that the training unit is specifically used for:

36. The device according to claim 32, characterized in that the training unit is specifically used for:

37. The device according to claim 22, characterized in that the device further comprises:

an execution unit, configured to, after training an initial model based on the at least one sample image acquired by the acquisition unit, the mask image of the at least one sample image, and the at least one conditional feature information determined by the determination unit to obtain an image generation model, input the original information into the image generation model and output a derived image, or input the original information into the image generation model, perform image processing, and output a derived image;

38. The device according to claim 37 is characterized in that the execution unit is specifically used to, when the original information includes the image to be processed, the description text and the mask image, encode the image to be processed through the first conditional encoder in the image generation model to obtain image feature information of the image to be processed, and encode the description text through the second conditional encoder in the image generation model to obtain text feature information of the description text, and perform noise processing on the image to be processed based on the Gaussian noise matrix and the mask image through the noise addition module in the image generation model to obtain a noise image; and perform image processing based on the image feature information, the text feature information and the noise image through the diffusion model in the image generation model to obtain a derivative image.

39. The device according to claim 38, characterized in that the device further comprises:

A compression unit, configured to perform feature compression on the image to be processed by using an image encoder in the image generation model to obtain an image feature matrix before performing noise processing on the image to be processed by using a noise adding module in the image generation model based on a Gaussian noise matrix and the mask image to obtain a noise image;

A processing unit, configured to perform noise addition processing on the image feature matrix based on a Gaussian noise matrix and the mask image through a noise addition module in the image generation model to obtain the noise image;

A decoding unit is used to decode the derivative image obtained by the execution unit through an image decoder in the image generation model, and output the decoded derivative image.

40. An electronic device, characterized in that it comprises a processor and a memory, wherein the memory stores programs or instructions that can be run on the processor, and when the program or instructions are executed by the processor, the steps of the training method of the image generation model as described in any one of claims 4-21 are implemented.