CN118747726B

CN118747726B - Training method, related device and medium for image generation model

Info

Publication number: CN118747726B
Application number: CN202411081525.2A
Authority: CN
Inventors: 郭卉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Filing date: 2024-08-08
Publication date: 2024-11-19
Anticipated expiration: 2044-08-08

Abstract

The disclosure provides a training method, a related device and a medium for an image generation model. The method comprises the following steps: generating denoising control information based on first description information of a sample background image, second description information of a sample object image and sample image description information of a sample reference image in a graphic sample pair; generating background fine adjustment control information based on the sample background image, and generating object fine adjustment control information based on the sample object image; denoising the sample hidden space features corresponding to the sample reference image through the image generation model based on the background fine tuning control information, the object fine tuning control information and the denoising control information to obtain predicted image features; and training an image generation model based on the predicted image features of the plurality of image-text sample pairs and the sample reference image to obtain a trained image generation model. The present disclosure can improve the accuracy of the replacement process.

Description

Training method, related device and medium for image generation model

Technical Field

The disclosure relates to the technical field of big data, in particular to a training method, a related device and a medium of an image generation model.

Background

Currently, in various business scenarios such as video production, virtual exhibition, etc., it is often necessary to replace background images and object images to create personalized images. For example, in the case of producing a personalized image of a person a, in the related art, the person image of the person a is often matted by a neural network model, the matted person image is embedded into a selected background image B in the form of a map, and the generated map image is subjected to edge restoration, so that the person a is shown in the designated background image B.

However, in practical model training, it is often difficult to collect enough training data that meets expectations, which affects the model training effect, and causes a problem that there is incompatibility between the object (foreground) and the background in the target image generated by the model (for example, the object in the target image and the background cannot be kept consistent in terms of illumination, filter, etc.), so that the accuracy of the target image generated by the model is not high.

Disclosure of Invention

The embodiment of the disclosure provides a training method, a related device and a medium of an image generation model, which can improve the coordination between an object and a background in image processing of replacing the object in a background image, thereby improving the accuracy of replacement processing.

According to an aspect of the present disclosure, there is provided a training method of an image generation model, the training method including:

Obtaining a plurality of graphic sample pairs, wherein each graphic sample pair comprises a sample reference image, a sample background image, a sample object image of a sample object, first descriptive information of the sample background image, second descriptive information of the sample object image and sample image descriptive information of the sample reference image, and the sample reference image is obtained by adding noise to an expected result of replacing a reference object in the sample background image with the sample object;

generating denoising control information based on the first description information, the second description information, and the sample image description information;

generating background fine adjustment control information based on the sample background image, and generating object fine adjustment control information based on a sample object image;

Denoising the sample hidden space features corresponding to the sample reference image through the image generation model based on the background fine tuning control information, the object fine tuning control information and the denoising control information to obtain predicted image features;

And training the image generation model based on the predicted image features and the sample reference images of the image-text sample pairs to obtain a trained image generation model, wherein the trained image generation model is used for generating images according to a target object image, a target background image, object description information of the target object image, background description information of the target background image and target image description information.

According to an aspect of the present disclosure, there is provided an image generation method including:

acquiring a target object image of a target object, a target background image, object description information of the target object image, background description information of the target background image, and target image description information, wherein the target background image contains a reference object to be replaced by a target object in the target object image, and the target image description information is used for describing replacement from the reference object to the target object;

Generating denoising prompt information based on the background description information, the object description information and the target image description information;

Generating background fine adjustment prompt information based on the target background image, and generating object fine adjustment prompt information based on the target object image;

Performing image generation through a trained image generation model based on a preset noise image, the denoising prompt information, the object fine adjustment prompt information and the background fine adjustment prompt information to obtain a target image, wherein the trained image generation model is generated according to the training method of the image generation model; the target image is used to indicate a result of replacing the reference object in the target background image with the target object of the target object image.

According to an aspect of the present disclosure, there is provided a training apparatus of an image generation model, the training apparatus including:

A first obtaining unit configured to obtain a plurality of pairs of graphic samples, wherein each of the pairs of graphic samples includes a sample reference image, a sample background image, a sample object image of a sample object, first description information of the sample background image, second description information of the sample object image, and sample image description information of the sample reference image, the sample reference image being obtained by adding noise to an expected result of replacing a reference object in the sample background image with the sample object;

A first generation unit configured to generate denoising control information based on the first description information, the second description information, and the sample image description information;

a second generation unit configured to generate background fine adjustment control information based on the sample background image, and generate object fine adjustment control information based on a sample object image;

The denoising unit is used for denoising the sample hidden space features corresponding to the sample reference image through the image generation model based on the background fine adjustment control information, the object fine adjustment control information and the denoising control information to obtain predicted image features;

The training unit is used for training the image generation model based on the predicted image features of the image-text sample pairs and the sample reference image to obtain a trained image generation model, and the trained image generation model is used for generating images according to a target object image, a target background image, object description information of the target object image, background description information of the target background image and target image description information.

Optionally, the image generation model includes a denoising network; the denoising network comprises an upsampling attention network and a downsampling attention network; the background fine adjustment control information comprises first background fine adjustment information and second background fine adjustment information; the object fine adjustment control information comprises first object fine adjustment information and second object fine adjustment information;

the denoising unit includes:

The downsampling subunit is configured to perform downsampling processing on the sample hidden space feature through the downsampling attention network based on the first object fine tuning information, the first background fine tuning information and the denoising control information, so as to obtain a sample downsampling result;

And the up-sampling subunit is used for up-sampling the sample down-sampling result through the up-sampling attention network based on the second object fine tuning information, the second background fine tuning information and the denoising control information to obtain the predicted image characteristic.

Optionally, the first context fine tuning information includes a first context query feature, a first context key feature, and a first context value feature; the first object fine tuning information comprises a first object query feature, a first object key feature and a first object value feature; the denoising control information comprises a reference image description embedded feature, a sample image description embedded feature and a background image description embedded feature;

The downsampling subunit comprises:

The first attention module is used for carrying out first attention calculation through the downsampling attention network based on the sample hidden space feature, the first background query feature, the first background key feature and the background image description embedded feature to obtain a background image attention result;

the second attention module is used for carrying out second attention calculation through the downsampling attention network based on the sample hidden space features, the first object query features, the first object key features and the sample image description embedding features to obtain a sample object image attention result;

The replacing module is used for carrying out feature replacement on the background image attention result based on the sample object image attention result to obtain a replaced background image attention result;

And the third attention module is used for carrying out third attention calculation through the downsampling attention network based on the replaced background image attention result, the first background value characteristic, the first object value characteristic and the reference image description embedded characteristic to obtain the sample downsampling result.

Optionally, the first attention module is configured to:

Weighting the sample hidden space features and the first background query features to obtain first weighted image features;

weighting the background image description embedded feature and the first background key feature to obtain a second weighted image feature;

and carrying out matrix multiplication on the first weighted image feature and the second weighted image feature to obtain the background image attention result.

Optionally, the replacing module is configured to:

Determining a first word number of the reference object based on the first description information, and determining a second word number of the sample object based on the second description information;

determining a word weight between the reference object and the sample object based on the first word number and the second word number;

And replacing the reference object characteristic in the background image attention result with the sample object characteristic in the sample object image attention result based on the word weight to obtain the replaced background image attention result.

Optionally, the third attention module is configured to:

weighting the first background value characteristic, the first object value characteristic and the reference image description embedded characteristic to obtain a sample weighted image characteristic;

and carrying out matrix multiplication on the sample weighted image characteristics and the replaced background image attention result to obtain the sample downsampling result.

Optionally, the first generating unit is configured to:

word embedding processing is carried out on the first description information, so that background image description embedded characteristics are obtained;

word embedding processing is carried out on the second description information, so that sample image description embedded characteristics are obtained;

word embedding processing is carried out on the sample image description information, so that reference image description embedding characteristics are obtained;

integrating the reference image description embedding feature, the sample image description embedding feature and the background image description embedding feature into the denoising control information.

Optionally, the second generating unit is configured to:

Activating a preset fine tuning network based on the first description information;

Performing feature extraction on the sample background image based on the activated fine tuning network to obtain a first background query feature, a first background key feature and a first background value feature;

Integrating the first background query feature, the first background key feature, and the first background value feature into the background fine adjustment control information.

Optionally, the second generating unit is configured to:

activating a preset fine tuning network based on the second description information;

performing feature extraction on the sample object image based on the activated fine tuning network to obtain a first object query feature, a first object key feature and a first object value feature;

and integrating the first object inquiry feature, the first object key feature and the first object value feature into the object fine adjustment control information.

Optionally, the sample hidden space feature corresponding to the sample reference image is generated by:

Performing coding processing on the sample reference image to obtain sample image coding characteristics;

And based on a preset time step, performing diffusion processing on the sample image coding features through a diffusion network of the image generation model to obtain the sample hidden space features.

Optionally, the sample reference image is generated by:

Acquiring an original reference image, wherein the original reference image is used for indicating an expected result of replacing a reference object in the sample background image with a sample object;

generating random numbers obeying Gaussian distribution based on a predetermined random number generation model;

And adding the random number to the pixel value of each pixel point in the original reference image to obtain the sample reference image.

Optionally, the training unit comprises;

the acquisition module is used for acquiring reference noise in the sample reference image aiming at each image-text sample pair, and carrying out noise prediction based on the predicted image characteristics to obtain predicted noise;

the calculating module is used for calculating a sub-loss function based on the comparison of the reference noise and the prediction noise;

A determining module, configured to determine a total loss function based on the sub-loss functions of each of the image-text sample pairs;

And the training module is used for training the image generation model based on the total loss function.

Optionally, the prediction noise is derived by prediction of a plurality of prediction time steps;

the computing module is used for:

performing difference calculation on the reference noise and the prediction noise to obtain a noise difference;

and carrying out regular term calculation on the noise difference value to obtain the sub-loss function.

Optionally, the sample object image is generated by:

Acquiring a sample image of the sample object;

performing image segmentation on the sample image based on a preset object segmentation model to obtain a sample segmentation image with the sample object;

and carrying out image enhancement on the sample segmentation image to obtain the sample object image.

According to an aspect of the present disclosure, there is provided an image generating apparatus including:

A second acquisition unit configured to acquire a target object image of a target object, a target background image, object description information of the target object image, background description information of the target background image, and target image description information, wherein the target background image contains a reference object to be replaced by a target object in the target object image, the target image description information describing replacement from the reference object to the target object;

a third generating unit, configured to generate denoising prompt information based on the background description information, the object description information, and the target image description information;

A fourth generating unit, configured to generate background fine adjustment prompt information based on the target background image, and generate object fine adjustment prompt information based on the target object image;

The image generation unit is used for generating an image through a trained image generation model based on a preset noise image, the denoising prompt information, the object fine adjustment prompt information and the background fine adjustment prompt information to obtain a target image, wherein the trained image generation model is generated according to the training method of the image generation model; the target image is used to indicate a result of replacing the reference object in the target background image with the target object of the target object image.

Optionally, the trained image generation model includes a diffusion network, a denoising network, and a decoding network;

the image generation unit is used for:

performing diffusion processing on the coded image features of the preset noise image based on the diffusion network to obtain target hidden space features;

Denoising the target hidden space features through the denoising network based on the denoising prompt information, the object fine tuning prompt information and the background fine tuning prompt information to obtain a target denoising result;

and performing feature decoding on the target denoising result based on the decoding network to obtain the target image.

According to an aspect of the present disclosure, there is provided an electronic device including a memory storing a computer program and a processor implementing a training method or an image generation method of an image generation model as described above when the computer program is executed.

According to an aspect of the present disclosure, there is provided a computer-readable storage medium storing a computer program which, when executed by a processor, implements the training method or the image generation method of the image generation model as described above.

According to an aspect of the present disclosure, there is provided a computer program product comprising a computer program that is read and executed by a processor of an electronic device, causing the electronic device to perform the training method or the image generation method of the image generation model as described above.

In an embodiment of the present disclosure, denoising control information is generated based on first description information of a sample background image, second description information of a sample object image, and sample image description information of a sample reference image. Because the first description information focuses on describing the background from the word aspect, the second description information focuses on describing the replaced object from the word aspect, and the sample image description information focuses on describing the whole image after the object is replaced from the word aspect, the denoising control information comprehensively expresses all aspects of characteristics from the word aspect in the object replacing process. The background fine adjustment control information focuses on the background information obtained from the content reflected by the image itself, and the object fine adjustment control information focuses on the object information obtained from the content reflected by the image itself, so that the background fine adjustment control information, the object fine adjustment control information and the denoising control information are integrated, the denoising processing is performed on the sample hidden space features corresponding to the sample reference image, the denoising processing consideration factors are relatively comprehensive, the coordination between the object and the background in the image processing for object replacement is fully considered, the training accuracy of the image generation model is improved, and the replacing processing accuracy is also improved.

Additional features and advantages of the disclosure will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of the disclosure. The objectives and other advantages of the disclosure will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings are included to provide a further understanding of the disclosed embodiments and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain, without limitation, the disclosed embodiments.

FIG. 1 is a architectural diagram of a training method for an image generation model and a system to which the image generation method applies, according to an embodiment of the present disclosure;

2A-2D illustrate schematic diagrams of an image generation method applied in an item display scenario according to an embodiment of the present disclosure;

FIG. 3 is a flow chart of a training method of an image generation model according to one embodiment of the present disclosure;

FIG. 4 is a flow chart of determining a sample object image according to one embodiment of the present disclosure;

FIG. 5 is a flow chart of determining a sample reference image according to one embodiment of the present disclosure;

FIG. 6 is a schematic diagram of an implementation of determining a sample reference image in accordance with one embodiment of the present disclosure;

FIG. 7 is a flow chart of generating denoising control information according to one embodiment of the present disclosure;

FIG. 8 is a schematic diagram of an implementation process of generating denoising control information according to one embodiment of the present disclosure;

FIG. 9 is a flow chart of generating background trim control information according to one embodiment of the present disclosure;

FIG. 10 is a flow chart of generating object fine tuning control information according to one embodiment of the present disclosure;

FIG. 11 is a flow chart of generating sample hidden space features according to one embodiment of the present disclosure;

FIG. 12 is a flow chart of generating predicted image features according to one embodiment of the present disclosure;

13A-13B are schematic diagrams of an overall implementation of generating predicted image features through a denoising network, according to one embodiment of the present disclosure;

FIG. 14 is a flow chart of generating sample downsampling results according to an embodiment of the present disclosure;

FIG. 15 is a flow chart of generating post-replacement background image attention results according to one embodiment of the present disclosure;

FIG. 16 is a schematic diagram of an implementation process of generating sample downsampling results according to an embodiment of the present disclosure;

FIG. 17 is a flow chart of a training image generation model according to one embodiment of the present disclosure;

FIG. 18 is a flow chart of determining a stator loss function according to an embodiment of the present disclosure;

FIG. 19 is a schematic illustration of implementation details of a training image generation model according to one embodiment of the present disclosure;

FIG. 20 is a flow chart of an image generation method according to one embodiment of the present disclosure;

FIG. 21 is a flow chart of generating a target image according to one embodiment of the present disclosure;

FIG. 22 is a schematic illustration of an implementation of an image generation method according to one embodiment of the present disclosure;

FIG. 23 is a schematic illustration of implementation details of an image generation method according to one embodiment of the present disclosure;

FIG. 24 is a block diagram of a training apparatus for image generation models according to one embodiment of the present disclosure;

FIG. 25 is a block diagram of an image generation apparatus according to one embodiment of the present disclosure;

FIG. 26 is a terminal block diagram of a training method of an image generation model according to one embodiment of the present disclosure;

FIG. 27 is a server block diagram of a training method of an image generation model according to one embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present disclosure more apparent, the present disclosure will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present disclosure.

The system architecture and scenario to which the embodiments of the present disclosure apply are described below.

Fig. 1 is a system architecture diagram of a training method of an image generation model, and a system architecture diagram to which the image generation method is applied, according to an embodiment of the present disclosure. It includes an object terminal 140, the internet 130, a gateway 120, an image processing server 110, an image database 150, and the like.

The object terminal 140 includes various forms of a desktop computer, a laptop computer, a PDA (personal digital assistant), a tablet computer, a cellular phone, a car-mounted terminal, a home theater terminal, a smart television, a dedicated terminal, and the like. In addition, the device can be a single device or a set of a plurality of devices. The object terminal 140 may communicate with the internet 130 in a wired or wireless manner, exchanging data. The object terminal 140 includes an image processing system for receiving a background image selected by an object, the object image, description information of the background image, and description information of the object image, and submitting the background image, the object image, the description information of the background image, and the description information of the object image to an image processing server, so that the image processing server generates a target image with a target object in the object image as a foreground and the background image as a background.

The image processing server 110 refers to a computer system capable of providing some service to the object terminal 140. The image processing server 110 is required to be higher in terms of stability, security, performance, and the like than the general object terminal 140. The image processing server 110 may be one high-performance computer in a network platform, a cluster of multiple high-performance computers, a portion of one high-performance computer (e.g., a virtual machine), a combination of portions of multiple high-performance computers (e.g., virtual machines), or a cloud server. The image processing server 110 contains various types of services, wherein the implementation of the individual services of the image processing server 110 is often associated with some intermediate database or storage medium or the like. The image processing server 110 is configured to generate a model using the trained image, generate a target image with a target object in the object image as a foreground and a background image as a background according to the background image submitted by the object, the object image, the description information of the background image, and the description information of the object image, and the image database 150 is configured to store various images such as the target image, the object image, and the background image, where the image database 150 may be separately provided or may be integrated on the image processing server 110 or other electronic devices.

Gateway 120 is also known as an intersubnetwork connector, protocol converter. The gateway implements network interconnection on the transport layer, and is a computer system or device that acts as a translation. The gateway is a translator between two systems using different communication protocols, data formats or languages, and even architectures that are quite different. At the same time, the gateway may also provide filtering and security functions. The message transmitted from the object terminal 140 to the image processing server 110 is transmitted to the corresponding server through the gateway 120. A message transmitted from the image processing server 110 to the object terminal 140 is also transmitted to the corresponding object terminal 140 through the gateway 120.

The embodiments of the present disclosure may be applied in a variety of scenarios, such as the item display scenario shown in fig. 2A-2D, and the like.

As shown in fig. 2A, when the object is to use the same background to display a plurality of objects, the object logs into the image processing system on the object terminal and enters the image synthesis process. At this time, a hint field "please select a background image, upload a target object image, and input a related image description" is displayed on the page, and an edit area for selecting a background image, an edit area for selecting a target object image, an edit area for inputting an image description, an edit area for inputting an object image description, and an edit area for inputting a background image description are provided. Based on this, the object selects "background image 5" in the editing area for selecting the background image; the object selects 'D \object image\image 1' in the editing area for selecting the target object image; the subject inputs "replace plastic bottle in background image 5 with glass bottle" in the editing area for inputting image description, "one glass bottle" in the editing area for inputting object image description, "one plastic bottle on desk before windowsill" in the editing area for inputting background image description, and clicks "ok" button.

As shown in fig. 2B, after the object clicks the "ok" button, a prompt window is displayed on the page, in which the prompt window has a prompt field of "background control information being generated for image composition, object control information, noise removal control information, and random noise map, please stand-by.

As shown in fig. 2C, a schematic diagram of image composition being performed by the image processing system, in which a prompt field "image generation model is being invoked" is displayed in the form of a prompt window in a page, and image composition is performed according to background control information, object control information, denoising control information, and random noise figure, please wait for the image composition process of the image generation model is illustrated for the object.

As shown in fig. 2D, when the image generation model is executed, a prompt field "the plastic bottle in the background image 5 has been replaced with a glass bottle" is displayed on the page. Furthermore, the generated image has been saved to the path: m: a composite image folder, an item presentation ", and a contrast indication of the background image 5 and the generated target image is displayed. Wherein the background image 5 is used for illustrating a plastic bottle placed on a table, the target image is used for illustrating a glass bottle placed on the table, and the plastic bottle in the background image 5 is replaced by the glass bottle under the condition of keeping the background unchanged by generating a model through an image in the image processing system.

It should be noted that, the image generating method of the embodiment of the present disclosure may be applied to various application scenarios, such as a video clip scenario in self-media, a personalized image production scenario in social media, and the like, besides the above-mentioned item display scenario.

The training method of the image generation model according to the embodiment of the present disclosure is generally described below.

According to one embodiment of the present disclosure, a training method of an image generation model is provided.

The training method of the image generation model is generally applied to a business scene, such as an article display scene shown in fig. 2A-2D, in which a certain object (a person, an animal, an article, etc.) is to be embedded into a certain background image and a high requirement for coordination between the object embedded into the background and the background is high. The scheme for carrying out model training based on the image description, the background image and the background description and the object image and the object description can improve the coordination between the object and the background in the image processing of replacing the object in the background image, so that the accuracy of replacement processing is improved.

As shown in fig. 3, a training method of an image generation model according to an embodiment of the present disclosure may be performed by an electronic device, which may be the image processing server or the object terminal shown in fig. 1. A training method of an image generation model according to one embodiment of the present disclosure may include:

Step 310, obtaining a plurality of image-text sample pairs;

Step 320, generating denoising control information based on the first description information, the second description information, and the sample image description information;

Step 330, generating background fine adjustment control information based on the sample background image, and generating object fine adjustment control information based on the sample object image;

Step 340, denoising the sample hidden space features corresponding to the sample reference image through the image generation model based on the background fine tuning control information, the object fine tuning control information and the denoising control information to obtain predicted image features;

And 350, training an image generation model based on the predicted image features of the plurality of image-text sample pairs and the sample reference image to obtain a trained image generation model.

Steps 310-350 are described in detail below.

In step 310, a plurality of pairs of teletext samples are acquired.

In the disclosed embodiment, one graphic sample pair is used as one training data. Wherein each of the pairs of teletext samples comprises a sample reference image, a sample background image, a sample object image of a sample object, first descriptive information of the sample background image, second descriptive information of the sample object image, and sample image descriptive information of the sample reference image.

The sample background image refers to an image to which a sample object is to be added. The sample background image is often a scenic image, a scene image, etc. containing a series of things.

The first description information of the sample background image is used to describe an image scene, image content, etc. of the sample background image. For example, for a sample background image representing a park, the first descriptive information is "the scene of the sample background image is a park".

Sample object images of sample objects refer to image data containing sample objects, which often reflect object characteristics of the sample objects, such as object pose, object motion, object appearance, and the like.

The second description information of the sample object image is used to describe an object class, an object feature, etc. of the sample object in the sample object image. For example, for a sample object image representing a white puppy, the second descriptive information is "the object class of the sample object is a dog".

The sample reference image is obtained by adding noise to the expected result of replacing the reference object in the sample background image with the sample object.

Sample image description information of the sample reference image is used to indicate a conditional constraint on the denoising process.

For example, the sample image description information may be "replace kittens in parks with a puppy". At this time, the sample background image is an image with a park as a scene, which contains a kitten. The sample object image is an image containing one puppy.

In step 320, denoising control information is generated based on the first description information, the second description information, and the sample image description information.

The denoising control information is used for assisting the image generating model in image denoising when the image is generated as a text condition constraint so that the image denoising effect is attached to the real requirement.

For the sake of brevity, the specific process of generating the denoising control information according to the embodiments of the present disclosure based on the first description information, the second description information, and the sample image description information will be described in detail below, and will not be described in detail here.

In step 330, background fine adjustment control information is generated based on the sample background image and object fine adjustment control information is generated based on the sample object image.

The background fine adjustment control information is used for guiding and controlling the attention and learning of the image generation model to the background characteristic information in the denoising process and is used for indicating the fine adjustment degree of the background characteristic in the denoising process.

The object fine adjustment control information is used for guiding and controlling the attention and learning of the image generation model to the object characteristic information in the denoising process and is used for indicating the fine adjustment degree of the object characteristic in the denoising process.

For the sake of brevity, a detailed description of a specific process of generating the background fine adjustment control information based on the sample background image and a specific process of generating the object fine adjustment control information based on the sample object image according to the embodiments of the present disclosure will be described in detail below, and will not be repeated here.

In step 340, denoising the sample hidden space features corresponding to the sample reference image by the image generation model based on the background fine tuning control information, the object fine tuning control information, and the denoising control information, to obtain predicted image features.

The image generation model of the embodiment of the present disclosure is a neural network model constructed based on a diffusion model (SD). The image generation model of the embodiment of the disclosure often takes a background image, an object image, a process description of replacing an object of the background image with an object of the object image, a text description of the background image, a text description of the object image, and the like as inputs, and outputs a composite image taking a scene of the background image as a background and an object of the object image as a foreground, and the image generation model can realize embedding the object into one background image and can meet the image generation requirements under various scenes.

The sample hidden space feature corresponding to the sample reference image refers to the representation of the pure noise image corresponding to the sample reference image without the image feature. The form of the sample hidden space features can be vector form or matrix form, and is not limited.

The predicted image features refer to features of the image generation model that are predicted to be capable of representing image content of the sample reference image in the potential vector space under the constraint of the background fine adjustment control information, the object fine adjustment control information, and the denoising control information.

For the sake of saving space, a specific process of denoising the sample hidden space features corresponding to the sample reference image by the image generation model in the embodiment of the present disclosure will be described in detail below, and will not be described here.

In step 350, the image generation model is trained based on the predicted image features and the sample reference images of the plurality of pairs of teletext samples to obtain a trained image generation model.

The trained image generation model is used for generating images according to the target object image, the target background image, the object description information of the target object image, the background description information of the target background image and the target image description information.

For the sake of brevity, a specific process of generating a model based on the predicted image features and the sample reference image training image of the plurality of graphic sample pairs according to the embodiments of the present disclosure will be described in detail below, and will not be described herein.

Through the above steps 310-350, in the embodiment of the present disclosure, the denoising control information is generated based on the first description information of the sample background image, the second description information of the sample object image, and the sample image description information of the sample reference image. Because the first description information focuses on describing the background from the word aspect, the second description information focuses on describing the replaced object from the word aspect, and the sample image description information focuses on describing the whole image after the object is replaced from the word aspect, the denoising control information comprehensively expresses all aspects of characteristics from the word aspect in the object replacing process. The background fine adjustment control information focuses on the background information obtained from the content reflected by the image itself, and the object fine adjustment control information focuses on the object information obtained from the content reflected by the image itself, so that the background fine adjustment control information, the object fine adjustment control information and the denoising control information are combined, the denoising processing is performed on the sample hidden space features corresponding to the sample reference image, the denoising processing consideration factors are comprehensive, the model is focused on the study of the coordination of the object and the background in the image processing for object replacement during model training, the accuracy of image generation model training is improved, and the accuracy of replacement processing is also improved.

The above is a general description of steps 310-350. The detailed description will be developed below for specific implementations of steps 310, 320, 330, 340 and 350.

Step 310 is described in detail below.

In step 310, a plurality of pairs of graphic samples are acquired, wherein each pair of graphic samples includes a sample reference image, a sample background image, a sample object image of a sample object, first description information of the sample background image, second description information of the sample object image, and sample image description information of the sample reference image, the sample reference image being obtained by adding noise to an expected result of replacing a reference object in the sample background image with the sample object.

Referring to fig. 4, in one embodiment, a sample object image of a sample object is determined by:

Step 410, obtaining a sample image of a sample object;

step 420, performing image segmentation on the sample image based on a preset object segmentation model to obtain a sample segmentation image with a sample object;

and step 430, performing image enhancement on the sample segmentation image to obtain a sample object image.

Steps 410-430 are described in detail below.

In step 410, the sample image is an image containing a sample object.

In the embodiment, when the license is authorized, the sample images of the plurality of sample images can be extracted from the existing image database, the video data with the sample objects can be extracted from the existing video database, the video data is subjected to video frame segmentation, and each video frame with the sample objects is taken as the sample image.

In step 420, the sample segmentation image refers to a local image with a sample object segmented from the sample image, the sample segmentation image being part of the sample image.

The object segmentation model may be a lightweight semantic segmentation model BiSeNet-v2 or pp_ LiteSeg.

Taking the example that the object segmentation model is a semantic segmentation model pp_ LiteSeg, the object segmentation model comprises an encoding module, a pyramid pooling module, a decoding module and an attention fusion module. Specifically, first, a sample image is input to an encoding module of an object segmentation model, the sample image is multi-scale encoded by the encoding module, and a first sample image feature having an image size of one fourth of the sample image, a second sample image feature having an image size of one eighth of the sample image, a third sample image feature having an image size of one sixteenth of the sample image, and a fourth sample image feature having an image size of one thirty-half of the sample image are sequentially obtained. And then, carrying out feature pooling on the fourth sample image features through a pyramid pooling module to obtain sample image pooled features. Further, feature fusion is carried out on the sample image pooling feature and the third sample image feature through an attention fusion module, and a first image fusion feature is obtained. And then, carrying out feature fusion on the first image fusion feature and the second sample image feature through an attention fusion module to obtain a second image fusion feature. Further, the second image fusion feature is decoded by a decoding module, and a sample segmentation image with a sample object is obtained.

In step 430, image enhancement to the sample-segmented image includes, but is not limited to, contrast enhancement, brightness enhancement, sharpening, noise reduction, and the like to the sample-segmented image. Specifically, first, when image enhancement is performed on a sample-divided image, a linear stretching or logarithmic transformation method is used to expand the pixel value distribution of the sample-divided image so that the bright-dark areas of the sample-image-divided image are more distinct, to achieve contrast enhancement of the sample-divided image. Then, after the contrast enhancement, the brightness values of all the pixels of the sample divided image are adjusted to improve the overall brightness of the sample divided image. Further, after brightness adjustment, a gaussian filter or a median filter is used to smooth the sample-divided image and reduce noise in the sample-divided image. Finally, after noise reduction, edge enhancement is carried out on the sample segmentation image through the Laplacian, so that sharpening processing of the sample segmentation image is realized, and a sample object image is obtained.

The embodiment has the advantages that the image segmentation is carried out on the sample image of the sample object, the local image with the sample object is segmented from the sample image, the interference of irrelevant image information can be better eliminated, and the image quality of the sample segmented image for training is improved. Further, various image enhancement processes such as contrast enhancement, brightness enhancement, sharpening, noise reduction and the like are adopted for the sample segmentation image, so that the image quality of the sample object image can be further improved.

Referring to fig. 5, in one embodiment, a sample reference image is determined by:

Step 510, acquiring an original reference image;

step 520, generating random numbers obeying Gaussian distribution based on a predetermined random number generation model;

and 530, adding random numbers to pixel values of the pixel points for each pixel point in the original reference image to obtain a sample reference image.

Steps 510-530 are described in detail below.

In step 510, the original reference image is used to indicate the expected result of replacing the reference object in the sample background image with the sample object.

In a specific implementation of this embodiment, image editing software may be utilized to replace the reference object of the sample background image with the sample object, such that a resulting one of the replaced images is treated as the intended result of replacing the reference object in the sample background image with the sample object. The image editing software can be Adobe Photoshop and other software.

In step 520, the predetermined random number generation model refers to a random number generator, the random number being a randomly generated number, wherein the random number generated by embodiments of the present disclosure that obeys a gaussian distribution may be repeated.

In a specific implementation of this embodiment, a predetermined random number generation model has a library function (numpy.random.normal () function) therein. Specifically, a random number is generated by a library function of a predetermined random number generation model, according to a gaussian distribution, the mean value to be followed is 0, the standard deviation is 1, and the random number is subjected to the gaussian distribution. Wherein the random numbers subject to gaussian distribution can be expressed in the form of a random noise figure having the same image size as the original reference image.

In step 530, for each pixel in the original reference image, first, a pixel value of the pixel in the original reference image is determined, and a noise value of the pixel in the random noise map corresponding to the random number is determined. And then, adding the pixel value of the pixel point and the noise value to obtain the noise added pixel value of the pixel point. And finally, obtaining a sample reference image according to the noise adding pixel value of each pixel point.

As shown in fig. 6, a schematic diagram of a specific implementation process of superimposing random noise on an original reference image is shown. The original reference image is an 8 x 8 image with 64 pixels. The random noise map corresponding to the random number subjected to Gaussian distribution is also an 8×8 image, and the random noise map has 64 pixel points. Specifically, for each pixel point of the original reference image, determining a pixel value of each pixel point, wherein the pixel value of each pixel point comprises 1,2, 3, 4,5 or 6. Next, for each pixel point, a noise value of each pixel point in the random noise map is determined, wherein the noise value includes 0,1, 2,3, 4,5, or 6. Further, for each pixel point, the pixel value and the noise value are added to obtain a noise added pixel value of each pixel point. For example, each pixel point of the first line in the original reference image has a pixel value of 1 and a noise value of 5, and the noise added pixel value is 1+5=6. The pixel value of the second pixel point is 1, the noise value is 1, and the noise added pixel value is 1+1=2. The pixel value of the third pixel point is 1, the noise value is 4, and the noise adding pixel value is 1+4=5; and the like until the noise adding pixel value of the last pixel point of the last row is determined, and generating a sample reference image.

The embodiment has the advantages that random noise conforming to Gaussian distribution is added to the expected result of replacing the reference object in the sample background image with the sample object, and environmental noise is introduced into the idealized reference image, so that the finally obtained expected result is more fit with the actual situation, the authenticity and the accuracy of the sample reference image are improved, and the sample reference image has better referenceability.

Step 320 is described in detail below.

Referring to FIG. 7, in one embodiment, step 320 includes, but is not limited to, steps 710-740 including:

step 710, performing word embedding processing on the first description information to obtain background image description embedded features;

step 720, performing word embedding processing on the second description information to obtain sample image description embedded features;

step 730, performing word embedding processing on the sample image description information to obtain a reference image description embedding feature;

step 740, integrating the reference image description embedded feature, the sample image description embedded feature and the background image description embedded feature into denoising control information.

Steps 710-740 are described in detail below.

In step 710, the background image description embedding feature is used to image scene description of the sample background image in the form of a vector feature.

When the embodiment is specifically implemented, word embedding processing can be performed on the first description information based on a preset text encoder, so as to obtain background image description embedding characteristics.

In an embodiment of the present disclosure, a text encoder includes a word segmentation engine (tokenizer), an embedding layer (embedding), and a text attention computation module (text transformer).

Specifically, first, the first description information is input into a word segmentation device, and words are obtained by segmenting the first description information according to spaces, punctuations or separators based on the word segmentation device. Further, removing the stop words in the separated words to obtain a plurality of description words, searching indexes corresponding to the description words in a preset index table, and representing the description words by the indexes. The preset index table is used for indicating the corresponding relation between the words and the indexes. And then, inputting the index of each descriptor into an embedding layer for word embedding processing to obtain the descriptor embedding characteristics corresponding to each descriptor. And finally, inputting the descriptor embedding characteristics of each descriptor to a text attention calculating module for attention calculation, and outputting the background image description embedding characteristics corresponding to the first description information by the text attention calculating module.

In step 720, the sample image description embedding feature is used to subject the sample object image to an object class description in the form of a vector feature.

In the specific implementation of this embodiment, the specific process of step 720 is similar to the specific process of step 710 described above. For the sake of space saving, the description is omitted.

In step 730, the reference image description embedding feature is used to describe the foreground and background of the sample reference image in the form of vector features.

In the implementation of this embodiment, the specific process of step 730 is similar to the specific process of step 710 described above. For the sake of space saving, the description is omitted.

In step 740, the reference image description embedded feature, the sample image description embedded feature, and the background image description embedded feature are incorporated into the same set, and all feature information within the set is determined as denoising control information.

As shown in fig. 8, a specific illustration of generating a reference image description embedding feature based on sample image description information is shown. Specifically, the input image instance is an alarm clock image. In order to enhance learning ability of the model, the sample image description information for the input image instance is set to "a Photo of S _*", and "clock" is represented by a specific symbol "S _*". Based on the above, firstly, the sample image description information is input into a word segmentation device of a text encoder, the word segmentation device segments the sample image description information to obtain a plurality of words, and indexes corresponding to the words are found in a preset index table. The index corresponding to the word "a" is 508, the index corresponding to the word "Photo" is 701, the index corresponding to the word "of" is 73, and the index corresponding to the word "S _*" is x. Further, word embedding processing is performed on indexes of each word through an embedding layer, and the indexes corresponding to each word are mapped from a numerical space to a vector space according to a preset feature word list (the feature word list is used for indicating vector features corresponding to each index), so as to obtain descriptor embedding features corresponding to each word, wherein the descriptor embedding features corresponding to the word A are "The descriptor embedding feature corresponding to the word Photo is "The descriptor embedding feature corresponding to the word of is ""; The descriptor embedding feature corresponding to the word "S _*" is "". Further, the descriptor embedded features of the words are input into a text attention calculating module for attention calculation, and the reference image description embedded features are output through the text attention calculating module. In addition, in the embodiment of the disclosure, the sample image description embedded feature and the background image description embedded feature are obtained in this way, and the reference image description embedded feature, the sample image description embedded feature and the background image description embedded feature are used together as the condition constraint of the denoising process of the image generator, so that the predicted image instance output by the image generator is attached to the content represented by the sample image description information. In addition, the predicted image instance is obtained by performing 3 diffusion processes (3 noise adding operations) on the input image instance to obtain a noise image instance, and performing 3 denoising processes on the noise image instance by the image generator according to the reference image description embedding feature, the sample image description embedding feature, and the background image description embedding feature.

The embodiment has the advantages that word embedding processing is respectively carried out on the first description information, the second description information and the sample image description information by using a preset text encoder, so that background image description embedding characteristics corresponding to the first description information, sample image description embedding characteristics corresponding to the second description information and reference image description embedding characteristics corresponding to the sample image description information are obtained, vectorization of description information of the whole image after describing the background, the target object and the object replacement from the text angle is realized, three characteristic information describing the background, the target object and the whole image after the object replacement are integrated into a whole and are jointly used as conditional constraints of denoising processing, and the content comprehensiveness and the content effectiveness of denoising control information can be improved.

Step 330 is described in detail below.

Referring to FIG. 9, in one embodiment, the specific process of generating background fine tuning control information includes, but is not limited to, steps 910-930 of:

step 910, activating a preset fine tuning network based on the first description information;

Step 920, performing feature extraction on a sample background image based on an activated fine tuning network to obtain a first background query feature, a first background key feature and a first background value feature;

Step 930, integrating the first background query feature, the first background key feature, and the first background value feature into background fine tuning control information.

Steps 910-930 are described in detail below.

In step 910, the preset fine-tuning network is a neural network generated based on a fine-tuning technique (LoRA techniques) of the deep learning model.

In particular implementations of this embodiment, the fine-tuning network to which different descriptive information is applied is often different, as the feature extraction involves a plurality of different descriptive information. Based on the above, first, a preset candidate trim network for extracting features of the first description information is screened out from a plurality of preset candidate trim networks based on the first description information. And then, activating the screened preset candidate trim network as a selected trim network to realize the feature extraction of the first description information by using the activated trim network.

In step 920, the first background query feature is used to indicate image feature information of the sample background image on the query channel.

The first background key feature is used to indicate image feature information of a sample background image on the key channel.

The first background value feature is used to indicate image feature information of the sample background image on the value channel.

In a specific implementation of this embodiment, first, a sample background image is input into an activated fine-tuning network. Then, the activated fine tuning network carries out linear projection on the sample background image on different channels to obtain a key vector of the sample background image on the key channel, a value vector of the sample background image on the value channel and a query vector of the sample background image on the query channel. Finally, the key vector on the key channel is determined to be a first background key feature (LoRA K1), the value vector on the value channel is determined to be a first background value feature (LoRA V1), and the query vector on the query channel is determined to be a first background query feature (LoRA Q1).

In step 930, the first background value feature (LoRA V1), the first background key feature (LoRA K1), and the first background query feature (LoRA Q1) are integrated into a whole to obtain the background fine-tuning control information (LoRA Q K1V 1).

The embodiment has the advantage that the fine tuning network is utilized to conduct feature extraction on the sample background image in a plurality of channel dimensions, and a first background query feature, a first background key feature and a first background value feature are obtained. Therefore, based on the series of background features, the background information obtained from the content reflected by the image can be used for realizing multi-dimensional background fine adjustment in the denoising process, so that the information comprehensiveness of the background fine adjustment control information is improved, and the accuracy of the background fine adjustment in the denoising process is further improved.

Referring to FIG. 10, in one embodiment, the specific process of generating the object fine tuning control information includes, but is not limited to, the following steps 1010-1030:

Step 1010, activating a preset fine tuning network based on the first description information;

Step 1020, extracting features of the sample object image based on the activated fine tuning network to obtain a first object query feature, a first object key feature and a first object value feature;

Step 1030, integrating the first object query feature, the first object key feature, and the first object value feature into object fine tuning control information.

Steps 1010-1030 are described in detail below.

The first object query feature is used to indicate image feature information of the sample object image on the query channel.

The first object key feature is used to indicate image feature information of the sample object image on the key channel.

The first object value characteristic is used to indicate image characteristic information of the sample object image on the value channel.

In particular implementations of this embodiment, the particular process of steps 1010-1030 is similar to the particular process of steps 910-930 described above. For the sake of space saving, the description is omitted.

The advantage of this embodiment is that the feature extraction of the sample object image is performed on the plurality of channel dimensions using the fine tuning network, and the first object query feature, the first object key feature and the first object value feature are obtained. Therefore, based on the series of object features, object information obtained from the content reflected by the image can be used for realizing multi-dimensional object fine adjustment in the denoising process, so that the information comprehensiveness of object fine adjustment control information is improved, and the accuracy of object fine adjustment in the denoising process is further improved.

It should be noted that, since the denoising network in the embodiment of the present disclosure often includes an upsampling attention network and a downsampling attention network, in order to implement background fine adjustment and object fine adjustment for the denoising overall process, the background fine adjustment control information often includes multiple sets of background LoRA parameters, and each set of background LoRA parameters includes a background value feature, a background key feature, and a background query feature. Likewise, object fine tuning control information often has multiple sets of object LoRA parameters, each set of object LoRA parameters containing an object value feature, an object key feature, and an object query feature. Wherein the network parameters of the trim network used to generate the different sets of background LoRA parameters (or object LoRA parameters) are different.

For example, when both the up-sampled attention network and the down-sampled attention network are made up of one attention module, the background trim control information has two sets of background LoRA parameters, the two sets of background LoRA parameters being object LoRA parameters (which may also be referred to as first background trim information) for the down-sampled attention network, respectively, made up of a first background value feature, a first background key feature, and a first background query feature; the object LoRA parameters for up-sampling the attention network (which may also be referred to as second context trim information) are composed of a second context value feature, a second context key feature, and a second context query feature. Likewise, the object fine tuning control information has two sets of object LoRA parameters, the two sets of object LoRA parameters are object LoRA parameters (which may also be referred to as first object fine tuning information) for down-sampling the attention network, and are composed of a first object value feature, a first object key feature, and a first object query feature; the object LoRA parameters (which may also be referred to as second object trim information) for the up-sampled attention network are composed of a second object value feature, a second object key feature, and a second object query feature.

Further, when the up-sampling attention network and the down-sampling attention network are each composed of a plurality of attention modules, the number of the background LoRA parameters in the background fine-tuning control information and the number of the object LoRA parameters in the object fine-tuning control information are the same as the number of the modules of the attention modules, so that fine tuning is performed by using a set of the background LoRA parameters and a set of the object LoRA parameters in each of the attention modules, thereby improving fine tuning accuracy and fine tuning comprehensiveness in the denoising process.

The sample hidden space features that generate the sample reference image are described in detail below.

Referring to FIG. 11, in one embodiment, a specific process for generating a sample latent spatial feature of a sample reference image includes, but is not limited to, the following steps 1110-1120:

step 1110, performing coding processing on the sample reference image to obtain sample image coding characteristics;

and 1120, performing diffusion processing on the sample image coding features through a diffusion network of the image generation model based on a preset time step to obtain sample hidden space features.

Steps 1110-1120 are described in detail below.

In step 1110, the sample image coding features are used to indicate the image features of the sample reference image in vector space (latent space).

In the embodiment, first, a preset image encoder is called up in the case where an authorized license is given. Then, the sample reference image is input into an image encoder, the sample reference image is encoded based on the image encoder, and mapping of the sample reference image from a pixel space to a vector space (potential hidden space) is realized, so that the sample image encoding characteristics are obtained.

In step 1120, a preset time step is used to indicate the number of times of noise addition to the sample image encoding features of the sample reference image.

In an embodiment of the present disclosure, the image generation model includes a diffusion network.

The diffusion network is used for realizing gradual noise adding processing on the coding features of the sample image until the input sample reference image features approach pure noise.

When the embodiment is specifically implemented, noise is added to the sample image coding features successively through the diffusion network according to a preset time step, so that the sample image coding features successively lose the original features. When the number of noisy events reaches a preset time step, the sample image encoded features will become hidden spatial features without any features. At this time, the obtained hidden space characterization is determined as a sample hidden space characterization.

For example, the preset time step is set to be T, T is a positive integer, the diffusion network is utilized to perform T times of noise addition on the sample image coding features, and the result after the T times of noise addition is used as the sample hidden space features.

The advantage of this embodiment is that mapping the sample reference image from pixel space to potentially hidden space, which makes the sample reference image a sample image encoding feature that meets the input requirements of the image generation model, can improve the usability of the sample reference image. Furthermore, the diffusion network of the image generation model is utilized to carry out noise adding on the sample image coding features for a plurality of times, so that the pure noise processing on the sample image coding features is realized, the effective image features in the sample image coding features can be effectively stripped, the image features of the model are restored according to various control information in the next denoising process, and the processing accuracy of the model on the image processing task to be subjected to object replacement is improved.

Step 340 is described in detail below.

In the embodiment of the disclosure, the image generation model not only includes the diffusion network, but also includes a denoising network, where the denoising network is configured to perform denoising processing on the sample hidden space features output by the diffusion network step by step until a predicted image feature corresponding to the sample reference image is generated.

In the disclosed embodiments, the denoising network may be a U-network structure (U-net network) that includes an upsampling attention network, and a downsampling attention network.

The upsampling attention network is used to downsample the sample hidden spatial feature vector to obtain more low-dimensional features.

The downsampling attention network is used for upsampling the output result of the upsampling attention network, and restoring the output result into denoising image features with the same feature dimension as the sample hidden space feature vector.

In an embodiment of the present disclosure, the background trim control information includes first background trim information and second background trim information. The object trimming control information includes first object trimming information and second object trimming information.

The first context fine adjustment information is used to guide the context fine adjustment during the downsampling attention of the denoising network.

The second background trim information is used to guide the background trim during the up-sampling attention of the de-noising network.

The first object fine tuning information is used to guide object fine tuning during downsampling attention of the denoising network.

The second object refinement information is used to guide object refinement during upsampling attention of the denoising network.

Referring to FIG. 12, in one embodiment, step 340 includes, but is not limited to, steps 1210-1220, including:

Step 1210, performing downsampling processing on the hidden space features of the sample through a downsampling attention network based on the first object fine tuning information, the first background fine tuning information and the denoising control information to obtain a sample downsampling result;

Step 1220, performing upsampling processing on the sample downsampling result through the upsampling attention network based on the second object trimming information, the second background trimming information, and the denoising control information, to obtain a predicted image characteristic.

Steps 1210-1220 are described in detail below.

In step 1210, the sample downsampling result is used to indicate image features generated by denoising the sample hidden space features by way of downsampling.

In a specific implementation of this embodiment, in order to improve the learning effect of the model on the relationship between the background feature and the object feature during the downsampling, a cross-attention mechanism may be introduced during the downsampling. Specifically, first, cross attention calculation is performed on the sample hidden space features based on the background image description embedded features and the first background fine adjustment information in the denoising control information, so as to realize background fine adjustment on the sample hidden space features. And meanwhile, performing cross attention calculation on the sample hidden space features based on the sample image description embedded features and the first object fine adjustment information in the denoising control information so as to realize object fine adjustment on the sample hidden space features. And then, carrying out attention fusion on the sample hidden space features subjected to object fine adjustment and the sample hidden space features subjected to background fine adjustment based on the reference image description embedded features in the denoising control information, and obtaining a sample downsampling result.

For the sake of brevity, the specific process of performing downsampling processing on the hidden space features of the sample through the downsampling attention network according to the embodiments of the present disclosure to obtain the downsampling result of the sample will be described in detail below. And will not be described in detail herein.

In step 1220, the specific process of step 1220 is similar to the specific process of step 1210 described above. The difference is that step 1210 is for downsampling of the sample hidden space features; while step 1220 is directed to upsampling the sample downsampling result, the specific process of both are symmetrical. In addition, the feature dimension of the predicted image feature obtained in step 1220 is the same as the feature dimension of the sample hidden space feature input in step 1220.

As shown in fig. 13A, a specific schematic diagram of the denoising network performing a denoising operation (when the predetermined time step is t=1) according to the object fine adjustment control information, the background fine adjustment control information, and the denoising control information. Specifically, the denoising network includes a downsampling attention network, an upsampling attention network, wherein the downsampling attention network includes an attention downsampling module 1 having an attention downsampling layer and residual block structure, and an attention downsampling module 2 having an attention downsampling layer and residual block structure. The upsampling attention network comprises an attention downsampling module 1 with an attention upsampling layer and residual block structure, and an attention upsampling module 2 with an attention upsampling layer and residual block structure. Since both the up-sampled and down-sampled attention networks have 2 attention modules, the background trim control information has four sets of background LoRA parameters (first background trim information 1, first background trim information 2, second background trim information 1, and second background trim information 2), and the object trim control information has four sets of object LoRA parameters (first object trim information 1, first object trim information 2, second object trim information 1, and second object trim information 2). Based on this, first, the sample hidden space feature and the denoising control information are input to the denoising network. Then, the attention down sampling layer in the attention down sampling module 1 performs attention computation on the sample hidden space features based on the denoising control information, the first background fine adjustment information 1 and the first object fine adjustment information 1 to obtain a first attention computation result, and the residual block structure in the attention down sampling module 1 performs residual processing on the first attention computation result to obtain a first down sampling result. Further, the attention down sampling layer in the attention down sampling module 2 performs attention computation on the first attention down sampling result based on the denoising control information, the first background fine adjustment information 2 and the first object fine adjustment information 2 to obtain a second attention computation result, and the residual block structure in the attention down sampling module 2 performs residual processing on the second attention computation result to obtain a sample down sampling result. Then, the spatial transformation structure performs spatial transformation on the sample downsampling result to obtain a spatial transformed sample downsampling result. Further, the attention upsampling layer in the attention upsampling module 2 performs attention calculation on the sample downsampling result after the spatial transformation based on the denoising control information, the second background fine tuning information 1 and the second object fine tuning information 1 to obtain a third attention calculation result, and the residual block structure in the attention downsampling module 2 performs residual processing on the third attention calculation result to obtain a first upsampling result. Finally, the attention upsampling layer in the attention upsampling module 1 performs attention calculation on the first upsampling result based on the denoising control information, the second background fine tuning information 2 and the second object fine tuning information 2 to obtain a fourth attention calculation result, and the residual block structure in the attention upsampling module 2 performs residual processing on the fourth attention calculation result to obtain the predicted image feature.

As shown in fig. 13B, a specific procedure of the diffusion process and the denoising process of the image generation model in the model training is described. Specifically, the sample reference image is encoded into sample image encoding features prior to being input into the image generation model. And then, carrying out T times of noise adding operation on the sample image coding features through a diffusion network, and converting the sample image coding features into image features close to pure noise to obtain sample hidden space features, wherein the image information of the sample reference image cannot be reflected in the sample hidden space features. Further, through the denoising network, based on given denoising control information, background fine tuning control information and object fine tuning control information, denoising operation is carried out on the sample hidden space features for T times, noise carried on the sample hidden space features is eliminated, the sample hidden space features are restored into image features, and predicted image features are obtained. The predicted image features may reflect image information included in the sample reference image to some extent.

The advantage of this embodiment is that the de-noising process is divided into a down-sampling phase and an up-sampling phase, and the up-sampling phase and the down-sampling phase depend on different background trim information and different object trim information, which can improve the trim effectiveness. Specifically, in the downsampling stage, the downsampling attention network performs feature downsampling according to the first object trimming information, the first background trimming information and the denoising control information, and in the upsampling stage, the upsampling attention network performs feature upsampling according to the second object trimming information, the second background trimming information and the denoising control information. In this way, the three control information of background fine adjustment information, object fine adjustment information and denoising control information are combined to conduct guiding and restraining in the down sampling stage and the up sampling stage, and accurate fine adjustment and flexible control of image denoising can be achieved. The mode can enable the model to carry out image denoising under the condition constraint of various aspects, improves the image denoising capability of the model, and is beneficial to learning of the model on the coordination of the background and the foreground, thereby improving the replacement processing accuracy and the image generating accuracy of the model.

The following describes in detail the case where the downsampling attention network in the denoising network of the image generation model has one attention downsampling module.

In an embodiment of the present disclosure, the first context tuning information includes a first context query feature, a first context key feature, and a first context value feature. The first object trim information includes a first object query feature, a first object key feature, and a first object value feature. The denoising control information includes a reference image description embedding feature, a sample image description embedding feature, and a background image description embedding feature.

Referring to FIG. 14, in one embodiment, step 1210 includes, but is not limited to, steps 1410-1440 including:

Step 1410, performing a first attention calculation through a downsampling attention network based on the sample hidden space feature, the first background query feature, the first background key feature, and the background image description embedding feature, to obtain a background image attention result;

Step 1420, performing a second attention calculation through a downsampling attention network based on the sample hidden space feature, the first object query feature, the first object key feature, and the sample image description embedding feature, to obtain a sample object image attention result;

step 1430, performing feature replacement on the background image attention result based on the sample object image attention result to obtain a replaced background image attention result;

Step 1440, performing a third attention calculation through the downsampling attention network based on the replaced background image attention result, the first background value feature, the first object value feature and the reference image description embedding feature, to obtain a sample downsampling result.

Steps 1410-1440 are described in detail below.

In step 1410, the background image attention result is used to indicate the result of fine-tuning the background feature of the sample hidden space feature according to the first background fine-tuning information and the first description information.

In particular implementations of this embodiment, step 1410 includes, but is not limited to, including the steps of:

carrying out weighting treatment on the background image description embedded feature and the first background key feature to obtain a second weighted image feature;

And performing matrix multiplication on the first weighted image feature and the second weighted image feature to obtain a background image attention result.

Specifically, firstly, weighting the hidden space features of the sample and the first background query features based on a preset first weight and a preset second weight, and adding the product of the first weight and the hidden space features of the sample and the product of the second weight and the first background query features to obtain first weighted image features. And then, carrying out weighting processing on the embedded feature of the background image description and the first background key feature based on a preset third weight and a preset fourth weight, and adding the product of the embedded feature of the third weight and the background image description and the product of the fourth weight and the first background key feature to obtain a second weighted image feature. And finally, carrying out matrix multiplication on the first weighted image feature and the second weighted image feature, and multiplying each element in the first weighted image feature with each element in the second weighted image feature to obtain a background image attention result.

Note that, the sum of the first weight and the second weight, and the sum of the third weight and the fourth weight may be 1, or may not be 1, and are not limited.

In step 1420, the sample object image attention result is used to indicate the result of fine-tuning the object features of the sample hidden space features according to the first object fine-tuning information and the second description information.

In the specific implementation of this embodiment, the specific process of step 1420 is similar to the specific process of step 1410 described above. The difference is that step 1410 is attention calculation for background features, while step 1420 is attention calculation for sample objects, which are different from each other but are processed in a substantially identical manner. For the sake of space saving, the description is omitted.

In step 1430, the post-replacement background image attention results are used to indicate the result of adjusting the reference object features in the sample hidden space features to target object features.

In a specific implementation of this embodiment, first, a sample object feature corresponding to a sample object is determined in a sample object image attention result, and a reference object feature corresponding to a reference object is determined in a background image attention result. Next, the reference object features in the background image attention result are replaced with the sample object features, and the background image attention result after feature replacement is determined as a replaced background image attention result.

In step 1440, step 1440 specifically includes, but is not limited to, the steps of:

and carrying out matrix multiplication on the sample weighted image characteristics and the replaced background image attention results to obtain sample downsampling results.

Specifically, first, the first background value feature, the first object value feature and the reference image description embedding feature are weighted based on a preset fifth weight, a preset sixth weight and a preset seventh weight, and the product of the fifth weight and the first background value feature, the product of the sixth weight and the first object value feature and the product of the seventh weight and the reference image description embedding feature are added to obtain the sample weighted image feature. And then, carrying out matrix multiplication on the sample weighted image characteristics and the attention result of the replaced background image, and multiplying each element in the sample weighted image characteristics with each element of the attention result of the replaced background image to obtain a sample downsampling result.

Note that, the sum of the fifth weight, the sixth weight, and the seventh weight may be 1, or may not be 1, and is not limited.

A benefit of this embodiment is that the cross-attention mechanism is improved and optimized when introduced in the down-sampling process. Specifically, when the background fine adjustment is performed on the hidden space features of the sample based on the background image description embedded features corresponding to the first description information, the image features (the first background key features and the first background query features) of the background image in the key channel and the query channel are fused. Meanwhile, when object fine adjustment is carried out on the hidden space features of the sample based on the object image description embedded features corresponding to the second description information, the image features (the first object key features and the first object query features) of the sample object image in the key channel and the query channel are fused, the object fine adjustment and the background fine adjustment in detail are realized in the mode, the down-sampling accuracy can be improved, and the object features and the background features of the sample down-sampling result have good coordination. In addition, the embodiment also introduces a feature replacement mode in the downsampling, and adjusts the reference object features in the sample hidden space features into target object features, so that the accuracy of image feature expression can be enhanced.

In order to reduce adverse effects of the object replacement process on background features, embodiments of the present disclosure provide a feature replacement scheme based on equalization weights, which can maintain good weight consistency between generated predicted image features and respective features of a sample reference image.

Referring to FIG. 15, in one embodiment, step 1430 includes, but is not limited to, steps 1510-1530 including:

step 1510, determining a first word number of the reference object based on the first description information, and determining a second word number of the sample object based on the second description information;

step 1520, determining a word weight between the reference object and the sample object based on the first word number and the second word number;

step 1530, replacing the reference object feature in the background image attention result with the sample object feature in the sample object image attention result based on the word weight, thereby obtaining a replaced background image attention result.

Steps 1510-1530 are described in detail below.

In step 1510, the first number of words is used to indicate the number of words used to describe the reference object in the first description information.

The second word number is used to indicate the number of words used to describe the sample object in the second description information.

For example, when the first description information is "CAT LYING IN GRASS", the reference object is "cat", and the first word number (token length) is 1; when the second description information is "mally CAT LYING IN GRASS", the sample object is "mally Cat", and the second word number (token length) is 2.

In the embodiment, first, word extraction is performed on the first description information to obtain words for describing the reference object, and the first word number of the reference object is determined according to the total number of the extracted words. Then, word extraction is performed on the second description information to obtain words for describing the sample object, and the number of the second words of the sample object is determined according to the total number of the extracted words.

In step 1520, the word weights are used to indicate the weight-to-ratio size of the respective words used to describe the sample object in the second description information.

In a specific implementation of this embodiment, first, the first number of words is divided by the second number of words to obtain a first duty ratio. Then, the weight of the word for describing the reference object in the first description information is multiplied by the first duty ratio to obtain the weight of the word between the reference object and the sample object.

For example, when the weight of each word in the first description information is 1, the weight of the reference object "cat" is 1. At this time, since the ratio of the first word number to the second word number is 1:2, the word weight is 0.5, and the weight of the sample object "doly Cat" becomes 1×0.5=0.5.

In step 1530, the sample object feature refers to a portion of the feature elements in the sample object image attention result that describe the sample object.

Reference object features refer to a portion of feature elements in the background image attention results that describe the reference object.

In the specific implementation of this embodiment, first, feature weighting is performed on the sample object features in the attention result of the sample object image according to the word weights, so as to obtain weighted sample object features. Next, the reference object features in the background image attention results are replaced with the weighted sample object features, and the feature-replaced background image attention results are determined as replaced background image attention results.

For example, when the reference object (Cat) is replaced with the sample object (mally Cat), the sample object feature generated by each word of the sample object (mally Cat) in the sample object image attention result is weighted by 0.5, so that the sum of weights of the features generated by the words describing the sample object in the replaced background image attention result is 1 in consideration of the word length.

The method has the advantages that a feature replacement mode based on balanced weights is adopted, a word weight is calculated according to the first word number of a reference object in first description information and the second word number of a sample object in second description information, when the sample object feature in a sample object image attention result is used for replacing the reference object feature in a background image attention result, the weight ratio of the sample object feature is adjusted according to the word weight, so that the word weights of all features before and after feature replacement are balanced, good weight consistency is kept between the generated predicted image feature and all features of a sample reference image, the influence of object replacement on the background feature can be reduced, the harmony of the object and the background of a generated image is improved, and the image generation accuracy is further improved.

As shown in fig. 16, a schematic diagram of an implementation of a sample downsampling result generated by a downsampling attention network with one attention module is shown. Specifically, when the downsampled attention network has only one attention module, the first context trim information includes a first context query feature (LoRA Q1), a first context key feature (LoRA K1), and a first context value feature (LoRA V1); the first object hinting information includes a first object query feature (LoRA Q2), a first object key feature (LoRA K2), and a first object value feature (LoRA V2). Based on this, first, the first background query feature (LoRA Q1) and the sample hidden space feature as the query component are weighted to obtain a first weighted image feature, the first background key feature (LoRA K) and the background image description embedding feature as the key component are weighted to obtain a second weighted image feature, and then the first weighted image feature and the second weighted image feature are subjected to matrix multiplication to obtain a background image attention result. Then, the first object query feature (LoRA Q) and the sample hidden space feature as the query component are weighted to obtain a third weighted image feature, the first object key feature (LoRA K) and the sample image description embedding feature as the key component are weighted to obtain a fourth weighted image feature, and the third weighted image feature and the fourth weighted image feature are multiplied by a matrix to obtain a sample image attention result. Further, based on the target object feature in the sample image attention result and the reference object feature in the background image attention result, the feature replacement operation is executed on the premise of adding the word weight w, and the replaced background image attention result is obtained. Finally, the first background value feature (LoRA V) and the first object value feature (LoRA V) are weighted with the reference image description embedded feature as a value component to obtain a sample weighted image feature, and then the sample weighted image feature and the replaced background image attention result are subjected to matrix multiplication to obtain a sample downsampling result.

It should be noted that, when there are a plurality of attention downsampling modules of the attention network, each of the attention downsampling modules may also perform the downsampling process according to steps 1410-1440. Wherein the input of the first attention downsampling module is a sample hidden space feature. For other attention downsampling modules than the first one, the output of the previous one is taken as the input of the next one and the output of the last one is taken as the sample downsampling result.

Furthermore, in the disclosed embodiments, the case where the up-sampling attention network in the denoising network has one attention up-sampling module, or has a plurality of attention up-sampling modules, is similar to the various cases of the down-sampling attention network described above. For the sake of space saving, details will not be described.

Step 350 is described in detail below.

In step 350, an image generation model is trained based on the predicted image features and sample reference images of the plurality of pairs of teletext samples.

Referring to FIG. 17, in one embodiment, step 350 includes, but is not limited to, steps 1710-1740 including:

Step 1710, for each graphic sample pair, acquiring reference noise in a sample reference image, and performing noise prediction based on the predicted image features to obtain predicted noise;

Step 1720, calculating a sub-loss function based on the comparison of the reference noise and the predicted noise;

step 1730, determining a total loss function based on the sub-loss functions of each of the pairs of image-text samples;

step 1740, training an image generation model based on the total loss function.

Steps 1710-1740 are described in detail below.

In step 1710, the reference noise is used to indicate the degree to which noise is added to the desired result.

The prediction noise is obtained by prediction of a plurality of prediction time steps. The prediction noise is used to indicate noise contained in a denoising result (predicted image feature) generated by the image generation model at the last prediction time step.

The prediction time step is used for indicating the noise adding frequency and the noise removing frequency of the sample image coding characteristics corresponding to the sample reference image of the input image generation model.

In the embodiment, for each graphic sample pair, noise extraction is performed on the sample reference image to obtain reference noise in the sample reference image. Wherein the reference noise is consistent with the random noise pattern added in step 530 described above.

Further, since the image generation model denoises the denoising result (denoising image) generated in the previous prediction time step at each prediction time step, and denoises the plurality of prediction time steps to generate the prediction image features, the prediction noise can be obtained only by extracting the noise from the prediction image features.

The noise included in the denoising result of different prediction time steps is different.

In step 1720, a sub-loss function is used to indicate the degree of difference between the reference noise and the prediction noise for a single pair of teletext samples.

When the embodiment is specifically implemented, noise difference calculation is performed on the reference noise and the prediction noise, and a noise difference result is obtained. Finally, calculating a sub-loss function according to the noise difference result.

In step 1730, the total loss function is used to indicate the overall degree of difference between the reference noise and the predicted noise for all pairs of teletext samples. The smaller the total loss function, the smaller the overall difference between the reference noise and the prediction noise of all the pairs of picture and text samples, and the higher the image generation accuracy of the image generation model.

In a specific implementation of this embodiment, the sub-loss functions of each pair of teletext samples are averaged to obtain a total loss function. Specifically, first, the total number of pairs of teletext samples is determined. Then, the sub-loss functions of all the image-text sample pairs are added to obtain the sum of the sub-loss functions. Finally, dividing the sub-loss function by the total number of the graphic sample pairs to obtain a total loss function.

In step 1740, the model parameters of the image generation model are adjusted with the minimum total loss function as a training target, and the steps 310-350 are repeated to realize iterative training of the image generation model, the model parameters with the minimum total loss function are taken as final model parameters, and the image generation model with the final model parameters is taken as the trained image generation model.

The method has the advantages that the sub-loss functions of the image-text sample pairs are determined according to the reference noise in the sample reference image of the image-text sample pairs and the noise difference between the prediction noise in the prediction image characteristics based on a supervised learning mode, and the total loss function is constructed based on the sub-loss functions.

In the disclosed embodiment, the prediction noise is obtained by prediction of a plurality of prediction time steps.

Referring to FIG. 18, in one embodiment, step 1720 includes, but is not limited to, the following steps 1810-1820:

Step 1810, performing difference calculation on the reference noise and the predicted noise to obtain a noise difference;

Step 1820, performing regular term calculation on the noise difference value to obtain a sub-loss function.

Steps 1810 to 1820 are described in detail below.

In step 1810, the noise difference value is used to indicate the magnitude of the difference in value between the reference noise and the prediction noise of the pixel point for each pixel point on the sample reference image.

In the embodiment, first, for each pixel point on the sample reference image, the reference noise and the prediction noise of the pixel point are subjected to a difference process (reference noise minus prediction noise), and a noise difference value of each pixel point is obtained. And then, integrating the noise difference values of the pixel points to obtain the noise difference value of the image-text sample pair.

In step 1820, first, for each image-text sample pair, the noise difference is regularized to obtain a regularized item calculation result, where the regularized item calculation result is used to indicate regularized differences of the reference noise and the prediction noise of the image-text sample pair. And then, converting the regular term calculation result into a form of a conditional loss function, and determining the conditional loss function obtained by conversion as a sub-loss function. Wherein, the sub-loss function of the embodiments of the present disclosure may be expressed as shown in formula (1):

formula (1);

wherein, Is a sub-loss function; is the reference noise and is used to determine, Indicating that the reference noise is subject to a standard normal distribution (mean 0, variance 1). t represents a prediction time step, which is used for gradually updating the hidden space features of the sample in the image generation process.Representing the sample hidden spatial features at the time of the predicted time step t,Is the result of the sample image coding feature z being denoised at the prediction time step t. y and c both refer to control information (including denoising control information, background fine control information, and object fine control information) for conditional constraints on the image generation process. x is the sample reference image used for training.For indicating encodersThe expectation for the input x.Refers to constraint according to conditionsImplicit spatial characterization of samples at a predicted time step tPrediction noise after denoising is performed.Is the result of the regular term computation.

The method has the advantages that the regular term loss between the reference noise contained in the sample reference image of the image-text sample pair and the prediction noise contained in the prediction image feature is calculated, the regular term calculation result of each image-text sample pair is obtained, the regular term calculation result (L2 loss) is used as a sub-loss function to train the image generation model, the image generation process of the image generation model can be better regulated, the generation process is more controllable, meanwhile, condition information (denoising control information, background fine adjustment control information and object fine adjustment control information) is added into the sub-loss function, the generated prediction image feature can be closely related to a given condition, the degree of harmony between an object and the background in image processing of object replacement in the background image by the model is improved, the accuracy of the replacement processing of the model is further improved, and the finally generated target image has better image quality.

As shown in fig. 19, is an overall flow chart of model training of an embodiment of the present disclosure. Specifically, a sample background image, a sample object image, an expected effect map, image description information of the expected effect map, first description information of the sample background image, and second description information of the sample object image are taken as training data for model training.

First, a reference noise is generated using the random number i and added to the expected effect map, resulting in a sample reference image, the detailed process of which is similar to steps 510-530 described above.

Then, the sample reference image is encoded by using an encoder to obtain a sample encoded image feature Z, diffusion processing is performed on the sample encoded image feature Z through a diffusion network, and T times of noise adding is performed on the sample encoded image feature Z by using the diffusion network to obtain a sample hidden space feature Z _T, wherein the specific process is similar to the steps 1110-1120. Further, the image description information of the expected effect map, the first description information of the sample background image, and the second description information of the sample object image are converted into text forms conforming to the input requirements of the text encoder τ as text condition constraints, and the image description information of the expected effect map, the first description information of the sample splice image, and the second description information of the sample object image are input to the text encoder, and the reference image description embedding feature, the background image description embedding feature, and the sample image description embedding feature are output by the text encoder, with the specific procedures similar to those of steps 710-740 described above. Further, respectively inputting the sample background image and the sample object image to a fine tuning network LoRA to obtain background fine tuning control information and object fine tuning control information, wherein the background fine tuning control information comprises a first background feature QKV-A1, a second background feature QKV-A1, a third background feature QKV-A1 and a fourth background feature QKV-A1; the object fine tuning control information includes a first object feature QKV-A2, a second object feature QKV-A2, a third object feature QKV-A2, and a fourth object feature QKV-A2, which are similar to the above steps 910-930 and 1010-1030.

Next, when the denoising network denoises the sample hidden space feature Z _T, the sample hidden space feature Z _T is downsampled by each attention downsampling module of the downsampling attention network of the denoising network based on the first background feature QKV-A1, the second background feature QKV-A1, the first object feature QKV-A2, the second object feature QKV-A2, the reference image description embedding feature, the background image description embedding feature and the sample image description embedding feature, to obtain a sample downsampling result. And based on the third background feature QKV-A1, the fourth background feature QKV-A1, the third object feature QKV-A2, the fourth object feature QKV-A2, the reference image description embedding feature, the background image description embedding feature and the sample image description embedding feature, upsampling the sample downsampling result through each attention upsampling module of the upsampling attention network of the denoising network to obtain a denoising result Z _T-1' subjected to one denoising operation. Further, denoising the denoising result Z _T-1' when the predicted time step is T by using a denoising network according to the above process, and obtaining the predicted image characteristic Z ^' after T-1 times of denoising, wherein the specific process is similar to the steps 1210-1220.

Finally, noise prediction is performed based on the predicted image feature Z ^' to obtain predicted noise, a LOSS function LOSS is constructed according to the reference noise and the predicted noise, and iterative training is performed on the image generation model based on the LOSS function LOSS until the image generation model meets the training requirement, wherein the specific process is similar to the steps 1710-1740. For the sake of space saving, the description is omitted.

An image generating method according to an embodiment of the present disclosure is described in detail below.

According to one embodiment of the present disclosure, an image generation method is provided.

The image generation method is generally applied to a business scene, such as an article display scene shown in fig. 2A-2D, in which a certain object (a person, an animal, an article, etc.) is to be embedded in a certain background image and a high requirement for coordination between the object embedded in the background and the background is required. The embodiment of the disclosure provides a scheme for generating an image through an image generation model based on image description, background image and background description, object image and object description, which can improve the coordination between an object and a background in image processing for replacing the object in the background image, thereby improving the accuracy of replacement processing.

As shown in fig. 20, the image generating method according to one embodiment of the present disclosure may be performed by an electronic device, which may be the image processing server or the object terminal shown in fig. 1. The image generation method according to an embodiment of the present disclosure may include:

Step 2010, acquiring a target object image, a target background image, object description information of the target object image, background description information of the target background image and target image description information of the target object;

step 2020, generating denoising prompt information based on the background description information, the object description information and the target image description information;

step 2030, generating background fine adjustment prompt information based on the target background image, and generating object fine adjustment prompt information based on the target object image;

step 2040, based on the preset noise image, the denoising prompt information, the object fine adjustment prompt information and the background fine adjustment prompt information, performing image generation through the trained image generation model, and obtaining the target image.

Steps 2010-2040 are described in detail below.

In step 2010, a target object image, a target background image, object description information of the target object image, background description information of the target background image, and target image description information of the target object are acquired.

A target object image of a target object refers to a series of images that reflect the characteristics of the object of the target object.

The target background image refers to a background image to which a target object is to be added, wherein the target background image contains a reference object to be replaced by the target object in the target object image.

The background description information of the target background image is used to describe the image scene and background content of the target background image, and the like.

The object description information of the target object image is used to describe the object category, the object characteristics, and the like of the target object in the target object image.

The target description information is used to describe the replacement from the reference object to the target object.

In the specific implementation of this embodiment, the specific process of step 2010 is similar to the specific process of 310 described above. For the sake of space saving, the description is omitted.

In step 2020, denoising hint information is generated based on the background description information, the object description information, and the target image description information.

The denoising prompt information is used for assisting the image generation model to carry out image denoising when the image is generated as a text condition constraint so as to improve the image denoising accuracy and enable the image denoising effect to be attached to the real requirement.

In the specific implementation of this embodiment, step 2020 is similar to steps 710-720 described above. For the sake of space saving, the description is omitted.

In step 2030, a background fine adjustment hint is generated based on the target background image and an object fine adjustment hint is generated based on the target object image.

The background fine adjustment prompt information is used for guiding the image generation model to conduct fine adjustment on background characteristics in the denoising process.

The object fine adjustment prompt information is used for guiding the image generation model to conduct fine adjustment on object characteristics in the denoising process.

In a specific implementation of this embodiment, the specific process of generating the background trim hints information based on the target background image is similar to the specific process of steps 910-930 described above. The specific process of generating object fine tuning hints based on the target object image is similar to the specific process of steps 1010-1030 described above. For the sake of space saving, the description is omitted.

In step 2040, image generation is performed by the trained image generation model based on the preset noise image, the denoising prompt, the object fine adjustment prompt and the background fine adjustment prompt, so as to obtain a target image.

The pre-set noise image is a noise figure generated based on random numbers and following a gaussian distribution in a similar manner as described above with respect to step 520. For the sake of space saving, the description is omitted.

The trained image generation model is generated according to the training method of the image generation model of the above embodiment.

The target image is used to indicate the result of replacing the reference object in the target background image with the target object of the target object image.

For the sake of saving space, the specific process of obtaining the target image according to the embodiment of the present disclosure is described in detail below by performing image generation through the image generation model based on the preset noise image, the denoising prompt, the object fine tuning prompt and the background fine tuning prompt. And will not be described in detail herein.

Through steps 2010-2040, in an embodiment of the present disclosure, denoising hint information is generated based on background description information of a target background image, object description information of a target object image, and target image description information of an image to be generated. Because the background description information focuses on describing the background from the word aspect, the object description information focuses on describing the target object from the word aspect, and the target image description information focuses on describing the whole image after the object replacement from the word aspect, the denoising prompt information comprehensively expresses all aspects of characteristics from the word aspect in the object replacement process. The background fine adjustment prompt information focuses on the background information obtained from the content reflected by the image, and the object fine adjustment prompt information focuses on the object information obtained from the content reflected by the image, so that the background fine adjustment prompt information, the object fine adjustment prompt information and the denoising prompt information are combined, the denoising processing is carried out on the target hidden space features corresponding to the preset noise image, the denoising processing consideration factors are comprehensive, the coordination of the object and the background in the image processing for replacing the object is fully considered, the feature quality of the generated target image features can be improved, finally, the decoding network is utilized for carrying out feature decoding on the target image features to obtain the target image to be generated, and the foreground and the background in the target image can have good coordination.

In an embodiment of the present disclosure, the trained image generation model includes a diffusion network, a denoising network, and a decoding network.

The decoding network is used to convert image features in the target denoising result from the latent vector space to the pixel space.

Referring to FIG. 21, in one embodiment, step 2040 includes, but is not limited to, steps 2110-2130:

step 2110, performing diffusion processing on the coded image features of the preset noise image based on a diffusion network to obtain target hidden space features;

Step 2120, denoising the target hidden space features through a denoising network based on the denoising prompt information, the object fine tuning prompt information and the background fine tuning prompt information to obtain a target denoising result;

And 2130, performing feature decoding on the target denoising result based on the decoding network to obtain a target image.

Steps 2110-2130 are described in detail below.

The encoded image features of the pre-set noise image are used to indicate the image features of the pre-set noise image in vector space (latent hidden space).

The target hidden space feature is used for indicating a result of adding noise to the coded image feature of the preset noise image in a fixed time step through the diffusion network.

The target denoising result is used for indicating that the target hidden space feature is denoised in a fixed time step through a denoising network, and the generated image feature meets the constraint requirement of the target image description information.

In the specific implementation of this embodiment, the specific process of step 2110 is similar to steps 1110-1120 described above, and the specific process of step 2120 is similar to steps 1210-1220 described above. For the sake of space saving, the description is omitted.

In step 2130, the target denoising result is input to a decoding network, and image features in the target denoising result are mapped back to the original pixel space from the latent vector space by the decoding network, so as to generate a noise-free target image conforming to the target description information.

The method has the advantages that the diffusion network is utilized to carry out noise adding on the coded image characteristics of the preset noise image for a plurality of times, and the target hidden space characteristics are obtained. And in the process, the denoising prompt information, the background fine adjustment prompt information and the object fine adjustment prompt information are utilized to finely adjust the predicted image characteristics, so that characteristic information adjustment on image angles and character angles is realized, and a target object in a finally predicted target denoising result is more coordinated with the background. Finally, decoding the target denoising result by using a decoding network, thereby obtaining a predicted target image, and improving the coordination between the object and the background in the image processing of replacing the object.

Fig. 22 is a schematic diagram of a specific application module of the image generating method according to the embodiment of the disclosure. Specifically, first, feature extraction is performed on the background image by using a fine adjustment generating model (fine adjustment network) to obtain background fine adjustment prompt information, and the specific process is similar to the above steps 910-930; feature extraction is performed on a foreground target image containing a target object by using a fine adjustment generation model (fine adjustment network) to obtain object fine adjustment prompt information, and the specific process is similar to the steps 1010-1030. Further, the background fine adjustment prompt information, the object fine adjustment prompt information and the general generated target image description information are input into an image generation model, diffusion processing and denoising processing are carried out on target hidden space features corresponding to a preset noise image through the image generation model, and fine adjustment feature replacement operation is introduced in the denoising process, so that target image features are obtained. Finally, the decoding network is utilized to convert the target image characteristics into a target image for output, so that a target image taking the background image as the background and taking the target object as the foreground is generated, and the specific process is similar to the steps 2110-2140. For the sake of space saving, the description is omitted.

As shown in fig. 23, an overall flowchart of image generation based on an image generation model according to an embodiment of the present disclosure is shown. Specifically, a target background image, a target object image, target image description information corresponding to an image to be generated, background description information of the target background image, and object description information of the target object image are taken as inputs.

First, a random noise pattern is generated using the random number i, and the specific process is similar to the above-described step 520.

Then, the random noise map is encoded by using an encoder to obtain a noise encoded image feature Z, the noise encoded image feature Z is subjected to diffusion processing by using a diffusion network, and the noise encoded image feature Z is subjected to T times of noise adding by using the diffusion network to obtain a target hidden space feature Z _T, and the specific process is similar to the steps 1110-1120. Further, the target image description information, the background description information and the object description information are used as text condition constraint and are converted into text form meeting the input requirement of the text encoder tau, the target image description information, the background description information and the object description information are input into the text encoder, and the text encoder outputs denoising prompt information, and the concrete process is similar to the steps 710-740. Further, respectively inputting the target background image and the target object image into a fine tuning network LoRA to obtain background fine tuning control information and object fine tuning control information, wherein the background fine tuning prompt information comprises a first background feature QKV-A1, a second background feature QKV-A1, a third background feature QKV-A1 and a fourth background feature QKV-A1; the object fine tuning hints include a first object feature QKV-A2, a second object feature QKV-A2, a third object feature QKV-A2, and a fourth object feature QKV-A2, which are similar to the steps 910-930 and 1010-1030 described above.

Next, when the denoising network denoises the target hidden space feature Z _T, the target hidden space feature Z _T is downsampled by each attention downsampling module of the downsampling attention network of the denoising network based on the first background feature QKV-A1, the second background feature QKV-A1, the first object feature QKV-A2, the second object feature QKV-A2 and the denoising prompt information, so as to obtain a target downsampling result. And based on the third background feature QKV-A1, the fourth background feature QKV-A1, the third object feature QKV-A2, the fourth object feature QKV-A4-A2, the reference image description embedded feature and the denoising prompt information, upsampling the target downsampling result through each attention upsampling module of the upsampling attention network of the denoising network to obtain a denoising result Z _T-1' subjected to one denoising operation. Further, denoising the denoising result Z _T-1' when the predicted time step is T by using a denoising network according to the process, and obtaining the target image characteristic Z ^' after T-1 times of denoising. Finally, feature decoding is performed on the target image feature Z ^' based on the decoding network to obtain the target image I, and the specific process is similar to the above steps 2110-2140. For the sake of space saving, the description is omitted.

The apparatus and devices of embodiments of the present disclosure are described below.

It will be appreciated that, although the steps in the various flowcharts described above are shown in succession in the order indicated by the arrows, the steps are not necessarily executed in the order indicated by the arrows. The steps are not strictly limited in order unless explicitly stated in the present embodiment, and may be performed in other orders. Moreover, at least some of the steps in the flowcharts described above may include a plurality of steps or stages that are not necessarily performed at the same time but may be performed at different times, and the order of execution of the steps or stages is not necessarily sequential, but may be performed in turn or alternately with at least a portion of the steps or stages in other steps or other steps.

In the embodiments of the present application, when related processing is performed according to data related to characteristics of a target object, such as attribute information or attribute information set of the target object, permission or consent of the target object is obtained first, and related laws and regulations and standards are complied with for collection, use, processing, etc. of the data. In addition, when the embodiment of the application needs to acquire the attribute information of the target object, the independent permission or independent consent of the target object is acquired through a popup window or a jump to a confirmation page or the like, and after the independent permission or independent consent of the target object is explicitly acquired, the necessary target object related data for enabling the embodiment of the application to normally operate is acquired.

Fig. 24 is a schematic structural diagram of a training apparatus 2400 for image generation model according to an embodiment of the present disclosure. The training apparatus 2400 for generating a model includes:

A first obtaining unit 2410 for obtaining a plurality of pairs of graphic samples, wherein each pair of graphic samples includes a sample reference image, a sample background image, a sample object image of a sample object, first description information of the sample background image, second description information of the sample object image, and sample image description information of the sample reference image, the sample reference image is obtained by adding noise to an expected result of replacing a reference object in the sample background image with the sample object;

A first generation unit 2420 for generating denoising control information based on the first description information, the second description information, and the sample image description information;

A second generation unit 2430 for generating background fine adjustment control information based on the sample background image and generating object fine adjustment control information based on the sample object image;

The denoising unit 2440 is configured to denoise the sample hidden space feature corresponding to the sample reference image through the image generation model based on the background fine adjustment control information, the object fine adjustment control information, and the denoising control information, so as to obtain a predicted image feature;

The training unit 2450 is configured to train the image generation model based on the predicted image features and the sample reference images of the plurality of graphic sample pairs, and obtain a trained image generation model. The trained image generation model is used for generating images according to the target object image, the target background image, the object description information of the target object image, the background description information of the target background image and the target image description information.

Optionally, the image generation model includes a denoising network; the denoising network comprises an upsampling attention network and a downsampling attention network; the background fine tuning control information comprises first background fine tuning information and second background fine tuning information; the object fine adjustment control information comprises first object fine adjustment information and second object fine adjustment information;

The denoising unit 2440 includes:

A downsampling subunit (not shown) configured to perform downsampling processing on the sample hidden space feature through the downsampling attention network based on the first object fine tuning information, the first background fine tuning information, and the denoising control information, to obtain a sample downsampling result;

an up-sampling sub-unit (not shown) for up-sampling the sample down-sampling result through the up-sampling attention network based on the second object trimming information, the second background trimming information, and the denoising control information, to obtain a predicted image feature.

Optionally, the first context tuning information includes a first context query feature, a first context key feature, and a first context value feature; the first object fine tuning information comprises a first object query feature, a first object key feature and a first object value feature; the denoising control information comprises a reference image description embedded feature, a sample image description embedded feature and a background image description embedded feature;

The downsampling subunit (not shown) comprises:

A first attention module (not shown) for performing a first attention calculation through the downsampling attention network based on the sample hidden space feature, the first background query feature, the first background key feature, and the background image description embedding feature to obtain a background image attention result;

A second attention module (not shown) for performing a second attention calculation through the downsampled attention network based on the sample hidden space feature, the first object query feature, the first object key feature, and the sample image description embedding feature to obtain a sample object image attention result;

A replacing module (not shown) for performing feature replacement on the background image attention result based on the sample object image attention result to obtain a replaced background image attention result;

a third attention module (not shown) for performing a third attention calculation through the downsampling attention network based on the replaced background image attention result, the first background value feature, the first object value feature and the reference image description embedding feature, to obtain a sample downsampling result.

Optionally, a first attention module (not shown) is used to:

Optionally, a replacement module (not shown) is used to:

And replacing the reference object features in the background image attention result with the sample object features in the sample object image attention result based on the word weight to obtain a replaced background image attention result.

Optionally, a third attention module (not shown) is used to:

Alternatively, the first generating unit 2420 is configured to:

Word embedding processing is carried out on the first description information, and background image description embedding characteristics are obtained;

word embedding processing is carried out on the second description information, so that sample image description embedding characteristics are obtained;

the reference image description embedding feature, the sample image description embedding feature, and the background image description embedding feature are integrated into denoising control information.

Optionally, the second generating unit 2430 is configured to:

And integrating the first background inquiry feature, the first background key feature and the first background value feature into background fine adjustment control information.

Optionally, the second generating unit 2430 is configured to:

integrating the first object query feature, the first object key feature and the first object value feature into object fine tuning control information.

based on a preset time step, the sample image coding features are subjected to diffusion processing through a diffusion network of the image generation model, and the sample hidden space features are obtained.

Optionally, the sample reference image is generated by:

Acquiring an original reference image, wherein the original reference image is used for indicating an expected result of replacing a reference object in a sample background image with the sample object;

and adding random numbers to pixel values of the pixel points for each pixel point in the original reference image to obtain a sample reference image.

Optionally, training unit 2450 includes;

An acquisition module (not shown) for acquiring reference noise in a sample reference image for each graphic sample pair, and performing noise prediction based on predicted image features to obtain predicted noise;

A calculation module (not shown) for calculating a sub-loss function based on a comparison of the reference noise and the predicted noise;

A determining module (not shown) for determining a total loss function based on the sub-loss functions of each of the pairs of teletext samples;

A training module (not shown) for training the image generation model based on the total loss function.

a computing module (not shown) is used to:

Performing difference calculation on the reference noise and the predicted noise to obtain a noise difference;

And carrying out regular term calculation on the noise difference value to obtain a sub-loss function.

Optionally, the sample object image is generated by:

Acquiring a sample image of a sample object;

Image segmentation is carried out on the sample image based on a preset object segmentation model, so that a sample segmentation image with a sample object is obtained;

And carrying out image enhancement on the sample segmentation image to obtain a sample object image.

Fig. 25 is a schematic structural diagram of an image generating apparatus 2500 according to an embodiment of the present disclosure. The image generating apparatus 2500 includes:

A second acquisition unit 2510 for acquiring a target object image of a target object, a target background image, object description information of the target object image, background description information of the target background image, and target image description information, wherein the target background image contains a reference object to be replaced by a target object in the target object image, the target image description information describing a replacement from the reference object to the target object;

a third generating unit 2520, configured to generate denoising hint information based on the background description information, the object description information, and the target image description information;

A fourth generating unit 2530, configured to generate background fine adjustment prompt information based on the target background image, and generate object fine adjustment prompt information based on the target object image;

The image generating unit 2540 is configured to perform image generation through a trained image generating model based on the preset noise image, the denoising prompt information, the object fine tuning prompt information and the background fine tuning prompt information, so as to obtain a target image, where the trained image generating model is generated according to the training method of the image generating model; the target image is used to indicate the result of replacing the reference object in the target background image with the target object of the target object image.

The image generation unit 2540 is configured to:

Performing diffusion processing on the coded image features of the preset noise image based on a diffusion network to obtain target hidden space features;

Denoising the target hidden space features through a denoising network based on the denoising prompt information, the object fine tuning prompt information and the background fine tuning prompt information to obtain a target denoising result;

and performing feature decoding on the target denoising result based on the decoding network to obtain a target image.

Referring to fig. 26, fig. 26 is a block diagram of a portion of a terminal, which may be the object terminal shown in fig. 1, implementing a training method of an image generation model or an image generation method according to an embodiment of the present disclosure. The terminal comprises: radio Frequency (RF) circuitry 2610, memory 2615, input unit 2630, display unit 2640, sensor 2650, audio circuitry 2660, wireless fidelity (WIRELESS FIDELITY, wiFi) module 2670, processor 2680, and power supply 2690. It will be appreciated by those skilled in the art that the terminal structure shown in fig. 26 is not limiting of a cell phone or computer and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

The RF circuit 2610 may be used for receiving and transmitting signals during a message or a call, and in particular, after receiving downlink information of a base station, the signal is processed by the processor 2680; in addition, the data of the design uplink is sent to the base station.

The memory 2615 may be used to store software programs and modules, and the processor 2680 executes the software programs and modules stored in the memory 2615 to thereby perform various functional applications and data processing of the subject terminal.

The input unit 2630 may be used to receive input numeric or character information and generate key signal inputs related to setting and function control of the object terminal. In particular, the input unit 2630 may include a touch panel 2631 and other input devices 2632.

The display unit 2640 may be used to display input information or provided information and various menus of the object terminal. The display unit 2640 may include a display panel 2641.

Audio circuitry 2660, speaker 2661, microphone 2662 may provide an audio interface.

In this embodiment, the processor 2680 included in the terminal may perform the training method or the image generation method of the image generation model of the previous embodiment.

Fig. 27 is a block diagram of a portion of a server implementing a training method of an image generation model or an image generation method of an embodiment of the present disclosure. The server may be the image processing server shown in fig. 1. The servers may vary widely by configuration or performance, and may include one or more central processing units (Central Processing Units, simply CPUs) 2722 (e.g., one or more processors) and memory 2732, one or more storage mediums 2730 (e.g., one or more mass storage devices) that store applications 2742 or data 2744. Wherein memory 2732 and storage medium 2730 may be transitory or persistent. The program stored on the storage medium 2730 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Still further, the central processor 2722 may be configured to communicate with a storage medium 2730 and execute a series of instruction operations in the storage medium 2730 on a server.

The server may also include one or more power supplies 2726, one or more wired or wireless network interfaces 2750, one or more input/output interfaces 2758, and/or one or more operating systems 2741, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, and the like.

The central processor 2722 in the server may be used to perform a training method or an image generation method of the image generation model of the embodiments of the present disclosure.

The embodiments of the present disclosure also provide a computer-readable storage medium storing a computer program for executing the training method or the image generation method of the image generation model of the foregoing embodiments.

The disclosed embodiments also provide a computer program product comprising a computer program. The processor of the electronic device reads the computer program and executes it, so that the electronic device executes a training method or an image generation method implementing the image generation model described above.

The terms "first," "second," "third," "fourth," and the like in the description of the present disclosure and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein, for example. Furthermore, the terms "comprises," "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in this disclosure, "at least one" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

It should be understood that in the description of the embodiments of the present disclosure, the meaning of a plurality (or multiple) is two or more, and that greater than, less than, exceeding, etc. is understood to not include the present number, and that greater than, less than, within, etc. is understood to include the present number.

In the several embodiments provided in the present disclosure, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of elements is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present disclosure may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present disclosure may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the various embodiments of the present disclosure. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory RAM), a magnetic disk, or an optical disk, etc., which can store program codes.

It should also be appreciated that the various implementations provided by the embodiments of the present disclosure may be arbitrarily combined to achieve different technical effects.

The above is a specific description of the embodiments of the present disclosure, but the present disclosure is not limited to the above embodiments, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the present disclosure, and are included in the scope of the present disclosure as defined in the claims.

Claims

1. A training method for an image generation model, the training method comprising:

generating background fine adjustment control information based on the sample background image and generating object fine adjustment control information based on the sample object image;

2. The method of training an image generation model according to claim 1, wherein the image generation model comprises a denoising network; the denoising network comprises an upsampling attention network and a downsampling attention network; the background fine adjustment control information comprises first background fine adjustment information and second background fine adjustment information; the object fine adjustment control information comprises first object fine adjustment information and second object fine adjustment information;

the denoising processing is performed on the sample hidden space features corresponding to the sample reference image through the image generation model based on the background fine tuning control information, the object fine tuning control information and the denoising control information, so as to obtain predicted image features, including:

Based on the first object fine tuning information, the first background fine tuning information and the denoising control information, performing downsampling processing on the sample hidden space features through the downsampling attention network to obtain a sample downsampling result;

and based on the second object fine tuning information, the second background fine tuning information and the denoising control information, performing upsampling processing on the sample downsampling result through the upsampling attention network to obtain the predicted image characteristic.

3. The method of training an image generation model according to claim 2, wherein the first background trim information includes a first background query feature, a first background key feature, and a first background value feature; the first object fine tuning information comprises a first object query feature, a first object key feature and a first object value feature; the denoising control information comprises a reference image description embedded feature, a sample image description embedded feature and a background image description embedded feature;

based on the first object fine adjustment information, the first background fine adjustment information, and the denoising control information, the downsampling processing is performed on the sample hidden space features through the downsampling attention network to obtain a sample downsampling result, including:

performing first attention calculation through the downsampling attention network based on the sample hidden space features, the first background query features, the first background key features and the background image description embedded features to obtain a background image attention result;

performing second attention calculation through the downsampling attention network based on the sample hidden space features, the first object query features, the first object key features and the sample image description embedding features to obtain a sample object image attention result;

Performing feature replacement on the background image attention result based on the sample object image attention result to obtain a replaced background image attention result;

And performing third attention calculation through the downsampling attention network based on the attention result of the replaced background image, the first background value characteristic, the first object value characteristic and the reference image description embedded characteristic to obtain the sample downsampling result.

4. A training method of an image generation model according to claim 3, wherein the feature-replacing the background image attention result based on the sample object image attention result to obtain a replaced background image attention result comprises:

5. A method of training an image generation model according to claim 3, wherein said performing a first attention calculation through said downsampled attention network based on said sample hidden space feature, said first background query feature, said first background key feature, and said background image description embedding feature to obtain a background image attention result comprises:

6. A training method of an image generation model according to claim 3, wherein said performing a third attention calculation through said down-sampling attention network based on said post-replacement background image attention result, said first background value feature, said first object value feature, and said reference image description embedding feature, to obtain said sample down-sampling result, comprises:

7. The method of training an image generation model according to claim 1, wherein the generating denoising control information based on the first description information, the second description information, and the sample image description information comprises:

8. The method of training an image generation model according to claim 1, wherein the generating background fine adjustment control information based on the sample background image comprises:

9. The training method of an image generation model according to claim 1, wherein the sample hidden space features corresponding to the sample reference image are generated by:

10. The training method of an image generation model according to claim 1, wherein the sample reference image is generated by:

11. The method of training an image generation model according to claim 1, wherein the training the image generation model based on the predicted image features and the sample reference images of the plurality of pairs of teletext samples comprises:

for each image-text sample pair, acquiring reference noise in the sample reference image, and carrying out noise prediction based on the predicted image characteristics to obtain predicted noise;

calculating a sub-loss function based on a comparison of the reference noise and the prediction noise;

Determining a total loss function based on the sub-loss functions of each of the pairs of teletext samples;

the image generation model is trained based on the total loss function.

12. The method of training an image generation model according to claim 11, wherein the prediction noise is obtained by prediction of a plurality of prediction time steps;

the calculating a sub-loss function based on the comparison of the reference noise and the prediction noise, comprising:

13. Training method for an image generation model according to any of the claims 1 to 12, characterized in that the sample object image is generated by:

Acquiring a sample image of the sample object;

14. An image generation method, characterized in that the image generation method comprises:

Generating an image through a trained image generation model based on a preset noise image, the denoising prompt information, the object fine adjustment prompt information and the background fine adjustment prompt information to obtain a target image, wherein the trained image generation model is generated according to the training method of the image generation model of any one of claims 1 to 13; the target image is used to indicate a result of replacing the reference object in the target background image with the target object of the target object image.

15. The image generation method of claim 14, wherein the trained image generation model comprises a diffusion network, a denoising network, and a decoding network;

the image generation is performed through a trained image generation model based on a preset noise image, the denoising prompt information, the object fine adjustment prompt information and the background fine adjustment prompt information, so as to obtain a target image, and the method comprises the following steps:

16. An image generation model training apparatus, characterized in that the image generation model training apparatus comprises:

17. An image generation apparatus, characterized in that the image generation apparatus comprises:

An image generating unit, configured to generate an image by using a trained image generating model based on a preset noise image, the denoising prompt information, the object fine tuning prompt information and the background fine tuning prompt information, to obtain a target image, where the trained image generating model is generated according to the training method of the image generating model according to any one of claims 1 to 13; the target image is used to indicate a result of replacing the reference object in the target background image with the target object of the target object image.

18. An electronic device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the training method of an image generation model according to any one of claims 1 to 13 or the image generation method according to any one of claims 14 to 15 when executing the computer program.

19. A computer-readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the training method of an image generation model according to any one of claims 1 to 13 or the image generation method according to any one of claims 14 to 15.

20. A computer program product comprising a computer program that is read and executed by a processor of an electronic device, causing the electronic device to perform the training method of an image generation model according to any one of claims 1 to 13 or the image generation method according to any one of claims 14 to 15.