[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN117953108B - Image generation method, device, electronic equipment and storage medium - Google Patents

Image generation method, device, electronic equipment and storage medium Download PDF

Info

Publication number
CN117953108B
CN117953108B CN202410328377.3A CN202410328377A CN117953108B CN 117953108 B CN117953108 B CN 117953108B CN 202410328377 A CN202410328377 A CN 202410328377A CN 117953108 B CN117953108 B CN 117953108B
Authority
CN
China
Prior art keywords
image
text
image generation
loss
deviation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410328377.3A
Other languages
Chinese (zh)
Other versions
CN117953108A (en
Inventor
冯鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202410328377.3A priority Critical patent/CN117953108B/en
Publication of CN117953108A publication Critical patent/CN117953108A/en
Application granted granted Critical
Publication of CN117953108B publication Critical patent/CN117953108B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/60Editing figures and text; Combining figures or text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0475Generative networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/094Adversarial learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/86Arrangements for image or video recognition or understanding using pattern recognition or machine learning using syntactic or structural representations of the image or video pattern, e.g. symbolic string recognition; using graph matching

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Multimedia (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The embodiment of the application discloses an image generation method, an image generation device, electronic equipment and a storage medium; the method comprises the steps of obtaining a text sample, a first image generation network and a second image generation network; performing image generation processing on the reference text by adopting a first image generation network to obtain a reference image, and performing image generation processing on the positive deviation text and the negative deviation text by adopting a second image generation network to obtain a positive deviation image and a negative deviation image; determining a target loss of the second image generation network based on the reference image, the positive bias image, and the negative bias image; training the second image generation network based on the target loss to obtain a target image generation network; and adopting a target image generation network to perform image generation processing on the text to be processed to obtain a target image. In the embodiment, the network training is performed by introducing the texts of three different concepts, so that the representation capability of the network to the different concepts is improved, and the generation accuracy of the network to the image is improved.

Description

Image generation method, device, electronic equipment and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to an image generating method, an image generating device, an electronic device, and a storage medium.
Background
Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is a technology that utilizes a digital computer to simulate the human perception environment, acquire knowledge, and use knowledge, which can enable machines to function similar to human perception, reasoning, and decision. The artificial intelligence technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning, deep learning and other directions.
In artificial intelligence, text-to-image technology refers to a process of converting text descriptions into images by using a combination of natural language processing technology and computer vision technology.
However, current text-generated image models often cannot accurately perceive concepts expressed by text descriptions, resulting in the text-generated image models not being able to accurately generate corresponding images from the text descriptions.
Disclosure of Invention
The embodiment of the application provides an image generation method, an image generation device, electronic equipment and a storage medium, which can improve the image generation accuracy.
The embodiment of the application provides an image generation method, which comprises the following steps:
Acquiring a text sample, a first image generation network and a second image generation network, wherein the text sample comprises a basic text, a positive deviation text and a negative deviation text, network parameters of the first image generation network are fixed, the positive deviation text and the basic text have positive deviation in terms of semantics, and the negative deviation text and the basic text have negative deviation in terms of semantics;
Performing image generation processing on the reference text by adopting the first image generation network to obtain a reference image, and performing image generation processing on the positive deviation text and the negative deviation text by adopting the second image generation network to obtain a positive deviation image and a negative deviation image;
determining a target loss of the second image generation network based on the reference image, the positive bias image, and the negative bias image;
training the second image generation network based on the target loss to obtain a target image generation network;
when the text to be processed is obtained, the target image generation network is adopted to perform image generation processing on the text to be processed, and a target image is obtained.
The embodiment of the application also provides an image generating device, which comprises:
The system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a text sample, a first image generation network and a second image generation network, the text sample comprises a basic text, a positive deviation text and a negative deviation text, network parameters of the first image generation network are fixed, the positive deviation text and the basic text have positive deviation semantically, and the negative deviation text and the basic text have negative deviation semantically;
The first generation unit is used for carrying out image generation processing on the reference text by adopting the first image generation network to obtain a reference image, and carrying out image generation processing on the positive deviation text and the negative deviation text by adopting the second image generation network to obtain a positive deviation image and a negative deviation image;
A loss determination unit configured to determine a target loss of the second image generation network based on the reference image, the positive deviation image, and the negative deviation image;
The training unit is used for training the second image generation network based on the target loss to obtain a target image generation network;
And the second generation unit is used for carrying out image generation processing on the text to be processed by adopting the target image generation network when the text to be processed is acquired, so as to obtain a target image.
In some embodiments, the loss determination unit includes:
The annotation image acquisition subunit is used for acquiring an annotation reference image corresponding to the reference text, an annotation positive deviation image corresponding to the positive deviation text and an annotation negative deviation image corresponding to the negative deviation text;
A first determining subunit configured to determine a target base loss and a target counter loss of the second image generation network based on the reference image, the positive bias image, the negative bias image, the labeling reference image, the labeling positive bias image, and the labeling negative bias image;
a second determination subunit configured to determine the target loss based on the target base loss and the target combat loss.
In some embodiments, the first determining subunit is specifically configured to:
determining a first basis loss according to the reference image and the annotation reference image;
determining a second basis loss from the positive deviation image and the annotated positive deviation image;
determining a third basis loss according to the negative deviation image and the marked negative deviation image;
and fusing the first basic loss, the second basic loss and the third basic loss to obtain a target basic loss.
In some embodiments, the first determining subunit is further specifically configured to:
determining a first contrast loss according to the annotation reference image and the positive deviation image;
determining a second contrast loss from the annotated reference image and the negative bias image;
Determining a third countermeasures loss from the positive bias image and the negative bias image;
and fusing the first countermeasures loss, the second countermeasures loss and the third countermeasures loss to obtain target countermeasures loss.
In some embodiments, the first determining subunit is further specifically configured to:
Acquiring a first weight corresponding to the first countermeasures loss, a second weight corresponding to the second countermeasures loss and a third weight corresponding to the third countermeasures loss;
And fusing the first countermeasures, the second countermeasures and the third countermeasures based on the first weight, the second weight and the third weight to obtain the target countermeasures.
In some embodiments, the second determining subunit is specifically configured to:
and determining a difference between the target base loss and the target countermeasure loss as the target loss.
In some embodiments, the annotation image acquisition subunit is specifically configured to:
performing image generation processing on the reference text by adopting a third image generation network to obtain a marked reference image, wherein the network structure of the third image generation network is the same as that of the first image generation network;
acquiring positive adjustment parameters and negative adjustment parameters for an image;
Adjusting the labeling reference image through the forward adjustment parameters to obtain the labeling positive deviation image;
and adjusting the marked reference image through the negative adjustment parameters to obtain the marked negative deviation image.
In some embodiments, the network structure of the first image generation network is the same as the network structure of the second image generation network, and the image generation apparatus further includes:
An input updating unit, configured to obtain, in a process of performing image generation processing using the second image generation network, an input feature of a first intermediate layer and an output feature of a second intermediate layer, where the first intermediate layer is any intermediate layer of the second image generation network, and the second intermediate layer is an intermediate layer corresponding to the first intermediate layer in the first image generation network;
and updating the input characteristics of the first middle layer based on the output characteristics of the second middle layer.
In some embodiments, the input updates are specifically for:
fusing the input features of the first intermediate layer with the output features of the second intermediate layer to obtain fusion features;
the fusion feature is determined as a new input feature of the first intermediate layer.
In some embodiments, the input updates are specifically further for:
Splicing the input characteristics of the first intermediate layer and the output characteristics of the second intermediate layer to obtain splicing characteristics;
and performing channel dimension reduction processing on the spliced features to obtain the fusion features.
In some embodiments, the reference text includes a base portion, the positive bias text includes the base portion and a first auxiliary portion describing the base portion, and the negative bias text includes the base portion and a second auxiliary portion describing the base portion, the first auxiliary portion having a semantic opposite to a semantic of the second auxiliary portion.
The embodiment of the application also provides electronic equipment, which comprises a memory, wherein the memory stores a plurality of computer instructions; the processor loads computer instructions from the memory to perform steps in any of the image generation methods provided by the embodiments of the present application.
The embodiment of the application also provides a computer readable storage medium, which stores a plurality of computer instructions, the computer instructions are suitable for being loaded by a processor to execute the steps in any of the image generation methods provided by the embodiment of the application.
The embodiments of the present application also provide a computer program product comprising computer instructions which, when executed by a processor, implement the steps in any of the image generation methods provided by the embodiments of the present application.
According to the embodiment of the application, a text sample, a first image generation network and a second image generation network can be obtained, wherein the text sample comprises a reference text, a positive deviation text and a negative deviation text, network parameters of the first image generation network are fixed, the positive deviation text and the reference text have positive deviation in terms of semantics, and the negative deviation text and the reference text have negative deviation in terms of semantics; performing image generation processing on the reference text by adopting a first image generation network to obtain a reference image; adopting a second image generation network to respectively perform image generation processing on the positive deviation text and the negative deviation text to obtain a positive deviation image and a negative deviation image; determining a target loss of the second image generation network based on the reference image, the positive bias image, and the negative bias image; training the second image generation network based on the target loss to obtain a target image generation network; when the text to be processed is obtained, the target image generating network is adopted to perform image generating processing on the text to be processed, and a target image is obtained.
According to the embodiment of the invention, the second image generation network can be trained by using the positive deviation text corresponding to the reference text and the negative deviation text corresponding to the reference text, so that the second image generation network can deeply perceive two concepts with opposite semantics relative to the reference text, and the understanding capability of the second image generation network on specific concepts in the training process is improved. In addition, a reference image is obtained by performing image generation processing on the reference text by using a first image generation network; adopting a second image generation network to respectively perform image generation processing on the positive deviation text and the negative deviation text to obtain a positive deviation image and a negative deviation image; and determining the target loss of the second image generation network based on the reference image, the positive deviation image and the negative deviation image, wherein the network parameters of the first image generation network are fixed, so that the first image generation network can be used as a reference, the second image generation network is promoted to distinguish and perceive the positive concepts reflected by the positive deviation image and the negative concepts reflected by the negative deviation by using the basic concepts reflected by the reference image generated by the first image generation network, the characterization efficiency and the capability of the second image generation network on the various concepts are improved, the perception capability of the target image generation model obtained through training on the basic concepts is not lost, and the target generation image model is ensured to accurately generate corresponding images according to texts.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1a is a schematic view of a scenario of an image generating method according to an embodiment of the present application;
FIG. 1b is a flowchart of an image generation method according to an embodiment of the present application;
FIG. 1c is a schematic diagram of a data transfer direction between intermediate layers in a first image generation network and a second image generation network according to an embodiment of the present application;
FIG. 1d is a schematic diagram of a process for fusing input features of a first intermediate layer with output features of a second intermediate layer according to an embodiment of the present application;
FIG. 1e is a schematic diagram of a generating process of a marked positive deviation image and a marked negative deviation image according to an embodiment of the present application;
fig. 2a is a schematic diagram of an image generating method according to an embodiment of the present application applied to a server;
FIG. 2b is a schematic diagram of a model structure of a text-generated image model provided by an embodiment of the present application;
Fig. 3 is a schematic structural view of an image generating apparatus according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.
The embodiment of the application provides an image generation method, an image generation device, electronic equipment and a storage medium.
The electronic device may be a terminal, a server, or the like. The terminal can be a mobile phone, a tablet personal computer, an intelligent Bluetooth device, a notebook computer, a desktop computer, an intelligent television, a vehicle-mounted device and the like; the server may be a single server, a server cluster composed of a plurality of servers, or a cloud server.
For example, referring to fig. 1a, fig. 1a shows an application scenario schematic diagram of an image generating method according to an embodiment of the present application.
As shown in fig. 1a, the application scenario may include a server 100 and a terminal 200, where the server 100 may be in communication with the terminal 200, in an actual application, the server 100 may set a text generation image model (hereinafter may be simply referred to as a model) that is trained in advance, and the server 100 may receive an image generation request sent by the terminal 200, where the image generation request carries a text to be processed, and then input the text to be processed into the text generation image model to obtain a target image corresponding to the text to be processed, and return the target image to the terminal 200.
It will be appreciated that in the specific embodiment of the present application, related data such as text to be processed sent by a user is involved, and when the above embodiments of the present application are applied to specific products or technologies, user permission or consent is required to be obtained, and the collection, use and processing of related data is required to comply with related laws and regulations and standards of related countries and regions.
The following will describe in detail. The numbers of the following examples are not intended to limit the preferred order of the examples.
Among them, computer Vision (CV) is a technique of performing operations such as recognition and measurement of a target image by using a Computer instead of human eyes and further performing processing. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, virtual reality, augmented reality, synchronous positioning and mapping, autopilot, intelligent transportation, etc., as well as common biometric recognition techniques such as face recognition, fingerprint recognition, etc. Such as image processing techniques such as image coloring, image stroking extraction, etc.
Natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.
Machine learning (MACHINE LEARNING, ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, and the like.
In this embodiment, an image generating method is provided, which may be executed by an electronic device, which may be the server 100 or the terminal 200 shown in fig. 1a, and as shown in fig. 1b, a specific flow of the image generating method may be as follows:
101. The method comprises the steps of obtaining a text sample, a first image generation network and a second image generation network, wherein the text sample comprises a basic text, a positive deviation text and a negative deviation text, network parameters of the first image generation network are fixed, the positive deviation text and the basic text have positive deviation in semanteme, and the negative deviation text and the basic text have negative deviation in semanteme.
Where the text samples are text used to train the text to generate the image model, these text are primarily used to describe the object image that is desired to be generated, alternatively these text may be descriptions of aspects of the content, scene, context, etc. of the object image, such as "a puppy", "a hot park", etc. In some implementations, the text sample can include one or more text groups, where a text group can include one base text, positive bias text corresponding to the base text, negative bias text corresponding to the base text.
The basic text may be a text in the text group for distinguishing concepts of other texts as a reference, and the described concepts are generally objective and neutral, and the basic concepts corresponding to the basic text are not biased or tendered relative to the concepts corresponding to the other texts in the text group. Typically the reference text describes the basic concept of an image, i.e. what this image mainly wants to present. As an example, the reference text may be text describing the object to be presented in the target image, for example, the text "park" if the target image is to be presented with a popular park, and for example, the text "flower" if the target image is to be presented with a beautiful flower, and the text describing the object may be the reference text since the concept of the object itself is not biased. Alternatively, the reference text may also be composed of text describing the object and text expressing neutral concepts, such as "ordinary car", "normal-body-type puppies".
Wherein the positive bias text may be text having a positive concept with respect to the base text, which may refer to a concept that is more aggressive, positive, or advantageous than the base concept, and thus, the positive bias text may have a positive bias semantically with the base text. For example, the positive deviation text may be "nice car", "large car", or the like, with respect to the reference text "car".
Wherein the negative bias text may be text having a negative concept with respect to the base text, which may refer to a concept more prone to be negative, negative or unfavorable than the base concept, and thus the negative bias text may have a negative bias semantically with the base text. For example, the negative bias text may be "unsightly car", "car", or the like, with respect to the benchmark text "car".
It will be appreciated that within the same text group, positive and negative bias texts are semantically corresponding and opposite, e.g., the semantics of the positive bias text embody the concept of "high" and the negative bias text embody the concept of "low". Alternatively, in the same text group, the semantic deviation amount of the positive deviation text and the semantic deviation amount of the negative deviation text also correspond, for example, the semantic meaning of the positive deviation text represents a "very high" concept, and the negative deviation text represents a "very low" concept; for another example, the semantics of positive bias text embody the concept of "higher", and then negative bias text embody the concept of "lower".
The first image generation network may be a feature extraction network in a complete text generation image model, where the feature extraction network may extract key features from text to form an abstract representation of image content, so that the complete text generation image model may generate a corresponding image according to the key features.
In some implementations, the complete text-generated image model can include a variational self-encoder (Variational Autoencoder, VAE), a text encoder, and a feature extraction network, the text encoder for encoding input text to obtain text features, and inputting the encoded text features into the feature extraction network; the feature extraction network is used for extracting features of the text features and inputting the extracted key features into the variational self-encoder; the variance self-encoder may generate a corresponding image based on the key features. Wherein the feature extraction network may be a U-network (U-Net). In this embodiment, the network parameters of the first image generating network are fixed, where the network parameters may include weights of middle layers of the first image generating network, and illustratively, taking a U-type network as an example, the network parameters may include weights of an up-sampling layer, weight parameters of a convolution layer, weights of a full connection layer, and the like in the U-type network.
The second image generating network may be a network having the same network structure as the first image generating network, and the specific description will refer to the first image generating network, so that the description is omitted herein. In this embodiment, the network parameters of the second image generating network may be updated, and may need to be determined through network training.
It will be appreciated that the first image generation network and the second image generation network may be disposed in the same text generation image model, or may be disposed in different text generation image models, which is not limited herein.
In some embodiments, when a text sample is acquired, texts in a text database may be labeled according to semantics in advance, for example, for a plurality of texts describing the same object, texts without semantic deviation in the plurality of texts may be labeled as reference texts (such as text "ordinary cars"), texts with positive semantic deviation from the reference texts may be labeled as positive deviation texts (such as text "nice cars"), texts with semantic deviation from the reference texts may be labeled as negative deviation texts (such as text "nice cars"), and then the three labeled texts are determined as a text group to be extracted from the text database as the text sample. In the embodiment, the texts in the text database are divided and marked according to the semantics, so that a large number of text samples can be obtained from the text database quickly and efficiently.
In other embodiments, the reference text may include a base portion, the positive bias text includes a base portion and a first auxiliary portion for describing the base portion, and the negative bias text includes a base portion and a second auxiliary portion for describing the base portion, the first auxiliary portion having a semantic opposite to a semantic of the second auxiliary portion. Thus, in the present embodiment, a text sample may be obtained by creating a base portion to obtain a base text, adding a first auxiliary portion to obtain a positive bias text on the basis of the base portion, and adding a second auxiliary portion to obtain a negative bias text on the basis of the base portion. For example, a basic part such as the text "car" may be created, and the text "car" may be used as the reference text. Then adding a first auxiliary part (such as text "beautiful") on the basis of the text "car" to obtain the positive deviation text "beautiful car". The negative bias text "ugly car" is obtained by adding a second auxiliary part (such as text "ugly") on the basis of the text "car". In this embodiment, a text sample is accurately constructed without existing text by a specific text structure template.
102. And performing image generation processing on the reference text by adopting a first image generation network to obtain a reference image, and performing image generation processing on the positive deviation text and the negative deviation text by adopting a second image generation network to obtain a positive deviation image and a negative deviation image.
In some embodiments, the reference text may be input into a text-generated image model constructed based on the first image-generating network, and the text-generated image model may be acquired to output the reference image. Alternatively, the text-generated image model may be a Stable Diffusion model (Stable Diffusion), which is composed mainly of a variable self-encoder, U-Net, and a text encoder.
In some embodiments, the positive deviation text may be first input into a text-generated image model constructed based on the second image-generating network, and the text-generated image model may be acquired to output a positive deviation image. And inputting the negative deviation text into a text generation image model constructed based on the second image generation network, and acquiring the text generation image model to output the negative deviation text. The order in which the positive bias text and the negative bias text are input to the text-generated image model may be set in a customized manner, and is not limited herein.
In some embodiments, the network structure of the first image generation network is the same as the network structure of the second image generation network, and the image generation method may further include:
In the process of performing image generation processing by adopting the second image generation network, the input characteristics of the first intermediate layer and the output characteristics of the second intermediate layer are acquired, wherein the first intermediate layer is any intermediate layer of the second image generation network, and the second intermediate layer is an intermediate layer corresponding to the first intermediate layer in the first image generation network.
Wherein, since the network structure of the first image generation network is the same as the network structure of the second image generation network, the plurality of intermediate layers in the first image generation network and the plurality of intermediate layers in the second image generation network are in one-to-one correspondence.
The first intermediate layer and the second intermediate layer may in particular be hidden layers in the network.
Then, the input features of the first intermediate layer are updated based on the output features of the second intermediate layer.
It will be appreciated that the step of performing the image generation processing of the reference text using the first image generation network and the step of performing the image generation processing of the positive deviation text using the second image generation network may be performed simultaneously, or the step of performing the image generation processing of the reference text using the first image generation network and the step of performing the image generation processing of the negative deviation text using the second image generation network may be performed simultaneously, since the input features of the first intermediate layer are to be updated based on the output features of the second intermediate layer.
For example, referring to fig. 1c, fig. 1c shows the direction of data transfer between intermediate layers in the first and second image generation networks, wherein the arrows in fig. 1c represent the direction of data transfer. As can be seen from fig. 1c, the first image generating network includes a plurality of second intermediate layers, the second image generating network includes a plurality of first intermediate layers, in the second image generating network, two data are received at an input end of each first intermediate layer, one data is an output of a first intermediate layer before the first intermediate layer, and the other data is an output of a second intermediate layer corresponding to the first intermediate layer.
Because the first image generation network is used for learning the basic concept expressed by the reference text, and the second image generation network words learn the positive concept compared with the basic concept expressed by the positive deviation text and the negative concept compared with the basic concept expressed by the negative deviation text, in the implementation manner, the input characteristics of the first intermediate layer are updated based on the output characteristics of the second intermediate layer in the process of performing image generation processing by adopting the second image generation network, so that the characteristic output of each intermediate layer in the first image generation network can be transmitted to each corresponding intermediate layer in the second image generation network, the second image generation network is not influenced by the semantics of the positive concept and the negative concept when learning the basic concept, and the understanding accuracy of the second image generation network to the basic concept is ensured.
As an embodiment, a specific embodiment of updating the input characteristics of the first intermediate layer based on the output characteristics of the second intermediate layer may include:
and fusing the input characteristics of the first middle layer with the output characteristics of the second middle layer to obtain fusion characteristics.
The fusion feature is determined as a new input feature of the first intermediate layer.
For example, referring to fig. 1d, in fig. 1d, after a series of fusion processes are performed on the input features of the first intermediate layer and the output features of the second intermediate layer, corresponding fusion features may be obtained.
Because the input of the first intermediate layer needs to meet the input rule corresponding to the first intermediate layer to enable the first intermediate layer to execute the feature extraction operation, in this embodiment, after the input feature of the first intermediate layer and the output feature of the second intermediate layer are obtained, the input feature of the first intermediate layer and the output feature of the second intermediate layer are fused, so that the obtained fusion feature can conform to the input rule corresponding to the first intermediate layer, and therefore the second image generation network can be ensured to stably perform the feature extraction operation. Wherein the input rule may include a specified feature size, wherein the feature size may refer to a channel dimension, and the channel dimension (channel dimension) refers to one of the tensors (tensor) used to represent a feature or attribute of the data.
In some embodiments, the step of fusing the input features of the first intermediate layer with the output features of the second intermediate layer to obtain the fused features may include:
and splicing the input characteristics of the first intermediate layer with the output characteristics of the second intermediate layer to obtain splicing characteristics.
And performing channel dimension reduction processing on the spliced features to obtain fusion features.
In fig. 1d, when the input feature of the first intermediate layer and the output feature of the second intermediate layer are fused, the input feature of the first intermediate layer and the output feature of the second intermediate layer may be subjected to a stitching process to obtain stitching features, and since the feature size corresponding to the first intermediate layer is the same as the feature size corresponding to the second intermediate layer, the feature size of the stitching features may be twice the feature size corresponding to the first intermediate layer, so that the stitching features may also need to be subjected to a dimension reduction process to enable the dimension-reduced stitching features to conform to the feature size corresponding to the first intermediate layer. Specifically, the merging feature can be obtained by setting a 1x1 multichannel convolution to reduce the dimension of the merged feature on a channel (channel).
103. A target loss of the second image generation network is determined based on the reference image, the positive bias image, and the negative bias image.
In some embodiments, in step 103, determining the target loss of the second image generation network based on the reference image, the positive bias image, and the negative bias image may include:
A1, obtaining a labeling reference image corresponding to the reference text, a labeling positive deviation image corresponding to the positive deviation text and a labeling negative deviation image corresponding to the negative deviation text.
The labeling reference image may be an image expected to be generated according to the reference text, and the labeling reference image may be a pre-labeled image associated with the reference text. In the training process of the text generation image model, the reference image can be used as a verification set to verify the text generation image model according to the reference image generated by the reference text, so that the performance of the text generation image model is evaluated, and the model parameters of the text generation image model can be adjusted. Specifically, the text-generated image model may adjust its parameters by comparing the annotated reference image with the reference text expectations and calculating the differences between them so that the predicted outcome is closer to the target outcome.
The marked positive deviation image can be an image expected to be generated according to the positive deviation text, and the marked positive deviation image can be a pre-marked image associated with the positive deviation text. The specific function of the method can refer to the labeling of the reference image, so that the method is not repeated herein.
The negative deviation labeling image can be an image expected to be generated according to the negative deviation text, and the negative deviation labeling image can be a pre-labeled image associated with the negative deviation text. The specific function of the method can refer to the labeling of the reference image, so that the method is not repeated herein.
In some embodiments, if the object described by the reference text is determined, the object may be photographed by normal camera parameters to obtain the reference image of the annotation set, for example, if the object described by the reference text is a flower, the flower may be photographed by a camera under normal camera parameters (such as initial camera parameters) to obtain the reference image of the annotation set.
In some embodiments, in step A1, the obtaining a labeling reference image specific embodiment corresponding to the reference text may include:
and screening out an image corresponding to the reference text in an image database in a manual screening mode, and marking the image to obtain a marked reference image. The marked positive deviation image corresponding to the similar positive deviation text and the marked negative deviation image corresponding to the negative deviation text can also be marked from the image database in the manual screening mode.
In other embodiments, in step A1, the specific embodiment of obtaining the labeling reference image corresponding to the reference text, the labeling positive deviation image corresponding to the positive deviation text, and the labeling negative deviation image corresponding to the negative deviation text may include:
a11, performing image generation processing on the reference text by adopting a third image generation network to obtain a marked reference image, wherein the network structure of the third image generation network is the same as that of the first image generation network.
The third image generating network may be a network that functions the same as the first image generating network, and thus will not be described herein. Alternatively, the third image generation network may be pre-adjusted in network parameters, with the third image generation network not exceeding a specified error between a predicted result from the input text and a target result desired for the input text.
For example, the reference text may be input to the third image generation network, and the image output by the third image generation network may be acquired as the annotation reference image.
A12, positive adjustment parameters and negative adjustment parameters for the image are acquired.
The forward adjustment parameter may be a parameter for beautifying an object presented in the image, for example, for a face presented in the image, the forward adjustment parameter may include a parameter for enlarging eyes in the face, thinning a face, and the like.
The negative adjustment parameter may be a parameter that acts opposite to the positive adjustment parameter described above, and for example, for a face presented in an image, the negative adjustment parameter may include a parameter for making eyes in the face smaller, making the face fat, and the like.
Alternatively, the positive adjustment parameter and the negative adjustment parameter may be equal to the adjustment amount of the image, for example, the positive adjustment parameter is used to adjust up the eyes in the face in the image by 10%, and then the negative adjustment parameter may be used to adjust down the eyes in the face in the image by 10%.
For example, referring to fig. 1e, the positive adjustment parameter and the negative adjustment parameter may be obtained by the image correction software, specifically, the labeling reference image may be input into the image correction software, and then different image correction instructions are input into the image correction software to obtain the corresponding adjustment parameters. For example, a trimming instruction such as trimming, brightness increasing, contrast increasing and the like can be input to the reference image marking in the trimming software, and corresponding forward adjustment parameters can be obtained. For example, in the image trimming software, image trimming instructions such as rough processing, brightness reduction, contrast reduction and the like can be input for the marked reference image, and corresponding negative adjustment parameters can be obtained.
A13, adjusting the marked reference image through the forward adjustment parameter to obtain a marked positive deviation image.
Along the above example, please refer to fig. 1e again, a correction instruction corresponding to a forward adjustment parameter is input in the correction software, the correction software may adjust the labeling reference image based on the forward adjustment parameter, so as to obtain a labeling positive deviation image, for example, a labeling reference image corresponding to a "common flower" may be obtained after the correction of the forward adjustment parameter, where the forward adjustment parameter may include, but is not limited to: the "finishing instruction (instruction for adjusting the outline of the flower to be finer), the" gorgeous "finishing instruction (instruction for adjusting the color of the flower to be brighter), and the" blooming "finishing instruction (instruction for adjusting the shape, size, and posture of the flower to be blooming).
A14, adjusting the marked reference image through the negative adjustment parameters to obtain a marked negative deviation image.
With the above example, please refer to fig. 1e again, a negative adjustment parameter corresponding to a map trimming instruction is input in the map trimming software, the map trimming software may adjust the marked reference image based on the negative adjustment parameter, so as to obtain a marked negative deviation image, for example, the "ordinary flower" is adjusted by the negative adjustment parameter, and then the "unsightly flower" is obtained, where the negative adjustment parameter may include but is not limited to: a negative adjustment parameter corresponding to a "rough" drawing trimming instruction (an instruction for adjusting the outline of a flower to be rough), a negative adjustment parameter corresponding to a "dim" drawing trimming instruction (an instruction for adjusting the color of a flower to be darker), and a negative adjustment parameter corresponding to a "wither" drawing trimming instruction (an instruction for adjusting the shape, size, and posture of a flower to be withered).
In this embodiment, the positive bias image and the negative bias image are accurately and efficiently obtained by acquiring the positive adjustment parameter and the negative adjustment parameter for the image, adjusting the labeling reference image by the positive adjustment parameter, and adjusting the labeling reference image by the negative adjustment parameter, so that the labeling negative bias image is obtained.
It should be understood that, in this embodiment, the manner of obtaining the labeling reference image corresponding to the reference text may be replaced by the manner of manual filtering, and the manner of obtaining the labeling positive deviation image and the labeling negative deviation image after obtaining the labeling reference image may be completed through the steps a12 to a 14.
A2, determining the target basic loss and the target countermeasure loss of the second image generation network based on the reference image, the positive deviation image, the negative deviation image, the labeling reference image, the labeling positive deviation image and the labeling negative deviation image.
In some embodiments, in step A2, determining the target base loss of the second image generation network based on the reference image, the positive bias image, the negative bias image, the labeling reference image, the labeling positive bias image, and the labeling negative bias image may include:
a first basis loss is determined from the reference image and the annotated reference image.
And determining a second basis loss according to the positive deviation image and the marked positive deviation image.
And determining a third basic loss according to the negative deviation image and the marked negative deviation image.
And fusing the first basic loss, the second basic loss and the third basic loss to obtain the target basic loss.
The first basic loss can be determined according to the reference image and the marked reference image through a preset loss function, the second basic loss can be determined according to the positive deviation image and the marked positive deviation image, and the third basic loss can be determined according to the negative deviation image and the marked negative deviation image. Optionally, the preset loss function may comprise a mean square error algorithm (Mean Squared Error, MSE). Alternatively, other common loss functions, such as a samdi algorithm, may be used instead of the preset loss function of the mean square error algorithm, and the specific type of loss function used may be set according to the actual requirement, which is not limited herein.
Illustratively, the following describes an example of a preset loss function as a mean square error algorithm, where the expression of the first base loss may be as follows:
Wherein, For the first base loss to be present,In order to annotate the reference image,Is a reference image.
Wherein the expression of the second base loss may be as follows:
Wherein, For the second basis loss of the first basis,In order to annotate the positive deviation image,Is a positive deviation image.
Wherein the expression of the third base loss may be as follows:
Wherein, For the third base loss of the first base,In order to annotate the negative bias image,Is a negative bias image.
As an embodiment, fusing the first base loss, the second base loss, and the third base loss to obtain the target base loss may include:
Taking the sum of the first base loss, the second base loss, and the third base loss as the target base loss, specifically, the expression of the target base loss may be as follows:
Wherein, Is the target base loss.
In some embodiments, in step A2, determining the target countermeasures loss of the second image generation network based on the reference image, the positive bias image, the negative bias image, the labeling reference image, the labeling positive bias image, and the labeling negative bias image may include:
A21, determining the first countermeasures according to the marked reference image and the positive deviation image.
A22, determining a second countering loss according to the marked reference image and the negative deviation image.
A23, determining a third countermeasure loss according to the positive deviation image and the negative deviation image.
Wherein the first countermeasures loss can be determined from the labeling reference image and the positive deviation image, the second countermeasures loss can be determined from the labeling reference image and the negative deviation image, and the third countermeasures loss can be determined from the positive deviation image and the negative deviation image by presetting a loss function. Optionally, the preset loss function may comprise a mean square error algorithm (Mean Squared Error, MSE).
Illustratively, the following describes an example of a preset loss function as a mean square error algorithm, where the expression of the first countermeasures loss may be as follows:
Wherein, Is the first countering loss.
Wherein the expression of the second countering loss may be as follows:
Wherein, Is the second countering loss.
Wherein the expression of the third countering loss may be as follows:
Wherein, Is the third countering loss.
A24, fusing the first counterattack loss, the second counterattack loss and the third counterattack loss to obtain the target counterattack loss.
In some embodiments, in step a24, fusing the first counterdamage, the second counterdamage, and the third counterdamage to obtain the target counterdamage may include:
Acquiring a first weight corresponding to the first countermeasures loss, a second weight corresponding to the second countermeasures loss and a third weight corresponding to the third countermeasures loss.
And fusing the first countermeasures, the second countermeasures and the third countermeasures based on the first weight, the second weight and the third weight to obtain the target countermeasures.
The first weight, the second weight and the third weight may be set according to actual requirements, for example, if the model that is expected to be trained can pay more attention to the positive concept and the negative concept, the third weight may be increased.
In some embodiments, the training effect of the model may be detected at regular intervals, and the first weight, the second weight, and the third weight may be updated according to the training effect, for example, after training for a period of time, if the model outputs an image according to the negative bias text or the positive bias text or is in an intermediate zone between the positive concept and the negative concept, it is explained that the model has no obvious distinction between the positive concept and the negative concept, so that the third weight may be added to enhance the training effect of the model.
Exemplary, e.g. the first weight isThe second countering loss isThe third countering loss is. Then the expression for the target countering loss may be as follows:
Wherein, To combat losses as a goal.
A3, determining target loss based on the target basic loss and the target countermeasure loss.
In some embodiments, in step A3, determining the target loss based on the target base loss and the target counter loss may include:
The difference between the target base loss and the target counter loss is determined as the target loss.
Along with the above example, the expression for the target loss may be as follows:
In the present embodiment, the target loss is calculated by the above three countermeasures, wherein the first pair of countermeasures and the second countermeasure enable the model to perceive the difference between the semantic expression of the auxiliary concept (such as the above positive concept and the negative concept) and the basic concept when calculating. The third countermeasures loss are calculated based on the positive deviation image and the negative deviation image, and the model can sense the positive and negative concepts. Thereby improving the accuracy of text generation image models. In addition, in order to enable the model to pay attention to one concept according to the requirement in the training process, the model is also added with rights Thereby allowing the model to focus more on learning the distance between different concepts.
104. And training the second image generation network based on the target loss to obtain a target image generation network.
In some embodiments, the network parameters in the second image generating network may be adjusted according to the target loss, for example, the weights corresponding to the middle layers in the second image generating network are adjusted, and the operations of steps 101 to 104 are repeated until the text generating image model including the second image generating network converges, and the second image generating network obtained when the text generating image model converges is used as the target image generating network.
105. When the text to be processed is obtained, the target image generating network is adopted to perform image generating processing on the text to be processed, and a target image is obtained.
In some embodiments, the obtained text to be processed may be input into a text-generated image model including a target image generation network, and an image output by the generated image model may be acquired as the target image.
It can be seen that, in this embodiment, by training the second image generating network using the positive deviation text corresponding to the reference text and the negative deviation text corresponding to the reference text, the second image generating network can deeply perceive two concepts with opposite semantics with respect to the reference text, so as to improve the understanding capability of the second image generating network on the specific concepts in the training process. In addition, a reference image is obtained by performing image generation processing on the reference text by using a first image generation network; adopting a second image generation network to respectively perform image generation processing on the positive deviation text and the negative deviation text to obtain a positive deviation image and a negative deviation image; and determining the target loss of the second image generation network based on the reference image, the positive deviation image and the negative deviation image, wherein the network parameters of the first image generation network are fixed, so that the first image generation network can be used as a reference, the second image generation network is promoted to distinguish and perceive the positive concepts reflected by the positive deviation image and the negative concepts reflected by the negative deviation by using the basic concepts reflected by the reference image generated by the first image generation network, the characterization efficiency and the capability of the second image generation network on the various concepts are improved, the perception capability of the target image generation model obtained through training on the basic concepts is not lost, and the target generation image model is ensured to accurately generate corresponding images according to texts.
The method described in the above embodiments will be described in further detail below.
An image generation method may be performed by an electronic device, and in this embodiment, a method of an embodiment of the present application will be described in detail by taking the method performed by a server as an example.
As shown in fig. 2a, a specific flow of an image generating method is as follows:
201. The server obtains a text sample, a first image generation network, a second image generation network, the text sample including a benchmark text, a positive deviation text, and a negative deviation text. The network parameters of the first image generation network are fixed, positive deviation text and reference text have positive deviation semantically, and negative deviation text and reference text have negative deviation semantically.
For example, referring to fig. 2b, in practical application, the first image generation network and the second image generation network may be connected in parallel to form a text generation image model as shown in fig. 2b, where the text generation image model may be a Stable difference model, and the text generation image model may include: the system includes a first image generation network and a second image generation network connected in parallel in multiple stages, a text encoder connected to inputs of the first image generation network and the second image generation network connected in parallel in multiple stages, and a variable in-variance self encoder (VAE) connected to outputs of the first image generation network and the second image generation network connected in parallel in multiple stages. Wherein the text encoder is configured to encode the input text to obtain a text insert, e.g., encoding a text flower to obtain a base text insert (Base Embedding), e.g., encoding a text flower to obtain a positive bias text insert (Positive Embedding), e.g., encoding a text flower to obtain a negative bias text insert (Negative Embedding). The first image generation network is used for extracting features of basic text embedding, and the extracted features can be converted into reference images by the VAE. The second image generation network is used for respectively carrying out feature extraction on the positive deviation text embedding and the negative deviation text, and the extracted features can be converted into a positive deviation graph and a negative deviation image by the VAE. Wherein the first image generation network and the second image generation network may be U-networks (U-nets).
In some embodiments, the reference text includes a base portion, the positive bias text includes a base portion and a first auxiliary portion for describing the base portion, and the negative bias text includes a base portion and a second auxiliary portion for describing the base portion, the first auxiliary portion having a semantic opposite to a semantic of the second auxiliary portion.
202. And the server adopts a first image generation network to perform image generation processing on the reference text to obtain a reference image.
203. And the server adopts a second image generation network to respectively perform image generation processing on the positive deviation text and the negative deviation text, so as to obtain a positive deviation image and a negative deviation image.
In step 203, the specific implementation manner of the step of performing image generation processing on the positive deviation text and the negative deviation text by using the second image generation network to obtain the positive deviation image and the negative deviation image may include:
In the process of performing image generation processing by adopting the second image generation network, the input characteristics of the first intermediate layer and the output characteristics of the second intermediate layer are acquired, wherein the first intermediate layer is any intermediate layer of the second image generation network, and the second intermediate layer is an intermediate layer corresponding to the first intermediate layer in the first image generation network.
The input features of the first intermediate layer are updated based on the output features of the second intermediate layer.
In some embodiments, the specific implementation of the step of updating the input features of the first intermediate layer based on the output features of the second intermediate layer may include:
and fusing the input characteristics of the first middle layer with the output characteristics of the second middle layer to obtain fusion characteristics.
The fusion feature is determined as a new input feature of the first intermediate layer.
In some embodiments, the specific implementation of the step of "fusing the original input features with the output features of the second intermediate layer" may include:
and splicing the input characteristics of the first intermediate layer with the output characteristics of the second intermediate layer to obtain splicing characteristics.
And performing channel dimension reduction processing on the spliced features to obtain fusion features.
In practical application, the first image generation network and the second image generation network are constructed based on the U-net network architecture in the Stable distribution model, so that the characteristic output of each layer in the first image generation network can be transmitted to each corresponding layer in the second image generation network in a characteristic calculation link mode, the text generation image model can keep understanding of basic concepts, and the text generation image model can also be amplified for differentiation among different concepts in the training process. Wherein the first image generation network is for maintaining basic concepts in the text generation image model. In addition, in order to enable the text-generated image model to maintain the most original basic concept and not be affected by the semantics of the other two auxiliary concepts (such as positive concepts and negative concepts), the weight parameters of the first image-generating network remain locked during the training phase, and the network only updates the parameter weights of the second image-generating network.
204. The server acquires a labeling reference image corresponding to the reference text, a labeling positive deviation image corresponding to the positive deviation text and a labeling negative deviation image corresponding to the negative deviation text.
In step 204, the specific implementation manner of "obtaining the labeling reference image corresponding to the reference text, the labeling positive deviation image corresponding to the positive deviation text, and the labeling negative deviation image corresponding to the negative deviation text" in step may include:
performing image generation processing on the reference text by adopting a third image generation network to obtain a marked reference image, wherein the network structure of the third image generation network is the same as that of the first image generation network;
And acquiring positive adjustment parameters and negative adjustment parameters for the image.
And adjusting the labeling reference image through the forward adjustment parameters to obtain a labeling forward deviation image.
And adjusting the marked reference image through the negative adjustment parameters to obtain a marked negative deviation image.
205. The server determines a target base loss and a target counter loss for the second image generation network based on the reference image, the positive bias image, the negative bias image, the annotation reference image, the annotation positive bias image, and the annotation negative bias image.
Wherein, in step 205, the specific implementation of the step of determining the target base loss of the second image generating network based on the reference image, the positive deviation image, the negative deviation image, the labeling reference image, the labeling positive deviation image, and the labeling negative deviation image may include:
a first basis loss is determined from the reference image and the annotated reference image.
And determining a second basis loss according to the positive deviation image and the marked positive deviation image.
And determining a third basic loss according to the negative deviation image and the marked negative deviation image.
And fusing the first basic loss, the second basic loss and the third basic loss to obtain the target basic loss.
In some embodiments, the specific implementation of the step of determining the target combat loss of the second image generation network based on the reference image, the positive deviation image, the negative deviation image, the annotation reference image, the annotation positive deviation image, and the annotation negative deviation image may include:
a first contrast loss is determined from the annotated reference image and the positive deviation image.
A second contrast loss is determined from the annotated reference image and the negative bias image.
A third countermeasures loss is determined from the positive bias image and the negative bias image.
And fusing the first countering loss, the second countering loss and the third countering loss to obtain the target countering loss.
In some embodiments, embodiments of the step of fusing the first, second, and third contrast losses to obtain the target contrast loss may include:
Acquiring a first weight corresponding to the first countermeasures loss, a second weight corresponding to the second countermeasures loss and a third weight corresponding to the third countermeasures loss.
And fusing the first countermeasures, the second countermeasures and the third countermeasures based on the first weight, the second weight and the third weight to obtain the target countermeasures.
In some embodiments, specific implementations of step "determining a target loss based on a target base loss and a target counter loss" may include:
The difference between the target base loss and the target counter loss is determined as the target loss.
206. The server determines a target loss based on the target base loss and the target combat loss.
207. And training the second image generation network based on the target loss by the server to obtain a target image generation network.
208. When the server acquires the text to be processed, the server adopts a target image generation network to perform image generation processing on the text to be processed, and a target image is obtained.
In the embodiment, through the positive and negative concept countermeasure training mechanism, through setting the auxiliary concept data input with the positive and negative semantics completely opposite in the process of training the model, the model can deeply feel the difference between the positive concepts and the negative concepts, so that the model learns the semantics actually required to be concerned in the picture content of the auxiliary concepts, the understanding capability of the whole model on the appointed concepts in the training process is improved, and the accuracy of the final text-to-image model on the generation of the concept content picture is enhanced.
In addition, based on a network architecture mechanism of parallel connection of the two U-nets of the first image generation network and the second image generation network, the parameter weight of the first image generation network is locked in the training process, the parameter weight of the second image generation network is updated and opened, and the output of each hidden layer of the first image generation network is transmitted to each corresponding layer in the second image generation network by using a feature transmission mode, so that the information depth of the first image generation network guides the training of the second image generation network, and because conceptual differences exist between the reference text input into the first image generation network and the positive deviation text and the negative deviation text input into the second image generation network, the model can deeply sense the semantics of auxiliary concepts corresponding to the positive deviation text and the negative deviation text, and the model can strengthen the capability of accurately learning concepts. Meanwhile, after the output of each hidden layer of the first image generation network is transmitted to each corresponding layer of the second image generation network, the second image generation network can not lose the perception capability of the basic concept.
In addition, in the embodiment, a data format input mode of a text group is used, and each text pair comprises a basic concept, a positive concept and a negative concept, so that semantic representation differences among different concepts can be fully expressed, the requirement of a model on training data is reduced, and the training performance of the model is improved under the condition of the same training data quantity.
In addition, by calculating the differences among the three results of the reference image, the positive deviation image and the negative deviation image, the first counterdamage, the second counterdamage and the third counterdamage are obtained, and the target damage is calculated by combining the input loss weights corresponding to the first counterdamage, the second counterdamage and the third counterdamage respectively, so that the strength of the model for the reinforcement learning of different concepts can be flexibly controlled, and the mutual influence among the concepts can be prevented. Meanwhile, the learning speed of the model for different concepts can be increased, and the convergence of the model is increased. And the semantic distance of characterization of positive and negative concepts in the model can be enlarged through antagonism between the positive concepts and the negative concepts, so that the effect of the model is improved.
In order to better implement the method, the embodiment of the application also provides an image generation device.
For example, as shown in fig. 3, the image generating apparatus may include an acquisition unit 301, a first generation unit 302, a loss determination unit 303, a training unit 304, and a second generation unit 305, as follows:
The obtaining unit 301 is configured to obtain a text sample, a first image generating network, and a second image generating network, where the text sample includes a reference text, a positive deviation text, and a negative deviation text, network parameters of the first image generating network are fixed, the positive deviation text and the reference text have a positive deviation in terms of semantics, and the negative deviation text and the reference text have a negative deviation in terms of semantics;
a first generating unit 302, configured to perform image generation processing on the reference text by using a first image generating network to obtain a reference image, and perform image generation processing on the positive deviation text and the negative deviation text by using a second image generating network to obtain a positive deviation image and a negative deviation image;
a loss determination unit 303 for determining a target loss of the second image generation network based on the reference image, the positive deviation image, and the negative deviation image;
training unit 304, based on the target loss, trains the second image generation network to obtain a target image generation network;
And the second generating unit 305 is configured to perform image generation processing on the text to be processed by using the target image generating network when the text to be processed is acquired, so as to obtain a target image.
In some embodiments, the loss determination unit 303 includes:
the annotation image acquisition subunit is used for acquiring an annotation reference image corresponding to the reference text, an annotation positive deviation image corresponding to the positive deviation text and an annotation negative deviation image corresponding to the negative deviation text;
A first determining subunit configured to determine a target base loss and a target counter loss of the second image generation network based on the reference image, the positive bias image, the negative bias image, the reference image, the positive bias image, and the negative bias image;
and a second determination subunit configured to determine a target loss based on the target base loss and the target counter loss.
In some embodiments, the first determining subunit is specifically configured to:
Determining a first basis loss according to the reference image and the annotation reference image;
determining a second basis loss from the positive deviation image and the annotated positive deviation image;
Determining a third basis loss according to the negative deviation image and the marked negative deviation image;
and fusing the first basic loss, the second basic loss and the third basic loss to obtain the target basic loss.
In some embodiments, the first determining subunit is specifically further configured to:
determining a first countermeasures loss according to the marked reference image and the positive deviation image;
determining a second countermeasures loss according to the marked reference image and the negative deviation image;
determining a third countermeasures loss from the positive bias image and the negative bias image;
And fusing the first countering loss, the second countering loss and the third countering loss to obtain the target countering loss.
In some embodiments, the first determining subunit is specifically further configured to:
Acquiring a first weight corresponding to the first countermeasures loss, a second weight corresponding to the second countermeasures loss and a third weight corresponding to the third countermeasures loss;
And fusing the first countermeasures, the second countermeasures and the third countermeasures based on the first weight, the second weight and the third weight to obtain the target countermeasures.
In some embodiments, the second determining subunit is specifically configured to:
The difference between the target base loss and the target counter loss is determined as the target loss.
In some embodiments, the annotation image acquisition subunit is specifically configured to:
performing image generation processing on the reference text by adopting a third image generation network to obtain a marked reference image, wherein the network structure of the third image generation network is the same as that of the first image generation network;
acquiring positive adjustment parameters and negative adjustment parameters for an image;
Adjusting the labeling reference image through the forward adjustment parameters to obtain a labeling forward deviation image;
and adjusting the marked reference image through the negative adjustment parameters to obtain a marked negative deviation image.
In some embodiments, the network structure of the first image generation network is the same as the network structure of the second image generation network, the image generation apparatus further comprising:
The input updating unit is used for acquiring the input characteristics of the first intermediate layer and the output characteristics of the second intermediate layer in the process of performing image generation processing by adopting the second image generation network, wherein the first intermediate layer is any intermediate layer of the second image generation network, and the second intermediate layer is an intermediate layer corresponding to the first intermediate layer in the first image generation network; and updating the input features of the first intermediate layer based on the output features of the second intermediate layer.
In some embodiments, the input updating unit is specifically configured to:
fusing the input features of the first intermediate layer with the output features of the second intermediate layer to obtain fusion features;
The fusion feature is determined as a new input feature of the first intermediate layer.
In some embodiments, the input updates are specifically further for:
Splicing the input characteristics of the first intermediate layer with the output characteristics of the second intermediate layer to obtain splicing characteristics;
and performing channel dimension reduction processing on the spliced features to obtain fusion features.
In some embodiments, the reference text includes a base portion, the positive bias text includes a base portion and a first auxiliary portion for describing the base portion, and the negative bias text includes a base portion and a second auxiliary portion for describing the base portion, the first auxiliary portion having a semantic opposite to a semantic of the second auxiliary portion.
In the implementation, each unit may be implemented as an independent entity, or may be implemented as the same entity or several entities in any combination, and the implementation of each unit may be referred to the foregoing method embodiment, which is not described herein again.
The embodiment of the application also provides an electronic device, for example, as shown in fig. 4, which shows a schematic structural diagram of the electronic device according to the embodiment of the application, specifically:
The electronic device may include one or more processor cores 401, one or more computer-readable storage media memory 402, a power supply 403, an input module 404, and a communication module 405, among other components. Those skilled in the art will appreciate that the electronic device structure shown in fig. 4 is not limiting of the electronic device and may include more or fewer components than shown, or may combine certain components, or may be arranged in different components. Wherein:
the processor 401 is a control center of the electronic device, connects various parts of the entire electronic device using various interfaces and lines, and performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 402, and calling data stored in the memory 402, thereby performing overall detection of the electronic device. In some embodiments, processor 401 may include one or more processing cores; in some embodiments, processor 401 may integrate an application processor that primarily processes operating systems, user interfaces, applications, and the like, with a modem processor that primarily processes wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 401.
The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by executing the software programs and modules stored in the memory 402. The memory 402 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as an image presentation function, etc.) required for at least one function, and the like; the storage data area may store data created according to the use of the electronic device, etc. In addition, memory 402 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 with access to the memory 402.
The electronic device also includes a power supply 403 for powering the various components, and in some embodiments, the power supply 403 may be logically connected to the processor 401 by a power management system, such that charge, discharge, and power consumption management functions are performed by the power management system. The power supply 403 may also include one or more of any of a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.
The electronic device may also include an input module 404, which input module 404 may be used to receive entered numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.
The electronic device may also include a communication module 405, and in some embodiments the communication module 405 may include a wireless module, through which the electronic device may wirelessly transmit over a short distance, thereby providing wireless broadband internet access to the user. For example, the communication module 405 may be used to assist a user in e-mail, browsing web pages, accessing streaming media, and so forth.
Although not shown, the electronic device may further include a display unit or the like, which is not described herein. In particular, in this embodiment, the processor 401 in the electronic device loads executable files corresponding to the processes of one or more application programs into the memory 402 according to the following computer instructions, and the processor 401 executes the application programs stored in the memory 402, so as to implement various functions in the embodiments of the present application.
The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.
Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by computer instructions, or by control of associated hardware, which may be stored in a computer readable storage medium and loaded and executed by a processor.
To this end, an embodiment of the present application provides a computer readable storage medium having stored therein a plurality of computer instructions capable of being loaded by a processor to perform the steps of any of the image generation methods provided by the embodiments of the present application.
Wherein the storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.
According to one aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the method provided in the above-described embodiment.
Because the computer instructions stored in the storage medium can execute the steps in any image generation method provided by the embodiment of the present application, the beneficial effects that any image generation method provided by the embodiment of the present application can be achieved, and detailed descriptions of the previous embodiments are omitted herein.
The foregoing has described in detail the methods, apparatuses, electronic devices and computer readable storage medium for generating images according to the embodiments of the present application, and specific examples have been applied to illustrate the principles and embodiments of the present application, where the foregoing examples are provided to assist in understanding the methods and core ideas of the present application; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in light of the ideas of the present application, the present description should not be construed as limiting the present application.

Claims (15)

1. An image generation method, comprising:
Acquiring a text sample, a first image generation network and a second image generation network, wherein the text sample comprises a basic text, a positive deviation text and a negative deviation text, network parameters of the first image generation network are fixed, the positive deviation text and the basic text have positive deviation in terms of semantics, and the negative deviation text and the basic text have negative deviation in terms of semantics;
Performing image generation processing on the reference text by adopting the first image generation network to obtain a reference image, and performing image generation processing on the positive deviation text and the negative deviation text by adopting the second image generation network to obtain a positive deviation image and a negative deviation image;
determining a target loss of the second image generation network based on the reference image, the positive bias image, and the negative bias image;
training the second image generation network based on the target loss to obtain a target image generation network;
when the text to be processed is obtained, the target image generation network is adopted to perform image generation processing on the text to be processed, and a target image is obtained.
2. The image generation method according to claim 1, wherein the determining the target loss of the second image generation network based on the reference image, the positive deviation image, and the negative deviation image includes:
Acquiring a labeling reference image corresponding to the reference text, a labeling positive deviation image corresponding to the positive deviation text and a labeling negative deviation image corresponding to the negative deviation text;
Determining a target base loss and a target counter loss for the second image generation network based on the reference image, the positive bias image, the negative bias image, the labeling reference image, the labeling positive bias image, and the labeling negative bias image;
The target loss is determined based on the target base loss and the target combat loss.
3. The image generation method according to claim 2, wherein the determining the target base loss of the second image generation network based on the reference image, the positive deviation image, the negative deviation image, the labeling reference image, the labeling positive deviation image, and the labeling negative deviation image comprises:
determining a first basis loss according to the reference image and the annotation reference image;
determining a second basis loss from the positive deviation image and the annotated positive deviation image;
determining a third basis loss according to the negative deviation image and the marked negative deviation image;
and fusing the first basic loss, the second basic loss and the third basic loss to obtain a target basic loss.
4. The image generation method according to claim 2, wherein the determining the target contrast loss of the second image generation network based on the reference image, the positive deviation image, the negative deviation image, the labeling reference image, the labeling positive deviation image, and the labeling negative deviation image includes:
determining a first contrast loss according to the annotation reference image and the positive deviation image;
determining a second contrast loss from the annotated reference image and the negative bias image;
Determining a third countermeasures loss from the positive bias image and the negative bias image;
and fusing the first countermeasures loss, the second countermeasures loss and the third countermeasures loss to obtain target countermeasures loss.
5. The image generation method according to claim 4, wherein the fusing the first contrast loss, the second contrast loss, and the third contrast loss to obtain a target contrast loss includes:
Acquiring a first weight corresponding to the first countermeasures loss, a second weight corresponding to the second countermeasures loss and a third weight corresponding to the third countermeasures loss;
And fusing the first countermeasures, the second countermeasures and the third countermeasures based on the first weight, the second weight and the third weight to obtain the target countermeasures.
6. The image generation method according to claim 2, wherein the determining the target loss based on the target base loss and the target countermeasure loss includes:
and determining a difference between the target base loss and the target countermeasure loss as the target loss.
7. The method according to claim 2, wherein the obtaining the labeling reference image corresponding to the reference text, the labeling positive deviation image corresponding to the positive deviation text, and the labeling negative deviation image corresponding to the negative deviation text includes:
performing image generation processing on the reference text by adopting a third image generation network to obtain a marked reference image, wherein the network structure of the third image generation network is the same as that of the first image generation network;
acquiring positive adjustment parameters and negative adjustment parameters for an image;
Adjusting the labeling reference image through the forward adjustment parameters to obtain the labeling positive deviation image;
and adjusting the marked reference image through the negative adjustment parameters to obtain the marked negative deviation image.
8. The image generation method according to any one of claims 1 to 7, wherein a network structure of the first image generation network is the same as a network structure of the second image generation network, the method further comprising:
Acquiring input characteristics of a first intermediate layer and output characteristics of a second intermediate layer in the process of performing image generation processing by adopting the second image generation network, wherein the first intermediate layer is any intermediate layer of the second image generation network, and the second intermediate layer is an intermediate layer corresponding to the first intermediate layer in the first image generation network;
and updating the input characteristics of the first middle layer based on the output characteristics of the second middle layer.
9. The image generation method according to claim 8, wherein the updating the input features of the first intermediate layer based on the output features of the second intermediate layer includes:
fusing the input features of the first intermediate layer with the output features of the second intermediate layer to obtain fusion features;
the fusion feature is determined as a new input feature of the first intermediate layer.
10. The image generating method according to claim 9, wherein the fusing the input features of the first intermediate layer with the output features of the second intermediate layer to obtain fused features includes:
Splicing the input characteristics of the first intermediate layer and the output characteristics of the second intermediate layer to obtain splicing characteristics;
and performing channel dimension reduction processing on the spliced features to obtain the fusion features.
11. The image generation method according to any one of claims 1 to 7, wherein the reference text includes a base portion, the positive bias text includes the base portion and a first auxiliary portion for describing the base portion, and the negative bias text includes the base portion and a second auxiliary portion for describing the base portion, the first auxiliary portion having a semantic opposite to a semantic of the second auxiliary portion.
12. An image generating apparatus, comprising:
The system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a text sample, a first image generation network and a second image generation network, the text sample comprises a basic text, a positive deviation text and a negative deviation text, network parameters of the first image generation network are fixed, the positive deviation text and the basic text have positive deviation semantically, and the negative deviation text and the basic text have negative deviation semantically;
The first generation unit is used for carrying out image generation processing on the reference text by adopting the first image generation network to obtain a reference image, and carrying out image generation processing on the positive deviation text and the negative deviation text by adopting the second image generation network to obtain a positive deviation image and a negative deviation image;
A loss determination unit configured to determine a target loss of the second image generation network based on the reference image, the positive deviation image, and the negative deviation image;
The training unit is used for training the second image generation network based on the target loss to obtain a target image generation network;
And the second generation unit is used for carrying out image generation processing on the text to be processed by adopting the target image generation network when the text to be processed is acquired, so as to obtain a target image.
13. An electronic device comprising a processor and a memory, the memory storing a plurality of computer instructions; the processor loads computer instructions from the memory to perform the steps in the image generation method of any of claims 1 to 11.
14. A computer readable storage medium, characterized in that it stores a plurality of computer instructions adapted to be loaded by a processor for executing the steps of the image generation method according to any of claims 1-11.
15. A computer program product comprising computer instructions which, when executed by a processor, implement the steps of the image generation method of any of claims 1 to 11.
CN202410328377.3A 2024-03-20 2024-03-20 Image generation method, device, electronic equipment and storage medium Active CN117953108B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410328377.3A CN117953108B (en) 2024-03-20 2024-03-20 Image generation method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410328377.3A CN117953108B (en) 2024-03-20 2024-03-20 Image generation method, device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN117953108A CN117953108A (en) 2024-04-30
CN117953108B true CN117953108B (en) 2024-07-05

Family

ID=90805367

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410328377.3A Active CN117953108B (en) 2024-03-20 2024-03-20 Image generation method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117953108B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112052948A (en) * 2020-08-19 2020-12-08 腾讯科技(深圳)有限公司 Network model compression method and device, storage medium and electronic equipment
CN113961736A (en) * 2021-09-14 2022-01-21 华南理工大学 Method and device for generating image by text, computer equipment and storage medium

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110796619B (en) * 2019-10-28 2022-08-30 腾讯科技(深圳)有限公司 Image processing model training method and device, electronic equipment and storage medium
WO2021097845A1 (en) * 2019-11-22 2021-05-27 驭势(上海)汽车科技有限公司 Simulation scene image generation method, electronic device and storage medium
CN111968193B (en) * 2020-07-28 2023-11-21 西安工程大学 Text image generation method based on StackGAN (secure gas network)
JP2022178243A (en) * 2021-05-19 2022-12-02 国立大学法人電気通信大学 Image generator, image generation method and program
CN114639139B (en) * 2022-02-16 2024-11-08 南京邮电大学 Emotion image description method and system based on reinforcement learning
CN117437516A (en) * 2022-07-11 2024-01-23 北京字跳网络技术有限公司 Semantic segmentation model training method and device, electronic equipment and storage medium
CN117058673A (en) * 2023-06-21 2023-11-14 北京交通大学 Text generation image model training method and system and text generation image method and system
CN117058276B (en) * 2023-10-12 2024-01-26 腾讯科技(深圳)有限公司 Image generation method, device, equipment and storage medium
CN117557708A (en) * 2023-11-15 2024-02-13 腾讯科技(上海)有限公司 Image generation method, device, storage medium and computer equipment
CN117540023A (en) * 2024-01-08 2024-02-09 南京信息工程大学 Image joint text emotion analysis method based on modal fusion graph convolution network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112052948A (en) * 2020-08-19 2020-12-08 腾讯科技(深圳)有限公司 Network model compression method and device, storage medium and electronic equipment
CN113961736A (en) * 2021-09-14 2022-01-21 华南理工大学 Method and device for generating image by text, computer equipment and storage medium

Also Published As

Publication number Publication date
CN117953108A (en) 2024-04-30

Similar Documents

Publication Publication Date Title
CN112182166B (en) Text matching method and device, electronic equipment and storage medium
EP3937124A1 (en) Image processing method, device and apparatus, and storage medium
CN111723784B (en) Risk video identification method and device and electronic equipment
CN113064968B (en) Social media emotion analysis method and system based on tensor fusion network
KR20190089451A (en) Electronic device for providing image related with text and operation method thereof
CN111062865B (en) Image processing method, image processing device, computer equipment and storage medium
CN111242844A (en) Image processing method, image processing apparatus, server, and storage medium
CN116882450B (en) Question-answering model editing method and device, electronic equipment and storage medium
CN111080746A (en) Image processing method, image processing device, electronic equipment and storage medium
CN117437317A (en) Image generation method, apparatus, electronic device, storage medium, and program product
CN117540221A (en) Image processing method and device, storage medium and electronic equipment
CN112446322A (en) Eyeball feature detection method, device, equipment and computer-readable storage medium
CN111428468A (en) Method, device, equipment and storage medium for predicting single sentence smoothness
CN116431827A (en) Information processing method, information processing device, storage medium and computer equipment
CN117953108B (en) Image generation method, device, electronic equipment and storage medium
CN118135062B (en) Image editing method, device, equipment and storage medium
CN116913278B (en) Voice processing method, device, equipment and storage medium
CN118038238A (en) Visual question-answering method and device, electronic equipment and storage medium
CN115098644B (en) Image and text matching method and device, electronic equipment and storage medium
CN117131176A (en) Interactive question-answering processing method and device, electronic equipment and storage medium
CN116485943A (en) Image generation method, electronic device and storage medium
Ke et al. An activate appearance model‐based algorithm for ear characteristic points positioning
CN113298731A (en) Image color migration method and device, computer readable medium and electronic equipment
CN118228035B (en) Content tag determination method and related equipment
CN118247608B (en) Concept learning method, image generation method and related devices

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant