CN113961736B

CN113961736B - Method, apparatus, computer device and storage medium for text generation image

Info

Publication number: CN113961736B
Application number: CN202111072292.6A
Authority: CN
Inventors: 陆璐; 叶锡洪; 冼允廷
Original assignee: Guangdong Yousuan Technology Co ltd; South China University of Technology SCUT
Current assignee: Guangdong Yousuan Technology Co ltd; South China University of Technology SCUT
Priority date: 2021-09-14
Filing date: 2021-09-14
Publication date: 2024-07-23
Anticipated expiration: 2041-09-14
Also published as: CN113961736A

Abstract

The invention discloses a method, a device, computer equipment and a storage medium for generating an image by text, wherein the method comprises the following steps: acquiring a text image pair in a database; the text in the text image pair is used as an original text; inputting the original text into a multilevel generation countermeasure network to obtain a corresponding image; inputting the corresponding images into a trained image annotation network to generate a predicted text; inputting the predicted text and the original text into a trained twin neural network to obtain the similarity between the predicted text and the original text; training the multistage generation countermeasure network according to the similarity to obtain a trained multistage generation countermeasure network; and inputting the text input by the user, inputting the trained multilevel generation countermeasure network, and generating an image corresponding to the text. According to the invention, the multistage generation countermeasure network is adopted, so that the pixels and the quality of the generated image are gradually improved, and meanwhile, the reality of the generated image is improved by adding the attention mechanism, so that the semantic consistency of the generated image and the text is improved.

Description

Method, apparatus, computer device and storage medium for text generation image

Technical Field

The present invention relates to the field of natural language processing and computer vision, and in particular, to a method, apparatus, computer device and storage medium for generating an image from text.

Background

Computer vision and natural language processing are used for processing single type data, namely images or words, the computer vision mainly focuses on understanding of pictures, including subtasks such as image semantic segmentation, image classification, target retrieval and the like, and the natural language processing mainly focuses on modeling processing of text information, including subtasks such as machine translation, named entity recognition, word segmentation and the like. In recent years, multi-modal tasks that incorporate multiple data types, such as images, text, and video, have received increasing attention from researchers, who can link relationships between multiple different types of data, such as mapping, fusion, and the like. The two most common data types in the multi-mode task are words and images, and cross-mode retrieval, image subtitle generation and the like are all common research directions in the multi-mode task.

The characters and the images serve as two different types of information carriers, play an important role in daily life, the images intuitively display the contained contents to people, show details which are not contained in the characters, the expression of the characters is simple and complete, and a large number of images can be expressed through simple description, so that the characters and the images are combined, and the object can be comprehensively described through a picture and text combination mode. Such scenes are visible everywhere in life: the picture designed by a designer often cannot meet the description of a client, and even if the picture is repeatedly modified, the picture still cannot meet the requirement of the client; in the crime scene, witness cases and witness persons of criminal suspects often describe the appearance characteristics of the criminal suspects only in a word-of-mouth expression mode, and the conversion of the description into pictures for social reference requires professional personnel to participate, so that time and labor are wasted, and a good effect cannot be obtained.

The text-to-image task refers to a task of inputting a text description to generate a corresponding image. The GAN-INT-CLS proposed by Reed et al in 2016 for text-to-image tasks enabled the conversion of manually written descriptive text into corresponding images. StackGAN initially stack two cGAN together, generating a low resolution image in a first stage cGAN containing the contours and colors of the primary object, expanding the low resolution image in a second stage, generating a high resolution image, and containing a more vivid object. AttnGAN is provided for providing semantic consistency, the model encodes text description into sentence characteristics and word characteristics at the same time, the sentence characteristics are used as input of a network to generate an initial low-resolution image, the word characteristics are used for extracting important words in a subsequent generation process, and image subareas corresponding to the word characteristics are found out, so that the attention of the area is improved, fine-granularity details are generated in the important subareas of the image, and the semantic consistency of the image is improved.

Disclosure of Invention

In order to solve the defects in the prior art, the invention provides a method, a device, computer equipment and a storage medium for generating an image by using a multi-stage generation countermeasure network, wherein the method gradually improves the pixels and quality of the generated image, avoids the problems of low pixels and poor quality of the image generated by the single generation countermeasure network, and simultaneously, attention mechanisms are added between cascade generators to pay attention to important parts in output characteristics, thereby further improving the authenticity of the generated image and further improving the semantic consistency of the generated image and the text.

A first object of the present invention is to provide a method of generating an image of text.

A second object of the present invention is to provide an apparatus for generating an image of text.

A third object of the present invention is to provide a computer device.

A fourth object of the present invention is to provide a storage medium.

The first object of the present invention can be achieved by adopting the following technical scheme:

a method of text generation of an image, the method comprising:

Acquiring a text image pair in a database; the text image pair comprises a text and an image, wherein the text is descriptive text of the image and serves as an original text;

inputting the original text into a multilevel generation countermeasure network to obtain a corresponding image;

inputting the corresponding images into a trained image annotation network to generate a predicted text;

inputting the predicted text and the original text into a trained twin neural network to obtain the similarity between the predicted text and the original text;

training the multistage generation countermeasure network according to the similarity between the predicted text and the original text to obtain a trained multistage generation countermeasure network;

and inputting the text input by the user into the trained multilevel generation countermeasure network, and generating an image corresponding to the text.

Further, the inputting the original text into the multilevel generation countermeasure network to obtain a corresponding image specifically includes:

before inputting the original text into a multilevel generation countermeasure network, inputting the original text into a text encoder to obtain a text feature vector;

and inputting the sentence embedded feature vector of the text feature vector into a multilevel generation countermeasure network to obtain a corresponding image.

Further, the multi-level generation countermeasure network includes n generators and n-1 attention mechanism modules; wherein n is a positive integer greater than 1;

inputting the sentence embedded feature vector of the text feature vector into a multilevel generation countermeasure network to obtain a corresponding image, wherein the method specifically comprises the following steps:

before inputting the sentence embedded feature vector into a generator in a multistage generation countermeasure network, carrying out conditional enhancement on the sentence embedded feature vector to obtain an enhanced sentence embedded feature vector;

When i is 1, the enhanced sentence embedded feature vector is input into an ith generator to obtain the output feature of the ith generator;

When i is a positive integer greater than 1 and less than or equal to n, inputting the output characteristics of the ith-1 generator into the ith-1 attention mechanism module to acquire important parts in the output characteristics of the ith-1 generator; inputting an important part in the output characteristics of the ith generator and the output characteristics of the ith generator to obtain the output characteristics of the ith generator;

taking the output characteristic of the nth generator as a corresponding image;

As the number of generators increases, the resolution of the output image of the generator gradually increases.

Further, the inputting the output characteristics of the i-1 th generator into the i-1 th attention mechanism module, and obtaining the important part in the output characteristics of the i-1 th generator specifically includes:

Embedding words of the text feature vector into a feature matrix and outputting features of an ith-1 generator to be input into an ith-1 attention mechanism module;

And calculating the most relevant part of the output characteristics of the ith generator and the keywords in the original text through the attention mechanism of the ith-1 attention mechanism module to obtain the important part of the output characteristics of the ith-1 generator.

Further, the multi-level generation countermeasure network further includes n discriminators, each of which corresponds to one generator;

training the multi-level generation countermeasure network according to the similarity between the predicted text and the original text to obtain a trained multi-level generation countermeasure network, and specifically comprising:

In a multi-level generation countermeasure network training process, one round of training includes two processes:

Fixing parameters of all generators, and updating parameters of the discriminator by using a loss function of the discriminator;

fixing parameters of all discriminators, and updating parameters of a generator by using a loss function of the generator and similarity between the test text and the original text;

and carrying out multi-round training on the multi-stage generation countermeasure network by utilizing the similarity between the plurality of predicted texts and the original texts, thereby obtaining a trained multi-stage generation countermeasure network.

Further, taking an image in a text image pair corresponding to the original text as a real image;

The input of the t-th discriminator comprises the output features of the t-th generator and the real image; wherein t is a positive integer greater than or equal to 1 and less than n+1;

When k is 1, the input of the kth discriminator further comprises the sentence-embedded feature vector;

when k is greater than 1 and less than or equal to n, the input of the kth discriminator further comprises a word embedding feature matrix of the text feature vector;

The loss function of the discriminator is as follows:

Wherein, For sentence embedded feature vectors or word embedded feature matrices, I is a real image, s ₀ represents the output features of the previous generator, c is a sentence embedded feature vector, G (s ₀, c) is the output features of the generator, IC is an image annotation network,Sim is the similarity between the predicted text and the original text, which is the output of the discriminator;

The loss function of the generator is as follows:

Wherein, Is the KL divergence between the gaussian distribution of the text feature vector and the standard gaussian distribution.

Further, the image annotation network comprises an encoder and a decoder; wherein the encoder comprises a convolutional neural network and a linear transformation, and the decoder comprises an LSTM network;

Inputting the corresponding image into a trained image annotation network to generate a predicted text, wherein the method specifically comprises the following steps of:

inputting the corresponding image into the convolutional neural network to obtain a feature matrix of the image;

After the characteristic matrix of the image is subjected to the linear transformation, a transformed characteristic matrix is obtained;

And inputting the transformed feature matrix into the LSTM network to generate a predicted text.

Further, the twin neural network comprises a text feature extraction network and a pooling layer;

Inputting the predicted text and the original text into a trained twin neural network to obtain the similarity between the predicted text and the original text, wherein the method specifically comprises the following steps:

respectively inputting the predicted text and the original text into the text feature extraction network to respectively obtain extracted text features;

inputting the extracted text features obtained respectively into the pooling layer to obtain a feature vector U and a feature vector V;

According to the cosine similarity, the similarity of the feature vector U and the feature vector V is calculated, and the formula is as follows:

Wherein U _i and V _i are the ith vectors of U and V, respectively.

The second object of the invention can be achieved by adopting the following technical scheme:

an apparatus for text generation of an image, the apparatus comprising:

The text image pair acquisition module is used for acquiring text image pairs in the database; the text image pair comprises a text and an image, wherein the text is descriptive text of the image and serves as an original text;

the predicted image generation module is used for inputting the original text into a multistage generation countermeasure network to obtain a corresponding image;

The prediction text generation module is used for inputting the corresponding images into a trained image annotation network to generate a prediction text;

The similarity calculation module is used for inputting the predicted text and the original text into a trained twin neural network to obtain the similarity between the predicted text and the original text;

The multistage generation countermeasure network training module is used for training the multistage generation countermeasure network according to the similarity between the predicted text and the original text to obtain a trained multistage generation countermeasure network;

And the text generation image module is used for inputting the text input by the user into the trained multilevel generation countermeasure network and generating an image corresponding to the text.

The third object of the present invention can be achieved by adopting the following technical scheme:

A computer device comprising a processor and a memory for storing a program executable by the processor, the processor implementing the method of text generation of images described above when executing the program stored in the memory.

The fourth object of the present invention can be achieved by adopting the following technical scheme:

a storage medium storing a program which, when executed by a processor, implements the method of generating an image from text described above.

Compared with the prior art, the invention has the following beneficial effects:

1. The invention adopts progressive multilayer generation countermeasure network to gradually improve the pixels and quality of the generated image, thereby avoiding the problems of low pixels and poor quality of the generated image of the single generation countermeasure network. Meanwhile, attention mechanisms are added between cascade generators, important parts in output characteristics are focused, and the authenticity of the generated image is further improved.

2. According to the invention, by adopting a text alignment mode, firstly, an image labeling network and a twin neural network are pre-trained, in the process of training a progressive multilayer generation countermeasure network, supervision factors are enhanced by the text alignment mode, text alignment constraint is added on the basis of conditional constraint of a discriminator, and the semantic consistency of a generated image and a text is further improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to the structures shown in these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a method of generating an image of text according to embodiment 1 of the present invention.

Fig. 2 is a schematic diagram of the entire network according to embodiment 1 of the present invention.

Fig. 3 is a schematic diagram of an image labeling network structure according to embodiment 1 of the present invention.

Fig. 4 is a schematic diagram of a twin neural network according to embodiment 1 of the present invention.

Fig. 5 is a block diagram showing the structure of an apparatus for generating an image of text according to embodiment 2 of the present invention.

Fig. 6 is a block diagram showing the structure of a computer device according to embodiment 3 of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments, and all other embodiments obtained by those skilled in the art without making any inventive effort based on the embodiments of the present application are within the scope of protection of the present application. It should be understood that the detailed description is intended to illustrate the application, and is not intended to limit the application.

Example 1:

as shown in fig. 1, the present embodiment provides a method for generating an image by text, including the steps of:

S101, acquiring a text image pair in a database; the text image pair comprises a text and an image, wherein the text is descriptive text of the image and serves as an original text.

And obtaining image text pair data in the network by a crawler and the like, and taking the image text pair data as a text image pair in a database. The text in one text image pair is the descriptive sentence of the image, and the text and the image have semantic consistency.

In this embodiment, the entire network includes a multistage generation countermeasure network, an image annotation network, and a twin neural network.

S102, inputting the original text into a multilevel generation countermeasure network to obtain a corresponding image.

Further, step S102 includes:

(1) Before inputting the original text into the multilevel generation countermeasure network, inputting the original text into a text encoder to obtain text feature vectors.

(2) The countermeasure network is generated in multiple stages.

The multi-level generation countermeasure network is a progressive multi-level generation countermeasure network including n generators, n discriminators, each discriminator corresponding to one generator, and n-1 attention mechanism modules.

As shown in fig. 2, n is 3 in this embodiment, i.e. three generators are included, the first generator consisting of 4 deconvolution blocks, each of which consists of an upsampling layer and a spectral normalization layer. The up-sampling layer reduces the channel number of the three-dimensional feature vector to half of the original channel number each time, and simultaneously expands the width and the height of the feature to 2 times of the original channel number. The feature vector generated by the first generator has dimensions of 3 x 64. The spectrum normalization layer is used for improving stability of the generated countermeasure network in the training process and avoiding problems such as mode collapse. The two subsequent generators are composed of 4 deconvolution blocks, each deconvolution block mainly comprises a convolution layer, a residual layer and an up-sampling layer, a new feature matrix is obtained, the size of an output image is improved at the same time, specifically, the convolution layer and the residual layer process the feature image generated by the previous generator, and the up-sampling layer improves the image pixels.

The previous generator output characteristic of the multi-stage generation countermeasure network is taken as the next generator input. The resolution of the output image of each generator in the multi-stage generation countermeasure network gradually increases from 128×128 to 256×256, and finally 512×512.

(3) Inputting the text feature vector into a multilevel generation countermeasure network to obtain a corresponding image.

Further, the step (3) specifically includes:

(3-1) the text feature vector is first conditional enhanced to generate additional conditional variables before being input into the multilevel generation countermeasure network, resulting in a sentence-embedded feature vector.

Sentence-embedded feature vectors of text feature vectors prior to input into a multi-stage generation countermeasure network, conditional augmentation is first performed to produce additional conditional variables that are randomly sampled from an independent gaussian distribution.

Sentence-embedded feature vector of text feature vector is Representing a D-dimensional vector. Because the data volume is limited, in order to improve the generalization capability of the network model, before sentence embedding feature vectors are input into the network, condition enhancement is firstly carried out, and the condition enhancement calculation mode is as follows:

c represents the sentence embedded feature vector after the condition enhancement. The main method of this conditional enhancement is that the conditional variables are randomly sampled from an independent gaussian distribution.

And (3-2) inputting the sentence-embedded feature vector into a multistage generation countermeasure network to obtain a corresponding image.

And splicing sentence embedded feature vectors, wherein the average value is 0, random noise with variance of 1 is used as network input, and the network input is input into the 1 st generator to obtain the output feature of the 1 st generator.

Before the output of the previous generator is input to the next generator, the important part in the generated image is acquired through an attention mechanism module. The attention mechanism module comprises two parts of input, and is respectively used for embedding feature matrixes for words of text feature vectorsAnd an output feature of the generator, wherein D represents a dimension of each word and T represents a text length. Through the attention mechanism, the most relevant part of the generated sub-images in each stage to the original text keywords can be calculated, and the sub-image quality is improved.

And inputting the output of the 2 nd attention mechanism module into the 3 rd generator to obtain the output characteristics of the 3 rd generator as a corresponding image, namely a generated image corresponding to the original image.

The discriminator includes two constraints: unconditional constraints and conditional constraints. The discriminator requires three inputs. The input of the first discriminator is sentence embedded feature vector, the generated image of the first generator and real image; the inputs to the other discriminators are: word embedding feature matrix, generated image of generator and real image. The unconditional constraint is used for judging whether the generated image is a true natural image or not, and the generated image and the true image of the generator are used as judging conditions; the conditional constraint is used to determine whether the generated image is consistent with the text description, using the sentence-embedded feature vector or the word-embedded feature matrix and the generator image as the determination conditions.

Conditional constraints are employed in this embodiment.

The discriminator loss function is:

Wherein, Word embedding feature matrix for sentence embedding feature vector/text feature vector, I is a real image, s ₀ represents the output feature of the previous generator, c is sentence embedding feature vector, G (s ₀, c) is the output feature of the generator, IC is an image annotation network,For the output of the discriminator sim is the similarity between the predicted text and the original text.

The loss function of the generator is:

Wherein, KL divergence between the gaussian distribution of text feature vectors and the standard gaussian distribution is aimed at avoiding overfitting.

S103, inputting the corresponding images into a trained image annotation network to generate a prediction text.

The image annotation network is used for generating descriptive text consistent with the input image semanteme.

As shown in fig. 3, the image annotation network mainly comprises two parts: an encoder and a decoder. The encoder includes a convolutional neural network and a linear transform, and uses the convolutional neural network pre-training model resnet-152 to remove the last fully-connected layer. The encoder firstly acquires a feature matrix of an input image by using a convolutional neural network, then converts the feature matrix (feature vector) into a representation form suitable for the input of a decoder after linear transformation, and then inputs the representation form into the decoder. For the decoder section, an LSTM network (long short term memory network) is used, and the decoder includes a plurality of LSTMs in which descriptive text is predicted and text output codes are acquired.

Training the image annotation network, which specifically comprises the following steps:

inputting the real image into an image labeling network to obtain a text output code;

And comparing the text output codes with text features corresponding to the real images to train, thereby obtaining a trained image annotation network.

Inputting the corresponding images into a trained image annotation network to generate a predicted text.

S104, inputting the predicted text and the original text into a trained twin neural network to obtain the similarity between the predicted text and the original text.

As shown in fig. 4, the twin neural network mainly includes a text feature extraction network and a pooling layer.

In the text image pair database, texts between different images are taken as negative sample pairs, texts of the same image are taken as positive sample pairs, the similarity between the positive sample pairs is set to be 0.8, and the similarity between the negative sample pairs is set to be 0.5. The twin neural network takes two texts as input, obtains characterization of the embedded high-dimensional space of the two input texts, and then calculates the similarity degree between the two embedded high-dimensional space characterization. Specifically, two texts are respectively input into a text feature extraction network, the text feature extraction network selected in this embodiment is BERT, and in order to represent the extracted features into the same dimension so as to calculate the similarity, the features are input into a pooling layer, and two feature vectors U, V are obtained. And then, calculating the similarity of the two feature vectors by using cosine similarity, wherein the similarity between positive samples is higher, and the similarity between negative samples is lower. The cosine similarity calculation formula is as follows:

Wherein U _i and V _i represent the ith vectors of U and V, respectively.

And respectively inputting the positive and negative sample pairs into the twin neural network to obtain predicted similarity, and updating twin neural network parameters through the difference between the target similarity and the predicted similarity to train, so that the trained twin neural network is obtained.

And inputting the predictive text and the original text into a trained twin neural network, and calculating the similarity of the two texts.

And S105, training the multistage generation countermeasure network according to the similarity between the predicted text and the original text, and obtaining the trained multistage generation countermeasure network.

Through steps S101-S104, a plurality of similarities between the predicted text and the original text are obtained, and the multilevel generation countermeasure network is trained according to the plurality of similarities between the predicted text and the original text.

In the multistage generation countermeasure network training process, one round of training is divided into two processes:

(1) Firstly, fixing parameters of all generators, and updating parameters of a discriminator by using a loss function of the discriminator;

(2) Then, parameters of all the fixed discriminators are updated by using a loss function of the generator and the similarity obtained by the twin neural network;

And similarly, 600 rounds of training are completed, the learning rate is set to be 0.0002, and the trained multilevel generation countermeasure network is obtained.

S106, inputting the text input by the user into a trained multilevel generation countermeasure network, and generating an image corresponding to the input text.

The user can generate the corresponding target image by inputting the text into the trained multistage generation countermeasure network without using a discriminator, an image annotation network and a twin neural network.

Those skilled in the art will appreciate that all or part of the steps in a method implementing the above embodiments may be implemented by a program to instruct related hardware, and the corresponding program may be stored in a computer readable storage medium.

It should be noted that although the method operations of the above embodiments are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in that particular order or that all illustrated operations be performed in order to achieve desirable results. Rather, the depicted steps may change the order of execution. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.

Example 2:

As shown in fig. 5, the present embodiment provides an apparatus for generating an image of a text, the apparatus including a text image pair acquisition module 501, a predicted image generation module 502, a predicted text generation module 503, a similarity calculation module 504, a multi-stage generation countermeasure network training module 505, and a text generation image module 506, wherein:

a text image pair acquisition module 501, configured to acquire a text image pair in a database; the text image pair comprises a text and an image, wherein the text is descriptive text of the image and serves as an original text;

the predicted image generation module 502 is configured to input the original text into a multilevel generation countermeasure network to obtain a corresponding image;

a predicted text generation module 503, configured to input the corresponding image into a trained image labeling network, and generate a predicted text;

the similarity calculation module 504 is configured to input the predicted text and the original text into a trained twin neural network to obtain a similarity between the predicted text and the original text;

The multistage generation countermeasure network training module 505 is configured to train the multistage generation countermeasure network according to the similarity between the predicted text and the original text, so as to obtain a trained multistage generation countermeasure network;

And the text generation image module 506 is configured to input a text input by a user into the trained multi-stage generation countermeasure network, and generate an image corresponding to the text.

Specific implementation of each module in this embodiment may be referred to embodiment 1 above, and will not be described in detail herein; it should be noted that, the apparatus provided in this embodiment is only exemplified by the division of the above functional modules, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure is divided into different functional modules, so as to perform all or part of the functions described above.

Example 3:

The present embodiment provides a computer device, which may be a computer, as shown in fig. 6, and is connected through a system bus 601, where the processor is configured to provide computing and control capabilities, the memory includes a nonvolatile storage medium 606 and an internal memory 607, where the nonvolatile storage medium 606 stores an operating system, a computer program, and a database, and the internal memory 607 provides an environment for the operating system and the computer program in the nonvolatile storage medium, and when the processor 602 executes the computer program stored in the memory, the method for generating an image by using the text in the foregoing embodiment 1 is implemented as follows:

Example 4:

the present embodiment provides a storage medium that is a computer-readable storage medium storing a computer program that, when executed by a processor, implements the method for generating an image of text of the above embodiment 1, as follows:

The computer readable storage medium of the present embodiment may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In summary, the multistage generation countermeasure network constructed by the invention mainly comprises a generator, a discriminator and an attention mechanism module, wherein the generator is used for generating a corresponding image, the discriminator is used for judging whether the generated image is a real image, and the attention mechanism is used for acquiring a attention-carrying feature map of the image generated by the low-level generator; the text alignment mainly comprises an image annotation network and a twin neural network, wherein the image annotation network takes an image generated by a generator as input, outputs descriptive text corresponding to the image, and the twin neural network is used for calculating the similarity between two text data: and respectively taking the generated text input by the original of the countermeasure network and the text output by the image annotation network as two inputs of the twin neural network, obtaining the similarity degree between the two texts, namely the text alignment degree, and updating and generating the countermeasure network parameters by utilizing the alignment degree. Compared with the prior art, the method and the device can improve the semantic consistency between the generated image and the text.

The above-mentioned embodiments are only preferred embodiments of the present invention, but the protection scope of the present invention is not limited thereto, and any person skilled in the art can make equivalent substitutions or modifications according to the technical solution and the inventive concept of the present invention within the scope of the present invention disclosed in the present invention patent, and all those skilled in the art belong to the protection scope of the present invention.

Claims

1. A method of generating an image from text, the method comprising:

Inputting the original text into a multilevel generation countermeasure network to obtain a corresponding image; the multi-level generation countermeasure network comprises n generators, n discriminators and n-1 attention mechanism modules, wherein each discriminator corresponds to one generator; n is a positive integer greater than 1;

Inputting a text input by a user into the trained multilevel generation countermeasure network, and generating an image corresponding to the text;

inputting the original text into a multilevel generation countermeasure network to obtain a corresponding image, wherein the method specifically comprises the following steps:

Before sentence embedded feature vectors of the text feature vectors are input into a generator in a multistage generation countermeasure network, the sentence embedded feature vectors are subjected to conditional enhancement to obtain enhanced sentence embedded feature vectors;

taking the output characteristic of the nth generator as a corresponding image;

as the number of generators increases, the resolution of the output image of the generator gradually increases;

Fixing parameters of all discriminators, and updating parameters of a generator by using a loss function of the generator and similarity between the predicted text and the original text;

performing multiple rounds of training on the multistage generation countermeasure network by using the similarity between the plurality of predicted texts and the original text, so as to obtain a trained multistage generation countermeasure network;

taking an image in a text image pair corresponding to an original text as a real image;

The input of the t-th discriminator comprises the output features of the t-th generator and the real image; t is a positive integer greater than or equal to 1 and less than n+1;

The loss function of the discriminator is as follows:

in the method, in the process of the invention, For sentence embedded feature vectors or word embedded feature matrices, I is a real image, s ₀ represents the output features of the previous generator, c is a sentence embedded feature vector, G (s ₀, c) is the output features of the generator, IC is an image annotation network,Sim is the similarity between the predicted text and the original text, which is the output of the discriminator;

the loss function of the generator is as follows:

in the method, in the process of the invention, Is the KL divergence between the gaussian distribution of the text feature vector and the standard gaussian distribution.

2. The method for generating an image according to claim 1, wherein the inputting the output characteristics of the i-1 th generator into the i-1 st attention mechanism module obtains the important part of the output characteristics of the i-1 th generator, specifically comprising:

3. The method of generating an image of text according to claim 1, wherein the image annotation network comprises an encoder and a decoder; wherein the encoder comprises a convolutional neural network and a linear transformation, and the decoder comprises an LSTM network;

4. The method of text generation of claim 1, wherein the twin neural network comprises a text feature extraction network and a pooling layer;

Wherein U _i and V _i are the ith vectors of U and V, respectively.

5. An apparatus for generating an image from text, the apparatus comprising:

The text generation image module is used for inputting the text input by the user into the trained multilevel generation countermeasure network to generate an image corresponding to the text;

taking the output characteristic of the nth generator as a corresponding image;

The loss function of the discriminator is as follows:

the loss function of the generator is as follows:

6. A storage medium storing a program which, when executed by a processor, implements the method of generating an image of text as claimed in any one of claims 1 to 4.