CN112614199B

CN112614199B - Semantic segmentation image conversion method, device, computer equipment and storage medium

Info

Publication number: CN112614199B
Application number: CN202011321375.XA
Authority: CN
Inventors: 孟云龙
Original assignee: Shanghai Eye Control Technology Co Ltd
Current assignee: Shanghai Eye Control Technology Co Ltd
Priority date: 2020-11-23
Filing date: 2020-11-23
Publication date: 2024-09-20
Anticipated expiration: 2040-11-23
Also published as: CN112614199A

Abstract

The application relates to a semantic segmentation image conversion method, a semantic segmentation image conversion device, computer equipment and a storage medium. The method comprises the following steps: receiving an initial live-action image acquired by acquisition equipment; carrying out semantic segmentation processing on the initial live-action image to generate a semantic segmentation image corresponding to the initial live-action image; inputting the semantic segmentation images into a multi-mode condition generation countermeasure network, outputting a plurality of live-action images corresponding to the semantic segmentation images, wherein the content modes of the live-action images in the live-action images are different, the multi-mode condition generation countermeasure network is generated by training a difference index determined based on a preset regulation function, and the preset regulation function is determined according to the live-action images of different content modes and the generation parameters of the live-action images corresponding to the different content modes. The method can improve the diversity of the generated live-action images.

Description

Semantic segmentation image conversion method, device, computer equipment and storage medium

Technical Field

The present application relates to the field of image generation, and in particular, to a semantic segmentation image conversion method, apparatus, computer device, and storage medium.

Background

Before business processing is performed by applying the neural network model, training of the model is performed by a large number of training set images, and when the training set images are difficult to collect, image conversion processing can be performed through the collected initial images, so that converted training set images with a large data volume are obtained.

In the conventional method, the conversion of the image is generally performed by conditional GENERATIVE ADVERSARIAL networks (conditional GENERATIVE ADVERSARIAL networks).

However, the above method has a problem of mode collapse (mode collapse problem), so that only one-to-one image can be output, that is, only one corresponding image can be output based on the input image, so that the output image is single and lacks of diversity.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a semantic division image conversion method, apparatus, computer device, and medium capable of improving the diversity of live-action images converted from semantic division images.

A semantic segmentation image conversion method, the method comprising:

Receiving an initial live-action image acquired by acquisition equipment;

Carrying out semantic segmentation processing on the initial live-action image to generate a semantic segmentation image corresponding to the initial live-action image;

Inputting the semantic segmentation images into a multi-mode condition generation countermeasure network, outputting a plurality of live-action images corresponding to the semantic segmentation images, wherein the content modes of the live-action images in the live-action images are different, the multi-mode condition generation countermeasure network is generated by training a difference index determined based on a preset regulation function, and the preset regulation function is determined according to the live-action images of different content modes and the generation parameters of the live-action images corresponding to the different content modes.

In one embodiment, the semantic segmentation processing is performed on the initial live-action image, and before the semantic segmentation image is generated, the method further includes:

Performing image size normalization processing on the initial live-action image to obtain an initial live-action image after the image size normalization processing;

Carrying out semantic segmentation processing on the initial live-action image to generate a semantic segmentation image corresponding to the initial live-action image, wherein the semantic segmentation processing comprises the following steps:

And carrying out semantic segmentation processing on the initial live-action image subjected to the image size normalization processing to generate a semantic segmentation image corresponding to the initial live-action image subjected to the image size normalization processing.

In one embodiment, inputting the semantically segmented image into a multi-modal condition generation countermeasure network, outputting a plurality of live-action images corresponding to the semantically segmented image, comprising:

performing feature extraction on the semantic segmentation images through an encoder in a multi-modal condition generation countermeasure network to generate feature images of the corresponding semantic segmentation images;

And decoding and converting the feature map through a generator in the multi-modal condition generation countermeasure network to obtain a plurality of live-action images corresponding to the semantic segmentation images.

In one embodiment, decoding and converting the feature map by a generator in the multimodal condition generating countermeasure network to obtain a plurality of live-action images corresponding to the semantic segmentation image, including:

and configuring a plurality of different generation parameters through a generator, and generating a plurality of live-action images corresponding to the semantic segmentation images according to each generation parameter and the feature map.

In one embodiment, the generation of the multimodal condition generating countermeasure network includes:

Acquiring a training set image;

inputting the training set image into a constructed initial multi-mode condition generation countermeasure network, and generating corresponding prediction live-action images based on the determined generation parameters;

generating each image set corresponding to each predicted live-action image according to each predicted live-action image and each training set image;

Inputting each image set into a discriminator for true and false discrimination, and outputting corresponding discrimination results;

Based on each identification result, adjusting the network parameters of the initial multi-mode condition generation countermeasure network, and carrying out iterative training on the initial multi-mode condition generation countermeasure network after the network parameters are adjusted according to the preset iteration times to obtain the trained multi-mode condition generation countermeasure network.

In one embodiment, after performing iterative training on the initial multi-modal condition generation countermeasure network after the network parameter adjustment according to the preset iteration number, the method further includes:

storing initial multi-mode condition generation countermeasure network of each iteration training and corresponding training index value;

the multi-modal condition generating countermeasure network with the training completed comprises:

The initial multi-modal condition generation countermeasure network corresponding to the highest training index value is determined to generate the countermeasure network for the multi-modal condition of training completion.

In one embodiment, based on each discrimination result, adjusting network parameters of an initial multi-modal condition generation countermeasure network, and performing iterative training on the initial multi-modal condition generation countermeasure network after the network parameters are adjusted according to preset iteration times, to generate a trained multi-modal condition generation countermeasure network, including:

According to each discrimination result, calculating a loss value of the initial multi-mode condition generation countermeasure network, and carrying out first adjustment on network parameters of the initial multi-mode condition generation countermeasure network based on the loss value to obtain an initial multi-mode condition generation countermeasure network after first adjustment;

Determining difference indexes among the predicted live-action images based on a preset regulation function according to the predicted live-action images with true discrimination results and corresponding generation parameters, and performing second adjustment on network parameters of the initial multi-modal condition generation countermeasure network based on the difference indexes to obtain a second adjusted initial multi-modal condition generation countermeasure network;

And carrying out iterative training on the initial multi-modal condition generation countermeasure network after the first adjustment and the second adjustment according to the preset iteration times to obtain the trained multi-modal condition generation countermeasure network.

A semantically segmented image conversion apparatus, the apparatus comprising:

the initial live-action image receiving module is used for receiving the initial live-action image acquired by the acquisition equipment;

The semantic segmentation processing module is used for carrying out semantic segmentation processing on the initial live-action image and generating a semantic segmentation image corresponding to the initial live-action image;

The multi-mode real image generation module is used for inputting the semantic segmentation image into a multi-mode condition generation countermeasure network, outputting a plurality of real images corresponding to the semantic segmentation image, wherein the content modes of each real image in the plurality of real images are different, the multi-mode condition generation countermeasure network is generated by training a difference index determined based on a preset regulation function, and the preset regulation function is determined according to the real images of different content modes and the generation parameters of the real images corresponding to the different content modes.

A computer device comprising a memory storing a computer program and a processor implementing the steps of the method of any of the embodiments described above when the computer program is executed.

A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method described in any of the embodiments above.

According to the semantic segmentation image conversion method, the semantic segmentation image is obtained, the semantic segmentation image is input into the multi-mode condition generation countermeasure network, a plurality of live-action images corresponding to the semantic segmentation image are output, the content modes of the live-action images are different, the multi-mode condition generation countermeasure network is generated through training based on the difference index determined by the preset regulating and controlling function, and the preset regulating and controlling function is determined according to the live-action images of different content modes and the generation parameters of the live-action images corresponding to the different content modes. The multi-mode condition generation countermeasure network is generated by training the difference index determined based on the preset regulating function, and the preset regulating function is determined according to the live-action images of different content modes and the generation parameters of the live-action images corresponding to the different content modes, so that the diversity of the generated live-action images can be improved when the countermeasure network is generated according to the multi-mode condition.

Drawings

FIG. 1 is an application scenario diagram of a semantic segmentation image transformation method in one embodiment;

FIG. 2 is a flow diagram of a method of semantically segmenting an image in one embodiment;

FIG. 3 is a flow chart of a semantic segmentation image transformation step according to another embodiment;

FIG. 4 is a flow chart illustrating steps for generating a plurality of live images in one embodiment;

FIG. 5 is a flow chart of the iterative training steps of the network in one embodiment;

FIG. 6 is a block diagram of a semantic segmentation image conversion device according to one embodiment;

fig. 7 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

The semantic segmentation image conversion method provided by the application can be applied to an application environment shown in figure 1. Wherein the acquisition device 102 communicates with the server 104 over a network. The capture device 102 may capture the initial live image and send it to the server 104 over a network. After receiving the initial live-action image, the server 104 may perform semantic segmentation processing on the initial live-action image to generate a semantic segmentation image corresponding to the initial live-action image. Further, the server 104 may input the semantic segmentation image into a multi-modal condition generation countermeasure network, outputting a plurality of live-action images corresponding to the semantic segmentation image. The multi-mode condition generation countermeasure network is generated by training a difference index determined based on a preset regulation function, wherein the content modes of each of the plurality of live-action images are different, and the preset regulation function is determined according to the live-action images of different content modes and the generation parameters of the live-action images corresponding to different content modes. The capturing device 102 may be, but not limited to, devices having an image capturing function and a transmitting function, such as various cameras, video recorders, and the like, and the server 104 may be implemented as a stand-alone server or a server cluster composed of a plurality of servers.

In one embodiment, as shown in fig. 2, a semantic division image conversion method is provided, and the method is applied to the server in fig. 1 for illustration, and includes the following steps:

step S202, receiving an initial live-action image acquired by an acquisition device.

The initial live-action image refers to an image generated by collecting a real life scene through a collecting device, and for example, the initial live-action image can be a road live-action image and the like.

In this embodiment, due to the limitation of the live-action scene, when live-action images are acquired by the acquisition device, live-action images meeting the preset requirements cannot be acquired, or a sufficient number of live-action images cannot be acquired. The method is used for generating a large quantity of live-action images based on a small quantity of acquired initial live-action images.

In this embodiment, after the acquisition device acquires the initial live-action image, the initial live-action image may be transmitted to the server through the network, so as to perform further processing through the server.

Step S204, carrying out semantic segmentation processing on the initial live-action image to generate a semantic segmentation image corresponding to the initial live-action image.

The semantic segmentation image refers to an image which is obtained by classifying each pixel point in the image, determining the category of each point, and representing the original image in a category style, wherein the semantic segmentation image can be shown as a conditional image input in fig. 3.

In this embodiment, when the server acquires the initial live-action image, the server may perform semantic segmentation processing on the acquired image to obtain a semantic segmented image corresponding to the live-action image. For example, the server may perform semantic segmentation processing by semantic segmentation based on regions, full convolution network semantic segmentation, weak supervision semantic segmentation, and the like, thereby corresponding to the semantic segmented image of the initial live-action image.

Step S206, inputting the semantic segmentation image into a multi-mode condition generation countermeasure network, outputting a plurality of live-action images corresponding to the semantic segmentation image, wherein the content modes of the live-action images in the live-action images are different, the multi-mode condition generation countermeasure network is generated by training a difference index determined based on a preset regulation function, and the preset regulation function is determined according to the live-action images of different content modes and the generation parameters of the live-action images corresponding to the different content modes.

In this embodiment, after the server acquires the semantic segmentation image, the acquired semantic segmentation image may be input into a multimodal condition generating countermeasure network trained in advance, so as to output a plurality of live-action images corresponding to different content modalities through the multimodal condition generating countermeasure network.

With continued reference to fig. 3, the server may generate, as shown in 301 and 302, a plurality of live-action images generated by the countermeasure network through multi-modal conditions, the plurality of live-action images having different content modalities and different colors, pixel differences, and the like.

In this embodiment, the multimodal condition generating countermeasure network may be generated based on a difference index training determined by a preset regulatory function, and the preset regulatory function may be represented by formula (1).

Wherein, G (E (x), z ₁) is a live-action image output corresponding to the generation parameter z ₁, and G (E (x), z ₂) is a live-action image output corresponding to the generation parameter z ₂. In this embodiment, when training the multimodal condition generating countermeasure network, the server makes the numerical value of the difference index determined by the regulatory function as large as possible, that is, makes the differences between the output live-action images G (E (x), z ₁) and G (E (x), z ₂) as large as possible, so that a plurality of live-action images with larger differences can be generated by the trained multimodal condition generating countermeasure network.

In the semantic segmentation image conversion method, the semantic segmentation image is acquired, the semantic segmentation image is input into the multi-mode condition generation countermeasure network, a plurality of live-action images corresponding to the semantic segmentation image are output, the content modes of all live-action images in the plurality of live-action images are different, the multi-mode condition generation countermeasure network is generated by training a difference index determined based on a preset regulation function, and the preset regulation function is determined according to the live-action images of all different content modes and the generation parameters of the live-action images corresponding to all different content modes. The multi-mode condition generation countermeasure network is generated by training the difference index determined based on the preset regulating function, and the preset regulating function is determined according to the live-action images of different content modes and the generation parameters of the live-action images corresponding to the different content modes, so that the diversity of the generated live-action images can be improved when the countermeasure network is generated according to the multi-mode condition.

In one embodiment, the semantic segmentation processing is performed on the initial live-action image, and before the semantic segmentation image is generated, the method may further include: and carrying out image size normalization processing on the initial live-action image to obtain the initial live-action image after the image size normalization processing.

Specifically, the sizes of the initial live-action images acquired by different terminals may be inconsistent, and thus, the sizes of the initial live-action images acquired by the server are inconsistent.

In this embodiment, after receiving the initial live-action image, the server may perform preprocessing on the initial live-action image, for example, perform normalization processing on the image size, and adjust the image size of the initial live-action image to a preset size.

Specifically, the server performs the image size normalization processing on the initial live-action image, which may be to determine whether the aspect ratio of the initial live-action image is consistent with the aspect ratio of the preset size, if not, fill the initial live-action image by the preset pixel image until the aspect ratio of the initial live-action image is consistent with the aspect ratio of the preset size, and scale the initial live-action image to the preset size to obtain the initial live-action image of the preset size, if so, scale the initial live-action image to the preset size to obtain the initial live-action image of the preset size.

For example, the preset size is 256×256, the size of the initial live-action image acquired from the terminal is 255×256, the server may fill the length of the initial live-action image to 256×256 through a preset pixel image, for example, 0 pixels, and similarly, when the size of the initial live-action image is 256×255, the server may fill the width of the semantic division image to 256×256 through 0 pixels.

Optionally, when the size of the initial live-action image obtained by the server is 512×510, the initial live-action image may be filled to 512×512 by 0 pixels, so as to be consistent with the aspect ratio of the preset size, and then reduced to 256×256, so as to be consistent with the preset size; or the size of the initial live-action image is 128×123, the preset image can be filled to 128×128 through 0 pixel, so that the preset image is consistent with the aspect ratio of the preset size, and then the preset image is enlarged to 256×256, so that the preset image is consistent with the preset size.

In other embodiments, the server may further perform image brightness adjustment, image forward processing, and the like on the initial live-action image, so that the initial live-action image after processing is performed.

In this embodiment, performing semantic segmentation processing on the initial live-action image to generate a semantic segmented image corresponding to the initial live-action image may include: and carrying out semantic segmentation processing on the initial live-action image subjected to the image size normalization processing to generate a semantic segmentation image corresponding to the initial live-action image subjected to the image size normalization processing.

Specifically, after obtaining the initial live-action image after the image size normalization processing, the server may perform semantic segmentation processing on the initial live-action image after the image size normalization processing, for example, perform semantic segmentation processing through semantic segmentation based on a region, full convolution network semantic segmentation, weak supervision semantic segmentation, and the like, so as to obtain a semantic segmentation image corresponding to the initial live-action image after the image size normalization processing.

It can be understood by those skilled in the art that in this embodiment, the image size normalization process may be performed after the semantic segmentation, that is, the semantic segmentation is performed on the initial live-action image, so as to obtain a semantic segmented image, and then the image size normalization process is performed on the semantic segmented image, so as to obtain an image with a size meeting the requirements of multimodal condition generation against network input.

In the above embodiment, the image size normalization processing is performed on the initial live-action image to obtain the initial live-action image after the image size normalization processing, and then the semantic segmentation processing is performed, so that the output semantic segmentation image meets the input requirement of the multi-mode condition generation countermeasure network, and the accuracy of the multi-mode condition generation countermeasure network generated live-action image can be improved.

In one embodiment, referring to fig. 4, the server inputting the semantically segmented image into the multi-modal condition generation countermeasure network, outputting a plurality of live-action images corresponding to the semantically segmented image may include:

Step S402, extracting features of the semantic segmentation images through an encoder in the multimodal condition generation countermeasure network, and generating feature graphs of the corresponding semantic segmentation images.

In this embodiment, the multi-modal condition generating countermeasure network may include an encoder and a corresponding generator (or decoder), and after the server inputs the initial image into the multi-modal condition generating countermeasure network, the encoder of the multi-modal condition generating countermeasure network may perform feature extraction on the semantic segmentation image to obtain a feature map of the corresponding semantic segmentation image.

In this embodiment, the encoder may include a plurality of convolution layers, each having a different size and number of channels, for example, 8 convolution layers, wherein the first to eighth convolution layers have a smaller size and the number of channels gradually increases. Specifically, the number of channels of the first convolution layer is 64 channels, the number of channels of the second convolution layer is 128 channels, the number of channels of the third convolution layer is 512 channels, the number of channels of the fourth convolution layer is 512 channels, the number of channels of the fifth convolution layer is 512 channels, the number of channels of the sixth convolution layer is 512 channels, the number of channels of the seventh convolution layer is 512 channels, and the number of channels of the eighth convolution layer is 512 channels.

And step S404, decoding and converting the feature images through a generator in the multimodal condition generation countermeasure network to obtain a plurality of live-action images corresponding to the semantic segmentation images.

The generator may comprise a multi-layer network structure, for example, a generator comprising a jump connection, a general generator, etc.

In one embodiment, the generator including the jump connection may include a plurality of convolution layers, a plurality of Dropout layers, and a Tann layer, and specifically may be a first 512-channel convolution layer, a 1024-channel convolution layer, a Dropout layer, a 1024-channel convolution layer, a 512-channel convolution layer, a 125-channel convolution layer, a 3-channel convolution layer, and a Tann layer.

In another embodiment, the generic generation year may include multiple layers of convolution layers, for example, a convolution layer including 64 channels, a convolution layer including 128 channels, a convolution layer including 256 channels, and a multi-layer convolution layer composition including 512 channels.

In this embodiment, the server may input the feature map generated by the encoder into the generator to perform decoding conversion, so as to obtain a live-action image corresponding to the semantic segmentation image.

It will be appreciated by those skilled in the art that the above description of the encoder and generator configuration is by way of example only, and that in other embodiments, other configurations are possible, as the application is not limited in this regard.

In one embodiment, decoding and converting the feature map by a generator in the multimodal condition generating countermeasure network to obtain a plurality of live-action images corresponding to the semantic segmentation image may include: and configuring a plurality of different generation parameters through a generator, and generating a plurality of live-action images corresponding to the semantic segmentation images according to each generation parameter and the feature map.

Specifically, with continued reference to FIG. 3, after the server generates the signature, the server takes two unequal generation parameters, namely latent vectors z ₁ and z ₂, from the normal distribution N (0, 1) and inputs from the latent space between the encoder and the generator.

In the present embodiment, the generator generates a plurality of live-action images corresponding to the semantically segmented images from the respective generation parameters and the feature map, that is, generates corresponding live-action map data G (E (x), z ₁) and G (E (x), z ₂) based on z ₁ and z ₂, respectively.

In the above embodiment, by configuring different generation parameters and generating a plurality of live-action images corresponding to the semantically segmented image based on the feature map, the generated live-action images may be made different.

In one embodiment, the generation manner of the multi-modal condition generation countermeasure network may include: acquiring a training set image; inputting the training set image into a constructed initial multi-mode condition generation countermeasure network, and generating corresponding prediction live-action images based on the determined generation parameters; generating each image set corresponding to each predicted live-action image according to each predicted live-action image and each training set image; inputting each image set into a discriminator for true and false discrimination, and outputting corresponding discrimination results; based on each identification result, adjusting the network parameters of the initial multi-mode condition generation countermeasure network, and carrying out iterative training on the initial multi-mode condition generation countermeasure network after the network parameters are adjusted according to the preset iteration times to obtain the trained multi-mode condition generation countermeasure network.

In this embodiment, the server acquires a set of paired two-dimensional planar maps and satellite map images, and uses the two-dimensional planar maps as training set images.

Further, the server randomly takes two unequal generation parameters, namely latent vectors z ₁ and z ₂, from the normal distribution N (0, 1) and inputs the initial multi-modal condition generation countermeasure network from the latent space between the encoder and the generator.

Further, the server performs feature extraction on the training set image through the initial multi-modal condition generation countermeasure network, and generates corresponding test map images based on the generation parameters and feature images obtained through feature extraction.

Further, the server generates an image set from the test map image and the training set image, inputs each image set into the discriminator for authentication, and outputs each corresponding authentication result.

In this embodiment, when the discrimination result output by the discriminator is false, the server may calculate a loss value according to the obtained test map image and the satellite map image corresponding to the two-dimensional planar map, so as to update the network parameters of the countermeasure network based on the calculated loss value.

Further, the server carries out iterative training on the initial multi-modal condition generation countermeasure network after the network parameter adjustment according to the preset iteration times until the identification result output by the identifier is true, so as to obtain the trained multi-modal condition generation countermeasure network.

In the above embodiment, the image is generated to perform the true-false authentication, and the corresponding authentication results are output, then the network parameters of the initial multi-mode condition generation countermeasure network are adjusted based on the authentication results, and the iterative training is performed on the initial multi-mode condition generation countermeasure network after the network parameters are adjusted according to the preset iteration times, so that the obtained initial multi-mode condition generation countermeasure network is more accurate, and the accuracy of the generated live-action image is improved.

In one embodiment, after performing iterative training on the initial multi-modal condition generation countermeasure network after the network parameter adjustment according to the preset iteration times, the method may further include: the initial multi-modal condition generation countermeasure network of each iterative training is stored, and corresponding training index values are stored.

In this embodiment, the server may store the initial multi-modal condition generation countermeasure network and the corresponding training index value corresponding to each iteration training when performing model training or after performing iteration training on the initial multi-modal condition generation countermeasure network after the network parameters are adjusted.

The training index value is an index value indicating the completion degree, the completion condition and the effect of the initial multi-mode condition generation countermeasure network, and specifically may be an index level or a score.

In other embodiments, the server may also store the initial multi-modal condition generation countermeasure network parameters for each iterative training, as well as the training index values, as the application is not limited in this regard.

In this embodiment, the multi-modal condition generating countermeasure network after training may include: the initial multi-modal condition generation countermeasure network corresponding to the highest training index value is determined to generate the countermeasure network for the multi-modal condition of training completion.

Specifically, after completing the network training with preset iteration times, the server may sort the training index values according to the training index value of each iteration training, determine the initial multi-modal condition with the highest training index value after sorting, generate the countermeasure network for the trained multi-modal condition in the countermeasure network, and use the countermeasure network for subsequent test application.

In the above embodiment, the initial multi-modal condition generating countermeasure network after each iteration training and the corresponding training index value are stored, and the initial multi-modal condition generating countermeasure network with the highest index value is selected as the final multi-modal condition generating countermeasure network after the training is completed, so that the final multi-modal condition generating countermeasure network is the network with the best training effect, and the accuracy of network prediction can be improved.

In one embodiment, referring to fig. 5, the server adjusts network parameters of the initial multi-modal condition generation countermeasure network based on each discrimination result, and performs iterative training on the initial multi-modal condition generation countermeasure network after the network parameters are adjusted according to a preset iteration number, so as to generate a trained multi-modal condition generation countermeasure network, which may include:

Step S502, calculating a loss value of the initial multi-mode condition generation countermeasure network according to each discrimination result, and performing first adjustment on network parameters of the initial multi-mode condition generation countermeasure network based on the loss value to obtain a first adjusted initial multi-mode condition generation countermeasure network.

Specifically, after obtaining the authentication result, the server may determine whether the authentication result is true or false. In this embodiment, when the server determines that the discrimination result output by the discriminator is false, the server may calculate a loss value according to the obtained test map image and the satellite map image corresponding to the two-dimensional planar map, so as to perform a first adjustment on the network parameters of the initial multimodal condition generation countermeasure network based on the calculated loss value.

In the present embodiment, the loss value may be calculated by different loss functions, for example, a cross entropy loss function, a two-class loss function, a multi-class loss function, or the like, without limitation.

Step S504, according to the predicted live-action images with the true discrimination results and the corresponding generation parameters, determining the difference indexes between the predicted live-action images based on the preset regulation and control functions, and performing second adjustment on the network parameters of the initial multi-modal condition generation countermeasure network based on the difference indexes to obtain a second adjusted initial multi-modal condition generation countermeasure network.

In this embodiment, the server may further calculate, according to a preset regulatory function, a difference index between each of the predicted live-action images generated according to different generation parameters and each of the corresponding generation parameters.

In this embodiment, the regulatory function may be shown by referring to the foregoing formula (1), and will not be described herein.

In this embodiment, in order to make the difference between the multiple live-action images output by the generating countermeasure network based on the trained multi-modal condition as large as possible, the server may perform a second adjustment on the network parameters of the generating countermeasure network of the initial multi-modal condition according to the calculated difference index, so as to obtain the generating countermeasure network of the initial multi-modal condition after the second adjustment.

Step S506, performing iterative training on the initial multi-modal condition generation countermeasure network after the first adjustment and the second adjustment according to the preset iteration times to obtain the trained multi-modal condition generation countermeasure network.

In this embodiment, the server performs iterative training of a preset number of iterations on the initial multi-modal condition generating countermeasure network after the first adjustment and the second adjustment, so as to obtain a multi-modal condition generating countermeasure network after the training is completed.

In the above embodiment, the accuracy of training can be improved, and the accuracy of predicting the final multi-modal condition generating countermeasure network can be improved by respectively calculating the loss value and the difference index, and respectively adjusting and iterating the initial multi-modal condition generating countermeasure network.

In this embodiment, the multimodal condition generating countermeasure network and the discriminator may be trained by means of cross-iterative training. Specifically, the cross-iterative training is performed by: a first generation multimode condition generation countermeasure network can generate poor images, then a first generation discriminator can accurately classify images generated by the multimode condition generation countermeasure network and real images, and in short, the discriminator is a classifier which outputs 0 for the images generated by the multimode condition generation countermeasure network and 1 for the real images. Further, the server starts training out the second-generation multimodal condition generation countermeasure network based on the authentication result of the discriminator, and the second-generation multimodal condition generation countermeasure network can generate slightly better images, so that the first-generation discriminator can consider the generated images as real images. The server then trains a second generation discriminator that accurately recognizes the true image and the second generation multimodal condition to generate an image against the network generation. By analogy, there will be three generations, four … n generations of multi-modal condition generation countermeasure networks and discriminators. Finally, the final evaluator will not be able to distinguish between the images generated by the multimodal condition generating countermeasure network and the actual images, so that the training is completed.

It should be understood that, although the steps in the flowcharts of fig. 2, 4, and 5 are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 2, 4, and 5 may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed sequentially, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 6, there is provided a semantic division image conversion apparatus including: an initial live-action image receiving module 100, a semantic segmentation processing module 200, and a multi-modal live-action image generating module 300, wherein:

The initial live-action image receiving module 100 is configured to receive an initial live-action image acquired by the acquisition device.

The semantic segmentation processing module 200 is configured to perform semantic segmentation processing on the initial live-action image, and generate a semantic segmentation image corresponding to the initial live-action image.

The multi-mode real image generation module 300 is configured to input the semantic segmentation image into a multi-mode condition generation countermeasure network, output a plurality of real images corresponding to the semantic segmentation image, wherein the content modes of each real image in the plurality of real images are different, the multi-mode condition generation countermeasure network is generated by training a difference index determined based on a preset regulation function, and the preset regulation function is determined according to the real images of different content modes and the generation parameters of the real images corresponding to different content modes.

In one embodiment, the apparatus may further include:

The normalization processing module is configured to perform, before the semantic segmentation processing module 200 performs semantic segmentation processing on the initial live-action image to generate a semantic segmented image, image size normalization processing on the initial live-action image, so as to obtain an initial live-action image after the image size normalization processing.

In this embodiment, the semantic segmentation processing module 200 is configured to perform semantic segmentation processing on the initial live-action image after the image size normalization processing, and generate a semantic segmented image corresponding to the initial live-action image after the image size normalization processing.

In one embodiment, the multi-modal live-action image generation module 300 may include:

The feature extraction sub-module is used for carrying out feature extraction on the semantic segmentation images through an encoder in the multimodal condition generation countermeasure network to generate feature images of the corresponding semantic segmentation images.

And the decoding conversion sub-module is used for decoding and converting the feature map through a generator in the multi-mode condition generation countermeasure network to obtain a plurality of live-action images corresponding to the semantic segmentation images.

In one embodiment, the decoding and converting sub-module is configured to configure a plurality of different generation parameters through the generator, and generate a plurality of live images corresponding to the semantically segmented image according to each generation parameter and the feature map.

In one embodiment, the apparatus may further include:

And the training module is used for training the multimodal condition generation countermeasure network.

In this embodiment, the training module may include:

the training set image acquisition sub-module is used for acquiring training set images.

And the prediction image generation sub-module is used for inputting the training set image into the constructed initial multi-mode condition generation countermeasure network and generating corresponding prediction live-action images based on the determined generation parameters.

And the image set generation sub-module is used for generating each image set corresponding to each predicted live-action image according to each predicted live-action image and the training set image.

And the identification sub-module is used for inputting each image set into the identifier for authenticity identification and outputting corresponding identification results.

The iteration training sub-module is used for adjusting the network parameters of the initial multi-mode condition generation countermeasure network based on each identification result, carrying out iteration training on the initial multi-mode condition generation countermeasure network after the network parameters are adjusted according to the preset iteration times, and obtaining the trained multi-mode condition generation countermeasure network.

In one embodiment, the apparatus may further include:

the storage module is used for storing the initial multi-mode condition generation countermeasure network and the corresponding training index value of each iterative training after the iterative training submodule performs iterative training on the initial multi-mode condition generation countermeasure network after the network parameters are adjusted according to the preset iterative times.

In this embodiment, the iterative training sub-module may determine that the initial multi-modal condition corresponding to the highest training index value generates the countermeasure network for the trained multi-modal condition in the countermeasure network.

In one embodiment, the iterative training sub-module may include:

The first adjusting unit is used for calculating a loss value of the initial multi-mode condition generation countermeasure network according to each discrimination result, and carrying out first adjustment on network parameters of the initial multi-mode condition generation countermeasure network based on the loss value to obtain a first adjusted initial multi-mode condition generation countermeasure network.

The second adjusting unit is used for determining difference indexes among the prediction live-action images based on a preset regulation function according to the prediction live-action images with the true discrimination results and the corresponding generation parameters, and performing second adjustment on network parameters of the initial multi-modal condition generation countermeasure network based on the difference indexes to obtain a second adjusted initial multi-modal condition generation countermeasure network.

The iteration unit is used for carrying out iterative training on the initial multi-mode condition generation countermeasure network after the first adjustment and the second adjustment according to the preset iteration times to obtain the trained multi-mode condition generation countermeasure network.

For specific definition of the semantic division image conversion apparatus, reference may be made to the definition of the semantic division image conversion method hereinabove, and no detailed description is given here. The respective modules in the above-described semantic division image conversion apparatus may be implemented in whole or in part by software, hardware, and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 7. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing data such as semantic segmentation images and generated live-action images. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a semantic segmentation image conversion method.

It will be appreciated by those skilled in the art that the structure shown in FIG. 7 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one embodiment, a computer device is provided comprising a memory storing a computer program and a processor that when executing the computer program performs the steps of: receiving an initial live-action image acquired by acquisition equipment; carrying out semantic segmentation processing on the initial live-action image to generate a semantic segmentation image corresponding to the initial live-action image; inputting the semantic segmentation images into a multi-mode condition generation countermeasure network, outputting a plurality of live-action images corresponding to the semantic segmentation images, wherein the content modes of the live-action images in the live-action images are different, the multi-mode condition generation countermeasure network is generated by training a difference index determined based on a preset regulation function, and the preset regulation function is determined according to the live-action images of different content modes and the generation parameters of the live-action images corresponding to the different content modes.

In one embodiment, the processor performs the semantic segmentation processing on the initial live-action image when executing the computer program, and before generating the semantic segmented image, the following steps may be further implemented: and carrying out image size normalization processing on the initial live-action image to obtain the initial live-action image after the image size normalization processing.

In this embodiment, when the processor executes the computer program, the semantic segmentation processing is implemented on the initial live-action image, and the generating the semantic segmentation image corresponding to the initial live-action image may include: and carrying out semantic segmentation processing on the initial live-action image subjected to the image size normalization processing to generate a semantic segmentation image corresponding to the initial live-action image subjected to the image size normalization processing.

In one embodiment, when the processor executes the computer program, the implementation of inputting the semantically segmented image into the multi-modal condition generation countermeasure network and outputting the plurality of live images corresponding to the semantically segmented image may include: performing feature extraction on the semantic segmentation images through an encoder in a multi-modal condition generation countermeasure network to generate feature images of the corresponding semantic segmentation images; and decoding and converting the feature map through a generator in the multi-modal condition generation countermeasure network to obtain a plurality of live-action images corresponding to the semantic segmentation images.

In one embodiment, the processor, when executing the computer program, performs decoding and converting on the feature map through a generator in the multimodal condition generating countermeasure network to obtain a plurality of live-action images corresponding to the semantic segmentation images, and may include: and configuring a plurality of different generation parameters through a generator, and generating a plurality of live-action images corresponding to the semantic segmentation images according to each generation parameter and the feature map.

In one embodiment, the generation manner of implementing the multi-modal condition generation countermeasure network when the processor executes the computer program may include: acquiring a training set image; inputting the training set image into a constructed initial multi-mode condition generation countermeasure network, and generating corresponding prediction live-action images based on the determined generation parameters; generating each image set corresponding to each predicted live-action image according to each predicted live-action image and each training set image; inputting each image set into a discriminator for true and false discrimination, and outputting corresponding discrimination results; based on each identification result, adjusting the network parameters of the initial multi-mode condition generation countermeasure network, and carrying out iterative training on the initial multi-mode condition generation countermeasure network after the network parameters are adjusted according to the preset iteration times to obtain the trained multi-mode condition generation countermeasure network.

In one embodiment, after performing iterative training on the initial multi-modal condition generation countermeasure network after the network parameter adjustment according to the preset iteration times when the processor executes the computer program, the following steps may be further implemented: the initial multi-modal condition generation countermeasure network of each iterative training is stored, and corresponding training index values are stored.

In this embodiment, the implementation of the training-completed multimodal condition generating countermeasure network by the processor when executing the computer program may include: the initial multi-modal condition generation countermeasure network corresponding to the highest training index value is determined to generate the countermeasure network for the multi-modal condition of training completion.

In one embodiment, when the processor executes the computer program, it is realized that based on each discrimination result, the network parameters of the initial multi-modal condition generation countermeasure network are adjusted, and according to the preset iteration times, iterative training is performed on the initial multi-modal condition generation countermeasure network after the network parameters are adjusted, and the generation of the trained multi-modal condition generation countermeasure network may include: according to each discrimination result, calculating a loss value of the initial multi-mode condition generation countermeasure network, and carrying out first adjustment on network parameters of the initial multi-mode condition generation countermeasure network based on the loss value to obtain an initial multi-mode condition generation countermeasure network after first adjustment; determining difference indexes among the predicted live-action images based on a preset regulation function according to the predicted live-action images with true discrimination results and corresponding generation parameters, and performing second adjustment on network parameters of the initial multi-modal condition generation countermeasure network based on the difference indexes to obtain a second adjusted initial multi-modal condition generation countermeasure network; and carrying out iterative training on the initial multi-modal condition generation countermeasure network after the first adjustment and the second adjustment according to the preset iteration times to obtain the trained multi-modal condition generation countermeasure network.

In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of: receiving an initial live-action image acquired by acquisition equipment; carrying out semantic segmentation processing on the initial live-action image to generate a semantic segmentation image corresponding to the initial live-action image; inputting the semantic segmentation images into a multi-mode condition generation countermeasure network, outputting a plurality of live-action images corresponding to the semantic segmentation images, wherein the content modes of the live-action images in the live-action images are different, the multi-mode condition generation countermeasure network is generated by training a difference index determined based on a preset regulation function, and the preset regulation function is determined according to the live-action images of different content modes and the generation parameters of the live-action images corresponding to the different content modes.

In one embodiment, the computer program when executed by the processor performs the semantic segmentation processing on the initial live-action image, and before generating the semantic segmented image, may further perform the following steps: and carrying out image size normalization processing on the initial live-action image to obtain the initial live-action image after the image size normalization processing.

In this embodiment, the implementation of the semantic segmentation processing on the initial live-action image when the computer program is executed by the processor, and the generation of the semantic segmented image corresponding to the initial live-action image may include: and carrying out semantic segmentation processing on the initial live-action image subjected to the image size normalization processing to generate a semantic segmentation image corresponding to the initial live-action image subjected to the image size normalization processing.

In one embodiment, the computer program, when executed by the processor, implements inputting the semantically segmented image into a multi-modal condition generation countermeasure network, outputting a plurality of live-action images corresponding to the semantically segmented image, may include: performing feature extraction on the semantic segmentation images through an encoder in a multi-modal condition generation countermeasure network to generate feature images of the corresponding semantic segmentation images; and decoding and converting the feature map through a generator in the multi-modal condition generation countermeasure network to obtain a plurality of live-action images corresponding to the semantic segmentation images.

In one embodiment, the computer program when executed by the processor performs decoding and converting the feature map by a generator in the multimodal condition generating countermeasure network to obtain a plurality of live images corresponding to the semantically segmented image, and may include: and configuring a plurality of different generation parameters through a generator, and generating a plurality of live-action images corresponding to the semantic segmentation images according to each generation parameter and the feature map.

In one embodiment, the manner in which the computer program is executed by the processor to implement the generation of the multimodal condition generating countermeasure network may include: acquiring a training set image; inputting the training set image into a constructed initial multi-mode condition generation countermeasure network, and generating corresponding prediction live-action images based on the determined generation parameters; generating each image set corresponding to each predicted live-action image according to each predicted live-action image and each training set image; inputting each image set into a discriminator for true and false discrimination, and outputting corresponding discrimination results; based on each identification result, adjusting the network parameters of the initial multi-mode condition generation countermeasure network, and carrying out iterative training on the initial multi-mode condition generation countermeasure network after the network parameters are adjusted according to the preset iteration times to obtain the trained multi-mode condition generation countermeasure network.

In one embodiment, the computer program when executed by the processor performs iterative training on the initial multi-modal condition generation countermeasure network after the network parameter adjustment according to the preset iteration times, and further performs the following steps: the initial multi-modal condition generation countermeasure network of each iterative training is stored, and corresponding training index values are stored.

In this embodiment, the implementation of the training-completed multimodal condition generating countermeasure network when the computer program is executed by the processor may include: the initial multi-modal condition generation countermeasure network corresponding to the highest training index value is determined to generate the countermeasure network for the multi-modal condition of training completion.

In one embodiment, the computer program when executed by the processor adjusts network parameters of the initial multi-modal condition generation countermeasure network based on each authentication result, and performs iterative training on the initial multi-modal condition generation countermeasure network after the network parameters are adjusted according to a preset iteration number, so as to generate a trained multi-modal condition generation countermeasure network, which may include: according to each discrimination result, calculating a loss value of the initial multi-mode condition generation countermeasure network, and carrying out first adjustment on network parameters of the initial multi-mode condition generation countermeasure network based on the loss value to obtain an initial multi-mode condition generation countermeasure network after first adjustment; determining difference indexes among the predicted live-action images based on a preset regulation function according to the predicted live-action images with true discrimination results and corresponding generation parameters, and performing second adjustment on network parameters of the initial multi-modal condition generation countermeasure network based on the difference indexes to obtain a second adjusted initial multi-modal condition generation countermeasure network; and carrying out iterative training on the initial multi-modal condition generation countermeasure network after the first adjustment and the second adjustment according to the preset iteration times to obtain the trained multi-modal condition generation countermeasure network.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SYNCHLINK) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. A semantic segmentation image conversion method, the method comprising:

Receiving an initial live-action image acquired by acquisition equipment;

Inputting the semantic segmentation image into a multi-mode condition generation countermeasure network, outputting a plurality of live-action images corresponding to the semantic segmentation image, wherein the content modes of each live-action image in the plurality of live-action images are different, the multi-mode condition generation countermeasure network is generated for difference index training based on a preset regulation function, the numerical value of the difference index determined by the regulation function is as large as possible during training, and the preset regulation function is determined according to the live-action images of different content modes and the generation parameters of the live-action images corresponding to the different content modes;

The regulation function adopts the following formula:

Wherein, G (E (X), Z ₁) is a live-action image output corresponding to the generation parameter Z ₁, and G (E (X), Z ₂) is a live-action image output corresponding to the generation parameter Z ₂.

2. The method of claim 1, wherein the performing semantic segmentation processing on the initial live-action image, before generating a semantic segmented image, further comprises:

Performing image size normalization processing on the initial live-action image to obtain an initial live-action image subjected to the image size normalization processing;

the semantic segmentation processing is performed on the initial live-action image to generate a semantic segmentation image corresponding to the initial live-action image, which comprises the following steps:

3. The method of claim 1, wherein inputting the semantically segmented image into a multi-modal condition generation countermeasure network outputs a plurality of live-action images corresponding to the semantically segmented image, comprising:

performing feature extraction on the semantic segmentation image through an encoder in a multi-modal condition generation countermeasure network to generate a feature map corresponding to the semantic segmentation image;

4. A method according to claim 3, wherein said decoding the feature map by a generator in a multimodal condition generating countermeasure network to obtain a plurality of live images corresponding to the semantically segmented image comprises:

and configuring a plurality of different generation parameters through the generator, and generating a plurality of live-action images corresponding to the semantic segmentation images according to each generation parameter and the feature map.

5. The method of claim 1, wherein the generating the countermeasure network based on the multi-modal condition includes:

Acquiring a training set image;

Generating each image set corresponding to each predicted live-action image according to each predicted live-action image and the training set image;

6. The method according to claim 5, wherein after performing iterative training on the initial multi-modal condition generation countermeasure network after the network parameter adjustment according to the preset number of iterations, further comprising:

The trained multimodal condition generating countermeasure network comprises:

7. The method of claim 5, wherein adjusting the network parameters of the initial multi-modal condition generation countermeasure network based on each of the discrimination results, and performing iterative training on the initial multi-modal condition generation countermeasure network after the network parameters are adjusted according to a preset number of iterations, generating a trained multi-modal condition generation countermeasure network, comprises:

Calculating a loss value of the initial multi-mode condition generation countermeasure network according to each discrimination result, and carrying out first adjustment on network parameters of the initial multi-mode condition generation countermeasure network based on the loss value to obtain a first adjusted initial multi-mode condition generation countermeasure network;

determining a difference index between the predicted live-action images based on a preset regulation function according to the predicted live-action images with the true discrimination results and the corresponding generation parameters, and performing second adjustment on network parameters of the initial multi-mode condition generation countermeasure network based on the difference index to obtain a second adjusted initial multi-mode condition generation countermeasure network;

8. A semantic segmentation image conversion apparatus, the apparatus comprising:

The multi-mode real image generation module is used for inputting the semantic segmentation image into a multi-mode condition generation countermeasure network, outputting a plurality of real images corresponding to the semantic segmentation image, wherein the content modes of each real image in the plurality of real images are different, the multi-mode condition generation countermeasure network is generated by training a difference index determined based on a preset regulation function, the numerical value of the difference index determined by the regulation function is as large as possible during training, and the preset regulation function is determined according to the real images of different content modes and the generation parameters of the real images corresponding to the different content modes;

The regulation function adopts the following formula:

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 7 when the computer program is executed.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.