CN116823647A

CN116823647A - Image complement method based on fast Fourier transform and selective attention mechanism

Info

Publication number: CN116823647A
Application number: CN202310661898.6A
Authority: CN
Inventors: 胡靖�; 吴颖超; 陶宇可; 张红湖; 杨飞扬; 刘李骏坤; 邓中臣; 吴锡
Original assignee: Chengdu University of Information Technology
Current assignee: Chengdu University of Information Technology
Priority date: 2023-06-06
Filing date: 2023-06-06
Publication date: 2023-09-29

Abstract

The invention relates to an image complement method based on deep learning, which provides an efficient complement network with built-in Fourier transform operation, realizes long-distance dependence and ensures high-efficiency operation of a model, and meanwhile, a fast Fourier transform module of a polymerization type selective convolution kernel network is embedded in the complement network, wherein the fast Fourier transform module is responsible for the complement network to obtain a global receptive field, the polymerization type selective convolution kernel network adaptively adjusts specific gravity according to contribution of each scale, provides local information supplement for the model more accurately, and finally realizes image complement with higher global fusion degree and richer local detail. Compared with the existing common complement method, the method provided by the invention has better reduction degree and definition of the generated image, and especially generates local semantic coherence and global integral fit of the image under the condition of larger shielding rate.

Description

Image complement method based on fast Fourier transform and selective attention mechanism

Technical Field

The invention relates to the field of image processing, in particular to an image complement method based on fast Fourier transform and a selective attention mechanism.

Background

The image complement task is to use existing information and model algorithm to infer under the condition that a part of image data is missing, so as to generate a complete image. The image complement technology based on the fast Fourier transform module and the aggregation selective convolution network can be applied to various fields, such as document restoration, target removal, criminal investigation and the like. In the field of cultural relics, the technology can be used for repairing damaged pictures and the like, and the repairing quality and the reduction degree are improved. In the field of image target removal, the technology can be used for intelligent video analysis, image recovery and the like, and the real-time performance and accuracy of a monitoring system are improved. In the criminal investigation field, the method can be used for repairing evidence photos, completing faces and the like, and helps case investigation.

With the development of technology and the popularization of artificial intelligence, the technology has very wide application prospect in various fields. In the aspect of cultural relic restoration, the restoration efficiency and quality of the cultural relic can be improved, so that the restoration difficulty of large-area damaged cultural relics is reduced; in the field of target removal, the effect of target removal can be improved, and the image is ensured to still maintain high global fitness after the target is removed; in criminal investigation, errors in recovery of the evidence can be reduced, and efficiency of the investigation process and the like can be reduced. Therefore, the image complement technology of the fast Fourier transform and the selective attention mechanism has wide application prospect and market demand.

The image complement technique is to estimate by using existing information and model algorithm in the case of missing a part of image data, thereby generating a complete image. Among them, the fast fourier transform and the selective attention mechanism are two key technologies that are widely used in the field of image completion at present.

The fast fourier transform (Fast Fourier Transformer, FFT) is an efficient, fast algorithm. The algorithm is primarily referred to as a discrete fourier transform, which is accomplished by converting a signal from one domain to another. The operation objective of the fast fourier transform is to map the spatial features to a spectral domain, then to perform an efficient global update on the generated spectral domain data, and finally to convert the processed data back into the form of the spatial features. The fast fourier transform can help transform the image between the fourier (i.e., frequency) domain and the spatial domain relative to the convolution operation. The spectral domain conversion is carried out in a full-member pixel updating mode, so that the fast Fourier transform can help the model to expand the receptive field of the network layer to all pixels of the image, namely, the global receptive field is obtained, the information association degree of the full-member pixels is improved, and finally, the remote pixels are helped to realize efficient information exchange. Therefore, the technology is suitable for solving the problem of global agreement of pursuit results of image completion.

In addition, a large number of experiments prove that the image complement algorithm added with the attention mechanism is beneficial to improving the complement efficiency and quality. Attention mechanisms refer to weighting the input information in a neural network, focusing attention on the most useful part of the current task. The attention mechanism mainly achieves two functional goals, one is to decide which parts of the input image need to be focused on, and the other is to allocate limited information processing resources to important parts in the object to be processed. The attention mechanism can differentially pay corresponding attention to the key information, and ignore the non-key information, which is beneficial to improving the operation of the network. A large number of experiments prove that the image complement algorithm added with the attention mechanism is beneficial to improving the complement efficiency and quality. The present invention also introduces a mechanism of attention. In addition, in order to improve the anti-interference performance of the complement network, the invention adopts an aggregation type selective convolution kernel network as an attention mechanism module of the invention. The attention mechanism module adaptively generates corresponding convolution kernel combinations by weighting the features processed by the convolution kernels with different scales, and further adjusts the network receptive field so as to help the network adapt to the change of the scale of the input image and the missing region. The attention mechanism can well solve the complex input problem of the image complement task.

In the field of image complementation, attention mechanisms can be used to enhance the attention of the network to missing parts and to complement them according to existing information and model algorithms. In addition to the selective attention mechanisms, there are other image completion techniques, such as interpolation-based image completion methods, sparse coding-based image completion methods, and the like. These techniques are widely used in the field of image completion.

The prior art scheme has the defects that:

1. the current image complement method based on deep learning realizes better complement effect than the traditional method, but mostly neglects to establish remote information interaction among pixels, so that the effect is poor and the integral fit degree of the generated image is insufficient when the models process large-area missing.

2. On the other hand, some schemes for retaining image detail information by fusing different scale features ignore differences of scale contributions, so that model accuracy is insufficient, and considerable artifact errors are generated in the generated result.

Disclosure of Invention

Aiming at the defects of the prior art, an image complement method based on fast Fourier transform and selective attention mechanism is provided, the image complement method provides an efficient complement network with built-in Fourier transform operation, the fast Fourier transform module of an aggregation type selective convolution kernel network is embedded in the complement network, wherein the fast Fourier transform module is responsible for the complement network to obtain a global receptive field, the aggregation type selective convolution kernel network adaptively adjusts specific gravity according to the contribution of each scale and is used for helping a model to keep detailed information, and meanwhile, the robustness of the model to input scale transformation is enhanced, the method comprises the following steps:

step 1: preparing an image complement data set, namely a Cele-A HQ data set and a Place2 data set;

step 2: randomly extracting a set number of images in the Cele-A HQ data set and the Place2 data set to serve as a training set, and the rest images serve as verification sets;

step 3: generating an irregular mask for the pictures in the training set and the verification set in a random mode to obtain a mask image;

step 4: inputting images in a training set into the complement network for training, wherein a main body framework of the complement network generates an reactance network, the model comprises a generator and a block discriminator, the generator is formed by stacking a convolution module, an deconvolution module, a multi-scale convolution fusion module and a fast Fourier transform module, and the discriminator adopts the block discriminator;

the generator takes the mask image and the binary mask containing the missing area as input after being spliced, the output of the generator is the image after being complemented, and the process of generating the complemented image by the generator comprises the following steps:

step 41: the input is encoded into a low-dimensional space by using common convolution formed by stacking three components, and the process realizes the dimension reduction of the input and obtains a feature vector containing important information of the input image;

step 42: the feature vector coded in the step 41 is sent to a multi-scale convolution fusion module, the multi-scale convolution fusion module splits a common convolution into four sub-convolutions with void ratios of 1,2,4 and 8 respectively, and then feature information of a plurality of convolution kernels is integrated by utilizing the common convolution;

step 43: inputting the characteristic information integrated by the multi-scale fusion module into a module combination formed by stacking at least three fast Fourier transform modules, wherein each fast Fourier transform module is embedded with a polymerization type selective convolution kernel network;

step 431: the characteristic information entering the fast Fourier transform module is firstly subjected to a 3X 3 convolution layer, and then the characteristic information is mapped to a frequency domain space by utilizing a real Fourier transform module and then mapped back to a space domain;

step 432: then performing 3×3 convolution operation, and entering a polymerization type selective convolution kernel network, wherein a learning mechanism of dynamic fusion of multiple receptive fields is arranged in the polymerization type selective convolution kernel network to provide local information supplement for a fast Fourier transform module, and meanwhile, the receptive field weight is dynamically given to help a generator adapt to the influence caused by the change of the input image scale;

step 433: the method comprises the steps that characteristics processed by an aggregation type selective convolution kernel network module are input into a real number inverse Fourier transform module to restore the characteristics to a space structure to obtain space structure characteristics, the space structure characteristics and the input characteristics of a fast Fourier transform module are added according to elements by using jump connection, and then the characteristics are subjected to 1X 1 convolution to obtain a real number characteristic diagram;

step 434: sequentially inputting the real number feature map obtained by the first fast Fourier transform module into the following fast Fourier transform module, and repeating the operations of the fast Fourier transforms from step 431 to step 433 to obtain a final real number feature map;

step 44: inputting the final real number feature map obtained in the step 43 into a multi-scale convolution fusion module to obtain a fusion feature map;

step 44: finally, decoding the fusion feature map into a complemented generated image by utilizing a plurality of deconvolution layers, wherein the generated image is used as the output of a generator;

step 45: and the block arbiter carries out three-layer convolution processing on the generated image to obtain a scoring matrix, the block arbiter distinguishes the generated image and the real image according to the scoring matrix, and guides the generator to generate a generated image which is more similar to the real image by the parameter feedback generator until the block arbiter cannot respectively output the real image and the generated image, and the training is finished.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention is inspired by a fast Fourier transform module with parameter efficiency advantages, and provides a high-efficiency complementary network with built-in fast Fourier transform operation, which ensures high-efficiency operation of a model while realizing long-distance dependence.

2. The improved aggregation type selective convolution kernel network is added into the fast Fourier transform module, the module adjusts the proportion according to the contribution adaptability of each scale, tries to provide local information supplement for the model more accurately, and finally realizes image completion with higher global fusion degree and richer local detail.

Drawings

FIG. 1 is a schematic diagram of the overall framework of an image-completed network model of the present invention;

FIG. 2 is a schematic diagram of the architecture of a multi-scale convolution fusion module of the present invention;

FIG. 3 is a schematic diagram of the structure of the fast Fourier transform module of the present invention;

FIG. 4 is a schematic diagram of the structure of the inventive polymeric selective convolution kernel network;

FIG. 5 is an experimental result of a rule mask applied to a Celeb-A dataset;

FIG. 6 is an experimental result of applying an irregular mask to a Celeb-A dataset;

FIG. 7 is a comparison result of the image restoration effect after large area damage.

Detailed Description

The objects, technical solutions and advantages of the present invention will become more apparent by the following detailed description of the present invention with reference to the accompanying drawings. It should be understood that the description is only illustrative and is not intended to limit the scope of the invention. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the present invention.

The following detailed description refers to the accompanying drawings.

The invention aims to improve the quality and efficiency of image complement by a method combining fast Fourier transform and a polymerization type selective convolution kernel network, and solve the problems of low speed, poor image complement effect on complex textures and structures and the like in the processing of large-size high-resolution images in the prior art. The method can restore the details and the structure of the original image more accurately, thereby improving the visual effect and the application effect of image complement. The technical problem created by the invention is solved, so that the image complement based on the Fourier transformation and the selective attention mechanism has wider application prospect and market value.

Aiming at the defects existing in the prior art, the image complement method based on the fast Fourier transform and the selective attention mechanism, which is provided by the invention, provides a high-efficiency complement network with built-in Fourier transform operation, and a fast Fourier transform module of an aggregation type selective convolution kernel network is embedded in the complement network, wherein the fast Fourier transform module is responsible for the complement network to obtain a global receptive field, the selective convolution kernel network adaptively adjusts specific gravity according to the contribution of each scale characteristic and is used for helping a model to keep detailed information, and meanwhile, the robustness of the model to input scale transformation is enhanced, and the specific method comprises the following steps:

step 1: image complement datasets were prepared, the Cele-A HQ dataset and the Place2 dataset, respectively.

Step 2: randomly extracting a set number of images in the Cele-A HQ data set and the Place2 data set to serve as a training set, and taking the rest images as verification sets.

Specifically, 20000 person pictures in the Celeb-a HQ dataset are used as a training set, 5000 pictures are selected from the rest of pictures as a test set, 800 ten thousand randomly selected pictures are used as the training set in the plane 2 dataset, and the rest 2 ten thousand pictures are used as the test set.

Step 3: and generating irregular masks for pictures in the training set and the verification set in a random mode to obtain mask images.

Specifically, 5000 irregular masks for training a model are generated, and finally 1000 masks are randomly selected from the rest of the pictures to be used as mask input in a test process.

Step 4: the method comprises the steps of inputting images in a training set into a complement network for training, and a main body framework of the complement network is used for generating an impedance network, wherein the model comprises a generator and a block discriminator, the generator is formed by stacking a convolution module, an deconvolution module, a Multi-scale convolution fusion module MSFM (Multi-scale Convolution Fusion Module) and a fast Fourier transform module FSKAM (Fast Fourier Transform And Selective Kernels Attention Mechanism) in a combined mode, the depth network is highly modularized, and the discriminator adopts the same block discriminator as the Patch-GAN, and is suitable for processing image data.

FIG. 1 illustrates the overall framework of the image-complementing network model of the present invention. The generator portion of the present model employs an encoder-decoder framework. The complement network firstly utilizes three-layer convolution operation to process the input of the mask and the damaged image spliced together in the channel dimension, and a high-dimension characteristic is obtained. The network model then passes the high latitude feature to the bottleneck layer where multiple core modules are stacked. The multi-scale convolution fusion module is used as a core module to be embedded into the front end and the tail end of the bottleneck layer, and the middle part of the bottleneck layer is formed by connecting a plurality of fast Fourier transform modules with built-in selective attention mechanism modules in series. The fast Fourier transform module mainly executes fast Fourier transform operation, and the parameter efficiency of single operation is high. For this reason the present model will moderately increase the number of fast fourier transform modules at a predictable computational cost. In addition, the fast fourier selective attention module also embeds an aggregate selective convolution kernel network ASKNET (Aggregative Selective Kernels Network). And finally, the network carries out three deconvolution operations on the characteristics processed by the bottleneck layer, and decodes the characteristics into output consistent with the real image scale. The model adopts an antagonism loss function, a perception loss function, a style loss function and a loss function to train the network, so that the network is forced to learn a more accurate mapping relation. The countermeasures are used for influencing the training process of the generator and the discriminator, so that the image generation quality of the generator is improved; the perception loss describes the difference of two pictures in human vision, and the specific operation is that the high-dimensional characteristics of the images are extracted by utilizing the high-layer abstract capacity of a convolution layer, and the generated images are forced to be close to the real images in texture and structure by punishing the characteristic differences corresponding to the generated images and the real images; the main effect of style loss is to force the generated image to approach the real image in color and texture; the loss refers to the absolute value of the difference between the corresponding pixels of the generated image and the true image, and the loss can reduce the difference between the pixel values of the generated image and the true image.

The process of the generator for generating a restored image includes the steps of:

step 42: the feature vector coded in the step 41 is sent to a multi-scale convolution fusion module, the multi-scale convolution fusion module splits a common convolution into four sub-convolutions with void ratios of 1,2,4 and 8 respectively, and then feature information of a plurality of convolution kernels is integrated by utilizing the common convolution; the module is beneficial to the generator to obtain more remote context information, so that detail information of the generated image is enriched and the rationality of the generated content is enhanced;

fig. 2 shows a schematic structural diagram of a multi-scale convolution kernel fusion module, which adopts a multi-scale feature fusion mechanism. In order to reduce the parameters, the module introduces four hole convolutions with expansion rates of 1,2,4 and 8 respectively at four branches of the module network. The features obtained after the cavity convolution processing are respectively output in a quarter of the number of input feature channels. And then, splicing the features after the cavity convolution in the channel dimension. And then, performing common convolution operation on the spliced characteristic graphs by using 3×3 convolution to integrate the characteristics of the multi-scale convolution kernel.

The hole convolution with larger expansion rate of the multi-scale convolution kernel fusion module is responsible for obtaining wider area context information, and the hole convolution with smaller expansion rate is responsible for obtaining smaller area context information. The multi-scale convolution kernel can more fully utilize pixel information at the far and near ends of the missing region. Furthermore, the network is helped to integrate the characteristics of different layers, and the improvement of the fineness of the complement image is finally realized.

Step 43: and inputting the characteristic information processed by the multi-scale fusion module into a module combination formed by stacking at least three fast Fourier transform modules. It is particularly emphasized that each fast fourier transform module embeds a polymeric selective convolution kernel network (Aggregative Selective Kernels Network, ASKNET). The fast Fourier transform module is used for assisting the generator to establish long-distance dependence among pixels so as to ensure the global fit effect of the generated image, and the aggregation type selective convolution kernel network is used for improving the receptive field self-adaption capability of the generator.

Fig. 3 is a schematic diagram of a fast fourier transform-based selective attention module that first extracts features using a convolution with a convolution kernel size of 3×3, and then performs batch normalization and activation processing on the features. And then, carrying out real number fast Fourier transform on the processed characteristics to obtain real number characteristics. Unlike the fast fourier transform, the real fast fourier transform uses only half of the frequency spectrum and retains half of the real value, and then fuses the real and imaginary parts of the tensor, ultimately converting to real. And then, inputting the real number characteristics obtained by the operation into an aggregation type selective convolution kernel network ASKNET, and selectively giving different weights to the ASKNET according to the contributions of different convolution kernels to realize the dynamic adjustment of the receptive field. And finally, carrying out real inverse fast Fourier transform on the obtained real number characteristics, and further repairing the characteristics back to a space structure.

step 432: then, a 3×3 convolution operation is performed, and then a polymerization type selective convolution kernel network ASKNET is entered. The learning mechanism of the dynamic fusion of the built-in multiple receptive fields of the aggregation type selective convolution kernel network ASKNET provides local information supplement for the fast Fourier transform module FSKAM, and meanwhile, the receptive field weight is dynamically given to help the generator adapt to the influence caused by the change of the input image scale.

Fig. 4 shows a schematic diagram of the structure of the aggregation type selective convolution kernel network proposed by the present invention, where the selective convolution kernel network generates convolution kernels with different importance for different images, so that the design of dynamic learning can help the network adapt to the input image scale and the change of the target scale in the image. The module is beneficial to the image complement network to adapt to the scale change of the whole missing image and the change of the missing target in the image. It should be specifically mentioned that the present network replaces the multi-scale feature fusion approach of the original selective convolution kernel network with a feature aggregation operation. As shown in fig. 4, the feature map is first fed into four hole convolution layers split by a common convolution. The first hole convolution layer is a hole convolution operation with the hole rate of 1, the second hole convolution layer is a hole convolution operation with the hole rate of 2, the third hole convolution layer is a hole convolution operation with the hole rate of 4, and the fourth hole convolution layer is a hole convolution operation with the hole rate of 8. After the characteristic map is processed by the four cavity convolution layers, four characteristics with different receptive fields are generated. These four features are then added by element to obtain a multi-scale fusion feature. The feature is transferred to a fusion module, the fusion module obtains a feature vector of the feature 1 Xc by using an average pooling layer, the feature vector is compressed into a feature vector of 1X (c/r) by using a full connection layer, and then the feature vector is up-scaled into four feature vectors of 1 Xc by using four full connection layers. These four feature vectors will be assigned to four different selectivity modules, selectivity module one, selectivity module two, selectivity module three, selectivity module four. And then multiplying the features of the original four cavity convolution layers with the feature vectors of the four dimension-increasing layers by elements to obtain four feature diagrams with weights. The four different scale features are reduced in dimension along the channel dimension by a 1 x 1 convolution, and then the number of channels per feature is reduced to 1/4 of the original. Finally, the features are spliced in the channel dimension, and then the spliced features are aggregated by using a convolution of 1 multiplied by 1. Compared with direct addition operation, the design of the invention can more effectively retain detail information and ensure the parameter efficiency of the network.

Step 433: the features processed by the aggregation type selective convolution kernel network module ASKNET enter a real number inverse Fourier transform module to restore the features to a space structure to obtain space structure features, the space structure features and input features of a fast Fourier transform module FAKAM are added according to elements by utilizing jump connection, and then the features are convolved by 1 multiplied by 1 to obtain a real number feature map.

Step 434: and sequentially inputting the feature map obtained by the first fast Fourier transform module into the following fast Fourier transform module, and repeating the fast Fourier transform operation from step 431 to step 433 to obtain a final real feature map.

This fast fourier transform operation will be repeated three times to enhance the global information integration effect of the features. The above process belongs to the characteristic processing process of the fast Fourier transform module combination. The feature map obtained in the process continues to enter a multi-scale fusion module, and the operation can help the network to keep more detail information.

step 45: and finally, decoding the fused feature map into a repaired generated image by utilizing a plurality of deconvolution layers, wherein the generated image is used as the output of a generator.

Step 46: and the block discriminator performs three-layer convolution processing on the generated image to obtain a scoring matrix, the block discriminator distinguishes the generated image and the real image according to the scoring matrix, and guides the generator to generate a generated image which is more similar to the real image by the parameter feedback generator until the block discriminator cannot respectively output the real image and the generated image, and the training is finished. The above process belongs to the training phase, and relevant parameters are saved after the training is finished. Next, we enter an application phase, which concatenates the mask to be processed in the application and the image blocked by the mask, and takes this as input to the application phase model. And finally, utilizing the trained parameters to enable the generator to generate a complement result.

Fig. 5 is a comparison of the completion result of applying a rule mask to a Celeb-a dataset. Fig. 5 (a) is an original image, fig. 5 (b) is a damaged image, fig. 5 (c) is a complement result of an antipodal network AOT GAN (Aggregated Contextual-Transformation GAN) method generated by aggregated context feature transformation, fig. 5 (d) is a complement result of a pyramidal context-encoding network PEN (Learning Pyramid-Context Encoder Network) method, fig. 5 (e) is a complement result of a gated convolutional network GC (Gated ConvolutionNetwork) method, and fig. 5 (f) is a complement result of the method of the present invention. Compared with other methods, the method generates richer texture details and obtains the effect of global fit.

Fig. 6 is a comparison of the complement of applying an irregular mask to the Celeb-a dataset. Fig. 6 (a), 6 (b), 6 (c), 6 (d), 6 (e), and 6 (f) show an original drawing, a damaged picture, an AOT method complement result, a GC method complement result, and a method complement result of the present invention, respectively. The repair results of the GC model and the PEN model show that both models tend to generate blurred textures and details with chromatic aberration, and that PEN has more visible repair marks. Compared with the scheme, the repair content of the AOT GAN model is clearer, but the five sense organs generated by the model deviate from reality. For example, the right eye generated by the second pedestrian in fig. 6 (c) has a larger deviation from the true human eye, and the repair result of the model has a higher reduction degree than that of the other methods.

Fig. 7 shows the image restoration effect comparison result after large area damage. When the image shielding rate is 20% -40% lower, the method of the invention obtains clearer results compared with other methods, the model of the invention generates images which are more matched on textures, and the generated results have certain quality advantages, as shown in a first row and a second row of fig. 7. When the occlusion ratio reaches 70% to 90%, the invention generates more reasonable details, and the advantage of the invention in the aspect of authenticity is obvious. It will be readily seen that the model of the present invention is less dependent on adjacent pixel information, as shown in the third and fourth rows of fig. 7. As shown in fig. 7 (c) and fig. 7 (d), on the image with the missing area ratio being greater than 80%, the generating capability of the model of the present invention is not greatly affected, and the generated image still maintains local semantic consistency and global overall fit. While other methods rely excessively on the effective information of neighboring pixels to cause large-area artifacts and texture faults in the missing region, the method is not good in processing images with little visible information.

It should be noted that the above-described embodiments are exemplary, and that a person skilled in the art, in light of the present disclosure, may devise various solutions that fall within the scope of the present disclosure and fall within the scope of the present disclosure. It should be understood by those skilled in the art that the present description and drawings are illustrative and not limiting to the claims. The scope of the invention is defined by the claims and their equivalents.

Claims

1. The image complement method based on the fast Fourier transform and the selective attention mechanism is characterized in that the image complement method provides an efficient complement network with built-in Fourier transform operation, the fast Fourier transform module of an aggregation type selective convolution kernel network is embedded in the complement network, the fast Fourier transform module is responsible for the complement network to obtain a global receptive field, the aggregation type selective convolution kernel network adaptively adjusts specific gravity according to the contribution of each scale and is used for helping a model to keep detailed information, and meanwhile, the robustness of the model to input scale transformation is enhanced, the specific method comprises the following steps:

step 432: then, performing 3×3 convolution operation, and entering a polymerization type selective convolution kernel network, wherein a learning mechanism of dynamic fusion of multiple receptive fields is arranged in the polymerization type selective convolution kernel network to provide local information supplement for the fast Fourier transform module, and meanwhile, the receptive field weight is dynamically given to help the generator adapt to the influence caused by the change of the input image scale;

step 45: finally, decoding the fusion feature map into a complemented generated image by utilizing a plurality of deconvolution layers, wherein the generated image is used as the output of a generator;

step 46: and the block arbiter carries out three-layer convolution processing on the generated image to obtain a scoring matrix, the block arbiter distinguishes the generated image and the real image according to the scoring matrix, and guides the generator to generate a generated image which is more similar to the real image by the parameter feedback generator until the block arbiter cannot respectively output the real image and the generated image, and the training is finished.