CN117422783A

CN117422783A - Training method of image sample generation model, image sample generation method and device

Info

Publication number: CN117422783A
Application number: CN202311206816.5A
Authority: CN
Inventors: 陈昊星; 兰钧; 许卓尔; 孟昌华
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2023-09-19
Filing date: 2023-09-19
Publication date: 2024-01-19

Abstract

The embodiment of the specification discloses a training method of an image sample generation model, which comprises the steps of acquiring at least one bank card image as an original image; determining a fake region and fake content in the original image based on a fake task; determining a position constraint condition text based on the fake region; determining content constraint text to describe the counterfeit content; determining content guide text to describe the forgery task; and inputting the original image into a pre-trained diffusion model to generate an image sample, and fine-tuning the diffusion model through the position constraint condition text, the content constraint condition text and the content guidance text in a back diffusion process. Correspondingly, the invention discloses an image sample generation method and device.

Description

Training method of image sample generation model, image sample generation method and device

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a training method for generating a model by using an image sample, and a method and apparatus for generating an image sample.

Background

In recent years, with the vigorous development of image generation technology based on deep learning, image forging technology also brings about a great deal of hidden danger of personal privacy information. In order to ensure the fund safety of the user, in a bank card identification scene, a bank card identification system is generally arranged to identify the input bank card image. In order to improve the recognition capability of the bank card recognition system for the attack image constructed by the forging technology, a large number of attack samples are required to be collected to train or attack the bank card recognition system, but the collection of a large number of bank card images as the attack samples is time-consuming and labor-consuming, and has high cost.

Disclosure of Invention

One or more embodiments of the present disclosure provide a training method for an image sample generation model, an image sample generation method and an image sample generation device, which can quickly generate a large number of realistic image samples according to requirements, so as to improve the attack resistance of a bank card identification system.

According to a first aspect, there is provided a training method of an image sample generation model, comprising:

acquiring at least one bank card image as an original image;

determining a fake region and fake content in the original image based on a fake task;

determining a position constraint condition text based on the fake region;

determining content constraint text to describe the counterfeit content;

determining content guide text to describe the forgery task;

and inputting the original image into a pre-trained diffusion model to generate an image sample, and fine-tuning the diffusion model through the position constraint condition text, the content constraint condition text and the content guidance text in a back diffusion process.

As an optional implementation manner of the method of the first aspect, fine tuning the diffusion model by the position constraint text, the content constraint text and the content guidance text specifically includes:

Acquiring first characteristic data of the original image after forward diffusion;

inputting the first characteristic data and the content guidance text into the diffusion model for reverse diffusion to obtain a first intermediate characteristic;

inputting the first characteristic data, the position constraint condition text and the content constraint condition text into the diffusion model for reverse diffusion to obtain a second intermediate characteristic;

performing feature fusion on the first intermediate feature and the second intermediate feature to obtain second feature data;

inputting the second characteristic data and the content guide text into the diffusion model for reverse diffusion to obtain a reconstructed image of the original image;

determining a loss function based on the original image and the reconstructed image;

and updating the diffusion model through the loss function.

Specifically, determining the loss function based on the original image and the reconstructed image specifically includes:

determining a reconstruction error between the original image and the reconstructed image;

the loss function is determined based on the reconstruction error.

Determining a similarity between the original image and the reconstructed image;

based on the similarity, the loss function is determined.

As an optional implementation manner of the method of the first aspect, inputting the original image into the diffusion model for generating an image sample, and in a back diffusion process, fine tuning the diffusion model through the position constraint text, the content constraint text and the content guide text specifically includes:

dividing the original image into image blocks which are not overlapped with each other;

inputting the image block into the diffusion model to generate an image block;

in the back-diffusion process, the diffusion model is fine-tuned based on the position constraint text, the content constraint text, and the content guide text for image blocks containing some or all of the counterfeit content.

According to a second aspect, there is provided an image sample generation method comprising:

acquiring at least one bank card image as a target image;

determining a target forging area and target forging contents based on a preset forging task;

determining a position constraint condition text based on the target forging area;

Determining content constraint condition text based on the target forged content;

inputting the target image, the position constraint condition text and the content constraint condition text into a pre-trained image sample generation model to obtain a bank card image sample;

the image sample generation model is obtained by training the training method of any one of the image sample generation models.

As an alternative embodiment of the method of the second aspect, the method further comprises:

taking a target forging area in the bank card image sample as a foreground image;

taking other areas except the target forging area in the bank card image sample as background images;

and carrying out harmony processing on the foreground image and the background image.

and carrying out transparency mixing processing on the foreground image and the background image.

According to a third aspect, there is provided a training method of an image recognition model, comprising:

Acquiring at least one bank card image as a positive sample;

determining a fake task based on a preset identification task;

determining a target fake area and target fake content based on the fake task;

inputting the target image, the position constraint condition text and the content constraint condition text into a pre-trained image sample generation model to obtain a generated image; the image sample generation model is obtained by training the training method of any one of the image sample generation models;

and training the image recognition model based on the positive sample and the negative sample by taking the generated image as a negative sample.

According to a fourth aspect, there is provided an attack testing method of an image recognition model, including:

acquiring at least one bank card image as an original image;

determining a fake task based on a preset test task;

determining a target fake area and target fake content based on the fake task;

taking the generated image as an attack sample, and carrying out attack test on the image recognition model;

and determining the result of the attack test based on the identification result of the image identification model on the attack sample.

According to a fifth aspect, there is provided a training apparatus of an image sample generation model, comprising:

the first data acquisition module is configured to acquire at least one bank card image as an original image;

a first constraint condition generation module configured to determine a falsified region and falsified content in the original image based on a falsification task; generating a position constraint condition text based on the fake region; generating content constraint text to describe the counterfeit content; generating content guide text to describe the forgery task;

the training module is configured to input the original image into a pre-trained diffusion model for image sample generation, and fine-tune the diffusion model through the position constraint condition text, the content constraint condition text and the content guidance text in a back diffusion process.

According to a sixth aspect, there is provided an image sample generation apparatus comprising:

the second data acquisition module is configured to acquire at least one bank card image as a target image;

the second constraint condition generation module is configured to determine a target forged region and target forged contents based on a preset forged task; generating a position constraint condition text based on the target forging area; generating content constraint condition text based on the target forged content;

the image generation module is configured to input the target image, the position constraint condition text and the content constraint condition text into a pre-trained image sample generation model to obtain a bank card image sample; the image sample generation model is obtained by training the training method of any one of the image sample generation models.

According to a seventh aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a training method of any one of the image sample generation models described above.

According to an eighth aspect, there is provided an electronic device comprising:

one or more processors; and

a memory associated with the one or more processors, the memory for storing program instructions that, when read for execution by the one or more processors, perform the steps of the training method of any of the image sample generation models described above.

According to a ninth aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements any of the above-described image sample generation methods.

According to a tenth aspect, there is provided an electronic device comprising:

one or more processors; and

a memory associated with the one or more processors, the memory for storing program instructions that, when read for execution by the one or more processors, perform the steps of any of the image sample generation methods described above.

The training method for the image sample generation model has the advantage that the image sample generation can be guided by utilizing information such as content and position needing forging in the image obtained by analyzing the image forging task and information of other areas in the image. Firstly, inputting an original image into a pre-trained diffusion model for forward diffusion to obtain first characteristic data; and then, respectively adding information in the image forging task in the process of back diffusion of the first characteristic data, fusing the obtained intermediate characteristics to obtain a required reconstructed image, and enabling the generated image to be more lifelike through calculation loss. The trained image sample generation model can be applied to an image sample generation method, so that a large number of vivid image samples can be quickly generated according to test requirements, and a bank card recognition system can be trained or tested more efficiently, so that the defensive capability of the bank card recognition system is enhanced.

The training device and the image sample generation device for the image sample generation model described in the embodiments of the present specification have the same advantageous effects described above.

Drawings

In order to more clearly illustrate the embodiments of the present description or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present description, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 schematically illustrates a flow chart of a training method for generating a model from image samples according to an embodiment of the present disclosure, in an embodiment.

Fig. 2 schematically illustrates a method for tuning a diffusion model according to an embodiment of the present disclosure in one scenario.

Fig. 3 schematically illustrates a method for tuning a diffusion model according to an embodiment of the present disclosure in a preferred scenario.

Fig. 4 schematically shows a flowchart of an image sample generation method according to an embodiment of the present disclosure in one implementation manner.

Fig. 5 schematically shows a flowchart of a training method of an image recognition model according to an embodiment of the present disclosure in an implementation manner.

Fig. 6 schematically shows a flowchart of an attack testing method of the image recognition model according to an embodiment of the present disclosure in an implementation manner.

Fig. 7 is a block diagram schematically showing a structure of an image sample generation model training apparatus according to an embodiment of the present specification in one embodiment.

Fig. 8 is a block diagram schematically showing a configuration of an image sample generating apparatus according to an embodiment of the present specification in one embodiment.

Fig. 9 exemplarily shows a block diagram of an electronic device provided in an embodiment of the present specification.

Detailed Description

It is first noted that the terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

In order to make the technical solutions in the present specification better understood by those skilled in the art, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only some embodiments of the present specification, not all embodiments. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that: in other embodiments, the steps of the corresponding method are not necessarily performed in the order shown and described in this specification. In some other embodiments, the method may include more or fewer steps than described in this specification. Furthermore, individual steps described in this specification, in other embodiments, may be described as being split into multiple steps; while various steps described in this specification may be combined into a single step in other embodiments.

In recent years, image generation technology based on deep learning is vigorously developed, and image forging technology represented by the deep forging technology can edit and tamper an original image according to a text instruction, so that the effect of 'fake and spurious' is achieved. While entertainment is carried out, the image forging technology also has a certain potential safety hazard, so that more and more attention and importance are drawn. In the field of certificate identification, a malicious attacker may tamper information on a certificate image and perform illegal actions by using an image forging technology, so that benefits are strived for, and hidden danger of disclosure of private information of a user is brought. In order to ensure the fund and information safety of users, in a bank card identification scene, a bank card identification system is generally arranged for identifying an input bank card image so as to intercept a forged bank card image. In order to improve the recognition capability of the bank card recognition system for the attack image sample constructed by the image forging technology, a large number of attack image samples are required to be collected to train or attack the bank card recognition system, but a large number of bank card images are collected as the attack sample, which is time-consuming and labor-consuming, and has high cost.

In view of this, it is desirable to have an efficient image sample generation scheme that can generate a large number of realistic image samples for use in a bank card identification system training or attack testing, depending on the specific needs.

The training method of the image sample generation model, the image sample generation method and the device described in the embodiments of the present specification will be further described in detail with reference to the accompanying drawings and specific embodiments of the present specification, but the detailed description does not limit the embodiments of the present specification.

In one embodiment of the present specification, a training method of an image sample generation model is presented. FIG. 1 schematically illustrates a flow chart of a training method for generating a model from image samples according to an embodiment of the present disclosure, in an embodiment.

As shown in fig. 1, the method includes steps S100-S110:

s100: at least one bank card image is acquired as an original image.

Alternatively, a mobile terminal device with an image acquisition function, such as a smart phone, a tablet computer, a wearable device, etc., may be used to acquire the image of the bank card. Of course, a large number of image samples of the bank card may be obtained from the disclosed dataset. The original image should include at least one, and in order to ensure the integrity and diversity of the image features, a plurality of bank card images may be acquired as the original image.

S102: based on the forgery task, a forgery area and forgery content in the original image are determined.

S104: based on the forgery area, a location constraint text is determined.

S106: content constraint text is determined to describe the counterfeit content.

S108: content guide text is determined to describe the forgery task.

In summary, steps S102-S108 illustrate a method of extracting image generation constraints based on counterfeit tasks. In particular, the forgery task may indicate where the original image content needs to be forged and how the content at that location is forged, thereby determining a forgery area and forgery content, for example, a part of a several-bit card number of a bank card may be falsified. Wherein, based on the forged region, a position constraint condition text can be determined to describe the specific position and range size of the forged region in the original image; content constraint text may be determined based on the falsified content to describe target content that requires falsification of original image content in a falsified region. In addition, content guide text may be determined based on the forgery task to assist in generating the image sample using information about other areas in the original image. For example, when the first four bank card number segments in the bank card image need to be forged, the forged region may be determined as the region where the first four bank card number segments are located, the forged content may be determined as the target forged card number of four bits, and the content guide text may be determined based on the other bank card numbers except the first four bank card number segments.

S110: and inputting the original image into a pre-trained diffusion model to generate an image sample, and fine-tuning the diffusion model through the position constraint condition text, the content constraint condition text and the content guidance text in the back diffusion process.

The diffusion model can denoise the input noise so as to obtain a required picture, and the diffusion model can be used for image generation. Training of a diffusion model comprises forward diffusion and backward diffusion, wherein the forward diffusion process is to add noise to a picture randomly continuously so as to convert the picture into random noise; the back diffusion process is a process of continuously denoising random noise by predicting noise to restore a true original image, and is also a process of generating an image.

In particular, potential diffusion models (Latent Diffusion Models, LDMs) pre-trained on a large scale data set can be used as a base model and realistic image samples can be generated by fine tuning the pre-trained LDMs. LDMs are a class of denoising diffusion probability models (Denoising Diffusion Probabilistic Models, DDPMs) that contain an auto-encoder trained on images and a diffusion model learned over sparse spaces built by the auto-encoder. The encoder encodes the original image into a potential representation by adding noise; the diffusion model reconstructs the original image from the potential representation by estimating the noise and denoising.

In some more specific embodiments, training may be based on LDMs-based models, conditioned on category labels, image and text cues, etc., with conditional LDM learning as follows:

wherein epsilon represents the encoder,original image ∈>The potential representation obtained by adding noise to the original image is marked as z; e is the real noise samples that are not scaled; e-shaped article _θ Is a diffusion model for estimating the true noise e; t is the time step, z _t Is the potential noise corresponding to time t; y is the adjustment input, τ _θ Y is mapped to a condition vector.

Based on a conditional input, such as a text prompt, the random noise tensor may be sampled and denoised according to the inferred time to generate new potential noise, which is continuously removed to reconstruct the original image based on the conditional input. In addition, individual images may be fine-tuned by constructing text cues, such as: the ap auto/serving of a [ class no ], where [ is an identifier and [ class no ] is a class description.

The loss function may be determined by calculating the mean square error between the real noise and the potential noise estimated by the diffusion model based on the above equation, thereby training the diffusion model.

In some embodiments, the original image is input into a diffusion model for image sample generation, and in the back diffusion process, the diffusion model is fine-tuned by a position constraint text, a content constraint text and a content guidance text, which specifically comprises:

dividing an original image into image blocks which are not overlapped with each other;

inputting the image block into a diffusion model to generate the image block;

in the back-diffusion process, for image blocks containing part or all of the counterfeit content, the diffusion model is trimmed based on the position constraint text, the content constraint text, and the content guide text.

As shown in fig. 2, the original image is first divided into several piecesThe image blocks which are not overlapped are dried, and an encoder epsilon is input to obtain a corresponding potential representation z; then inputting the potential representation z into a diffusion model for forward diffusion to obtain random noise z _T 。

In the back diffusion process, the image block position and position constraint text are determined based on the image block containing part or all of the forged content, and a universal speech encoder tau is reused _θ Generating language description containing content constraint condition text and content guide text, and uniformly recording the language description as c, for example, the content of a bank card number section needing forging; will random noise z _T The diffusion model is input by the position constraint condition text and the language description c, and after the potential representation z of the original image is reconstructed, the potential representation z is passed through a decoderGenerating z as reconstructed image +.>The loss function may then be determined by calculating the difference between the reconstructed image and the original image to fine tune the diffusion model.

In some embodiments, fine tuning the diffusion model by location constraint text, content constraint text, and content guidance text specifically includes:

acquiring first characteristic data of an original image after forward diffusion;

inputting the first characteristic data and the content guide text into a diffusion model to carry out back diffusion to obtain a first intermediate characteristic;

inputting the first characteristic data, the position constraint condition text and the content constraint condition text into a diffusion model for reverse diffusion to obtain a second intermediate characteristic;

inputting the second characteristic data and the content guide text into a diffusion model for back diffusion to obtain a reconstructed image of the original image;

the diffusion model is updated by the loss function.

The inputs to the back diffusion process include first feature data, text guidance, and location constraint text, and the text guidance includes content constraint text and content guidance text. The fine tuning of the diffusion model is divided into three phases: the first stage comprises the steps of firstly, respectively combining input data to obtain intermediate features; the second stage fuses all the obtained intermediate features, and then combines the content guide text to carry out back diffusion, so as to reconstruct the feature representation of the reconstructed image, and finally, a decoder is used for decoding to obtain the reconstructed image; the third stage updates the diffusion model based on the differences between the original image and the reconstructed image.

Specifically, determining a loss function based on the original image and the reconstructed image specifically includes:

based on the reconstruction error, a loss function is determined.

After determining the loss function based on the reconstruction error, the diffusion model may be fine-tuned with the goal of minimizing the reconstruction error.

Optionally, determining the loss function based on the original image and the reconstructed image specifically includes:

Determining the similarity between the original image and the reconstructed image;

based on the similarity, a loss function is determined.

Specifically, the loss function may be determined by calculating cosine similarity between the original image and the reconstructed image, and fine tuning the diffusion model with the cosine similarity maximized as a target.

Referring to fig. 3, in a specific case of bank card number forgery, additional text guidance on the target generated image is added to edit the image during the back diffusion process. First characteristic data obtained by forward diffusion of original imageRecorded as random noise z _T Diffusion model versus random noise z _T The dimension of which corresponds to the desired output resolution when sampling is performed. Content guide text c and content constraint textRespectively by pre-trained speech encoder tau _θ And->Given under the direction of a forgery task, wherein the content constraint text +.>The content guide text c may describe numbers other than the bank card number segment to be forged, as shown in fig. 3. For fine tuning of diffusion model Except for content constraint text->In addition to the location constraint text is entered which describes the forgery area. The results calculated for the first K steps at each model and E for pre-training only _θ And then linear combination is used between the prediction results.

Alternatively, inspired by no classifier guidance, a method is introduced for the diffusion model, which can overcome the overfitting when fine-tuning the diffusion model of a single image. No classifier-oriented is a technique widely adopted by a priori text-to-image diffusion models. A single diffusion model is trained using conditional and unconditional targets by randomly discarding the conditions during the training process. When back-diffusing, a linear combination of conditional and unconditional estimates is used:

wherein,is the prediction result of the diffusion model; e-shaped article _θ (z _t C) and E _θ (z _t ) Respectively a conditional diffusion model prediction result and an unconditional diffusion model prediction result; c is represented by τ _θ The generated condition vector represents the content guide text; w represents the weight of the sample. Alternatively, the Twaiedie formula may be used, i.e. +.>Noise prediction is performed, wherein->Is a function of the time step t that can affect the sampling quality.

Since only a single image is used as training data, and a corresponding text descriptor for that image, the diffusion model is subject to overfitting and language drift occurs after fine-tuning, resulting in the fine-tuned diffusion model not being able to synthesize images containing other text-directed features. The overcorrection problem may be due to the fact that only one repeated text prompt is used in the fine tuning process, such that other text prompts are no longer accurate enough to control editing.

In a preferred embodiment, to mitigate overfitting of the trimmed diffusion model, the pre-trained text-to-image diffusion model can be further utilized to generate an image with provided text guidance, and the trimmed diffusion model is used to combine the predictions from the two models to provide image content features, similar to no classifier guidance. In particular, byE representing a fine-tuned diffusion model _θ Representing a pre-trained diffusion model. In the back-diffusion process, in a specified step, a fine-tuned diffusion model is used to guide the pre-trained model by linear combination of the predictions from each model. Therefore, the prediction result of the diffusion model of equation 2 +. >The conversion is as follows:

wherein v represents the weight of the model;is a content constraint condition text; c is content guide text.

Alternatively, to prevent artifacts from the overfitting model and to preserve the fidelity of the generated image, the prediction can be performed using equation 3 in the case of t > K, and equation 2 when t.ltoreq.K. From K to 0, the denoising process depends only on the pre-trained model, and according to the method, the generalization capability of the pre-trained model can be fully utilized. It should be noted that this method can be generalized to include multiple condition prompts or even multiple modalities.

Through model-based classifier-free guidance, images can be edited and operated under given text guidance, the phenomenon of model overfitting is relieved, and the quality of generated images is improved.

In some more specific embodiments, it is further demonstrated how the fine tuning process of a single training image can be improved so that the fine tuned diffusion model can better understand the content and geometry of the image. Therefore, it can provide better content guidance for large-scale text-to-image diffusion models in sampling time and exhibit the potential to generate arbitrary resolution images.

The current diffusion model tuning process given an input image I of resolution h×w, a downsampled image potential representation z may be obtained from the encoder of the pre-trained diffusion model. Since the text-to-image diffusion model is pre-trained at a fixed resolution p x p, the input image needs to be adjusted to a corresponding resolution sp x sp, where s represents the scaling factor of the encoder to match the resolution, thereby reducing the fine tuning cost. Essentially, a priori knowledge of the correlation between location and content information is learned by a diffusion model, and thus, when sampled from a higher resolution noise tensor, the potential representations generated can lead to artifacts like repetition or positional offset.

To solve the above limitations, a simple and effective fine tuning method can be adopted: a single training image is considered as a function of the coordinates of each pixel, in [0, H]×[0，W]Is the boundary. The diffusion model still produces potential representations at a fixed resolution p x p, but each potential representation corresponds to a sub-region in the image, which is represented as v= [ h ] ₁ ，w ₁ ，h ₂ ，w ₂ ]Wherein (h) ₁ ，w ₁ )∈[0，H]×[0，W]Sum (h) ₂ ，w ₂ )∈(h ₁ ，H]×(w ₁ ，W]Representing the upper left and lower right coordinates of the region, respectively. In the trimming process, different sub-regions v are set to sample image blocks from the image and the size of the image blocks is adjusted to the resolution sp. The generated patch may be represented as Where F is the normalization and fourier embedding of the particular region. The encoded potential code of the patch is +.>

The diffusion model uses normalized fourier embedding as input to let the diffusion model learn the position-content correlation. Formally, a diffusion model is defined as

After fine tuning, potential representations of different solutions may be generated by providing image location information directly to the diffusion model. Any resolution image editing is done by providing two inputs to the diffusion model: the position of the entire image is embedded and a randomly sampled noise latent image, the dimensions of which correspond to the desired resolution. The model, when sampled at any resolution, still allows the structure of the original image to remain unchanged. Notably, the diffusion model still allows for redirection, regardless of the correct position encoding, thereby preserving the scale of the original image.

In some embodiments of the present disclosure, there is further provided an image sample generating method, as shown in fig. 4, including:

s200: at least one bank card image is acquired as a target image.

S202: and determining the target fake area and the target fake content based on the preset fake task.

S204: based on the target forgery area, a location constraint text is determined.

S206: content constraint text is determined based on the targeted counterfeit content.

S208: inputting the target image, the position constraint condition text and the content constraint condition text into a pre-trained image sample generation model to obtain a bank card image sample.

S210: the image sample generation model is obtained by training the training method of any one of the image sample generation models.

Taking a forged bank card number as an example, the preset forging task comprises the content and the number of bits of the card number to be forged, the target forged content and the content constraint condition text can be determined according to the content of the forged card number, and the target forged area and the position constraint condition text can be determined according to the number of the forged card number. It should be noted that numbers other than the bank card number segment to be forged have been learned during the training of the image sample generation model, so that the generation of the forged image can be guided. After the target image, the position constraint condition text and the content constraint condition text are input into a pre-trained image sample generation model, the image sample generation model can generate a required image under the guidance of the double constraint condition text to obtain a forged bank card image sample. The image sample generation method is very efficient, and meanwhile, the quality and fidelity of the image sample can be guaranteed, so that the image sample can be used for training or testing a bank card recognition system to strengthen the defensive capability of the bank card recognition system.

In some embodiments of the present disclosure, a training method of an image recognition model is provided, please refer to fig. 5, including:

s300: at least one bank card image is acquired as a positive sample.

S302: based on the preset identification task, a fake task is determined.

S304: based on the forgery task, a target forgery area and a target forgery content are determined.

S306: based on the target forgery area, a location constraint text is determined.

S308: content constraint text is determined based on the targeted counterfeit content.

S310: inputting a target image, a position constraint condition text and a content constraint condition text into a pre-trained image sample generation model to obtain a generated image; the image sample generation model is obtained by training the training method of any one of the image sample generation models.

S312: the generated image is taken as a negative sample, and an image recognition model is trained based on the positive sample and the negative sample.

The image sample generation method in the embodiment can forge based on the real bank card image to generate a forged bank card image, thereby forming a positive and negative sample pair. Taking a real bank card image as a positive sample, and adding a positive label; and taking the forged bank card image as a negative sample, and adding a negative label to form a training sample.

Taking a bank card identification model as an example, it is required to identify the authenticity of the bank card according to the input bank card image. In the model training process, a corresponding identification result is obtained after an image sample is input, namely, a bank card in the image is real or counterfeit; a loss function is then determined based on the difference between the recognition result and the sample tag, and the image recognition model is trained with the aim of minimizing the difference.

In some specific embodiments, the image recognition model can be built based on the structure of the ResNet or Transformer network.

In other embodiments of the present disclosure, an attack testing method of an image recognition model is further provided, as shown in fig. 6, including:

s400: at least one bank card image is acquired as an original image.

S402: based on a preset test task, a fake task is determined.

S404: based on the forgery task, a target forgery area and a target forgery content are determined.

S406: based on the target forgery area, a location constraint text is determined.

S408: content constraint text is determined based on the targeted counterfeit content.

S410: inputting a target image, a position constraint condition text and a content constraint condition text into a pre-trained image sample generation model to obtain a generated image; the image sample generation model is obtained by training the training method of any one of the image sample generation models.

S412: and taking the generated image as an attack sample, and carrying out attack test on the image recognition model.

S414: and determining the result of the attack test based on the identification result of the image identification model on the attack sample.

In the attack test process of the image recognition model, a pre-trained image sample generation model can be used as a strategy model for adjusting an attack strategy according to the recognition result of the image recognition model to generate a new attack sample, so that the attack effect is continuously updated to test the attack resistance of the image recognition model rapidly and effectively.

After an attack sample is generated by using the image sample generation model, the attack sample is input into an image recognition model to be tested, and an image recognition result is obtained. In some specific embodiments, the image recognition model can be built based on the structure of the ResNet or Transformer network.

Based on the image recognition result, the attack result of each round can be determined, so that the attack rate of the round of attack is calculated, and the fact that the image recognition model is breached means that the image recognition model recognizes the forged bank card image as a real bank card image. The attack rate can be calculated by the ratio of the number of times the image recognition model is attacked to the total number of times it is attacked in a certain period of time. Under each round of attack, the higher the attack rate is, the weaker the defensive capability of the image recognition model is, and the greater the optimization requirement is. Thus, for each round of attack, determining a positive excitation signal in response to the round of attack having a higher attack rate than the attack rate of the previous round of attack; the negative stimulus signal is determined in response to the attack rate of the round being lower than the attack rate of the previous round. The positive excitation signal mark can be used for effectively breaking through the generated image of the image recognition model, so that the strategy model is updated in the direction of increasing the breaking rate, and a new attack strategy is generated, so that the image recognition model can be optimized in a targeted manner later.

And then inputting the excitation signals and the image recognition results into a strategy model, namely an image sample generation model, so that the image sample generation model updates an attack strategy by adopting a reinforcement learning method, a new fake image is generated as an attack sample, and a round of learning and updating are carried out by taking the increase of the attack rate as forward excitation, so that the attack sample generated by the image sample generation model can stably attack the image recognition model.

When the image recognition model receives each round of attack, if the attack rate of the round of attack is higher than a preset attack rate threshold value, negative labels are added to fake images corresponding to the round of attack, and the fake images are used as image samples for training. Specifically, an image sample is input into an image recognition model to obtain a recognition result, a difference between the recognition result and a negative sample label is calculated, and the image recognition model is optimized by taking the difference as a target until the image recognition model capable of stably recognizing a fake image is obtained.

Some embodiments of the present disclosure provide a training apparatus for generating a model from image samples, as shown in fig. 7, including:

a first data acquisition module 50 configured to acquire at least one bank card image as an original image;

A first constraint condition generation module 52 configured to determine a falsified region and falsified content in the original image based on the falsification task; generating a position constraint condition text based on the fake region; generating content constraint text to describe the counterfeit content; generating content guide text to describe the forgery task;

the training module 54 is configured to input the raw image into a pre-trained diffusion model for image sample generation and fine-tune the diffusion model during back-diffusion with position constraint text, content constraint text, and content guide text.

Some embodiments of the present disclosure further provide an image sample generating apparatus, as shown in fig. 8, including:

a second data acquisition module 60 configured to acquire at least one bank card image as a target image;

a second constraint condition generation module 62 configured to determine a target falsified region and a target falsified content based on a preset falsification task; generating a position constraint condition text based on the target forging area; generating content constraint condition text based on the target forged content;

the image generation module 64 is configured to input the target image, the position constraint condition text and the content constraint condition text into a pre-trained image sample generation model to obtain a bank card image sample; the image sample generation model is obtained by training the training method of any one of the image sample generation models.

One embodiment of the present specification provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a training method of any one of the image sample generation models described above.

One embodiment in the present specification also provides an electronic device, including:

one or more processors; and

An embodiment in the present specification provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements any one of the above-described image sample generation methods.

one or more processors; and

Fig. 9 exemplarily shows a block diagram of an electronic device provided in an embodiment of the present disclosure, which shows a schematic structural diagram of a computer system 700 of a terminal device or a server suitable for implementing an embodiment of the present invention. The terminal device or server shown in fig. 8 is only an example, and should not impose any limitation on the functions and scope of use of the embodiments of the present invention.

In one typical configuration, computer system 700 includes one or more processors (CPUs) 702, an input interface 704, an output interface 706, a network interface 708, and a memory 710.

Memory 710 may include non-volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash RAM. Memory is an example of a computer-readable medium.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, read only compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic disk storage, quantum memory, graphene-based storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by the computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

It should be understood that although the terms first, second, third, etc. may be used in one or more embodiments of the present description to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

It should be noted that the above-mentioned embodiments are merely examples of the present invention, and it is obvious that the present invention is not limited to the above-mentioned embodiments, and many similar variations are possible. All modifications attainable or obvious from the present disclosure set forth herein should be deemed to be within the scope of the present disclosure.

Claims

1. A training method of an image sample generation model, comprising:

acquiring at least one bank card image as an original image;

determining a position constraint condition text based on the fake region;

determining content constraint text to describe the counterfeit content;

determining content guide text to describe the forgery task;

2. The method of claim 1, fine tuning the diffusion model by the position constraint text, the content constraint text, and the content guide text, comprising:

and updating the diffusion model through the loss function.

3. The method of claim 2, determining the loss function based on the original image and the reconstructed image, comprising:

the loss function is determined based on the reconstruction error.

4. The method of claim 2, determining the loss function based on the original image and the reconstructed image, comprising:

based on the similarity, the loss function is determined.

5. The method according to claim 1, inputting the original image into the diffusion model for image sample generation, and fine-tuning the diffusion model through the position constraint text, the content constraint text and the content guide text during back diffusion, specifically comprising:

inputting the image block into the diffusion model to generate an image block;

6. An image sample generation method, comprising:

acquiring at least one bank card image as a target image;

the image sample generation model is trained by the method of any one of claims 1 to 5.

7. The method of claim 6, further comprising:

8. The method of claim 6, further comprising:

9. A training method of an image recognition model, comprising:

acquiring at least one bank card image as a positive sample;

determining a fake task based on a preset identification task;

determining a target fake area and target fake content based on the fake task;

inputting the target image, the position constraint condition text and the content constraint condition text into a pre-trained image sample generation model to obtain a generated image; the image sample generation model is trained by the method according to any one of claims 1 to 5;

10. An attack test method of an image recognition model, comprising the following steps:

acquiring at least one bank card image as an original image;

determining a fake task based on a preset test task;

determining a target fake area and target fake content based on the fake task;

11. A training apparatus for generating a model of an image sample, comprising:

12. An image sample generation apparatus comprising:

the image generation module is configured to input the target image, the position constraint condition text and the content constraint condition text into a pre-trained image sample generation model to obtain a bank card image sample; the image sample generation model is trained by the method of any one of claims 1 to 5.

13. A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any of claims 1 to 5.

14. An electronic device, comprising:

one or more processors; and

a memory associated with the one or more processors for storing program instructions that, when read for execution by the one or more processors, perform the steps of the method of any of claims 1 to 5.

15. A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any of claims 6 to 8.

16. An electronic device, comprising:

one or more processors; and

a memory associated with the one or more processors for storing program instructions which, when read for execution by the one or more processors, perform the steps of the method of any of claims 6 to 8.