CN116129502B

CN116129502B - Training method and device for human face fake video detection model and computing equipment

Info

Publication number: CN116129502B
Application number: CN202310112068.8A
Authority: CN
Inventors: 张冬明; 鲁鼎煜; 张勇东; 张菁
Original assignee: People Co Ltd
Current assignee: Konami Sports Club Co Ltd
Priority date: 2023-02-06
Filing date: 2023-02-06
Publication date: 2024-03-01
Anticipated expiration: 2043-02-06
Also published as: CN116129502A

Abstract

The invention discloses a training method, a training device and a computing device of a face fake video detection model, wherein the training method comprises the following steps: collecting an original face video and a face fake video of a target person; extracting features from the original face video and the face fake video respectively to obtain corresponding multivariable time sequences; training based on a double-flow neural network by taking the multivariate time sequence as training data to obtain a face fake video detection model; wherein, double-flow neural network includes: the system comprises a space domain branch network for extracting the speaking mode characteristics of the target person, a time domain branch network for extracting the time correlation characteristics of the video, and a prediction layer for fusing and predicting the speaking mode characteristics of the target person and the time correlation characteristics of the video. Based on the training method, the requirement of the detection model on the quality of the video to be detected is effectively reduced, the robustness of the multiple compression detection video is improved, and the generalization capability of the detection model is improved while the precision is ensured.

Description

Training method and device for human face fake video detection model and computing equipment

Technical Field

The present invention relates to a training method for a face fake video detection model, and in particular, to a training method, a training device, a training computing device, and a training computer storage medium for a face fake video detection model.

Background

With the sequential development of various face counterfeiting algorithms and software in recent years, a large number of face counterfeiting videos or pictures which cannot be distinguished by human eyes appear on each large network platform. Some others use these techniques to replace the celebrity's face with another, or to manipulate the celebrity's video's face or lips directly to make some disfigurement of their will. The malicious transmission of these false videos presents a great threat to the whole society. Therefore, there is an urgent need to relieve the threat of these face-counterfeit videos, and detection studies of face-counterfeit videos and pictures have received a great deal of attention.

In order to prevent such face-falsification of videos and pictures, researchers have recently proposed different detection methods from various angles, which can be roughly classified into three types: shan Zhen artifact feature-based methods, biometric feature-based methods, and temporal feature-based methods.

The method based on Shan Zhen artifact characteristics mainly takes artifacts left in an RGB (red+Green+blue) domain or a frequency domain in the generation process of the forged face as clues, but the method based on Shan Zhen artifact characteristic mining is too dependent on specific artifact information, so that the problem of insufficient generalization of the forged face data which is not contained in training data is generally presented, and the practical application requirements cannot be met; the method based on biological characteristics mainly distinguishes the real human face from the fake human face by mining contradictions between the real human face and certain specific biological characteristics of the human face fake video, however most biological characteristics can be eliminated by targeted fine post-processing, so that the detection method is invalid; the method based on the time characteristics detects the false human face by mining the time correlation information of the human face false video, but along with the rapid development of the human face false generation technology, the variety of the false human face is various, so that the human face false detectors have insufficient generalization.

Disclosure of Invention

The present invention has been made in view of the above-mentioned problems, and has as its object to provide a training method, apparatus, computing device and computer storage medium for a face-counterfeit video detection model that overcomes or at least partially solves the above-mentioned problems.

According to a first aspect of the present invention, there is provided a training method of a face-counterfeit video detection model, the method comprising:

collecting an original face video and a face fake video of a target person;

extracting features from the original face video and the face fake video respectively to obtain corresponding multivariable time sequences; and

training the multivariate time sequence based on the double-flow neural network to obtain a human face fake video detection model by taking the multivariate time sequence as training data; wherein, double-flow neural network includes: the system comprises a space domain branch network for extracting the speaking mode characteristics of the target person, a time domain branch network for extracting the time correlation characteristics of the video, and a prediction layer for fusing and predicting the speaking mode characteristics of the target person and the time correlation characteristics of the video.

Optionally, extracting features from the original face video and the face fake video respectively, and obtaining the corresponding multivariate time sequence further includes:

And respectively tracking the facial motion unit data, the sight line estimation data and the head motion data of each frame of the original face video and the face fake video to obtain a corresponding multivariable time sequence.

Optionally, the spatial branch network is used for: projecting the multivariable time sequence through the full connection layer and performing nonlinear activation through a nonlinear activation function; the target person speaking mode features are extracted and obtained based on the multi-variable time sequence after nonlinear activation through a plurality of encoders connected in sequence; in each encoder, multi-head self-attention calculation is carried out on the multi-variable time sequence after nonlinear activation, and then the speaking mode characteristics of the target person are obtained through a feedforward network.

Optionally, the time domain branching network is configured to: projecting the multivariable time sequence through the full connection layer and performing nonlinear activation through a nonlinear activation function; position encoding the nonlinear activated multivariate time series by a position encoder; the video time correlation characteristics are obtained based on the multivariate time sequence after position coding through a plurality of encoders connected in sequence; in each encoder, masked multi-head attention calculation is performed on specific data subjected to position coding, and then video time correlation characteristics are acquired through a feedforward network.

Optionally, collecting the original face video and the face counterfeit video of the target person, further comprising:

collecting an original face video of a target person;

based on the original face video, generating a face fake video, wherein the method for generating the face fake video comprises face replacement, lip synchronization and/or expression action migration.

According to a second aspect of the present invention, there is provided a method of detecting a face falsified video, the method comprising:

acquiring a face video to be detected;

extracting features from the face video to be detected to obtain a corresponding multivariable time sequence;

inputting the multivariable time sequence into a face fake video detection model to obtain a detection result of the face video to be detected; the face fake video detection model is obtained according to the training method of the face fake video detection model provided in the first aspect.

According to a third aspect of the present invention, there is provided a training apparatus for a face-falsification video detection model, the training apparatus comprising:

the video collecting module is suitable for collecting the original face video and the face fake video of the target person;

the first feature extraction module is suitable for extracting features from the original face video and the face fake video respectively to obtain corresponding multivariable time sequences; and

The training module is suitable for training based on a double-flow neural network by taking the multivariate time sequence as training data to obtain a face fake video detection model; wherein, double-flow neural network includes: the system comprises a space domain branch network for extracting the speaking mode characteristics of the target person, a time domain branch network for extracting the time correlation characteristics of the video, and a prediction layer for fusing and predicting the speaking mode characteristics of the target person and the time correlation characteristics of the video.

Optionally, the first feature extraction module is further adapted to:

Optionally, the video collection module is further adapted to:

collecting an original face video of a target person;

According to a fourth aspect of the present invention, there is provided a detection apparatus for face-falsified video, the detection apparatus comprising:

the video acquisition module is suitable for acquiring face videos to be detected;

the second feature extraction module is suitable for extracting features from the face video to be detected to obtain a corresponding multivariable time sequence; and

The detection module is suitable for inputting the multivariate time sequence into the face fake video detection model to obtain a detection result of the face model to be detected; the face fake video detection model is obtained according to the training method of the face fake video detection model provided in the first aspect.

According to a fifth aspect of the present invention, there is provided a computing device comprising: the device comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface are communicated with each other through the communication bus;

the memory is configured to store at least one executable instruction, where the executable instruction causes the processor to execute an operation corresponding to the training method for the face counterfeit video detection model provided in the first aspect or an operation corresponding to the detection method for the face counterfeit video provided in the second aspect.

According to a sixth aspect of the present invention, there is provided a computer storage medium having stored therein at least one executable instruction for causing a processor to perform the operation corresponding to the training method for a face-counterfeit video detection model provided in the first aspect or the operation corresponding to the detection method for a face-counterfeit video provided in the second aspect.

According to the training method, the training device, the computing equipment and the computer storage medium of the face fake video detection model, the original face video and the face fake video of the target person are collected, the characteristics are extracted from the original face video and the face fake video respectively, the corresponding multivariate time sequence is obtained, the multivariate time sequence is used as training data, training is carried out based on a double-flow neural network, and the face fake video detection model is obtained, wherein the double-flow neural network comprises: the system comprises a space domain branch network for extracting the speaking mode characteristics of the target person, a time domain branch network for extracting the time correlation characteristics of the video, and a prediction layer for fusing and predicting the speaking mode characteristics of the target person and the time correlation characteristics of the video. Based on the training method, the requirement of the detection model on the quality of the detection video can be effectively reduced, the robustness of the multiple compression detection video is improved, and the generalization capability of the detection model is improved while the precision is ensured.

The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

FIG. 1 is a flow chart of a training method of a face counterfeit video detection model according to an embodiment of the invention;

FIG. 2 is a flow chart of a training method of a face counterfeit video detection model according to another embodiment of the present invention;

FIG. 3 shows a schematic diagram of a training method of a face-counterfeit video detection model according to a further embodiment of the invention;

FIG. 4 is a schematic diagram of extracting speech pattern features of a target person based on a spatial branch network according to an embodiment of the present invention;

FIG. 5 shows a schematic diagram of extracting video temporal correlation features based on a time domain branching network in accordance with an embodiment of the present invention;

fig. 6 is a flowchart of a method for detecting a face counterfeit video according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a training apparatus of a face-counterfeit video detection model according to an embodiment of the present invention;

Fig. 8 is a schematic diagram showing the structure of a face-counterfeit video detection device according to another embodiment of the present invention; and

FIG. 9 illustrates a schematic diagram of a computing device, according to an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

Fig. 1 shows a flow chart of a training method of a face counterfeit video detection model according to an embodiment of the invention. As shown in fig. 1, the method comprises the steps of:

step S101, collecting an original face video and a face fake video of a target person.

Determining a target person according to training requirements, collecting original face videos and a large number of face fake videos of the target person from a network based on the determined target person, preprocessing the original face videos and the large number of face fake videos of the target person, for example, preprocessing the collected original face videos and the face fake videos of the target person respectively by using a SceneDetecte, FFMPEG (Fast forward mpeg) open source database to obtain video data of the original face videos and the face fake videos of the target person, wherein preprocessing the original face videos and the face fake videos of the target person comprises video scene segmentation, video cutting and the like, and specific preprocessing modes are not limited; the number of target persons is one or more.

Specifically, an original face video and a large number of face-forged videos of a plurality of target persons (preferably, the target persons are public persons such as television stars, network red persons, well-known scholars and the like, without limitation in particular) are collected from the network, and all the face features of the target persons need to be included in the original face video and the face-forged videos, and the face faces face the shots. Using a Scenedetect open source tool library to segment the scene of the collected original face video of the target person and a large number of face fake videos so as to select an original face video segment and a face fake video segment with the lens length not smaller than a specific frame and meeting the requirements; and performing equal-length segmentation on the selected original face video and the selected face fake video segments meeting the requirements by using an FFMPEG tool library, such as performing equal-length and equal-interval sliding segmentation processing on the video segments by using a sliding window method.

Step S102, extracting features from the original face video and the face fake video respectively to obtain corresponding multivariate time sequences.

Most face counterfeiting detection methods in the prior art use artifacts left in the space domain or the frequency domain in the counterfeiting generation process as clues, and the artifacts are mined to distinguish the true face from the fake face. However, with the rapid development of the face-counterfeit generation technology and the targeted refinement, these clues are gradually eliminated, thereby resulting in the failure of the detection method of these face-counterfeit videos. In this embodiment, the OpenFace open source tool library is used to extract features from the original face video and the face fake video of the target person, so as to obtain a corresponding multivariate time sequence, and the feature extraction method is used for the face fake video detection model to learn a specific face and a head motion pattern shown when the target person speaks to distinguish a real face from a fake face, so that the resistance of the model to a refinement processing sample is improved, and specifically, the feature extraction method includes tracking face motion unit data, sight line estimation data and head motion data of each frame of the original face video and the face fake video.

And step S103, training based on a double-flow neural network by taking the multivariate time sequence as training data to obtain a face fake video detection model.

Specifically, the dual-flow neural network includes: the system comprises a space domain branch network for extracting the speaking mode characteristics of the target person, a time domain branch network for extracting the time correlation characteristics of the video, and a prediction layer for fusing and predicting the speaking mode characteristics of the target person and the time correlation characteristics of the video.

The airspace branch network is used for: projecting the multivariable time sequence through the full connection layer and performing nonlinear activation through a nonlinear activation function; extracting and obtaining the speaking mode characteristics of the target person based on the nonlinear activated multivariable time sequences through a plurality of encoders connected in sequence, performing multi-head self-attention calculation on the nonlinear activated multivariable time sequences in each encoder of the airspace branch network, and then obtaining the speaking mode characteristics of the target person through a feedforward network; the time domain branch network is used for: projecting the multivariable time sequence through the full connection layer and performing nonlinear activation through a nonlinear activation function; position encoding the nonlinear activated multivariate time series by a position encoder; and acquiring video time correlation features based on the multivariate time sequence after position encoding by a plurality of encoders connected in sequence, performing masked multi-head attention calculation on the specific data after position encoding in each encoder of the time domain branch network, and then acquiring the video time correlation features by a feedforward network. The features extracted by the method according to the embodiment are independent of the detail features of the artifacts, are not affected by repeated compression, and are good in robustness.

Fig. 2 shows a flowchart of a training method of a face-counterfeit video detection model according to another embodiment of the present invention. As shown in fig. 2, the method comprises the steps of:

step S201, collecting an original face video of a target person.

The method comprises the steps of determining a target person according to training requirements, collecting original face videos of the target person from a network based on the determined target person, preprocessing the original face videos of the target person, for example, preprocessing the collected original face videos of the target person by using a SceneDetecte, FFMPEG open source database to obtain video data of the original face videos of the target person, wherein preprocessing of the original face videos of the target person comprises video scene segmentation, video cutting and the like, and a specific preprocessing mode is not limited.

Step S202, generating a face fake video based on the original face video.

Whether the face fake video needs to be generated is determined according to the difference of the target person, if a large number of face fake videos of the target person exist on the network, the face fake video of the target person can be directly downloaded based on the network, and if the face fake video of the target person does not exist on the network, the face fake video needs to be generated based on the original face video. Based on the original face video, facial regions of people in the original face video are intercepted. In this embodiment, a Dlib open source tool library is used to perform face detection on an original face video, and face region frame coordinates [ x, y, w, h ] in an original face video frame are obtained, wherein x, y, w, h represents the upper left corner coordinates of the face region frame, and the length and height of the region, respectively. In this embodiment, on the premise of keeping the center of the face region frame unchanged, the length and the width of the face region frame are respectively doubled to obtain new face region frame coordinates [ x-0.5w, y-0.5h,2w,2h ], and the face region is cut out on the basis of the original face video frame according to the new face region frame coordinates, so as to ensure that the face region frame contains the head of the whole person, and the motion information extracted by Openface is more accurate.

In particular, the method of generating face counterfeited video includes face replacement, lip sync and/or expression action migration. In this embodiment, the face of the target person is replaced to the face region of its own simulator, and a face fake video of the face replacement type of the target person is generated; tampering with the lip of the original face video of the target person to generate a face fake video of the lip synchronous type of the target person; and driving the facial expression and the head action of the original video of the target person by using the source video to generate the face fake video of the expression action migration type of the target person. And finally obtaining the preset number of face fake video data of each type of each target person. The method of generating a face-counterfeit video of the present embodiment is merely exemplary, and any method of generating a face-counterfeit video may be selected by those skilled in the art according to the need.

Step S203, extracting features from the original face video and the face fake video respectively to obtain corresponding multivariate time sequences.

Step S204, training is performed based on a double-flow neural network by taking the multivariate time sequence as training data to obtain a face fake video detection model, wherein the double-flow neural network comprises: the system comprises a space domain branch network for extracting the speaking mode characteristics of the target person, a time domain branch network for extracting the time correlation characteristics of the video, and a prediction layer for fusing and predicting the speaking mode characteristics of the target person and the time correlation characteristics of the video. The features extracted by the method according to the embodiment do not depend on the detail features of the artifacts, are not influenced by repeated compression, and have better robustness

The specific execution of steps 203-204 is similar to that of steps 102-103, and will not be described again.

Fig. 3 shows a schematic diagram of a training method of a face-counterfeit video detection model according to an embodiment of the invention. As shown in fig. 3, first, gathering an original face video of a target person from a network, wherein the original face video comprises all facial features of the target person and faces a lens, using a scenetect open source tool library to segment the collected original face video, selecting an original face video segment with a lens length not less than 300 frames and meeting the requirements, using an FFMPEG tool library to segment the selected original face video meeting the requirements in equal length, and using a sliding window method to perform sliding segmentation processing with equal length and equal interval on the original face video, wherein the window length and the sliding pace can be respectively set to 300 frames and 50 frames, so as to obtain 10000 original face videos with a length of 300 frames of the target person. It should be noted that the number of frames of the shot length, the number of frames of the window length, the number of frames of the sliding steps, and the number of original face videos can be flexibly set according to specific situations.

Based on the original face video, the face region of the person in the original face video is intercepted, in this embodiment, the Dlib open source tool library is used to perform face detection on the original video, and coordinates [ x, y, w, h ] of a face region frame in a video frame of the original face video are obtained, wherein x, y, w, h represents the upper left corner coordinates of the face region frame, the length and the height of the region respectively. On the premise of keeping the center of the area frame unchanged, the length and the width of the face area frame in the original face video are respectively doubled to obtain new face area frame coordinates [ x-0.5w, y-0.5h,2w and 2h ], and the face area is intercepted on the basis of the original face video frame according to the new coordinates. And processing the original video of the target person frame by frame according to the method, so as to obtain the original face region video of the target person for generating the corresponding face fake video.

In the embodiment, a FaceSwap algorithm is used to replace the face of the target person to the face area of each simulator, so as to generate a face-replacement-type face fake video of the target person; collecting a plurality of (e.g. 20) audio data from a network data set, and using a Wav2Lip algorithm, and using the audio data as a driving source to tamper with the Lip of the original face video of the target person to generate a Lip synchronous type face fake video of the target person; the facial expression and the head motion of the original face video of the target person are driven by using the FOMM algorithm, so that the face fake video of the expression motion migration type of the target person is generated, and finally, the face fake video data of the preset number (such as 3000) of each type of the target person are obtained. The method of generating a face-counterfeit video of the present embodiment is merely exemplary, and any method of generating a face-counterfeit video may be selected by those skilled in the art according to the need.

Aiming at the problem that the face counterfeiting detection method is invalid due to the refinement processing in the prior art, the embodiment uses an Openface open source tool library to extract data such as a face motion unit, line of sight estimation, head motion and the like from the original face video and the face counterfeiting video of the target person, and is used for distinguishing real faces from fake faces according to specific face and head motion modes shown when the face counterfeiting video detection model learns the speaking of the target person, so that the resistance of the model to refinement processing samples is improved. Specifically, the data of a plurality of variables such as head pose, line of sight estimation, and facial Action Unit (AU) of each frame in the original face video and the face fake video of the target person are tracked, wherein the data comprises multi-variable time series data of 17 AU units, 6 head three-dimensional pose estimation, and 8 line of sight three-dimensional estimation, and 31 variables in total, and a CSV (Comma Separated Values) file is generated and stored. The specific data variables extracted by the OpenFace open source tool library are shown in table 1:

TABLE 1

The face fake video can erase some details and artifacts of the face fake video after multiple compression processing, so that the face fake detector trained on the high-quality face fake video does not perform well on the face fake video after repeated compression; secondly, along with the continuous perfection of the face counterfeiting generation technology and the changeability of the face counterfeiting method, the problem of insufficient generalization capability of the existing face counterfeiting detection method is caused. In order to enhance the robustness of the proposed method to multiple compressed videos and the generalization capability of the method to multiple variants, the present embodiment designs a dual-stream neural network (also referred to as a dual-stream transducer network). The double-flow neural network comprises a space domain branch network (space domain Transformer network) and a time domain branch network (time domain Transformer network), wherein the space domain branch network is used for respectively extracting the speaking mode characteristics of the target person from the original face video and the face fake video of the target person, and the time domain branch network is used for respectively extracting the video time correlation characteristics from the original face video and the face fake video of the target person. The features extracted by the method according to the embodiment are independent of the detail features of the artifacts, are not affected by repeated compression, and are good in robustness.

Fig. 4 is a schematic diagram of extracting a target person speaking mode feature based on a spatial branch network according to an embodiment of the present invention. As shown in fig. 4, the linear embedding in the prior art is changed into the full connection layer and the nonlinear activation, and the position encoder of the neural network in the prior art is eliminated, and the spatial branch network is used for calculating the correlation among variables without paying attention to the position relation among each variable; therefore, the spatial branch network of the present embodiment does not include a Decoder (Decoder) portion of the neural network in the prior art, and includes only an Encoder (Encoder) portion for variable correlation feature extraction, and in the present embodiment, the number of encoders is set to 8, and the specific number of encodings can be specifically set according to the needs, which is not limited herein. The neural network of the embodiment cancels the position encoder of the prior neural network, has simpler structure and effectively saves resources. The spatial branch network of this embodiment is composed of a nonlinear embedded layer and 8 identical encoders connected in sequence. The specific calculation steps are as follows:

in the space domain branch network, the input dimension is 31×300 multivariable time sequence, each variable in the input data is taken as a whole by the embedded layer, the 31 variable sequence data are processed in parallel through projection and ReLU (RectifiedLinear Activation Function) function activation of the full-connection layer, the input size of the embedded layer is 300, the output size is uniform 512, and the formula is as follows:

X’＝ReLU(XW+b)(1)

Based on equation (1), the input X dimension is 31×300, and the non-linear mapping through the embedded layer results in X' dimension of 31×512.

The input data is sent to a following encoder to extract variable correlation information after nonlinear projection of an embedded layer, 8 encoders are sequentially connected, and the output of the former encoder is used as the input of the latter encoder, and the formula is as follows:

Output _i ＝Encoder _i (Output _i-1 )(2)

wherein Output is _i 、Output _i-1 Representing the outputs of the i-th and i-1 th encoders, respectively, each encoder having an output size of 512, encoder _i Representing the operation of the ith encoder.

In this embodiment, each encoder includes two parts, a first part being a multi-Head self-Attention mechanism (Muti-Head Attention), and a second part being a feed forward network (feed forward) made up of fully connected layers. Each part adopts a residual connection mode and adds a Normalization layer (Normalization) at the back, and the input and output sizes of each part are kept consistent with the output of the embedding layer to be 512.

The specific calculation steps of a single encoder are as follows:

first, a feature F is input to perform multi-head self-attention calculation. The feature F is linearly projected to three different hidden spaces using three different parameter matrices W, resulting in Q, K, V three different key values. The calculation formula is as follows:

Q＝F·W ^Q (3)

K＝F·W ^K (4)

V＝F·W ^V (5)

Unlike single-head self-attention calculations, the multi-head self-attention mechanism performs the self-attention function in parallel on each projection of K, Q, V, resulting in a self-attention value head for each self-attention head _i . The calculation formula is as follows:

wherein, the Attention calculation formula is:

after obtaining the self-attention value of each head, connecting the output of each head by using residual error and performing linear projection again to obtain a calculated final value, wherein the multi-head self-attention calculation formula is as follows:

MutiHead(Q,K,V)＝Concat(Head ₁ ,…,head _h )W ^O (8)

the final value of the feature F obtained by multi-head self-attention calculation is connected with the residual error of the feature F, and the normalization (normalization after addition) treatment is carried out through a normalization layer, so that a new feature F' is obtained, and the calculation formula is as follows:

F’＝Normalization(F⊕Mutihead(Q,K,V))(9)

and then F 'is subjected to residual connection and normalization (normalization after addition) after further mapping activation of the F' through two layers of full-connection layers through the full-connection layer part of the second part, so that output F of the airspace network branch is obtained. The calculation formula of the second part is as follows:

fig. 5 shows a schematic diagram of extracting video temporal correlation features based on a time domain branching network according to an embodiment of the present invention. As shown in fig. 5, the time domain branch network includes a full connection layer plus a nonlinear activation structure, and the activation function is a ReLU function and is shared with embedded layer parameters of the spatial domain branch network. The embedded layer of the time domain branch network takes each variable data of each frame as a whole to input the embedded layer for nonlinear projection, the input size of the embedded layer is 31, the output size is 512, the input X dimension is 300 multiplied by 31, and the X' with the dimension of 300 multiplied by 512 is obtained through nonlinear mapping of the embedded layer. As shown in fig. 5, the time domain branching network of the present embodiment further includes a position coding module, and the present embodiment uses Sine function and Cosine function of different frequencies to perform position coding on the embedding of different frames, where the formula is as follows:

Where i is the frame number.

In this embodiment, a Masked multi-head attention mechanism (Masked multi-head attention) is used to mask the effects of subsequent frames in calculating the attention value between the data of the previous i frames. The masked multi-head attention mechanism cooperates with the position encoder to ensure that the prediction of a certain position can only depend on the known output of the frame before the position, effectively improving the detection accuracy.

After the airspace branch network and the time domain branch network extract the speaking mode feature and the video time correlation feature of the target person respectively, the speaking mode feature and the video time correlation feature of the target person are fused, and the feature size after fusion is kept consistent with the speaking mode feature and the video time correlation feature of the target person and is 512. The fused features are sequentially sent to a subsequent full-connected neural network (also called a Multi-Layer Perceptron, MLP) Layer and a full-connected (Fully connected Layer, FC) Layer, further compressed into a feature with resolution of size 2, denoted as f, and finally classified by using a normalization function (Softmax function), so that a final result can be obtained. And constructing a depth network for face fake detection of the target person, and training on a pre-constructed training set. The loss function used in training adopts a cross entropy loss function, and the formula is as follows:

The specific training steps are as follows: collecting original face video and face fake video data of a target person according to requirements; extracting face action units, sight estimation and head movement data of an original face video and a face fake video of a target person by using an Openface open source tool library, and storing the face action units, the sight estimation and the head movement data into a CSV file; and training a face fake video detection model by the extracted data.

According to the training method of the human face fake video detection model, a double-flow neural network comprising a space domain branch network and a time domain branch network is designed, the speaking mode characteristics of the target person are learned through the space domain branch network, the requirement of the detection model on the quality of the video to be detected can be effectively reduced, and the robustness of the multiple compressed video to be detected is improved; the time domain branch network is used for mining the time correlation characteristics of the video, specific artifact details are not concerned, the generalization capability of the detection model to different face fake methods is enhanced, and in addition, the method provided by the embodiment is suitable for detecting all types of face fake video because specific artifact information is not concerned, and good generalization can be kept for face fake video data outside a data set.

Fig. 6 shows a flow chart of a method for detecting a face counterfeit video according to an embodiment of the present invention. As shown in fig. 6, the method comprises the steps of:

step S301, acquiring a face video to be detected.

And determining the target person according to the detection requirement, collecting the face video of the target person from the network based on the determined target person, and preprocessing the face video of the target person.

Step S302, extracting features from the face video to be detected to obtain a corresponding multivariate time sequence.

In this embodiment, the OpenFace open source tool library is used to extract features from the face video of the target person to obtain a corresponding multivariate time sequence, and specifically, the extracted features include face motion unit data, line-of-sight estimation data, and head motion data of each frame of the face video of the target person.

Step S303, inputting the multivariable time sequence into a face fake video detection model to obtain a detection result of the face video to be detected; the face fake video detection model is obtained according to the training method of the face fake video detection model.

The detection method of the face fake video does not pay attention to specific artifact information, is suitable for detecting all types of face fake videos, and can keep good generalization on face fake video data outside a data set.

Fig. 7 shows a schematic structural diagram of a training device of a face-counterfeit video detection model according to an embodiment of the present invention. As shown in fig. 7, the training device includes: a video collection module 410, a first feature extraction module 420, and a training module 430.

A video collection module 410 adapted to collect an original face video and a face counterfeit video of a target person;

the first feature extraction module 420 is adapted to extract features from the original face video and the face fake video, respectively, to obtain a corresponding multivariate time sequence; and

the training module 430 is adapted to train based on a dual-flow neural network by using the multivariate time sequence as training data to obtain a face fake video detection model; wherein, double-flow neural network includes: the system comprises a space domain branch network for extracting the speaking mode characteristics of the target person, a time domain branch network for extracting the time correlation characteristics of the video, and a prediction layer for fusing and predicting the speaking mode characteristics of the target person and the time correlation characteristics of the video.

Optionally, the first feature extraction module 420 is further adapted to:

Optionally, the video collection module 410 is further adapted to:

Collecting an original face video of a target person;

According to the training device of the human face fake video detection model, a double-flow neural network comprising a space domain branch network and a time domain branch network is designed, and the speaking mode characteristics of the target person are learned through the space domain branch network, so that the requirement of the detection model on the quality of the video to be detected can be effectively reduced, and the robustness of the multiple compressed video to be detected is improved; the time domain branch network is used for mining the time correlation characteristics of the video, specific artifact details are not concerned, the generalization capability of the detection model to different face fake methods is enhanced, and in addition, the training device provided by the embodiment is suitable for detecting all types of face fake video because specific artifact information is not concerned, and good generalization can be kept for face fake video data outside a data set.

Fig. 8 shows a schematic structural diagram of a face-counterfeit video detection device according to an embodiment of the present invention. As shown in fig. 8, the training device includes: a video acquisition module 510, a second feature extraction module 520, and a detection module 530.

The video acquisition module 510 is adapted to acquire a face video to be detected;

the second feature extraction module 520 is adapted to extract features from the face video to be detected, so as to obtain a corresponding multivariate time sequence; and

the detection module 530 is adapted to input the multivariate time sequence to a face fake video detection model to obtain a detection result of the face model to be detected; the face fake video detection model is obtained according to the training method of the face fake video detection model provided in the first aspect.

The detection device of the face fake video does not pay attention to specific artifact information, is suitable for detecting all types of face fake videos, and can keep good generalization on face fake video data outside a data set.

The embodiment of the invention provides a nonvolatile computer storage medium, which stores at least one executable instruction, and the computer executable instruction can execute the training method of the human face fake video detection model or the detection method of the human face fake video in any method embodiment.

FIG. 9 illustrates a schematic diagram of a computing device, according to an embodiment of the invention, the particular embodiment of the invention not being limited to a particular implementation of the computing device.

As shown in fig. 9, the computing device may include: a processor 602, a communication interface (Communications Interface), a memory 606, and a communication bus 608.

Wherein: processor 602, communication interface 604, and memory 606 perform communication with each other via communication bus 608. Communication interface 604 is used to communicate with network elements of other devices, such as clients or other servers. The processor 602 is configured to execute the program 610, and may specifically perform relevant steps in the method embodiments described above.

In particular, program 610 may include program code including computer-operating instructions.

The processor 602 may be a central processing unit CPU or a specific integrated circuit ASIC (Application Specific Integrated Circuit) or one or more integrated circuits configured to implement embodiments of the present invention. The one or more processors included by the computing device may be the same type of processor, such as one or more CPUs; but may also be different types of processors such as one or more CPUs and one or more ASICs.

A memory 606 for storing a program 610. The memory 606 may comprise high-speed RAM memory or may further comprise non-volatile memory (non-volatile memory), such as at least one disk memory.

The program 610 may be specifically configured to cause the processor 602 to perform a method for predicting a mobile terminal device traffic in any of the method embodiments described above. The specific implementation of each step in the procedure 610 may refer to corresponding steps and corresponding descriptions in the units in the above-mentioned embodiments of predicting the mobile terminal device traffic, which are not described herein. It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the apparatus and modules described above may refer to corresponding procedure descriptions in the foregoing method embodiments, which are not repeated herein.

The algorithms and displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. Various general-purpose systems may also be used with the teachings herein. The required structure for a construction of such a system is apparent from the description above. In addition, embodiments of the present application are not directed to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the embodiments of the present application as described herein, and the above description of specific languages is provided for disclosure of enablement and best mode of the embodiments of the present application.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the application may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the above description of exemplary embodiments of the application, various features of embodiments of the application are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be construed as reflecting the intention that: i.e., the claimed embodiments of the application claim more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this application embodiment.

Those skilled in the art will appreciate that the modules in the apparatus of the embodiments may be adaptively changed and disposed in one or more apparatuses different from the embodiments. The modules or units or components of the embodiments may be combined into one module or unit or component and, furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. Any combination of all features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or units of any method or apparatus so disclosed, may be used in combination, except insofar as at least some of such features and/or processes or units are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings), may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of embodiments of the present application and form different embodiments. For example, in the following claims, any of the claimed embodiments can be used in any combination.

The various component embodiments of the present embodiments may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that some or all of the functionality of some or all of the components according to embodiments of the present application may be implemented in practice using a microprocessor or Digital Signal Processor (DSP). Embodiments of the present application may also be implemented as a device or apparatus program (e.g., a computer program and a computer program product) for performing a portion or all of the methods described herein. Such a program embodying the embodiments of the present application may be stored on a computer readable medium, or may have the form of one or more signals. Such signals may be downloaded from an internet website, provided on a carrier signal, or provided in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the embodiments of the application, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The embodiments of the application may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names.

Claims

1. A training method for a face counterfeit video detection model, the method comprising:

collecting an original face video of a target person, and generating a face fake video of the target person based on the original face video of the target person;

Respectively tracking the facial motion unit data, the sight line estimation data and the head motion data of each frame of the original face video and the face fake video to obtain a corresponding multivariable time sequence; and

training the multivariate time sequence based on a double-flow neural network by taking the multivariate time sequence as training data to obtain a face fake video detection model; wherein the dual-flow neural network comprises: the system comprises a airspace branch network for extracting and obtaining the speaking mode characteristics of a target person, a time domain branch network for extracting the time correlation characteristics of a video, and a prediction layer for fusing and predicting the speaking mode characteristics of the target person and the time correlation characteristics of the video; the airspace branch network is used for: projecting the multivariate time sequence through a full connection layer and performing nonlinear activation through a nonlinear activation function; the target person speaking mode features are extracted and obtained based on the multi-variable time sequence after nonlinear activation through a plurality of encoders connected in sequence; the time domain branching network is used for: projecting the multivariate time sequence through a full connection layer and performing nonlinear activation through a nonlinear activation function; position encoding the nonlinear activated multivariate time series by a position encoder; and acquiring video time correlation features based on the multivariate time series after position encoding by a plurality of encoders connected in sequence, in each encoder, performing masked multi-head attention calculation on the specific data after position encoding, and then acquiring the video time correlation features by a feed-forward network.

2. The method of claim 1, wherein in each encoder of the spatial branch network, a multi-headed self-attention calculation is performed on the nonlinear activated multi-variable time series, and then the target person speaking pattern feature is obtained through a feed forward network.

3. The training method of claim 1, wherein the method of generating a face counterfeited video of the target person includes face replacement, lip synchronization, and/or expression action migration.

4. A method for detecting a face counterfeit video, the method comprising:

acquiring a face video to be detected;

inputting the multivariate time sequence into a face fake video detection model to obtain a detection result of the face video to be detected; wherein the face-counterfeit video detection model is obtained according to the training method of the face-counterfeit video detection model of any one of claims 1 to 3.

5. A training device for a face-counterfeit video detection model, the training device comprising:

the video collecting module is suitable for collecting the original face video of the target person and generating a face fake video of the target person based on the original face video of the target person;

The first feature extraction module is suitable for tracking the face motion unit data, the sight estimation data and the head motion data of each frame of the original face video and the face fake video respectively to obtain a corresponding multivariable time sequence; and

the training module is suitable for training based on a double-flow neural network by taking the multivariate time sequence as training data to obtain a face fake video detection model; wherein the dual-flow neural network comprises: the system comprises a airspace branch network for extracting and obtaining the speaking mode characteristics of a target person, a time domain branch network for extracting the time correlation characteristics of a video, and a prediction layer for fusing and predicting the speaking mode characteristics of the target person and the time correlation characteristics of the video; the airspace branch network is used for: projecting the multivariate time sequence through a full connection layer and performing nonlinear activation through a nonlinear activation function; the target person speaking mode features are extracted and obtained based on the multi-variable time sequence after nonlinear activation through a plurality of encoders connected in sequence; the time domain branching network is used for: projecting the multivariate time sequence through a full connection layer and performing nonlinear activation through a nonlinear activation function; position encoding the nonlinear activated multivariate time series by a position encoder; and acquiring video time correlation features based on the multivariate time series after position encoding by a plurality of encoders connected in sequence, in each encoder, performing masked multi-head attention calculation on the specific data after position encoding, and then acquiring the video time correlation features by a feed-forward network.

6. A detection apparatus for face-counterfeit video, the detection apparatus comprising:

the detection module is suitable for inputting the multivariate time sequence into a face fake video detection model to obtain a detection result of the face model to be detected; wherein the face-counterfeit video detection model is obtained according to the training method of the face-counterfeit video detection model of any one of claims 1 to 3.

7. A computing device, comprising: the device comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete communication with each other through the communication bus;

the memory is configured to store at least one executable instruction, where the executable instruction causes the processor to perform an operation corresponding to a training method of a face counterfeit video detection model according to any one of claims 1 to 3 or an operation corresponding to a detection method of a face counterfeit video according to claim 4.

8. A computer storage medium having stored therein at least one executable instruction for causing a processor to perform operations corresponding to the training method of a face counterfeit video detection model according to any one of claims 1 to 3 or operations corresponding to the detection method of a face counterfeit video according to claim 4.