CN116758604A

CN116758604A - Depth forgery detection method based on facial geometry relation reasoning

Info

Publication number: CN116758604A
Application number: CN202310418813.1A
Authority: CN
Inventors: 王宏霞; 张瑞; 刘汉卿; 周炀; 曾强
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2023-04-18
Filing date: 2023-04-18
Publication date: 2023-09-15

Abstract

The invention discloses a depth counterfeiting detection method based on facial geometry relation reasoning, relates to the field of passive evidence obtaining of videos, and is used for improving generalization capability of a depth counterfeiting detection model. Firstly, detecting key point characteristics of a human face by using a human face characteristic point detector, and constructing a human face explicit geometric relationship diagram according to the internal five-sense organ structure of the human face; constructing a global feature extractor by using a transducer to obtain a global feature map of the face image; positioning a high information content area through a self-supervision learning mechanism and constructing a face implicit geometric relationship diagram on the basis of the face global feature diagram; and constructing a geometric relationship reasoning module in the human face by using the graph convolution neural network, carrying out feature combination on the explicit geometric relationship graph and the implicit geometric relationship graph of the human face, and carrying out fake detection on the video frame to be tested. The invention can effectively improve the accuracy of the detection of the face depth fake video, has better model generalization performance under different data fields and has practical value.

Description

Depth forgery detection method based on facial geometry relation reasoning

Technical Field

The invention relates to the technical field of video evidence obtaining, in particular to a depth forgery detection method based on face geometric relationship reasoning.

Background

Artificial intelligence authoring content is widely spread by social media as an emerging technology, but the popularity of artificial intelligence content generation technology is such that face video manipulation operations become more accessible. Depth forging is a face video tampering technology which is widely spread initially, and through the development of technology, extremely real face videos can be synthesized, so that people are difficult to directly distinguish. If the generated deeply forged video is abused, serious risks are brought to privacy, politics and national security. Therefore, deep counterfeit video detection has become an important research problem in multimedia evidence obtaining technology in the field of information security. Since the feature difference between the real face and the fake face is small, the existing tamper trace is difficult to identify by using the existing deep learning model. In order to improve the effectiveness of detection, the existing deep forging method based on deep learning utilizes various characteristic attention mechanisms to improve the discrimination capability of a network model on fine differences. However, the continuous development of the deep forging method leads to large differences in tamper marks generated by different deep forging models. Most existing methods are generally only suitable for single or small amounts of tamper marks, have limited generalization performance, take the paper DeepFake Detection Based on Discrepancies Between Faces and Their Context published in the authoritative journal IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, and the 10 th period as an example, the method obtains an AUC detection score of 99.7 in face images generated by the faseswap method, but it is difficult to detect face images generated by the improved version of deepfakes, and the AUC score will be reduced by approximately 35%. Aiming at the new deep forging technology, a new feature attention module needs to be further designed to obtain a better detection effect. However, in practical application environments such as social media, the model is updated with high cost because of the need to cope with the depth fake video updated at any time, so that the prior art is difficult to meet the practical requirements, and the generalization capability of the model design is needed to be improved.

Disclosure of Invention

The invention aims to solve the limitations, and provides a depth fake detection method based on face geometric relationship reasoning, which further improves the generalization capability of detection of the depth fake video.

The technical scheme for realizing the purpose of the invention is as follows:

a depth forgery detection method based on human face geometric relation reasoning utilizes a transducer to construct a global feature extractor to obtain a human face image global feature map; positioning a high information content area through a self-supervision learning mechanism and constructing a face implicit geometric relationship diagram on the basis of the face global feature diagram; the method comprises the following main steps of constructing a geometric relationship reasoning module in a face by using a graph convolution neural network, carrying out feature combination on an explicit geometric relationship graph and an implicit geometric relationship graph of the face, and carrying out fake detection on a video frame to be tested so as to improve the accuracy of face depth fake video detection, wherein the method comprises the following main steps:

step 1: acquiring training videos simultaneously containing true and false samples, sampling video frames at intervals, and extracting human faces and a small part of background areas as human face images according to the frames by using a human face detector;

step 2: aiming at each frame of face image, a face key point detector is used for detecting 468 three-dimensional key points of a face, and a face explicit geometric relation diagram is constructed according to face facial relations;

step 3: constructing a global feature extractor, and extracting global features of each frame of face image;

step 4: constructing a high information content region locator based on a self-supervision learning method, which is used for locating feature regions with higher information content in global features and constructing a face implicit geometric relationship diagram among the feature regions;

step 5: constructing a face geometric relationship reasoning module, extracting the characteristics of a face explicit geometric relationship graph and an implicit geometric relationship graph based on a graph convolution neural network, and matching graph node relationships;

step 6: and inputting the continuous face images in the continuous frames in the test set into the trained model to obtain the probability score of the prediction belonging to the authenticity, and averaging the scores of the full video frames to judge the authenticity of the video.

Further, in step 1, the specific method for acquiring the face image includes:

(1) Presetting a sampling interval of video frames, starting from a first frame of a video, and extracting video frames from the video at the sampling interval;

(2) Using a Retinaface detector to determine a face candidate frame in a current video frame, and establishing a Cartesian coordinate system at the left upper corner of an image, wherein the coordinates of the candidate frame can be expressed as (x, y, w, h), wherein (x, y) is the left upper corner coordinate of the candidate frame, and w, h respectively represent the length and the width of the candidate frame;

(3) And expanding the range of the candidate frame according to a preset proportion r, changing the face candidate frame into (x-r multiplied by w, y-r multiplied by h, w+r multiplied by w, h+r multiplied by h), and intercepting the image in the range of the candidate frame as a face image.

Further, in step 2, the specific method for constructing the face explicit geometric relationship graph is as follows:

(1) Inputting the face image into a pre-trained face three-dimensional key point extractor Mediapipe to obtain 428 face key points;

(2) The key points of the human face are used as nodes V of an explicit geometric relationship diagram _L According to the positions of key points of the human face in the human face, sequentially connecting the nodes of eyebrows, pupils, eyeboxes, lips and facial contours to form an appearance contour, and then according to the geometric structure of the human face, connecting the nodes of the five parts to form a human face explicit geometric relationship graph G _L 。

Further, in step 3, the global feature extractor is a MobileVit or other backbone network implemented based on Vision Transformer.

Further, in step 4, the specific method of the high information content area locator based on the self-supervision learning method is as follows:

(1) In the face global feature map, obtaining M regions of interest by using a region lifting network;

(2) Inputting local features of the region of interest into a simple two-classifier f _p In the method, the classifier uses a 1-layer 1 multiplied by 1 convolution kernel to reduce the dimension of the local feature channel number to 2 channels, and then uses an activation function and a batch normalization layer to increase the nonlinear expression capability of the feature to obtain the local semantic information featureUse global average pooling to ∈>Downsampling length and width is 1;

(3) Using cross entropy loss functionCalculating a loss value l between the classification result of each local feature and the authenticity of the current face _p ；

(4) Arranging all loss values obtained from each image from large to small;

(5) Loss value l obtained for each local feature _P And corresponding regional proposal score S _P The square difference is obtained, and the mean value is calculated after all the square differences are obtained and used as a loss function of self-supervision learning

Further, in step 4, the specific method for constructing the face implicit geometric relationship map GP is as follows:

(1) Information content fraction S based on high information content zone locator _P Selecting N local features F of high information content areas with highest scores _P As a set, each F _P Converting from a feature matrix of size 2 xw×h into feature vectors of size t×2, where t=w×h, aggregating the converted feature vectors intoImplicit geometrical relationship graph node for human face

(2) Converting the size of a node set N x T x 2 to N x C _P, wherein C_P =tx2, converting a set of nodes into an attention vector V using SoftMax operation _att ＝softmax(V _P )；

(3) Based on self-attention mechanism, calculate V _P Attention vector transpositionObtaining adjacency matrix of implicit geometrical relationship graph node connection +.>The size is N×N.

Further, in step 5, the geometric relationship inference module includes an explicit face geometric feature inference module, an implicit face geometric feature inference module, a graph feature matching module, and a graph classifier:

(1) The explicit face geometric feature reasoning module obtains a graph feature expression G of an explicit face geometric relation on the basis of a face explicit geometric relation graph GL by using a point cloud analysis model constructed based on a graph convolution neural network _gr ；

(2) The implicit face geometric feature reasoning module uses a two-layer graph convolution network model to generate a hidden geometric relationship graph G _P On the basis of (1) obtaining the graph feature expression G of the implicit face geometric relationship _ir ；

(3) The graph feature matching module uses a two-layer interactive graph rolling network model to generate a geometric relationship graph G _gr and G_ir Based on (1), fusing the multi-view geometric relationship to obtain a fused geometric relationship graph G _F ；

(4) The graph classifier respectively obtains the maximum value and the average value of the graph node characteristics by utilizing global average pooling and global maximum pooling, fuses graph representation characteristics of two observation angles, and classifies the graph based on a multi-layer perceptron.

In the implementation, the specific method of the graph feature matching module is as follows:

(1) Based on the geometric relationship graph G _gr and G_ir Included graph node feature set V _gr and V_ir Computing V using a mutual attention mechanism _gr and V_ir Is transposed of (a)Is paired with one another to obtain G _gr Connection G _ir Adjacency matrix of->The size is N multiplied by N;

(2) Based on the geometric relationship graph G _gr and G_ir Included graph node feature set V _gr and V_ir Computing V using a mutual attention mechanism _ir and V_gr Is transposed of (a)Is paired with one another to obtain G _gr Connection G _ir Adjacency matrix of->The size is N multiplied by N;

(3) Based on the graph node feature set V _gr and V_ir Adjacency matrix A _gi and A_ig And matching the graph node characteristics by using an interactive graph rolling network model, and reasoning geometrical anomalies existing in the deep fake face image.

In implementation, the interactive graph convolution network model calculation process specifically comprises the following steps:

(1) Extraction of G _gr Connection G _ir Is characterized by the expression of the implicit characteristic relation diagram node characteristics:

V _gi ＝σ(W ₁ ×σ(A _gi V _gr W _gi )+V _gr )；

wherein W₁ and W_gi Are all learnable parameters of the interactive graph convolution network, and sigma (·) represents a nonlinear activation function which is a ReLu or a Leaky ReLu function;

(2) Extraction of G _ir Connection G _gr Is characterized by the expression of the implicit characteristic relation diagram node characteristics:

V _ig ＝σ(W ₂ ×σ(A _ig V _ir W _ig )+V _ir )；

wherein W₂ and W_ig Are all learnable parameters of the interactive graph convolution network, and sigma (·) represents a nonlinear activation function which is a ReLu or a Leaky ReLu function;

(3) Expressing the characteristics of the two graph nodes V _gi and V_ig And splicing to obtain the node characteristic expression of the multi-angle geometric relationship graph.

In practice, the training total loss function in step 5 is:

wherein ,for a bi-classification cross entropy loss function, and using label smoothing technique, +.>Is a self-supervising loss function. P (P) _P and P_a Respectively classifying the local high information content region classification result and the global geometric relationship graph node characteristic classification result, S _P The score is proposed for the region of the high information content region locator.

In step 6, predicting all video frames by using the trained model, and averaging all scores to be used as a prediction result for the authenticity of the human face in the video.

The invention ensures the safety of video content containing human faces by using a deep learning technology. Features are extracted from the explicit geometric relationship and the implicit geometric relationship of the face, and the inference is based on the abnormality existing under the inherent geometric structure of the face and is used for judging whether the current image has counterfeiting conditions or not.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention constructs a face geometric relationship graph by utilizing three-dimensional face key points as the inherent characteristics of the face, and extracts the relationship characteristics by utilizing a graph convolutional neural network.

2. And extracting global features by using a transducer network, positioning a high information content region by using a self-supervision learning method, and constructing an implicit geometric relationship diagram by using a self-attention mechanism, regardless of a fake mode.

3. The implicit geometric relation and the explicit geometric relation feature are matched, the geometric abnormality of the face is inferred, the dependence on specific fake marks is effectively avoided, and the generalization capability of the deep fake detection model is effectively improved.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention.

Fig. 2 is a network configuration diagram of an embodiment of the present invention.

Fig. 3 is a face explicit geometry diagram according to an embodiment of the present invention.

Fig. 4 is a comparison of experimental results of the present invention with prior methods in a publicly-verified example.

Fig. 5 is a diagram showing a detection effect according to an embodiment of the present invention.

Detailed Description

The invention will be further described with reference to the drawings and the specific examples.

As shown in fig. 1-2, the embodiment of the invention constructs a deep forgery detection network based on face geometric relationship reasoning, which comprises a Visinon Transformer main network, a high information content area locator, a high information content area classifier, a point cloud analysis network, a graph rolling module, an interactive graph rolling module and a graph classifier, thereby forming a whole model framework. FIG. 1 shows the workflow of the present invention; fig. 2 shows a specific network structure diagram of the present invention in one embodiment.

Step 1, training videos simultaneously containing true and false samples are obtained, video frames are sampled at intervals, a face detector is used for positioning and sampling to obtain the positions of faces in the video frames, and the faces and a small part of background areas around the faces are cut to be used as face images. In some specific implementations, the RetinaFace with the backbone network being Resnet50 is used as a face detector, the detection frame corresponds to 0.1 times of the length and the width of the detection result, and the four positioning coordinates are expanded to be used as the face detection result. For each frame of face image, using key points provided by RetinaFace, aligning all images to a uniform size according to face key points in which eyes, nose and mouth corners;

step 2, extracting three-dimensional face key points by using a three-dimensional face key point detector, and constructing an explicit geometric relationship graph G containing facial feature position information and facial contour information according to facial relationships _L ；

In some specific implementation occasions, the MediaPipe is used for detecting the face, so that 468 key points of the 3-dimensional face can be obtained;

and 3, constructing a Vision Transformer-based global feature extractor, and extracting global features of the face. In some specific embodiments, the backbone network uses MobileVit, takes the pre-training parameters in the ImageNet dataset as model initialization parameters, and reserves the stage of first downsampling of the model and five subsequent feature extraction, so as to extract the global feature of each frame of face image;

step 4, constructing a high information content area locator based on a self-supervision learning method, and providing information content scores S of all areas in a global feature map by using an area proposal network constructed by a convolution layer and a full connection layer _P And the corresponding coordinates, each region is sent into a simple two-classifier constructed by a 1 layer 1 multiplied by 1 convolution layer, and a two-channel semantic feature map containing image space semantic information is obtainedThe global average pooling layer provides probability that the current region belongs to the fake image, the probability and the label of the current face image are subjected to a classification cross entropy loss function to obtain corresponding loss, and a loss value and an information quantity fraction S are calculated _P Is calculated by the square error of (1)For self-supervision learning loss function, the simple local area with higher classification loss, namely the area with higher uncertainty is ensured to be used as the detection result of the high information content area locator. Taking the high information area characteristics obtained by the detection as graph representation characteristic nodes, and utilizing a two-channel semantic characteristic graph of the corresponding area +.>And combining the image adjacency matrixes to form a face implicit geometric relation image among the characteristic areas.

In some specific implementations, the high information region locator will provide 20 candidate regions during the training process, pooled to a uniform size using the region of interest, e.g., 7 x 7. And calculating corresponding two-class cross entropy loss for each candidate region, respectively calculating the square difference between the loss and the proposal scores of the 20 candidate regions, and completing the training of the module by minimizing the two-class cross entropy loss and the supervision loss. Selecting two-channel semantic features of 6 regions with highest candidate scoresEach region feature may be stretched as a vector. 6 feature vectors are spliced to form a human face implicit geometrical relationship graph nodeBased on self-attention mechanism, the human face implicit geometric relation graph adjacency matrix is thatThe size is 6 multiplied by 6;

step 5, constructing a human face geometric relation reasoning module, and expressing G by using the explicit and implicit geometric relation graphs of the human face _L and G_P The graph characteristic representation enhancement is respectively carried out through two neural networks formed by graph roll lamination layers, and the enhanced geometric relationship graph representation G is obtained _gr and G_ir . Constructing a multi-layer interactive graph rolling network model, and matching G _gr and G_ir Is characterized by the graph node characteristic relationship, the enhanced geometric relationship graph and the enhanced characteristicFused to G _F Highlighting geometric abnormal feature nodes, converting a graph node feature channel into a classification channel by using a full connection layer, respectively calculating average two-classification predicted values of all nodes and maximum two-classification predicted values of all nodes by using global average pooling and global maximum pooling, adding the predicted values, outputting the probability that the current image belongs to a fake image as a final classification result, and calculating classification loss by using cross entropy loss;

in some embodiments, G is represented by an explicit geometric relationship graph to the face _L The number of channels is 3, the number of nodes is 468, the graph characteristic enhancement network is composed of a point cloud analysis network based on a graph convolution network, a trunk network uses Curvenet, and the number of graph nodes is downsampled and G is obtained _P The number is the same, and the feature representing dimension is deepened, namely the relation representing capability is enhanced, and the feature dimension is equal to G _P And consistent. Implicit geometric relationship graph representation G _P The number of channels is 98, the number of nodes is 6, the enhancement network is formed by combining a two-layer simple graph convolution network with a nonlinear activation function, and the number of nodes and the number of channels of the original graph are maintained.

The interactive graph convolution network calculation comprises the following specific steps: 1) Inputting two graph signs G to be matched ₁ and G₂ The method comprises the steps of carrying out a first treatment on the surface of the 2) Calculation G ₁ Connection G ₂ Adjacent matrix of (a) wherein V₁ and V₂ Is the node characteristic of the corresponding graph; 3) Enhancement G ₁ Graph characterization capability, update graph node characteristics, +.> wherein W₁ and W_1→2 Are all interactive graph convolution network learnable parameters, and sigma (·) represents a nonlinear activation function; 4) Calculation G ₂ Connection G ₁ Adjacency matrix of-> 5) Enhancement G ₂ Graph characterization capability, update graph node characteristics, +.> wherein W₂ and W_2→1 Are all interactive graph convolution network learnable parameters, and sigma (·) represents a nonlinear activation function;

in some embodiments, the nonlinear activation function is a ReLu or a leak ReLu function.

In some specific implementations, feature nodes enhanced by the interactive graph rolling network model can be fused in a splicing or adding mode.

In some specific implementations, the cross entropy loss function may use a label smoothing regularization constraint, specifically:

wherein y is [ 0+alpha, 1-alpha ]]For a genuine label, 0 represents a genuine image, 1 represents a counterfeit image, a represents a label smoothing parameter,to predict probability values.

In some specific implementations, the final loss function of the network constructed in the steps 1 to 5 is as follows:

wherein For a bi-classification cross entropy loss function, label smoothing techniques can be used instead ofIs a self-supervising loss function. P (P) _P and P_a Respectively classifying the local high information content region classification result and the global geometric relationship graph node characteristic classification result, S _P The score is proposed for the region of the high information content region locator.

And 6, inputting the continuous face images in the continuous frames in the test set into the trained model, outputting probability scores of true and false of the corresponding frames, averaging the probability scores obtained by all frames corresponding to the video, and judging the true and false of the video.

Examples

The embodiment comprises the following steps:

s1: collecting a training sample;

s1.1: the method comprises the steps of inputting videos, detecting face positions in each frame by using a Retinaface detector with a backbone network of Resnet50 for each input video, and reserving face images with the number of L in a training stage in a mode of interval sampling;

s1.2: according to the main key point positions provided by RetinaFace, the face image is aligned to the left and right mouth corners and the face center, and the unified size is 380 multiplied by 380;

s1.3: each face image is endowed with a category label of a corresponding video, 0 represents a real video, 1 represents a fake video

S2: constructing a face explicit geometric relation graph GL shown in FIG. 3;

s2.1: acquiring 3-dimensional 468 face key point coordinates by using a MediaPipe;

s2.2: connecting key points such as the eyes, lips, nose, facial contours and the like according to the key point positions, and connecting all parts according to the facial area distribution relation;

s3: constructing a Vision Transformer-based global feature extractor, and extracting global features of a face;

s3.1: the lightweight network MobileVit is used as a global feature extractor, so that the algorithm calculation amount is reduced, and the actual use of the method is facilitated. Selecting a first convolution downsampling module of a backbone network and a basic module of the subsequent 5 MobileViTs as feature extractors;

s3.2: inputting a face image with the size of 380 multiplied by 380, and acquiring a global feature map F with the size of 11 multiplied by 11 _a ；

S4: constructing a high information content region locator based on a self-supervision learning method, acquiring 6 regions with highest information content, and constructing a face implicit geometric relationship graph G _P 。

As shown in fig. 1 and 2, a face implicit geometry map G _P The method comprises the following specific steps:

s4.1: using the area proposal network common in the dual-stage object detector as a high information content area locator based on the anchor frame mode in the global feature map F _a Providing 20 candidate frames of the region of interest, wherein the candidate scores corresponding to the candidate frames are regarded as information quantity scores;

s4.2: the method comprises the steps of (1) unifying the size of the position features of a candidate frame by using a region-of-interest pooling method, and changing the position features into 7 multiplied by 7;

s4.3: 20 interesting region features are sent into a 1X 1 convolution layer, each region feature channel is changed into 2, and two-channel semantic features are obtainedThe size of the feature map is 7 multiplied by 2, and the size of each feature map is downsampled by using global tie pooling to obtain the true and false prediction probability score;

s4.4: calculating the predictive probability score of each region of interest and the two classification cross entropy losses of the corresponding face image labels;

s4.5: calculating the square difference between the cross entropy loss value and the candidate frame candidate score, wherein the region with higher candidate score corresponds to higher local classification loss, and represents higher uncertainty or is called higher uncertainty;

s4.6: selecting 6 region features with highest candidate scores as a high information content region of the current image;

s4.7: each region feature is stretched into a vector, and the vectors are spliced to form a human face implicit geometric relationship graph node

S4.8, calculating the connection relation between nodes based on a self-attention mechanism, and constructing a graph adjacency matrix as The size is 6 multiplied by 6;

s5: constructing a face geometric relationship reasoning module, and analyzing and reasoning face geometric anomalies;

s5.1: using CurveNet removal classifier based on graph rolling network as point cloud analysis network ψ ();

s5.2: inputting a human face explicit geometric relation graph GL, analyzing the human face explicit geometric relation by using a point cloud analysis network, extracting the most important 6 feature nodes, and outputting a relation graph G with enhanced features _gr =ψ (GL), output feature dimension 96;

s5.3: a graph rolling module is formed by using a graph rolling layer and a nonlinear activation function together, and a network omega (-) is enhanced by using the two-layer graph rolling module as a face implicit geometric relationship graph;

s5.4: implicit geometry G of input face _P Enhancing the network enhancement graph feature representation by using the face implicit geometric relationship graph, and outputting an enhancement relationship graph G _ir ＝Ω(GP)

S5.5: an interactive graph convolution layer and a nonlinear activation function are used for forming an interactive graph convolution module together, and the two layers of interactive graph convolution modules are used as a geometric relationship graph matching network phi ();

s5.6: input geometry relation graph G _gr and G_ir Obtaining a node matching relationship inference graph (G) _gi G _ig )＝Φ(G _gr ，G _ir ) And corresponding characteristic node V _gi and V_ig ；

S5.7: splicing according to the channel direction or fusing node characteristics

S5.8: fusion node feature V using full connectivity layer _F Is converted into the characteristic channel of (1)

S5.8: obtaining node class average score P using global average pooling ₁ ∈[0，1]；

S5.8: obtaining node class maximum score P using global maximization pooling ₂ ∈[0，1]；

S5.9: will beAs the true-false probability score obtained according to geometric anomaly reasoning, calculating a classification cross entropy loss with the image label;

s6: acquiring a video to be tested, detecting all face images contained in the video by using a face detector, sequentially inputting the images into a trained model to obtain the true-false probability score of each image, and calculating the average value of all image prediction scores of the video to obtain the true-false probability score of the video;

in this embodiment, the area under the ROC curve (Area Under the ROC Curve, AUC) is used as an evaluation index, and the ROC curve is formed by a True Positive Rate (TPR) and a False Positive Rate (FPR) as the abscissa. The true positive rate refers to the proportion of the positive sample which is correctly predicted in the actual positive sample; the false positive rate refers to the number of negative samples that are mispredicted as negative in the actual negative samples. The closer the AUC score is to 1, the better the model performance is, and the AUC score is not affected by the classifier threshold setting, and is an evaluation index with robustness.

Fig. 4 is a comparison of experimental results of the present invention with prior methods in a publicly-verified example. This example chooses to use a high quality (C23) dataset training model in faceforensics++ (ff++) datasets to test the validity of the method in ff++ (C23) and CelebDF v2 datasets. The results show that: the method provided by the invention has good performance on both FF++ (C23) and CelebDF v2 public data sets. The detection performance of the unknown deep counterfeiting method is effectively improved while the detection performance in a higher data domain is maintained. Compared with a comparison algorithm, the method has a better detection effect.

Fig. 5 is a graph of the detection effect of the verification example of the present invention on different data sets, specifically showing the explicit geometrical relationship graph, the implicit geometrical relationship graph and the effect of the corresponding feature Grad-CAM focused attention area of the method proposed by the present invention.

The above embodiments are only for illustrating the technical solution of the present invention and not for limiting it, and various modifications and variations of the present invention are possible for those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A depth forgery detection method based on human face geometric relation reasoning utilizes a transducer to construct a global feature extractor to obtain a human face image global feature map; positioning a high information content area through a self-supervision learning mechanism and constructing a face implicit geometric relationship diagram on the basis of the face global feature diagram; the method comprises the following main steps of constructing a geometric relationship reasoning module in a face by using a graph convolution neural network, carrying out feature combination on an explicit geometric relationship graph and an implicit geometric relationship graph of the face, and carrying out fake detection on a video frame to be tested so as to improve the accuracy of face depth fake video detection, wherein the method comprises the following main steps:

2. The method for detecting the deep forgery based on the reasoning of the geometrical relationship of the human face according to claim 1, wherein in the step 1, the specific method for acquiring the human face image is as follows:

3. The depth falsification detection method based on human face geometric relationship reasoning as claimed in claim 1, wherein in the step 2, the specific method for constructing the human face explicit geometric relationship graph is as follows:

(2) The key points of the human face are used as nodes V of an explicit geometric relationship diagram _L According to the positions of the key points of the human face in the human face, sequentially connecting the nodes of the eyebrows, the pupils, the eyesockets, the lips and the facial outline to form an outer partObserving the outline, and then connecting the nodes of the five parts with each other according to the face geometry structure to form a face explicit geometry relation graph G _L 。

4. A depth forgery detection method based on face geometry reasoning as claimed in claim 1, wherein in step 3, the global feature extractor is a backbone network implemented based on Vision Transformer, such as MobileVit.

5. The method for detecting deep forgery based on human face geometric relationship reasoning according to claim 1, wherein in step 4, the specific method of the high information content region locator based on the self-supervision learning method is as follows:

(4) Arranging all loss values obtained from each image from large to small;

6. The method for deep forgery detection based on human face geometrical relationship reasoning as claimed in claim 1 or 5, wherein in step 4, a human face implicit geometrical relationship graph G is constructed _P The specific method comprises the following steps:

(1) Information content fraction S based on high information content zone locator _P Selecting N local features F of high information content areas with highest scores _P As a set, each F _P Converting a feature matrix with the size of 2 xW x H into feature vectors with the size of T x 2, wherein T=W x H, and collecting the converted feature vectors into a face implicit geometrical relationship graph node

7. The method for deep forgery detection based on human face geometric relationship reasoning as claimed in claim 1, wherein in step 5, the geometric relationship reasoning module comprises an explicit human face geometric feature reasoning module, an implicit human face geometric feature reasoning module, a graph feature matching module, and a graph classifier:

(1) The explicit human face geometric feature reasoning module utilizes a point cloud analysis model constructed based on a graph convolution neural network to display a geometric relationship graph G on the human face _L On the basis of (1) obtaining the graph feature expression G of the explicit face geometric relationship _gr ；

8. The depth forgery detection method based on face geometry reasoning of claim 7, wherein the graph feature matching module specifically comprises the following steps:

(1) Based on the geometric relationship graph G _gr and G_ir Included graph node feature set V _gr and V_ir Computing V using a mutual attention mechanism _gr and V_ir Transposed V of (2) _ir ^T Is paired with one another to obtain G _gr Connection G _ir Adjacent matrix of (a)The size is N multiplied by N;

9. The method for detecting deep forgery based on human face geometric relationship reasoning as claimed in claim 7, wherein the interactive graph rolling network model calculation process specifically comprises the following steps:

V _gi ＝σ(W ₁ ×σ(A _gi V _gr W _gi )+V _gr )；

V _ig ＝σ(W ₂ ×σ(A _ig V _ir W _ig )+V _ir )；

(3) Expressing the characteristics of the two graph nodes V _gi and V_ig Splicing to obtain multi-angle tableAnd what relation diagram node characteristic expression.

10. The method for detecting deep forgery based on face geometry reasoning as claimed in claim 1, wherein in step 5, the training total loss function is:

11. The method for detecting the deep forgery based on the reasoning of the geometrical relationship of the human face according to claim 1, wherein in the step 6, all video frames are predicted by using a trained model, and all scores are averaged to be used as a prediction result for the authenticity of the human face in the video.