CN116758604A - Depth forgery detection method based on facial geometry relation reasoning - Google Patents
Depth forgery detection method based on facial geometry relation reasoning Download PDFInfo
- Publication number
- CN116758604A CN116758604A CN202310418813.1A CN202310418813A CN116758604A CN 116758604 A CN116758604 A CN 116758604A CN 202310418813 A CN202310418813 A CN 202310418813A CN 116758604 A CN116758604 A CN 116758604A
- Authority
- CN
- China
- Prior art keywords
- face
- graph
- feature
- geometric relationship
- human face
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 37
- 230000001815 facial effect Effects 0.000 title claims abstract description 12
- 238000010586 diagram Methods 0.000 claims abstract description 23
- 230000007246 mechanism Effects 0.000 claims abstract description 13
- 238000013528 artificial neural network Methods 0.000 claims abstract description 8
- 238000000034 method Methods 0.000 claims description 53
- 230000006870 function Effects 0.000 claims description 31
- 230000002452 interceptive effect Effects 0.000 claims description 19
- 239000011159 matrix material Substances 0.000 claims description 14
- 238000011176 pooling Methods 0.000 claims description 13
- 238000005096 rolling process Methods 0.000 claims description 13
- 230000004913 activation Effects 0.000 claims description 12
- 239000013598 vector Substances 0.000 claims description 12
- 238000012549 training Methods 0.000 claims description 11
- 238000005070 sampling Methods 0.000 claims description 8
- 238000004458 analytical method Methods 0.000 claims description 7
- 210000000887 face Anatomy 0.000 claims description 7
- 238000000605 extraction Methods 0.000 claims description 5
- 238000009499 grossing Methods 0.000 claims description 5
- 238000012935 Averaging Methods 0.000 claims description 4
- 238000012360 testing method Methods 0.000 claims description 4
- 230000008569 process Effects 0.000 claims description 3
- 239000013256 coordination polymer Substances 0.000 claims description 2
- 210000004709 eyebrow Anatomy 0.000 claims description 2
- 238000012821 model calculation Methods 0.000 claims description 2
- 238000010606 normalization Methods 0.000 claims description 2
- 210000001747 pupil Anatomy 0.000 claims description 2
- 210000004279 orbit Anatomy 0.000 claims 1
- 210000000056 organ Anatomy 0.000 abstract 1
- 238000005516 engineering process Methods 0.000 description 7
- 230000000694 effects Effects 0.000 description 5
- 238000005242 forging Methods 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 4
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000012512 characterization method Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 208000029157 Abnormality of the face Diseases 0.000 description 1
- 101100272279 Beauveria bassiana Beas gene Proteins 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000005856 abnormality Effects 0.000 description 1
- 230000004931 aggregating effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000003475 lamination Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/161—Detection; Localisation; Normalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/042—Knowledge-based neural networks; Logical representations of neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/0895—Weakly supervised learning, e.g. semi-supervised or self-supervised learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/95—Pattern authentication; Markers therefor; Forgery detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/168—Feature extraction; Face representation
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Multimedia (AREA)
- Mathematical Physics (AREA)
- General Engineering & Computer Science (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Human Computer Interaction (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a depth counterfeiting detection method based on facial geometry relation reasoning, relates to the field of passive evidence obtaining of videos, and is used for improving generalization capability of a depth counterfeiting detection model. Firstly, detecting key point characteristics of a human face by using a human face characteristic point detector, and constructing a human face explicit geometric relationship diagram according to the internal five-sense organ structure of the human face; constructing a global feature extractor by using a transducer to obtain a global feature map of the face image; positioning a high information content area through a self-supervision learning mechanism and constructing a face implicit geometric relationship diagram on the basis of the face global feature diagram; and constructing a geometric relationship reasoning module in the human face by using the graph convolution neural network, carrying out feature combination on the explicit geometric relationship graph and the implicit geometric relationship graph of the human face, and carrying out fake detection on the video frame to be tested. The invention can effectively improve the accuracy of the detection of the face depth fake video, has better model generalization performance under different data fields and has practical value.
Description
Technical Field
The invention relates to the technical field of video evidence obtaining, in particular to a depth forgery detection method based on face geometric relationship reasoning.
Background
Artificial intelligence authoring content is widely spread by social media as an emerging technology, but the popularity of artificial intelligence content generation technology is such that face video manipulation operations become more accessible. Depth forging is a face video tampering technology which is widely spread initially, and through the development of technology, extremely real face videos can be synthesized, so that people are difficult to directly distinguish. If the generated deeply forged video is abused, serious risks are brought to privacy, politics and national security. Therefore, deep counterfeit video detection has become an important research problem in multimedia evidence obtaining technology in the field of information security. Since the feature difference between the real face and the fake face is small, the existing tamper trace is difficult to identify by using the existing deep learning model. In order to improve the effectiveness of detection, the existing deep forging method based on deep learning utilizes various characteristic attention mechanisms to improve the discrimination capability of a network model on fine differences. However, the continuous development of the deep forging method leads to large differences in tamper marks generated by different deep forging models. Most existing methods are generally only suitable for single or small amounts of tamper marks, have limited generalization performance, take the paper DeepFake Detection Based on Discrepancies Between Faces and Their Context published in the authoritative journal IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, and the 10 th period as an example, the method obtains an AUC detection score of 99.7 in face images generated by the faseswap method, but it is difficult to detect face images generated by the improved version of deepfakes, and the AUC score will be reduced by approximately 35%. Aiming at the new deep forging technology, a new feature attention module needs to be further designed to obtain a better detection effect. However, in practical application environments such as social media, the model is updated with high cost because of the need to cope with the depth fake video updated at any time, so that the prior art is difficult to meet the practical requirements, and the generalization capability of the model design is needed to be improved.
Disclosure of Invention
The invention aims to solve the limitations, and provides a depth fake detection method based on face geometric relationship reasoning, which further improves the generalization capability of detection of the depth fake video.
The technical scheme for realizing the purpose of the invention is as follows:
a depth forgery detection method based on human face geometric relation reasoning utilizes a transducer to construct a global feature extractor to obtain a human face image global feature map; positioning a high information content area through a self-supervision learning mechanism and constructing a face implicit geometric relationship diagram on the basis of the face global feature diagram; the method comprises the following main steps of constructing a geometric relationship reasoning module in a face by using a graph convolution neural network, carrying out feature combination on an explicit geometric relationship graph and an implicit geometric relationship graph of the face, and carrying out fake detection on a video frame to be tested so as to improve the accuracy of face depth fake video detection, wherein the method comprises the following main steps:
step 1: acquiring training videos simultaneously containing true and false samples, sampling video frames at intervals, and extracting human faces and a small part of background areas as human face images according to the frames by using a human face detector;
step 2: aiming at each frame of face image, a face key point detector is used for detecting 468 three-dimensional key points of a face, and a face explicit geometric relation diagram is constructed according to face facial relations;
step 3: constructing a global feature extractor, and extracting global features of each frame of face image;
step 4: constructing a high information content region locator based on a self-supervision learning method, which is used for locating feature regions with higher information content in global features and constructing a face implicit geometric relationship diagram among the feature regions;
step 5: constructing a face geometric relationship reasoning module, extracting the characteristics of a face explicit geometric relationship graph and an implicit geometric relationship graph based on a graph convolution neural network, and matching graph node relationships;
step 6: and inputting the continuous face images in the continuous frames in the test set into the trained model to obtain the probability score of the prediction belonging to the authenticity, and averaging the scores of the full video frames to judge the authenticity of the video.
Further, in step 1, the specific method for acquiring the face image includes:
(1) Presetting a sampling interval of video frames, starting from a first frame of a video, and extracting video frames from the video at the sampling interval;
(2) Using a Retinaface detector to determine a face candidate frame in a current video frame, and establishing a Cartesian coordinate system at the left upper corner of an image, wherein the coordinates of the candidate frame can be expressed as (x, y, w, h), wherein (x, y) is the left upper corner coordinate of the candidate frame, and w, h respectively represent the length and the width of the candidate frame;
(3) And expanding the range of the candidate frame according to a preset proportion r, changing the face candidate frame into (x-r multiplied by w, y-r multiplied by h, w+r multiplied by w, h+r multiplied by h), and intercepting the image in the range of the candidate frame as a face image.
Further, in step 2, the specific method for constructing the face explicit geometric relationship graph is as follows:
(1) Inputting the face image into a pre-trained face three-dimensional key point extractor Mediapipe to obtain 428 face key points;
(2) The key points of the human face are used as nodes V of an explicit geometric relationship diagram L According to the positions of key points of the human face in the human face, sequentially connecting the nodes of eyebrows, pupils, eyeboxes, lips and facial contours to form an appearance contour, and then according to the geometric structure of the human face, connecting the nodes of the five parts to form a human face explicit geometric relationship graph G L 。
Further, in step 3, the global feature extractor is a MobileVit or other backbone network implemented based on Vision Transformer.
Further, in step 4, the specific method of the high information content area locator based on the self-supervision learning method is as follows:
(1) In the face global feature map, obtaining M regions of interest by using a region lifting network;
(2) Inputting local features of the region of interest into a simple two-classifier f p In the method, the classifier uses a 1-layer 1 multiplied by 1 convolution kernel to reduce the dimension of the local feature channel number to 2 channels, and then uses an activation function and a batch normalization layer to increase the nonlinear expression capability of the feature to obtain the local semantic information featureUse global average pooling to ∈>Downsampling length and width is 1;
(3) Using cross entropy loss functionCalculating a loss value l between the classification result of each local feature and the authenticity of the current face p ;
(4) Arranging all loss values obtained from each image from large to small;
(5) Loss value l obtained for each local feature P And corresponding regional proposal score S P The square difference is obtained, and the mean value is calculated after all the square differences are obtained and used as a loss function of self-supervision learning
Further, in step 4, the specific method for constructing the face implicit geometric relationship map GP is as follows:
(1) Information content fraction S based on high information content zone locator P Selecting N local features F of high information content areas with highest scores P As a set, each F P Converting from a feature matrix of size 2 xw×h into feature vectors of size t×2, where t=w×h, aggregating the converted feature vectors intoImplicit geometrical relationship graph node for human face
(2) Converting the size of a node set N x T x 2 to N x C P, wherein CP =tx2, converting a set of nodes into an attention vector V using SoftMax operation att =softmax(V P );
(3) Based on self-attention mechanism, calculate V P Attention vector transpositionObtaining adjacency matrix of implicit geometrical relationship graph node connection +.>The size is N×N.
Further, in step 5, the geometric relationship inference module includes an explicit face geometric feature inference module, an implicit face geometric feature inference module, a graph feature matching module, and a graph classifier:
(1) The explicit face geometric feature reasoning module obtains a graph feature expression G of an explicit face geometric relation on the basis of a face explicit geometric relation graph GL by using a point cloud analysis model constructed based on a graph convolution neural network gr ;
(2) The implicit face geometric feature reasoning module uses a two-layer graph convolution network model to generate a hidden geometric relationship graph G P On the basis of (1) obtaining the graph feature expression G of the implicit face geometric relationship ir ;
(3) The graph feature matching module uses a two-layer interactive graph rolling network model to generate a geometric relationship graph G gr and Gir Based on (1), fusing the multi-view geometric relationship to obtain a fused geometric relationship graph G F ;
(4) The graph classifier respectively obtains the maximum value and the average value of the graph node characteristics by utilizing global average pooling and global maximum pooling, fuses graph representation characteristics of two observation angles, and classifies the graph based on a multi-layer perceptron.
In the implementation, the specific method of the graph feature matching module is as follows:
(1) Based on the geometric relationship graph G gr and Gir Included graph node feature set V gr and Vir Computing V using a mutual attention mechanism gr and Vir Is transposed of (a)Is paired with one another to obtain G gr Connection G ir Adjacency matrix of->The size is N multiplied by N;
(2) Based on the geometric relationship graph G gr and Gir Included graph node feature set V gr and Vir Computing V using a mutual attention mechanism ir and Vgr Is transposed of (a)Is paired with one another to obtain G gr Connection G ir Adjacency matrix of->The size is N multiplied by N;
(3) Based on the graph node feature set V gr and Vir Adjacency matrix A gi and Aig And matching the graph node characteristics by using an interactive graph rolling network model, and reasoning geometrical anomalies existing in the deep fake face image.
In implementation, the interactive graph convolution network model calculation process specifically comprises the following steps:
(1) Extraction of G gr Connection G ir Is characterized by the expression of the implicit characteristic relation diagram node characteristics:
V gi =σ(W 1 ×σ(A gi V gr W gi )+V gr );
wherein W1 and Wgi Are all learnable parameters of the interactive graph convolution network, and sigma (·) represents a nonlinear activation function which is a ReLu or a Leaky ReLu function;
(2) Extraction of G ir Connection G gr Is characterized by the expression of the implicit characteristic relation diagram node characteristics:
V ig =σ(W 2 ×σ(A ig V ir W ig )+V ir );
wherein W2 and Wig Are all learnable parameters of the interactive graph convolution network, and sigma (·) represents a nonlinear activation function which is a ReLu or a Leaky ReLu function;
(3) Expressing the characteristics of the two graph nodes V gi and Vig And splicing to obtain the node characteristic expression of the multi-angle geometric relationship graph.
In practice, the training total loss function in step 5 is:
wherein ,for a bi-classification cross entropy loss function, and using label smoothing technique, +.>Is a self-supervising loss function. P (P) P and Pa Respectively classifying the local high information content region classification result and the global geometric relationship graph node characteristic classification result, S P The score is proposed for the region of the high information content region locator.
In step 6, predicting all video frames by using the trained model, and averaging all scores to be used as a prediction result for the authenticity of the human face in the video.
The invention ensures the safety of video content containing human faces by using a deep learning technology. Features are extracted from the explicit geometric relationship and the implicit geometric relationship of the face, and the inference is based on the abnormality existing under the inherent geometric structure of the face and is used for judging whether the current image has counterfeiting conditions or not.
Compared with the prior art, the invention has the beneficial effects that:
1. the invention constructs a face geometric relationship graph by utilizing three-dimensional face key points as the inherent characteristics of the face, and extracts the relationship characteristics by utilizing a graph convolutional neural network.
2. And extracting global features by using a transducer network, positioning a high information content region by using a self-supervision learning method, and constructing an implicit geometric relationship diagram by using a self-attention mechanism, regardless of a fake mode.
3. The implicit geometric relation and the explicit geometric relation feature are matched, the geometric abnormality of the face is inferred, the dependence on specific fake marks is effectively avoided, and the generalization capability of the deep fake detection model is effectively improved.
Drawings
FIG. 1 is a flow chart of an embodiment of the present invention.
Fig. 2 is a network configuration diagram of an embodiment of the present invention.
Fig. 3 is a face explicit geometry diagram according to an embodiment of the present invention.
Fig. 4 is a comparison of experimental results of the present invention with prior methods in a publicly-verified example.
Fig. 5 is a diagram showing a detection effect according to an embodiment of the present invention.
Detailed Description
The invention will be further described with reference to the drawings and the specific examples.
As shown in fig. 1-2, the embodiment of the invention constructs a deep forgery detection network based on face geometric relationship reasoning, which comprises a Visinon Transformer main network, a high information content area locator, a high information content area classifier, a point cloud analysis network, a graph rolling module, an interactive graph rolling module and a graph classifier, thereby forming a whole model framework. FIG. 1 shows the workflow of the present invention; fig. 2 shows a specific network structure diagram of the present invention in one embodiment.
Step 1, training videos simultaneously containing true and false samples are obtained, video frames are sampled at intervals, a face detector is used for positioning and sampling to obtain the positions of faces in the video frames, and the faces and a small part of background areas around the faces are cut to be used as face images. In some specific implementations, the RetinaFace with the backbone network being Resnet50 is used as a face detector, the detection frame corresponds to 0.1 times of the length and the width of the detection result, and the four positioning coordinates are expanded to be used as the face detection result. For each frame of face image, using key points provided by RetinaFace, aligning all images to a uniform size according to face key points in which eyes, nose and mouth corners;
step 2, extracting three-dimensional face key points by using a three-dimensional face key point detector, and constructing an explicit geometric relationship graph G containing facial feature position information and facial contour information according to facial relationships L ;
In some specific implementation occasions, the MediaPipe is used for detecting the face, so that 468 key points of the 3-dimensional face can be obtained;
and 3, constructing a Vision Transformer-based global feature extractor, and extracting global features of the face. In some specific embodiments, the backbone network uses MobileVit, takes the pre-training parameters in the ImageNet dataset as model initialization parameters, and reserves the stage of first downsampling of the model and five subsequent feature extraction, so as to extract the global feature of each frame of face image;
step 4, constructing a high information content area locator based on a self-supervision learning method, and providing information content scores S of all areas in a global feature map by using an area proposal network constructed by a convolution layer and a full connection layer P And the corresponding coordinates, each region is sent into a simple two-classifier constructed by a 1 layer 1 multiplied by 1 convolution layer, and a two-channel semantic feature map containing image space semantic information is obtainedThe global average pooling layer provides probability that the current region belongs to the fake image, the probability and the label of the current face image are subjected to a classification cross entropy loss function to obtain corresponding loss, and a loss value and an information quantity fraction S are calculated P Is calculated by the square error of (1)For self-supervision learning loss function, the simple local area with higher classification loss, namely the area with higher uncertainty is ensured to be used as the detection result of the high information content area locator. Taking the high information area characteristics obtained by the detection as graph representation characteristic nodes, and utilizing a two-channel semantic characteristic graph of the corresponding area +.>And combining the image adjacency matrixes to form a face implicit geometric relation image among the characteristic areas.
In some specific implementations, the high information region locator will provide 20 candidate regions during the training process, pooled to a uniform size using the region of interest, e.g., 7 x 7. And calculating corresponding two-class cross entropy loss for each candidate region, respectively calculating the square difference between the loss and the proposal scores of the 20 candidate regions, and completing the training of the module by minimizing the two-class cross entropy loss and the supervision loss. Selecting two-channel semantic features of 6 regions with highest candidate scoresEach region feature may be stretched as a vector. 6 feature vectors are spliced to form a human face implicit geometrical relationship graph nodeBased on self-attention mechanism, the human face implicit geometric relation graph adjacency matrix is thatThe size is 6 multiplied by 6;
step 5, constructing a human face geometric relation reasoning module, and expressing G by using the explicit and implicit geometric relation graphs of the human face L and GP The graph characteristic representation enhancement is respectively carried out through two neural networks formed by graph roll lamination layers, and the enhanced geometric relationship graph representation G is obtained gr and Gir . Constructing a multi-layer interactive graph rolling network model, and matching G gr and Gir Is characterized by the graph node characteristic relationship, the enhanced geometric relationship graph and the enhanced characteristicFused to G F Highlighting geometric abnormal feature nodes, converting a graph node feature channel into a classification channel by using a full connection layer, respectively calculating average two-classification predicted values of all nodes and maximum two-classification predicted values of all nodes by using global average pooling and global maximum pooling, adding the predicted values, outputting the probability that the current image belongs to a fake image as a final classification result, and calculating classification loss by using cross entropy loss;
in some embodiments, G is represented by an explicit geometric relationship graph to the face L The number of channels is 3, the number of nodes is 468, the graph characteristic enhancement network is composed of a point cloud analysis network based on a graph convolution network, a trunk network uses Curvenet, and the number of graph nodes is downsampled and G is obtained P The number is the same, and the feature representing dimension is deepened, namely the relation representing capability is enhanced, and the feature dimension is equal to G P And consistent. Implicit geometric relationship graph representation G P The number of channels is 98, the number of nodes is 6, the enhancement network is formed by combining a two-layer simple graph convolution network with a nonlinear activation function, and the number of nodes and the number of channels of the original graph are maintained.
The interactive graph convolution network calculation comprises the following specific steps: 1) Inputting two graph signs G to be matched 1 and G2 The method comprises the steps of carrying out a first treatment on the surface of the 2) Calculation G 1 Connection G 2 Adjacent matrix of (a) wherein V1 and V2 Is the node characteristic of the corresponding graph; 3) Enhancement G 1 Graph characterization capability, update graph node characteristics, +.> wherein W1 and W1→2 Are all interactive graph convolution network learnable parameters, and sigma (·) represents a nonlinear activation function; 4) Calculation G 2 Connection G 1 Adjacency matrix of-> 5) Enhancement G 2 Graph characterization capability, update graph node characteristics, +.> wherein W2 and W2→1 Are all interactive graph convolution network learnable parameters, and sigma (·) represents a nonlinear activation function;
in some embodiments, the nonlinear activation function is a ReLu or a leak ReLu function.
In some specific implementations, feature nodes enhanced by the interactive graph rolling network model can be fused in a splicing or adding mode.
In some specific implementations, the cross entropy loss function may use a label smoothing regularization constraint, specifically:
wherein y is [ 0+alpha, 1-alpha ]]For a genuine label, 0 represents a genuine image, 1 represents a counterfeit image, a represents a label smoothing parameter,to predict probability values.
In some specific implementations, the final loss function of the network constructed in the steps 1 to 5 is as follows:
wherein For a bi-classification cross entropy loss function, label smoothing techniques can be used instead ofIs a self-supervising loss function. P (P) P and Pa Respectively classifying the local high information content region classification result and the global geometric relationship graph node characteristic classification result, S P The score is proposed for the region of the high information content region locator.
And 6, inputting the continuous face images in the continuous frames in the test set into the trained model, outputting probability scores of true and false of the corresponding frames, averaging the probability scores obtained by all frames corresponding to the video, and judging the true and false of the video.
Examples
The embodiment comprises the following steps:
s1: collecting a training sample;
s1.1: the method comprises the steps of inputting videos, detecting face positions in each frame by using a Retinaface detector with a backbone network of Resnet50 for each input video, and reserving face images with the number of L in a training stage in a mode of interval sampling;
s1.2: according to the main key point positions provided by RetinaFace, the face image is aligned to the left and right mouth corners and the face center, and the unified size is 380 multiplied by 380;
s1.3: each face image is endowed with a category label of a corresponding video, 0 represents a real video, 1 represents a fake video
S2: constructing a face explicit geometric relation graph GL shown in FIG. 3;
s2.1: acquiring 3-dimensional 468 face key point coordinates by using a MediaPipe;
s2.2: connecting key points such as the eyes, lips, nose, facial contours and the like according to the key point positions, and connecting all parts according to the facial area distribution relation;
s3: constructing a Vision Transformer-based global feature extractor, and extracting global features of a face;
s3.1: the lightweight network MobileVit is used as a global feature extractor, so that the algorithm calculation amount is reduced, and the actual use of the method is facilitated. Selecting a first convolution downsampling module of a backbone network and a basic module of the subsequent 5 MobileViTs as feature extractors;
s3.2: inputting a face image with the size of 380 multiplied by 380, and acquiring a global feature map F with the size of 11 multiplied by 11 a ;
S4: constructing a high information content region locator based on a self-supervision learning method, acquiring 6 regions with highest information content, and constructing a face implicit geometric relationship graph G P 。
As shown in fig. 1 and 2, a face implicit geometry map G P The method comprises the following specific steps:
s4.1: using the area proposal network common in the dual-stage object detector as a high information content area locator based on the anchor frame mode in the global feature map F a Providing 20 candidate frames of the region of interest, wherein the candidate scores corresponding to the candidate frames are regarded as information quantity scores;
s4.2: the method comprises the steps of (1) unifying the size of the position features of a candidate frame by using a region-of-interest pooling method, and changing the position features into 7 multiplied by 7;
s4.3: 20 interesting region features are sent into a 1X 1 convolution layer, each region feature channel is changed into 2, and two-channel semantic features are obtainedThe size of the feature map is 7 multiplied by 2, and the size of each feature map is downsampled by using global tie pooling to obtain the true and false prediction probability score;
s4.4: calculating the predictive probability score of each region of interest and the two classification cross entropy losses of the corresponding face image labels;
s4.5: calculating the square difference between the cross entropy loss value and the candidate frame candidate score, wherein the region with higher candidate score corresponds to higher local classification loss, and represents higher uncertainty or is called higher uncertainty;
s4.6: selecting 6 region features with highest candidate scores as a high information content region of the current image;
s4.7: each region feature is stretched into a vector, and the vectors are spliced to form a human face implicit geometric relationship graph node
S4.8, calculating the connection relation between nodes based on a self-attention mechanism, and constructing a graph adjacency matrix as The size is 6 multiplied by 6;
s5: constructing a face geometric relationship reasoning module, and analyzing and reasoning face geometric anomalies;
s5.1: using CurveNet removal classifier based on graph rolling network as point cloud analysis network ψ ();
s5.2: inputting a human face explicit geometric relation graph GL, analyzing the human face explicit geometric relation by using a point cloud analysis network, extracting the most important 6 feature nodes, and outputting a relation graph G with enhanced features gr =ψ (GL), output feature dimension 96;
s5.3: a graph rolling module is formed by using a graph rolling layer and a nonlinear activation function together, and a network omega (-) is enhanced by using the two-layer graph rolling module as a face implicit geometric relationship graph;
s5.4: implicit geometry G of input face P Enhancing the network enhancement graph feature representation by using the face implicit geometric relationship graph, and outputting an enhancement relationship graph G ir =Ω(GP)
S5.5: an interactive graph convolution layer and a nonlinear activation function are used for forming an interactive graph convolution module together, and the two layers of interactive graph convolution modules are used as a geometric relationship graph matching network phi ();
s5.6: input geometry relation graph G gr and Gir Obtaining a node matching relationship inference graph (G) gi G ig )=Φ(G gr ,G ir ) And corresponding characteristic node V gi and Vig ;
S5.7: splicing according to the channel direction or fusing node characteristics
S5.8: fusion node feature V using full connectivity layer F Is converted into the characteristic channel of (1)
S5.8: obtaining node class average score P using global average pooling 1 ∈[0,1];
S5.8: obtaining node class maximum score P using global maximization pooling 2 ∈[0,1];
S5.9: will beAs the true-false probability score obtained according to geometric anomaly reasoning, calculating a classification cross entropy loss with the image label;
s6: acquiring a video to be tested, detecting all face images contained in the video by using a face detector, sequentially inputting the images into a trained model to obtain the true-false probability score of each image, and calculating the average value of all image prediction scores of the video to obtain the true-false probability score of the video;
in this embodiment, the area under the ROC curve (Area Under the ROC Curve, AUC) is used as an evaluation index, and the ROC curve is formed by a True Positive Rate (TPR) and a False Positive Rate (FPR) as the abscissa. The true positive rate refers to the proportion of the positive sample which is correctly predicted in the actual positive sample; the false positive rate refers to the number of negative samples that are mispredicted as negative in the actual negative samples. The closer the AUC score is to 1, the better the model performance is, and the AUC score is not affected by the classifier threshold setting, and is an evaluation index with robustness.
Fig. 4 is a comparison of experimental results of the present invention with prior methods in a publicly-verified example. This example chooses to use a high quality (C23) dataset training model in faceforensics++ (ff++) datasets to test the validity of the method in ff++ (C23) and CelebDF v2 datasets. The results show that: the method provided by the invention has good performance on both FF++ (C23) and CelebDF v2 public data sets. The detection performance of the unknown deep counterfeiting method is effectively improved while the detection performance in a higher data domain is maintained. Compared with a comparison algorithm, the method has a better detection effect.
Fig. 5 is a graph of the detection effect of the verification example of the present invention on different data sets, specifically showing the explicit geometrical relationship graph, the implicit geometrical relationship graph and the effect of the corresponding feature Grad-CAM focused attention area of the method proposed by the present invention.
The above embodiments are only for illustrating the technical solution of the present invention and not for limiting it, and various modifications and variations of the present invention are possible for those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (11)
1. A depth forgery detection method based on human face geometric relation reasoning utilizes a transducer to construct a global feature extractor to obtain a human face image global feature map; positioning a high information content area through a self-supervision learning mechanism and constructing a face implicit geometric relationship diagram on the basis of the face global feature diagram; the method comprises the following main steps of constructing a geometric relationship reasoning module in a face by using a graph convolution neural network, carrying out feature combination on an explicit geometric relationship graph and an implicit geometric relationship graph of the face, and carrying out fake detection on a video frame to be tested so as to improve the accuracy of face depth fake video detection, wherein the method comprises the following main steps:
step 1: acquiring training videos simultaneously containing true and false samples, sampling video frames at intervals, and extracting human faces and a small part of background areas as human face images according to the frames by using a human face detector;
step 2: aiming at each frame of face image, a face key point detector is used for detecting 468 three-dimensional key points of a face, and a face explicit geometric relation diagram is constructed according to face facial relations;
step 3: constructing a global feature extractor, and extracting global features of each frame of face image;
step 4: constructing a high information content region locator based on a self-supervision learning method, which is used for locating feature regions with higher information content in global features and constructing a face implicit geometric relationship diagram among the feature regions;
step 5: constructing a face geometric relationship reasoning module, extracting the characteristics of a face explicit geometric relationship graph and an implicit geometric relationship graph based on a graph convolution neural network, and matching graph node relationships;
step 6: and inputting the continuous face images in the continuous frames in the test set into the trained model to obtain the probability score of the prediction belonging to the authenticity, and averaging the scores of the full video frames to judge the authenticity of the video.
2. The method for detecting the deep forgery based on the reasoning of the geometrical relationship of the human face according to claim 1, wherein in the step 1, the specific method for acquiring the human face image is as follows:
(1) Presetting a sampling interval of video frames, starting from a first frame of a video, and extracting video frames from the video at the sampling interval;
(2) Using a Retinaface detector to determine a face candidate frame in a current video frame, and establishing a Cartesian coordinate system at the left upper corner of an image, wherein the coordinates of the candidate frame can be expressed as (x, y, w, h), wherein (x, y) is the left upper corner coordinate of the candidate frame, and w, h respectively represent the length and the width of the candidate frame;
(3) And expanding the range of the candidate frame according to a preset proportion r, changing the face candidate frame into (x-r multiplied by w, y-r multiplied by h, w+r multiplied by w, h+r multiplied by h), and intercepting the image in the range of the candidate frame as a face image.
3. The depth falsification detection method based on human face geometric relationship reasoning as claimed in claim 1, wherein in the step 2, the specific method for constructing the human face explicit geometric relationship graph is as follows:
(1) Inputting the face image into a pre-trained face three-dimensional key point extractor Mediapipe to obtain 428 face key points;
(2) The key points of the human face are used as nodes V of an explicit geometric relationship diagram L According to the positions of the key points of the human face in the human face, sequentially connecting the nodes of the eyebrows, the pupils, the eyesockets, the lips and the facial outline to form an outer partObserving the outline, and then connecting the nodes of the five parts with each other according to the face geometry structure to form a face explicit geometry relation graph G L 。
4. A depth forgery detection method based on face geometry reasoning as claimed in claim 1, wherein in step 3, the global feature extractor is a backbone network implemented based on Vision Transformer, such as MobileVit.
5. The method for detecting deep forgery based on human face geometric relationship reasoning according to claim 1, wherein in step 4, the specific method of the high information content region locator based on the self-supervision learning method is as follows:
(1) In the face global feature map, obtaining M regions of interest by using a region lifting network;
(2) Inputting local features of the region of interest into a simple two-classifier f p In the method, the classifier uses a 1-layer 1 multiplied by 1 convolution kernel to reduce the dimension of the local feature channel number to 2 channels, and then uses an activation function and a batch normalization layer to increase the nonlinear expression capability of the feature to obtain the local semantic information featureUse global average pooling to ∈>Downsampling length and width is 1;
(3) Using cross entropy loss functionCalculating a loss value l between the classification result of each local feature and the authenticity of the current face p ;
(4) Arranging all loss values obtained from each image from large to small;
(5) Loss value l obtained for each local feature P And corresponding regional proposal score S P The square difference is obtained, and the mean value is calculated after all the square differences are obtained and used as a loss function of self-supervision learning
6. The method for deep forgery detection based on human face geometrical relationship reasoning as claimed in claim 1 or 5, wherein in step 4, a human face implicit geometrical relationship graph G is constructed P The specific method comprises the following steps:
(1) Information content fraction S based on high information content zone locator P Selecting N local features F of high information content areas with highest scores P As a set, each F P Converting a feature matrix with the size of 2 xW x H into feature vectors with the size of T x 2, wherein T=W x H, and collecting the converted feature vectors into a face implicit geometrical relationship graph node
(2) Converting the size of a node set N x T x 2 to N x C P, wherein CP =tx2, converting a set of nodes into an attention vector V using SoftMax operation att =softmax(V P );
(3) Based on self-attention mechanism, calculate V P Attention vector transpositionObtaining adjacency matrix of implicit geometrical relationship graph node connection +.>The size is N×N.
7. The method for deep forgery detection based on human face geometric relationship reasoning as claimed in claim 1, wherein in step 5, the geometric relationship reasoning module comprises an explicit human face geometric feature reasoning module, an implicit human face geometric feature reasoning module, a graph feature matching module, and a graph classifier:
(1) The explicit human face geometric feature reasoning module utilizes a point cloud analysis model constructed based on a graph convolution neural network to display a geometric relationship graph G on the human face L On the basis of (1) obtaining the graph feature expression G of the explicit face geometric relationship gr ;
(2) The implicit face geometric feature reasoning module uses a two-layer graph convolution network model to generate a hidden geometric relationship graph G P On the basis of (1) obtaining the graph feature expression G of the implicit face geometric relationship ir ;
(3) The graph feature matching module uses a two-layer interactive graph rolling network model to generate a geometric relationship graph G gr and Gir Based on (1), fusing the multi-view geometric relationship to obtain a fused geometric relationship graph G F ;
(4) The graph classifier respectively obtains the maximum value and the average value of the graph node characteristics by utilizing global average pooling and global maximum pooling, fuses graph representation characteristics of two observation angles, and classifies the graph based on a multi-layer perceptron.
8. The depth forgery detection method based on face geometry reasoning of claim 7, wherein the graph feature matching module specifically comprises the following steps:
(1) Based on the geometric relationship graph G gr and Gir Included graph node feature set V gr and Vir Computing V using a mutual attention mechanism gr and Vir Transposed V of (2) ir T Is paired with one another to obtain G gr Connection G ir Adjacent matrix of (a)The size is N multiplied by N;
(2) Based on the geometric relationship graph G gr and Gir Included graph node feature set V gr and Vir Computing V using a mutual attention mechanism ir and Vgr Is transposed of (a)Is paired with one another to obtain G gr Connection G ir Adjacency matrix of->The size is N multiplied by N;
(3) Based on the graph node feature set V gr and Vir Adjacency matrix A gi and Aig And matching the graph node characteristics by using an interactive graph rolling network model, and reasoning geometrical anomalies existing in the deep fake face image.
9. The method for detecting deep forgery based on human face geometric relationship reasoning as claimed in claim 7, wherein the interactive graph rolling network model calculation process specifically comprises the following steps:
(1) Extraction of G gr Connection G ir Is characterized by the expression of the implicit characteristic relation diagram node characteristics:
V gi =σ(W 1 ×σ(A gi V gr W gi )+V gr );
wherein W1 and Wgi Are all learnable parameters of the interactive graph convolution network, and sigma (·) represents a nonlinear activation function which is a ReLu or a Leaky ReLu function;
(2) Extraction of G ir Connection G gr Is characterized by the expression of the implicit characteristic relation diagram node characteristics:
V ig =σ(W 2 ×σ(A ig V ir W ig )+V ir );
wherein W2 and Wig Are all learnable parameters of the interactive graph convolution network, and sigma (·) represents a nonlinear activation function which is a ReLu or a Leaky ReLu function;
(3) Expressing the characteristics of the two graph nodes V gi and Vig Splicing to obtain multi-angle tableAnd what relation diagram node characteristic expression.
10. The method for detecting deep forgery based on face geometry reasoning as claimed in claim 1, wherein in step 5, the training total loss function is:
wherein ,for a bi-classification cross entropy loss function, and using label smoothing technique, +.>Is a self-supervising loss function. P (P) P and Pa Respectively classifying the local high information content region classification result and the global geometric relationship graph node characteristic classification result, S P The score is proposed for the region of the high information content region locator.
11. The method for detecting the deep forgery based on the reasoning of the geometrical relationship of the human face according to claim 1, wherein in the step 6, all video frames are predicted by using a trained model, and all scores are averaged to be used as a prediction result for the authenticity of the human face in the video.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310418813.1A CN116758604A (en) | 2023-04-18 | 2023-04-18 | Depth forgery detection method based on facial geometry relation reasoning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310418813.1A CN116758604A (en) | 2023-04-18 | 2023-04-18 | Depth forgery detection method based on facial geometry relation reasoning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116758604A true CN116758604A (en) | 2023-09-15 |
Family
ID=87952118
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310418813.1A Pending CN116758604A (en) | 2023-04-18 | 2023-04-18 | Depth forgery detection method based on facial geometry relation reasoning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116758604A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118366207A (en) * | 2024-06-20 | 2024-07-19 | 杭州名光微电子科技有限公司 | 3D face anti-counterfeiting system and method based on deep learning |
-
2023
- 2023-04-18 CN CN202310418813.1A patent/CN116758604A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118366207A (en) * | 2024-06-20 | 2024-07-19 | 杭州名光微电子科技有限公司 | 3D face anti-counterfeiting system and method based on deep learning |
CN118366207B (en) * | 2024-06-20 | 2024-09-10 | 杭州名光微电子科技有限公司 | 3D face anti-counterfeiting system and method based on deep learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Asnani et al. | Reverse engineering of generative models: Inferring model hyperparameters from generated images | |
CN108537743B (en) | Face image enhancement method based on generation countermeasure network | |
Guanghui et al. | Multi-modal emotion recognition by fusing correlation features of speech-visual | |
CN112529042B (en) | Medical image classification method based on dual-attention multi-example deep learning | |
Zheng et al. | Attention-based spatial-temporal multi-scale network for face anti-spoofing | |
CN112395442B (en) | Automatic identification and content filtering method for popular pictures on mobile internet | |
CN109255289B (en) | Cross-aging face recognition method based on unified generation model | |
US11514715B2 (en) | Deepfake video detection system and method | |
CN113537027B (en) | Face depth counterfeiting detection method and system based on face division | |
CN104700078A (en) | Scale-invariant feature extreme learning machine-based robot scene recognition method | |
CN114662497A (en) | False news detection method based on cooperative neural network | |
CN116206327A (en) | Image classification method based on online knowledge distillation | |
Yin et al. | Msa-gcn: Multiscale adaptive graph convolution network for gait emotion recognition | |
CN116758604A (en) | Depth forgery detection method based on facial geometry relation reasoning | |
CN114937298B (en) | Micro-expression recognition method based on feature decoupling | |
Zhou et al. | Design of an Intelligent Laboratory Facial Recognition System Based on Expression Keypoint Extraction | |
CN115100128A (en) | Depth forgery detection method based on artifact noise | |
Wang et al. | Se-resnet56: Robust network model for deepfake detection | |
CN114663953B (en) | Facial expression recognition method based on facial key points and deep neural network | |
Dhamija et al. | Analysis of age invariant face recognition using quadratic support vector machine-principal component analysis | |
CN112016490B (en) | Pedestrian attribute identification method based on generation countermeasure learning | |
Emberi et al. | Harnessing Deep Neural Networks for Accurate Offline Signature Forgery Detection | |
Gopalakrishnan | Understanding the Deepfake Detectability of Various Affinity Groups Using a Dual-Stream Network | |
Zhang et al. | [Retracted] Application of Artificial Neural Network Algorithm in Facial Biological Image Information Scanning and Recognition | |
Sabeena et al. | Copy-move Image Forgery Detection and Localization Using Alteration Trace Net |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |