4.2 State-of-the-Art Comparison
We make comparisons with 12 state-of-the-art approaches. We use the
Area Under Curve (AUC) score as a metric, since it can well eliminate the interference of human factors. The results are shown in Table
1. We can find that these approaches achieve different scores across different datasets, which reveals detection difficulties and forgery traces are different across datasets. Despite this, our LPS model still achieves at least 0.03%, 5.04%, and 7.92% higher than other approaches on FaceForensics++, DFDC, and Celeb-DF(v2), respectively. The selective self-distillation fine-tuned LPS
\(_{ssd}\) model further improves the performance. The results imply a robust representation learning ability of our approach in deepfake video detection. We bold our proposed method in the first column and also bold the state of art results achieved by our models in Table
1.
We also analyze our model’s ROC curve, which can better eliminate the interference of human factors in setting different thresholds. The results from LPS model are shown in Figure
6. It is obvious that the results on two datasets are very close to the full coverage of the entire coordinate plane, which means our model can accurately identify all positive samples (True Positive Rate = 1) while ensuring that negative samples will not be misjudged (False Positive Rate = 0). In other words, all of the samples can be almost classified correctly. The results demonstrate that our detector can achieve near perfect discriminate ability with proper threshold settings. Meanwhile, as for the two curves, when the false-positive rates are the same, the true-positive rates from Celeb-DF(v2) are higher than those from FaceForensics++, which is consistent with the accuracy of the results.
Besides, to demonstrate effectiveness of the predictive representation learning, we further train an extra model
\({\bf LPS_{ri}}\) by first randomly initializing with random parameters and then learning in an end-to-end manner like [
40]. As shown in Table
1, it is observed that the predictive representation learning boosts detection performance by around 10% on average, proving its efficient pattern learning ability with a better initial starting point.
4.3 Generalizability Analysis
To demonstrate the model’s generalization ability, we conduct cross-dataset experiments. The results are presented in Table
2. We discover that using Celeb-DF(v2) as a training set gets better testing results generally, proving it contains more representative features in all kinds of fake videos. Although the results vary across different datasets, we can also find that their AUC scores are totally over 70% with LPS and 80% with LPS
\(_{ssd}\). Meanwhile, as for deepfake datasets, the fake samples are usually 2 or 3 times more than real samples and fake clues are too weak to confuse with real samples [
53]. When learning on unbalanced training samples, the CNN-based detector easily falls into the trap of overfitting [
24], especially in the DFDC dataset [
10]. To evaluate the model robustness against overfitting, we also report the Recall and Precision scores that are mainly defined with true positive, false negative, and false positive, which calculate the effect of positive samples’ proportion. Thus, Recall and Precision are suitable to measure the overfitting situation [
7]. From Table
2, the Recall and Precision scores suggest that our model well avoids overfitting and obtains an impressive performance on different datasets. We also note that the AUC on Celeb-DF(v2) reaches 84.22% with LPS and 88.52% with LPS
\(_{ssd}\) even when training on DFDC, which still exceeds many prior approaches such as Multi-task [
36] and MesoNet [
1]. We also note that the selective self-distillation fine-tuned LPS
\(_{ssd}\) consistently improves model performance. These results show the generalization ability of our approach and the strong adaptability of the learned representations.
4.4 Component Analysis
After the promising detection performance is achieved, we further analyze the impact of each component in our approach, including localizer, encoder, aggregator, predictor, memory bank, classifier, and selective self-distillation fine-tuning.
Localizer. The localizer mainly performs face detection and then face alignment. Figure
7 shows some examples, where face alignment can make the facial regions more complete and improve the detection performance. To study this effect, we conduct experiments on three benchmarks by removing the face alignment function and just using the detected faces as input. The results are shown in Table
3. We can see that the detection performance drop happens consistently over three benchmarks on all three metrics. For example, without face alignment, it has an AUC score drop of 3.27%, 5.89%, and 4.64% on FaceForensics++, DFDC, and Celeb-DF(v2), respectively. Their results reveal the importance of landmark localization for capturing artifacts. The main reason is that face alignment can make the input faces more standardized with the unified center position, rotation angle and scaling ratio, which is convenient for the encoder to capture the characteristics of human faces and further facilitate representation enhancement by the aggregator.
Encoder. The encoder serves as the basic feature extraction and is very critical for video representation. To study its effect, we take the original detector as a baseline and conduct two ablation experiments by removing it and enhancing it.
First, we remain the aggregator and remove the encoder from the deepfake video detector learning framework during the whole training process. Different from the original baseline framework, the temporal information is mainly captured by the aggregator, thus only one aggregator can complete a certain degree of feature extraction. In this case, the spatial details extracted by the original encoder will be weakened or ignored. As shown in Figure
8, when comparing with the baseline model, the AUC score decreases by over 18% on FaceForensics++ and almost 10% on DFDC and Celeb-DF(v2). The main reason comes from the difficulty of deepfake video detection in which forgery trace is too subtle and scattered in spatial due to high realistic frame synthesis.
To further study the importance of the encoder, we conduct the second experiment by modifying the encoder network. The baseline encoder is composed of 2d3d-ResNet18, which is a kind of 3DCNN for extracting short-term temporal information. Inspired by Reference [
54], we replace it with 3D Inception [
8] and pretrain the encoder on the deepfake detection task. Then, we train a new LPS model termed
LPS\(_e\) on Celeb-DF(v2) with the same setting and conduct a comparison with the baseline. As shown in Table
4, all metrics have increased by varying degrees, for example, the AUC score, recall, and precision are increased by 1.31%, 1.84% and 2.22%, respectively. These results reflect the excellent effect of the enhanced encoder. As mentioned in Reference [
54], the 3D Inception network is more complicated than 3D ResNet18. It contains four different branches for focusing on more details on different scales, resulting in the better capturing ability of spatial features.
Aggregator. Similarly to the analysis on the encoder, we remain the encoder and remove the aggregator with the same settings of learning baseline detection models. Then, we train the new models on three datasets and compare them with the baseline. The results are reported in Figure
8. As we thought, the performance also gets a big drop on all three benchmarks. It is because the encoder composed by 2d3d-ResNet18 can only capture short-term temporal features and is unable to effectively describe spatiotemporal context clues. Although we contacted all the short-term outputs for classification, the resulting representations still contain so much redundant information that the long-term context information is not extracted effectively. The results show that an effective deepfake video detector needs to identify not only short-term temporal inconsistency but also long-term one, which can be demonstrated again in the encoder’s experiment. Thus, our approach cascades encoder and aggregator into a unified backbone for video representations.
Memory Bank. Besides, the predictive representation learning as our main idea, we also conduct an ablation experiment on its core component, the memory bank in the predictor, for proving the effectiveness of our predictor design. As we introduced in Subsection
3.3, the memory bank can provide more possible results for prediction diversity. In practice, the memory bank is updated by new semantic features from the encoder in each epoch. We just remove the memory bank and force the multi-layer perceptron
\(\phi (\cdot)\) to predict the specific future states. Compared with computing each memory slot’s hypothesis, this target task is more difficult. The most obvious change is reflected in the training process. After removing the memory bank, the training loss becomes larger and the convergence is slower. In Figure
8, the downstream task is also reflected with decreased results.
Classifier. Furthermore, we analyze the classifier’s effect on our detection model. Both in the training and testing phase, the classifier plays an important role. In the training phase, our core objective is to learn an effective feature extractor as well as the backbone. However, the back propagation algorithm decides the training process relying on the outputs from the classifier. If the discrimination of output results is better, then the training efficiency is higher. In the testing phase, the discrimination of output results is just our objective. Such two situations greatly rely on the classifier’s mapping ability. When the boundaries between different types of feature points in the feature space are clear, traditional machine learning methods, such as
support vector machine (SVM), which pay attention to the maximum interval of classification surfaces, often have better results. To validate this, we train a new model termed
LPS\(_c\) on Celeb-DF(v2) by replacing the baseline classifier with SVM and conducting a comparison with the baseline
LPS model. The results are reported in Table
4, where the new model
LPS\(_c\) achieves improved performance over the baseline, such as an AUC improvement of 0.6%, implying the effect of a more discriminative classifier.
Selective Self-Distillation Finetuning. A robust deepfake video detector should stably identify a video example as real or fake. The proposed selective self-distillation fine-tuning method aims to make the models robust and generalizable. As shown in Table
1, compared with the baseline
LPS, the fine-tuned
LPS\(_{ssd}\) can consistently improve the detection performance on all three benchmarks. Furthermore, as illuminated in Figure
5, the classification probabilities on video examples identified by
LPS\(_{ssd}\) are more stable, e.g., most values are around 0.5 for fake videos. The main reason is that the method inherits the advantages of knowledge distillation [
23], which boosts the robustness and generalizability. We experimentally study the effect of the hyper-parameters
\(\lambda\) and
\(T\) in Equation (
6) to find the best choice. As shown in Table
5, we first fix
\(T=1\) and then change
\(\lambda\) from 0.1 to 1.0 with step 0.1. We find that the performance on the training set is improved continuously and the performance on the validation set starts to get worse with a large gap away from that of the training set when
\(\lambda\) is beyond 0.7. Then, we fix
\(\lambda =0.7\) and change temperature
\(T\) from 1 to 10 with step 1. We find that when
\(T\) gets larger, the AUC metric gets worse, since the larger
\(T\) can soften the distribution and leads to the narrowing of the difference between the two classification results and the unclear boundary. Thus, we choose
\(\lambda =0.7\) and
\(T=1\) in our experiments.
4.5 Further Analysis
Representation Visualization. To check the effectiveness of the learned spatiotemporal context features in identifying fake artifacts, we visualize the feature maps from our detector’s intermediate-layers [
29,
52]. In our experiment, the encode
\(f(\cdot)\) is constructed by CNNs, we put the final convolutional layer’s outputs across two pooling layers to get our demand feature map. Meanwhile, our encoder is basically constructed by 3D-convolutional layers. Then, the output feature map from
\(f(\cdot)\) is denoted by a 4D tensor. So we randomly pick a 2D feature map from the output of the last convolutional layer of the encoder
\(f(\cdot)\), as shown in the last column of Figure
9. Then, we draw the artifact regions (the light parts) in gray scale image and a sequence of testing faces and its output feature map are next to it. The higher feature response values can be observed in the feature maps’ brighter regions, corresponding with artifacts in the fake face videos during the frame change. From Figure
9, we can find that the abnormal temporal changes generally fall in the bright regions in the feature maps, and the feature maps effectively highlight the fake regions, demonstrating the effectiveness of our predictive representation learning approach. Also compared with the results from pretrained FaceNet in Figure
1, it is obvious that our model extracts more prominent feature maps with less noise. Such a phenomenon proves that the attention of our model is more focused, and the extracted features are more discriminative and can provide strong generalization ability.
We further use t-SNE [
49] to visualize the representations. Figure
10 presents the visualization of spatial semantic features and spatiotemporal context features, showing these two features have obvious discriminability between fake and real ones. It also shows that the spatiotemporal context features have smaller intra-class distances with fewer outliers, which may benefit from the learned spatiotemporal context features that can capture more subtle temporal artifacts.
Efficiency Analysis. We evaluate the time cost of all modules on 100 videos composed of 180 sampled face frames on a 2.4-GHz CPU. We use the Titan Xp GPU for data loading and inference with a memory of 12 GB. The average time costs of each video are 31.7, 114.1, 24.4, and 14.7 ms for the localizer, encoder, aggregator, and classifier, respectively. Thus, it is feasible to deploy the detector in practical scenarios. Besides, the time cost of pretraining is higher, because of the predictor module. We also evaluate the inference efficiency of the predictor by averaging the time cost on 100 videos. The average time cost is 21.7 ms in inferring a video and 13.9 ms in updating the memory bank for a video.
Failure Analysis. Figure
11 demonstrates some detection failure cases, from which we can get several meaningful findings. First, we can see that high photo-realistic synthesis across multiple frames typically leads to great difficulty in identifying fake traces from short-term temporal clues, which often results in detection failure. It is expected to develop more discriminative video representations to capture context clues on a longer-term temporal scale. Second, the real and fake videos often are very confused especially when their image quality is poor. In this case, the binary annotation of video examples with real or fake usually cannot provide sufficient prior information for model learning, leading to the poor ability of a simple classifier. Maybe using a strong classifier can be a probable solution. Thus, our future work is to develop a more discriminative encoder also a more effective solution to describe and identify the fake traces on a fine-grained and long-term level.