Improved Optical Flow Estimation Method for Deepfake Videos
<p>Difference between real (<bold>top</bold>) and fake (<bold>bottom</bold>) frames passed to optical flow estimator.</p> "> Figure 2
<p>PWC-Net Network Architecture.</p> "> Figure 3
<p>Overall proposed system architecture.</p> "> Figure 4
<p>GPU architecture of the system.</p> "> Figure 5
<p>Four different augmentations: (<bold>a</bold>) Augmentation 1: applying the proposed augmentation by Jeon et al.; (<bold>b</bold>) Augmentation 2: training the FTT block alongside the CNN; (<bold>c</bold>) Augmentation 3: attaching MobileNet block at the end of the CNN; (<bold>d</bold>) Augmentation 4: attaching FTT block at the end of the CNN.</p> "> Figure 6
<p>Basic architecture for the TPU approach.</p> "> Figure 7
<p>Backbone CNNs validation accuracy vs. epochs: (<bold>a</bold>) linear scale; (<bold>b</bold>) logarithmic scale.</p> "> Figure 8
<p>Accuracy comparison between the proposed and the original trained on different datasets: (<bold>a</bold>) linear scale; (<bold>b</bold>) logarithmic scale.</p> "> Figure 9
<p>Comparison between the best performing GPU and other models trained on the TPU: (<bold>a</bold>) linear scale; (<bold>b</bold>) logarithmic scale.</p> "> Figure 10
<p>The effect of each Augmentation on validation accuracy over 25 epochs: (<bold>a</bold>) linear scale; (<bold>b</bold>) logarithmic scale.</p> ">
Abstract
:1. Introduction
2. Related Work
- Improved accuracy: The proposed method achieves more accuracy overall than other the original method that utilized optical flow.
- Detecting multiple deepfake techniques: Tests are conducted on several deepfake techniques including Deepfakes and face2face.
- Experimenting with fine-tuning techniques: New augmentation techniques proposed by Jeon et al. [10] are implemented with the proposed method to try to improve accuracy in this paper.
- Using TPU and GPU on the proposed method: The system is also trained on TPUs and compared with the GPU results as well.
3. Background
3.1. Deepfake
3.2. Deepfake Datasets
3.3. Optical Flow
4. Materials and Methods
4.1. Proposed Architecture
Algorithm 1 Deepfake Detection Using Optical Flow Model |
|
4.2. GPU Approach
4.3. Augmentations and Fine-Tuning Approach
4.4. TPU Approach
4.5. Extraction and Filtering
4.6. Training Stage
5. Results
5.1. GPU Results
5.2. TPU Results
6. Discussion
6.1. Augmentation Experiments
6.2. Limitations and Future Work
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- De Lima, O.; Franklin, S.; Basu, S.; Karwoski, B.; George, A. Deepfake detection using spatiotemporal convolutional networks. arXiv 2020, arXiv:2006.14749. [Google Scholar]
- Caldelli, R.; Galteri, L.; Amerini, I.; Del Bimbo, A. Optical Flow based CNN for detection of unlearnt deepfake manipulations. Pattern Recognit. Lett. 2021, 146, 31–37. [Google Scholar] [CrossRef]
- Fagni, T.; Falchi, F.; Gambini, M.; Martella, A.; Tesconi, M. TweepFake: About detecting deepfake tweets. arXiv 2020, arXiv:2008.00036. [Google Scholar]
- Tolosana, R.; Vera-Rodriguez, R.; Fierrez, J.; Morales, A.; Ortega-Garcia, J. DeepFakes and Beyond: A Survey of Face Manipulation and Fake Detection. Inf. Fusion 2020, 64, 131–148. [Google Scholar]
- Khalid, H.; Woo, S.S. OC-FakeDect: Classifying Deepfakes Using One-Class Variational Autoencoder. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; Volume 2020, pp. 2794–2803. [Google Scholar]
- FaceApp—Free Neural Face Transformation Filters. Available online: https://www.faceapp.com/ (accessed on 25 February 2022).
- Qi, H.; Guo, Q.; Juefei-Xu, F.; Xie, X.; Ma, L.; Feng, W.; Liu, Y.; Zhao, J. DeepRhythm. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; ACM: New York, NY, USA, 2020; pp. 4318–4327. [Google Scholar]
- Verdoliva, L. Media Forensics and DeepFakes: An Overview. IEEE J. Sel. Top. Signal Process. 2020, 14, 910–932. [Google Scholar] [CrossRef]
- Rossler, A.; Cozzolino, D.; Verdoliva, L.; Riess, C.; Thies, J.; Niessner, M. FaceForensics++: Learning to Detect Manipulated Facial Images. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; Volume 2019, pp. 1–11. [Google Scholar]
- Jeon, H.; Bang, Y.; Woo, S.S. FDFtNet: Facing Off Fake Images Using Fake Detection Fine-Tuning Network. IFIP Adv. Inf. Commun. Technol. 2020, 580, 416–430. [Google Scholar] [CrossRef]
- Matern, F.; Riess, C.; Stamminger, M. Exploiting Visual Artifacts to Expose Deepfakes and Face Manipulations. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision Workshops (WACVW), Waikoloa Village, HI, USA, 7–11 January 2019; pp. 83–92. [Google Scholar]
- Shahin, I.; Nassif, A.B.; Hamsa, S. Novel cascaded Gaussian mixture model-deep neural network classifier for speaker identification in emotional talking environments. Neural Comput. Appl. 2018, 32, 2575–2587. [Google Scholar] [CrossRef] [Green Version]
- Nassif, A.B.; Shahin, I.; Attili, I.; Azzeh, M.; Shaalan, K. Speech Recognition Using Deep Neural Networks: A Systematic Review. IEEE Access 2019, 7, 19143–19165. [Google Scholar] [CrossRef]
- Li, Y.; Yang, X.; Sun, P.; Qi, H.; Lyu, S. Celeb-DF: A Large-Scale Challenging Dataset for DeepFake Forensics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 3204–3213. [Google Scholar]
- Gao, H.; Pei, J.; Huang, H. Progan: Network Embedding via Proximity Generative Adversarial Network. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, WA, USA, 13–19 June 2020; ACM: New York, NY, USA, 2019; pp. 1308–1316. [Google Scholar]
- Kingma, D.P.; Dhariwal, P. Glow: Generative Flow with Invertible 1 × 1 Convolutions. Adv. Neural Inf. Process. Syst. 2018, 2018, 10215–10224. [Google Scholar]
- Thies, J.; Zollhöfer, M.; Stamminger, M.; Theobalt, C.; Nießner, M. Face2Face: Real-Time Face Capture and Reenactment of RGB Videos. In Proceedings of the Communications of the ACM, London, UK, 11–15 November 2019; Volume 62, pp. 96–104. [Google Scholar]
- Balakrishnan, G.; Durand, F.; Guttag, J. Detecting Pulse from Head Motions in Video. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 3430–3437. [Google Scholar]
- Dolhansky, B.; Bitton, J.; Pflaum, B.; Lu, J.; Howes, R.; Wang, M.; Ferrer, C.C. The DeepFake detection challenge dataset. arXiv 2020, arXiv:2006.07397. [Google Scholar]
- Guera, D.; Delp, E.J. Deepfake Video Detection Using Recurrent Neural Networks. In Proceedings of the AVSS 2018 15th IEEE International Conference on Advanced Video and Signal-Based Surveillance (AVSS), Auckland, New Zealand, 27–30 November 2018; pp. 1–6. [Google Scholar]
- Laptev, I.; Marszałek, M.; Schmid, C.; Rozenfeld, B. Learning Realistic Human Actions from Movies. In Proceedings of the 26th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Anchorage, AK, USA, 23–28 June 2008. [Google Scholar]
- Amerini, I.; Galteri, L.; Caldelli, R.; Bimbo, A. Del Deepfake Video Detection through Optical Flow based CNN. In Proceedings of the International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019. [Google Scholar]
- Sun, D.; Yang, X.; Liu, M.Y.; Kautz, J. Models Matter, so Does Training: An Empirical Study of CNNs for Optical Flow Estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 1408–1423. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Guerra, E.; de Lara, J.; Malizia, A.; Díaz, P. Supporting user-oriented analysis for multi-view domain-specific visual languages. Inf. Softw. Technol. 2009, 51, 769–784. [Google Scholar] [CrossRef] [Green Version]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; Volume 2016, pp. 770–778. [Google Scholar]
- Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Honolulu, Honolulu, HI, USA, 21–26 July 2016; Volume 2017, pp. 1800–1807. [Google Scholar]
- Karras, T.; Aila, T.; Laine, S.; Lehtinen, J. Progressive Growing of GANs for Improved Quality, Stability, and Variation. In Proceedings of the 6th International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018; pp. 1–26. [Google Scholar]
- Li, L.; Bao, J.; Zhang, T.; Yang, H.; Chen, D.; Wen, F.; Guo, B. Face X-ray for More General Face Forgery Detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; 2020; pp. 5000–5009. [Google Scholar]
- Li, Y.; Chang, M.C.; Lyu, S. In Ictu Oculi: Exposing AI Created Fake Videos by Detecting Eye Blinking. In Proceedings of the 10th IEEE International Workshop on Information Forensics and Security (WIFS), Hong Kong, China, 11–13 December 2018; pp. 1–7. [Google Scholar]
- Chintha, A.; Rao, A.; Sohrawardi, S.; Bhatt, K.; Wright, M.; Ptucha, R. Leveraging Edges and Optical Flow on Faces for Deepfake Detection. In Proceedings of the 2020 IEEE International Joint Conference on Biometrics (IJCB), Houston, TX, USA, 28 September–1 October 2020. [Google Scholar] [CrossRef]
- Dai, G.; Xie, J.; Fang, Y. Metric-Based Generative Adversarial Network. In Proceedings of the 2017 ACM Multimedia Conference, Mountain View, CA, USA, 23–27 October 2017; ACM Press: New York, NY, USA, 2017; pp. 672–680. [Google Scholar]
- Suwajanakorn, S.; Seitz, S.M.; Kemelmacher-Shlizerman, I. Synthesizing obama: Learning lip sync from audio. Assoc. Comput. Mach. Trans. Graph. 2017, 36, 1–13. [Google Scholar] [CrossRef]
- Prenger, R.; Valle, R.; Catanzaro, B. Waveglow: A Flow-Based Generative Network for Speech Synthesis. In Proceedings of the ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, Brighton, UK, 12–17 May 2019; Volume 2019, pp. 3617–3621. [Google Scholar]
- Zhao, W.; Xie, Q.; Ma, Y.; Liu, Y.; Xiong, S. Pose Guided Person Image Generation Based on Pose Skeleton Sequence and 3D Convolution. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 25–28 October 2020; 2020; pp. 1561–1565. [Google Scholar]
- Baker, S.; Roth, S.; Scharstein, D.; Black, M.J.; Lewis, J.P.; Szeliski, R. A Database and Evaluation Methodology for Optical Flow. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1–8. [Google Scholar]
- Horn, B.K.P.; Schunck, B.G. Determining Optical Flow. Artif. Intell. 1981, 17, 185–203. [Google Scholar] [CrossRef] [Green Version]
- Dosovitskiy, A.; Fischery, P.; Ilg, E.; Hausser, P.; Hazirbas, C.; Golkov, V.; van der Smagt, P.; Cremers, D.; Brox, T. FlowNet: Learning optical flow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; Volume 2015, pp. 2758–2766. [Google Scholar]
- Zhang, K.; Zhang, Z.; Li, Z.; Qiao, Y. Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks. IEEE Signal Process. Lett. 2016, 23, 1499–1503. [Google Scholar] [CrossRef] [Green Version]
- Vinet, L.; Zhedanov, A. A “missing” family of classical orthogonal polynomials. J. Phys. A Math. Theor. 2011, 44, 085201. [Google Scholar] [CrossRef]
- Ketkar, N.; Ketkar, N. Introduction to Keras. In Deep Learning with Python; Apress: Berkeley, CA, USA, 2017; pp. 97–111. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
- Jouppi, N.P. In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the International Symposium on Computer Architecture, Toronto, ON, Canada, 24–28 June 2017; ACM: New York, NY, USA, 2017; Volume F1286, pp. 1–12. [Google Scholar]
- Ilg, E.; Mayer, N.; Saikia, T.; Keuper, M.; Dosovitskiy, A.; Brox, T. FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2016; Volume 2017, pp. 1647–1655. [Google Scholar]
- Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; Volume 2016, pp. 2818–2826. [Google Scholar]
- Howard, A.; Sandler, M.; Chen, B.; Wang, W.; Chen, L.C.; Tan, M.; Chu, G.; Vasudevan, V.; Zhu, Y.; Pang, R.; et al. Searching for mobileNetV3. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; 2019; Volume 2019, pp. 1314–1324. [Google Scholar]
- Garcia-Lamont, F.; Cervantes, J.; López, A.; Rodriguez, L. Segmentation of images by color features: A survey. Neurocomputing 2018, 292, 1–27. [Google Scholar] [CrossRef]
Research Paper | Year | Method | Domain | Datasets | Hardware | Accuracy |
---|---|---|---|---|---|---|
DeepRhythm [7] | 2020 | Heartbeat rhythms using PPG with attention network | Dual-spatial-temporal | FF++ * DFDC | GPU | Accuracy: 98.0% |
FDFtNet [10] | 2020 | Augmentation of pretrained CNN | Pixel-Level detection | PGGAN, FF++ –Deepfake, FF++ –Face2Face | GPU | AUROC: 0.994 Accuracy: 97.02% |
Face X-Ray [28] | 2020 | Detection of blending boundaries in the image | Pixel-Level detection | FF++ * DFDC DFD Celeb-DF | GPU | AUC: 95.4 |
Visual Artifacts [11] | 2019 | Visual artifacts (eyes, teeth and nose and face border) | Pixel-Level detection | Glow, ProGan, celeb-A | GPU | AUROC: 0.866 |
Optical Flow [22] | 2019 | Inter-frame correlations using optical flow | Spatio-temporal | FF++ –Face2Face | GPU | Accuracy: 81.61% |
Recurrent Neural Networks [20] | 2019 | Recurrent Neural Network | Spatio-temporal | HOHA | GPU | Accuracy: 97.1% |
FF++ –Xception [9] | 2019 | CNN-based Image classification | Pixel-Level detection | FF++ * | GPU | Accuracy: 96.36% |
Eye Blinking [29] | 2018 | Discrepancies in eye blinking across the frames | Spatio-temporal | CEW | GPU | AUROC: 0.98 |
Edges & Optical flow [30] | 2020 | Edges of optical flow images with XceptionNet | Spatio-temporal | FF++ * DFDC-mini | GPU | Accuracy on DFDC-mini: 97.94% |
Optical flow based CNN [2] | 2021 | Optical flow-based CNN | Spatio-temporal | FF++ * | GPU | Accuracy on Optical flow only: 82.99% |
This research paper | 2022 | Inter-frame correlations using optical flow | Spatio-temporal | FF++ –Deepfake, FF++ –Face2Face Celeb-DF, DFDC | GPU, TPU | AUROC: 0.879 Accuracy: 82% |
Type | Photo | Audio | Video |
---|---|---|---|
Description | This type includes manipulations done on images, i.e., to generate a non-existent face image. | This type includes any type of manipulation done on audio records, i.e., impersonating or changing a person’s voice. | This type includes manipulations done on videos. |
Class | Face and body swapping. |
|
|
Example | FaceApp [6]. |
Dataset | Year | Size (Videos) | Techniques |
---|---|---|---|
FF++ [9] | 2019 | 1000 real 7000 fake (all techniques) | Deepfakes, Face2Face, face swap, NeuralTextures |
Celeb-DF v2 [14] | 2020 | 590/5639 | Deepfakes |
DFDC [19] | 2020 | 19,154/100,000 | 8 different deepfakes techniques |
Dataset | Videos Used | Original Frames | Optical Flow Frames | Training/Validation/Test |
---|---|---|---|---|
FaceForensics++ –DF | 631 | 240,000 | 120,000 | 80,000/20,000/20,000 |
FaceForensics++ –F2F | 545 | 240,000 | 120,000 | 80,000/20,000/20,000 |
Celeb-DF | 1254 | 240,000 | 120,000 | 80,000/20,000/20,000 |
DFDC | 962 | 240,000 | 120,000 | 80,000/20,000/20,000 |
Approach | Optimizer | Learning Rate | Compiler Loss | Last Dense | Epochs |
---|---|---|---|---|---|
GPU | Adam | 1e-4 | categorical_crossentropy | 2, softmax | 25 |
GPU-Orignal | Adam | 1e-4 | binary_crossentropy | 1, sigmoid | 25 |
Augmented | Adam | Default | categorical_crossentropy | 2, softmax | 25 |
TPU | Adamax | 1e-4 | sparse_categorical_crossentropy | 2, softmax | 25 |
Model | Time Per Epoch | Total Time | Accuracy |
---|---|---|---|
Inception V3 [44] | 800 s | 335 min | 62.1% |
ResNet 50 [25] | 77 s | 33 min | 60.64% |
ResNet 101 [25] | 1207 s | 507 min | 65.89% |
ResNet 152 [25] | 1994 s | 839 min | 65.79% |
Xception [26] | 633 s | 264 min | 52.0% |
VGG-19 | 698 s | 294 min | 80.1% |
VGG-16 Binary (Amirini’s) [22] | 446 s | 187 min | 75.27% |
VGG-16 (Proposed) | 440 s | 183 min | 82.0% |
Model | Dataset | Accuracy | Overall Accuracy |
---|---|---|---|
Proposed | FaceForensics++ –DF | 82.0% | 66.780% |
FaceForensics++ –F2f | 69.67% | ||
Celeb-DF v2 | 74.24% | ||
DFDC | 61.25% | ||
Original [22] | FaceForensics++ –DF | 75.27% | 63.435% |
FaceForensics++ –F2f | 67.37% | ||
Celeb-DF v2 | 50.0% | ||
DFDC | 61.1% |
Validation | FF++ –Deepfake | FF++ –Face2Face | DFDC | Celeb-DF | Overall | |
---|---|---|---|---|---|---|
Trained | ||||||
FF++ –Deepfake | AUROC: 0.878556 | AUROC: 0.710618 | AUROC: 0.521114 | AUROC: 0.528509 | 0.6276 | |
Acc: 0.81995 | Acc: 0.6478 | Acc: 0.5184 | Acc: 0.5241 | |||
FF++ –Face2Face | AUROC: 0.766970 | AUROC: 0.764427 | AUROC: 0.480113 | AUROC: 0.531422 | 0.6001 | |
Acc: 0.6913 | Acc: 0.69675 | Acc: 0.4859 | Acc: 0.52645 | |||
DFDC | AUROC: 0.519190 | AUROC: 0.476737 | AUROC: 0.650156 | AUROC: 0.476790 | 0.5226 | |
Acc: 0.5142 | Acc: 0.48485 | Acc: 0.61225 | Acc: 0.4792 | |||
CelebDF | AUROC: 0.529061 | AUROC: 0.529152 | AUROC: 0.464086 | AUROC: 0.806833 | 0.5650 | |
Acc: 0.525 | Acc: 0.5185 | Acc: 0.4742 | Acc: 0.74245 |
Model | Time Per Epoch | Total Time | Accuracy |
---|---|---|---|
VGG-16-GPU | 440 s | 183 min | 82% |
VGG-16 | 52 s | 22 min | 71.34% |
VGG-19 | 57 s | 24.5 min | 63.56% |
InceptionV3 | 72 s | 30.2 min | 58.72% |
Xception | 70 s | 30 min | 52.10% |
ResNet50V2 | 55 s | 23.1 min | 68.37% |
ResNet101V2 | 85 s | 35.7 min | 69.27% |
ResNet152V2 | 110 s | 46.3 min | 70.50% |
Augmentation | Training Time | Accuracy |
---|---|---|
No Augmentations | 183 min | 82.0% |
Augmentation 1 | 672 min | 77.5% |
Augmentation 2 | 612 min | 61.5% |
Augmentation 3 | 212 min | 76.0% |
Augmentation 4 | 204 min | 75.45% |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Nassif, A.B.; Nasir, Q.; Talib, M.A.; Gouda, O.M. Improved Optical Flow Estimation Method for Deepfake Videos. Sensors 2022, 22, 2500. https://doi.org/10.3390/s22072500
Nassif AB, Nasir Q, Talib MA, Gouda OM. Improved Optical Flow Estimation Method for Deepfake Videos. Sensors. 2022; 22(7):2500. https://doi.org/10.3390/s22072500
Chicago/Turabian StyleNassif, Ali Bou, Qassim Nasir, Manar Abu Talib, and Omar Mohamed Gouda. 2022. "Improved Optical Flow Estimation Method for Deepfake Videos" Sensors 22, no. 7: 2500. https://doi.org/10.3390/s22072500
APA StyleNassif, A. B., Nasir, Q., Talib, M. A., & Gouda, O. M. (2022). Improved Optical Flow Estimation Method for Deepfake Videos. Sensors, 22(7), 2500. https://doi.org/10.3390/s22072500