Abstract
A DeepFake is a manipulated video made with generative deep learning technologies, such as generative adversarial networks or auto encoders that anyone can utilize. With the increase in DeepFakes, classifiers consisting of convolutional neural networks (CNN) that can distinguish them have been actively created. However, CNNs have a problem with overfitting and cannot consider the relation between local regions as global feature of image, resulting in misclassification. In this paper, we propose an efficient vision transformer model for DeepFake detection to extract both local and global features. We combine vector-concatenated CNN feature and patch-based positioning to interact with all positions to specify the artifact region. For the distillation token, the logit is trained using binary cross entropy through the sigmoid function. By adding this distillation, the proposed model is generalized to improve performance. From experiments, the proposed model outperforms the SOTA model by 0.006 AUC and 0.013 f1 score on the DFDC test dataset. For 2,500 fake videos, the proposed model correctly predicts 2,313 as fake, whereas the SOTA model predicts 2,276 in the best performance. With the ensemble method, the proposed model outperformed the SOTA model by 0.01 AUC. For Celeb-DF (v2) dataset, the proposed model achieves a high performance of 0.993 AUC and 0.978 f1 score, respectively.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Choi Y-J, Lee YW, Kim B-G (2021) Group-based bi-directional recurrent wavelet neural networks for video super-resolution, arXiv:2106.07190
Jeong D, Kim BG, Dong S-Y (2020) Deep joint spatiotemporal network (djstn) for efficient facial expression recognition. Sensors 20(7):1936
Yeo W-H, Heo Y-J, Choi Y-J, Kim B-G (2020) Place classification algorithm based on semantic segmented objects. Appl Sci 10(24):9069
Heo Y-J, Choi Y-J, Lee Y-W, Kim B-G (2021) Deepfake detection scheme based on vision transformer and distillation, arXiv:2104.01353
Karras T, Laine S, Aila T (2019) A style-based generator architecture for generative adversarial networks. In: proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 4401–4410
Choi Y, Choi M, Kim M, Ha J-W, Kim S, Choo J (2018) Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In: proceedings of the IEEE conference on computer vision and pattern recognition, pp 8789–8797
Shen Y, Yang C, Tang X, Zhou B (2020) Interfacegan: Interpreting the disentangled face representation learned by gans, IEEE Transactions on Pattern Analysis and Machine Intelligence
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. Advances in neural information processing systems, 27
Kingma DP, Welling M (2014) Stochastic gradient vb and the variational auto-encoder. In: Second international conference on learning representations, ICLR, vol 19, p 121
Dolhansky B, Bitton J, Pflaum B, Lu J, Howes R, Wang M, Ferrer CC (2020) The deepfake detection challenge dataset, arXiv preprint arXiv arXiv:2006.07397
Seferbekov S (2020) https://github.com/selimsef/dfdc_deepfake_challenge. Accessed 24 Jan 2022
Nguyen HH, Yamagishi Y, Echizen I (2019) Use of a capsule network to detect fake images and videos, arXiv:1910.12467
Li Y, Lyu S (2019) Exposing deepfake videos by detecting face warping artifacts. In: CVPR Workshops
Lui S, Deng W (2015) Very deep convolutional neural network based image classification using small training sample size. In: 2015 3rd IAPR Asian conference on pattern recognition (ACPR), p 730–734 IEEE
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Yang X, Li Y, Lyu S (2019) Exposing deep fakes using inconsistent head poses. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 8261–8265
Guarnera L, Giudice O, Battiato S (2020) Deepfake detection by analyzing convolutional traces. In: proceedings of the IEEE/CVF Conference on computer vision and pattern recognition workshops, pp 666–667
Tan M, Le Q (2019) Efficientnet: Rethinking model scaling for convolutional neural networks. In: International conference on machine learning. PMLR
Li L, Bao J, Zhang T, Yang H, Chen D, Wen F, Guo B (2020) Face x-ray for more general face forgery detection. In: proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 5001–5010
Mittal T, Bhattacharya U, Chandra R, Bera A, Manocha D (2020) Emotions don’t lie: an audio-visual deepfake detection method using affective cues. In: proceedings of the 28th ACM international conference on multimedia, pp 2823–2832
Montserrat DM, Hao H, Yarlagadda SK, Baireddy S, Shao R, Horváth J, Bartusiak E, Yang J, Guera D, Zhu F et al (2020) Deepfakes detection with automatic face weighting. In: proceedings of the IEEE/CVF Conference on computer vision and pattern recognition workshops, pp 668–669
Güera D, Delp EJ (2018) Deepfake video detection using recurrent neural networks. In: 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE, pp 1–6
de Lima O, Franklin S, Basu S, Karwoski B, George A (2020) Deepfake detection using spatiotemporal convolutional networks, arXiv:2006.14749
Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: proceedings of the IEEE Conference on computer vision and pattern recognition, pp 6299–6308
Hara K, Kataoka H, Satoh Y (2017) Learning spatio-temporal features with 3d residual networks for action recognition. In: proceedings of the IEEE International conference on computer vision workshops, pp 3154–3160
Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: proceedings of the IEEE conference on computer vision and pattern recognition, pp 6450–6459
Amerini I, Galteri L, Caldelli R, Del Bimbo A (2019) Deepfake video detection through optical flow based cnn. In: proceedings of the IEEE/CVF International conference on computer vision workshops, pp 0–0
Thies J, Zollhofer M, Stamminger M, Theobalt C, Nießner M (2016) Face2face: Real-time face capture and reenactment of rgb videos. In: proceedings of the IEEE conference on computer vision and pattern recognition, pp 2387–2395
Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou A (2021) Training data-efficient image transformers & distillation through attention. PMLR
Raghu M, Unterthiner T, Kornblith S, Zhang C, Dosovitskiy A (2021) Do vision transformers see like convolutional neural networks?. Advances in Neural Information Processing Systems, vol 34
Zhang K, Zhang Z, Li Z, Qiao Y (2016) Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters 23(10):1499–1503
Buslaev A, Iglovikov VI, Khvedchenya E, Parinov A, Druzhinin M, Kalinin AA (2020) Albumentations: fast and flexible image augmentations. Information 11(2):125
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2020) An image is worth 16x16 words: Transformers for image recognition at scale, arXiv:2010.11929
Girdhar R, Carreira J, Doersch C, Zisserman A (2019) Video action transformer network. In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 244–253
Neimark D, Bar O, Zohar M, Asselmann D (2021) Video transformer network, arXiv:2102.00719
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows, International Conference on Computer Vision (ICCV)
Lin M, Chen Q, Yan S (2013) Network in network, arXiv:1312.4400
Dolhansky B, Howes R, Pflaum B, Baram N, Ferrer CC (2019) The deepfake detection challenge (dfdc) preview dataset, arXiv:1910.08854
Korshunov P, Marcel S (2018) Deepfakes:, a new threat to face recognition? assessment and detection, arXiv:1812.08685
Rossler A, Cozzolino D, Verdoliva L, Riess C, Thies J, Nießner M (2019) Faceforensics++: Learning to detect manipulated facial images. In: proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1–11
Li Y, Yang X, Sun P, Qi H, Lyu S (2020) Celeb-df: a large-scale challenging dataset for deepfake forensics. In: proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 3207–3216
Zhao H, Cui H, Zhou W (2020) https://github.com/cuihaoleo/kaggle-dfdc. Accessed 24 Jan 2022
Davletshin A (2020) https://github.com/NTech-Lab/deepfake-detection-challengehttps://github.com/NTech-Lab/deepfake-detection-challenge. Accessed 24 Jan 2022
Shao J, Shi H, Yin Z, Fang Z, Yin G, Chen S, Ning N, Liu Y (2020) https://github.com/Siyu-C/RobustForensics. Accessed 24 Jan 2022
Howard J, Pan I (2020) https://github.com/jphdotam/DFDC/. Accessed 24 Jan 2022
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Heo, YJ., Yeo, WH. & Kim, BG. DeepFake detection algorithm based on improved vision transformer. Appl Intell 53, 7512–7527 (2023). https://doi.org/10.1007/s10489-022-03867-9
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-022-03867-9