[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3664647.3688984acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

MFMS: Learning Modality-Fused and Modality-Specific Features for Deepfake Detection and Localization Tasks

Published: 28 October 2024 Publication History

Abstract

This paper presents a summary of the proposed solution to the AV-Deepfake1M competition. Deepfake technology is developing fast, and realistic generation techniques of audio and videos have aroused public concerns. With this background, the AV-Deepfake1M competition aims to address the problem of audio-video Deepfake and provides a large-scale dataset named AV-Deepfake1M to boost the research in this area. In this paper, we present our solutions which have achieved top performance in this competition. We also provide more detailed experiments to prove the effectiveness of the modules used in our methods.

References

[1]
Zhixi Cai, Abhinav Dhall, Shreya Ghosh, Munawar Hayat, Dimitrios Kollias, Kalin Stefanov, and Usman Tariq. 2024. 1M-Deepfakes Detection Challenge. In ACM international conference on multimedia.
[2]
Zhixi Cai, Shreya Ghosh, Aman Pankaj Adatia, Munawar Hayat, Abhinav Dhall, and Kalin Stefanov. 2023. AV-Deepfake1M: A large-scale LLM-driven audio-visual deepfake dataset. arXiv preprint arXiv:2311.15308 (2023).
[3]
Zhixi Cai, Shreya Ghosh, Abhinav Dhall, Tom Gedeon, Kalin Stefanov, and Munawar Hayat. 2023. Glitch in the matrix: A large scale benchmark for content driven audio--visual forgery detection and localization. Computer Vision and Image Understanding, Vol. 236 (2023), 103818.
[4]
Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. 2023. Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 45, 9 (2023), 10850--10869.
[5]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[6]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. Advances in neural information processing systems, Vol. 27 (2014).
[7]
Hasam Khalid, Minha Kim, Shahroz Tariq, and Simon S Woo. 2021. Evaluation of an audio-video multimodal deepfake dataset using unimodal and multimodal detectors. In Proceedings of the 1st workshop on synthetic multimedia-audiovisual deepfake generation and detection. 7--15.
[8]
Hasam Khalid, Shahroz Tariq, Minha Kim, and Simon S Woo. 2021. FakeAVCeleb: A novel audio-video multimodal deepfake dataset. arXiv preprint arXiv:2108.05080 (2021).
[9]
Diederik P Kingma, Max Welling, et al. 2019. An introduction to variational autoencoders. Foundations and Trends® in Machine Learning, Vol. 12, 4 (2019), 307--392.
[10]
Changtao Miao, Qi Chu, Weihai Li, Tao Gong, Wanyi Zhuang, and Nenghai Yu. 2021. Towards Generalizable and Robust Face Manipulation Detection via Bag-of-feature. In 2021 International Conference on Visual Communications and Image Processing (VCIP). IEEE, 1--5.
[11]
Changtao Miao, Qi Chu, Weihai Li, Suichan Li, Zhentao Tan, Wanyi Zhuang, and Nenghai Yu. 2022. Learning Forgery Region-aware and ID-independent Features for Face Manipulation Detection. IEEE Transactions on Biometrics, Behavior, and Identity Science, Vol. 4, 1 (2022), 71--84.
[12]
Changtao Miao, Qi Chu, Zhentao Tan, Zhenchao Jin, Wanyi Zhuang, Yue Wu, Bin Liu, Honggang Hu, and Nenghai Yu. 2023. Multi-spectral Class Center Network for Face Manipulation Detection and Localization. arXiv preprint arXiv:2305.10794 (2023).
[13]
Changtao Miao, Zichang Tan, Qi Chu, Huan Liu, Honggang Hu, and Nenghai Yu. 2023. F 2 Trans: High-Frequency Fine-Grained Transformer for Face Forgery Detection. IEEE Transactions on Information Forensics and Security, Vol. 18 (2023), 1039--1051.
[14]
Changtao Miao, Zichang Tan, Qi Chu, Nenghai Yu, and Guodong Guo. 2022. Hierarchical frequency-assisted interactive networks for face manipulation detection. IEEE Transactions on Information Forensics and Security, Vol. 17 (2022), 3008--3021.
[15]
Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, and Kunio Kashino. 2021. Byol for audio: Self-supervised learning for general-purpose audio representation. In 2021 International Joint Conference on Neural Networks (IJCNN). IEEE, 1--8.
[16]
Gan Pei, Jiangning Zhang, Menghan Hu, Guangtao Zhai, Chengjie Wang, Zhenyu Zhang, Jian Yang, Chunhua Shen, and Dacheng Tao. 2024. Deepfake generation and detection: A benchmark and survey. arXiv preprint arXiv:2403.17881 (2024).
[17]
Muhammad Anas Raza and Khalid Mahmood Malik. 2023. Multimodaltrace: Deepfake detection using audiovisual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 993--1000.
[18]
Rui Shao, Tianxing Wu, and Ziwei Liu. 2023. Detecting and grounding multi-modal media manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6904--6913.
[19]
Zichang Tan, Zhichao Yang, Changtao Miao, and Guodong Guo. 2022. Transformer-based feature compensation and aggregation for deepfake detection. IEEE Signal Processing Letters, Vol. 29 (2022), 2183--2187.
[20]
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision. Springer, 20--36.
[21]
Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, et al. 2022. Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191 (2022).
[22]
Nanqing Xu, Weiwei Feng, Tianzhu Zhang, and Yongdong Zhang. 2024. FD-GAN: Generalizable and Robust Forgery Detection via Generative Adversarial Networks. International Journal of Computer Vision (2024), 1--19.
[23]
Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. 2024. Vision-language models for vision tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024).
[24]
Rui Zhang, Hongxia Wang, Mingshan Du, Hanqing Liu, Yang Zhou, and Qiang Zeng. 2023. Ummaformer: A universal multimodal-adaptive transformer framework for temporal forgery localization. In Proceedings of the 31st ACM International Conference on Multimedia. 8749--8759.
[25]
Yipin Zhou and Ser-Nam Lim. 2021. Joint audio-visual deepfake detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14800--14809.
[26]
Wanyi Zhuang, Qi Chu, Zhentao Tan, Qiankun Liu, Haojie Yuan, Changtao Miao, Zixiang Luo, and Nenghai Yu. 2022. UIA-ViT: Unsupervised inconsistency-aware method based on vision transformer for face forgery detection. In Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part V. Springer, 391--407.
[27]
Wanyi Zhuang, Qi Chu, Haojie Yuan, Changtao Miao, Bin Liu, and Nenghai Yu. 2022. Towards intrinsic common discriminative features learning for face forgery detection using adversarial learning. In 2022 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1--6.

Index Terms

  1. MFMS: Learning Modality-Fused and Modality-Specific Features for Deepfake Detection and Localization Tasks

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
      October 2024
      11719 pages
      ISBN:9798400706868
      DOI:10.1145/3664647
      This work is licensed under a Creative Commons Attribution International 4.0 License.

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 28 October 2024

      Check for updates

      Author Tags

      1. deepfake video detection and localization
      2. multi-modal

      Qualifiers

      • Research-article

      Conference

      MM '24
      Sponsor:
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne VIC, Australia

      Acceptance Rates

      MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
      Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 295
        Total Downloads
      • Downloads (Last 12 months)295
      • Downloads (Last 6 weeks)117
      Reflects downloads up to 02 Mar 2025

      Other Metrics

      Citations

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media