Spatio-temporal knowledge distilled video vision transformer (STKD-VViT) for multimodal deepfake detection

被引:0
|
作者
Usmani, Shaheen [1 ]
Kumar, Sunil [1 ]
Sadhya, Debanjan [2 ]
机构
[1] ABV Indian Inst Informat Technol & Management, Dept Informat Technol, Gwalior, India
[2] ABV Indian Inst Informat Technol & Management, Dept Comp Sci & Engn, Gwalior, India
关键词
Deepfake detection; Video vision transformer; Vision transformer; Fusion; Knowledge distillation;
D O I
10.1016/j.neucom.2024.129256
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The widespread circulation of manipulated videos using deepfake techniques has raised concerns about the authenticity of multimedia content. In response, deepfake detection techniques have made significant strides in specific scenarios. However, most of the existing methods are unimodal and focus only on extracting traditional spatial features, due to which they struggle to accurately identify modern deepfakes. This work introduces the STKD-VViT model for detecting deepfakes across multiple modalities while employing spatiotemporal features. STKD-VViT combines the strengths of the Video vision transformer and the Vision transformer to process visual and audio streams. The Video vision transformer employs a multi-head attention mechanism and tubelet embedding to extract the video's spatial and temporal features. Alternatively, the vision transformer extracts the salient features from the mel-spectrograms of audio files. Furthermore, STKD-VViT leverages the knowledge distillation technique to reduce the number of FLOPs and the model's parameters. Experimental results on the benchmark FakeAVCeleb dataset demonstrate that STKD-VViT achieves a testing accuracy of 97.49% for video stream data, 98.65% for audio stream data and 96.0% when both streams are combined using score-level fusion, surpassing other state-of-the-art methods.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] Spatio-Temporal Catcher: a Self-Supervised Transformer for Deepfake Video Detection
    Li, Maosen
    Li, Xurong
    Yu, Kun
    Deng, Cheng
    Huang, Heng
    Mao, Feng
    Xue, Hui
    Li, Minghao
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 8707 - 8718
  • [2] Dynamic Difference Learning With Spatio-Temporal Correlation for Deepfake Video Detection
    Yin, Qilin
    Lu, Wei
    Li, Bin
    Huang, Jiwu
    IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2023, 18 : 4046 - 4058
  • [3] Transformer with Spatio-Temporal Representation for Video Anomaly Detection
    Sun, Xiaohu
    Chen, Jinyi
    Shen, Xulin
    Li, Hongjun
    STRUCTURAL, SYNTACTIC, AND STATISTICAL PATTERN RECOGNITION, S+SSPR 2022, 2022, 13813 : 213 - 222
  • [4] SFormer: An end-to-end spatio-temporal transformer architecture for deepfake detection
    Kingra, Staffy
    Aggarwal, Naveen
    Kaur, Nirmal
    FORENSIC SCIENCE INTERNATIONAL-DIGITAL INVESTIGATION, 2024, 51
  • [5] DepMSTAT: Multimodal Spatio-Temporal Attentional Transformer for Depression Detection
    Tao, Yongfeng
    Yang, Minqiang
    Li, Huiru
    Wu, Yushan
    Hu, Bin
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (07) : 2956 - 2966
  • [6] Spatio-Temporal Transformer Network for Video Restoration
    Kim, Tae Hyun
    Sajjadi, Mehdi S. M.
    Hirsch, Michael
    Schoelkopf, Bernhard
    COMPUTER VISION - ECCV 2018, PT III, 2018, 11207 : 111 - 127
  • [7] Transformer RGBT Tracking With Spatio-Temporal Multimodal Tokens
    Sun, Dengdi
    Pan, Yajie
    Lu, Andong
    Li, Chenglong
    Luo, Bin
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (11) : 12059 - 12072
  • [8] Towards Spatio-temporal Collaborative Learning: An End-to-End Deepfake Video Detection Framework
    Guo, Wenxuan
    Du, Shuo
    Deng, Huiyuan
    Yu, Zikang
    Feng, Lin
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [9] Improved Deepfake Video Detection Using Convolutional Vision Transformer
    Deressa, Deressa Wodajo
    Lambert, Peter
    Van Wallendael, Glenn
    Atnafu, Solomon
    Mareen, Hannes
    2024 IEEE GAMING, ENTERTAINMENT, AND MEDIA CONFERENCE, GEM 2024, 2024, : 492 - 497
  • [10] Attention Guided Spatio-Temporal Artifacts Extraction for Deepfake Detection
    Wang, Zhibing
    Li, Xin
    Ni, Rongrong
    Zhao, Yao
    PATTERN RECOGNITION AND COMPUTER VISION, PT IV, 2021, 13022 : 374 - 386