Spatio-temporal knowledge distilled video vision transformer (STKD-VViT) for multimodal deepfake detection

被引：0

作者：

Usmani, Shaheen ^{[1
]}

Kumar, Sunil ^{[1
]}

Sadhya, Debanjan ^{[2
]}

机构：

[1] ABV Indian Inst Informat Technol & Management, Dept Informat Technol, Gwalior, India

[2] ABV Indian Inst Informat Technol & Management, Dept Comp Sci & Engn, Gwalior, India

来源：

NEUROCOMPUTING | 2025年 / 620卷

关键词：

Deepfake detection; Video vision transformer; Vision transformer; Fusion; Knowledge distillation;

D O I：

10.1016/j.neucom.2024.129256

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The widespread circulation of manipulated videos using deepfake techniques has raised concerns about the authenticity of multimedia content. In response, deepfake detection techniques have made significant strides in specific scenarios. However, most of the existing methods are unimodal and focus only on extracting traditional spatial features, due to which they struggle to accurately identify modern deepfakes. This work introduces the STKD-VViT model for detecting deepfakes across multiple modalities while employing spatiotemporal features. STKD-VViT combines the strengths of the Video vision transformer and the Vision transformer to process visual and audio streams. The Video vision transformer employs a multi-head attention mechanism and tubelet embedding to extract the video's spatial and temporal features. Alternatively, the vision transformer extracts the salient features from the mel-spectrograms of audio files. Furthermore, STKD-VViT leverages the knowledge distillation technique to reduce the number of FLOPs and the model's parameters. Experimental results on the benchmark FakeAVCeleb dataset demonstrate that STKD-VViT achieves a testing accuracy of 97.49% for video stream data, 98.65% for audio stream data and 96.0% when both streams are combined using score-level fusion, surpassing other state-of-the-art methods.

引用

页数：13

共 50 条

[1] Spatio-Temporal Catcher: a Self-Supervised Transformer for Deepfake Video Detection
Li, Maosen
Li, Xurong
Yu, Kun
Deng, Cheng
Huang, Heng
Mao, Feng
Xue, Hui
Li, Minghao
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 8707 - 8718
[2] Dynamic Difference Learning With Spatio-Temporal Correlation for Deepfake Video Detection
Yin, Qilin
Lu, Wei
Li, Bin
Huang, Jiwu
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2023, 18 : 4046 - 4058
[3] Transformer with Spatio-Temporal Representation for Video Anomaly Detection
Sun, Xiaohu
Chen, Jinyi
Shen, Xulin
Li, Hongjun
STRUCTURAL, SYNTACTIC, AND STATISTICAL PATTERN RECOGNITION, S+SSPR 2022, 2022, 13813 : 213 - 222
[4] SFormer: An end-to-end spatio-temporal transformer architecture for deepfake detection
Kingra, Staffy
Aggarwal, Naveen
Kaur, Nirmal
FORENSIC SCIENCE INTERNATIONAL-DIGITAL INVESTIGATION, 2024, 51
[5] DepMSTAT: Multimodal Spatio-Temporal Attentional Transformer for Depression Detection
Tao, Yongfeng
Yang, Minqiang
Li, Huiru
Wu, Yushan
Hu, Bin
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (07) : 2956 - 2966
[6] Spatio-Temporal Transformer Network for Video Restoration
Kim, Tae Hyun
Sajjadi, Mehdi S. M.
Hirsch, Michael
Schoelkopf, Bernhard
COMPUTER VISION - ECCV 2018, PT III, 2018, 11207 : 111 - 127
[7] Transformer RGBT Tracking With Spatio-Temporal Multimodal Tokens
Sun, Dengdi
Pan, Yajie
Lu, Andong
Li, Chenglong
Luo, Bin
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (11) : 12059 - 12072
[8] Towards Spatio-temporal Collaborative Learning: An End-to-End Deepfake Video Detection Framework
Guo, Wenxuan
Du, Shuo
Deng, Huiyuan
Yu, Zikang
Feng, Lin
2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
[9] Improved Deepfake Video Detection Using Convolutional Vision Transformer
Deressa, Deressa Wodajo
Lambert, Peter
Van Wallendael, Glenn
Atnafu, Solomon
Mareen, Hannes
2024 IEEE GAMING, ENTERTAINMENT, AND MEDIA CONFERENCE, GEM 2024, 2024, : 492 - 497
[10] Attention Guided Spatio-Temporal Artifacts Extraction for Deepfake Detection
Wang, Zhibing
Li, Xin
Ni, Rongrong
Zhao, Yao
PATTERN RECOGNITION AND COMPUTER VISION, PT IV, 2021, 13022 : 374 - 386

← 1 2 3 4 5 →