Spatio-temporal knowledge distilled video vision transformer (STKD-VViT) for multimodal deepfake detection

被引:0
|
作者
Usmani, Shaheen [1 ]
Kumar, Sunil [1 ]
Sadhya, Debanjan [2 ]
机构
[1] ABV Indian Inst Informat Technol & Management, Dept Informat Technol, Gwalior, India
[2] ABV Indian Inst Informat Technol & Management, Dept Comp Sci & Engn, Gwalior, India
关键词
Deepfake detection; Video vision transformer; Vision transformer; Fusion; Knowledge distillation;
D O I
10.1016/j.neucom.2024.129256
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The widespread circulation of manipulated videos using deepfake techniques has raised concerns about the authenticity of multimedia content. In response, deepfake detection techniques have made significant strides in specific scenarios. However, most of the existing methods are unimodal and focus only on extracting traditional spatial features, due to which they struggle to accurately identify modern deepfakes. This work introduces the STKD-VViT model for detecting deepfakes across multiple modalities while employing spatiotemporal features. STKD-VViT combines the strengths of the Video vision transformer and the Vision transformer to process visual and audio streams. The Video vision transformer employs a multi-head attention mechanism and tubelet embedding to extract the video's spatial and temporal features. Alternatively, the vision transformer extracts the salient features from the mel-spectrograms of audio files. Furthermore, STKD-VViT leverages the knowledge distillation technique to reduce the number of FLOPs and the model's parameters. Experimental results on the benchmark FakeAVCeleb dataset demonstrate that STKD-VViT achieves a testing accuracy of 97.49% for video stream data, 98.65% for audio stream data and 96.0% when both streams are combined using score-level fusion, surpassing other state-of-the-art methods.
引用
收藏
页数:13
相关论文
共 50 条
  • [21] ISTVT: Interpretable Spatial-Temporal Video Transformer for Deepfake Detection
    Zhao, Cairong
    Wang, Chutian
    Hu, Guosheng
    Chen, Haonan
    Liu, Chun
    Tang, Jinhui
    IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2023, 18 : 1335 - 1348
  • [22] Exploiting spatio-temporal knowledge for video action recognition
    Zhang, Huigang
    Wang, Liuan
    Sun, Jun
    IET COMPUTER VISION, 2023, 17 (02) : 222 - 230
  • [23] An Efficient Spatio-Temporal Pyramid Transformer for Action Detection
    Weng, Yuetian
    Pan, Zizheng
    Han, Mingfei
    Chang, Xiaojun
    Zhuang, Bohan
    COMPUTER VISION, ECCV 2022, PT XXXIV, 2022, 13694 : 358 - 375
  • [24] Spatio-temporal model of human vision for digital video compression
    Westen, SJP
    Lagendijk, RL
    Biemond, J
    HUMAN VISION AND ELECTRONIC IMAGING II, 1997, 3016 : 260 - 268
  • [25] Spatio-Temporal United Memory for Video Anomaly Detection
    Wang, Yunlong
    Chen, Mingyi
    Li, Jiaxin
    Li, Hongjun
    STRUCTURAL, SYNTACTIC, AND STATISTICAL PATTERN RECOGNITION, S+SSPR 2022, 2022, 13813 : 84 - 93
  • [26] Spatio-Temporal Unity Networking for Video Anomaly Detection
    Li, Yuanyuan
    Cai, Yiheng
    Liu, Jiaqi
    Lang, Shinan
    Zhang, Xinfeng
    IEEE ACCESS, 2019, 7 : 172425 - 172432
  • [27] Spatio-temporal Blotches Detection and removal in Archive Video
    Yous, H.
    Serir, A.
    2017 INTELLIGENT SYSTEMS AND COMPUTER VISION (ISCV), 2017,
  • [28] SPATIO-TEMPORAL INTERACTION FOR AERIAL VIDEO CHANGE DETECTION
    Bourdis, Nicolas
    Marraud, Denis
    Sahbi, Hichem
    2012 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS), 2012, : 2253 - 2256
  • [29] Cross-scale hierarchical spatio-temporal transformer for video enhancement
    Jiang, Qin
    Wang, Qinglin
    Chi, Lihua
    Liu, Jie
    KNOWLEDGE-BASED SYSTEMS, 2025, 309
  • [30] Blur-aware Spatio-temporal Sparse Transformer for Video Deblurring
    Zhang, Huicong
    Xie, Haozhe
    Yao, Hongxun
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 2673 - 2681