Spatio-temporal knowledge distilled video vision transformer (STKD-VViT) for multimodal deepfake detection

被引:0
|
作者
Usmani, Shaheen [1 ]
Kumar, Sunil [1 ]
Sadhya, Debanjan [2 ]
机构
[1] ABV Indian Inst Informat Technol & Management, Dept Informat Technol, Gwalior, India
[2] ABV Indian Inst Informat Technol & Management, Dept Comp Sci & Engn, Gwalior, India
关键词
Deepfake detection; Video vision transformer; Vision transformer; Fusion; Knowledge distillation;
D O I
10.1016/j.neucom.2024.129256
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The widespread circulation of manipulated videos using deepfake techniques has raised concerns about the authenticity of multimedia content. In response, deepfake detection techniques have made significant strides in specific scenarios. However, most of the existing methods are unimodal and focus only on extracting traditional spatial features, due to which they struggle to accurately identify modern deepfakes. This work introduces the STKD-VViT model for detecting deepfakes across multiple modalities while employing spatiotemporal features. STKD-VViT combines the strengths of the Video vision transformer and the Vision transformer to process visual and audio streams. The Video vision transformer employs a multi-head attention mechanism and tubelet embedding to extract the video's spatial and temporal features. Alternatively, the vision transformer extracts the salient features from the mel-spectrograms of audio files. Furthermore, STKD-VViT leverages the knowledge distillation technique to reduce the number of FLOPs and the model's parameters. Experimental results on the benchmark FakeAVCeleb dataset demonstrate that STKD-VViT achieves a testing accuracy of 97.49% for video stream data, 98.65% for audio stream data and 96.0% when both streams are combined using score-level fusion, surpassing other state-of-the-art methods.
引用
收藏
页数:13
相关论文
共 50 条
  • [41] Spatio-temporal prediction and reconstruction network for video anomaly detection
    Liu, Ting
    Zhang, Chengqing
    Niu, Xiaodong
    Wang, Liming
    PLOS ONE, 2022, 17 (05):
  • [42] SPATIO-TEMPORAL MOTION AGGREGATION NETWORK FOR VIDEO ACTION DETECTION
    Zhang, Hongcheng
    Zhao, Xu
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 2180 - 2184
  • [43] Survey of Spatio-Temporal Interest Point Detection Algorithms in Video
    Li, Yanshan
    Xia, Rongjie
    Huang, Qinghua
    Xie, Weixin
    Li, Xuelong
    IEEE ACCESS, 2017, 5 : 10323 - 10331
  • [44] Spectral Spatio-Temporal Fire Model for Video Fire Detection
    Wu, Zhaohui
    Song, Tao
    Wu, Xiaobo
    Shao, Xuqiang
    Liu, Yan
    INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2018, 32 (05)
  • [45] Efficient Online Spatio-Temporal Filtering for Video Event Detection
    Yan, Xinchen
    Yuan, Junsong
    Liang, Hui
    COMPUTER VISION - ECCV 2014 WORKSHOPS, PT I, 2015, 8925 : 769 - 785
  • [46] Associative Memory With Spatio-Temporal Enhancement for Video Anomaly Detection
    Zhong, Yuanhong
    Hu, Yongting
    Tang, Panliang
    Wang, Heng
    IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 1212 - 1216
  • [47] A novel spatio-temporal memory network for video anomaly detection
    Li H.
    Chen M.
    Multimedia Tools and Applications, 2025, 84 (8) : 4603 - 4624
  • [48] Face Detection in Video Using Local Spatio-temporal Representations
    Martinez-Diaz, Yoanna
    Hernandez, Noslen
    Mendez-Vazquez, Heydi
    PROGRESS IN PATTERN RECOGNITION IMAGE ANALYSIS, COMPUTER VISION, AND APPLICATIONS, CIARP 2014, 2014, 8827 : 860 - 867
  • [49] Video Copy Detection Using Spatio-Temporal CNN Features
    Zhou, Zhili
    Chen, Jingcheng
    Yang, Ching-Nung
    Sun, Xingming
    IEEE ACCESS, 2019, 7 : 100658 - 100665
  • [50] Improving Video Concept Detection Using Spatio-Temporal Correlation
    Zhu, Songhao
    Liang, Zhiwei
    Liu, Yuncai
    ADVANCES IN MULTIMEDIA INFORMATION PROCESSING-PCM 2010, PT I, 2010, 6297 : 46 - +