Spatio-temporal knowledge distilled video vision transformer (STKD-VViT) for multimodal deepfake detection

被引:0
|
作者
Usmani, Shaheen [1 ]
Kumar, Sunil [1 ]
Sadhya, Debanjan [2 ]
机构
[1] ABV Indian Inst Informat Technol & Management, Dept Informat Technol, Gwalior, India
[2] ABV Indian Inst Informat Technol & Management, Dept Comp Sci & Engn, Gwalior, India
关键词
Deepfake detection; Video vision transformer; Vision transformer; Fusion; Knowledge distillation;
D O I
10.1016/j.neucom.2024.129256
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The widespread circulation of manipulated videos using deepfake techniques has raised concerns about the authenticity of multimedia content. In response, deepfake detection techniques have made significant strides in specific scenarios. However, most of the existing methods are unimodal and focus only on extracting traditional spatial features, due to which they struggle to accurately identify modern deepfakes. This work introduces the STKD-VViT model for detecting deepfakes across multiple modalities while employing spatiotemporal features. STKD-VViT combines the strengths of the Video vision transformer and the Vision transformer to process visual and audio streams. The Video vision transformer employs a multi-head attention mechanism and tubelet embedding to extract the video's spatial and temporal features. Alternatively, the vision transformer extracts the salient features from the mel-spectrograms of audio files. Furthermore, STKD-VViT leverages the knowledge distillation technique to reduce the number of FLOPs and the model's parameters. Experimental results on the benchmark FakeAVCeleb dataset demonstrate that STKD-VViT achieves a testing accuracy of 97.49% for video stream data, 98.65% for audio stream data and 96.0% when both streams are combined using score-level fusion, surpassing other state-of-the-art methods.
引用
收藏
页数:13
相关论文
共 50 条
  • [31] Predicting User Confidence in Video Recordings with Spatio-Temporal Multimodal Analytics
    Emerson, Andrew
    Houghton, Patrick
    Chen, Ke
    Basheerabad, Vinay
    Ubale, Rutuja
    Leong, Chee Wee
    COMPANION PUBLICATION OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2022, 2022, : 98 - 104
  • [32] Point Spatio-Temporal Transformer Networks for Point Cloud Video Modeling
    Fan, Hehe
    Yang, Yi
    Kankanhalli, Mohan
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (02) : 2181 - 2192
  • [33] TransVisDrone: Spatio-Temporal Transformer for Vision-based Drone-to-Drone Detection in Aerial Videos
    Sangam, Tushar
    Dave, Ishan Rajendrakumar
    Sultani, Waqas
    Shah, Mubarak
    2023 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, ICRA, 2023, : 6006 - 6013
  • [34] Leveraging Multimodal Knowledge for Spatio-temporal Action Localization<bold> </bold>
    Chen, Keke
    Tu, Zhewei
    Shu, Xiangbo
    2024 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO WORKSHOPS, ICMEW 2024, 2024,
  • [35] Exploiting spatio-temporal characteristics of human vision for mobile video applications
    Jillani, Rashad
    Kalva, Hari
    APPLICATIONS OF DIGITAL IMAGE PROCESSING XXXI, 2008, 7073
  • [36] HCiT: Deepfake Video Detection Using a Hybrid Model of CNN features and Vision Transformer
    Kaddar, Bachir
    Fezza, Sid Ahmed
    Hamidouche, Wassim
    Akhtar, Zahid
    Hadid, Abdenour
    2021 INTERNATIONAL CONFERENCE ON VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP), 2021,
  • [37] A knowledge-based approach for video event detection using spatio-temporal sliding windows
    Cavaliere, D.
    Greco, L.
    Ritrovato, P.
    Senatore, S.
    2017 14TH IEEE INTERNATIONAL CONFERENCE ON ADVANCED VIDEO AND SIGNAL BASED SURVEILLANCE (AVSS), 2017,
  • [38] Video copy detection using spatio-temporal sequence matching
    Kim, C
    STORAGE AND RETRIEVAL METHODS AND APPLICATIONS FOR MULTIMEDIA 2004, 2004, 5307 : 70 - 79
  • [39] STEP: Spatio-Temporal Progressive Learning for Video Action Detection
    Yang, Xitong
    Yang, Xiaodong
    Liu, Ming-Yu
    Xiao, Fanyi
    Davis, Larry
    Kautz, Jan
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 264 - 272
  • [40] Optimal Spatio-Temporal Path Discovery for Video Event Detection
    Du Tran
    Yuan, Junsong
    2011 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2011,