Dynamic Difference Learning With Spatio-Temporal Correlation for Deepfake Video Detection

被引:19
|
作者
Yin, Qilin [1 ,2 ]
Lu, Wei [1 ,2 ]
Li, Bin [3 ,4 ,5 ]
Huang, Jiwu [3 ,4 ,5 ]
机构
[1] Sun Yat Sen Univ, Sch Comp Sci & Engn, Guangdong Prov Key Lab Informat Secur Technol, Minist Educ, Guangzhou 510006, Peoples R China
[2] Sun Yat Sen Univ, Key Lab Machine Intelligence & Adv Comp, Guangzhou 510006, Peoples R China
[3] Shenzhen Univ, Guangdong Key Lab Intelligent Informat Proc, Shenzhen 518060, Peoples R China
[4] Shenzhen Univ, Shenzhen Key Lab Media Secur, Shenzhen 518060, Peoples R China
[5] Shenzhen Inst Artificial Intelligence & Robot Soc, Shenzhen 518055, Peoples R China
基金
中国国家自然科学基金;
关键词
Video forensics; face forgery detection; dynamic differential learning; spatio-temporal correlation; fine-grained denoising operation;
D O I
10.1109/TIFS.2023.3290752
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
With the rapid development of face forgery techniques, the existing frame-based deepfake video detection methods have fell into a dilemma that frame-based methods may fail when encountering extremely realistic images. To overcome the above problem, many approaches attempted to model the spatio-temporal inconsistency of videos to distinguish real and fake videos. However, current works model spatio-temporal inconsistency by combining intra-frame and inter-frame information, but ignore the disturbance caused by facial motions that would limit further improvement in detection performance. To address this issue, we investigate into long and short range inter-frame motions and propose a novel dynamic difference learning method to distinguish between the inter-frame differences caused by face manipulation and the inter-frame differences caused by facial motions in order to model precise spatio-temporal inconsistency for deepfake video detection. Moreover, we elaborately design a dynamic fine-grained difference capture module (DFDC-module) and a multi-scale spatio-temporal aggregation module (MSA-module) to collaboratively model spatio-temporal inconsistency. Specifically, the DFDC-module applies self-attention mechanism and fine-grained denoising operation to eliminate the differences caused by facial motions and generates long range difference attention maps. The MSA-module is devised to aggregate multi-direction and multi-scale temporal information to model spatio-temporal inconsistency. The existing 2D CNNs can be extended into dynamic spatio-temporal inconsistency capture networks by integrating the proposed two modules. Extensive experimental results demonstrate that our proposed algorithm steadily outperforms state-of-the-art methods by a clear margin in different benchmark datasets.
引用
收藏
页码:4046 / 4058
页数:13
相关论文
共 50 条
  • [1] Towards Spatio-temporal Collaborative Learning: An End-to-End Deepfake Video Detection Framework
    Guo, Wenxuan
    Du, Shuo
    Deng, Huiyuan
    Yu, Zikang
    Feng, Lin
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [2] Spatio-Temporal Catcher: a Self-Supervised Transformer for Deepfake Video Detection
    Li, Maosen
    Li, Xurong
    Yu, Kun
    Deng, Cheng
    Huang, Heng
    Mao, Feng
    Xue, Hui
    Li, Minghao
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 8707 - 8718
  • [3] Improving Video Concept Detection Using Spatio-Temporal Correlation
    Zhu, Songhao
    Liang, Zhiwei
    Liu, Yuncai
    ADVANCES IN MULTIMEDIA INFORMATION PROCESSING-PCM 2010, PT I, 2010, 6297 : 46 - +
  • [4] Learning a spatio-temporal correlation
    Narain, D.
    Mamassian, P.
    van Beers, R. J.
    Smeets, J. B. J.
    Brenner, E.
    PERCEPTION, 2012, 41 : 58 - 58
  • [5] Salient Object Detection via Video Spatio-temporal Difference and Coherence
    Huang, Lei
    Luo, Bin
    PROCEEDINGS OF 2016 12TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND SECURITY (CIS), 2016, : 218 - 222
  • [6] STEP: Spatio-Temporal Progressive Learning for Video Action Detection
    Yang, Xitong
    Yang, Xiaodong
    Liu, Ming-Yu
    Xiao, Fanyi
    Davis, Larry
    Kautz, Jan
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 264 - 272
  • [7] Attention Guided Spatio-Temporal Artifacts Extraction for Deepfake Detection
    Wang, Zhibing
    Li, Xin
    Ni, Rongrong
    Zhao, Yao
    PATTERN RECOGNITION AND COMPUTER VISION, PT IV, 2021, 13022 : 374 - 386
  • [8] Spatio-temporal knowledge distilled video vision transformer (STKD-VViT) for multimodal deepfake detection
    Usmani, Shaheen
    Kumar, Sunil
    Sadhya, Debanjan
    NEUROCOMPUTING, 2025, 620
  • [9] Interactive spatio-temporal feature learning network for video foreground detection
    Hongrui Zhang
    Huan Li
    Complex & Intelligent Systems, 2022, 8 : 4251 - 4263
  • [10] Interactive spatio-temporal feature learning network for video foreground detection
    Zhang, Hongrui
    Li, Huan
    COMPLEX & INTELLIGENT SYSTEMS, 2022, 8 (05) : 4251 - 4263