Dynamic Difference Learning With Spatio-Temporal Correlation for Deepfake Video Detection

被引:19
|
作者
Yin, Qilin [1 ,2 ]
Lu, Wei [1 ,2 ]
Li, Bin [3 ,4 ,5 ]
Huang, Jiwu [3 ,4 ,5 ]
机构
[1] Sun Yat Sen Univ, Sch Comp Sci & Engn, Guangdong Prov Key Lab Informat Secur Technol, Minist Educ, Guangzhou 510006, Peoples R China
[2] Sun Yat Sen Univ, Key Lab Machine Intelligence & Adv Comp, Guangzhou 510006, Peoples R China
[3] Shenzhen Univ, Guangdong Key Lab Intelligent Informat Proc, Shenzhen 518060, Peoples R China
[4] Shenzhen Univ, Shenzhen Key Lab Media Secur, Shenzhen 518060, Peoples R China
[5] Shenzhen Inst Artificial Intelligence & Robot Soc, Shenzhen 518055, Peoples R China
基金
中国国家自然科学基金;
关键词
Video forensics; face forgery detection; dynamic differential learning; spatio-temporal correlation; fine-grained denoising operation;
D O I
10.1109/TIFS.2023.3290752
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
With the rapid development of face forgery techniques, the existing frame-based deepfake video detection methods have fell into a dilemma that frame-based methods may fail when encountering extremely realistic images. To overcome the above problem, many approaches attempted to model the spatio-temporal inconsistency of videos to distinguish real and fake videos. However, current works model spatio-temporal inconsistency by combining intra-frame and inter-frame information, but ignore the disturbance caused by facial motions that would limit further improvement in detection performance. To address this issue, we investigate into long and short range inter-frame motions and propose a novel dynamic difference learning method to distinguish between the inter-frame differences caused by face manipulation and the inter-frame differences caused by facial motions in order to model precise spatio-temporal inconsistency for deepfake video detection. Moreover, we elaborately design a dynamic fine-grained difference capture module (DFDC-module) and a multi-scale spatio-temporal aggregation module (MSA-module) to collaboratively model spatio-temporal inconsistency. Specifically, the DFDC-module applies self-attention mechanism and fine-grained denoising operation to eliminate the differences caused by facial motions and generates long range difference attention maps. The MSA-module is devised to aggregate multi-direction and multi-scale temporal information to model spatio-temporal inconsistency. The existing 2D CNNs can be extended into dynamic spatio-temporal inconsistency capture networks by integrating the proposed two modules. Extensive experimental results demonstrate that our proposed algorithm steadily outperforms state-of-the-art methods by a clear margin in different benchmark datasets.
引用
收藏
页码:4046 / 4058
页数:13
相关论文
共 50 条
  • [41] Learning Spatio-temporal features to detect manipulated facial videos created by the Deepfake techniques
    Nguyen X.H.
    Tran T.S.
    Le V.T.
    Nguyen K.D.
    Truong D.-T.
    Forensic Science International: Digital Investigation, 2021, 36
  • [42] Video super-resolution reconstruction based on correlation learning and spatio-temporal nonlocal similarity
    Liang, Meiyu
    Du, Junping
    Li, Linghui
    MULTIMEDIA TOOLS AND APPLICATIONS, 2016, 75 (17) : 10241 - 10269
  • [43] Dynamic Spatio-Temporal Modular Network for Video Question Answering
    Qian, Zi
    Wang, Xin
    Duan, Xuguang
    Chen, Hong
    Zhu, Wenwu
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4466 - 4477
  • [44] Delving into the Local: Dynamic Inconsistency Learning for DeepFake Video Detection
    Gu, Zhihao
    Chen, Yang
    Yao, Taiping
    Ding, Shouhong
    Li, Jilin
    Ma, Lizhuang
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 744 - 752
  • [45] Video super-resolution reconstruction based on correlation learning and spatio-temporal nonlocal similarity
    Meiyu Liang
    Junping Du
    Linghui Li
    Multimedia Tools and Applications, 2016, 75 : 10241 - 10269
  • [46] Learning Spatio-temporal Representation by Channel Aliasing Video Perception
    Lin, Yiqi
    Wang, Jinpeng
    Zhang, Manlin
    Ma, Andy J.
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2317 - 2325
  • [47] Learning Feature Semantic Matching for Spatio-Temporal Video Grounding
    Zhang, Tong
    Fang, Hao
    Zhang, Hao
    Gao, Jialin
    Lu, Xiankai
    Nie, Xiushan
    Yin, Yilong
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 9268 - 9279
  • [48] Learning Deep Spatio-Temporal Dependence for Semantic Video Segmentation
    Qiu, Zhaofan
    Yao, Ting
    Mei, Tao
    IEEE TRANSACTIONS ON MULTIMEDIA, 2018, 20 (04) : 939 - 949
  • [49] Video copy detection using spatio-temporal sequence matching
    Kim, C
    STORAGE AND RETRIEVAL METHODS AND APPLICATIONS FOR MULTIMEDIA 2004, 2004, 5307 : 70 - 79
  • [50] Optimal Spatio-Temporal Path Discovery for Video Event Detection
    Du Tran
    Yuan, Junsong
    2011 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2011,