Dynamic Difference Learning With Spatio-Temporal Correlation for Deepfake Video Detection

被引:19
|
作者
Yin, Qilin [1 ,2 ]
Lu, Wei [1 ,2 ]
Li, Bin [3 ,4 ,5 ]
Huang, Jiwu [3 ,4 ,5 ]
机构
[1] Sun Yat Sen Univ, Sch Comp Sci & Engn, Guangdong Prov Key Lab Informat Secur Technol, Minist Educ, Guangzhou 510006, Peoples R China
[2] Sun Yat Sen Univ, Key Lab Machine Intelligence & Adv Comp, Guangzhou 510006, Peoples R China
[3] Shenzhen Univ, Guangdong Key Lab Intelligent Informat Proc, Shenzhen 518060, Peoples R China
[4] Shenzhen Univ, Shenzhen Key Lab Media Secur, Shenzhen 518060, Peoples R China
[5] Shenzhen Inst Artificial Intelligence & Robot Soc, Shenzhen 518055, Peoples R China
基金
中国国家自然科学基金;
关键词
Video forensics; face forgery detection; dynamic differential learning; spatio-temporal correlation; fine-grained denoising operation;
D O I
10.1109/TIFS.2023.3290752
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
With the rapid development of face forgery techniques, the existing frame-based deepfake video detection methods have fell into a dilemma that frame-based methods may fail when encountering extremely realistic images. To overcome the above problem, many approaches attempted to model the spatio-temporal inconsistency of videos to distinguish real and fake videos. However, current works model spatio-temporal inconsistency by combining intra-frame and inter-frame information, but ignore the disturbance caused by facial motions that would limit further improvement in detection performance. To address this issue, we investigate into long and short range inter-frame motions and propose a novel dynamic difference learning method to distinguish between the inter-frame differences caused by face manipulation and the inter-frame differences caused by facial motions in order to model precise spatio-temporal inconsistency for deepfake video detection. Moreover, we elaborately design a dynamic fine-grained difference capture module (DFDC-module) and a multi-scale spatio-temporal aggregation module (MSA-module) to collaboratively model spatio-temporal inconsistency. Specifically, the DFDC-module applies self-attention mechanism and fine-grained denoising operation to eliminate the differences caused by facial motions and generates long range difference attention maps. The MSA-module is devised to aggregate multi-direction and multi-scale temporal information to model spatio-temporal inconsistency. The existing 2D CNNs can be extended into dynamic spatio-temporal inconsistency capture networks by integrating the proposed two modules. Extensive experimental results demonstrate that our proposed algorithm steadily outperforms state-of-the-art methods by a clear margin in different benchmark datasets.
引用
收藏
页码:4046 / 4058
页数:13
相关论文
共 50 条
  • [21] Bidirectional Spatio-Temporal Feature Learning With Multiscale Evaluation for Video Anomaly Detection
    Zhong, Yuanhong
    Chen, Xia
    Hu, Yongting
    Tang, Panliang
    Ren, Fan
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (12) : 8285 - 8296
  • [22] Video action detection by learning graph-based spatio-temporal interactions
    Tomei, Matteo
    Baraldi, Lorenzo
    Calderara, Simone
    Bronzin, Simone
    Cucchiara, Rita
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2021, 206
  • [23] SFormer: An end-to-end spatio-temporal transformer architecture for deepfake detection
    Kingra, Staffy
    Aggarwal, Naveen
    Kaur, Nirmal
    FORENSIC SCIENCE INTERNATIONAL-DIGITAL INVESTIGATION, 2024, 51
  • [24] Spatio-Temporal Crop Aggregation for Video Representation Learning
    Sameni, Sepehr
    Jenni, Simon
    Favaro, Paolo
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 5641 - 5651
  • [25] Deconfounded Multimodal Learning for Spatio-temporal Video Grounding
    Wang, Jiawei
    Ma, Zhanchang
    Cao, Da
    Le, Yuquan
    Xiao, Junbin
    Chua, Tat-Seng
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 7521 - 7529
  • [26] On the Importance of Spatio-Temporal Learning for Video Quality Assessment
    Fontanel, Dario
    Higham, David
    Vallade, Benoit Quentin Arthur
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WORKSHOPS (WACVW), 2023, : 481 - 487
  • [27] Video representation learning by identifying spatio-temporal transformations
    Sheng Geng
    Shimin Zhao
    Hu Liu
    Applied Intelligence, 2022, 52 : 6613 - 6622
  • [28] Learning Spatio-Temporal Downsampling for Effective Video Upscaling
    Xiang, Xiaoyu
    Tian, Yapeng
    Rengarajan, Vijay
    Young, Lucas D.
    Zhu, Bo
    Ranjan, Rakesh
    COMPUTER VISION - ECCV 2022, PT XVIII, 2022, 13678 : 162 - 181
  • [29] Transformer with Spatio-Temporal Representation for Video Anomaly Detection
    Sun, Xiaohu
    Chen, Jinyi
    Shen, Xulin
    Li, Hongjun
    STRUCTURAL, SYNTACTIC, AND STATISTICAL PATTERN RECOGNITION, S+SSPR 2022, 2022, 13813 : 213 - 222
  • [30] Spatio-Temporal United Memory for Video Anomaly Detection
    Wang, Yunlong
    Chen, Mingyi
    Li, Jiaxin
    Li, Hongjun
    STRUCTURAL, SYNTACTIC, AND STATISTICAL PATTERN RECOGNITION, S+SSPR 2022, 2022, 13813 : 84 - 93