AVForensics: Audio-driven Deepfake Video Detection with Masking Strategy in Self-supervision

被引:3
|
作者
Zhu Yizhe [1 ]
Gao Jialin [2 ]
Zhou Xi [3 ]
机构
[1] Shanghai Jiao Tong Univ, Shanghai, Peoples R China
[2] Natl Univ Singapore, Singapore, Singapore
[3] CloudWalk Technol, Shanghai, Peoples R China
关键词
Deepfake detection; audio-visual; masking strategy; self-supervision;
D O I
10.1145/3591106.3592218
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Existing cross-dataset deepfake detection approaches exploit mouth-related mismatches between the auditory and visual modalities in fake videos to enhance generalisation to unseen forgeries. However, such methods inevitably suffer performance degradation with limited or unaltered mouth motions, we argue that face forgery detection consistently benefits from using high-level cues across the whole face region. In this paper, we propose a two-phase audio-driven multi-modal transformer-based framework, termed AVForensics, to perform deepfake video content detection from an audio-visual matching view related to full face. In the first pretraining phase, we apply the novel uniform masking strategy to model global facial features and learn temporally dense video representations in a self-supervised cross-modal manner, by capturing the natural correspondence between the visual and auditory modalities regardless of large-scaled labelled data and heavy memory usage. Then we use these learned representations to fine-tune for the down-stream deepfake detection task in the second phase, which encourages the model to offer accurate predictions based on captured global facial movement features. Extensive experiments and visualizations on various public datasets demonstrate the superiority of our self-supervised pre-trained method for achieving generalisable and robust deepfake video detection.
引用
收藏
页码:162 / 171
页数:10
相关论文
共 50 条
  • [1] PVASS-MDD: Predictive Visual-Audio Alignment Self-Supervision for Multimodal Deepfake Detection
    Yu, Yang
    Liu, Xiaolong
    Ni, Rongrong
    Yang, Siyuan
    Zhao, Yao
    Kot, Alex C.
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (08) : 6926 - 6936
  • [2] Photorealistic Audio-driven Video Portraits
    Wen, Xin
    Wang, Miao
    Richardt, Christian
    Chen, Ze-Yin
    Hu, Shi-Min
    IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2020, 26 (12) : 3457 - 3466
  • [3] Audio-Driven Emotional Video Portraits
    Ji, Xinya
    Zhou, Hang
    Wang, Kaisiyuan
    Wu, Wayne
    Loy, Chen Change
    Cao, Xun
    Xu, Feng
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 14075 - 14084
  • [4] DEEP VIDEO INPAINTING GUIDED BY AUDIO-VISUAL SELF-SUPERVISION
    Kim, Kyuyeon
    Jung, Junsik
    Kim, Woo Jae
    Yoon, Sung-Eui
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 1970 - 1974
  • [5] Audio-Driven Talking Video Frame Restoration
    Cheng, Harry
    Guo, Yangyang
    Yin, Jianhua
    Chen, Haonan
    Wang, Jiafang
    Nie, Liqiang
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 4110 - 4122
  • [6] ASVFI: AUDIO-DRIVEN SPEAKER VIDEO FRAME INTERPOLATION
    Wang, Qianrui
    Li, Dengshi
    Liao, Liang
    Song, Hao
    Li, Wei
    Xiao, Jing
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 3200 - 3204
  • [7] Audio-driven Talking Face Video Generation with Emotion
    Liang, Jiadong
    Lu, Feng
    2024 IEEE CONFERENCE ON VIRTUAL REALITY AND 3D USER INTERFACES ABSTRACTS AND WORKSHOPS, VRW 2024, 2024, : 863 - 864
  • [8] Pre-Training Audio Representations With Self-Supervision
    Tagliasacchi, Marco
    Gfeller, Beat
    Quitry, Felix de Chaumont
    Roblek, Dominik
    IEEE SIGNAL PROCESSING LETTERS, 2020, 27 : 600 - 604
  • [9] Learning to Remove Rain in Video With Self-Supervision
    Yang, Wenhan
    Tan, Robby T.
    Wang, Shiqi
    Kot, Alex C.
    Liu, Jiaying
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (03) : 1378 - 1396
  • [10] Audio-Driven Co-Speech Gesture Video Generation
    Liu, Xian
    Wu, Qianyi
    Zhou, Hang
    Du, Yuanqi
    Wu, Wayne
    Lin, Dahua
    Liu, Ziwei
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,