PVASS-MDD: Predictive Visual-Audio Alignment Self-Supervision for Multimodal Deepfake Detection

被引:4
|
作者
Yu, Yang [1 ,2 ]
Liu, Xiaolong [1 ,2 ]
Ni, Rongrong [1 ,2 ]
Yang, Siyuan [3 ]
Zhao, Yao [1 ,2 ]
Kot, Alex C. [4 ]
机构
[1] Beijing Jiaotong Univ, Inst Informat Sci, Beijing 100044, Peoples R China
[2] Beijing Key Lab Adv Informat Sci & Network Technol, Beijing 100044, Peoples R China
[3] Nanyang Technol Univ, Interdisciplinary Grad Program, Rapid Rich Object Search Lab, Singapore 639798, Singapore
[4] Nanyang Technol Univ, Sch Elect & Elect Engn, Singapore 639798, Singapore
关键词
Deepfakes; Visualization; Feature extraction; Forgery; Faces; Collaboration; Task analysis; Multimodal deepfake detection; visual-audio alignment; self-supervised auxiliary;
D O I
10.1109/TCSVT.2023.3309899
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Deepfake techniques can forge the visual or audio signals in the video, which leads to inconsistencies between visual and audio (VA) signals. Therefore, multimodal detection methods expose deepfake videos by extracting VA inconsistencies. Recently, deepfake technology has started VA collaborative forgery to obtain more realistic deepfake videos, which poses new challenges for extracting VA inconsistencies. Recent multimodal detection methods propose to first extract natural VA correspondences in real videos in a self-supervised manner, and then use the learned real correspondences as targets to guide the extraction of VA inconsistencies in the subsequent deepfake detection stage. However, the inherent VA relations are difficult to extract due to the modality gap, which leads to the limited auxiliary performance of the aforementioned self-supervised methods. In this paper, we propose Predictive Visual-audio Alignment Self-supervision for Multimodal Deepfake Detection (PVASS-MDD), which consists of PVASS auxiliary and MDD stages. In the PVASS auxiliary stage in real videos, we first devise a three-stream network to associate two augmented visual views with corresponding audio clues, leading to explore common VA correspondences based on cross-view learning. Secondly, we introduce a novel cross-modal predictive align module for eliminating VA gaps to provide inherent VA correspondences. In the MDD stage, we propose to the auxiliary loss to utilize the frozen PVASS network to align VA features of real videos, to better assist multimodal deepfake detector for capturing subtle VA inconsistencies. We conduct extensive experiments on existing widely used and latest multimodal deepfake datasets. Our method obtains a significant performance improvement compared to state-of-the-art methods.
引用
收藏
页码:6926 / 6936
页数:11
相关论文
共 8 条
  • [1] AVForensics: Audio-driven Deepfake Video Detection with Masking Strategy in Self-supervision
    Zhu Yizhe
    Gao Jialin
    Zhou Xi
    PROCEEDINGS OF THE 2023 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2023, 2023, : 162 - 171
  • [2] Audio-Visual Contrastive Learning with Temporal Self-Supervision
    Jenni, Simon
    Black, Alexander
    Collomosse, John
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 7, 2023, : 7996 - 8004
  • [3] DEEP VIDEO INPAINTING GUIDED BY AUDIO-VISUAL SELF-SUPERVISION
    Kim, Kyuyeon
    Jung, Junsik
    Kim, Woo Jae
    Yoon, Sung-Eui
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 1970 - 1974
  • [4] Self-Supervision Interactive Alignment for Remote Sensing Image-Audio Retrieval
    Huang, Jinghao
    Chen, Yaxiong
    Xiong, Shengwu
    Lu, Xiaoqiang
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
  • [5] LiRA: Learning Visual Speech Representations from Audio through Self-supervision
    Ma, Pingchuan
    Mira, Rodrigo
    Petridis, Stavros
    Schuller, Bjorn W.
    Pantic, Maja
    INTERSPEECH 2021, 2021, : 3011 - 3015
  • [6] Deep Crash Detection From Vehicular Sensor Data With Multimodal Self-Supervision
    Kubin, Luca
    Bianconcini, Tommaso
    de Andrade, Douglas Coimbra
    Simoncini, Matteo
    Taccari, Leonardo
    Sambo, Francesco
    IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2022, 23 (08) : 12480 - 12489
  • [7] Audio-based anomaly detection on edge devices via self-supervision and spectral analysis
    Fabrizio Lo Scudo
    Ettore Ritacco
    Luciano Caroprese
    Giuseppe Manco
    Journal of Intelligent Information Systems, 2023, 61 : 765 - 793
  • [8] Audio-based anomaly detection on edge devices via self-supervision and spectral analysis
    Lo Scudo, Fabrizio
    Ritacco, Ettore
    Caroprese, Luciano
    Manco, Giuseppe
    JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2023, 61 (03) : 765 - 793