PVASS-MDD: Predictive Visual-Audio Alignment Self-Supervision for Multimodal Deepfake Detection

被引：4

作者：

Yu, Yang ^{[1
,2
]}

Liu, Xiaolong ^{[1
,2
]}

Ni, Rongrong ^{[1
,2
]}

Yang, Siyuan ^{[3
]}

Zhao, Yao ^{[1
,2
]}

Kot, Alex C. ^{[4
]}

机构：

[1] Beijing Jiaotong Univ, Inst Informat Sci, Beijing 100044, Peoples R China

[2] Beijing Key Lab Adv Informat Sci & Network Technol, Beijing 100044, Peoples R China

[3] Nanyang Technol Univ, Interdisciplinary Grad Program, Rapid Rich Object Search Lab, Singapore 639798, Singapore

[4] Nanyang Technol Univ, Sch Elect & Elect Engn, Singapore 639798, Singapore

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2024年 / 34卷 / 08期

关键词：

Deepfakes; Visualization; Feature extraction; Forgery; Faces; Collaboration; Task analysis; Multimodal deepfake detection; visual-audio alignment; self-supervised auxiliary;

D O I：

10.1109/TCSVT.2023.3309899

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Deepfake techniques can forge the visual or audio signals in the video, which leads to inconsistencies between visual and audio (VA) signals. Therefore, multimodal detection methods expose deepfake videos by extracting VA inconsistencies. Recently, deepfake technology has started VA collaborative forgery to obtain more realistic deepfake videos, which poses new challenges for extracting VA inconsistencies. Recent multimodal detection methods propose to first extract natural VA correspondences in real videos in a self-supervised manner, and then use the learned real correspondences as targets to guide the extraction of VA inconsistencies in the subsequent deepfake detection stage. However, the inherent VA relations are difficult to extract due to the modality gap, which leads to the limited auxiliary performance of the aforementioned self-supervised methods. In this paper, we propose Predictive Visual-audio Alignment Self-supervision for Multimodal Deepfake Detection (PVASS-MDD), which consists of PVASS auxiliary and MDD stages. In the PVASS auxiliary stage in real videos, we first devise a three-stream network to associate two augmented visual views with corresponding audio clues, leading to explore common VA correspondences based on cross-view learning. Secondly, we introduce a novel cross-modal predictive align module for eliminating VA gaps to provide inherent VA correspondences. In the MDD stage, we propose to the auxiliary loss to utilize the frozen PVASS network to align VA features of real videos, to better assist multimodal deepfake detector for capturing subtle VA inconsistencies. We conduct extensive experiments on existing widely used and latest multimodal deepfake datasets. Our method obtains a significant performance improvement compared to state-of-the-art methods.

引用

页码：6926 / 6936

页数：11

共 8 条

[1] AVForensics: Audio-driven Deepfake Video Detection with Masking Strategy in Self-supervision
Zhu Yizhe
Gao Jialin
Zhou Xi
PROCEEDINGS OF THE 2023 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2023, 2023, : 162 - 171
[2] Audio-Visual Contrastive Learning with Temporal Self-Supervision
Jenni, Simon
Black, Alexander
Collomosse, John
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 7, 2023, : 7996 - 8004
[3] DEEP VIDEO INPAINTING GUIDED BY AUDIO-VISUAL SELF-SUPERVISION
Kim, Kyuyeon
Jung, Junsik
Kim, Woo Jae
Yoon, Sung-Eui
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 1970 - 1974
[4] Self-Supervision Interactive Alignment for Remote Sensing Image-Audio Retrieval
Huang, Jinghao
Chen, Yaxiong
Xiong, Shengwu
Lu, Xiaoqiang
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
[5] LiRA: Learning Visual Speech Representations from Audio through Self-supervision
Ma, Pingchuan
Mira, Rodrigo
Petridis, Stavros
Schuller, Bjorn W.
Pantic, Maja
INTERSPEECH 2021, 2021, : 3011 - 3015
[6] Deep Crash Detection From Vehicular Sensor Data With Multimodal Self-Supervision
Kubin, Luca
Bianconcini, Tommaso
de Andrade, Douglas Coimbra
Simoncini, Matteo
Taccari, Leonardo
Sambo, Francesco
IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2022, 23 (08) : 12480 - 12489
[7] Audio-based anomaly detection on edge devices via self-supervision and spectral analysis
Fabrizio Lo Scudo
Ettore Ritacco
Luciano Caroprese
Giuseppe Manco
Journal of Intelligent Information Systems, 2023, 61 : 765 - 793
[8] Audio-based anomaly detection on edge devices via self-supervision and spectral analysis
Lo Scudo, Fabrizio
Ritacco, Ettore
Caroprese, Luciano
Manco, Giuseppe
JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2023, 61 (03) : 765 - 793

← 1 →