Dynamic Cross Attention for Audio-Visual Person Verification

被引:0
|
作者
Praveen, R. Gnana [1 ]
Alam, Jahangir [1 ]
机构
[1] Comp Res Inst Montreal, Montreal, PQ, Canada
关键词
SPEAKER RECOGNITION;
D O I
10.1109/FG59268.2024.10581998
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Although person or identity verification has been predominantly explored using individual modalities such as face and voice, audio-visual fusion has recently shown immense potential to outperform unimodal approaches. Audio and visual modalities are often expected to pose strong complementary relationships, which plays a crucial role for effective audio-visual fusion. However, they may not always strongly complement each other, they may also exhibit weak complementary relationships, resulting in poor audio-visual feature representations. In this paper, we propose a Dynamic Cross Attention (DCA) model that can dynamically select the cross-attended or unattended features on the fly based on the strong or weak complementary relationships, respectively, across audio and visual modalities. In particular, a conditional gating layer is designed to evaluate the contribution of the cross-attention mechanism and choose cross-attended features only when they exhibit strong complementary relationships, otherwise unattended features. Extensive experiments are conducted on the Voxceleb1 dataset to demonstrate the robustness of the proposed model. Results indicate that the proposed model consistently improves the performance on multiple variants of cross-attention while outperforming the state-of-the-art methods. Code is available at https://github.com/praveena2j/DCAforPersonVerification
引用
收藏
页数:5
相关论文
共 50 条
  • [1] Audio-Visual Fusion Based on Interactive Attention for Person Verification
    Jing, Xuebin
    He, Liang
    Song, Zhida
    Wang, Shaolei
    SENSORS, 2023, 23 (24)
  • [2] Audio-Visual Person Verification based on Recursive Fusion of Joint Cross-Attention
    Praveen, R. Gnana
    Alam, Jahangir
    2024 IEEE 18TH INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION, FG 2024, 2024,
  • [3] Audio-Visual Speaker Verification via Joint Cross-Attention
    Rajasekhar, Gnana Praveen
    Alam, Jahangir
    SPEECH AND COMPUTER, SPECOM 2023, PT II, 2023, 14339 : 18 - 31
  • [4] Attention Fusion for Audio-Visual Person Verification Using Multi-Scale Features
    Hoermann, Stefan
    Moiz, Abdul
    Knoche, Martin
    Rigoll, Gerhard
    2020 15TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION (FG 2020), 2020, : 281 - 285
  • [5] Scalability analysis of audio-visual person identity verification
    Czyz, J
    Bengio, S
    Marcel, C
    Vandendorpe, L
    AUDIO-AND VIDEO-BASED BIOMETRIC PERSON AUTHENTICATION, PROCEEDINGS, 2003, 2688 : 752 - 760
  • [6] Dynamic visual features for audio-visual speaker verification
    Dean, David
    Sridharan, Sridha
    COMPUTER SPEECH AND LANGUAGE, 2010, 24 (02): : 136 - 149
  • [7] Audio-Visual Deep Neural Network for Robust Person Verification
    Qian, Yanmin
    Chen, Zhengyang
    Wang, Shuai
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 1079 - 1092
  • [8] Dynamic Audio-Visual Biometric Fusion for Person Recognition
    Alsaedi, Najlaa Hindi
    Jaha, Emad Sami
    CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 71 (01): : 1283 - 1311
  • [9] Audio-visual dynamic remapping in an endogenous spatial attention task
    Fagioli, Sabrina
    Couyoumdjian, Alessandro
    Ferlazzo, Fabio
    BEHAVIOURAL BRAIN RESEARCH, 2006, 173 (01) : 30 - 38
  • [10] A Method of Audio-Visual Person Verification by Mining Connections between Time Series
    Sun, Peiwen
    Zhang, Shanshan
    Liu, Zishan
    Yuan, Yougen
    Zhang, Taotao
    Zhang, Honggang
    Hu, Pengfei
    INTERSPEECH 2023, 2023, : 3227 - 3231