Dynamic Cross Attention for Audio-Visual Person Verification

被引:0
|
作者
Praveen, R. Gnana [1 ]
Alam, Jahangir [1 ]
机构
[1] Comp Res Inst Montreal, Montreal, PQ, Canada
关键词
SPEAKER RECOGNITION;
D O I
10.1109/FG59268.2024.10581998
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Although person or identity verification has been predominantly explored using individual modalities such as face and voice, audio-visual fusion has recently shown immense potential to outperform unimodal approaches. Audio and visual modalities are often expected to pose strong complementary relationships, which plays a crucial role for effective audio-visual fusion. However, they may not always strongly complement each other, they may also exhibit weak complementary relationships, resulting in poor audio-visual feature representations. In this paper, we propose a Dynamic Cross Attention (DCA) model that can dynamically select the cross-attended or unattended features on the fly based on the strong or weak complementary relationships, respectively, across audio and visual modalities. In particular, a conditional gating layer is designed to evaluate the contribution of the cross-attention mechanism and choose cross-attended features only when they exhibit strong complementary relationships, otherwise unattended features. Extensive experiments are conducted on the Voxceleb1 dataset to demonstrate the robustness of the proposed model. Results indicate that the proposed model consistently improves the performance on multiple variants of cross-attention while outperforming the state-of-the-art methods. Code is available at https://github.com/praveena2j/DCAforPersonVerification
引用
收藏
页数:5
相关论文
共 50 条
  • [21] Audio-visual identity verification: An introductory overview
    Abboud, Bouchra
    Bredin, Herv
    Aversano, Guido
    Chollet, Gerard
    PROGRESS IN NONLINEAR SPEECH PROCESSING, 2007, 4391 : 118 - +
  • [22] Audio-Visual Attention Networks for Emotion Recognition
    Lee, Jiyoung
    Kim, Sunok
    Kim, Seungryong
    Sohn, Kwanghoon
    AVSU'18: PROCEEDINGS OF THE 2018 WORKSHOP ON AUDIO-VISUAL SCENE UNDERSTANDING FOR IMMERSIVE MULTIMEDIA, 2018, : 27 - 32
  • [23] VIDEO CODING BASED ON AUDIO-VISUAL ATTENTION
    Lee, Jong-Seok
    De Simone, Francesca
    Ebrahimi, Touradj
    ICME: 2009 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOLS 1-3, 2009, : 57 - 60
  • [24] An audio-visual database for evaluating person tracking algorithms
    Krinidis, M
    Stamou, G
    Teutsch, H
    Spors, S
    Nikolaidis, N
    Rabenstein, R
    Pitas, L
    2005 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1-5: SPEECH PROCESSING, 2005, : 237 - 240
  • [25] Multi-Feature Audio-Visual Person Recognition
    Das, Amitav
    Manyam, Ohil K.
    Tapaswi, Makarand
    2008 IEEE WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, 2008, : 227 - 232
  • [26] A Deep Neural Network for Audio-Visual Person Recognition
    Alam, Mohammad Rafiqul
    Bennamoun, Mohammed
    Togneri, Roberto
    Sohel, Ferdous
    2015 IEEE 7TH INTERNATIONAL CONFERENCE ON BIOMETRICS THEORY, APPLICATIONS AND SYSTEMS (BTAS 2015), 2015,
  • [27] Online Cross-Modal Adaptation for Audio-Visual Person Identification With Wearable Cameras
    Brutti, Alessio
    Cavallaro, Andrea
    IEEE TRANSACTIONS ON HUMAN-MACHINE SYSTEMS, 2017, 47 (01) : 40 - 51
  • [28] INTRODUCTION OF QUALITY MEASURES IN AUDIO-VISUAL IDENTITY VERIFICATION
    Bendris, Meriem
    Charlet, Delphine
    Chollet, Gerard
    2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, : 1913 - 1916
  • [29] Face Anthropometry Aware Audio-visual Age Verification
    Korshunov, Pavel
    Marcel, Sebastien
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5944 - 5951
  • [30] Multi-Attention Audio-Visual Fusion Network for Audio Spatialization
    Zhang, Wen
    Shao, Jie
    PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 394 - 401