Dynamic Cross Attention for Audio-Visual Person Verification

被引:0
|
作者
Praveen, R. Gnana [1 ]
Alam, Jahangir [1 ]
机构
[1] Comp Res Inst Montreal, Montreal, PQ, Canada
关键词
SPEAKER RECOGNITION;
D O I
10.1109/FG59268.2024.10581998
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Although person or identity verification has been predominantly explored using individual modalities such as face and voice, audio-visual fusion has recently shown immense potential to outperform unimodal approaches. Audio and visual modalities are often expected to pose strong complementary relationships, which plays a crucial role for effective audio-visual fusion. However, they may not always strongly complement each other, they may also exhibit weak complementary relationships, resulting in poor audio-visual feature representations. In this paper, we propose a Dynamic Cross Attention (DCA) model that can dynamically select the cross-attended or unattended features on the fly based on the strong or weak complementary relationships, respectively, across audio and visual modalities. In particular, a conditional gating layer is designed to evaluate the contribution of the cross-attention mechanism and choose cross-attended features only when they exhibit strong complementary relationships, otherwise unattended features. Extensive experiments are conducted on the Voxceleb1 dataset to demonstrate the robustness of the proposed model. Results indicate that the proposed model consistently improves the performance on multiple variants of cross-attention while outperforming the state-of-the-art methods. Code is available at https://github.com/praveena2j/DCAforPersonVerification
引用
收藏
页数:5
相关论文
共 50 条
  • [41] An Efficient Reliability Estimation Technique For Audio-Visual Person Identification
    Alam, Mohammad Rafiqul
    Bennamoun, Mohammed
    Togneri, Roberto
    Sohel, Ferdous
    PROCEEDINGS OF THE 2013 IEEE 8TH CONFERENCE ON INDUSTRIAL ELECTRONICS AND APPLICATIONS (ICIEA), 2013, : 1631 - 1635
  • [42] Incremental Audio-Visual Fusion for Person Recognition in Earthquake Scene
    You, Sisi
    Zuo, Yukun
    Yao, Hantao
    Xu, Changsheng
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (02)
  • [43] An audio-visual distance for audio-visual speech vector quantization
    Girin, L
    Foucher, E
    Feng, G
    1998 IEEE SECOND WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 1998, : 523 - 528
  • [44] Catching audio-visual mice:: The extrapolation of audio-visual speed
    Hofbauer, MM
    Wuerger, SM
    Meyer, GF
    Röhrbein, F
    Schill, K
    Zetzsche, C
    PERCEPTION, 2003, 32 : 96 - 96
  • [45] Noise resistant audio-visual verification via structural constraints
    Sanderson, C
    Paliwal, KK
    2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL V, PROCEEDINGS: SENSOR ARRAY & MULTICHANNEL SIGNAL PROCESSING AUDIO AND ELECTROACOUSTICS MULTIMEDIA SIGNAL PROCESSING, 2003, : 716 - 719
  • [46] A MULTI-VIEW APPROACH TO AUDIO-VISUAL SPEAKER VERIFICATION
    Sari, Leda
    Singh, Kritika
    Zhou, Jiatong
    Torresani, Lorenzo
    Singhal, Nayan
    Saraf, Yatharth
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6194 - 6198
  • [47] Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language
    Mercea, Otniel-Bogdan
    Riesch, Lukas
    Koepke, A. Sophia
    Akata, Zeynep
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 10543 - 10553
  • [48] Multi-scale network with shared cross-attention for audio-visual correlation learning
    Zhang, Jiwei
    Yu, Yi
    Tang, Suhua
    Li, Wei
    Wu, Jianming
    NEURAL COMPUTING & APPLICATIONS, 2023, 35 (27): : 20173 - 20187
  • [49] Discriminative Cross-Modality Attention Network for Temporal Inconsistent Audio-Visual Event Localization
    Xuan, Hanyu
    Luo, Lei
    Zhang, Zhenyu
    Yang, Jian
    Yan, Yan
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 7878 - 7888
  • [50] AUDIO-VISUAL SPEECH RECOGNITION WITH A HYBRID CTC/ATTENTION ARCHITECTURE
    Petridis, Stavros
    Stafylakis, Themos
    Ma, Pingchuan
    Tzimiropoulos, Georgios
    Pantic, Maja
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 513 - 520