Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning

被引:14
|
作者
Sun, Weixuan [1 ,5 ]
Zhang, Jiayi [2 ]
Wang, Jianyuan [3 ]
Liu, Zheyuan [1 ]
Zhong, Yiran [4 ]
Feng, Tianpeng [5 ]
Guo, Yandong [5 ]
Zhang, Yanhao [5 ]
Barnes, Nick [1 ]
机构
[1] Australian Natl Univ, Canberra, Australia
[2] Beihang Univ, Beijing, Peoples R China
[3] Univ Oxford, Oxford, England
[4] Shanghai AI Lab, Shanghai, Peoples R China
[5] OPPO Res Inst, Shenzhen, Peoples R China
关键词
D O I
10.1109/CVPR52729.2023.00621
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Self-supervised audio-visual source localization aims to locate sound-source objects in video frames without extra annotations. Recent methods often approach this goal with the help of contrastive learning, which assumes only the audio and visual contents from the same video are positive samples for each other. However, this assumption would suffer from false negative samples in real-world training. For example, for an audio sample, treating the frames from the same audio class as negative samples may mislead the model and therefore harm the learned representations (e.g., the audio of a siren wailing may reasonably correspond to the ambulances in multiple images). Based on this observation, we propose a new learning strategy named False Negative Aware Contrastive (FNAC) to mitigate the problem of misleading the training with such false negative samples. Specifically, we utilize the intra-modal similarities to identify potentially similar samples and construct corresponding adjacency matrices to guide contrastive learning. Further, we propose to strengthen the role of true negative samples by explicitly leveraging the visual features of sound sources to facilitate the differentiation of authentic sounding source regions. FNAC achieves state-of-the-art performances on Flickr-SoundNet, VGG-Sound, and AVSBench, which demonstrates the effectiveness of our method in mitigating the false negative issue. The code is available at https://github.com/OpenNLPLab/FNAC_AVL.
引用
收藏
页码:6420 / 6429
页数:10
相关论文
共 50 条
  • [41] AUDIO-VISUAL SPEECH INPAINTING WITH DEEP LEARNING
    Morrone, Giovanni
    Michelsanti, Daniel
    Tan, Zheng-Hua
    Jensen, Jesper
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6653 - 6657
  • [42] Audio-Visual Class-Incremental Learning
    Pian, Weiguo
    Mo, Shentong
    Guo, Yunhui
    Tian, Yapeng
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 7765 - 7777
  • [43] AN AUDIO-VISUAL AIDS AND PROGRAMMED LEARNING UNIT
    LEYTHAM, G
    MEDICAL AND BIOLOGICAL ILLUSTRATION, 1970, 20 (01): : 35 - &
  • [44] AUDIO-VISUAL LEARNING AIDS FOR THE PRIMARY GRADES
    Gray, H. A.
    ELEMENTARY SCHOOL JOURNAL, 1938, 38 (07): : 509 - 517
  • [45] Audio-visual correspondences based joint learning for instrumental playing source separation
    Liu, Tianyu
    Zhang, Peng
    Wang, Siliang
    Huang, Wei
    Zha, Yufei
    Zhang, Yanning
    NEUROCOMPUTING, 2025, 618
  • [46] Persian Music Source Separation in Audio-Visual Data Using Deep Learning
    Hashemi, Seyedeh Sogand
    Aghabozorgi, Masoudreza
    Sadeghi, Mohammad Taghi
    2020 6TH IRANIAN CONFERENCE ON SIGNAL PROCESSING AND INTELLIGENT SYSTEMS (ICSPIS), 2020,
  • [47] Joint Student-Teacher Learning for Audio-Visual Scene-Aware Dialog
    Hori, Chiori
    Cherian, Anoop
    Marks, Tim K.
    Hori, Takaaki
    INTERSPEECH 2019, 2019, : 1886 - 1890
  • [48] Multi-Task Joint Learning for Embedding Aware Audio-Visual Speech Enhancement
    Wang, Chenxi
    Chen, Hang
    Du, Jun
    Yin, Baocai
    Pan, Jia
    2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2022, : 255 - 259
  • [49] Object-Aware Image Augmentation for Audio-Visual Zero-Shot Learning
    Dong, Yujie
    Chen, Shiming
    Duan, Bowen
    Ding, Weiping
    Wang, Yisong
    You, Xinge
    IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2024,
  • [50] Object-Aware Adaptive-Positivity Learning for Audio-Visual Question Answering
    Li, Zhangbin
    Guo, Dan
    Zhou, Jinxing
    Zhang, Jing
    Wang, Meng
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 4, 2024, : 3306 - 3314