Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning

被引：14

作者：

Sun, Weixuan ^{[1
,5
]}

Zhang, Jiayi ^{[2
]}

Wang, Jianyuan ^{[3
]}

Liu, Zheyuan ^{[1
]}

Zhong, Yiran ^{[4
]}

Feng, Tianpeng ^{[5
]}

Guo, Yandong ^{[5
]}

Zhang, Yanhao ^{[5
]}

Barnes, Nick ^{[1
]}

机构：

[1] Australian Natl Univ, Canberra, Australia

[2] Beihang Univ, Beijing, Peoples R China

[3] Univ Oxford, Oxford, England

[4] Shanghai AI Lab, Shanghai, Peoples R China

[5] OPPO Res Inst, Shenzhen, Peoples R China

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR | 2023年

关键词：

D O I：

10.1109/CVPR52729.2023.00621

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Self-supervised audio-visual source localization aims to locate sound-source objects in video frames without extra annotations. Recent methods often approach this goal with the help of contrastive learning, which assumes only the audio and visual contents from the same video are positive samples for each other. However, this assumption would suffer from false negative samples in real-world training. For example, for an audio sample, treating the frames from the same audio class as negative samples may mislead the model and therefore harm the learned representations (e.g., the audio of a siren wailing may reasonably correspond to the ambulances in multiple images). Based on this observation, we propose a new learning strategy named False Negative Aware Contrastive (FNAC) to mitigate the problem of misleading the training with such false negative samples. Specifically, we utilize the intra-modal similarities to identify potentially similar samples and construct corresponding adjacency matrices to guide contrastive learning. Further, we propose to strengthen the role of true negative samples by explicitly leveraging the visual features of sound sources to facilitate the differentiation of authentic sounding source regions. FNAC achieves state-of-the-art performances on Flickr-SoundNet, VGG-Sound, and AVSBench, which demonstrates the effectiveness of our method in mitigating the false negative issue. The code is available at https://github.com/OpenNLPLab/FNAC_AVL.

引用

页码：6420 / 6429

页数：10

共 50 条

[21] Learning Event-Specific Localization Preferences for Audio-Visual Event Localization
Ge, Shiping
Jiang, Zhiwei
Yin, Yafeng
Wang, Cong
Cheng, Zifeng
Gu, Qing
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3446 - 3454
[22] Multi-Relation Learning Network for audio-visual event localization
Zhang, Pufen
Wang, Jiaxiang
Wan, Meng
Chang, Sijie
Ding, Lianhong
Shi, Peng
KNOWLEDGE-BASED SYSTEMS, 2025, 310
[23] LEARNING CONTEXTUALLY FUSED AUDIO-VISUAL REPRESENTATIONS FOR AUDIO-VISUAL SPEECH RECOGNITION
Zhang, Zi-Qiang
Zhang, Jie
Zhang, Jian-Shu
Wu, Ming-Hui
Fang, Xin
Dai, Li-Rong
2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 1346 - 1350
[24] Audio-Visual Learning: A Comment on Research
Allen, William H.
SCHOOL AND SOCIETY, 1953, 78 (2014): : 55 - 57
[25] AUDIO-VISUAL SCENE-AWARE DIALOG AND REASONING USING AUDIO-VISUAL TRANSFORMERS WITH JOINT STUDENT-TEACHER LEARNING
Shah, Ankit
Geng, Shijie
Gao, Peng
Cherian, Anoop
Hori, Takaaki
Marks, Tim K.
Le Roux, Jonathan
Hori, Chiori
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7732 - 7736
[26] Deep Audio-visual Learning: A Survey
Hao Zhu
Man-Di Luo
Rui Wang
Ai-Hua Zheng
Ran He
International Journal of Automation and Computing, 2021, 18 : 351 - 376
[27] Deep Audio-visual Learning: A Survey
Hao Zhu
Man-Di Luo
Rui Wang
Ai-Hua Zheng
Ran He
International Journal of Automation and Computing, 2021, 18 (03) : 351 - 376
[28] Deep Audio-visual Learning: A Survey
Zhu, Hao
Luo, Man-Di
Wang, Rui
Zheng, Ai-Hua
He, Ran
INTERNATIONAL JOURNAL OF AUTOMATION AND COMPUTING, 2021, 18 (03) : 351 - 376
[29] Joint Learning of Audio-Visual Saliency Prediction and Sound Source Localization on Multi-face Videos
Qiao, Minglang
Liu, Yufan
Xu, Mai
Deng, Xin
Li, Bing
Hu, Weiming
Borji, Ali
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (06) : 2003 - 2025
[30] Tracking atoms with particles for audio-visual source localization
Monaci, Gianluca
Vandergheynst, Pierre
Maggio, Emilio
Cavallaro, Andrea
2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL II, PTS 1-3, 2007, : 753 - +

← 1 2 3 4 5 →