Egocentric Audio-Visual Object Localization

被引：9

作者：

Huang, Chao ^{[1
]}

Flan, Yapeng ^{[1
]}

Kurnar, Anurag ^{[2
]}

Xu, Chenliang ^{[1
]}

机构：

[1] Univ Rochester, Rochester, NY 14627 USA

[2] Meta Real Labs Res, Redmond, WA USA

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年

关键词：

D O I：

10.1109/CVPR52729.2023.02194

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Humans naturally perceive surrounding scenes by unifying sound and sight from a first-person view. Likewise, machines are advanced to approach human intelligence by learning with multisensory inputs from an egocentric perspective. In this paper, we explore the challenging egocentric audio-visual object localization task and observe that 1) egomotion commonly exists in first-person recordings, even within a short duration; 2) The out-of-view sound components can be created when wearers shift their attention. To address the first problem, we propose a geometry-aware temporal aggregation module that handles the egomotion explicitly. The effect of egomotion is mitigated by estimating the temporal geometry transformation and exploiting it to update visual representations. Moreover, we propose a cascaded feature enhancement module to overcome the second issue. It improves cross-modal localization robustness by disentangling visually-indicated audio representation. During training, we take advantage of the naturally occurring audio-visual temporal synchronization as the "free" self-supervision to avoid costly labeling. We also annotate and create the Epic Sounding Object dataset for evaluation purposes. Extensive experiments show that our method achieves state-of-the-art localization performance in egocentric videos and can be generalized to diverse audio-visual scenes. Code is available at https://github.com/WikiChao/Ego-AV-Loc.

引用

页码：22910 / 22921

页数：12

共 50 条

[21] AUDIO-VISUAL SPEAKER LOCALIZATION VIA WEIGHTED CLUSTERING
Gebru, Israel D.
Alameda-Pineda, Xavier
Horaud, Radu
Forbes, Florence
2014 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2014,
[22] Tracking atoms with particles for audio-visual source localization
Monaci, Gianluca
Vandergheynst, Pierre
Maggio, Emilio
Cavallaro, Andrea
2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL II, PTS 1-3, 2007, : 753 - +
[23] Audio-visual speaker localization using graphical models
Kushal, Akash
Rahurkar, Mandar
Li Fei-Fei
Ponce, Jean
Huang, Thomas
18TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 1, PROCEEDINGS, 2006, : 291 - +
[24] Audio-Visual Localization by Synthetic Acoustic Image Generation
Sanguineti, Valentina
Morerio, Pietro
Del Bue, Alessio
Murino, Vittorio
THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 2523 - 2531
[25] The neural basis of visual dominance in the context of audio-visual object processing
Schmid, Carmen
Buechel, Christian
Rose, Michael
NEUROIMAGE, 2011, 55 (01) : 304 - 311
[26] Dual Perspective Network for Audio-Visual Event Localization
Rao, Varshanth
Khalil, Md Ibrahim
Li, Haoda
Dai, Peng
Lu, Juwei
COMPUTER VISION, ECCV 2022, PT XXXIV, 2022, 13694 : 689 - 704
[27] Semantic and Relation Modulation for Audio-Visual Event Localization
Wang, Hao
Zha, Zheng-Jun
Li, Liang
Chen, Xuejin
Luo, Jiebo
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (06) : 7711 - 7725
[28] Dual Attention Matching for Audio-Visual Event Localization
Wu, Yu
Zhu, Linchao
Yan, Yan
Yang, Yi
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 6301 - 6309
[29] An audio-visual distance for audio-visual speech vector quantization
Girin, L
Foucher, E
Feng, G
1998 IEEE SECOND WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 1998, : 523 - 528
[30] Catching audio-visual mice:: The extrapolation of audio-visual speed
Hofbauer, MM
Wuerger, SM
Meyer, GF
Röhrbein, F
Schill, K
Zetzsche, C
PERCEPTION, 2003, 32 : 96 - 96

← 1 2 3 4 5 →