Egocentric Audio-Visual Object Localization

被引：9

作者：

Huang, Chao ^{[1
]}

Flan, Yapeng ^{[1
]}

Kurnar, Anurag ^{[2
]}

Xu, Chenliang ^{[1
]}

机构：

[1] Univ Rochester, Rochester, NY 14627 USA

[2] Meta Real Labs Res, Redmond, WA USA

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年

关键词：

D O I：

10.1109/CVPR52729.2023.02194

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Humans naturally perceive surrounding scenes by unifying sound and sight from a first-person view. Likewise, machines are advanced to approach human intelligence by learning with multisensory inputs from an egocentric perspective. In this paper, we explore the challenging egocentric audio-visual object localization task and observe that 1) egomotion commonly exists in first-person recordings, even within a short duration; 2) The out-of-view sound components can be created when wearers shift their attention. To address the first problem, we propose a geometry-aware temporal aggregation module that handles the egomotion explicitly. The effect of egomotion is mitigated by estimating the temporal geometry transformation and exploiting it to update visual representations. Moreover, we propose a cascaded feature enhancement module to overcome the second issue. It improves cross-modal localization robustness by disentangling visually-indicated audio representation. During training, we take advantage of the naturally occurring audio-visual temporal synchronization as the "free" self-supervision to avoid costly labeling. We also annotate and create the Epic Sounding Object dataset for evaluation purposes. Extensive experiments show that our method achieves state-of-the-art localization performance in egocentric videos and can be generalized to diverse audio-visual scenes. Code is available at https://github.com/WikiChao/Ego-AV-Loc.

引用

页码：22910 / 22921

页数：12

共 50 条

[1] Spherical World-Locking for Audio-Visual Localization in Egocentric Videos
Yun, Heeseung
Gao, Ruohan
Ananthabhotla, Ishwarya
Kumar, Anurag
Donley, Jacob
Li, Chao
Kim, Gunhee
Ithapu, Vamsi Krishna
Murdock, Calvin
COMPUTER VISION - ECCV 2024, PT XXIV, 2025, 15082 : 256 - 274
[2] Integrated audio-visual processing for object localization and tracking
Pingali, GS
MULTIMEDIA COMPUTING AND NETWORKING 1998, 1997, 3310 : 206 - 213
[3] Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization
Jiang, Hao
Murdock, Calvin
Ithapu, Vamsi Krishna
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 10534 - 10542
[4] Binaural Audio-Visual Localization
Wu, Xinyi
Wu, Zhenyao
Ju, Lili
Wang, Song
THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 2961 - 2968
[5] A JOINT AUDIO-VISUAL APPROACH TO AUDIO LOCALIZATION
Jensen, Jesper Rindom
Christensen, Mads Graesboll
2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 454 - 458
[6] Audio, Visual, and Audio-Visual Egocentric Distance Perception by Moving Subjects in Virtual Environments
Rebillat, Marc
Boutillon, Xavier
Corteel, Etienne
Katz, Brian F. G.
ACM TRANSACTIONS ON APPLIED PERCEPTION, 2012, 9 (04)
[7] Listen to Look Into the Future: Audio-Visual Egocentric Gaze Anticipation
Lai, Bolin
Ryan, Fiona
Jia, Wenqi
Liu, Miao
Rehg, James M.
COMPUTER VISION - ECCV 2024, PT IX, 2025, 15067 : 192 - 210
[8] AUDIO-VISUAL OBJECT LOCALIZATION AND SEPARATION USING LOW-RANK AND SPARSITY
Pu, Jie
Panagakis, Yannis
Petridis, Stavros
Pantic, Maja
2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 2901 - 2905
[9] LEARNING DIFFERENTIABLE SPARSE AND LOW RANK NETWORKS FOR AUDIO-VISUAL OBJECT LOCALIZATION
Pu, Jie
Panagakis, Yannis
Pantic, Maja
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 8668 - 8672
[10] Object Permanence Through Audio-Visual Representations
Bu, Fanjun
Huang, Chien-Ming
IEEE ACCESS, 2021, 9 : 131574 - 131582

← 1 2 3 4 5 →