Egocentric Audio-Visual Object Localization

被引：9

作者：

Huang, Chao ^{[1
]}

Flan, Yapeng ^{[1
]}

Kurnar, Anurag ^{[2
]}

Xu, Chenliang ^{[1
]}

机构：

[1] Univ Rochester, Rochester, NY 14627 USA

[2] Meta Real Labs Res, Redmond, WA USA

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年

关键词：

D O I：

10.1109/CVPR52729.2023.02194

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Humans naturally perceive surrounding scenes by unifying sound and sight from a first-person view. Likewise, machines are advanced to approach human intelligence by learning with multisensory inputs from an egocentric perspective. In this paper, we explore the challenging egocentric audio-visual object localization task and observe that 1) egomotion commonly exists in first-person recordings, even within a short duration; 2) The out-of-view sound components can be created when wearers shift their attention. To address the first problem, we propose a geometry-aware temporal aggregation module that handles the egomotion explicitly. The effect of egomotion is mitigated by estimating the temporal geometry transformation and exploiting it to update visual representations. Moreover, we propose a cascaded feature enhancement module to overcome the second issue. It improves cross-modal localization robustness by disentangling visually-indicated audio representation. During training, we take advantage of the naturally occurring audio-visual temporal synchronization as the "free" self-supervision to avoid costly labeling. We also annotate and create the Epic Sounding Object dataset for evaluation purposes. Extensive experiments show that our method achieves state-of-the-art localization performance in egocentric videos and can be generalized to diverse audio-visual scenes. Code is available at https://github.com/WikiChao/Ego-AV-Loc.

引用

页码：22910 / 22921

页数：12

共 50 条

[41] Information-Driven Active Audio-Visual Source Localization
Schult, Niclas
Reineking, Thomas
Kluss, Thorsten
Zetzsche, Christoph
PLOS ONE, 2015, 10 (09):
[42] Dense Modality Interaction Network for Audio-Visual Event Localization
Liu, Shuo
Quan, Weize
Wang, Chaoqun
Liu, Yuan
Liu, Bin
Yan, Dong-Ming
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 2734 - 2748
[43] Dynamic interactive learning network for audio-visual event localization
Jincai Chen
Han Liang
Ruili Wang
Jiangfeng Zeng
Ping Lu
Applied Intelligence, 2023, 53 : 30431 - 30442
[44] Probabilistic speaker localization in noisy enviromments by audio-visual integration
Choi, Jong-Suk
Kim, Munsang
Kim, Hyun-Don
2006 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, VOLS 1-12, 2006, : 4704 - +
[45] Audio-Visual Fusion for Sound Source Localization and Improved Attention
Lee, Byoung-gi
Choi, JongSuk
Yoon, SangSuk
Choi, Mun-Taek
Kim, Munsang
Kim, Daijin
TRANSACTIONS OF THE KOREAN SOCIETY OF MECHANICAL ENGINEERS A, 2011, 35 (07) : 737 - 743
[46] Audio-Visual Clustering for 3D Speaker Localization
Khalidov, Vasil
Forbes, Florence
Hansard, Miles
Arnaud, Elise
Horaud, Radu
MACHINE LEARNING FOR MULTIMODAL INTERACTION, PROCEEDINGS, 2008, 5237 : 86 - 97
[47] Paper: Speaker Localization Based on Audio-Visual Bimodal Fusion
Zhu, Ying-Xin
Jin, Hao-Ran
JOURNAL OF ADVANCED COMPUTATIONAL INTELLIGENCE AND INTELLIGENT INFORMATICS, 2021, 25 (03) : 375 - 382
[48] Learning Event-Specific Localization Preferences for Audio-Visual Event Localization
Ge, Shiping
Jiang, Zhiwei
Yin, Yafeng
Wang, Cong
Cheng, Zifeng
Gu, Qing
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3446 - 3454
[49] AUDIO-VISUAL EDUCATION
Brickman, William W.
SCHOOL AND SOCIETY, 1948, 67 (1739): : 320 - 326
[50] Audio-Visual Objects
Kubovy M.
Schutz M.
Review of Philosophy and Psychology, 2010, 1 (1) : 41 - 61

← 1 2 3 4 5 →