Egocentric Audio-Visual Object Localization

被引：9

作者：

Huang, Chao ^{[1
]}

Flan, Yapeng ^{[1
]}

Kurnar, Anurag ^{[2
]}

Xu, Chenliang ^{[1
]}

机构：

[1] Univ Rochester, Rochester, NY 14627 USA

[2] Meta Real Labs Res, Redmond, WA USA

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年

关键词：

D O I：

10.1109/CVPR52729.2023.02194

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Humans naturally perceive surrounding scenes by unifying sound and sight from a first-person view. Likewise, machines are advanced to approach human intelligence by learning with multisensory inputs from an egocentric perspective. In this paper, we explore the challenging egocentric audio-visual object localization task and observe that 1) egomotion commonly exists in first-person recordings, even within a short duration; 2) The out-of-view sound components can be created when wearers shift their attention. To address the first problem, we propose a geometry-aware temporal aggregation module that handles the egomotion explicitly. The effect of egomotion is mitigated by estimating the temporal geometry transformation and exploiting it to update visual representations. Moreover, we propose a cascaded feature enhancement module to overcome the second issue. It improves cross-modal localization robustness by disentangling visually-indicated audio representation. During training, we take advantage of the naturally occurring audio-visual temporal synchronization as the "free" self-supervision to avoid costly labeling. We also annotate and create the Epic Sounding Object dataset for evaluation purposes. Extensive experiments show that our method achieves state-of-the-art localization performance in egocentric videos and can be generalized to diverse audio-visual scenes. Code is available at https://github.com/WikiChao/Ego-AV-Loc.

引用

页码：22910 / 22921

页数：12

共 50 条

[31] An Audio-Visual System for Object-Based Audio: From Recording to Listening
Coleman, Philip
Franck, Andreas
Francombe, Jon
Liu, Qingju
de Campos, Teofilo
Hughes, Richard J.
Menzies, Dylan
Galvez, Marcos F. Simon
Tang, Yan
Woodcock, James
Jackson, Philip J. B.
Melchior, Frank
Pike, Chris
Fazi, Filippo Maria
Cox, Trevor J.
Hilton, Adrian
IEEE TRANSACTIONS ON MULTIMEDIA, 2018, 20 (08) : 1919 - 1931
[32] Onmidirectional audio-visual talker localization based on dynamic fusion of audio-visual features using validity and reliability criteria
Denda, Yuki
Nishiura, Takanobu
Yamashita, Yoichi
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2008, E91D (03): : 598 - 606
[33] Audio-Visual Perception - The Perception of Object Material in a Virtual Environment
Anderson, Ryan
Arro, Joosep
Hansen, Christian Schutt
Serafin, Stefania
Augmented Reality, Virtual Reality, and Computer Graphics, Pt I, 2016, 9768 : 162 - 171
[34] A NEW AUDIO-VISUAL CONTROL USING MESSAGE OBJECT TRANSMISSION
HASE, T
MATSUDA, M
IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, 1994, 40 (04) : 920 - 926
[35] AUDIO-VISUAL OBJECT CLASSIFICATION FOR HUMAN-ROBOT COLLABORATION
Xompero, A.
Pang, Y. L.
Patten, T.
Prabhakar, A.
Calli, B.
Cavallaro, A.
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 9137 - 9141
[36] Audio-visual object removal in 360-degree videos
Shimamura, Ryo
Feng, Qi
Koyama, Yuki
Nakatsuka, Takayuki
Fukayama, Satoru
Hamasaki, Masahiro
Goto, Masataka
Morishima, Shigeo
VISUAL COMPUTER, 2020, 36 (10-12): : 2117 - 2128
[37] An audio-visual speech recognition with a new mandarin audio-visual database
Liao, Wen-Yuan
Pao, Tsang-Long
Chen, Yu-Te
Chang, Tsun-Wei
INT CONF ON CYBERNETICS AND INFORMATION TECHNOLOGIES, SYSTEMS AND APPLICATIONS/INT CONF ON COMPUTING, COMMUNICATIONS AND CONTROL TECHNOLOGIES, VOL 1, 2007, : 19 - +
[38] AUDIO-VISUAL DISCREPANCY AND THE INFLUENCE ON VERTICAL SOUND SOURCE LOCALIZATION
Werner, Stephan
Liebetrau, Judith
Sporer, Thomas
2012 Fourth International Workshop on Quality of Multimedia Experience (QoMEX), 2012, : 133 - 139
[39] Audio-Visual Grouping Network for Sound Localization from Mixtures
Mo, Shentong
Tian, Yapeng
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10565 - 10574
[40] Dynamic interactive learning network for audio-visual event localization
Chen, Jincai
Liang, Han
Wang, Ruili
Zeng, Jiangfeng
Lu, Ping
APPLIED INTELLIGENCE, 2023, 53 (24) : 30431 - 30442

← 1 2 3 4 5 →