Egocentric Audio-Visual Object Localization

被引:9
|
作者
Huang, Chao [1 ]
Flan, Yapeng [1 ]
Kurnar, Anurag [2 ]
Xu, Chenliang [1 ]
机构
[1] Univ Rochester, Rochester, NY 14627 USA
[2] Meta Real Labs Res, Redmond, WA USA
关键词
D O I
10.1109/CVPR52729.2023.02194
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Humans naturally perceive surrounding scenes by unifying sound and sight from a first-person view. Likewise, machines are advanced to approach human intelligence by learning with multisensory inputs from an egocentric perspective. In this paper, we explore the challenging egocentric audio-visual object localization task and observe that 1) egomotion commonly exists in first-person recordings, even within a short duration; 2) The out-of-view sound components can be created when wearers shift their attention. To address the first problem, we propose a geometry-aware temporal aggregation module that handles the egomotion explicitly. The effect of egomotion is mitigated by estimating the temporal geometry transformation and exploiting it to update visual representations. Moreover, we propose a cascaded feature enhancement module to overcome the second issue. It improves cross-modal localization robustness by disentangling visually-indicated audio representation. During training, we take advantage of the naturally occurring audio-visual temporal synchronization as the "free" self-supervision to avoid costly labeling. We also annotate and create the Epic Sounding Object dataset for evaluation purposes. Extensive experiments show that our method achieves state-of-the-art localization performance in egocentric videos and can be generalized to diverse audio-visual scenes. Code is available at https://github.com/WikiChao/Ego-AV-Loc.
引用
收藏
页码:22910 / 22921
页数:12
相关论文
共 50 条
  • [31] An Audio-Visual System for Object-Based Audio: From Recording to Listening
    Coleman, Philip
    Franck, Andreas
    Francombe, Jon
    Liu, Qingju
    de Campos, Teofilo
    Hughes, Richard J.
    Menzies, Dylan
    Galvez, Marcos F. Simon
    Tang, Yan
    Woodcock, James
    Jackson, Philip J. B.
    Melchior, Frank
    Pike, Chris
    Fazi, Filippo Maria
    Cox, Trevor J.
    Hilton, Adrian
    IEEE TRANSACTIONS ON MULTIMEDIA, 2018, 20 (08) : 1919 - 1931
  • [32] Onmidirectional audio-visual talker localization based on dynamic fusion of audio-visual features using validity and reliability criteria
    Denda, Yuki
    Nishiura, Takanobu
    Yamashita, Yoichi
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2008, E91D (03): : 598 - 606
  • [33] Audio-Visual Perception - The Perception of Object Material in a Virtual Environment
    Anderson, Ryan
    Arro, Joosep
    Hansen, Christian Schutt
    Serafin, Stefania
    Augmented Reality, Virtual Reality, and Computer Graphics, Pt I, 2016, 9768 : 162 - 171
  • [34] A NEW AUDIO-VISUAL CONTROL USING MESSAGE OBJECT TRANSMISSION
    HASE, T
    MATSUDA, M
    IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, 1994, 40 (04) : 920 - 926
  • [35] AUDIO-VISUAL OBJECT CLASSIFICATION FOR HUMAN-ROBOT COLLABORATION
    Xompero, A.
    Pang, Y. L.
    Patten, T.
    Prabhakar, A.
    Calli, B.
    Cavallaro, A.
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 9137 - 9141
  • [36] Audio-visual object removal in 360-degree videos
    Shimamura, Ryo
    Feng, Qi
    Koyama, Yuki
    Nakatsuka, Takayuki
    Fukayama, Satoru
    Hamasaki, Masahiro
    Goto, Masataka
    Morishima, Shigeo
    VISUAL COMPUTER, 2020, 36 (10-12): : 2117 - 2128
  • [37] An audio-visual speech recognition with a new mandarin audio-visual database
    Liao, Wen-Yuan
    Pao, Tsang-Long
    Chen, Yu-Te
    Chang, Tsun-Wei
    INT CONF ON CYBERNETICS AND INFORMATION TECHNOLOGIES, SYSTEMS AND APPLICATIONS/INT CONF ON COMPUTING, COMMUNICATIONS AND CONTROL TECHNOLOGIES, VOL 1, 2007, : 19 - +
  • [38] AUDIO-VISUAL DISCREPANCY AND THE INFLUENCE ON VERTICAL SOUND SOURCE LOCALIZATION
    Werner, Stephan
    Liebetrau, Judith
    Sporer, Thomas
    2012 Fourth International Workshop on Quality of Multimedia Experience (QoMEX), 2012, : 133 - 139
  • [39] Audio-Visual Grouping Network for Sound Localization from Mixtures
    Mo, Shentong
    Tian, Yapeng
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10565 - 10574
  • [40] Dynamic interactive learning network for audio-visual event localization
    Chen, Jincai
    Liang, Han
    Wang, Ruili
    Zeng, Jiangfeng
    Lu, Ping
    APPLIED INTELLIGENCE, 2023, 53 (24) : 30431 - 30442