Egocentric Audio-Visual Object Localization

被引:9
|
作者
Huang, Chao [1 ]
Flan, Yapeng [1 ]
Kurnar, Anurag [2 ]
Xu, Chenliang [1 ]
机构
[1] Univ Rochester, Rochester, NY 14627 USA
[2] Meta Real Labs Res, Redmond, WA USA
关键词
D O I
10.1109/CVPR52729.2023.02194
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Humans naturally perceive surrounding scenes by unifying sound and sight from a first-person view. Likewise, machines are advanced to approach human intelligence by learning with multisensory inputs from an egocentric perspective. In this paper, we explore the challenging egocentric audio-visual object localization task and observe that 1) egomotion commonly exists in first-person recordings, even within a short duration; 2) The out-of-view sound components can be created when wearers shift their attention. To address the first problem, we propose a geometry-aware temporal aggregation module that handles the egomotion explicitly. The effect of egomotion is mitigated by estimating the temporal geometry transformation and exploiting it to update visual representations. Moreover, we propose a cascaded feature enhancement module to overcome the second issue. It improves cross-modal localization robustness by disentangling visually-indicated audio representation. During training, we take advantage of the naturally occurring audio-visual temporal synchronization as the "free" self-supervision to avoid costly labeling. We also annotate and create the Epic Sounding Object dataset for evaluation purposes. Extensive experiments show that our method achieves state-of-the-art localization performance in egocentric videos and can be generalized to diverse audio-visual scenes. Code is available at https://github.com/WikiChao/Ego-AV-Loc.
引用
收藏
页码:22910 / 22921
页数:12
相关论文
共 50 条
  • [1] Spherical World-Locking for Audio-Visual Localization in Egocentric Videos
    Yun, Heeseung
    Gao, Ruohan
    Ananthabhotla, Ishwarya
    Kumar, Anurag
    Donley, Jacob
    Li, Chao
    Kim, Gunhee
    Ithapu, Vamsi Krishna
    Murdock, Calvin
    COMPUTER VISION - ECCV 2024, PT XXIV, 2025, 15082 : 256 - 274
  • [2] Integrated audio-visual processing for object localization and tracking
    Pingali, GS
    MULTIMEDIA COMPUTING AND NETWORKING 1998, 1997, 3310 : 206 - 213
  • [3] Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization
    Jiang, Hao
    Murdock, Calvin
    Ithapu, Vamsi Krishna
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 10534 - 10542
  • [4] Binaural Audio-Visual Localization
    Wu, Xinyi
    Wu, Zhenyao
    Ju, Lili
    Wang, Song
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 2961 - 2968
  • [5] A JOINT AUDIO-VISUAL APPROACH TO AUDIO LOCALIZATION
    Jensen, Jesper Rindom
    Christensen, Mads Graesboll
    2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 454 - 458
  • [6] Audio, Visual, and Audio-Visual Egocentric Distance Perception by Moving Subjects in Virtual Environments
    Rebillat, Marc
    Boutillon, Xavier
    Corteel, Etienne
    Katz, Brian F. G.
    ACM TRANSACTIONS ON APPLIED PERCEPTION, 2012, 9 (04)
  • [7] Listen to Look Into the Future: Audio-Visual Egocentric Gaze Anticipation
    Lai, Bolin
    Ryan, Fiona
    Jia, Wenqi
    Liu, Miao
    Rehg, James M.
    COMPUTER VISION - ECCV 2024, PT IX, 2025, 15067 : 192 - 210
  • [8] AUDIO-VISUAL OBJECT LOCALIZATION AND SEPARATION USING LOW-RANK AND SPARSITY
    Pu, Jie
    Panagakis, Yannis
    Petridis, Stavros
    Pantic, Maja
    2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 2901 - 2905
  • [9] LEARNING DIFFERENTIABLE SPARSE AND LOW RANK NETWORKS FOR AUDIO-VISUAL OBJECT LOCALIZATION
    Pu, Jie
    Panagakis, Yannis
    Pantic, Maja
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 8668 - 8672
  • [10] Object Permanence Through Audio-Visual Representations
    Bu, Fanjun
    Huang, Chien-Ming
    IEEE ACCESS, 2021, 9 : 131574 - 131582