Egocentric Audio-Visual Object Localization

被引:9
|
作者
Huang, Chao [1 ]
Flan, Yapeng [1 ]
Kurnar, Anurag [2 ]
Xu, Chenliang [1 ]
机构
[1] Univ Rochester, Rochester, NY 14627 USA
[2] Meta Real Labs Res, Redmond, WA USA
关键词
D O I
10.1109/CVPR52729.2023.02194
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Humans naturally perceive surrounding scenes by unifying sound and sight from a first-person view. Likewise, machines are advanced to approach human intelligence by learning with multisensory inputs from an egocentric perspective. In this paper, we explore the challenging egocentric audio-visual object localization task and observe that 1) egomotion commonly exists in first-person recordings, even within a short duration; 2) The out-of-view sound components can be created when wearers shift their attention. To address the first problem, we propose a geometry-aware temporal aggregation module that handles the egomotion explicitly. The effect of egomotion is mitigated by estimating the temporal geometry transformation and exploiting it to update visual representations. Moreover, we propose a cascaded feature enhancement module to overcome the second issue. It improves cross-modal localization robustness by disentangling visually-indicated audio representation. During training, we take advantage of the naturally occurring audio-visual temporal synchronization as the "free" self-supervision to avoid costly labeling. We also annotate and create the Epic Sounding Object dataset for evaluation purposes. Extensive experiments show that our method achieves state-of-the-art localization performance in egocentric videos and can be generalized to diverse audio-visual scenes. Code is available at https://github.com/WikiChao/Ego-AV-Loc.
引用
收藏
页码:22910 / 22921
页数:12
相关论文
共 50 条
  • [41] Information-Driven Active Audio-Visual Source Localization
    Schult, Niclas
    Reineking, Thomas
    Kluss, Thorsten
    Zetzsche, Christoph
    PLOS ONE, 2015, 10 (09):
  • [42] Dense Modality Interaction Network for Audio-Visual Event Localization
    Liu, Shuo
    Quan, Weize
    Wang, Chaoqun
    Liu, Yuan
    Liu, Bin
    Yan, Dong-Ming
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 2734 - 2748
  • [43] Dynamic interactive learning network for audio-visual event localization
    Jincai Chen
    Han Liang
    Ruili Wang
    Jiangfeng Zeng
    Ping Lu
    Applied Intelligence, 2023, 53 : 30431 - 30442
  • [44] Probabilistic speaker localization in noisy enviromments by audio-visual integration
    Choi, Jong-Suk
    Kim, Munsang
    Kim, Hyun-Don
    2006 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, VOLS 1-12, 2006, : 4704 - +
  • [45] Audio-Visual Fusion for Sound Source Localization and Improved Attention
    Lee, Byoung-gi
    Choi, JongSuk
    Yoon, SangSuk
    Choi, Mun-Taek
    Kim, Munsang
    Kim, Daijin
    TRANSACTIONS OF THE KOREAN SOCIETY OF MECHANICAL ENGINEERS A, 2011, 35 (07) : 737 - 743
  • [46] Audio-Visual Clustering for 3D Speaker Localization
    Khalidov, Vasil
    Forbes, Florence
    Hansard, Miles
    Arnaud, Elise
    Horaud, Radu
    MACHINE LEARNING FOR MULTIMODAL INTERACTION, PROCEEDINGS, 2008, 5237 : 86 - 97
  • [47] Paper: Speaker Localization Based on Audio-Visual Bimodal Fusion
    Zhu, Ying-Xin
    Jin, Hao-Ran
    JOURNAL OF ADVANCED COMPUTATIONAL INTELLIGENCE AND INTELLIGENT INFORMATICS, 2021, 25 (03) : 375 - 382
  • [48] Learning Event-Specific Localization Preferences for Audio-Visual Event Localization
    Ge, Shiping
    Jiang, Zhiwei
    Yin, Yafeng
    Wang, Cong
    Cheng, Zifeng
    Gu, Qing
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3446 - 3454
  • [49] AUDIO-VISUAL EDUCATION
    Brickman, William W.
    SCHOOL AND SOCIETY, 1948, 67 (1739): : 320 - 326
  • [50] Audio-Visual Objects
    Kubovy M.
    Schutz M.
    Review of Philosophy and Psychology, 2010, 1 (1) : 41 - 61