Spherical World-Locking for Audio-Visual Localization in Egocentric Videos

被引:0
|
作者
Yun, Heeseung [1 ,2 ]
Gao, Ruohan [2 ]
Ananthabhotla, Ishwarya [2 ]
Kumar, Anurag [2 ]
Donley, Jacob [2 ]
Li, Chao [2 ]
Kim, Gunhee [1 ]
Ithapu, Vamsi Krishna [2 ]
Murdock, Calvin [2 ]
机构
[1] Seoul Natl Univ, Seoul, South Korea
[2] Meta, Real Labs Res, Redmond, WA USA
来源
关键词
Egocentric Vision; Audio-Visual Learning; HEAD MOVEMENTS; SOUND;
D O I
10.1007/978-3-031-72691-0_15
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Egocentric videos provide comprehensive contexts for user and scene understanding, spanning multisensory perception to behavioral interaction. We propose Spherical World-Locking (SWL) as a general framework for egocentric scene representation, which implicitly transforms multisensory streams with respect to measurements of head orientation. Compared to conventional head-locked egocentric representations with a 2D planar field-of-view, SWL effectively offsets challenges posed by self-motion, allowing for improved spatial synchronization between input modalities. Using a set of multisensory embeddings on a world-locked sphere, we design a unified encoder-decoder transformer architecture that preserves the spherical structure of the scene representation, without requiring expensive projections between image and world coordinate systems. We evaluate the effectiveness of the proposed framework on multiple benchmark tasks for egocentric video understanding, including audio-visual active speaker localization, auditory spherical source localization, and behavior anticipation in everyday activities.
引用
收藏
页码:256 / 274
页数:19
相关论文
共 50 条
  • [1] Egocentric Audio-Visual Object Localization
    Huang, Chao
    Flan, Yapeng
    Kurnar, Anurag
    Xu, Chenliang
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 22910 - 22921
  • [2] Audio-Visual Event Localization in Unconstrained Videos
    Tian, Yapeng
    Shi, Jing
    Li, Bochen
    Duan, Zhiyao
    Xu, Chenliang
    COMPUTER VISION - ECCV 2018, PT II, 2018, 11206 : 252 - 268
  • [3] Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization
    Jiang, Hao
    Murdock, Calvin
    Ithapu, Vamsi Krishna
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 10534 - 10542
  • [4] Binaural Audio-Visual Localization
    Wu, Xinyi
    Wu, Zhenyao
    Ju, Lili
    Wang, Song
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 2961 - 2968
  • [5] A JOINT AUDIO-VISUAL APPROACH TO AUDIO LOCALIZATION
    Jensen, Jesper Rindom
    Christensen, Mads Graesboll
    2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 454 - 458
  • [6] Audio, Visual, and Audio-Visual Egocentric Distance Perception by Moving Subjects in Virtual Environments
    Rebillat, Marc
    Boutillon, Xavier
    Corteel, Etienne
    Katz, Brian F. G.
    ACM TRANSACTIONS ON APPLIED PERCEPTION, 2012, 9 (04)
  • [7] Listen to Look Into the Future: Audio-Visual Egocentric Gaze Anticipation
    Lai, Bolin
    Ryan, Fiona
    Jia, Wenqi
    Liu, Miao
    Rehg, James M.
    COMPUTER VISION - ECCV 2024, PT IX, 2025, 15067 : 192 - 210
  • [8] Unified Audio-Visual Saliency Model for Omnidirectional Videos With Spatial Audio
    Zhu, Dandan
    Zhang, Kaiwei
    Zhang, Nana
    Zhou, Qiangqiang
    Min, Xiongkuo
    Zhai, Guangtao
    Yang, Xiaokang
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 764 - 775
  • [9] A Novel Lightweight Audio-visual Saliency Model for Videos
    Zhu, Dandan
    Shao, Xuan
    Zhou, Qiangqiang
    Min, Xiongkuo
    Zhai, Guangtao
    Yang, Xiaokang
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (04)
  • [10] AVQA: A Dataset for Audio-Visual Question Answering on Videos
    Yang, Pinci
    Wang, Xin
    Duan, Xuguang
    Chen, Hong
    Hou, Runze
    Jin, Cong
    Zhu, Wenwu
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 3480 - 3491