Spherical World-Locking for Audio-Visual Localization in Egocentric Videos

被引:0
|
作者
Yun, Heeseung [1 ,2 ]
Gao, Ruohan [2 ]
Ananthabhotla, Ishwarya [2 ]
Kumar, Anurag [2 ]
Donley, Jacob [2 ]
Li, Chao [2 ]
Kim, Gunhee [1 ]
Ithapu, Vamsi Krishna [2 ]
Murdock, Calvin [2 ]
机构
[1] Seoul Natl Univ, Seoul, South Korea
[2] Meta, Real Labs Res, Redmond, WA USA
来源
关键词
Egocentric Vision; Audio-Visual Learning; HEAD MOVEMENTS; SOUND;
D O I
10.1007/978-3-031-72691-0_15
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Egocentric videos provide comprehensive contexts for user and scene understanding, spanning multisensory perception to behavioral interaction. We propose Spherical World-Locking (SWL) as a general framework for egocentric scene representation, which implicitly transforms multisensory streams with respect to measurements of head orientation. Compared to conventional head-locked egocentric representations with a 2D planar field-of-view, SWL effectively offsets challenges posed by self-motion, allowing for improved spatial synchronization between input modalities. Using a set of multisensory embeddings on a world-locked sphere, we design a unified encoder-decoder transformer architecture that preserves the spherical structure of the scene representation, without requiring expensive projections between image and world coordinate systems. We evaluate the effectiveness of the proposed framework on multiple benchmark tasks for egocentric video understanding, including audio-visual active speaker localization, auditory spherical source localization, and behavior anticipation in everyday activities.
引用
收藏
页码:256 / 274
页数:19
相关论文
共 50 条
  • [21] Acoustic and Visual Knowledge Distillation for Contrastive Audio-Visual Localization
    Yaghoubi, Ehsan
    Kelm, Andre
    Gerkmann, Timo
    Frintrop, Simone
    PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2023, 2023, : 15 - 23
  • [22] Joint Learning of Audio-Visual Saliency Prediction and Sound Source Localization on Multi-face Videos
    Qiao, Minglang
    Liu, Yufan
    Xu, Mai
    Deng, Xin
    Li, Bing
    Hu, Weiming
    Borji, Ali
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (06) : 2003 - 2025
  • [23] EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition
    Kazakos, Evangelos
    Nagrani, Arsha
    Zisserman, Andrew
    Damen, Dima
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 5491 - 5500
  • [24] AUDIO-VISUAL SPEAKER LOCALIZATION VIA WEIGHTED CLUSTERING
    Gebru, Israel D.
    Alameda-Pineda, Xavier
    Horaud, Radu
    Forbes, Florence
    2014 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2014,
  • [25] Tracking atoms with particles for audio-visual source localization
    Monaci, Gianluca
    Vandergheynst, Pierre
    Maggio, Emilio
    Cavallaro, Andrea
    2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL II, PTS 1-3, 2007, : 753 - +
  • [26] Audio-Visual Localization by Synthetic Acoustic Image Generation
    Sanguineti, Valentina
    Morerio, Pietro
    Del Bue, Alessio
    Murino, Vittorio
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 2523 - 2531
  • [27] Audio-visual speaker localization using graphical models
    Kushal, Akash
    Rahurkar, Mandar
    Li Fei-Fei
    Ponce, Jean
    Huang, Thomas
    18TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 1, PROCEEDINGS, 2006, : 291 - +
  • [28] Dual Perspective Network for Audio-Visual Event Localization
    Rao, Varshanth
    Khalil, Md Ibrahim
    Li, Haoda
    Dai, Peng
    Lu, Juwei
    COMPUTER VISION, ECCV 2022, PT XXXIV, 2022, 13694 : 689 - 704
  • [29] Integrated audio-visual processing for object localization and tracking
    Pingali, GS
    MULTIMEDIA COMPUTING AND NETWORKING 1998, 1997, 3310 : 206 - 213
  • [30] Dual Attention Matching for Audio-Visual Event Localization
    Wu, Yu
    Zhu, Linchao
    Yan, Yan
    Yang, Yi
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 6301 - 6309