Spherical World-Locking for Audio-Visual Localization in Egocentric Videos

被引:0
|
作者
Yun, Heeseung [1 ,2 ]
Gao, Ruohan [2 ]
Ananthabhotla, Ishwarya [2 ]
Kumar, Anurag [2 ]
Donley, Jacob [2 ]
Li, Chao [2 ]
Kim, Gunhee [1 ]
Ithapu, Vamsi Krishna [2 ]
Murdock, Calvin [2 ]
机构
[1] Seoul Natl Univ, Seoul, South Korea
[2] Meta, Real Labs Res, Redmond, WA USA
来源
关键词
Egocentric Vision; Audio-Visual Learning; HEAD MOVEMENTS; SOUND;
D O I
10.1007/978-3-031-72691-0_15
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Egocentric videos provide comprehensive contexts for user and scene understanding, spanning multisensory perception to behavioral interaction. We propose Spherical World-Locking (SWL) as a general framework for egocentric scene representation, which implicitly transforms multisensory streams with respect to measurements of head orientation. Compared to conventional head-locked egocentric representations with a 2D planar field-of-view, SWL effectively offsets challenges posed by self-motion, allowing for improved spatial synchronization between input modalities. Using a set of multisensory embeddings on a world-locked sphere, we design a unified encoder-decoder transformer architecture that preserves the spherical structure of the scene representation, without requiring expensive projections between image and world coordinate systems. We evaluate the effectiveness of the proposed framework on multiple benchmark tasks for egocentric video understanding, including audio-visual active speaker localization, auditory spherical source localization, and behavior anticipation in everyday activities.
引用
收藏
页码:256 / 274
页数:19
相关论文
共 50 条
  • [31] Semantic and Relation Modulation for Audio-Visual Event Localization
    Wang, Hao
    Zha, Zheng-Jun
    Li, Liang
    Chen, Xuejin
    Luo, Jiebo
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (06) : 7711 - 7725
  • [32] Summarization of Multiple News Videos Considering the Consistency of Audio-Visual Contents
    Zhang, Ye
    Tanishige, Ryunosuke
    Ide, Ichiro
    Doman, Keisuke
    Kawanishi, Yasutomo
    Deguchi, Daisuke
    Murase, Hiroshi
    INTERNATIONAL JOURNAL OF SEMANTIC COMPUTING, 2019, 13 (01) : 135 - 155
  • [33] Onmidirectional audio-visual talker localization based on dynamic fusion of audio-visual features using validity and reliability criteria
    Denda, Yuki
    Nishiura, Takanobu
    Yamashita, Yoichi
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2008, E91D (03): : 598 - 606
  • [34] Multimodal framework based on audio-visual features for summarisation of cricket videos
    Javed, Ali
    Irtaza, Aun
    Malik, Hafiz
    Mahmood, Muhammad Tariq
    Adnan, Syed
    IET IMAGE PROCESSING, 2019, 13 (04) : 615 - 622
  • [35] AVLnet: Learning Audio-Visual Language Representations from Instructional Videos
    Rouditchenko, Andrew
    Boggust, Angie
    Harwath, David
    Chen, Brian
    Joshi, Dhiraj
    Thomas, Samuel
    Audhkhasi, Kartik
    Kuehne, Hilde
    Panda, Rameswar
    Feris, Rogerio
    Kingsbury, Brian
    Picheny, Michael
    Torralba, Antonio
    Glass, James
    INTERSPEECH 2021, 2021, : 1584 - 1588
  • [36] AUDIO-VISUAL DISCREPANCY AND THE INFLUENCE ON VERTICAL SOUND SOURCE LOCALIZATION
    Werner, Stephan
    Liebetrau, Judith
    Sporer, Thomas
    2012 Fourth International Workshop on Quality of Multimedia Experience (QoMEX), 2012, : 133 - 139
  • [37] Audio-Visual Grouping Network for Sound Localization from Mixtures
    Mo, Shentong
    Tian, Yapeng
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10565 - 10574
  • [38] Dynamic interactive learning network for audio-visual event localization
    Chen, Jincai
    Liang, Han
    Wang, Ruili
    Zeng, Jiangfeng
    Lu, Ping
    APPLIED INTELLIGENCE, 2023, 53 (24) : 30431 - 30442
  • [39] Robust Audio-Visual Contrastive Learning for Proposal-Based Self-Supervised Sound Source Localization in Videos
    Xuan, Hanyu
    Wu, Zhiliang
    Yang, Jian
    Jiang, Bo
    Luo, Lei
    Alameda-Pineda, Xavier
    Yan, Yan
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (07) : 4896 - 4907
  • [40] Information-Driven Active Audio-Visual Source Localization
    Schult, Niclas
    Reineking, Thomas
    Kluss, Thorsten
    Zetzsche, Christoph
    PLOS ONE, 2015, 10 (09):