Spherical World-Locking for Audio-Visual Localization in Egocentric Videos

被引：0

作者：

Yun, Heeseung ^{[1
,2
]}

Gao, Ruohan ^{[2
]}

Ananthabhotla, Ishwarya ^{[2
]}

Kumar, Anurag ^{[2
]}

Donley, Jacob ^{[2
]}

Li, Chao ^{[2
]}

Kim, Gunhee ^{[1
]}

Ithapu, Vamsi Krishna ^{[2
]}

Murdock, Calvin ^{[2
]}

机构：

[1] Seoul Natl Univ, Seoul, South Korea

[2] Meta, Real Labs Res, Redmond, WA USA

来源：

COMPUTER VISION - ECCV 2024, PT XXIV | 2025年 / 15082卷

关键词：

Egocentric Vision; Audio-Visual Learning; HEAD MOVEMENTS; SOUND;

D O I：

10.1007/978-3-031-72691-0_15

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Egocentric videos provide comprehensive contexts for user and scene understanding, spanning multisensory perception to behavioral interaction. We propose Spherical World-Locking (SWL) as a general framework for egocentric scene representation, which implicitly transforms multisensory streams with respect to measurements of head orientation. Compared to conventional head-locked egocentric representations with a 2D planar field-of-view, SWL effectively offsets challenges posed by self-motion, allowing for improved spatial synchronization between input modalities. Using a set of multisensory embeddings on a world-locked sphere, we design a unified encoder-decoder transformer architecture that preserves the spherical structure of the scene representation, without requiring expensive projections between image and world coordinate systems. We evaluate the effectiveness of the proposed framework on multiple benchmark tasks for egocentric video understanding, including audio-visual active speaker localization, auditory spherical source localization, and behavior anticipation in everyday activities.

引用

页码：256 / 274

页数：19

共 50 条

[31] Semantic and Relation Modulation for Audio-Visual Event Localization
Wang, Hao
Zha, Zheng-Jun
Li, Liang
Chen, Xuejin
Luo, Jiebo
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (06) : 7711 - 7725
[32] Summarization of Multiple News Videos Considering the Consistency of Audio-Visual Contents
Zhang, Ye
Tanishige, Ryunosuke
Ide, Ichiro
Doman, Keisuke
Kawanishi, Yasutomo
Deguchi, Daisuke
Murase, Hiroshi
INTERNATIONAL JOURNAL OF SEMANTIC COMPUTING, 2019, 13 (01) : 135 - 155
[33] Onmidirectional audio-visual talker localization based on dynamic fusion of audio-visual features using validity and reliability criteria
Denda, Yuki
Nishiura, Takanobu
Yamashita, Yoichi
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2008, E91D (03): : 598 - 606
[34] Multimodal framework based on audio-visual features for summarisation of cricket videos
Javed, Ali
Irtaza, Aun
Malik, Hafiz
Mahmood, Muhammad Tariq
Adnan, Syed
IET IMAGE PROCESSING, 2019, 13 (04) : 615 - 622
[35] AVLnet: Learning Audio-Visual Language Representations from Instructional Videos
Rouditchenko, Andrew
Boggust, Angie
Harwath, David
Chen, Brian
Joshi, Dhiraj
Thomas, Samuel
Audhkhasi, Kartik
Kuehne, Hilde
Panda, Rameswar
Feris, Rogerio
Kingsbury, Brian
Picheny, Michael
Torralba, Antonio
Glass, James
INTERSPEECH 2021, 2021, : 1584 - 1588
[36] AUDIO-VISUAL DISCREPANCY AND THE INFLUENCE ON VERTICAL SOUND SOURCE LOCALIZATION
Werner, Stephan
Liebetrau, Judith
Sporer, Thomas
2012 Fourth International Workshop on Quality of Multimedia Experience (QoMEX), 2012, : 133 - 139
[37] Audio-Visual Grouping Network for Sound Localization from Mixtures
Mo, Shentong
Tian, Yapeng
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10565 - 10574
[38] Dynamic interactive learning network for audio-visual event localization
Chen, Jincai
Liang, Han
Wang, Ruili
Zeng, Jiangfeng
Lu, Ping
APPLIED INTELLIGENCE, 2023, 53 (24) : 30431 - 30442
[39] Robust Audio-Visual Contrastive Learning for Proposal-Based Self-Supervised Sound Source Localization in Videos
Xuan, Hanyu
Wu, Zhiliang
Yang, Jian
Jiang, Bo
Luo, Lei
Alameda-Pineda, Xavier
Yan, Yan
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (07) : 4896 - 4907
[40] Information-Driven Active Audio-Visual Source Localization
Schult, Niclas
Reineking, Thomas
Kluss, Thorsten
Zetzsche, Christoph
PLOS ONE, 2015, 10 (09):

← 1 2 3 4 5 →