Spherical World-Locking for Audio-Visual Localization in Egocentric Videos

被引：0

作者：

Yun, Heeseung ^{[1
,2
]}

Gao, Ruohan ^{[2
]}

Ananthabhotla, Ishwarya ^{[2
]}

Kumar, Anurag ^{[2
]}

Donley, Jacob ^{[2
]}

Li, Chao ^{[2
]}

Kim, Gunhee ^{[1
]}

Ithapu, Vamsi Krishna ^{[2
]}

Murdock, Calvin ^{[2
]}

机构：

[1] Seoul Natl Univ, Seoul, South Korea

[2] Meta, Real Labs Res, Redmond, WA USA

来源：

COMPUTER VISION - ECCV 2024, PT XXIV | 2025年 / 15082卷

关键词：

Egocentric Vision; Audio-Visual Learning; HEAD MOVEMENTS; SOUND;

D O I：

10.1007/978-3-031-72691-0_15

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Egocentric videos provide comprehensive contexts for user and scene understanding, spanning multisensory perception to behavioral interaction. We propose Spherical World-Locking (SWL) as a general framework for egocentric scene representation, which implicitly transforms multisensory streams with respect to measurements of head orientation. Compared to conventional head-locked egocentric representations with a 2D planar field-of-view, SWL effectively offsets challenges posed by self-motion, allowing for improved spatial synchronization between input modalities. Using a set of multisensory embeddings on a world-locked sphere, we design a unified encoder-decoder transformer architecture that preserves the spherical structure of the scene representation, without requiring expensive projections between image and world coordinate systems. We evaluate the effectiveness of the proposed framework on multiple benchmark tasks for egocentric video understanding, including audio-visual active speaker localization, auditory spherical source localization, and behavior anticipation in everyday activities.

引用

页码：256 / 274

页数：19

共 50 条

[1] Egocentric Audio-Visual Object Localization
Huang, Chao
Flan, Yapeng
Kurnar, Anurag
Xu, Chenliang
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 22910 - 22921
[2] Audio-Visual Event Localization in Unconstrained Videos
Tian, Yapeng
Shi, Jing
Li, Bochen
Duan, Zhiyao
Xu, Chenliang
COMPUTER VISION - ECCV 2018, PT II, 2018, 11206 : 252 - 268
[3] Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization
Jiang, Hao
Murdock, Calvin
Ithapu, Vamsi Krishna
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 10534 - 10542
[4] Binaural Audio-Visual Localization
Wu, Xinyi
Wu, Zhenyao
Ju, Lili
Wang, Song
THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 2961 - 2968
[5] A JOINT AUDIO-VISUAL APPROACH TO AUDIO LOCALIZATION
Jensen, Jesper Rindom
Christensen, Mads Graesboll
2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 454 - 458
[6] Audio, Visual, and Audio-Visual Egocentric Distance Perception by Moving Subjects in Virtual Environments
Rebillat, Marc
Boutillon, Xavier
Corteel, Etienne
Katz, Brian F. G.
ACM TRANSACTIONS ON APPLIED PERCEPTION, 2012, 9 (04)
[7] Listen to Look Into the Future: Audio-Visual Egocentric Gaze Anticipation
Lai, Bolin
Ryan, Fiona
Jia, Wenqi
Liu, Miao
Rehg, James M.
COMPUTER VISION - ECCV 2024, PT IX, 2025, 15067 : 192 - 210
[8] Unified Audio-Visual Saliency Model for Omnidirectional Videos With Spatial Audio
Zhu, Dandan
Zhang, Kaiwei
Zhang, Nana
Zhou, Qiangqiang
Min, Xiongkuo
Zhai, Guangtao
Yang, Xiaokang
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 764 - 775
[9] A Novel Lightweight Audio-visual Saliency Model for Videos
Zhu, Dandan
Shao, Xuan
Zhou, Qiangqiang
Min, Xiongkuo
Zhai, Guangtao
Yang, Xiaokang
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (04)
[10] AVQA: A Dataset for Audio-Visual Question Answering on Videos
Yang, Pinci
Wang, Xin
Duan, Xuguang
Chen, Hong
Hou, Runze
Jin, Cong
Zhu, Wenwu
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 3480 - 3491

← 1 2 3 4 5 →