Spherical World-Locking for Audio-Visual Localization in Egocentric Videos

被引：0

作者：

Yun, Heeseung ^{[1
,2
]}

Gao, Ruohan ^{[2
]}

Ananthabhotla, Ishwarya ^{[2
]}

Kumar, Anurag ^{[2
]}

Donley, Jacob ^{[2
]}

Li, Chao ^{[2
]}

Kim, Gunhee ^{[1
]}

Ithapu, Vamsi Krishna ^{[2
]}

Murdock, Calvin ^{[2
]}

机构：

[1] Seoul Natl Univ, Seoul, South Korea

[2] Meta, Real Labs Res, Redmond, WA USA

来源：

COMPUTER VISION - ECCV 2024, PT XXIV | 2025年 / 15082卷

关键词：

Egocentric Vision; Audio-Visual Learning; HEAD MOVEMENTS; SOUND;

D O I：

10.1007/978-3-031-72691-0_15

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Egocentric videos provide comprehensive contexts for user and scene understanding, spanning multisensory perception to behavioral interaction. We propose Spherical World-Locking (SWL) as a general framework for egocentric scene representation, which implicitly transforms multisensory streams with respect to measurements of head orientation. Compared to conventional head-locked egocentric representations with a 2D planar field-of-view, SWL effectively offsets challenges posed by self-motion, allowing for improved spatial synchronization between input modalities. Using a set of multisensory embeddings on a world-locked sphere, we design a unified encoder-decoder transformer architecture that preserves the spherical structure of the scene representation, without requiring expensive projections between image and world coordinate systems. We evaluate the effectiveness of the proposed framework on multiple benchmark tasks for egocentric video understanding, including audio-visual active speaker localization, auditory spherical source localization, and behavior anticipation in everyday activities.

引用

页码：256 / 274

页数：19

共 50 条

[21] Acoustic and Visual Knowledge Distillation for Contrastive Audio-Visual Localization
Yaghoubi, Ehsan
Kelm, Andre
Gerkmann, Timo
Frintrop, Simone
PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2023, 2023, : 15 - 23
[22] Joint Learning of Audio-Visual Saliency Prediction and Sound Source Localization on Multi-face Videos
Qiao, Minglang
Liu, Yufan
Xu, Mai
Deng, Xin
Li, Bing
Hu, Weiming
Borji, Ali
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (06) : 2003 - 2025
[23] EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition
Kazakos, Evangelos
Nagrani, Arsha
Zisserman, Andrew
Damen, Dima
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 5491 - 5500
[24] AUDIO-VISUAL SPEAKER LOCALIZATION VIA WEIGHTED CLUSTERING
Gebru, Israel D.
Alameda-Pineda, Xavier
Horaud, Radu
Forbes, Florence
2014 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2014,
[25] Tracking atoms with particles for audio-visual source localization
Monaci, Gianluca
Vandergheynst, Pierre
Maggio, Emilio
Cavallaro, Andrea
2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL II, PTS 1-3, 2007, : 753 - +
[26] Audio-Visual Localization by Synthetic Acoustic Image Generation
Sanguineti, Valentina
Morerio, Pietro
Del Bue, Alessio
Murino, Vittorio
THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 2523 - 2531
[27] Audio-visual speaker localization using graphical models
Kushal, Akash
Rahurkar, Mandar
Li Fei-Fei
Ponce, Jean
Huang, Thomas
18TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 1, PROCEEDINGS, 2006, : 291 - +
[28] Dual Perspective Network for Audio-Visual Event Localization
Rao, Varshanth
Khalil, Md Ibrahim
Li, Haoda
Dai, Peng
Lu, Juwei
COMPUTER VISION, ECCV 2022, PT XXXIV, 2022, 13694 : 689 - 704
[29] Integrated audio-visual processing for object localization and tracking
Pingali, GS
MULTIMEDIA COMPUTING AND NETWORKING 1998, 1997, 3310 : 206 - 213
[30] Dual Attention Matching for Audio-Visual Event Localization
Wu, Yu
Zhu, Linchao
Yan, Yan
Yang, Yi
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 6301 - 6309

← 1 2 3 4 5 →