Spherical World-Locking for Audio-Visual Localization in Egocentric Videos

被引：0

作者：

Yun, Heeseung ^{[1
,2
]}

Gao, Ruohan ^{[2
]}

Ananthabhotla, Ishwarya ^{[2
]}

Kumar, Anurag ^{[2
]}

Donley, Jacob ^{[2
]}

Li, Chao ^{[2
]}

Kim, Gunhee ^{[1
]}

Ithapu, Vamsi Krishna ^{[2
]}

Murdock, Calvin ^{[2
]}

机构：

[1] Seoul Natl Univ, Seoul, South Korea

[2] Meta, Real Labs Res, Redmond, WA USA

来源：

COMPUTER VISION - ECCV 2024, PT XXIV | 2025年 / 15082卷

关键词：

Egocentric Vision; Audio-Visual Learning; HEAD MOVEMENTS; SOUND;

D O I：

10.1007/978-3-031-72691-0_15

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Egocentric videos provide comprehensive contexts for user and scene understanding, spanning multisensory perception to behavioral interaction. We propose Spherical World-Locking (SWL) as a general framework for egocentric scene representation, which implicitly transforms multisensory streams with respect to measurements of head orientation. Compared to conventional head-locked egocentric representations with a 2D planar field-of-view, SWL effectively offsets challenges posed by self-motion, allowing for improved spatial synchronization between input modalities. Using a set of multisensory embeddings on a world-locked sphere, we design a unified encoder-decoder transformer architecture that preserves the spherical structure of the scene representation, without requiring expensive projections between image and world coordinate systems. We evaluate the effectiveness of the proposed framework on multiple benchmark tasks for egocentric video understanding, including audio-visual active speaker localization, auditory spherical source localization, and behavior anticipation in everyday activities.

引用

页码：256 / 274

页数：19

共 50 条

[41] Dense Modality Interaction Network for Audio-Visual Event Localization
Liu, Shuo
Quan, Weize
Wang, Chaoqun
Liu, Yuan
Liu, Bin
Yan, Dong-Ming
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 2734 - 2748
[42] Dynamic interactive learning network for audio-visual event localization
Jincai Chen
Han Liang
Ruili Wang
Jiangfeng Zeng
Ping Lu
Applied Intelligence, 2023, 53 : 30431 - 30442
[43] Probabilistic speaker localization in noisy enviromments by audio-visual integration
Choi, Jong-Suk
Kim, Munsang
Kim, Hyun-Don
2006 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, VOLS 1-12, 2006, : 4704 - +
[44] Audio-Visual Fusion for Sound Source Localization and Improved Attention
Lee, Byoung-gi
Choi, JongSuk
Yoon, SangSuk
Choi, Mun-Taek
Kim, Munsang
Kim, Daijin
TRANSACTIONS OF THE KOREAN SOCIETY OF MECHANICAL ENGINEERS A, 2011, 35 (07) : 737 - 743
[45] Audio-Visual Clustering for 3D Speaker Localization
Khalidov, Vasil
Forbes, Florence
Hansard, Miles
Arnaud, Elise
Horaud, Radu
MACHINE LEARNING FOR MULTIMODAL INTERACTION, PROCEEDINGS, 2008, 5237 : 86 - 97
[46] Paper: Speaker Localization Based on Audio-Visual Bimodal Fusion
Zhu, Ying-Xin
Jin, Hao-Ran
JOURNAL OF ADVANCED COMPUTATIONAL INTELLIGENCE AND INTELLIGENT INFORMATICS, 2021, 25 (03) : 375 - 382
[47] Learning Event-Specific Localization Preferences for Audio-Visual Event Localization
Ge, Shiping
Jiang, Zhiwei
Yin, Yafeng
Wang, Cong
Cheng, Zifeng
Gu, Qing
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3446 - 3454
[48] Self-Supervised Audio-Visual Representation Learning for in-the-wild Videos
Feng, Zishun
Tu, Ming
Xia, Rui
Wang, Yuxuan
Krishnamurthy, Ashok
2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2020, : 5671 - 5672
[49] Audio-Visual Model for Generating Eating Sounds Using Food ASMR Videos
Uchiyama, Kodai
Kawamoto, Kazuhiko
IEEE ACCESS, 2021, 9 : 50106 - 50111
[50] TIME-DOMAIN AUDIO-VISUAL SPEECH SEPARATION ON LOW QUALITY VIDEOS
Wu, Yifei
Li, Chenda
Bai, Jinfeng
Wu, Zhongqin
Qian, Yanmin
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 256 - 260

← 1 2 3 4 5 →