Visual-guided scene-aware audio generation method based on hierarchical feature codec and rendering decision

被引：0

作者：

Wang, Ruiqi ^{[1
]}

Cheng, Haonan ^{[2
]}

Ye, Long ^{[2
]}

Zhang, Qin ^{[2
]}

机构：

[1] Commun Univ China, Key Lab Media Audio & Video, Minist Educ, Beijing 100024, Peoples R China

[2] Commun Univ China, State Key Lab Media Convergency & Commun, Beijing 100024, Peoples R China

来源：

DISPLAYS | 2024年 / 83卷

基金：

国家重点研发计划; 中国国家自然科学基金;

关键词：

Visual guide spatial sound generation; Cinematic audiovisual language; Hierarchical audiovisual codec;

D O I：

10.1016/j.displa.2024.102708

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Visually guided spatial sound generation (VGSSG) is a well -suited multimodal learning method for dealing with recorded videos. However, existing methods are difficult to be directly applied to spatial sound generation for movie clips. This is mainly due to (1) the existence of Cinematic Audiovisual Language (CAL) in movies, which makes it difficult to construct spatial sound mapping models directly through data -driven based methods. (2) The problem of the inadequate model performance, which is caused by the excessive heterogeneous gap between audiovisual modal information. To solve the aforementioned problems, we propose a VGSSG method based on CAL decision -making and hierarchical feature coding and decoding, which effectively accomplishes spatial sound generation based on the CAL of movies. Specifically, to solve the problem of CAL modeling, a multimodal information -guided movie audio rendering decision maker is established, which can decide the rendering strategy based on the CAL of the current clip. To narrow the heterogeneous gap that hinders the fusion between audiovisual modal data, we propose a codec structure based on hierarchical fusion of audiovisual features and full-scale skip -connections, which improves the efficiency of the comprehensive utilization of audiovisual modal data, and demonstrates the effectiveness of adopting shallow features in VGSSG task. We integrate both 2 -channel and 6 -channel spatial audio generation into a unified framework. In addition, we establish a movie audiovisual bimodal dataset with hand-crafted CAL annotations. Experimentally, we demonstrate that compared with the existing methods, our method has higher performance in terms of reducing generation distortion.

引用

页数：12

共 5 条

[1] Visually Guided Binaural Audio Generation Method Based on Hierarchical Feature Encoding and Decoding
Wang R.-Q.
Cheng H.-N.
Ye L.
Ruan Jian Xue Bao/Journal of Software, 2024, 35 (05): : 2165 - 2175
[2] QUALIFIER: Question-Guided Self-Attentive Multimodal Fusion Network for Audio Visual Scene-Aware Dialog
Ye, Muchao
You, Quanzeng
Ma, Fenglong
2022 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2022), 2022, : 2503 - 2511
[3] TMT: A Transformer-based Modal Translator for Improving Multimodal Sequence Representations in Audio Visual Scene-aware Dialog
Li, Wubo
Jiang, Dongwei
Zou, Wei
Li, Xiangang
INTERSPEECH 2020, 2020, : 3501 - 3505
[4] Hierarchical multimodal attention for end -to -end audio-visual scene -aware dialogue response generation
Le, Hung
Sahoo, Doyen
Chen, Nancy F.
Hoi, Steven C. H.
COMPUTER SPEECH AND LANGUAGE, 2020, 63
[5] END-TO-END AUDIO VISUAL SCENE-AWARE DIALOG USING MULTIMODAL ATTENTION-BASED VIDEO FEATURES
Hori, Chiori
Alamri, Huda
Wang, Jue
Wichern, Gordon
Hori, Takaaki
Cherian, Anoop
Marks, Tim K.
Cartillier, Vincent
Lopes, Raphael Gontijo
Das, Abhishek
Essa, Irfan
Batra, Dhruv
Parikh, Devi
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 2352 - 2356

← 1 →