Visual-guided scene-aware audio generation method based on hierarchical feature codec and rendering decision

被引:0
|
作者
Wang, Ruiqi [1 ]
Cheng, Haonan [2 ]
Ye, Long [2 ]
Zhang, Qin [2 ]
机构
[1] Commun Univ China, Key Lab Media Audio & Video, Minist Educ, Beijing 100024, Peoples R China
[2] Commun Univ China, State Key Lab Media Convergency & Commun, Beijing 100024, Peoples R China
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
Visual guide spatial sound generation; Cinematic audiovisual language; Hierarchical audiovisual codec;
D O I
10.1016/j.displa.2024.102708
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Visually guided spatial sound generation (VGSSG) is a well -suited multimodal learning method for dealing with recorded videos. However, existing methods are difficult to be directly applied to spatial sound generation for movie clips. This is mainly due to (1) the existence of Cinematic Audiovisual Language (CAL) in movies, which makes it difficult to construct spatial sound mapping models directly through data -driven based methods. (2) The problem of the inadequate model performance, which is caused by the excessive heterogeneous gap between audiovisual modal information. To solve the aforementioned problems, we propose a VGSSG method based on CAL decision -making and hierarchical feature coding and decoding, which effectively accomplishes spatial sound generation based on the CAL of movies. Specifically, to solve the problem of CAL modeling, a multimodal information -guided movie audio rendering decision maker is established, which can decide the rendering strategy based on the CAL of the current clip. To narrow the heterogeneous gap that hinders the fusion between audiovisual modal data, we propose a codec structure based on hierarchical fusion of audiovisual features and full-scale skip -connections, which improves the efficiency of the comprehensive utilization of audiovisual modal data, and demonstrates the effectiveness of adopting shallow features in VGSSG task. We integrate both 2 -channel and 6 -channel spatial audio generation into a unified framework. In addition, we establish a movie audiovisual bimodal dataset with hand-crafted CAL annotations. Experimentally, we demonstrate that compared with the existing methods, our method has higher performance in terms of reducing generation distortion.
引用
收藏
页数:12
相关论文
共 5 条
  • [1] Visually Guided Binaural Audio Generation Method Based on Hierarchical Feature Encoding and Decoding
    Wang R.-Q.
    Cheng H.-N.
    Ye L.
    Ruan Jian Xue Bao/Journal of Software, 2024, 35 (05): : 2165 - 2175
  • [2] QUALIFIER: Question-Guided Self-Attentive Multimodal Fusion Network for Audio Visual Scene-Aware Dialog
    Ye, Muchao
    You, Quanzeng
    Ma, Fenglong
    2022 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2022), 2022, : 2503 - 2511
  • [3] TMT: A Transformer-based Modal Translator for Improving Multimodal Sequence Representations in Audio Visual Scene-aware Dialog
    Li, Wubo
    Jiang, Dongwei
    Zou, Wei
    Li, Xiangang
    INTERSPEECH 2020, 2020, : 3501 - 3505
  • [4] Hierarchical multimodal attention for end -to -end audio-visual scene -aware dialogue response generation
    Le, Hung
    Sahoo, Doyen
    Chen, Nancy F.
    Hoi, Steven C. H.
    COMPUTER SPEECH AND LANGUAGE, 2020, 63
  • [5] END-TO-END AUDIO VISUAL SCENE-AWARE DIALOG USING MULTIMODAL ATTENTION-BASED VIDEO FEATURES
    Hori, Chiori
    Alamri, Huda
    Wang, Jue
    Wichern, Gordon
    Hori, Takaaki
    Cherian, Anoop
    Marks, Tim K.
    Cartillier, Vincent
    Lopes, Raphael Gontijo
    Das, Abhishek
    Essa, Irfan
    Batra, Dhruv
    Parikh, Devi
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 2352 - 2356