Deep Cross-Modal Audio-Visual Generation

被引:299
|
作者
Chen, Lele [1 ]
Srivastava, Sudhanshu [1 ]
Duan, Zhiyao [2 ]
Xu, Chenliang [1 ]
机构
[1] Univ Rochester, Comp Sci, Rochester, NY 14627 USA
[2] Univ Rochester, Elect & Comp Engn, Rochester, NY 14627 USA
关键词
cross-modal generation; audio-visual; generative adversarial networks; PERCEPTION;
D O I
10.1145/3126686.3126723
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Cross-modal audio-visual perception has been a long-lasting topic in psychology and neurology, and various studies have discovered strong correlations in human perception of auditory and visual stimuli. Despite work on computational multimodal modeling, the problem of cross-modal audio-visual generation has not been systematically studied in the literature. In this paper, we make the first attempt to solve this cross-modal generation problem leveraging the power of deep generative adversarial training. Specifically, we use conditional generative adversarial networks to achieve cross-modal audio-visual generation of musical performances. We explore different encoding methods for audio and visual signals, and work on two scenarios: instrument-oriented generation and pose-oriented generation. Being the first to explore this new problem, we compose two new datasets with pairs of images and sounds of musical performances of different instruments. Our experiments using both classification and human evaluation demonstrate that our model has the ability to generate one modality, i.e., audio/visual, from the other modality, i.e., visual/audio, to a good extent. Our experiments on various design choices along with the datasets will facilitate future research in this new problem space.
引用
收藏
页码:349 / 357
页数:9
相关论文
共 50 条
  • [21] Adversarial-Metric Learning for Audio-Visual Cross-Modal Matching
    Zheng, Aihua
    Hu, Menglan
    Jiang, Bo
    Huang, Yan
    Yan, Yan
    Luo, Bin
    IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 338 - 351
  • [22] Audio-Visual Event Localization based on Cross-Modal Interacting Guidance
    Yue, Qiurui
    Wu, Xiaoyu
    Gao, Jiayi
    2021 IEEE FOURTH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND KNOWLEDGE ENGINEERING (AIKE 2021), 2021, : 104 - 107
  • [23] PERFECT MATCH: IMPROVED CROSS-MODAL EMBEDDINGS FOR AUDIO-VISUAL SYNCHRONISATION
    Chung, Soo-Whan
    Chung, Joon Son
    Kang, Hong-Goo
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 3965 - 3969
  • [24] SCLAV: Supervised Cross-modal Contrastive Learning for Audio-Visual Coding
    Sun, Chao
    Chen, Min
    Cheng, Jialiang
    Liang, Han
    Zhu, Chuanbo
    Chen, Jincai
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 261 - 270
  • [25] Cross-Modal Mutual Learning for Audio-Visual Speech Recognition and Manipulation
    Yang, Chih-Chun
    Fan, Wan-Cyuan
    Yang, Cheng-Fu
    Wang, Yu-Chiang Frank
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 3036 - 3044
  • [26] A NOVEL DISTANCE LEARNING FOR ELASTIC CROSS-MODAL AUDIO-VISUAL MATCHING
    Wangrui
    Huang, Huaibo
    Zhang, Xufeng
    Ma, Jixin
    Zheng, Aihua
    2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO WORKSHOPS (ICMEW), 2019, : 300 - 305
  • [27] Cross-Modal Matching of Audio-Visual German and French Fluent Speech in Infancy
    Kubicek, Claudia
    de Boisferon, Anne Hillairet
    Dupierrix, Eve
    Pascalis, Olivier
    Loevenbruck, Helene
    Gervain, Judit
    Schwarzer, Gudrun
    PLOS ONE, 2014, 9 (02):
  • [28] Cross-Modal Label Contrastive Learning for Unsupervised Audio-Visual Event Localization
    Bao, Peijun
    Yang, Wenhan
    Boon Poh Ng
    Er, Meng Hwa
    Kot, Alex C.
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 1, 2023, : 215 - 222
  • [29] Cross-Modal Relation-Aware Networks for Audio-Visual Event Localization
    Xu, Haoming
    Zeng, Runhao
    Wu, Qingyao
    Tan, Mingkui
    Gan, Chuang
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 3893 - 3901
  • [30] Audio-visual cross-modal concept of familiar persons in dogs (Canis familiaris)
    Ogura, Tadatoshi
    Izumi, Shoko
    Imai, Miku
    Nagano, Sakurako
    Matsuura, Akihiro
    INTERNATIONAL JOURNAL OF PSYCHOLOGY, 2016, 51 : 261 - 261