Deep Cross-Modal Audio-Visual Generation

被引:299
|
作者
Chen, Lele [1 ]
Srivastava, Sudhanshu [1 ]
Duan, Zhiyao [2 ]
Xu, Chenliang [1 ]
机构
[1] Univ Rochester, Comp Sci, Rochester, NY 14627 USA
[2] Univ Rochester, Elect & Comp Engn, Rochester, NY 14627 USA
关键词
cross-modal generation; audio-visual; generative adversarial networks; PERCEPTION;
D O I
10.1145/3126686.3126723
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Cross-modal audio-visual perception has been a long-lasting topic in psychology and neurology, and various studies have discovered strong correlations in human perception of auditory and visual stimuli. Despite work on computational multimodal modeling, the problem of cross-modal audio-visual generation has not been systematically studied in the literature. In this paper, we make the first attempt to solve this cross-modal generation problem leveraging the power of deep generative adversarial training. Specifically, we use conditional generative adversarial networks to achieve cross-modal audio-visual generation of musical performances. We explore different encoding methods for audio and visual signals, and work on two scenarios: instrument-oriented generation and pose-oriented generation. Being the first to explore this new problem, we compose two new datasets with pairs of images and sounds of musical performances of different instruments. Our experiments using both classification and human evaluation demonstrate that our model has the ability to generate one modality, i.e., audio/visual, from the other modality, i.e., visual/audio, to a good extent. Our experiments on various design choices along with the datasets will facilitate future research in this new problem space.
引用
收藏
页码:349 / 357
页数:9
相关论文
共 50 条
  • [41] Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language
    Mercea, Otniel-Bogdan
    Riesch, Lukas
    Koepke, A. Sophia
    Akata, Zeynep
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 10543 - 10553
  • [42] Auditory cross-modal reorganization in cochlear implant users indicates audio-visual integration
    Stropahl, Maren
    Debener, Stefan
    NEUROIMAGE-CLINICAL, 2017, 16 : 514 - 523
  • [43] Learning Explicit and Implicit Dual Common Subspaces for Audio-visual Cross-modal Retrieval
    Zeng, Donghuo
    Wu, Jianming
    Hattori, Gen
    Xu, Rong
    Yu, Yi
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (02)
  • [44] Interactive Co-Learning with Cross-Modal Transformer for Audio-Visual Emotion Recognition
    Takashima, Akihiko
    Masumura, Ryo
    Ando, Atsushi
    Yamazaki, Yoshihiro
    Uchida, Mihiro
    Orihashi, Shota
    INTERSPEECH 2022, 2022, : 4740 - 4744
  • [45] Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual Speech Separation
    Lee, Jiyoung
    Chung, Soo-Whan
    Kim, Sunok
    Kang, Hong-Goo
    Sohn, Kwanghoon
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 1336 - 1345
  • [46] Modeling implicit learning in a cross-modal audio-visual serial reaction time task
    Taesler, Philipp
    Jablonowski, Julia
    Fu, Qiufang
    Rose, Michael
    COGNITIVE SYSTEMS RESEARCH, 2019, 54 : 154 - 164
  • [47] Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity
    Sarkar, Pritam
    Etemad, Ali
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 8, 2023, : 9723 - 9732
  • [48] Audio-to-Image Cross-Modal Generation
    Zelaszczyk, Maciej
    Mandziuk, Jacek
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [49] Deep Latent Space Learning for Cross-modal Mapping of Audio and Visual Signals
    Nawaz, Shah
    Janjua, Muhammad Kamran
    Gallo, Ignazio
    Mahmood, Arif
    Calefati, Alessandro
    2019 DIGITAL IMAGE COMPUTING: TECHNIQUES AND APPLICATIONS (DICTA), 2019, : 83 - 89
  • [50] VAG: A Uniform Model for Cross-Modal Visual-Audio Mutual Generation
    Hao, Wangli
    Guan, He
    Zhang, Zhaoxiang
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 36 (03) : 4196 - 4208