Deep Cross-Modal Audio-Visual Generation

被引：299

作者：

Chen, Lele ^{[1
]}

Srivastava, Sudhanshu ^{[1
]}

Duan, Zhiyao ^{[2
]}

Xu, Chenliang ^{[1
]}

机构：

[1] Univ Rochester, Comp Sci, Rochester, NY 14627 USA

[2] Univ Rochester, Elect & Comp Engn, Rochester, NY 14627 USA

来源：

PROCEEDINGS OF THE THEMATIC WORKSHOPS OF ACM MULTIMEDIA 2017 (THEMATIC WORKSHOPS'17) | 2017年

关键词：

cross-modal generation; audio-visual; generative adversarial networks; PERCEPTION;

D O I：

10.1145/3126686.3126723

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Cross-modal audio-visual perception has been a long-lasting topic in psychology and neurology, and various studies have discovered strong correlations in human perception of auditory and visual stimuli. Despite work on computational multimodal modeling, the problem of cross-modal audio-visual generation has not been systematically studied in the literature. In this paper, we make the first attempt to solve this cross-modal generation problem leveraging the power of deep generative adversarial training. Specifically, we use conditional generative adversarial networks to achieve cross-modal audio-visual generation of musical performances. We explore different encoding methods for audio and visual signals, and work on two scenarios: instrument-oriented generation and pose-oriented generation. Being the first to explore this new problem, we compose two new datasets with pairs of images and sounds of musical performances of different instruments. Our experiments using both classification and human evaluation demonstrate that our model has the ability to generate one modality, i.e., audio/visual, from the other modality, i.e., visual/audio, to a good extent. Our experiments on various design choices along with the datasets will facilitate future research in this new problem space.

引用

页码：349 / 357

页数：9

共 50 条

[41] Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language
Mercea, Otniel-Bogdan
Riesch, Lukas
Koepke, A. Sophia
Akata, Zeynep
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 10543 - 10553
[42] Auditory cross-modal reorganization in cochlear implant users indicates audio-visual integration
Stropahl, Maren
Debener, Stefan
NEUROIMAGE-CLINICAL, 2017, 16 : 514 - 523
[43] Learning Explicit and Implicit Dual Common Subspaces for Audio-visual Cross-modal Retrieval
Zeng, Donghuo
Wu, Jianming
Hattori, Gen
Xu, Rong
Yu, Yi
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (02)
[44] Interactive Co-Learning with Cross-Modal Transformer for Audio-Visual Emotion Recognition
Takashima, Akihiko
Masumura, Ryo
Ando, Atsushi
Yamazaki, Yoshihiro
Uchida, Mihiro
Orihashi, Shota
INTERSPEECH 2022, 2022, : 4740 - 4744
[45] Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual Speech Separation
Lee, Jiyoung
Chung, Soo-Whan
Kim, Sunok
Kang, Hong-Goo
Sohn, Kwanghoon
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 1336 - 1345
[46] Modeling implicit learning in a cross-modal audio-visual serial reaction time task
Taesler, Philipp
Jablonowski, Julia
Fu, Qiufang
Rose, Michael
COGNITIVE SYSTEMS RESEARCH, 2019, 54 : 154 - 164
[47] Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity
Sarkar, Pritam
Etemad, Ali
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 8, 2023, : 9723 - 9732
[48] Audio-to-Image Cross-Modal Generation
Zelaszczyk, Maciej
Mandziuk, Jacek
2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
[49] Deep Latent Space Learning for Cross-modal Mapping of Audio and Visual Signals
Nawaz, Shah
Janjua, Muhammad Kamran
Gallo, Ignazio
Mahmood, Arif
Calefati, Alessandro
2019 DIGITAL IMAGE COMPUTING: TECHNIQUES AND APPLICATIONS (DICTA), 2019, : 83 - 89
[50] VAG: A Uniform Model for Cross-Modal Visual-Audio Mutual Generation
Hao, Wangli
Guan, He
Zhang, Zhaoxiang
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 36 (03) : 4196 - 4208

← 1 2 3 4 5 →