A Survey of Cross-Modal Visual Content Generation

被引:3
|
作者
Nazarieh, Fatemeh [1 ,2 ]
Feng, Zhenhua [1 ,2 ]
Awais, Muhammad [3 ]
Wang, Wenwu [3 ]
Kittler, Josef [3 ]
机构
[1] Univ Surrey, Sch Comp Sci & Elect Engn, Guildford GU2 7XH, England
[2] Univ Surrey, Nat Inspired Comp & Engn NICE Res Grp, Guildford GU2 7XH, England
[3] Univ Surrey, Ctr Vis Speech & Signal Proc, Guildford GU2 7XH, England
基金
英国工程与自然科学研究理事会;
关键词
Visualization; Surveys; Data models; Task analysis; Measurement; Training; Generative adversarial networks; Generative models; cross-modal; visual content generation;
D O I
10.1109/TCSVT.2024.3351601
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Cross-modal content generation has become very popular in recent years. To generate high-quality and realistic content, a variety of methods have been proposed. Among these approaches, visual content generation has attracted significant attention from academia and industry due to its vast potential in various applications. This survey provides an overview of recent advances in visual content generation conditioned on other modalities, such as text, audio, speech, and music, with a focus on their key contributions to the community. In addition, we summarize the existing publicly available datasets that can be used for training and benchmarking cross-modal visual content generation models. We provide an in-depth exploration of the datasets used for audio-to-visual content generation, filling a gap in the existing literature. Various evaluation metrics are also introduced along with the datasets. Furthermore, we discuss the challenges and limitations encountered in the area, such as modality alignment and semantic coherence. Last, we outline possible future directions for synthesizing visual content from other modalities including the exploration of new modalities, and the development of multi-task multi-modal networks. This survey serves as a resource for researchers interested in quickly gaining insights into this burgeoning field.
引用
收藏
页码:6814 / 6832
页数:19
相关论文
共 50 条
  • [1] Deep Cross-Modal Audio-Visual Generation
    Chen, Lele
    Srivastava, Sudhanshu
    Duan, Zhiyao
    Xu, Chenliang
    PROCEEDINGS OF THE THEMATIC WORKSHOPS OF ACM MULTIMEDIA 2017 (THEMATIC WORKSHOPS'17), 2017, : 349 - 357
  • [2] Audio-to-Visual Cross-Modal Generation of Birds
    Shim, Joo Yong
    Kim, Joongheon
    Kim, Jong-Kook
    IEEE ACCESS, 2023, 11 : 27719 - 27729
  • [3] Visual determinants of a cross-modal illusion
    James A. Armontrout
    Michael Schiutz
    Michael Kubovy
    Attention, Perception, & Psychophysics, 2009, 71 : 1618 - 1627
  • [4] Visual determinants of a cross-modal illusion
    Armontrout, James A.
    Schutz, Michael
    Kubovy, Michael
    ATTENTION PERCEPTION & PSYCHOPHYSICS, 2009, 71 (07) : 1618 - 1627
  • [5] Cross-modal orienting of visual attention
    Hillyard, Steven A.
    Stoermer, Viola S.
    Feng, Wenfeng
    Martinez, Antigona
    McDonald, John J.
    NEUROPSYCHOLOGIA, 2016, 83 : 170 - 178
  • [6] CROSS-MODAL CONGRUITY - VISUAL AND OLFACTORY
    HENION, KE
    JOURNAL OF SOCIAL PSYCHOLOGY, 1970, 81 (01): : 15 - &
  • [7] Cross-modal visual and vibrotactile tracking
    van Erp, JBF
    Verschoor, MH
    APPLIED ERGONOMICS, 2004, 35 (02) : 105 - 112
  • [8] Visual-Textual Cross-Modal Interaction Network for Radiology Report Generation
    Zhang, Wenfeng
    Cai, Baoning
    Hu, Jianming
    Qin, Qibing
    Xie, Kezhen
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 984 - 988
  • [9] Audio-Visual Cross-Modal Generation with Multimodal Variational Generative Model
    Xu, Zhubin
    Wang, Tianlei
    Liu, Dekang
    Hu, Dinghan
    Zeng, Huangiang
    Cao, Jiuwen
    2024 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, ISCAS 2024, 2024,
  • [10] LEARNING AUDIO-VISUAL CORRELATIONS FROM VARIATIONAL CROSS-MODAL GENERATION
    Zhu, Ye
    Wu, Yu
    Latapie, Hugo
    Yang, Yi
    Yan, Yan
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 4300 - 4304