A Survey of Cross-Modal Visual Content Generation

被引:3
|
作者
Nazarieh, Fatemeh [1 ,2 ]
Feng, Zhenhua [1 ,2 ]
Awais, Muhammad [3 ]
Wang, Wenwu [3 ]
Kittler, Josef [3 ]
机构
[1] Univ Surrey, Sch Comp Sci & Elect Engn, Guildford GU2 7XH, England
[2] Univ Surrey, Nat Inspired Comp & Engn NICE Res Grp, Guildford GU2 7XH, England
[3] Univ Surrey, Ctr Vis Speech & Signal Proc, Guildford GU2 7XH, England
基金
英国工程与自然科学研究理事会;
关键词
Visualization; Surveys; Data models; Task analysis; Measurement; Training; Generative adversarial networks; Generative models; cross-modal; visual content generation;
D O I
10.1109/TCSVT.2024.3351601
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Cross-modal content generation has become very popular in recent years. To generate high-quality and realistic content, a variety of methods have been proposed. Among these approaches, visual content generation has attracted significant attention from academia and industry due to its vast potential in various applications. This survey provides an overview of recent advances in visual content generation conditioned on other modalities, such as text, audio, speech, and music, with a focus on their key contributions to the community. In addition, we summarize the existing publicly available datasets that can be used for training and benchmarking cross-modal visual content generation models. We provide an in-depth exploration of the datasets used for audio-to-visual content generation, filling a gap in the existing literature. Various evaluation metrics are also introduced along with the datasets. Furthermore, we discuss the challenges and limitations encountered in the area, such as modality alignment and semantic coherence. Last, we outline possible future directions for synthesizing visual content from other modalities including the exploration of new modalities, and the development of multi-task multi-modal networks. This survey serves as a resource for researchers interested in quickly gaining insights into this burgeoning field.
引用
收藏
页码:6814 / 6832
页数:19
相关论文
共 50 条
  • [31] Active Visual-Tactile Cross-Modal Matching
    Liu, Huaping
    Wang, Feng
    Sun, Fuchun
    Zhang, Xinyu
    IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, 2019, 11 (02) : 176 - 187
  • [32] Cross-modal integration of simple auditory and visual events
    Geoffrey R. Patching
    Philip T. Quinlan
    Perception & Psychophysics, 2004, 66 : 131 - 140
  • [34] Visual and tactile cross-modal mere exposure effects
    Suzuki, Miho
    Gyoba, Jiro
    COGNITION & EMOTION, 2008, 22 (01) : 147 - 154
  • [35] Cross-modal processing in auditory and visual working memory
    Suchan, B
    Linnewerth, B
    Köster, O
    Daum, I
    Schmid, G
    NEUROIMAGE, 2006, 29 (03) : 853 - 858
  • [36] Cross-modal transfer in visual and haptic object categorization
    Gaissert, N.
    Waterkamp, S.
    Van Dam, L.
    Buelthoff, I.
    PERCEPTION, 2011, 40 : 134 - 134
  • [37] Cross-modal integration of simple auditory and visual events
    Patching, GR
    Quinlan, PT
    PERCEPTION & PSYCHOPHYSICS, 2004, 66 (01): : 131 - 140
  • [38] Cross-modal transfer in visual and nonvisual cues in bumblebees
    Michael J. M. Harrap
    David A. Lawson
    Heather M. Whitney
    Sean A. Rands
    Journal of Comparative Physiology A, 2019, 205 : 427 - 437
  • [39] Cross-modal prediction in audio-visual communication
    Rao, RR
    Chen, TH
    1996 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, CONFERENCE PROCEEDINGS, VOLS 1-6, 1996, : 2056 - 2059
  • [40] Cross-modal learning with prior visual relation knowledge
    Yu, Jing
    Zhang, Weifeng
    Yang, Zhuoqian
    Qin, Zengchang
    Hu, Yue
    KNOWLEDGE-BASED SYSTEMS, 2020, 203 (203)