A Survey of Cross-Modal Visual Content Generation

被引：3

作者：

Nazarieh, Fatemeh ^{[1
,2
]}

Feng, Zhenhua ^{[1
,2
]}

Awais, Muhammad ^{[3
]}

Wang, Wenwu ^{[3
]}

Kittler, Josef ^{[3
]}

机构：

[1] Univ Surrey, Sch Comp Sci & Elect Engn, Guildford GU2 7XH, England

[2] Univ Surrey, Nat Inspired Comp & Engn NICE Res Grp, Guildford GU2 7XH, England

[3] Univ Surrey, Ctr Vis Speech & Signal Proc, Guildford GU2 7XH, England

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2024年 / 34卷 / 08期

基金：

英国工程与自然科学研究理事会;

关键词：

Visualization; Surveys; Data models; Task analysis; Measurement; Training; Generative adversarial networks; Generative models; cross-modal; visual content generation;

D O I：

10.1109/TCSVT.2024.3351601

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Cross-modal content generation has become very popular in recent years. To generate high-quality and realistic content, a variety of methods have been proposed. Among these approaches, visual content generation has attracted significant attention from academia and industry due to its vast potential in various applications. This survey provides an overview of recent advances in visual content generation conditioned on other modalities, such as text, audio, speech, and music, with a focus on their key contributions to the community. In addition, we summarize the existing publicly available datasets that can be used for training and benchmarking cross-modal visual content generation models. We provide an in-depth exploration of the datasets used for audio-to-visual content generation, filling a gap in the existing literature. Various evaluation metrics are also introduced along with the datasets. Furthermore, we discuss the challenges and limitations encountered in the area, such as modality alignment and semantic coherence. Last, we outline possible future directions for synthesizing visual content from other modalities including the exploration of new modalities, and the development of multi-task multi-modal networks. This survey serves as a resource for researchers interested in quickly gaining insights into this burgeoning field.

引用

页码：6814 / 6832

页数：19

共 50 条

[31] Active Visual-Tactile Cross-Modal Matching
Liu, Huaping
Wang, Feng
Sun, Fuchun
Zhang, Xinyu
IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, 2019, 11 (02) : 176 - 187
[32] Cross-modal integration of simple auditory and visual events
Geoffrey R. Patching
Philip T. Quinlan
Perception & Psychophysics, 2004, 66 : 131 - 140
[33] VISUAL AND HAPTIC TRAINING AND CROSS-MODAL TRANSFER OF REFLECTIVITY
BUTTER, EJ
JOURNAL OF EDUCATIONAL PSYCHOLOGY, 1979, 71 (02) : 212 - 219
[34] Visual and tactile cross-modal mere exposure effects
Suzuki, Miho
Gyoba, Jiro
COGNITION & EMOTION, 2008, 22 (01) : 147 - 154
[35] Cross-modal processing in auditory and visual working memory
Suchan, B
Linnewerth, B
Köster, O
Daum, I
Schmid, G
NEUROIMAGE, 2006, 29 (03) : 853 - 858
[36] Cross-modal transfer in visual and haptic object categorization
Gaissert, N.
Waterkamp, S.
Van Dam, L.
Buelthoff, I.
PERCEPTION, 2011, 40 : 134 - 134
[37] Cross-modal integration of simple auditory and visual events
Patching, GR
Quinlan, PT
PERCEPTION & PSYCHOPHYSICS, 2004, 66 (01): : 131 - 140
[38] Cross-modal transfer in visual and nonvisual cues in bumblebees
Michael J. M. Harrap
David A. Lawson
Heather M. Whitney
Sean A. Rands
Journal of Comparative Physiology A, 2019, 205 : 427 - 437
[39] Cross-modal prediction in audio-visual communication
Rao, RR
Chen, TH
1996 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, CONFERENCE PROCEEDINGS, VOLS 1-6, 1996, : 2056 - 2059
[40] Cross-modal learning with prior visual relation knowledge
Yu, Jing
Zhang, Weifeng
Yang, Zhuoqian
Qin, Zengchang
Hu, Yue
KNOWLEDGE-BASED SYSTEMS, 2020, 203 (203)

← 1 2 3 4 5 →