CogView: Mastering Text-to-Image Generation via Transformers

被引:0
|
作者
Ding, Ming [1 ]
Yang, Zhuoyi [1 ]
Hong, Wenyi [1 ]
Zheng, Wendi [1 ]
Zhou, Chang [2 ]
Yin, Da [1 ]
Lin, Junyang [2 ]
Zou, Xu [1 ]
Shao, Zhou [3 ]
Yang, Hongxia [2 ]
Tang, Jie [1 ,3 ]
机构
[1] Tsinghua Univ, Beijing, Peoples R China
[2] DAMO Acad, Alibaba Grp, Hangzhou, Peoples R China
[3] BAAI, Beijing, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text-to-Image generation in the general domain has long been an open problem, which requires both a powerful generative model and cross-modal understanding. We propose CogView, a 4-billion-parameter Transformer with VQ-VAE tokenizer to advance this problem. We also demonstrate the finetuning strategies for various downstream tasks, e.g. style learning, super-resolution, text-image ranking and fashion design, and methods to stabilize pretraining, e.g. eliminating NaN losses. CogView achieves the state-of-the-art FID on the blurred MS COCO dataset, outperforming previous GAN-based models and a recent similar work DALL-E. [GRAPHICS] .
引用
收藏
页数:14
相关论文
共 50 条
  • [41] MobileDiffusion: Instant Text-to-Image Generation on Mobile Devices
    Zhao, Yang
    Xu, Yanwu
    Xiao, Zhisheng
    Jia, Haolin
    Hou, Tingbo
    COMPUTER VISION - ECCV 2024, PT LXII, 2025, 15120 : 225 - 242
  • [42] Social Biases through the Text-to-Image Generation Lens
    Naik, Ranjita
    Nushi, Besmira
    PROCEEDINGS OF THE 2023 AAAI/ACM CONFERENCE ON AI, ETHICS, AND SOCIETY, AIES 2023, 2023, : 786 - 808
  • [43] HARIVO: Harnessing Text-to-Image Models for Video Generation
    Kwon, Mingi
    Oh, Seoung Wug
    Zhou, Yang
    Liu, Difan
    Lee, Joon-Young
    Cai, Haoran
    Liu, Baqiao
    Liu, Feng
    Uh, Youngjung
    COMPUTER VISION - ECCV 2024, PT LIII, 2025, 15111 : 19 - 36
  • [44] ITI- GEN: Inclusive Text-to-Image Generation
    Zhang, Cheng
    Chen, Xuanbai
    Chai, Siqi
    Wu, Chen Henry
    Lagun, Dmitry
    Beeler, Thabo
    De la Torre, Fernando
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 3946 - 3957
  • [45] Translation-Enhanced Multilingual Text-to-Image Generation
    Li, Yaoyiran
    Chang, Ching-Yun
    Rawls, Stephen
    Vulic, Ivan
    Korhonen, Anna
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 9174 - 9193
  • [46] Training-Free Consistent Text-to-Image Generation
    Tewel, Yoad
    Kaduri, Omri
    Gal, Rinon
    Kasten, Yoni
    Wolf, Lior
    Chechik, Gal
    Atzmon, Yuval
    ACM TRANSACTIONS ON GRAPHICS, 2024, 43 (04):
  • [47] Text-to-image generation combined with mutual information maximization
    Mo J.
    Xu K.
    Lin L.
    Ouyang N.
    Xi'an Dianzi Keji Daxue Xuebao/Journal of Xidian University, 2019, 46 (05): : 180 - 188
  • [48] EmoGen: Emotional Image Content Generation with Text-to-Image Diffusion Models
    Yang, Jingyuan
    Feng, Jiawei
    Huang, Hui
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 6358 - 6368
  • [49] Background Layout Generation and Object Knowledge Transfer for Text-to-Image Generation
    Chen, Zhuowei
    Mao, Zhendong
    Fang, Shancheng
    Hu, Bo
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4327 - 4335
  • [50] Text-to-image Synthesis via Symmetrical Distillation Networks
    Yuan, Mingkuan
    Peng, Yuxin
    PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 1407 - 1415