CogView: Mastering Text-to-Image Generation via Transformers

被引:0
|
作者
Ding, Ming [1 ]
Yang, Zhuoyi [1 ]
Hong, Wenyi [1 ]
Zheng, Wendi [1 ]
Zhou, Chang [2 ]
Yin, Da [1 ]
Lin, Junyang [2 ]
Zou, Xu [1 ]
Shao, Zhou [3 ]
Yang, Hongxia [2 ]
Tang, Jie [1 ,3 ]
机构
[1] Tsinghua Univ, Beijing, Peoples R China
[2] DAMO Acad, Alibaba Grp, Hangzhou, Peoples R China
[3] BAAI, Beijing, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text-to-Image generation in the general domain has long been an open problem, which requires both a powerful generative model and cross-modal understanding. We propose CogView, a 4-billion-parameter Transformer with VQ-VAE tokenizer to advance this problem. We also demonstrate the finetuning strategies for various downstream tasks, e.g. style learning, super-resolution, text-image ranking and fashion design, and methods to stabilize pretraining, e.g. eliminating NaN losses. CogView achieves the state-of-the-art FID on the blurred MS COCO dataset, outperforming previous GAN-based models and a recent similar work DALL-E. [GRAPHICS] .
引用
收藏
页数:14
相关论文
共 50 条
  • [31] Improving text-to-image generation with object layout guidance
    Jezia Zakraoui
    Moutaz Saleh
    Somaya Al-Maadeed
    Jihad Mohammed Jaam
    Multimedia Tools and Applications, 2021, 80 : 27423 - 27443
  • [32] Variational Distribution Learning for Unsupervised Text-to-Image Generation
    Kang, Minsoo
    Lee, Doyup
    Kim, Jiseob
    Kim, Saehoon
    Han, Bohyung
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23380 - 23389
  • [33] HanDiffuser: Text-to-Image Generation With Realistic Hand Appearances
    Narasimhaswamy, Supreeth
    Bhattacharya, Uttaran
    Chen, Xiang
    Dasgupta, Ishita
    Mitra, Saayan
    Hoai, Minh
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 2468 - 2479
  • [34] Attribute-Centric Compositional Text-to-Image Generation
    Cong, Yuren
    Min, Martin Renqiang
    Li, Li Erran
    Rosenhahn, Bodo
    Yang, Michael Ying
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025,
  • [35] Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement
    Chen, Zhennan
    Li, Yajie
    Wang, Haofan
    Chen, Zhibo
    Jiang, Zhengkai
    Li, Jun
    Wang, Qian
    Yang, Jian
    Tai, Ying
    arXiv,
  • [36] Using text-to-image generation for architectural design ideation
    Paananen, Ville
    Oppenlaender, Jonas
    Visuri, Aku
    INTERNATIONAL JOURNAL OF ARCHITECTURAL COMPUTING, 2024, 22 (03) : 458 - 474
  • [37] No-reference Quality Assessment of Text-to-Image Generation
    Huang, Haitao
    Jia, Rongli
    Zhang, Yuhong
    Xie, Rong
    Song, Li
    Li, Lin
    Feng, Yanan
    19TH IEEE INTERNATIONAL SYMPOSIUM ON BROADBAND MULTIMEDIA SYSTEMS AND BROADCASTING, BMSB 2024, 2024, : 357 - 362
  • [38] Latent Guard: A Safety Framework for Text-to-Image Generation
    Liu, Runtao
    Khakzar, Ashkan
    Gu, Jindong
    Chen, Qifeng
    Torr, Philip
    Pizzati, Fabio
    COMPUTER VISION - ECCV 2024, PT XXVI, 2025, 15084 : 93 - 109
  • [39] Improving text-to-image generation with object layout guidance
    Zakraoui, Jezia
    Saleh, Moutaz
    Al-Maadeed, Somaya
    Jaam, Jihad Mohammed
    MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (18) : 27423 - 27443
  • [40] ReCo: Region-Controlled Text-to-Image Generation
    Yang, Zhengyuan
    Wang, Jianfeng
    Gan, Zhe
    Li, Linjie
    Lin, Kevin
    Wu, Chenfei
    Duan, Nan
    Liu, Zicheng
    Liu, Ce
    Zeng, Michael
    Wang, Lijuan
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14246 - 14255