EasyGen: Easing Multimodal Generation with BiDiffuser and LLMs

被引:0
|
作者
Zhao, Xiangyu [1 ]
Liu, Bo [1 ]
Liu, Qijiong [1 ]
Shi, Guangyuan [1 ]
Wu, Xiao-Ming [1 ]
机构
[1] Hong Kong Polytech Univ, Dept Comp, Hong Kong, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present EasyGen, an efficient model designed to enhance multimodal understanding and generation by harnessing the capabilities of diffusion models and large language models (LLMs). Unlike existing multimodal models that predominately depend on encoders like CLIP or ImageBind and need ample amounts of training data to bridge modalities, EasyGen leverages BiDiffuser, a bidirectional conditional diffusion model, to foster more efficient modality interactions. EasyGen achieves text generation by training a projection layer linking BiDiffuser and an LLM, and facilities image generation by training an adapter to align the LLM's text space with the BiDiffuser's image space. Comprehensive quantitative and qualitative experiments show that EasyGen excels in data-efficient training, high-quality image generation, and extendibility, effectively addressing the challenges in multimodal generation. The source code is available at https: //github.com/zxy556677/EasyGen.
引用
收藏
页码:1351 / 1370
页数:20
相关论文
共 50 条
  • [1] SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs
    Yu, Lijun
    Cheng, Yong
    Wang, Zhiruo
    Kumar, Vivek
    Macherey, Wolfgang
    Huang, Yanping
    Ross, David A.
    Essa, Irfan
    Bisk, Yonatan
    Yang, Ming-Hsuan
    Murphy, Kevin
    Hauptmann, Alexander G.
    Jiang, Lu
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [2] What is the limitation of multimodal LLMs? A deeper look into multimodal LLMs through prompt probing
    Qi, Shuhan
    Cao, Zhengying
    Rao, Jun
    Wang, Lei
    Xiao, Jing
    Wang, Xuan
    INFORMATION PROCESSING & MANAGEMENT, 2023, 60 (06)
  • [3] Easing attosecond generation
    Thomas Pfeifer
    Nature Photonics, 2010, 4 : 417 - 418
  • [4] Merlin: Empowering Multimodal LLMs with Foresight Minds
    Yu, En
    Zhao, Liang
    Wei, Yana
    Yang, Jinrong
    Wu, Dongming
    Kong, Lingyu
    Wei, Haoran
    Wang, Tiancai
    Ge, Zheng
    Zhang, Xiangyu
    Tao, Wenbing
    COMPUTER VISION-ECCV 2024, PT IV, 2025, 15062 : 425 - 443
  • [5] Multimodal AI & LLMs for Peacekeeping and Emergency Response
    Jaimes, Alejandro
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3 - 4
  • [6] Stage Wizard: Enhancing Tangible Storytelling with Multimodal LLMs
    Han, Kuntong
    Tang, Keyang
    Wang, Meng
    PROCEEDINGS OF THE NINETEENTH INTERNATIONAL CONFERENCE ON TANGIBLE, EMBEDDED AND EMBODIED INTERACTION, TEI 2025, 2025,
  • [7] Next Generation Vulnerability Detection with LLMs
    Dalla Preda, Mila
    Marastoni, Niccolo
    Paci, Federica
    ERCIM NEWS, 2024, (139):
  • [8] FlowMind: Automatic Workflow Generation with LLMs
    Zeng, Zhen
    Watson, William
    Cho, Nicole
    Rahimi, Saba
    Reynolds, Shayleen
    Balch, Tucker
    Veloso, Manuela
    PROCEEDINGS OF THE 4TH ACM INTERNATIONAL CONFERENCE ON AI IN FINANCE, ICAIF 2023, 2023, : 73 - 81
  • [9] Evaluating LLMs for visualization generation and understanding
    Saadiq Rauf Khan
    Vinit Chandak
    Sougata Mukherjea
    Discover Data, 3 (1):
  • [10] The next generation of experimental research with LLMs
    Charness, Gary
    Jabarian, Brian
    List, John A.
    NATURE HUMAN BEHAVIOUR, 2025,