EasyGen: Easing Multimodal Generation with BiDiffuser and LLMs

被引：0

作者：

Zhao, Xiangyu ^{[1
]}

Liu, Bo ^{[1
]}

Liu, Qijiong ^{[1
]}

Shi, Guangyuan ^{[1
]}

Wu, Xiao-Ming ^{[1
]}

机构：

[1] Hong Kong Polytech Univ, Dept Comp, Hong Kong, Peoples R China

来源：

PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS | 2024年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We present EasyGen, an efficient model designed to enhance multimodal understanding and generation by harnessing the capabilities of diffusion models and large language models (LLMs). Unlike existing multimodal models that predominately depend on encoders like CLIP or ImageBind and need ample amounts of training data to bridge modalities, EasyGen leverages BiDiffuser, a bidirectional conditional diffusion model, to foster more efficient modality interactions. EasyGen achieves text generation by training a projection layer linking BiDiffuser and an LLM, and facilities image generation by training an adapter to align the LLM's text space with the BiDiffuser's image space. Comprehensive quantitative and qualitative experiments show that EasyGen excels in data-efficient training, high-quality image generation, and extendibility, effectively addressing the challenges in multimodal generation. The source code is available at https: //github.com/zxy556677/EasyGen.

引用

页码：1351 / 1370

页数：20

共 50 条

[1] SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs
Yu, Lijun
Cheng, Yong
Wang, Zhiruo
Kumar, Vivek
Macherey, Wolfgang
Huang, Yanping
Ross, David A.
Essa, Irfan
Bisk, Yonatan
Yang, Ming-Hsuan
Murphy, Kevin
Hauptmann, Alexander G.
Jiang, Lu
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[2] What is the limitation of multimodal LLMs? A deeper look into multimodal LLMs through prompt probing
Qi, Shuhan
Cao, Zhengying
Rao, Jun
Wang, Lei
Xiao, Jing
Wang, Xuan
INFORMATION PROCESSING & MANAGEMENT, 2023, 60 (06)
[3] Easing attosecond generation
Thomas Pfeifer
Nature Photonics, 2010, 4 : 417 - 418
[4] Merlin: Empowering Multimodal LLMs with Foresight Minds
Yu, En
Zhao, Liang
Wei, Yana
Yang, Jinrong
Wu, Dongming
Kong, Lingyu
Wei, Haoran
Wang, Tiancai
Ge, Zheng
Zhang, Xiangyu
Tao, Wenbing
COMPUTER VISION-ECCV 2024, PT IV, 2025, 15062 : 425 - 443
[5] Multimodal AI & LLMs for Peacekeeping and Emergency Response
Jaimes, Alejandro
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3 - 4
[6] Stage Wizard: Enhancing Tangible Storytelling with Multimodal LLMs
Han, Kuntong
Tang, Keyang
Wang, Meng
PROCEEDINGS OF THE NINETEENTH INTERNATIONAL CONFERENCE ON TANGIBLE, EMBEDDED AND EMBODIED INTERACTION, TEI 2025, 2025,
[7] Next Generation Vulnerability Detection with LLMs
Dalla Preda, Mila
Marastoni, Niccolo
Paci, Federica
ERCIM NEWS, 2024, (139):
[8] FlowMind: Automatic Workflow Generation with LLMs
Zeng, Zhen
Watson, William
Cho, Nicole
Rahimi, Saba
Reynolds, Shayleen
Balch, Tucker
Veloso, Manuela
PROCEEDINGS OF THE 4TH ACM INTERNATIONAL CONFERENCE ON AI IN FINANCE, ICAIF 2023, 2023, : 73 - 81
[9] Evaluating LLMs for visualization generation and understanding
Saadiq Rauf Khan
Vinit Chandak
Sougata Mukherjea
Discover Data, 3 (1):
[10] The next generation of experimental research with LLMs
Charness, Gary
Jabarian, Brian
List, John A.
NATURE HUMAN BEHAVIOUR, 2025,

← 1 2 3 4 5 →