MultiFusion: Fusing Pre-Trained Models for Multi-Lingual, Multi-Modal Image Generation

被引：0

作者：

Bellagente, Marco ^{[4
]}

Brack, Manuel ^{[2
,3
]}

Teufel, Hannah ^{[1
]}

Friedrich, Felix ^{[3
,6
]}

Deiseroth, Bjoern ^{[1
,3
,6
]}

Eichenberg, Constantin ^{[1
]}

Dai, Andrew ^{[1
]}

Baldock, Robert J. N. ^{[1
]}

Nanda, Souradeep ^{[5
]}

Oostermeijer, Koen ^{[1
]}

Cruz-Salinas, Andres Felipe ^{[1
]}

Schramowski, Patrick ^{[2
,3
,6
,8
]}

Kersting, Kristian ^{[2
,3
,6
,7
]}

Weinbach, Samuel ^{[1
]}

机构：

[1] Aleph Alpha, Heidelberg, Germany

[2] German Res Ctr Artificial Intelligence DFKI, Kaiserslautern, Germany

[3] Tech Univ Darmstadt, Comp Sci Dept, Darmstadt, Germany

[4] Stabil AI, London, England

[5] Univ Texas Dallas, Dallas, TX USA

[6] Hessian AI, Darmstadt, Germany

[7] Tech Univ Darmstadt, Ctr Cognit Sci, Darmstadt, Germany

[8] LAION, Hamburg, Germany

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年

基金：

欧盟地平线“2020”;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The recent popularity of text-to-image diffusion models (DM) can largely be attributed to the intuitive interface they provide to users. The intended generation can be expressed in natural language, with the model producing faithful interpretations of text prompts. However, expressing complex or nuanced ideas in text alone can be difficult. To ease image generation, we propose MULTIFUSION that allows one to express complex and nuanced concepts with arbitrarily interleaved inputs of multiple modalities and languages. MULTIFUSION leverages pre-trained models and aligns them for integration into a cohesive system, thereby avoiding the need for extensive training from scratch. Our experimental results demonstrate the efficient transfer of capabilities from individual modules to the downstream model. Specifically, the fusion of all independent components allows the image generation module to utilize multilingual, interleaved multimodal inputs despite being trained solely on monomodal data in a single language.

引用

页数：20

共 50 条

[31] The role of mental models in a multi-modal image search
Frost, C
ASIST 2001: PROCEEDINGS OF THE 64TH ASIST ANNUAL MEETING, VOL 38, 2001, 2001, 38 : 52 - 57
[32] Enhancing Image Classification Models with Multi-modal Biomarkers
Caban, Jesus J.
Liao, David
Yao, Jianhua
Mollura, Daniel J.
Gochuico, Bernadette
Yoo, Terry
MEDICAL IMAGING 2011: COMPUTER-AIDED DIAGNOSIS, 2011, 7963
[33] Constructing a Multi-Modal Based Underwater Acoustic Target Recognition Method With a Pre-Trained Language-Audio Model
Fu, Bowen
Nie, Jiangtao
Wei, Wei
Zhang, Lei
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2025, 63
[34] MM-ConvBERT-LMS: Detecting Malicious Web Pages via Multi-Modal Learning and Pre-Trained Model
Tong, Xin
Jin, Bo
Wang, Jingya
Yang, Ying
Suo, Qiwei
Wu, Yong
APPLIED SCIENCES-BASEL, 2023, 13 (05):
[35] Multi-target Backdoor Attacks for Code Pre-trained Models
Li, Yanzhou
Liu, Shangqing
Chen, Kangjie
Xie, Xiaofei
Zhang, Tianwei
Liu, Yang
PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 7236 - 7254
[36] Cheap Bootstrap of Multi-Lingual Hidden Markov Models
Falavigna, Daniele
Gretter, Roberto
12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 2336 - +
[37] Multi-Modal Emotion Recognition Fusing Video and Audio
Xu, Chao
Du, Pufeng
Feng, Zhiyong
Meng, Zhaopeng
Cao, Tianyi
Dong, Caichao
APPLIED MATHEMATICS & INFORMATION SCIENCES, 2013, 7 (02): : 455 - 462
[38] Controlled Multi-modal Image Generation for Plant Growth Modeling
Miranda, Miro
Drees, Lukas
Roscher, Ribana
2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 5118 - 5124
[39] Language Models for Multi-Lingual Tasks- A Survey
Jafari, Amir Reza
Heidary, Behnam
Farahbakhsh, Reza
Salehi, Mostafa
Crespi, Noel
INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2024, 15 (06) : 1458 - 1472
[40] Instruct-Imagen: Image Generation with Multi-modal Instruction
Hu, Hexiang
Chan, Kelvin C. K.
Su, Yu-Chuan
Chen, Wenhu
Li, Yandong
Sohn, Kihyuk
Zhao, Yang
Ben, Xue
Gong, Boqing
Chang, William Cohen Ming-Wei
Jia, Xuhui
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 4753 - 4763

← 1 2 3 4 5 →