Factorizing Text-to-Video Generation by Explicit Image Conditioning

被引：0

作者：

Girdhar, Rohit ^{[1
]}

Singh, Mannat ^{[1
]}

Brown, Andrew ^{[1
]}

Duval, Quentin ^{[1
]}

Azadi, Samaneh ^{[1
]}

Rambhatla, Sai Saketh ^{[1
]}

Shah, Akbar ^{[1
]}

Yin, Xi ^{[1
]}

Parikh, Devi ^{[1
]}

Misra, Ishan ^{[1
]}

机构：

[1] Meta, GenAI, New York, NY 10003 USA

来源：

COMPUTER VISION - ECCV 2024, PT LXII | 2025年 / 15120卷

关键词：

D O I：

10.1007/978-3-031-73033-7_12

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We present Emu Video, a text-to-video generation model that factorizes the generation into two steps: first generating an image conditioned on the text, and then generating a video conditioned on the text and the generated image. We identify critical design decisions-adjusted noise schedules for diffusion, and multi-stage training-that enable us to directly generate high quality and high resolution videos, without requiring a deep cascade of models as in prior work. In human evaluations, our generated videos are strongly preferred in quality compared to all prior work-81% vs. Google's Imagen Video, 90% vs. Nvidia's PYOCO, and 96% vs. Meta's Make-A-Video. Our model outperforms commercial solutions such as RunwayML's Gen2 and Pika Labs. Finally, our factorizing approach naturally lends itself to animating images based on a user's text prompt, where our generations are preferred 96% over prior work.

引用

页码：205 / 224

页数：20

共 50 条

[31] Text-to-Video: Story Illustration from Online Photo Collections
Schwarz, Katharina
Rojtberg, Pavel
Caspar, Joachim
Gurevych, Iryna
Goesele, Michael
Lensch, Hendrik P. A.
KNOWLEDGE-BASED AND INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS, PT IV, 2010, 6279 : 402 - +
[32] ED-T2V: An Efficient Training Framework for Diffusion-based Text-to-Video Generation
Liu, Jiawei
Wang, Weining
Liu, Wei
He, Qian
Liu, Jing
2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
[33] Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval
Hu, Fan
Chen, Aozhu
Wang, Ziyue
Zhou, Fangming
Dong, Jianfeng
Li, Xirong
COMPUTER VISION - ECCV 2022, PT XIV, 2022, 13674 : 444 - 461
[34] Follow Your Pose: Pose-Guided Text-to-Video Generation Using Pose-Free Videos
Ma, Yue
He, Yingqing
Cun, Xiaodong
Wang, Xintao
Chen, Siran
Li, Xiu
Chen, Qifeng
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 5, 2024, : 4117 - 4125
[35] Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment
Ibrahimi, Sarah
Sun, Xiaohang
Wang, Pichao
Garg, Amanmeet
Sanan, Ashutosh
Omar, Mohamed
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 12020 - 12030
[36] Multi-Conditional Generative Adversarial Network for Text-to-Video Synthesis
Zhou R.
Jiang C.
Xu Q.
Li Y.
Zhang C.
Song Y.
Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2022, 34 (10): : 1567 - 1579
[37] Write What YouWant: Applying Text-to-Video Retrieval to Audiovisual Archives
Yang, Yuchen
ACM JOURNAL ON COMPUTING AND CULTURAL HERITAGE, 2023, 16 (04):
[38] Fighting FIRe with FIRE: Assessing the Validity of Text-to-Video Retrieval Benchmarks
Rodriguez, Pedro
Azab, Mahmoud
Silvert, Becka
Sanchez, Renato
Labson, Linzy
Shah, Hardik
Moon, Seungwhan
17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 47 - 68
[39] Text-to-Video Models and Sora in Plastic Surgery: Pearls, Pitfalls, and Prospectives
Kang, Yuanbo
Wang, Sifan
Zhu, Lin
AESTHETIC PLASTIC SURGERY, 2024,
[40] Relation Triplet Construction for Cross-modal Text-to-Video Retrieval
Song, Xue
Chen, Jingjing
Jiang, Yu-Gang
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4759 - 4767

← 1 2 3 4 5 →