Factorizing Text-to-Video Generation by Explicit Image Conditioning

被引:0
|
作者
Girdhar, Rohit [1 ]
Singh, Mannat [1 ]
Brown, Andrew [1 ]
Duval, Quentin [1 ]
Azadi, Samaneh [1 ]
Rambhatla, Sai Saketh [1 ]
Shah, Akbar [1 ]
Yin, Xi [1 ]
Parikh, Devi [1 ]
Misra, Ishan [1 ]
机构
[1] Meta, GenAI, New York, NY 10003 USA
来源
关键词
D O I
10.1007/978-3-031-73033-7_12
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present Emu Video, a text-to-video generation model that factorizes the generation into two steps: first generating an image conditioned on the text, and then generating a video conditioned on the text and the generated image. We identify critical design decisions-adjusted noise schedules for diffusion, and multi-stage training-that enable us to directly generate high quality and high resolution videos, without requiring a deep cascade of models as in prior work. In human evaluations, our generated videos are strongly preferred in quality compared to all prior work-81% vs. Google's Imagen Video, 90% vs. Nvidia's PYOCO, and 96% vs. Meta's Make-A-Video. Our model outperforms commercial solutions such as RunwayML's Gen2 and Pika Labs. Finally, our factorizing approach naturally lends itself to animating images based on a user's text prompt, where our generations are preferred 96% over prior work.
引用
收藏
页码:205 / 224
页数:20
相关论文
共 50 条
  • [31] Text-to-Video: Story Illustration from Online Photo Collections
    Schwarz, Katharina
    Rojtberg, Pavel
    Caspar, Joachim
    Gurevych, Iryna
    Goesele, Michael
    Lensch, Hendrik P. A.
    KNOWLEDGE-BASED AND INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS, PT IV, 2010, 6279 : 402 - +
  • [32] ED-T2V: An Efficient Training Framework for Diffusion-based Text-to-Video Generation
    Liu, Jiawei
    Wang, Weining
    Liu, Wei
    He, Qian
    Liu, Jing
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [33] Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval
    Hu, Fan
    Chen, Aozhu
    Wang, Ziyue
    Zhou, Fangming
    Dong, Jianfeng
    Li, Xirong
    COMPUTER VISION - ECCV 2022, PT XIV, 2022, 13674 : 444 - 461
  • [34] Follow Your Pose: Pose-Guided Text-to-Video Generation Using Pose-Free Videos
    Ma, Yue
    He, Yingqing
    Cun, Xiaodong
    Wang, Xintao
    Chen, Siran
    Li, Xiu
    Chen, Qifeng
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 5, 2024, : 4117 - 4125
  • [35] Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment
    Ibrahimi, Sarah
    Sun, Xiaohang
    Wang, Pichao
    Garg, Amanmeet
    Sanan, Ashutosh
    Omar, Mohamed
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 12020 - 12030
  • [36] Multi-Conditional Generative Adversarial Network for Text-to-Video Synthesis
    Zhou R.
    Jiang C.
    Xu Q.
    Li Y.
    Zhang C.
    Song Y.
    Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2022, 34 (10): : 1567 - 1579
  • [37] Write What YouWant: Applying Text-to-Video Retrieval to Audiovisual Archives
    Yang, Yuchen
    ACM JOURNAL ON COMPUTING AND CULTURAL HERITAGE, 2023, 16 (04):
  • [38] Fighting FIRe with FIRE: Assessing the Validity of Text-to-Video Retrieval Benchmarks
    Rodriguez, Pedro
    Azab, Mahmoud
    Silvert, Becka
    Sanchez, Renato
    Labson, Linzy
    Shah, Hardik
    Moon, Seungwhan
    17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 47 - 68
  • [39] Text-to-Video Models and Sora in Plastic Surgery: Pearls, Pitfalls, and Prospectives
    Kang, Yuanbo
    Wang, Sifan
    Zhu, Lin
    AESTHETIC PLASTIC SURGERY, 2024,
  • [40] Relation Triplet Construction for Cross-modal Text-to-Video Retrieval
    Song, Xue
    Chen, Jingjing
    Jiang, Yu-Gang
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4759 - 4767