Factorizing Text-to-Video Generation by Explicit Image Conditioning

被引:0
|
作者
Girdhar, Rohit [1 ]
Singh, Mannat [1 ]
Brown, Andrew [1 ]
Duval, Quentin [1 ]
Azadi, Samaneh [1 ]
Rambhatla, Sai Saketh [1 ]
Shah, Akbar [1 ]
Yin, Xi [1 ]
Parikh, Devi [1 ]
Misra, Ishan [1 ]
机构
[1] Meta, GenAI, New York, NY 10003 USA
来源
关键词
D O I
10.1007/978-3-031-73033-7_12
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present Emu Video, a text-to-video generation model that factorizes the generation into two steps: first generating an image conditioned on the text, and then generating a video conditioned on the text and the generated image. We identify critical design decisions-adjusted noise schedules for diffusion, and multi-stage training-that enable us to directly generate high quality and high resolution videos, without requiring a deep cascade of models as in prior work. In human evaluations, our generated videos are strongly preferred in quality compared to all prior work-81% vs. Google's Imagen Video, 90% vs. Nvidia's PYOCO, and 96% vs. Meta's Make-A-Video. Our model outperforms commercial solutions such as RunwayML's Gen2 and Pika Labs. Finally, our factorizing approach naturally lends itself to animating images based on a user's text prompt, where our generations are preferred 96% over prior work.
引用
收藏
页码:205 / 224
页数:20
相关论文
共 50 条
  • [41] HOW TEXT-TO-VIDEO TOOL SORA COULD SHAPE SCIENCE - AND SOCIETY
    O'Callaghan, Jonathan
    NATURE, 2024, 627 (8004) : 475 - 476
  • [42] HARIVO: Harnessing Text-to-Image Models for Video Generation
    Kwon, Mingi
    Oh, Seoung Wug
    Zhou, Yang
    Liu, Difan
    Lee, Joon-Young
    Cai, Haoran
    Liu, Baqiao
    Liu, Feng
    Uh, Youngjung
    COMPUTER VISION - ECCV 2024, PT LIII, 2025, 15111 : 19 - 36
  • [43] Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation
    Zhu, Zixin
    Feng, Xuelu
    Chen, Dongdong
    Yuan, Junsong
    Qiao, Chunming
    Hua, Gang
    COMPUTER VISION - ECCV 2024, PT XII, 2025, 15070 : 452 - 469
  • [44] Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models
    Ren, Yixuan
    Zhou, Yang
    Yang, Jimei
    Shi, Jing
    Liu, Difan
    Liu, Feng
    Kwon, Mingi
    Shrivastava, Abhinav
    COMPUTER VISION - ECCV 2024, PT LXXXIX, 2025, 15147 : 332 - 349
  • [45] Fine-Grained Text-to-Video Temporal Grounding from Coarse Boundary
    Hao, Jiachang
    Sun, Haifeng
    Ren, Pengfei
    Zhong, Yiming
    Wang, Jingyu
    Qi, Qi
    Liao, Jianxin
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (05)
  • [46] Commentary on: "Text-to-Video Models and Sora in Plastic Surgery: Pearls, Pitfalls, and Prospectives"
    Najafali, Daniel
    Galbraith, Logan G.
    Mehrzad, Raman
    AESTHETIC PLASTIC SURGERY, 2025,
  • [47] Exploiting Instance-level Relationships in Weakly Supervised Text-to-Video Retrieval
    Yin, Sh ukang
    Zhao, Sirui
    Wang, Hao
    Xu, Tong
    Chen, Enhong
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (10)
  • [48] Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution
    Yuan, Xin
    Baek, Jinoo
    Xu, Keyang
    Tov, Omer
    Fei, Hongliang
    2024 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WORKSHOPS, WACVW 2024, 2024, : 489 - 496
  • [49] Reading-Strategy Inspired Visual Representation Learning for Text-to-Video Retrieval
    Dong, Jianfeng
    Wang, Yabing
    Chen, Xianke
    Qu, Xiaoye
    Li, Xirong
    He, Yuan
    Wang, Xun
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (08) : 5680 - 5694
  • [50] T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models
    Miao, Yibo
    Zhu, Yifan
    Dong, Yinpeng
    Yu, Lijia
    Zhu, Jun
    Gao, Xiao-Shan
    arXiv,