Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling

被引:4
|
作者
Shi, Xiaoyu [1 ]
Huang, Zhaoyang [2 ]
Wang, Fu-Yun [1 ]
Bian, Weikang [1 ]
Li, Dasong [1 ]
Zhang, Yi [1 ]
Zhang, Manyuan [1 ]
Cheung, Ka Chun [3 ]
See, Simon [3 ]
Qin, Hongwei [4 ]
Dai, Jifeng [5 ]
Li, Hongsheng [1 ,6 ]
机构
[1] CUHK, Multimedia Lab, Shenzhen, Peoples R China
[2] Avolution AI, London, England
[3] NVIDIA AI Technol Ctr, Shenzhen, Peoples R China
[4] SenseTime, Hong Kong, Peoples R China
[5] Tsinghua Univ, Beijing, Peoples R China
[6] Shanghai AI Lab, Ctr Perceptual & Interact Intelligence CPII, Shanghai, Peoples R China
来源
PROCEEDINGS OF SIGGRAPH 2024 CONFERENCE PAPERS | 2024年
基金
国家重点研发计划;
关键词
Diffusion models; Image animation;
D O I
10.1145/3641519.3657497
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We introduce Motion-I2V, a novel framework for consistent and controllable text-guided image-to-video generation (I2V). In contrast to previous methods that directly learn the complicated image-to-video mapping, Motion-I2V factorizes I2V into two stages with explicit motion modeling. For the first stage, we propose a diffusionbased motion field predictor, which focuses on deducing the trajectories of the reference image's pixels. For the second stage, we propose motion-augmented temporal attention to enhance the limited 1-D temporal attention in video latent diffusion models. This module can effectively propagate reference image features to synthesized frames with the guidance of predicted trajectories from the first stage. Compared with existing methods, Motion-I2V can generate more consistent videos even in the presence of large motion and viewpoint variation. By training a sparse trajectory ControlNet for the first stage, Motion-I2V can support users to precisely control motion trajectories and motion regions with sparse trajectory and region. This offers more controllability of the I2V process than solely relying on textual instructions. Additionally, Motion-I2V's second stage naturally supports zero-shot video-to-video translation. Both qualitative and quantitative comparisons demonstrate the advantages of Motion-I2V over prior approaches in consistent and controllable image-to-video generation. Please see our project page at https://xiaoyushi97.github.io/Motion-I2V/.
引用
收藏
页数:11
相关论文
共 42 条
  • [1] Decouple Content and Motion for Conditional Image-to-Video Generation
    Shen, Cuifeng
    Gan, Yulu
    Chen, Chen
    Zhu, Xiongwei
    Cheng, Lele
    Gao, Tingting
    Wang, Jinzhi
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 5, 2024, : 4757 - 4765
  • [2] Learning to Forecast and Refine Residual Motion for Image-to-Video Generation
    Zhao, Long
    Peng, Xi
    Tian, Yu
    Kapadia, Mubbasir
    Metaxas, Dimitris
    COMPUTER VISION - ECCV 2018, PT 15, 2018, 11219 : 403 - 419
  • [3] A Benchmark for Controllable Text -Image-to-Video Generation
    Hu, Yaosi
    Luo, Chong
    Chen, Zhenzhong
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 1706 - 1719
  • [4] Make It Move: Controllable Image-to-Video Generation with Text Descriptions
    Hu, Yaosi
    Luo, Chong
    Chen, Zhenzhong
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 18198 - 18207
  • [5] Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation
    Hu, Li
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 8153 - 8163
  • [6] Activity Image-to-Video Retrieval by Disentangling Appearance and Motion
    Liu, Liu
    Li, Jiangtong
    Niu, Li
    Xu, Ruicong
    Zhang, Liqing
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 2145 - 2153
  • [7] Controllable Image-to-Video Translation: A Case Study on Facial Expression Generation
    Fan, Lijie
    Huang, Wenbing
    Gan, Chuang
    Huang, Junzhou
    Gong, Boqing
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 3510 - 3517
  • [8] I2V-Adapter: A General Image-to-Video Adapter for Diffusion Models
    Guo, Xun
    Zheng, Mingwu
    Hou, Liang
    Gao, Yuan
    Deng, Yufan
    Wan, Pengfei
    Zhang, Di
    Liu, Yufan
    Hu, Weiming
    Zha, Zhengjun
    Huang, Haibin
    Ma, Chongyang
    PROCEEDINGS OF SIGGRAPH 2024 CONFERENCE PAPERS, 2024,
  • [9] Dual-MTGAN: Stochastic and Deterministic Motion Transfer for Image-to-Video Synthesis
    Yang, Fu-En
    Chang, Jing-Cheng
    Lee, Yuan-Hao
    Wang, Yu-Chiang Frank
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 6764 - 6771
  • [10] Controllable Video Generation Through Global and Local Motion Dynamics
    Davtyan, Aram
    Favaro, Paolo
    COMPUTER VISION - ECCV 2022, PT XVII, 2022, 13677 : 68 - 84