Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling

被引:4
|
作者
Shi, Xiaoyu [1 ]
Huang, Zhaoyang [2 ]
Wang, Fu-Yun [1 ]
Bian, Weikang [1 ]
Li, Dasong [1 ]
Zhang, Yi [1 ]
Zhang, Manyuan [1 ]
Cheung, Ka Chun [3 ]
See, Simon [3 ]
Qin, Hongwei [4 ]
Dai, Jifeng [5 ]
Li, Hongsheng [1 ,6 ]
机构
[1] CUHK, Multimedia Lab, Shenzhen, Peoples R China
[2] Avolution AI, London, England
[3] NVIDIA AI Technol Ctr, Shenzhen, Peoples R China
[4] SenseTime, Hong Kong, Peoples R China
[5] Tsinghua Univ, Beijing, Peoples R China
[6] Shanghai AI Lab, Ctr Perceptual & Interact Intelligence CPII, Shanghai, Peoples R China
来源
PROCEEDINGS OF SIGGRAPH 2024 CONFERENCE PAPERS | 2024年
基金
国家重点研发计划;
关键词
Diffusion models; Image animation;
D O I
10.1145/3641519.3657497
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We introduce Motion-I2V, a novel framework for consistent and controllable text-guided image-to-video generation (I2V). In contrast to previous methods that directly learn the complicated image-to-video mapping, Motion-I2V factorizes I2V into two stages with explicit motion modeling. For the first stage, we propose a diffusionbased motion field predictor, which focuses on deducing the trajectories of the reference image's pixels. For the second stage, we propose motion-augmented temporal attention to enhance the limited 1-D temporal attention in video latent diffusion models. This module can effectively propagate reference image features to synthesized frames with the guidance of predicted trajectories from the first stage. Compared with existing methods, Motion-I2V can generate more consistent videos even in the presence of large motion and viewpoint variation. By training a sparse trajectory ControlNet for the first stage, Motion-I2V can support users to precisely control motion trajectories and motion regions with sparse trajectory and region. This offers more controllability of the I2V process than solely relying on textual instructions. Additionally, Motion-I2V's second stage naturally supports zero-shot video-to-video translation. Both qualitative and quantitative comparisons demonstrate the advantages of Motion-I2V over prior approaches in consistent and controllable image-to-video generation. Please see our project page at https://xiaoyushi97.github.io/Motion-I2V/.
引用
收藏
页数:11
相关论文
共 42 条
  • [21] I2V-CMGAN: Generative Adversarial Cross-Modal Network-Based Image-to-Video Person Re-identification
    Joshi, Aditya
    Diwakar, Manoj
    COGNITIVE COMPUTATION, 2025, 17 (01)
  • [22] Stereoscopic video generation based on efficient layered structure and motion estimation from a monoscopic image sequence
    Moustakas, K
    Tzovaras, D
    Strintzis, MG
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2005, 15 (08) : 1065 - 1073
  • [23] Modeling lung motion using consistent image registration in four-dimensional computed tomography for radiation therapy
    Lu, Wei
    Song, Joo Hyun
    Christensen, Gary E.
    Parikh, Parag J.
    Bradley, Jeffrey D.
    Low, Daniel A.
    MEDICAL IMAGING 2006: IMAGE PROCESSING, PTS 1-3, 2006, 6144
  • [24] Video texture and motion based modeling of rate variabilitydistortion (VD) curves of I, P, and B frames
    Van der Auwera, Geert
    Reisslein, Martin
    Karam, Lina J.
    2006 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO - ICME 2006, VOLS 1-5, PROCEEDINGS, 2006, : 1405 - +
  • [25] Motion depth generation using MHI for 2D-to-3D video conversion
    Gil, J.
    Kim, M.
    ELECTRONICS LETTERS, 2017, 53 (23) : 1520 - 1521
  • [26] DEPTH IMAGE-BASED RENDERING WITH SPATIO-TEMPORALLY CONSISTENT TEXTURE SYNTHESIS FOR 3-D VIDEO WITH GLOBAL MOTION
    Koeppel, Martin
    Wang, Xi
    Doshkov, Dimitar
    Wiegand, Thomas
    Ndjiki-Nya, Patrick
    2012 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP 2012), 2012, : 2713 - 2716
  • [27] Head-Motion-Controlled Video Goggles: Preliminary Concept for an Interactive Laparoscopic Image Display (i-LID)
    Aidlen, Jeremy T.
    Glick, Sara
    Silverman, Kenneth
    Silverman, Harvey F.
    Luks, Francois I.
    JOURNAL OF LAPAROENDOSCOPIC & ADVANCED SURGICAL TECHNIQUES, 2009, 19 (04): : 595 - 598
  • [28] Pedestrian Motion Detection & Pedestrian Communication (P2I &V2P)
    Tahir, Muhammad Naeem
    Maenpaa, Kari
    Hippi, Marjo
    2020 28TH INTERNATIONAL CONFERENCE ON SOFTWARE, TELECOMMUNICATIONS AND COMPUTER NETWORKS (SOFTCOM), 2020, : 273 - 275
  • [29] 4DCBCT-based motion modeling and 3D fluoroscopic image generation for lung cancer radiotherapy
    Dhou, Salam
    Hurwitz, Martina
    Mishra, Pankaj
    Berbeco, Ross
    Lewis, John
    MEDICAL IMAGING 2015: IMAGE-GUIDED PROCEDURES, ROBOTIC INTERVENTIONS, AND MODELING, 2015, 9415
  • [30] TIA2V: Video generation conditioned on triple modalities of text-image-audio
    Zhao, Minglu
    Wang, Wenmin
    Zhang, Rui
    Jia, Haomei
    Chen, Qi
    EXPERT SYSTEMS WITH APPLICATIONS, 2025, 268