Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling

被引：4

作者：

Shi, Xiaoyu ^{[1
]}

Huang, Zhaoyang ^{[2
]}

Wang, Fu-Yun ^{[1
]}

Bian, Weikang ^{[1
]}

Li, Dasong ^{[1
]}

Zhang, Yi ^{[1
]}

Zhang, Manyuan ^{[1
]}

Cheung, Ka Chun ^{[3
]}

See, Simon ^{[3
]}

Qin, Hongwei ^{[4
]}

Dai, Jifeng ^{[5
]}

Li, Hongsheng ^{[1
,6
]}

机构：

[1] CUHK, Multimedia Lab, Shenzhen, Peoples R China

[2] Avolution AI, London, England

[3] NVIDIA AI Technol Ctr, Shenzhen, Peoples R China

[4] SenseTime, Hong Kong, Peoples R China

[5] Tsinghua Univ, Beijing, Peoples R China

[6] Shanghai AI Lab, Ctr Perceptual & Interact Intelligence CPII, Shanghai, Peoples R China

来源：

PROCEEDINGS OF SIGGRAPH 2024 CONFERENCE PAPERS | 2024年

基金：

国家重点研发计划;

关键词：

Diffusion models; Image animation;

D O I：

10.1145/3641519.3657497

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We introduce Motion-I2V, a novel framework for consistent and controllable text-guided image-to-video generation (I2V). In contrast to previous methods that directly learn the complicated image-to-video mapping, Motion-I2V factorizes I2V into two stages with explicit motion modeling. For the first stage, we propose a diffusionbased motion field predictor, which focuses on deducing the trajectories of the reference image's pixels. For the second stage, we propose motion-augmented temporal attention to enhance the limited 1-D temporal attention in video latent diffusion models. This module can effectively propagate reference image features to synthesized frames with the guidance of predicted trajectories from the first stage. Compared with existing methods, Motion-I2V can generate more consistent videos even in the presence of large motion and viewpoint variation. By training a sparse trajectory ControlNet for the first stage, Motion-I2V can support users to precisely control motion trajectories and motion regions with sparse trajectory and region. This offers more controllability of the I2V process than solely relying on textual instructions. Additionally, Motion-I2V's second stage naturally supports zero-shot video-to-video translation. Both qualitative and quantitative comparisons demonstrate the advantages of Motion-I2V over prior approaches in consistent and controllable image-to-video generation. Please see our project page at https://xiaoyushi97.github.io/Motion-I2V/.

引用

页数：11

共 42 条

[31] Quantifying and recognizing human movement patterns from monocular video images - Part I: A new framework for modeling human motion
Green, RD
Guan, L
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2004, 14 (02) : 179 - 190
[32] Image: A low cost, low power video processor for high quality motion estimation in MPEG-2 encoding
Mombers, F
Gumm, M
Stephanie, D
Garino, P
Mlynek, D
IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, 1998, 44 (03) : 774 - 783
[33] A new track for modeling human motion - A technique for reconstructing 3D motion from 2D video provides an inexpensive approach to tracking human movement
Mahoney, DP
COMPUTER GRAPHICS WORLD, 2000, 23 (05) : 18 - +
[34] Utilization of the recursive shortest spanning tree algorithm for video-object segmentation by 2-D affine motion modeling
Tuncel, E
Onural, L
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2000, 10 (05) : 776 - 781
[35] High output voltage generation of over 5 V from liquid motion on single-layer MoS2
Aji, Adha Sukma
Nishi, Ryohei
Ago, Hiroki
Ohno, Yutaka
NANO ENERGY, 2020, 68
[36] Abstract. Taking the wavelet decomposed approximate image as the main research object, a direction estimation method for moving object was proposed in this paper. Firstly, the approximate image for the frame of the video was obtained via wavelet decomposition; and furthermore, the motion estimation on the approximate image was achieved to obtain the motion vectors. Finally, the motion vectors were described as polar coordinate form to compute the number of motion vectors in specified angles and the information entropy of the motion directions. The experiment results show that the proposed method can remove the effect of noise and the results of direction estimation are consistent with the actual motion directions. Evaluation of Crowd Motion Direction Based on Wavelet Transform
Yang, Guoqing
Cui, Rongyi
ADVANCES IN MECHATRONICS, AUTOMATION AND APPLIED INFORMATION TECHNOLOGIES, PTS 1 AND 2, 2014, 846-847 : 1106 - +
[37] New motion estimation algorithm using adaptively quantized low bit resolution image and its VLSI architecture for MPEG2 video encoding
Lee, S
Kim, JM
Chae, SI
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 1998, 8 (06) : 734 - 744
[38] ITALIAN SILENT-FILMS, 1916 (MOTION-PICTURES DURING WORLD-WAR-I), PTS 1 AND 2 - ITALIAN - MARTINELLI,V
COMUZIO, E
CINEFORUM, 1992, 32 (7-8): : 92 - 93
[39] DISSIPATIVE FLUX MOTION IN YBA2CU3O7-DELTA FILMS - INVESTIGATION BY MEANS OF TRANSPORT-I-V CURVES
LEGHISSA, M
KONIGER, A
LIPPERT, M
DORSCH, W
KRAUS, M
SAEMANNISCHENKO, G
ZEITSCHRIFT FUR PHYSIK B-CONDENSED MATTER, 1993, 92 (02): : 163 - 172
[40] Internal rotation in high-resolution ultraviolet spectra.: I.: Semirigid model of a C2v top-Cs frame internal motion
Schäfer, M
JOURNAL OF CHEMICAL PHYSICS, 2001, 115 (24): : 11139 - 11146

← 1 2 3 4 5 →