TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models

被引:2
|
作者
Zhang, Zhongwei [1 ]
Long, Fuchen [2 ]
Pan, Yingwei [2 ]
Qiu, Zhaofan [2 ]
Yao, Ting [2 ]
Cao, Yang [1 ]
Mei, Tao [2 ]
机构
[1] Univ Sci & Technol China, Hefei, Peoples R China
[2] HiDream AI Inc, Futian, Peoples R China
关键词
D O I
10.1109/CVPR52733.2024.00828
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent advances in text-to-video generation have demonstrated the utility of powerful diffusion models. Nevertheless, the problem is not trivial when shaping diffusion models to animate static image (i.e., image-to-video generation). The difficulty originates from the aspect that the diffusion process of subsequent animated frames should not only preserve the faithful alignment with the given image but also pursue temporal coherence among adjacent frames. To alleviate this, we present TRIP, a new recipe of image-to-video diffusion paradigm that pivots on image noise prior derived from static image to jointly trigger inter-frame relational reasoning and ease the coherent temporal model-ing via temporal residual learning. Technically, the image noise prior is first attained through one-step backward diffusion process based on both static image and noised video latent codes. Next, TRIP executes a residual-like dual-path scheme for noise prediction: 1) a shortcut path that directly takes image noise prior as the reference noise of each frame to amplify the alignment between the first frame and subsequent frames; 2) a residual path that employs 3D-UNet over noised video and static image latent codes to enable inter-frame relational reasoning, thereby easing the learning of the residual noise for each frame. Furthermore, both reference and residual noise of each frame are dynamically merged via attention mechanism for final video generation. Extensive experiments on WebVid-10M, DTDB and MSR-VTT datasets demonstrate the effectiveness of our TRIP for image-to-video generation. Please see our project page at https://trip-i2v.github.io/TRIP/.
引用
收藏
页码:8671 / 8681
页数:11
相关论文
共 50 条
  • [1] TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models
    Zhang, Zhongwei
    Long, Fuchen
    Pan, Yingwei
    Qiu, Zhaofan
    Yao, Ting
    Cao, Yang
    Mei, Tao
    arXiv,
  • [2] Learning to Forecast and Refine Residual Motion for Image-to-Video Generation
    Zhao, Long
    Peng, Xi
    Tian, Yu
    Kapadia, Mubbasir
    Metaxas, Dimitris
    COMPUTER VISION - ECCV 2018, PT 15, 2018, 11219 : 403 - 419
  • [3] Conditional Image-to-Video Generation with Latent Flow Diffusion Models
    Ni, Haomiao
    Shi, Changhao
    Li, Kai
    Huang, Sharon X.
    Min, Martin Renqiang
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 18444 - 18455
  • [4] Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning
    Qing, Zhiwu
    Zhang, Shiwei
    Huang, Ziyuan
    Zhang, Yingya
    Gao, Changxin
    Zhao, Deli
    Sang, Nong
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 13888 - 13898
  • [5] Ultrasound Image-to-Video Synthesis via Latent Dynamic Diffusion Models
    Chen, Tingxiu
    Shi, Yilei
    Zheng, Zixuan
    Yan, Bingcong
    Hu, Jingliang
    Zhu, Xiao Xiang
    Mou, Lichao
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2024, PT IV, 2024, 15004 : 764 - 774
  • [6] R2-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding
    Liu, Ye
    He, Jixuan
    Li, Wanhua
    Kim, Junsik
    Wei, Donglai
    Pfister, Hanspeter
    Chen, Chang Wen
    COMPUTER VISION - ECCV 2024, PT XLI, 2025, 15099 : 421 - 438
  • [7] I2V-Adapter: A General Image-to-Video Adapter for Diffusion Models
    Guo, Xun
    Zheng, Mingwu
    Hou, Liang
    Gao, Yuan
    Deng, Yufan
    Wan, Pengfei
    Zhang, Di
    Liu, Yufan
    Hu, Weiming
    Zha, Zhengjun
    Huang, Haibin
    Ma, Chongyang
    PROCEEDINGS OF SIGGRAPH 2024 CONFERENCE PAPERS, 2024,
  • [8] DreamPose: Fashion Image-to-Video Synthesis via Stable Diffusion
    Karras, Johanna
    Holynski, Aleksander
    Wang, Ting-Chun
    Kemelmacher-Shlizerman, Ira
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 22623 - 22633
  • [9] Temporal Complementarity-Guided Reinforcement Learning for Image-to-Video Person Re-Identification
    Wu, Wei
    Liu, Jiawei
    Zheng, Kecheng
    Sun, Qibin
    Zha, Zheng-Jun
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 7309 - 7318
  • [10] Spatial-temporal Causal Inference for Partial Image-to-video Adaptation
    Chen, Jin
    Wu, Xinxiao
    Hu, Yao
    Luo, Jiebo
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 1027 - 1035