TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models

被引:2
|
作者
Zhang, Zhongwei [1 ]
Long, Fuchen [2 ]
Pan, Yingwei [2 ]
Qiu, Zhaofan [2 ]
Yao, Ting [2 ]
Cao, Yang [1 ]
Mei, Tao [2 ]
机构
[1] Univ Sci & Technol China, Hefei, Peoples R China
[2] HiDream AI Inc, Futian, Peoples R China
关键词
D O I
10.1109/CVPR52733.2024.00828
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent advances in text-to-video generation have demonstrated the utility of powerful diffusion models. Nevertheless, the problem is not trivial when shaping diffusion models to animate static image (i.e., image-to-video generation). The difficulty originates from the aspect that the diffusion process of subsequent animated frames should not only preserve the faithful alignment with the given image but also pursue temporal coherence among adjacent frames. To alleviate this, we present TRIP, a new recipe of image-to-video diffusion paradigm that pivots on image noise prior derived from static image to jointly trigger inter-frame relational reasoning and ease the coherent temporal model-ing via temporal residual learning. Technically, the image noise prior is first attained through one-step backward diffusion process based on both static image and noised video latent codes. Next, TRIP executes a residual-like dual-path scheme for noise prediction: 1) a shortcut path that directly takes image noise prior as the reference noise of each frame to amplify the alignment between the first frame and subsequent frames; 2) a residual path that employs 3D-UNet over noised video and static image latent codes to enable inter-frame relational reasoning, thereby easing the learning of the residual noise for each frame. Furthermore, both reference and residual noise of each frame are dynamically merged via attention mechanism for final video generation. Extensive experiments on WebVid-10M, DTDB and MSR-VTT datasets demonstrate the effectiveness of our TRIP for image-to-video generation. Please see our project page at https://trip-i2v.github.io/TRIP/.
引用
收藏
页码:8671 / 8681
页数:11
相关论文
共 50 条
  • [41] iHair Recolorer: deep image-to-video hair color transfer
    Keyu WU
    Lingchen YANG
    Hongbo FU
    Youyi ZHENG
    ScienceChina(InformationSciences), 2021, 64 (11) : 52 - 66
  • [42] Fusion schemes for image-to-video person re-identification
    Thuy-Binh Nguyen
    Thi-Lan Le
    Nam Pham Ngoc
    JOURNAL OF INFORMATION AND TELECOMMUNICATION, 2019, 3 (01) : 74 - 94
  • [43] Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models
    Ge, Songwei
    Nah, Seungjun
    Liu, Guilin
    Poon, Tyler
    Tao, Andrew
    Catanzaro, Bryan
    Jacobs, David
    Huang, Jia-Bin
    Liu, Ming-Yu
    Balaji, Yogesh
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 22873 - 22884
  • [44] Rethinking Image-to-Video Adaptation: An Object-Centric Perspective
    Qian, Rui
    Ding, Shuangrui
    Lin, Dahua
    COMPUTER VISION-ECCV 2024, PT XLIII, 2025, 15101 : 329 - 348
  • [45] Deep Image-to-Video Adaptation and Fusion Networks for Action Recognition
    Liu, Yang
    Lu, Zhaoyang
    Li, Jing
    Yang, Tao
    Yao, Chao
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 (29) : 3168 - 3182
  • [46] Controllable Image-to-Video Translation: A Case Study on Facial Expression Generation
    Fan, Lijie
    Huang, Wenbing
    Gan, Chuang
    Huang, Junzhou
    Gong, Boqing
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 3510 - 3517
  • [47] Fluoroscopic Image Denoising with Feature Preserving Residual Noise Learning
    Wu, Chengyang
    Zhang, Pu
    Xu, Yan
    Yao, Jingwu
    THIRD INTERNATIONAL SYMPOSIUM ON IMAGE COMPUTING AND DIGITAL MEDICINE (ISICDM 2019), 2019, : 97 - 101
  • [48] Schatten p-norm based Image-to-Video Adaptation for Video Action Recognition
    Dass, Sharana Dharshikgan Suresh
    Krishnasamy, Ganesh
    Paramesran, Raveendran
    Phan, Raphael C. -W.
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [49] A Triple Deep Image Prior Model for Image Denoising Based on Mixed Priors and Noise Learning
    Hu, Yong
    Xu, Shaoping
    Cheng, Xiaohui
    Zhou, Changfei
    Hu, Yufeng
    APPLIED SCIENCES-BASEL, 2023, 13 (09):
  • [50] A Comprehensive Survey of Recent Transformers in Image, Video and Diffusion Models
    Le, Dinh Phu Cuong
    Wang, Dong
    Le, Viet-Tuan
    CMC-COMPUTERS MATERIALS & CONTINUA, 2024, 80 (01): : 37 - 60