WAVE: Warping DDIM Inversion Features for Zero-Shot Text-to-Video Editing

被引:0
|
作者
Feng, Yutang [1 ,5 ]
Gao, Sicheng [1 ,3 ]
Bao, Yuxiang [1 ]
Wang, Xiaodi [2 ]
Han, Shumin [1 ,2 ]
Zhang, Juan [1 ]
Zhang, Baochang [1 ,4 ]
Yao, Angela [3 ]
机构
[1] Beihang Univ, Beijing, Peoples R China
[2] Baidu VIS, Beijing, Peoples R China
[3] Natl Univ Singapore, Singapore, Singapore
[4] Zhongguancun Lab, Beijing, Peoples R China
[5] Baidu, Beijing, Peoples R China
来源
基金
中国国家自然科学基金; 北京市自然科学基金; 新加坡国家研究基金会;
关键词
Text to video editing; DDIM inversion; Flow-guided warping;
D O I
10.1007/978-3-031-73116-7_3
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text-driven video editing has emerged as a prominent application based on the breakthroughs of image diffusion models. Existing state-of-the-art methods focus on zero-shot frameworks due to limited training data and computing resources. To preserve structure consistency, previous frameworks usually employ Denoising Diffusion Implicit Model (DDIM) inversion to provide inverted noise latents as guidance. The key challenge lies in limiting errors caused by the randomness and inaccuracy in each step of the naive DDIM inversion process, which can lead to temporal inconsistency in video editing tasks. Our observation indicates that incorporating temporal keyframe information can alleviate the accumulated error during inversion. In this paper, we propose an effective warping strategy in the feature domain to obtain high-quality DDIM inverted noise latents. Specifically, we shuffle the editing frames randomly in each timestep and use optical flow extracted from the source video to propagate the latent features of the first keyframe to subsequent keyframes. Moreover, we develop a comprehensive zero-shot framework that adapts to this strategy in both the inversion and denoising processes, thereby facilitating the generation of consistent edited videos. We compare our method with state-of-the-art text-driven editing methods on various real-world videos with different forms of motion. The project page is available at https://ree1s.github.io/wave/.
引用
收藏
页码:38 / 55
页数:18
相关论文
共 50 条
  • [1] Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator
    Huang, Hanzhuo
    Feng, Yufan
    Shi, Cheng
    Xu, Lan
    Yu, Jingyi
    Yang, Sibei
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [2] FateZero: Fusing Attentions for Zero-shot Text-based Video Editing
    Qi, Chenyang
    Cun, Xiaodong
    Zhang, Yong
    Lei, Chenyang
    Wang, Xintao
    Shan, Ying
    Chen, Qifeng
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15886 - 15896
  • [3] VidToMe: Video Token Merging for Zero-Shot Video Editing
    Li, Xirui
    Ma, Chao
    Yang, Xiaokang
    Yang, Ming-Hsuan
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 7486 - 7495
  • [4] A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video Editing
    Li, Maomao
    Li, Yu
    Yang, Tianyu
    Liu, Yunfei
    Yue, Dongxu
    Lin, Zhihui
    Xu, Dong
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 7528 - 7537
  • [5] INFUSION: Inject and Attention Fusion for Multi Concept Zero-Shot Text-based Video Editing
    Khandelwal, Anant
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 3009 - 3018
  • [6] VOICECRAFT: Zero-Shot Speech Editing and Text-to-Speech in theWild
    Peng, Puyuan
    Huang, Po-Yao
    Le, Shang-Wen
    Mohamed, Abdelrahman
    Harwath, David
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 12442 - 12462
  • [7] Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators
    Khachatryan, Levon
    Movsisyan, Andranik
    Tadevosyan, Vahram
    Henschel, Roberto
    Wang, Zhangyang
    Navasardyan, Shant
    Shi, Humphrey
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15908 - 15918
  • [8] Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation
    Yang, Shuai
    Zhou, Yifan
    Liu, Ziwei
    Loy, Chen Change
    PROCEEDINGS OF THE SIGGRAPH ASIA 2023 CONFERENCE PAPERS, 2023,
  • [9] Zero-Shot Video Moment Retrieval With Angular Reconstructive Text Embeddings
    Jiang, Xun
    Xu, Xing
    Zhou, Zailei
    Yang, Yang
    Shen, Fumin
    Shen, Heng Tao
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 9657 - 9670
  • [10] Zero-Shot Turkish Text Classification
    Birim, Ahmet
    Erden, Mustafa
    Arslan, Levent M.
    29TH IEEE CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS (SIU 2021), 2021,