WAVE: Warping DDIM Inversion Features for Zero-Shot Text-to-Video Editing

被引:0
|
作者
Feng, Yutang [1 ,5 ]
Gao, Sicheng [1 ,3 ]
Bao, Yuxiang [1 ]
Wang, Xiaodi [2 ]
Han, Shumin [1 ,2 ]
Zhang, Juan [1 ]
Zhang, Baochang [1 ,4 ]
Yao, Angela [3 ]
机构
[1] Beihang Univ, Beijing, Peoples R China
[2] Baidu VIS, Beijing, Peoples R China
[3] Natl Univ Singapore, Singapore, Singapore
[4] Zhongguancun Lab, Beijing, Peoples R China
[5] Baidu, Beijing, Peoples R China
来源
基金
中国国家自然科学基金; 北京市自然科学基金; 新加坡国家研究基金会;
关键词
Text to video editing; DDIM inversion; Flow-guided warping;
D O I
10.1007/978-3-031-73116-7_3
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text-driven video editing has emerged as a prominent application based on the breakthroughs of image diffusion models. Existing state-of-the-art methods focus on zero-shot frameworks due to limited training data and computing resources. To preserve structure consistency, previous frameworks usually employ Denoising Diffusion Implicit Model (DDIM) inversion to provide inverted noise latents as guidance. The key challenge lies in limiting errors caused by the randomness and inaccuracy in each step of the naive DDIM inversion process, which can lead to temporal inconsistency in video editing tasks. Our observation indicates that incorporating temporal keyframe information can alleviate the accumulated error during inversion. In this paper, we propose an effective warping strategy in the feature domain to obtain high-quality DDIM inverted noise latents. Specifically, we shuffle the editing frames randomly in each timestep and use optical flow extracted from the source video to propagate the latent features of the first keyframe to subsequent keyframes. Moreover, we develop a comprehensive zero-shot framework that adapts to this strategy in both the inversion and denoising processes, thereby facilitating the generation of consistent edited videos. We compare our method with state-of-the-art text-driven editing methods on various real-world videos with different forms of motion. The project page is available at https://ree1s.github.io/wave/.
引用
收藏
页码:38 / 55
页数:18
相关论文
共 50 条
  • [21] Unified benchmark for zero-shot Turkish text classification
    celik, Emrecan
    Dalyan, Tugba
    INFORMATION PROCESSING & MANAGEMENT, 2023, 60 (03)
  • [22] ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding
    Shahamt, Uri
    Ivgit, Maor
    Efratt, Avia
    Berantt, Jonathan
    Levytmu, Omer
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 7977 - 7989
  • [23] Extreme Zero-Shot Learning for Extreme Text Classification
    Xiong, Yuanhao
    Chang, Wei-Cheng
    Hsieh, Cho-Jui
    Yu, Hsiang-Fu
    Dhillon, Inderjit
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 5455 - 5468
  • [24] CLIPTEXT: A New Paradigm for Zero-shot Text Classification
    Qin, Libo
    Wang, Weiyun
    Chen, Qiguang
    Che, Wanxiang
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 1077 - 1088
  • [25] Learn to Adapt for Generalized Zero-Shot Text Classification
    Zhang, Yiwen
    Yuan, Caixia
    Wang, Xiaojie
    Bai, Ziwei
    Liu, Yongbin
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 517 - 527
  • [26] Generalized Zero-Shot Text Classification for ICD Coding
    Song, Congzheng
    Zhang, Shanghang
    Sadoughi, Najmeh
    Xie, Pengtao
    Xing, Eric
    PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 4018 - 4024
  • [27] Free-Editor: Zero-Shot Text-Driven 3D Scene Editing
    Karim, Nazmul
    Igbal, Hasan
    Khalid, Umar
    Chen, Chen
    Hua, Jing
    COMPUTER VISION - ECCV 2024, PT LXXX, 2025, 15138 : 436 - 453
  • [28] PRESENT: Zero-Shot Text-to-Prosody Control
    Lam, Perry
    Zhang, Huayun
    Chen, Nancy F.
    Sisman, Berrak
    Herremans, Dorien
    IEEE SIGNAL PROCESSING LETTERS, 2025, 32 : 776 - 780
  • [29] Compositional Zero-Shot Domain Transfer with Text-to-Text Models
    Liu, Fangyu
    Liu, Qianchu
    Bannur, Shruthi
    Perez-Garcia, Fernando
    Usuyama, Naoto
    Zhang, Sheng
    Naumann, Tristan
    Nori, Aditya
    Poon, Hoifung
    Alvarez-Valle, Javier
    Oktay, Ozan
    Hyland, Stephanie L.
    TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2023, 11 : 1097 - 1113
  • [30] Zero-Shot Video Retrieval Using Content and Concepts
    Dalton, Jeffrey
    Allan, James
    Mirajkar, Pranav
    PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM'13), 2013, : 1857 - 1860