WAVE: Warping DDIM Inversion Features for Zero-Shot Text-to-Video Editing

被引：0

作者：

Feng, Yutang ^{[1
,5
]}

Gao, Sicheng ^{[1
,3
]}

Bao, Yuxiang ^{[1
]}

Wang, Xiaodi ^{[2
]}

Han, Shumin ^{[1
,2
]}

Zhang, Juan ^{[1
]}

Zhang, Baochang ^{[1
,4
]}

Yao, Angela ^{[3
]}

机构：

[1] Beihang Univ, Beijing, Peoples R China

[2] Baidu VIS, Beijing, Peoples R China

[3] Natl Univ Singapore, Singapore, Singapore

[4] Zhongguancun Lab, Beijing, Peoples R China

[5] Baidu, Beijing, Peoples R China

来源：

COMPUTER VISION - ECCV 2024, PT LXXVI | 2025年 / 15134卷

基金：

中国国家自然科学基金; 北京市自然科学基金; 新加坡国家研究基金会;

关键词：

Text to video editing; DDIM inversion; Flow-guided warping;

D O I：

10.1007/978-3-031-73116-7_3

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Text-driven video editing has emerged as a prominent application based on the breakthroughs of image diffusion models. Existing state-of-the-art methods focus on zero-shot frameworks due to limited training data and computing resources. To preserve structure consistency, previous frameworks usually employ Denoising Diffusion Implicit Model (DDIM) inversion to provide inverted noise latents as guidance. The key challenge lies in limiting errors caused by the randomness and inaccuracy in each step of the naive DDIM inversion process, which can lead to temporal inconsistency in video editing tasks. Our observation indicates that incorporating temporal keyframe information can alleviate the accumulated error during inversion. In this paper, we propose an effective warping strategy in the feature domain to obtain high-quality DDIM inverted noise latents. Specifically, we shuffle the editing frames randomly in each timestep and use optical flow extracted from the source video to propagate the latent features of the first keyframe to subsequent keyframes. Moreover, we develop a comprehensive zero-shot framework that adapts to this strategy in both the inversion and denoising processes, thereby facilitating the generation of consistent edited videos. We compare our method with state-of-the-art text-driven editing methods on various real-world videos with different forms of motion. The project page is available at https://ree1s.github.io/wave/.

引用

页码：38 / 55

页数：18

共 50 条

[41] Text-to-Image Diffusion Models are Zero-Shot Classifiers
Clark, Kevin
Jaini, Priyank
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[42] AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios
Wu, Yihan
Tan, Xu
Li, Bohan
He, Lei
Zhao, Sheng
Song, Ruihua
Qin, Tao
Liu, Tie-Yan
INTERSPEECH 2022, 2022, : 2568 - 2572
[43] Issues with Entailment-based Zero-shot Text Classification
Ma, Tingting
Yao, Jin-Ge
Lin, Chin-Yew
Zhao, Tiejun
ACL-IJCNLP 2021: THE 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 2, 2021, : 786 - 796
[44] Person Search by Text Attribute Query as Zero-Shot Learning
Dong, Qi
Gong, Shaogang
Zhu, Xiatian
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 3651 - 3660
[45] ZeroBERTo: Leveraging Zero-Shot Text Classification by Topic Modeling
Alcoforado, Alexandre
Ferraz, Thomas Palmeira
Gerber, Rodrigo
Bustos, Enzo
Oliveira, Andre Seidel
Veloso, Bruno Miguel
Siqueira, Fabio Levy
Reali Costa, Anna Helena
COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, PROPOR 2022, 2022, 13208 : 125 - 136
[46] Zero-shot detection of LLM-generated text via text reorder
Sun, Jingtao
Lv, Zhanglong
NEUROCOMPUTING, 2025, 631
[47] Integrating Semantic Knowledge to Tackle Zero-shot Text Classification
Zhang, Jingqing
Lertvittayakumjorn, Piyawat
Guo, Yike
2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, 2019, : 1031 - 1040
[48] Identifying Entity Properties from Text with Zero-shot Learning
Imrattanatrai, Wiradee
Kato, Makoto P.
Yoshikawa, Masatoshi
PROCEEDINGS OF THE 42ND INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '19), 2019, : 195 - 204
[49] Zero-shot Topical Text Classification with LLMs - an Experimental Study
Gretz, Shai
Halfon, Alon
Shnayderman, Ilya
Toledo-Ronen, Orith
Dankin, Lena
Katsis, Yannis
Arviv, Ofir
Katz, Yoav
Slonim, Noam
Ein-Dor, Liat
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 9647 - 9676
[50] Improved Zero-Shot Classification by Adapting VLMs with Text Descriptions
Saha, Oindrila
Van Horn, Grant
Maji, Subhransu
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 17542 - 17552

← 1 2 3 4 5 →