VidToMe: Video Token Merging for Zero-Shot Video Editing

被引:0
|
作者
Li, Xirui [1 ]
Ma, Chao [1 ]
Yang, Xiaokang [1 ]
Yang, Ming-Hsuan [2 ]
机构
[1] Shanghai Jiao Tong Univ, AI Inst, MoE Key Lab Artificial Intelligence, Shanghai, Peoples R China
[2] UC Merced, Merced, CA USA
关键词
D O I
10.1109/CVPR52733.2024.00715
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Diffusion models have made significant advances in generating high-quality images, but their application to video generation has remained challenging due to the complexity of temporal motion. Zero-shot video editing offers a solution by utilizing pre-trained image diffusion models to translate source videos into new ones. Nevertheless, existing methods struggle to maintain strict temporal consistency and efficient memory consumption. In this work, we propose a novel approach to enhance temporal consistency in generated videos by merging self-attention tokens across frames. By aligning and compressing temporally redundant tokens across frames, our method improves temporal coherence and reduces memory consumption in self-attention computations. The merging strategy matches and aligns tokens according to the temporal correspondence between frames, facilitating natural temporal consistency in generated video frames. To manage the complexity of video processing, we divide videos into chunks and develop intra-chunk local token merging and inter-chunk global token merging, ensuring both short-term video continuity and long-term content consistency. Our video editing approach seamlessly extends the advancements in image editing to video editing, rendering favorable results in temporal consistency over state-of-the-art methods.
引用
收藏
页码:7486 / 7495
页数:10
相关论文
共 50 条
  • [21] Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators
    Khachatryan, Levon
    Movsisyan, Andranik
    Tadevosyan, Vahram
    Henschel, Roberto
    Wang, Zhangyang
    Navasardyan, Shant
    Shi, Humphrey
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15908 - 15918
  • [22] An Image Grid Can Be Worth a Video: Zero-Shot Video Question Answering Using a VLM
    Kim, Wonkyun
    Choi, Changin
    Lee, Wonseok
    Rhee, Wonjong
    IEEE ACCESS, 2024, 12 : 193057 - 193075
  • [23] FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation
    Yang, Shuai
    Zhou, Yifan
    Liu, Ziwei
    Loy, Chen Change
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 8703 - 8712
  • [24] Zero-shot Video Classification with Appropriate Web and Task Knowledge Transfer
    Zhuo, Junbao
    Zhu, Yan
    Cui, Shuhao
    Wang, Shuhui
    Ma, Bin
    Huang, Qingming
    Wei, Xiaoming
    Wei, Xiaolin
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5761 - 5772
  • [25] Visual Data Synthesis via GAN for Zero-Shot Video Classification
    Zhang, Chenrui
    Peng, Yuxin
    PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 1128 - 1134
  • [26] Zero-Shot Video Classification Combined with 3D DenseNet
    Yin M.
    Zhao X.
    Guo S.
    Chen Z.
    Zhang J.
    Wuhan Daxue Xuebao (Xinxi Kexue Ban)/Geomatics and Information Science of Wuhan University, 2023, 48 (03): : 480 - 488
  • [27] Zero-shot Video Moment Retrieval With Off-the-Shelf Models
    Diwan, Anuj
    Peng, Puyuan
    Mooney, Raymond J.
    TRANSFER LEARNING FOR NATURAL LANGUAGE PROCESSING WORKSHOP, VOL 203, 2022, 203 : 10 - 21
  • [28] ReGen: A good Generative zero-shot video classifier should be Rewarded
    Bulat, Adrian
    Sanchez, Enrique
    Martinez, Brais
    Tzimiropoulos, Georgios
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 13477 - 13487
  • [29] Zero-Shot Video Moment Retrieval With Angular Reconstructive Text Embeddings
    Jiang, Xun
    Xu, Xing
    Zhou, Zailei
    Yang, Yang
    Shen, Fumin
    Shen, Heng Tao
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 9657 - 9670
  • [30] Motion-Attentive Transition for Zero-Shot Video Object Segmentation
    Zhou, Tianfei
    Wang, Shunzhou
    Zhou, Yi
    Yao, Yazhou
    Li, Jianwu
    Shao, Ling
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 13066 - 13073