DLFormer: Discrete Latent Transformer for Video Inpainting

被引:20
|
作者
Ren, Jingjing [1 ,2 ]
Zheng, Qingqing [3 ]
Zhao, Yuanyuan [2 ]
Xu, Xuemiao [1 ]
Li, Chen [2 ]
机构
[1] South China Univ Technol, Sch Comp Sci & Engn, Guangzhou, Peoples R China
[2] Tencent Inc, WeChat, Shenzhen, Peoples R China
[3] Chinese Acad Sci, Shenzhen Inst Adv Technol, Shenzhen, Peoples R China
关键词
D O I
10.1109/CVPR52688.2022.00350
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video inpainting remains a challenging problem to fill with plausible and coherent content in unknown areas in video frames despite the prevalence of data-driven methods. Although various transformer-based architectures yield promising result for this task, they still suffer from hallucinating blurry contents and long-term spatial-temporal inconsistency. While noticing the capability of discrete representation for complex reasoning and predictive learning, we propose a novel Discrete Latent Transformer (DLFormer) to reformulate video inpainting tasks into the discrete latent space rather the previous continuous feature space. Specifically, we first learn a unique compact discrete codebook and the corresponding autoencoder to represent the target video. Built upon these representative discrete codes obtained from the entire target video, the subsequent discrete latent transformer is capable to infer proper codes for unknown areas under a self-attention mechanism, and thus produces fine-grained content with long-term spatial-temporal consistency. Moreover, we further explicitly enforce the short-term consistency to relieve temporal visual jitters via a temporal aggregation block among adjacent frames. We conduct comprehensive quantitative and qualitative evaluations to demonstrate that our method significantly outperforms other state-of-the-art approaches in reconstructing visually-plausible and spatial-temporal coherent content with fine-grained details.Code is available at https://github.com/JingjingRenabc/diformer.
引用
收藏
页码:3501 / 3510
页数:10
相关论文
共 50 条
  • [21] Local flow propagation and global multi-scale dilated Transformer for video inpainting
    Zuo, Yuting
    Chen, Jing
    Wang, Kaixing
    Lin, Qi
    Zeng, Huanqiang
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2025, 107
  • [22] Inpainting Transformer for Anomaly Detection
    Pirnay, Jonathan
    Chai, Keng
    IMAGE ANALYSIS AND PROCESSING, ICIAP 2022, PT II, 2022, 13232 : 394 - 406
  • [23] Deep Video Inpainting
    Kim, Dahun
    Woo, Sanghyun
    Lee, Joon-Young
    Kweon, In So
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 5785 - 5794
  • [24] Stereo Video Inpainting
    Raimbault, Felix
    Kokaram, Anil
    STEREOSCOPIC DISPLAYS AND APPLICATIONS XXII, 2011, 7863
  • [25] Video-rate Video Inpainting
    Murase, Rito
    Zhang, Yan
    Okatani, Takayuki
    2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, : 1553 - 1561
  • [26] SwinVI:3D Swin Transformer Model with U-net for Video Inpainting
    Zhang, Wei
    Cao, Yang
    Zhai, Junhai
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [27] FVIFormer: Flow-Guided Global-Local Aggregation Transformer Network for Video Inpainting
    Yan, Weiqing
    Sun, Yiqiu
    Yue, Guanghui
    Zhou, Wei
    Liu, Hantao
    IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, 2024, 14 (02) : 235 - 244
  • [28] Transformer with Convolution for Irregular Image Inpainting
    Xie, Donglin
    Wang, Lingfeng
    2022 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, COMPUTER VISION AND MACHINE LEARNING (ICICML), 2022, : 35 - 38
  • [29] Continuously Masked Transformer for Image Inpainting
    Ko, Keunsoo
    Kim, Chang-Su
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 13123 - 13132
  • [30] Depth Inpainting via Vision Transformer
    Makarov, Ilya
    Borisenko, Gleb
    2021 IEEE INTERNATIONAL SYMPOSIUM ON MIXED AND AUGMENTED REALITY ADJUNCT PROCEEDINGS (ISMAR-ADJUNCT 2021), 2021, : 286 - 291