DLFormer: Discrete Latent Transformer for Video Inpainting

被引：20

作者：

Ren, Jingjing ^{[1
,2
]}

Zheng, Qingqing ^{[3
]}

Zhao, Yuanyuan ^{[2
]}

Xu, Xuemiao ^{[1
]}

Li, Chen ^{[2
]}

机构：

[1] South China Univ Technol, Sch Comp Sci & Engn, Guangzhou, Peoples R China

[2] Tencent Inc, WeChat, Shenzhen, Peoples R China

[3] Chinese Acad Sci, Shenzhen Inst Adv Technol, Shenzhen, Peoples R China

来源：

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) | 2022年

关键词：

D O I：

10.1109/CVPR52688.2022.00350

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video inpainting remains a challenging problem to fill with plausible and coherent content in unknown areas in video frames despite the prevalence of data-driven methods. Although various transformer-based architectures yield promising result for this task, they still suffer from hallucinating blurry contents and long-term spatial-temporal inconsistency. While noticing the capability of discrete representation for complex reasoning and predictive learning, we propose a novel Discrete Latent Transformer (DLFormer) to reformulate video inpainting tasks into the discrete latent space rather the previous continuous feature space. Specifically, we first learn a unique compact discrete codebook and the corresponding autoencoder to represent the target video. Built upon these representative discrete codes obtained from the entire target video, the subsequent discrete latent transformer is capable to infer proper codes for unknown areas under a self-attention mechanism, and thus produces fine-grained content with long-term spatial-temporal consistency. Moreover, we further explicitly enforce the short-term consistency to relieve temporal visual jitters via a temporal aggregation block among adjacent frames. We conduct comprehensive quantitative and qualitative evaluations to demonstrate that our method significantly outperforms other state-of-the-art approaches in reconstructing visually-plausible and spatial-temporal coherent content with fine-grained details.Code is available at https://github.com/JingjingRenabc/diformer.

引用

页码：3501 / 3510

页数：10

共 50 条

[21] Local flow propagation and global multi-scale dilated Transformer for video inpainting
Zuo, Yuting
Chen, Jing
Wang, Kaixing
Lin, Qi
Zeng, Huanqiang
JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2025, 107
[22] Inpainting Transformer for Anomaly Detection
Pirnay, Jonathan
Chai, Keng
IMAGE ANALYSIS AND PROCESSING, ICIAP 2022, PT II, 2022, 13232 : 394 - 406
[23] Deep Video Inpainting
Kim, Dahun
Woo, Sanghyun
Lee, Joon-Young
Kweon, In So
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 5785 - 5794
[24] Stereo Video Inpainting
Raimbault, Felix
Kokaram, Anil
STEREOSCOPIC DISPLAYS AND APPLICATIONS XXII, 2011, 7863
[25] Video-rate Video Inpainting
Murase, Rito
Zhang, Yan
Okatani, Takayuki
2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, : 1553 - 1561
[26] SwinVI:3D Swin Transformer Model with U-net for Video Inpainting
Zhang, Wei
Cao, Yang
Zhai, Junhai
2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
[27] FVIFormer: Flow-Guided Global-Local Aggregation Transformer Network for Video Inpainting
Yan, Weiqing
Sun, Yiqiu
Yue, Guanghui
Zhou, Wei
Liu, Hantao
IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, 2024, 14 (02) : 235 - 244
[28] Transformer with Convolution for Irregular Image Inpainting
Xie, Donglin
Wang, Lingfeng
2022 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, COMPUTER VISION AND MACHINE LEARNING (ICICML), 2022, : 35 - 38
[29] Continuously Masked Transformer for Image Inpainting
Ko, Keunsoo
Kim, Chang-Su
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 13123 - 13132
[30] Depth Inpainting via Vision Transformer
Makarov, Ilya
Borisenko, Gleb
2021 IEEE INTERNATIONAL SYMPOSIUM ON MIXED AND AUGMENTED REALITY ADJUNCT PROCEEDINGS (ISMAR-ADJUNCT 2021), 2021, : 286 - 291

← 1 2 3 4 5 →