DLFormer: Discrete Latent Transformer for Video Inpainting

被引：20

作者：

Ren, Jingjing ^{[1
,2
]}

Zheng, Qingqing ^{[3
]}

Zhao, Yuanyuan ^{[2
]}

Xu, Xuemiao ^{[1
]}

Li, Chen ^{[2
]}

机构：

[1] South China Univ Technol, Sch Comp Sci & Engn, Guangzhou, Peoples R China

[2] Tencent Inc, WeChat, Shenzhen, Peoples R China

[3] Chinese Acad Sci, Shenzhen Inst Adv Technol, Shenzhen, Peoples R China

来源：

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) | 2022年

关键词：

D O I：

10.1109/CVPR52688.2022.00350

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video inpainting remains a challenging problem to fill with plausible and coherent content in unknown areas in video frames despite the prevalence of data-driven methods. Although various transformer-based architectures yield promising result for this task, they still suffer from hallucinating blurry contents and long-term spatial-temporal inconsistency. While noticing the capability of discrete representation for complex reasoning and predictive learning, we propose a novel Discrete Latent Transformer (DLFormer) to reformulate video inpainting tasks into the discrete latent space rather the previous continuous feature space. Specifically, we first learn a unique compact discrete codebook and the corresponding autoencoder to represent the target video. Built upon these representative discrete codes obtained from the entire target video, the subsequent discrete latent transformer is capable to infer proper codes for unknown areas under a self-attention mechanism, and thus produces fine-grained content with long-term spatial-temporal consistency. Moreover, we further explicitly enforce the short-term consistency to relieve temporal visual jitters via a temporal aggregation block among adjacent frames. We conduct comprehensive quantitative and qualitative evaluations to demonstrate that our method significantly outperforms other state-of-the-art approaches in reconstructing visually-plausible and spatial-temporal coherent content with fine-grained details.Code is available at https://github.com/JingjingRenabc/diformer.

引用

页码：3501 / 3510

页数：10

共 50 条

[41] IMAGE INPAINTING BY MSCSWIN TRANSFORMER ADVERSARIAL AUTOENCODER
Chen, Bo-Wei
Liu, Tsung-Jung
Liu, Kuan-Hsien
2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 2040 - 2044
[42] TSFormer: Tracking Structure Transformer for Image Inpainting
Lin, Jiayu
Wang, Yuan-gen
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (12)
[43] Edge-Guided Image Inpainting with Transformer
Liang, Huining
Kambhamettu, Chandra
ADVANCES IN VISUAL COMPUTING, ISVC 2023, PT II, 2023, 14362 : 285 - 296
[44] Learning Contextual Transformer Network for Image Inpainting
Deng, Ye
Hui, Siqi
Zhou, Sanping
Meng, Deyu
Wang, Jinjun
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2529 - 2538
[45] Application of Inpainting Technology to Video Restoration
Chang, Rong-Chi
Tang, Nick C.
Chao, Chia Cheng
2008 FIRST IEEE INTERNATIONAL CONFERENCE ON UBI-MEDIA COMPUTING AND WORKSHOPS, PROCEEDINGS, 2008, : 359 - 364
[46] Character Superimposition Inpainting in Surveillance Video
Jia, Lili
Tao, Junjie
You, Ying
INTERNATIONAL CONFERENCE ON OPTOELECTRONICS AND MICROELECTRONICS TECHNOLOGY AND APPLICATION, 2017, 10244
[47] Properties of a Variational Model for Video Inpainting
March, Riccardo
Riey, Giuseppe
NETWORKS & SPATIAL ECONOMICS, 2022, 22 (02): : 315 - 326
[48] Video Inpainting Localization With Contrastive Learning
Lou, Zijie
Cao, Gang
Lin, Man
IEEE SIGNAL PROCESSING LETTERS, 2025, 32 : 611 - 615
[49] MOTION-CONSISTENT VIDEO INPAINTING
Thuc Trinh Le
Almansa, Andres
Gousseau, Yann
Masnou, Simon
2017 24TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2017, : 2094 - 2098
[50] Video Editing Using Motion Inpainting
Tsai, Joseph C.
Shih, Timothy K.
Wattanachote, Kanoksak
Li, Kuan-Ching
2012 IEEE 26TH INTERNATIONAL CONFERENCE ON ADVANCED INFORMATION NETWORKING AND APPLICATIONS (AINA), 2012, : 649 - 654

← 1 2 3 4 5 →