Neural Video Compression with Spatio-Temporal Cross-Covariance Transformers

被引：3

作者：

Chen, Zhenghao ^{[1
,3
]}

Relic, Lucas ^{[2
]}

Azevedo, Roberto ^{[3
]}

Zhang, Yang ^{[3
]}

Gross, Markus ^{[2
]}

Xu, Dong ^{[4
]}

Zhou, Luping ^{[1
]}

Schroers, Christopher ^{[3
]}

机构：

[1] Univ Sydney, Sydney, NSW, Australia

[2] Swiss Fed Inst Technol, Zurich, Switzerland

[3] DisneyRes Studios, Zurich, Switzerland

[4] Univ Hong Kong, Hong Kong, Peoples R China

来源：

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年

关键词：

Video compression; neural network; transformer;

D O I：

10.1145/3581783.3611960

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Although existing neural video compression (NVC) methods have achieved significant success, most of them focus on improving either temporal or spatial information separately. They generally use simple operations such as concatenation or subtraction to utilize this information, while such operations only partially exploit spatio-temporal redundancies. This work aims to effectively and jointly leverage robust temporal and spatial information by proposing a new 3D-based transformer module: Spatio-Temporal Cross-Covariance Transformer (ST-XCT). The ST-XCT module combines two individual extracted features into a joint spatio-temporal feature, followed by 3D convolutional operations and a novel spatio-temporal-aware cross-covariance attention mechanism. Unlike conventional transformers, the cross-covariance attention mechanism is applied across the feature channels without breaking down the spatio-temporal features into local tokens. Such design allows for modeling global cross-channel correlations of the spatio-temporal context while lowering the computational requirement. Based on ST-XCT, we introduce a novel transformer-based end-to-end optimized NVC framework. ST-XCT-based modules are integrated into various key coding components of NVC, such as feature extraction, frame reconstruction, and entropy modeling, demonstrating its generalizability. Extensive experiments show that our ST-XCT-based NVC proposal achieves state-of-the-art compression performances on various standard video benchmark datasets.

引用

页码：8543 / 8551

页数：9

共 50 条

[21] A JOINT SPATIO-TEMPORAL FILTERING APPROACH TO EFFICIENT PREDICTION IN VIDEO COMPRESSION
Chen, Yue
Han, Jingning
Nanjundaswamy, Tejaswi
Rose, Kenneth
2013 PICTURE CODING SYMPOSIUM (PCS), 2013, : 81 - 84
[22] Spatio-temporal constrained tone mapping operator for HDR video compression
Ozcinar, Cagri
Lauga, Paul
Valenzise, Giuseppe
Dufaux, Frederic
JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2018, 55 : 166 - 178
[23] Spatio-temporal compression for semi-supervised video object segmentation
Chuanjun Ji
Yadang Chen
Zhi-Xin Yang
Enhua Wu
The Visual Computer, 2023, 39 : 4929 - 4942
[24] Families of spatio-temporal stationary covariance models
Ma, C
JOURNAL OF STATISTICAL PLANNING AND INFERENCE, 2003, 116 (02) : 489 - 501
[25] Spatio-Temporal Covariance Functions Generated by Mixtures
Chunsheng Ma
Mathematical Geology, 2002, 34 : 965 - 975
[26] Visualization and assessment of spatio-temporal covariance properties
Huang, Huang
Sun, Ying
SPATIAL STATISTICS, 2019, 34
[27] Nonparametric estimation of the spatio-temporal covariance structure
Yang, Kai
Qiu, Peihua
STATISTICS IN MEDICINE, 2019, 38 (23) : 4555 - 4565
[28] Time varying spatio-temporal covariance models
Ip, Ryan H. L.
Li, W. K.
SPATIAL STATISTICS, 2015, 14 : 269 - 285
[29] Spatio-temporal covariance functions generated by mixtures
Ma, CS
MATHEMATICAL GEOLOGY, 2002, 34 (08): : 965 - 975
[30] A New Covariance Function and Spatio-Temporal Prediction (Kriging) for A Stationary Spatio-Temporal Random Process
Rao, T. Subba
Terdik, Gyorgy
JOURNAL OF TIME SERIES ANALYSIS, 2017, 38 (06) : 936 - 959

← 1 2 3 4 5 →