Neural Video Compression with Spatio-Temporal Cross-Covariance Transformers

被引:3
|
作者
Chen, Zhenghao [1 ,3 ]
Relic, Lucas [2 ]
Azevedo, Roberto [3 ]
Zhang, Yang [3 ]
Gross, Markus [2 ]
Xu, Dong [4 ]
Zhou, Luping [1 ]
Schroers, Christopher [3 ]
机构
[1] Univ Sydney, Sydney, NSW, Australia
[2] Swiss Fed Inst Technol, Zurich, Switzerland
[3] DisneyRes Studios, Zurich, Switzerland
[4] Univ Hong Kong, Hong Kong, Peoples R China
关键词
Video compression; neural network; transformer;
D O I
10.1145/3581783.3611960
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Although existing neural video compression (NVC) methods have achieved significant success, most of them focus on improving either temporal or spatial information separately. They generally use simple operations such as concatenation or subtraction to utilize this information, while such operations only partially exploit spatio-temporal redundancies. This work aims to effectively and jointly leverage robust temporal and spatial information by proposing a new 3D-based transformer module: Spatio-Temporal Cross-Covariance Transformer (ST-XCT). The ST-XCT module combines two individual extracted features into a joint spatio-temporal feature, followed by 3D convolutional operations and a novel spatio-temporal-aware cross-covariance attention mechanism. Unlike conventional transformers, the cross-covariance attention mechanism is applied across the feature channels without breaking down the spatio-temporal features into local tokens. Such design allows for modeling global cross-channel correlations of the spatio-temporal context while lowering the computational requirement. Based on ST-XCT, we introduce a novel transformer-based end-to-end optimized NVC framework. ST-XCT-based modules are integrated into various key coding components of NVC, such as feature extraction, frame reconstruction, and entropy modeling, demonstrating its generalizability. Extensive experiments show that our ST-XCT-based NVC proposal achieves state-of-the-art compression performances on various standard video benchmark datasets.
引用
收藏
页码:8543 / 8551
页数:9
相关论文
共 50 条
  • [1] Spatio-Temporal Covariance and Cross-Covariance Functions of the Great Circle Distance on a Sphere
    Porcu, Emilio
    Bevilacqua, Moreno
    Genton, Marc G.
    JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2016, 111 (514) : 888 - 898
  • [2] Matern cross-covariance functions for bivariate spatio-temporal random fields
    Ip, Ryan H. L.
    Li, W. K.
    SPATIAL STATISTICS, 2016, 17 : 22 - 37
  • [3] Nonstationary cross-covariance functions for multivariate spatio-temporal random fields
    Salvana, Mary Lai O.
    Genton, Marc G.
    SPATIAL STATISTICS, 2020, 37
  • [4] Spatio-Temporal Cross-Covariance Functions under the Lagrangian Framework with Multiple Advections
    Salvana, Mary Lai O.
    Lenzi, Amanda
    Genton, Marc G.
    JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2023, 118 (544) : 2746 - 2761
  • [5] A class of valid Matern cross-covariance functions for multivariate spatio-temporal random fields
    Ip, Ryan H. L.
    Li, W. K.
    STATISTICS & PROBABILITY LETTERS, 2017, 130 : 115 - 119
  • [6] XCiT: Cross-Covariance Image Transformers
    El-Nouby, Alaaeldin
    Touvron, Hugo
    Caron, Mathilde
    Bojanowski, Piotr
    Douze, Matthijs
    Joulin, Armand
    Laptev, Ivan
    Neverova, Natalia
    Synnaeve, Gabriel
    Verbeek, Jakob
    Jegou, Herve
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021,
  • [7] TubeDETR: Spatio-Temporal Video Grounding with Transformers
    Yang, Antoine
    Miech, Antoine
    Sivic, Josef
    Laptev, Ivan
    Schmid, Cordelia
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 16421 - 16432
  • [8] Spatio-temporal transformers for decoding neural movement control
    Candelori, Benedetta
    Bardella, Giampiero
    Spinelli, Indro
    Ramawat, Surabhi
    Pani, Pierpaolo
    Ferraina, Stefano
    Scardapane, Simone
    JOURNAL OF NEURAL ENGINEERING, 2025, 22 (01)
  • [9] Spatio-Temporal Cross-Covariance Functions under the Lagrangian Framework with Multiple Advections (vol 118, pg 2746, 2022)
    Zhang, Xiran
    Salvana, Mary Lai O.
    Lenzi, Amanda
    Genton, Marc G.
    JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2024, 119 (548) : 3189 - 3189
  • [10] Spatio-temporal compression of the motion field in video coding
    Grigoriu, L
    2001 IEEE FOURTH WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 2001, : 129 - 134