Neural Video Compression with Spatio-Temporal Cross-Covariance Transformers

被引:3
|
作者
Chen, Zhenghao [1 ,3 ]
Relic, Lucas [2 ]
Azevedo, Roberto [3 ]
Zhang, Yang [3 ]
Gross, Markus [2 ]
Xu, Dong [4 ]
Zhou, Luping [1 ]
Schroers, Christopher [3 ]
机构
[1] Univ Sydney, Sydney, NSW, Australia
[2] Swiss Fed Inst Technol, Zurich, Switzerland
[3] DisneyRes Studios, Zurich, Switzerland
[4] Univ Hong Kong, Hong Kong, Peoples R China
关键词
Video compression; neural network; transformer;
D O I
10.1145/3581783.3611960
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Although existing neural video compression (NVC) methods have achieved significant success, most of them focus on improving either temporal or spatial information separately. They generally use simple operations such as concatenation or subtraction to utilize this information, while such operations only partially exploit spatio-temporal redundancies. This work aims to effectively and jointly leverage robust temporal and spatial information by proposing a new 3D-based transformer module: Spatio-Temporal Cross-Covariance Transformer (ST-XCT). The ST-XCT module combines two individual extracted features into a joint spatio-temporal feature, followed by 3D convolutional operations and a novel spatio-temporal-aware cross-covariance attention mechanism. Unlike conventional transformers, the cross-covariance attention mechanism is applied across the feature channels without breaking down the spatio-temporal features into local tokens. Such design allows for modeling global cross-channel correlations of the spatio-temporal context while lowering the computational requirement. Based on ST-XCT, we introduce a novel transformer-based end-to-end optimized NVC framework. ST-XCT-based modules are integrated into various key coding components of NVC, such as feature extraction, frame reconstruction, and entropy modeling, demonstrating its generalizability. Extensive experiments show that our ST-XCT-based NVC proposal achieves state-of-the-art compression performances on various standard video benchmark datasets.
引用
收藏
页码:8543 / 8551
页数:9
相关论文
共 50 条
  • [41] Video Segmentation with Spatio-Temporal Tubes
    Trichet, Remi
    Nevatia, Ramakant
    2013 10TH IEEE INTERNATIONAL CONFERENCE ON ADVANCED VIDEO AND SIGNAL BASED SURVEILLANCE (AVSS 2013), 2013, : 330 - 335
  • [42] Spatio-temporal segmentation for video surveillance
    Sun, HZ
    Tan, TN
    ELECTRONICS LETTERS, 2001, 37 (01) : 20 - 21
  • [43] Spatio-temporal segmentation for video surveillance
    Sun, HZ
    Feng, T
    Tan, TN
    15TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 1, PROCEEDINGS: COMPUTER VISION AND IMAGE ANALYSIS, 2000, : 843 - 846
  • [44] VideoZoom Spatio-Temporal Video Browser
    Smith, John R.
    IEEE TRANSACTIONS ON MULTIMEDIA, 1999, 1 (02) : 157 - 171
  • [45] Spatio-temporal video contrast enhancement
    Celik, Turgay
    IET IMAGE PROCESSING, 2013, 7 (06) : 543 - 555
  • [46] Spatio-Temporal Perturbations for Video Attribution
    Li, Zhenqiang
    Wang, Weimin
    Li, Zuoyue
    Huang, Yifei
    Sato, Yoichi
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (04) : 2043 - 2056
  • [47] Spatio-temporal querying in video databases
    Köprülü, M
    Çiçekli, NK
    Yazici, A
    FLEXIBLE QUERY ANSWERING SYSTEMS, PROCEEDINGS, 2002, 2522 : 251 - 262
  • [48] Spatio-temporal querying in video databases
    Koprulu, M
    Cicekli, NK
    Yazici, A
    INFORMATION SCIENCES, 2004, 160 (1-4) : 131 - 152
  • [49] Spatio-Temporal Covariance Descriptors for Action and Gesture Recognition
    Sanin, Andres
    Sanderson, Conrad
    Harandi, Mehrtash T.
    Lovell, Brian C.
    2013 IEEE WORKSHOP ON APPLICATIONS OF COMPUTER VISION (WACV), 2013, : 103 - 110
  • [50] Recent developments on the construction of spatio-temporal covariance models
    Ma, Chunsheng
    STOCHASTIC ENVIRONMENTAL RESEARCH AND RISK ASSESSMENT, 2008, 22 (Suppl 1) : S39 - S47