Temporally Efficient Vision Transformer for Video Instance Segmentation

被引:38
|
作者
Yang, Shusheng [1 ,3 ]
Wang, Xinggang [1 ]
Li, Yu [3 ,4 ]
Fang, Yuxin [1 ]
Fang, Jiemin [1 ,2 ]
Liu, Wenyu [1 ]
Zhao, Xun [3 ]
Shan, Ying [3 ]
机构
[1] Huazhong Univ Sci & Technol, Sch EIC, Wuhan, Hubei, Peoples R China
[2] Huazhong Univ Sci & Technol, Inst Artificial Intelligence, Wuhan, Hubei, Peoples R China
[3] Tencent PCG, Appl Res Ctr ARC, London, England
[4] Int Digital Econ Acad IDEA, Shenzhen, Peoples R China
关键词
D O I
10.1109/CVPR52688.2022.00290
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently vision transformer has achieved tremendous success on image-level visual recognition tasks. To effectively and efficiently model the crucial temporal information within a video clip, we propose a Temporally Efficient Vision Transformer (TeViT) for video instance segmentation (VIS). Different from previous transformer-based VIS methods, TeViT is nearly convolution-free, which contains a transformer backbone and a query-based video instance segmentation head. In the backbone stage, we propose a nearly parameter-free messenger shift mechanism for early temporal context fusion. In the head stages, we propose a parameter-shared spatiotemporal query interaction mechanism to build the one-to-one correspondence between video instances and queries. Thus, TeViT fully utilizes both frame-level and instance-level temporal context information and obtains strong temporal modeling capacity with negligible extra computational cost. On three widely adopted VIS benchmarks, i.e., YouTube-VIS-2019, YouTube-VIS-2021, and OVIS, TeViT obtains state-of-the-art results and maintains high inference speed, e.g., 46.6 AP with 68.9 FPS on YouTube-VIS-2019. Code is available at https://github.com/hustvl/TeViT.
引用
收藏
页码:2875 / 2885
页数:11
相关论文
共 50 条
  • [41] MaskRNN: Instance Level Video Object Segmentation
    Hu, Yuan-Ting
    Huang, Jia-Bin
    Schwing, Alexander G.
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
  • [42] DVIS: Decoupled Video Instance Segmentation Framework
    Zhang, Tao
    Tian, Xingye
    Wu, Yu
    Ji, Shunping
    Wang, Xuebo
    Zhang, Yuan
    Wan, Pengfei
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 1282 - 1291
  • [43] INSTANCE SEGMENTATION OF LIDAR DATA WITH VISION TRANSFORMER MODEL IN SUPPORT INUNDATION MAPPING UNDER FOREST CANOPY ENVIRONMENT
    Yang, Jian
    El Mendili, Lamiae
    Khayer, Yasmin
    McArdle, Steven
    Beni, Leila Hashemi
    GEOSPATIAL WEEK 2023, VOL. 48-1, 2023, : 203 - 208
  • [44] AuxAdapt: Stable and Efficient Test-Time Adaptation for Temporally Consistent Video Semantic Segmentation
    Zhang, Yizhe
    Borse, Shubhankar
    Cai, Hong
    Porikli, Fatih
    2022 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2022), 2022, : 2633 - 2642
  • [45] Video Summarization With Spatiotemporal Vision Transformer
    Hsu, Tzu-Chun
    Liao, Yi-Sheng
    Huang, Chun-Rong
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 : 3013 - 3026
  • [46] Radar Instance Transformer: Reliable Moving Instance Segmentation in Sparse Radar Point Clouds
    Zeller, Matthias
    Sandhu, Vardeep S.
    Mersch, Benedikt
    Behley, Jens
    Heidingsfeld, Michael
    Stachniss, Cyrill
    IEEE TRANSACTIONS ON ROBOTICS, 2024, 40 : 2357 - 2372
  • [47] Spatial-channel transformer network based on mask-RCNN for efficient mushroom instance segmentation
    Wang, Jiaoling
    Song, Weidong
    Zheng, Wengang
    Feng, Qingchun
    Wang, Mingfei
    Zhao, Chunjiang
    INTERNATIONAL JOURNAL OF AGRICULTURAL AND BIOLOGICAL ENGINEERING, 2024, 17 (04) : 227 - 235
  • [48] Hybrid Instance-Aware Temporal Fusion for Online Video Instance Segmentation
    Li, Xiang
    Wang, Jinglu
    Li, Xiao
    Lu, Yan
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 1429 - 1437
  • [49] TEMPORALLY CONSISTENT VIDEO MATTING BASED ON BILAYER SEGMENTATION
    Tang, Zhen
    Miao, Zhenjiang
    Wan, Yanli
    2010 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME 2010), 2010, : 370 - 375
  • [50] Video Mask Transfiner for High-Quality Video Instance Segmentation
    Ke, Lei
    Ding, Henghui
    Danelljan, Martin
    Tai, Yu-Wing
    Tang, Chi-Keung
    Yu, Fisher
    COMPUTER VISION - ECCV 2022, PT XXVIII, 2022, 13688 : 731 - 747