Temporally Efficient Vision Transformer for Video Instance Segmentation

被引:38
|
作者
Yang, Shusheng [1 ,3 ]
Wang, Xinggang [1 ]
Li, Yu [3 ,4 ]
Fang, Yuxin [1 ]
Fang, Jiemin [1 ,2 ]
Liu, Wenyu [1 ]
Zhao, Xun [3 ]
Shan, Ying [3 ]
机构
[1] Huazhong Univ Sci & Technol, Sch EIC, Wuhan, Hubei, Peoples R China
[2] Huazhong Univ Sci & Technol, Inst Artificial Intelligence, Wuhan, Hubei, Peoples R China
[3] Tencent PCG, Appl Res Ctr ARC, London, England
[4] Int Digital Econ Acad IDEA, Shenzhen, Peoples R China
关键词
D O I
10.1109/CVPR52688.2022.00290
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently vision transformer has achieved tremendous success on image-level visual recognition tasks. To effectively and efficiently model the crucial temporal information within a video clip, we propose a Temporally Efficient Vision Transformer (TeViT) for video instance segmentation (VIS). Different from previous transformer-based VIS methods, TeViT is nearly convolution-free, which contains a transformer backbone and a query-based video instance segmentation head. In the backbone stage, we propose a nearly parameter-free messenger shift mechanism for early temporal context fusion. In the head stages, we propose a parameter-shared spatiotemporal query interaction mechanism to build the one-to-one correspondence between video instances and queries. Thus, TeViT fully utilizes both frame-level and instance-level temporal context information and obtains strong temporal modeling capacity with negligible extra computational cost. On three widely adopted VIS benchmarks, i.e., YouTube-VIS-2019, YouTube-VIS-2021, and OVIS, TeViT obtains state-of-the-art results and maintains high inference speed, e.g., 46.6 AP with 68.9 FPS on YouTube-VIS-2019. Code is available at https://github.com/hustvl/TeViT.
引用
收藏
页码:2875 / 2885
页数:11
相关论文
共 50 条
  • [21] Temporally stable video segmentation without video annotations
    Azulay, Aharon
    Halperin, Tavi
    Vantzos, Orestis
    Bornstein, Nadav
    Bibi, Ofir
    2022 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2022), 2022, : 1919 - 1928
  • [22] Occluded Video Instance Segmentation: A Benchmark
    Jiyang Qi
    Yan Gao
    Yao Hu
    Xinggang Wang
    Xiaoyu Liu
    Xiang Bai
    Serge Belongie
    Alan Yuille
    Philip H. S. Torr
    Song Bai
    International Journal of Computer Vision, 2022, 130 : 2022 - 2039
  • [23] A Generalized Framework for Video Instance Segmentation
    Heo, Miran
    Hwang, Sukjun
    Hyun, Jeongseok
    Kim, Hanjung
    Oh, Seoung Wug
    Lee, Joon-Young
    Kim, Seon Joo
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14623 - 14632
  • [24] Efficient Instance Segmentation Network
    Huang, Chenquan
    Wu, Weiping
    Lei, Zhihua
    PROCEEDINGS OF 2020 IEEE INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND INFORMATION SYSTEMS (ICAIIS), 2020, : 93 - 101
  • [25] TempFormer: Temporally Consistent Transformer for Video Denoising
    Song, Mingyang
    Zhang, Yang
    Aydin, Tunc O.
    COMPUTER VISION, ECCV 2022, PT XIX, 2022, 13679 : 481 - 496
  • [26] EMSViT: Efficient Multi Scale Vision Transformer for Biomedical Image Segmentation
    Sagar, Abhinav
    BRAINLESION: GLIOMA, MULTIPLE SCLEROSIS, STROKE AND TRAUMATIC BRAIN INJURIES, BRAINLES 2021, PT I, 2022, 12962 : 39 - 51
  • [27] Instance as Identity: A Generic Online Paradigm for Video Instance Segmentation
    Zhu, Feng
    Yang, Zongxin
    Yu, Xin
    Yang, Yi
    Wei, Yunchao
    COMPUTER VISION, ECCV 2022, PT XXIX, 2022, 13689 : 524 - 540
  • [28] Temporal Shift Vision Transformer Adapter for Efficient Video Action Recognition
    Shi, Yaning
    Sun, Pu
    Gu, Bing
    Li, Longfei
    PROCEEDINGS OF 2024 4TH INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND INTELLIGENT COMPUTING, BIC 2024, 2024, : 42 - 46
  • [29] ViViT: A Video Vision Transformer
    Arnab, Anurag
    Dehghani, Mostafa
    Heigold, Georg
    Sun, Chen
    Lucic, Mario
    Schmid, Cordelia
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 6816 - 6826
  • [30] Foveal Vision for Instance Segmentation of Road Images
    Ortelt, Benedikt
    Herrmann, Christian
    Willersinn, Dieter
    Beyerer, Juergen
    VISAPP: PROCEEDINGS OF THE 13TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER VISION, IMAGING AND COMPUTER GRAPHICS THEORY AND APPLICATIONS - VOL 4: VISAPP, 2018, : 371 - 378