Temporally Efficient Vision Transformer for Video Instance Segmentation

被引:38
|
作者
Yang, Shusheng [1 ,3 ]
Wang, Xinggang [1 ]
Li, Yu [3 ,4 ]
Fang, Yuxin [1 ]
Fang, Jiemin [1 ,2 ]
Liu, Wenyu [1 ]
Zhao, Xun [3 ]
Shan, Ying [3 ]
机构
[1] Huazhong Univ Sci & Technol, Sch EIC, Wuhan, Hubei, Peoples R China
[2] Huazhong Univ Sci & Technol, Inst Artificial Intelligence, Wuhan, Hubei, Peoples R China
[3] Tencent PCG, Appl Res Ctr ARC, London, England
[4] Int Digital Econ Acad IDEA, Shenzhen, Peoples R China
关键词
D O I
10.1109/CVPR52688.2022.00290
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently vision transformer has achieved tremendous success on image-level visual recognition tasks. To effectively and efficiently model the crucial temporal information within a video clip, we propose a Temporally Efficient Vision Transformer (TeViT) for video instance segmentation (VIS). Different from previous transformer-based VIS methods, TeViT is nearly convolution-free, which contains a transformer backbone and a query-based video instance segmentation head. In the backbone stage, we propose a nearly parameter-free messenger shift mechanism for early temporal context fusion. In the head stages, we propose a parameter-shared spatiotemporal query interaction mechanism to build the one-to-one correspondence between video instances and queries. Thus, TeViT fully utilizes both frame-level and instance-level temporal context information and obtains strong temporal modeling capacity with negligible extra computational cost. On three widely adopted VIS benchmarks, i.e., YouTube-VIS-2019, YouTube-VIS-2021, and OVIS, TeViT obtains state-of-the-art results and maintains high inference speed, e.g., 46.6 AP with 68.9 FPS on YouTube-VIS-2019. Code is available at https://github.com/hustvl/TeViT.
引用
收藏
页码:2875 / 2885
页数:11
相关论文
共 50 条
  • [1] Video-TransUNet: Temporally Blended Vision Transformer for CT VFSS Instance Segmentation
    Zeng, Chengxi
    Yang, Xinyu
    Mirmehdi, Majid
    Gambaruto, Alberto M.
    Burghardt, Tilo
    FIFTEENTH INTERNATIONAL CONFERENCE ON MACHINE VISION, ICMV 2022, 2023, 12701
  • [2] Temporally Efficient Gabor Transformer for Unsupervised Video Object Segmentation
    Fan, Jiaqing
    Su, Tiankang
    Zhang, Kaihua
    Liu, Bo
    Liu, Qingshan
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3394 - 3402
  • [3] TCOVIS: Temporally Consistent Online Video Instance Segmentation
    Li, Junlong
    Yu, Bingyao
    Rao, Yongming
    Zhou, Jie
    Lu, Jiwen
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 1097 - 1107
  • [4] SeqFormer: Sequential Transformer for Video Instance Segmentation
    Wu, Junfeng
    Jiang, Yi
    Bai, Song
    Zhang, Wenqing
    Bai, Xiang
    COMPUTER VISION - ECCV 2022, PT XXVIII, 2022, 13688 : 553 - 569
  • [5] Video Instance Segmentation Using Graph Matching Transformer
    Qin, Zheyun
    Lu, Xiankai
    Nie, Xiushan
    Yin, Yilong
    2023 23RD IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS, ICDMW 2023, 2023, : 995 - 1004
  • [6] WormSwin: Instance segmentation of C. elegans using vision transformer
    Deserno, Maurice
    Bozek, Katarzyna
    SCIENTIFIC REPORTS, 2023, 13 (01)
  • [7] WormSwin: Instance segmentation of C. elegans using vision transformer
    Maurice Deserno
    Katarzyna Bozek
    Scientific Reports, 13
  • [8] STFormer: Spatial-Temporal-Aware Transformer for Video Instance Segmentation
    Li, Hao
    Wang, Wei
    Wang, Mengzhu
    Tan, Huibin
    Lan, Long
    Luo, Zhigang
    Liu, Xinwang
    Li, Kenli
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024,
  • [9] Video Instance Segmentation
    Yang, Linjie
    Fan, Yuchen
    Xu, Ning
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 5187 - 5196
  • [10] Robust Temporally-Coherent Strategy for Few-shot Video Instance Segmentation
    Wang, Qiuyue
    Zhang, Songyang
    He, Xuming
    2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 251 - 255