Temporally Efficient Vision Transformer for Video Instance Segmentation

被引:38
|
作者
Yang, Shusheng [1 ,3 ]
Wang, Xinggang [1 ]
Li, Yu [3 ,4 ]
Fang, Yuxin [1 ]
Fang, Jiemin [1 ,2 ]
Liu, Wenyu [1 ]
Zhao, Xun [3 ]
Shan, Ying [3 ]
机构
[1] Huazhong Univ Sci & Technol, Sch EIC, Wuhan, Hubei, Peoples R China
[2] Huazhong Univ Sci & Technol, Inst Artificial Intelligence, Wuhan, Hubei, Peoples R China
[3] Tencent PCG, Appl Res Ctr ARC, London, England
[4] Int Digital Econ Acad IDEA, Shenzhen, Peoples R China
关键词
D O I
10.1109/CVPR52688.2022.00290
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently vision transformer has achieved tremendous success on image-level visual recognition tasks. To effectively and efficiently model the crucial temporal information within a video clip, we propose a Temporally Efficient Vision Transformer (TeViT) for video instance segmentation (VIS). Different from previous transformer-based VIS methods, TeViT is nearly convolution-free, which contains a transformer backbone and a query-based video instance segmentation head. In the backbone stage, we propose a nearly parameter-free messenger shift mechanism for early temporal context fusion. In the head stages, we propose a parameter-shared spatiotemporal query interaction mechanism to build the one-to-one correspondence between video instances and queries. Thus, TeViT fully utilizes both frame-level and instance-level temporal context information and obtains strong temporal modeling capacity with negligible extra computational cost. On three widely adopted VIS benchmarks, i.e., YouTube-VIS-2019, YouTube-VIS-2021, and OVIS, TeViT obtains state-of-the-art results and maintains high inference speed, e.g., 46.6 AP with 68.9 FPS on YouTube-VIS-2019. Code is available at https://github.com/hustvl/TeViT.
引用
收藏
页码:2875 / 2885
页数:11
相关论文
共 50 条
  • [31] Video Instance Segmentation via Multi-Scale Spatio-Temporal Split Attention Transformer
    Thawakar, Omkar
    Narayan, Sanath
    Cao, Jiale
    Cholakkal, Hisham
    Anwer, Rao Muhammad
    Khan, Muhammad Haris
    Khan, Salman
    Felsberg, Michael
    Khan, Fahad Shahbaz
    COMPUTER VISION, ECCV 2022, PT XXIX, 2022, 13689 : 666 - 681
  • [32] InstanceFormer: An Online Video Instance Segmentation Framework
    Koner, Rajat
    Hannan, Tanveer
    Shit, Suprosanna
    Sharifzadeh, Sahand
    Schubert, Matthias
    Seidl, Thomas
    Tresp, Volker
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 1, 2023, : 1188 - 1195
  • [33] In Defense of Online Models for Video Instance Segmentation
    Wu, Junfeng
    Liu, Qihao
    Jiang, Yi
    Bai, Song
    Yuille, Alan
    Bai, Xiang
    COMPUTER VISION - ECCV 2022, PT XXVIII, 2022, 13688 : 588 - 605
  • [34] Dual Embedding Learning for Video Instance Segmentation
    Feng, Qianyu
    Yang, Zongxin
    Li, Peike
    Wei, Yunchao
    Yang, Yi
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 717 - 720
  • [35] Mask-Free Video Instance Segmentation
    Ke, Lei
    Danelljan, Martin
    Ding, Henghui
    Tai, Yu-Wing
    Tang, Chi-Keung
    Yu, Fisher
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 22857 - 22866
  • [36] Video Instance Segmentation in an Open-World
    Thawakar, Omkar
    Narayan, Sanath
    Cholakkal, Hisham
    Anwer, Rao Muhammad
    Khan, Salman
    Laaksonen, Jorma
    Shah, Mubarak
    Khan, Fahad Shahbaz
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025, 133 (01) : 398 - 409
  • [37] Learning Hierarchical Embeddings for Video Instance Segmentation
    Qin, Zheyun
    Lu, Xiankai
    Nie, Xiushan
    Zhen, Xiantong
    Yin, Yilong
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 1884 - 1892
  • [38] Dynamic Transformer for Few-shot Instance Segmentation
    Wang, Haochen
    Liu, Jie
    Liu, Yongtuo
    Maji, Subhransu
    Sonke, Jan-Jakob
    Gavves, Efstratios
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 2969 - 2977
  • [39] InstanceFormer: An Online Video Instance Segmentation Framework
    Ludwig Maximilian University of Munich, Germany
    不详
    arXiv, 1600,
  • [40] InsPro: Propagating Instance Query and Proposal for Online Video Instance Segmentation
    He, Fei
    Zhang, Haoyang
    Gao, Naiyu
    Jia, Jian
    Shan, Yanhu
    Zhao, Xin
    Huang, Kaiqi
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,