TransVOD: End-to-End Video Object Detection With Spatial-Temporal Transformers

被引:57
|
作者
Zhou, Qianyu [1 ]
Li, Xiangtai [2 ]
He, Lu [1 ]
Yang, Yibo [3 ]
Cheng, Guangliang [4 ]
Tong, Yunhai [2 ]
Ma, Lizhuang [1 ]
Tao, Dacheng [3 ]
机构
[1] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, Shanghai 200240, Peoples R China
[2] Peking Univ, Sch Artificial Intelligence, Beijing 100871, Peoples R China
[3] JD Explore Acad, Beijing 100176, Peoples R China
[4] SenseTime Res, Beijing 100080, Peoples R China
基金
中国国家自然科学基金;
关键词
Transformers; Object detection; Pipelines; Detectors; Streaming media; Fuses; Task analysis; Video object detection; vision transformers; scene understanding; video understanding;
D O I
10.1109/TPAMI.2022.3223955
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Detection Transformer (DETR) and Deformable DETR have been proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance as previous complex hand-crafted detectors. However, their performance on Video Object Detection (VOD) has not been well explored. In this paper, we present TransVOD, the first end-to-end video object detection system based on simple yet effective spatial-temporal Transformer architectures. The first goal of this paper is to streamline the pipeline of current VOD, effectively removing the need for many hand-crafted components for feature aggregation, e.g., optical flow model, relation networks. Besides, benefited from the object query design in DETR, our method does not need postprocessing methods such as Seq-NMS. In particular, we present a temporal Transformer to aggregate both the spatial object queries and the feature memories of each frame. Our temporal transformer consists of two components: Temporal Query Encoder (TQE) to fuse object queries, and Temporal Deformable Transformer Decoder (TDTD) to obtain current frame detection results. These designs boost the strong baseline deformable DETR by a significant margin (3 %-4 % mAP) on the ImageNet VID dataset. TransVOD yields comparable performances on the benchmark of ImageNet VID. Then, we present two improved versions of TransVOD including TransVOD++ and TransVOD Lite. The former fuses object-level information into object query via dynamic convolution while the latter models the entire video clips as the output to speed up the inference time. We give detailed analysis of all three models in the experiment part. In particular, our proposed TransVOD++ sets a new state-of-the-art record in terms of accuracy on ImageNet VID with 90.0 % mAP. Our proposed TransVOD Lite also achieves the best speed and accuracy trade-off with 83.7 % mAP while running at around 30 FPS on a single V100 GPU device. Code and models are available at https://github.com/SJTU-LuHe/TransVOD.
引用
收藏
页码:7853 / 7869
页数:17
相关论文
共 50 条
  • [1] End-to-End Video Object Detection with Spatial-Temporal Transformers
    He, Lu
    Zhou, Qianyu
    Li, Xiangtai
    Niu, Li
    Cheng, Guangliang
    Li, Xiao
    Liu, Wenxuan
    Tong, Yunhai
    Ma, Lizhuang
    Zhang, Liqing
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 1507 - 1516
  • [2] LEARNING-BASED END-TO-END VIDEO COMPRESSION WITH SPATIAL-TEMPORAL ADAPTATION
    Zhang, Zhaobin
    Li, Yue
    Zhang, Kai
    Zhang, Li
    He, Yuwen
    2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 2821 - 2825
  • [3] End-to-End Referring Video Object Segmentation with Multimodal Transformers
    Botach, Adam
    Zheltonozhskii, Evgenii
    Baskin, Chaim
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 4975 - 4985
  • [4] End-to-End Video Instance Segmentation via Spatial-Temporal Graph Neural Networks
    Wang, Tao
    Xu, Ning
    Chen, Kean
    Lin, Weiyao
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 10777 - 10786
  • [5] Building an End-to-End Spatial-Temporal Convolutional Network for Video Super-Resolution
    Guo, Jun
    Chao, Hongyang
    THIRTY-FIRST AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 4053 - 4060
  • [6] Deeply Tensor Compressed Transformers for End-to-End Object Detection
    Zhen, Peining
    Gao, Ziyang
    Hou, Tianshu
    Cheng, Yuan
    Chen, Hai-Bao
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 4716 - 4724
  • [7] VRDFormer: End-to-End Video Visual Relation Detection with Transformers
    Zheng, Sipeng
    Chen, Shizhe
    Jin, Qin
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 18814 - 18824
  • [8] End-to-End Spatio-Temporal Action Localisation with Video Transformers
    Gritsenko, Alexey A.
    Xiong, Xuehan
    Djolonga, Josip
    Dehghani, Mostafa
    Sun, Chen
    Lucic, Mario
    Schmid, Cordelia
    Arnab, Anurag
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 18373 - 18383
  • [9] Spatial-temporal transformer for end-to-end sign language recognition
    Cui, Zhenchao
    Zhang, Wenbo
    Li, Zhaoxin
    Wang, Zhaoqi
    COMPLEX & INTELLIGENT SYSTEMS, 2023, 9 (04) : 4645 - 4656
  • [10] Study and Generalization on an End-to-End Spatial-temporal Driving Model
    Yao, Tingting
    Chen, Xin
    Yuan, Sheng
    Wang, Huaying
    Guo, Lili
    Tian, Bin
    Ai, Yunfeng
    2019 CHINESE AUTOMATION CONGRESS (CAC2019), 2019, : 4755 - 4760