TransVOD: End-to-End Video Object Detection With Spatial-Temporal Transformers

被引:57
|
作者
Zhou, Qianyu [1 ]
Li, Xiangtai [2 ]
He, Lu [1 ]
Yang, Yibo [3 ]
Cheng, Guangliang [4 ]
Tong, Yunhai [2 ]
Ma, Lizhuang [1 ]
Tao, Dacheng [3 ]
机构
[1] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, Shanghai 200240, Peoples R China
[2] Peking Univ, Sch Artificial Intelligence, Beijing 100871, Peoples R China
[3] JD Explore Acad, Beijing 100176, Peoples R China
[4] SenseTime Res, Beijing 100080, Peoples R China
基金
中国国家自然科学基金;
关键词
Transformers; Object detection; Pipelines; Detectors; Streaming media; Fuses; Task analysis; Video object detection; vision transformers; scene understanding; video understanding;
D O I
10.1109/TPAMI.2022.3223955
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Detection Transformer (DETR) and Deformable DETR have been proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance as previous complex hand-crafted detectors. However, their performance on Video Object Detection (VOD) has not been well explored. In this paper, we present TransVOD, the first end-to-end video object detection system based on simple yet effective spatial-temporal Transformer architectures. The first goal of this paper is to streamline the pipeline of current VOD, effectively removing the need for many hand-crafted components for feature aggregation, e.g., optical flow model, relation networks. Besides, benefited from the object query design in DETR, our method does not need postprocessing methods such as Seq-NMS. In particular, we present a temporal Transformer to aggregate both the spatial object queries and the feature memories of each frame. Our temporal transformer consists of two components: Temporal Query Encoder (TQE) to fuse object queries, and Temporal Deformable Transformer Decoder (TDTD) to obtain current frame detection results. These designs boost the strong baseline deformable DETR by a significant margin (3 %-4 % mAP) on the ImageNet VID dataset. TransVOD yields comparable performances on the benchmark of ImageNet VID. Then, we present two improved versions of TransVOD including TransVOD++ and TransVOD Lite. The former fuses object-level information into object query via dynamic convolution while the latter models the entire video clips as the output to speed up the inference time. We give detailed analysis of all three models in the experiment part. In particular, our proposed TransVOD++ sets a new state-of-the-art record in terms of accuracy on ImageNet VID with 90.0 % mAP. Our proposed TransVOD Lite also achieves the best speed and accuracy trade-off with 83.7 % mAP while running at around 30 FPS on a single V100 GPU device. Code and models are available at https://github.com/SJTU-LuHe/TransVOD.
引用
收藏
页码:7853 / 7869
页数:17
相关论文
共 50 条
  • [31] What Makes for End-to-End Object Detection?
    Sun, Peize
    Jiang, Yi
    Xie, Enze
    Shao, Wenqi
    Yuan, Zehuan
    Wang, Changhu
    Luo, Ping
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [32] RhythmNet: End-to-End Heart Rate Estimation From Face via Spatial-Temporal Representation
    Niu, Xuesong
    Shan, Shiguang
    Han, Hu
    Chen, Xilin
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 2409 - 2423
  • [33] Spatial-Temporal Routing for Supporting End-to-End Hard Deadlines in Multi-hop Networks
    Liu, Xin
    Ying, Lei
    2016 ANNUAL CONFERENCE ON INFORMATION SCIENCE AND SYSTEMS (CISS), 2016,
  • [34] Spatial-temporal routing for supporting end-to-end hard deadlines in multi-hop networks
    Liu, Xin
    Wang, Weichang
    Ying, Lei
    PERFORMANCE EVALUATION, 2019, 135
  • [35] SPATIAL-TEMPORAL FEATURE AGGREGATION NETWORK FOR VIDEO OBJECT DETECTION
    Chen, Zhu
    Li, Weihai
    Fei, Chi
    Liu, Bin
    Yu, Nenghai
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 1858 - 1862
  • [36] Multilevel Spatial-Temporal Feature Aggregation for Video Object Detection
    Xu, Chao
    Zhang, Jiangning
    Wang, Mengmeng
    Tian, Guanzhong
    Liu, Yong
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (11) : 7809 - 7820
  • [37] RhythmNet: End-to-end Heart Rate Estimation from Face via Spatial-temporal Representation
    Niu, Xuesong
    Shan, Shiguang
    Han, Hu
    Chen, Xilin
    arXiv, 2019,
  • [38] End-to-End United Video Dehazing and Detection
    Li, Boyi
    Peng, Xiulian
    Wang, Zhangyang
    Xu, Jizheng
    Feng, Dan
    THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 7016 - 7023
  • [39] End-to-End Temporal Action Detection With Transformer
    Liu, Xiaolong
    Wang, Qimeng
    Hu, Yao
    Tang, Xu
    Zhang, Shiwei
    Bai, Song
    Bai, Xiang
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 5427 - 5441
  • [40] End-to-End Human-Gaze-Target Detection with Transformers
    Tu, Danyang
    Min, Xiongkuo
    Duan, Huiyu
    Guo, Guodong
    Zhai, Guangtao
    Shen, Wei
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 2192 - 2200