TransVOD: End-to-End Video Object Detection With Spatial-Temporal Transformers

被引:57
|
作者
Zhou, Qianyu [1 ]
Li, Xiangtai [2 ]
He, Lu [1 ]
Yang, Yibo [3 ]
Cheng, Guangliang [4 ]
Tong, Yunhai [2 ]
Ma, Lizhuang [1 ]
Tao, Dacheng [3 ]
机构
[1] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, Shanghai 200240, Peoples R China
[2] Peking Univ, Sch Artificial Intelligence, Beijing 100871, Peoples R China
[3] JD Explore Acad, Beijing 100176, Peoples R China
[4] SenseTime Res, Beijing 100080, Peoples R China
基金
中国国家自然科学基金;
关键词
Transformers; Object detection; Pipelines; Detectors; Streaming media; Fuses; Task analysis; Video object detection; vision transformers; scene understanding; video understanding;
D O I
10.1109/TPAMI.2022.3223955
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Detection Transformer (DETR) and Deformable DETR have been proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance as previous complex hand-crafted detectors. However, their performance on Video Object Detection (VOD) has not been well explored. In this paper, we present TransVOD, the first end-to-end video object detection system based on simple yet effective spatial-temporal Transformer architectures. The first goal of this paper is to streamline the pipeline of current VOD, effectively removing the need for many hand-crafted components for feature aggregation, e.g., optical flow model, relation networks. Besides, benefited from the object query design in DETR, our method does not need postprocessing methods such as Seq-NMS. In particular, we present a temporal Transformer to aggregate both the spatial object queries and the feature memories of each frame. Our temporal transformer consists of two components: Temporal Query Encoder (TQE) to fuse object queries, and Temporal Deformable Transformer Decoder (TDTD) to obtain current frame detection results. These designs boost the strong baseline deformable DETR by a significant margin (3 %-4 % mAP) on the ImageNet VID dataset. TransVOD yields comparable performances on the benchmark of ImageNet VID. Then, we present two improved versions of TransVOD including TransVOD++ and TransVOD Lite. The former fuses object-level information into object query via dynamic convolution while the latter models the entire video clips as the output to speed up the inference time. We give detailed analysis of all three models in the experiment part. In particular, our proposed TransVOD++ sets a new state-of-the-art record in terms of accuracy on ImageNet VID with 90.0 % mAP. Our proposed TransVOD Lite also achieves the best speed and accuracy trade-off with 83.7 % mAP while running at around 30 FPS on a single V100 GPU device. Code and models are available at https://github.com/SJTU-LuHe/TransVOD.
引用
收藏
页码:7853 / 7869
页数:17
相关论文
共 50 条
  • [41] A Generative Appearance Model for End-to-end Video Object Segmentation
    Johnander, Joakim
    Danelljan, Martin
    Brissman, Emil
    Khan, Fahad Shahbaz
    Felsberg, Michael
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 8945 - 8954
  • [42] RVOS: End-to-End Recurrent Network for Video Object Segmentation
    Ventura, Carles
    Bellver, Miriam
    Girbau, Andreu
    Salvador, Amaia
    Marques, Ferran
    Giro-i-Nieto, Xavier
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 5272 - 5281
  • [43] End-to-End Object Detection with Fully Convolutional Network
    Wang, Jianfeng
    Song, Lin
    Li, Zeming
    Sun, Hongbin
    Sun, Jian
    Zheng, Nanning
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 15844 - 15853
  • [44] SRDD: a lightweight end-to-end object detection with transformer
    Zhu, Yuan
    Xia, Qingyuan
    Jin, Wen
    CONNECTION SCIENCE, 2022, 34 (01) : 2448 - 2465
  • [45] Progressive End-to-End Object Detection in Crowded Scenes
    Zheng, Anlin
    Zhang, Yuang
    Zhang, Xiangyu
    Qi, Xiaojuan
    Sun, Jian
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 847 - 856
  • [46] Toward End-to-End Object Detection and Tracking on the Edge
    Tabkhi, Hamed
    SEC 2017: 2017 THE SECOND ACM/IEEE SYMPOSIUM ON EDGE COMPUTING (SEC'17), 2017,
  • [47] Dense Distinct Query for End-to-End Object Detection
    Zhang, Shilong
    Wang, Xinjiang
    Wang, Jiaqi
    Pang, Jiangmiao
    Lyu, Chengqi
    Zhang, Wenwei
    Luo, Ping
    Chen, Kai
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 7329 - 7338
  • [48] End-to-End Edge Neuromorphic Object Detection System
    Silva, D. A.
    Shymyrbay, A.
    Smagulova, K.
    Elsheikh, A.
    Fouda, M. E.
    Eltawil, A. M.
    2024 IEEE 6TH INTERNATIONAL CONFERENCE ON AI CIRCUITS AND SYSTEMS, AICAS 2024, 2024, : 194 - 198
  • [49] Efficient Video Transformers with Spatial-Temporal Token Selection
    Wang, Junke
    Yang, Xitong
    Li, Hengduo
    Liu, Li
    Wu, Zuxuan
    Jiang, Yu-Gang
    COMPUTER VISION - ECCV 2022, PT XXXV, 2022, 13695 : 69 - 86
  • [50] End-to-end Multi-modal Video Temporal Grounding
    Chen, Yi-Wen
    Tsai, Yi-Hsuan
    Yang, Ming-Hsuan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34