Joint Spatio-Temporal Similarity and Discrimination Learning for Visual Tracking

被引:0
|
作者
Liang, Yanjie [1 ]
Chen, Haosheng [2 ]
Wu, Qiangqiang [3 ]
Xia, Changqun [1 ]
Li, Jia [4 ]
机构
[1] Peng Cheng Lab, Shenzhen 518000, Peoples R China
[2] Chongqing Univ Posts & Telecommun, Coll Comp Sci & Technol, Chongqing Key Lab Image Cognit, Chongqing 400065, Peoples R China
[3] City Univ Hong Kong, Dept Comp Sci, Hong Kong, Peoples R China
[4] Beihang Univ, Sch Comp Sci & Engn, State Key Lab Virtual Real Technol & Syst, Beijing 100191, Peoples R China
基金
中国国家自然科学基金; 中国博士后科学基金;
关键词
Target tracking; Location awareness; Correlation; Visualization; Learning systems; Circuits and systems; Transformers; Video object tracking; joint learning; spatio-temporal similarity; spatio-temporal discrimination; adaptive response map fusion;
D O I
10.1109/TCSVT.2024.3377379
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Visual tracking is a task of localizing a target unceasingly in a video with an initial target state at the first frame. The limited target information makes this problem an extremely challenging task. Existing tracking methods either perform matching based similarity learning or optimization based discrimination reasoning. However, these two types of tracking methods suffer from the problem of ineffectiveness for distinguishing target objects from background distractors and the problem of insufficiency in maintaining spatio-temporal consistency among successive frames, respectively. In this paper, we design a joint spatio-temporal similarity and discrimination learning (STSDL) framework for accurate and robust tracking. The designed framework is composed of two complementary branches: a similarity learning branch and a discrimination learning branch. The similarity learning branch uses an effective transformer encoder-decoder to gather rich spatio-temporal context information to generate a similarity map. In parallel, the discrimination learning branch exploits an efficient model predictor to train a target model to produce a discriminative map. Finally, the similarity map and the discriminative map are adaptively fused for accurate and robust target localization. Experimental results on six prevalent datasets demonstrate that the proposed STSDL can obtain satisfactory results, while it retains a real-time tracking speed of 50 FPS on a single GPU.
引用
收藏
页码:7284 / 7300
页数:17
相关论文
共 50 条
  • [21] Online visual tracking by integrating spatio-temporal cues
    He, Yang
    Pei, Mingtao
    Yang, Min
    Wu, Yuwei
    Jia, Yunde
    IET COMPUTER VISION, 2015, 9 (01) : 124 - 137
  • [22] HUMAN TRACKING & VISUAL SPATIO-TEMPORAL STATISTICAL ANALYSIS
    Ioannidis, D.
    Krinidis, S.
    Tzovaras, D.
    Likothanassis, S.
    2014 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2014, : 3417 - 3419
  • [23] STRUCTURAL SPATIO-TEMPORAL TRANSFORM FOR ROBUST VISUAL TRACKING
    Tang, Yazhe
    Lao, Mingjie
    Lin, Feng
    Wu, Denglu
    2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 1105 - 1109
  • [24] Spatio-Temporal Trajectory Similarity Learning in Road Networks
    Fang, Ziquan
    Du, Yuntao
    Zhu, Xinjun
    Hu, Danlei
    Chen, Lu
    Gao, Yunjun
    Jensen, Christian S.
    PROCEEDINGS OF THE 28TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2022, 2022, : 347 - 356
  • [25] Learning spatio-temporal properties of visual cells
    Burkitt, Anthony
    Lian, Yanbo
    Ruslim, Marko
    JOURNAL OF COMPUTATIONAL NEUROSCIENCE, 2024, 52 : S116 - S116
  • [26] Learning spatio-temporal properties of visual cells
    Burkitt, Anthony
    Lian, Yanbo
    Ruslim, Marko
    JOURNAL OF COMPUTATIONAL NEUROSCIENCE, 2024, 52 : S116 - S116
  • [27] Spatio-temporal video segmentation using a joint similarity measure
    Choi, JG
    Lee, SW
    Kim, SD
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 1997, 7 (02) : 279 - 286
  • [28] Oversaturated part-based visual tracking via spatio-temporal context learning
    Liu, Wei
    Li, Jicheng
    Shi, Zhiguang
    Chen, Xiaotian
    Chen, Xiao
    APPLIED OPTICS, 2016, 55 (25) : 6960 - 6968
  • [29] Memory Network With Pixel-Level Spatio-Temporal Learning for Visual Object Tracking
    Zhou, Zechu
    Zhou, Xinyu
    Chen, Zhaoyu
    Guo, Pinxue
    Liu, Qian-Yu
    Zhang, Wenqiang
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (11) : 6897 - 6911
  • [30] Learning spatio-temporal discriminative model for affine subspace based visual object tracking
    Tianyang Xu
    Xue-Feng Zhu
    Xiao-Jun Wu
    Visual Intelligence, 1 (1):