Mixing Tokens from Target and Search Regions for Visual Object Tracking

被引:0
|
作者
Wanli X. [1 ]
Zhibin Z. [1 ]
Shenglei P. [2 ]
Kaihua Z. [3 ]
Shengyong C. [1 ]
机构
[1] School of Computer Science and Engineering, Tianjin University of Technology, Tianjin
[2] College of Physics and Electronic Information Engineering, Qinghai Minzu University, Xining
[3] School of Computer Science, Nanjing University of Information Science & Technology, Nanjing
基金
中国国家自然科学基金;
关键词
fast Fourier transform; feature extraction; feature fusion; Key words Transformer; object tracking;
D O I
10.7544/issn1000-1239.202220698
中图分类号
学科分类号
摘要
There are three problems about feature extraction and fusion in the current mainstream tracking framework based on Transformer: 1. The two modules of feature extraction and fusion are used separately, which is easy to produce sub-optimal model training results. 2. Computational complexity of O(N2) using self-attention reduces tracking efficiency. 3. The target template selection strategy is simple and is difficult to adapt to the drastic changes in the target appearance during the tracking process. We propose a novel Transformer tracking framework using fast Fourier transform mixing target tokens and search region tokens. For problem 1, an efficient end-to-end approach is proposed to extract and fuse features for unified learning to obtain optimal model; For problem 2, the fast Fourier transform is used to achieve complete information interaction between the target tokens and search region tokens. The computational complexity of this operation is O(Nlog(N)), which greatly improves the tracing efficiency. For problem 3, a template memory storage mechanism based on quality assessment is proposed, which can quickly adapt to the drastic changes in target appearance. Compared with the current state-of-the-art algorithms on three datasets LaSOT, OTB100 and UAV123, our tracker achieves better performance in both efficiency and accuracy. © 2024 Science Press. All rights reserved.
引用
收藏
页码:460 / 469
页数:9
相关论文
共 38 条
  • [1] Xi L., Yufei C., Tianzhu Z., Et al., Survey of visual object tracking algorithms based on deep learning[J], Journal of Image and Graphics, 24, 12, pp. 2057-2080, (2019)
  • [2] Peizhong L., Hongxiang W., Yanmin L., Et al., Visual tracking algorithm based on adaptive spatial regularization[J], Journal of Computer Research and Development, 55, 12, pp. 2785-2793, (2018)
  • [3] Yongming R., Wenliang Z., Zheng Z., Et al., Global filter networks for image classification[C], Proc of the 35th Advances in Neural Information Processing Systems, 2021, pp. 980-993
  • [4] Zhipeng Z., Houwen P., Deeper and wider Siamese networks for real-time visual tracking[C], Proc of the 32nd IEEE/CVF Conf on Computer Vision and Pattern Recognition, pp. 4591-4600, (2019)
  • [5] Bo L., Wei W., Qiang W., Et al., Evolution of Siamese visual tracking with very deep networks[C], Proc of the 32nd IEEE Conf on Computer Vision and Pattern Recognition, pp. 16-20, (2019)
  • [6] Bo L., Junjie Y., Wei W., Et al., High performance visual tracking with Siamese region proposal network[C], Proc of the 31st IEEE Conf on Computer Vision and Pattern Recognition, pp. 8971-8980, (2018)
  • [7] Bertinetto L., Valmadre J., Henriques J.F., Et al., Fully-convolutional siamese networks for object tracking[C], Proc of the 14th European Conf on Computer Vision, pp. 850-865, (2016)
  • [8] Bhat G., Danelljan M., Gool L.V., Et al., Learning discriminative model prediction for tracking[C], Proc of the 17th IEEE/CVF Int Conf on Computer Vision, pp. 6182-6191, (2019)
  • [9] Danelljan M., Gool L.V., Timofte R., Probabilistic regression for visual tracking[C], Proc of the 33rd IEEE/CVF Conf on Computer Vision and Pattern Recognition, 2020, pp. 7183-7192
  • [10] Danelljan M., Bhat G., Khan F.S., Et al., Atom: Accurate tracking by overlap maximization[C], Proc of the 32nd IEEE/CVF Conf on Computer Vision and Pattern Recognition, pp. 4660-4669, (2019)