Transformer RGBT Tracking With Spatio-Temporal Multimodal Tokens

被引:3
|
作者
Sun, Dengdi [1 ,2 ,3 ]
Pan, Yajie [4 ]
Lu, Andong [4 ]
Li, Chenglong [5 ]
Luo, Bin [4 ]
机构
[1] Anhui Univ, Sch Artificial Intelligence, Key Lab Intelligent Comp & Signal Proc, Minist Educ, Hefei 230601, Peoples R China
[2] Jianghuai Adv Technol Ctr, Hefei 230000, Peoples R China
[3] Hefei Comprehens Natl Sci Ctr, Inst Artificial Intelligence, Hefei 230026, Peoples R China
[4] Anhui Univ, Sch Comp Sci & Technol, Anhui Prov Key Lab Multimodal Cognit Computat, Hefei 230601, Peoples R China
[5] Anhui Univ, Sch Artificial Intelligence, Key Lab Intelligent Comp & Signal Proc, Minist Educ,Anhui Prov Key Lab Secur Artificial I, Hefei 230601, Peoples R China
基金
中国国家自然科学基金;
关键词
RGBT tracking; transformer; cross-modal interaction; spatio-temporal multimodal tokens; NETWORK;
D O I
10.1109/TCSVT.2024.3425455
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Many RGBT tracking researches primarily focus on modal fusion design, while overlooking the effective handling of target appearance changes. While some approaches have introduced historical frames or fuse and replace initial templates to incorporate temporal information, they have the risk of disrupting the original target appearance and accumulating errors over time. To alleviate these limitations, we propose a novel Transformer RGBT tracking approach, which mixes spatio-temporal multimodal tokens from the static multimodal templates and multimodal search regions in Transformer to handle target appearance changes, for robust RGBT tracking. We introduce independent dynamic template tokens to interact with the search region, embedding temporal information to address appearance changes, while also retaining the involvement of the initial static template tokens in the joint feature extraction process to ensure the preservation of the original reliable target appearance information that prevent deviations from the target appearance caused by traditional temporal updates. We also use attention mechanisms to enhance the target features of multimodal template tokens by incorporating supplementary modal cues, and make the multimodal search region tokens interact with multimodal dynamic template tokens via attention mechanisms, which facilitates the conveyance of multimodal-enhanced target change information. Our module is inserted into the transformer backbone network and inherits joint feature extraction, search-template matching, and cross-modal interaction. Extensive experiments on three RGBT benchmark datasets show that the proposed approach maintains competitive performance compared to other state-of-the-art tracking algorithms while running at 39.1 FPS. The project-related materials are available at: https://github.com/yinghaidada/STMT.
引用
收藏
页码:12059 / 12072
页数:14
相关论文
共 50 条
  • [41] Deconfounded Multimodal Learning for Spatio-temporal Video Grounding
    Wang, Jiawei
    Ma, Zhanchang
    Cao, Da
    Le, Yuquan
    Xiao, Junbin
    Chua, Tat-Seng
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 7521 - 7529
  • [42] Multimodal Spatio-Temporal Prediction with Stochastic Adversarial Networks
    Saxena, Divya
    Cao, Jiannong
    ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2022, 13 (02)
  • [43] Deep spatio-temporal features for multimodal emotion recognition
    Nguyen, Dung
    Nguyen, Kien
    Sridharan, Sridha
    Ghasemi, Afsane
    Dean, David
    Fookes, Clinton
    2017 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2017), 2017, : 1215 - 1223
  • [44] Spatio-Temporal Inference Transformer Network for Video Inpainting
    Tudavekar, Gajanan
    Saraf, Santosh S.
    Patil, Sanjay R.
    INTERNATIONAL JOURNAL OF IMAGE AND GRAPHICS, 2023, 23 (01)
  • [45] Shifted Chunk Transformer for Spatio-Temporal Representational Learning
    Zha, Xuefan
    Zhu, Wentao
    Lv, Tingxun
    Yang, Sen
    Liu, Ji
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [46] Multimodal Spatio-Temporal Theme Modeling for Landmark Analysis
    Min, Weiqing
    Bao, Bing-Kun
    Xu, Changsheng
    IEEE MULTIMEDIA, 2014, 21 (03) : 20 - 29
  • [47] Video Text Tracking With a Spatio-Temporal Complementary Model
    Gao, Yuzhe
    Li, Xing
    Zhang, Jiajian
    Zhou, Yu
    Jin, Dian
    Wang, Jing
    Zhu, Shenggao
    Bai, Xiang
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 9321 - 9331
  • [48] Online hash tracking with spatio-temporal saliency auxiliary
    Fang, Jianwu
    Xu, Hongke
    Wang, Qi
    Wu, Tianjun
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2017, 160 : 57 - 72
  • [49] Spatio-temporal cell segmentation and tracking for automated screening
    Padfield, Dirk
    Rittscher, Jens
    Roysam, Badrinath
    2008 IEEE INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING: FROM NANO TO MACRO, VOLS 1-4, 2008, : 376 - +
  • [50] Anomaly Detection with a Spatio-Temporal Tracking of the Laser Spot
    Atienza, David
    Bielza, Concha
    Diaz, Javier
    Larranaga, Pedro
    PROCEEDINGS OF THE EIGHTH EUROPEAN STARTING AI RESEARCHER SYMPOSIUM (STAIRS 2016), 2016, 284 : 137 - 142