Multi-modal visual tracking based on textual generation

被引:1
|
作者
Wang, Jiahao [1 ,2 ]
Liu, Fang [1 ,2 ]
Jiao, Licheng [1 ,2 ]
Wang, Hao [1 ,2 ]
Li, Shuo [1 ,2 ]
Li, Lingling [1 ,2 ]
Chen, Puhua [1 ,2 ]
Liu, Xu [1 ,2 ]
机构
[1] Xidian Univ, Int Res Ctr Intelligent Percept & Computat, Sch Artificial Intelligence, Key Lab Intelligent Percept & Image Understanding, Xian 710071, Shaanxi Provinc, Peoples R China
[2] Xidian Univ, Sch Artificial Intelligence, Joint Int Res Lab Intelligent Percept & Computat, Xian 710071, Shaanxi Provinc, Peoples R China
基金
中国国家自然科学基金; 中国博士后科学基金;
关键词
Multi-modal tracking; Image descriptions; Visual and language modalities; Prompt learning; FUSION;
D O I
10.1016/j.inffus.2024.102531
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multi-modal tracking has garnered significant attention due to its wide range of potential applications. Existing multi-modal tracking approaches typically merge data from different visual modalities on top of RGB tracking. However, focusing solely on the visual modality is insufficient due to the scarcity of tracking data. Inspired by the recent success of large models, this paper introduces a Multi-modal Visual Tracking Based on Textual Generation (MVTTG) approach to address the limitations of visual tracking, which lacks language information and overlooks semantic relationships between the target and the search area. To achieve this, we leverage large models to generate image descriptions, using these descriptions to provide complementary information about the target's appearance and movement. Furthermore, to enhance the consistency between visual and language modalities, we employ prompt learning and design a Visual-Language Interaction Prompt Manager (V-L PM) to facilitate collaborative learning between visual and language domains. Experiments conducted with MVTTG on multiple benchmark datasets confirm the effectiveness and potential of incorporating image descriptions in multi-modal visual tracking.
引用
收藏
页数:13
相关论文
共 50 条
  • [31] TV commercial classification by using multi-modal textual information
    Zheng, Yantao
    Duan, Lingyu
    Tian, Qi
    Jin, Jesse S.
    2006 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO - ICME 2006, VOLS 1-5, PROCEEDINGS, 2006, : 497 - 500
  • [32] Online Multi-Modal Robust Non-Negative Dictionary Learning for Visual Tracking
    Zhang, Xiang
    Guan, Naiyang
    Tao, Dacheng
    Qiu, Xiaogang
    Luo, Zhigang
    PLOS ONE, 2015, 10 (05):
  • [33] MMC: Multi-modal colorization of images using textual description
    Ghosh, Subhankar
    Bhattacharya, Saumik
    Roy, Prasun
    Pal, Umapada
    Blumenstein, Michael
    SIGNAL IMAGE AND VIDEO PROCESSING, 2025, 19 (01)
  • [34] A Multi-Modal Stimulator System for Visual Prosthesis
    Abdo, Emad A.
    Yuan, Peimin
    Zheng, Yujin
    Yakovlev, Alex
    Degenaar, Patrick
    2023 21ST IEEE INTERREGIONAL NEWCAS CONFERENCE, NEWCAS, 2023,
  • [35] Multi-Object Tracking Based on a Novel Feature Image With Multi-Modal Information
    An, Yi
    Wu, Jialin
    Cui, Yunhao
    Hu, Huosheng
    IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, 2023, 72 (08) : 9909 - 9921
  • [36] Multi-modal authentication system based on audio-visual data
    Debnath, Saswati
    Roy, Pinki
    PROCEEDINGS OF THE 2019 IEEE REGION 10 CONFERENCE (TENCON 2019): TECHNOLOGY, KNOWLEDGE, AND SOCIETY, 2019, : 2507 - 2512
  • [37] Multi-modal face tracking in multi-camera environments
    Kang, HB
    Cho, SH
    COMPUTER ANALYSIS OF IMAGES AND PATTERNS, PROCEEDINGS, 2005, 3691 : 814 - 821
  • [38] Improving visual grounding with multi-modal interaction and auto-regressive vertex generation
    Qin, Xiaofei
    Li, Fan
    He, Changxiang
    Pei, Ruiqi
    Zhang, Xuedian
    NEUROCOMPUTING, 2024, 598
  • [39] VGV: Verilog Generation using Visual Capabilities of Multi-Modal Large Language Models
    Wong, Sam-Zaak
    Wan, Gwok-Waa
    Liu, Dongping
    Wang, Xi
    2024 IEEE LLM AIDED DESIGN WORKSHOP, LAD 2024, 2024,
  • [40] A Multi-Modal Chinese Poetry Generation Model
    Liu, Dayiheng
    Guo, Quan
    Li, Wubo
    2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2018,