Multi-modal visual tracking based on textual generation

被引:1
|
作者
Wang, Jiahao [1 ,2 ]
Liu, Fang [1 ,2 ]
Jiao, Licheng [1 ,2 ]
Wang, Hao [1 ,2 ]
Li, Shuo [1 ,2 ]
Li, Lingling [1 ,2 ]
Chen, Puhua [1 ,2 ]
Liu, Xu [1 ,2 ]
机构
[1] Xidian Univ, Int Res Ctr Intelligent Percept & Computat, Sch Artificial Intelligence, Key Lab Intelligent Percept & Image Understanding, Xian 710071, Shaanxi Provinc, Peoples R China
[2] Xidian Univ, Sch Artificial Intelligence, Joint Int Res Lab Intelligent Percept & Computat, Xian 710071, Shaanxi Provinc, Peoples R China
基金
中国国家自然科学基金; 中国博士后科学基金;
关键词
Multi-modal tracking; Image descriptions; Visual and language modalities; Prompt learning; FUSION;
D O I
10.1016/j.inffus.2024.102531
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multi-modal tracking has garnered significant attention due to its wide range of potential applications. Existing multi-modal tracking approaches typically merge data from different visual modalities on top of RGB tracking. However, focusing solely on the visual modality is insufficient due to the scarcity of tracking data. Inspired by the recent success of large models, this paper introduces a Multi-modal Visual Tracking Based on Textual Generation (MVTTG) approach to address the limitations of visual tracking, which lacks language information and overlooks semantic relationships between the target and the search area. To achieve this, we leverage large models to generate image descriptions, using these descriptions to provide complementary information about the target's appearance and movement. Furthermore, to enhance the consistency between visual and language modalities, we employ prompt learning and design a Visual-Language Interaction Prompt Manager (V-L PM) to facilitate collaborative learning between visual and language domains. Experiments conducted with MVTTG on multiple benchmark datasets confirm the effectiveness and potential of incorporating image descriptions in multi-modal visual tracking.
引用
收藏
页数:13
相关论文
共 50 条
  • [41] Multi-modal Sarcasm Generation: Dataset and Solution
    Zhao, Wenye
    Huang, Qingbao
    Xu, Dongsheng
    Zhao, Peizhi
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 5601 - 5613
  • [42] Meme Generation with Multi-modal Input and Planning
    Ranjan, Ashutosh
    Srivastava, Vivek
    Khatri, Jyotsana
    Bhat, Savita
    Karande, Shirish
    PROCEEDINGS OF THE 2ND INTERNATIONAL WORKSHOP ON DEEP MULTIMODAL GENERATION AND RETRIEVAL, MMGR 2024, 2024, : 21 - +
  • [43] Unified Multi-Modal Landmark Tracking for Tightly Coupled Lidar-Visual-Inertial Odometry
    Wisth, David
    Camurri, Marco
    Das, Sandipan
    Fallon, Maurice
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2021, 6 (02) : 1004 - 1011
  • [44] Multi-Modal People Tracking on a Mobile Companion Robot
    Volkhardt, Michael
    Weinrich, Christoph
    Gross, Horst-Michael
    2013 EUROPEAN CONFERENCE ON MOBILE ROBOTS (ECMR 2013), 2013, : 288 - 293
  • [45] An Embedded Multi-Modal System for Object Localization and Tracking
    Rodriguez F, Sergio A.
    Fremont, Vincent
    Bonnifait, Philippe
    Cherfaoui, Veronique
    IEEE INTELLIGENT TRANSPORTATION SYSTEMS MAGAZINE, 2012, 4 (04) : 42 - U53
  • [46] An Embedded Multi-Modal System for Object Localization and Tracking
    Rodriguez F, Sergio A.
    Fremont, Vincent
    Bonnifait, Philippe
    Cherfaoui, Veronique
    2010 IEEE INTELLIGENT VEHICLES SYMPOSIUM (IV), 2010, : 211 - 216
  • [47] Multi-modal adapter for RGB-T tracking
    Wang, He
    Xu, Tianyang
    Tang, Zhangyong
    Wu, Xiao-Jun
    Kittler, Josef
    INFORMATION FUSION, 2025, 118
  • [48] Multi-modal pedestrian detection with misalignment based on modal-wise regression and multi-modal IoU
    Wanchaitanawong, Napat
    Tanaka, Masayuki
    Shibata, Takashi
    Okutomi, Masatoshi
    JOURNAL OF ELECTRONIC IMAGING, 2023, 32 (01)
  • [49] Multi-modal long document classification based on Hierarchical Prompt and Multi-modal Transformer
    Liu, Tengfei
    Hu, Yongli
    Gao, Junbin
    Wang, Jiapu
    Sun, Yanfeng
    Yin, Baocai
    NEURAL NETWORKS, 2024, 176
  • [50] Multi-modal user interaction method based on gaze tracking and gesture recognition
    Lee, Heekyung
    Lim, Seong Yong
    Lee, Injae
    Cha, Jihun
    Cho, Dong-Chan
    Cho, Sunyoung
    SIGNAL PROCESSING-IMAGE COMMUNICATION, 2013, 28 (02) : 114 - 126