Multi-modal visual tracking based on textual generation

被引：1

作者：

Wang, Jiahao ^{[1
,2
]}

Liu, Fang ^{[1
,2
]}

Jiao, Licheng ^{[1
,2
]}

Wang, Hao ^{[1
,2
]}

Li, Shuo ^{[1
,2
]}

Li, Lingling ^{[1
,2
]}

Chen, Puhua ^{[1
,2
]}

Liu, Xu ^{[1
,2
]}

机构：

[1] Xidian Univ, Int Res Ctr Intelligent Percept & Computat, Sch Artificial Intelligence, Key Lab Intelligent Percept & Image Understanding, Xian 710071, Shaanxi Provinc, Peoples R China

[2] Xidian Univ, Sch Artificial Intelligence, Joint Int Res Lab Intelligent Percept & Computat, Xian 710071, Shaanxi Provinc, Peoples R China

来源：

INFORMATION FUSION | 2024年 / 112卷

基金：

中国国家自然科学基金; 中国博士后科学基金;

关键词：

Multi-modal tracking; Image descriptions; Visual and language modalities; Prompt learning; FUSION;

D O I：

10.1016/j.inffus.2024.102531

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Multi-modal tracking has garnered significant attention due to its wide range of potential applications. Existing multi-modal tracking approaches typically merge data from different visual modalities on top of RGB tracking. However, focusing solely on the visual modality is insufficient due to the scarcity of tracking data. Inspired by the recent success of large models, this paper introduces a Multi-modal Visual Tracking Based on Textual Generation (MVTTG) approach to address the limitations of visual tracking, which lacks language information and overlooks semantic relationships between the target and the search area. To achieve this, we leverage large models to generate image descriptions, using these descriptions to provide complementary information about the target's appearance and movement. Furthermore, to enhance the consistency between visual and language modalities, we employ prompt learning and design a Visual-Language Interaction Prompt Manager (V-L PM) to facilitate collaborative learning between visual and language domains. Experiments conducted with MVTTG on multiple benchmark datasets confirm the effectiveness and potential of incorporating image descriptions in multi-modal visual tracking.

引用

页数：13

共 50 条

[31] TV commercial classification by using multi-modal textual information
Zheng, Yantao
Duan, Lingyu
Tian, Qi
Jin, Jesse S.
2006 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO - ICME 2006, VOLS 1-5, PROCEEDINGS, 2006, : 497 - 500
[32] Online Multi-Modal Robust Non-Negative Dictionary Learning for Visual Tracking
Zhang, Xiang
Guan, Naiyang
Tao, Dacheng
Qiu, Xiaogang
Luo, Zhigang
PLOS ONE, 2015, 10 (05):
[33] MMC: Multi-modal colorization of images using textual description
Ghosh, Subhankar
Bhattacharya, Saumik
Roy, Prasun
Pal, Umapada
Blumenstein, Michael
SIGNAL IMAGE AND VIDEO PROCESSING, 2025, 19 (01)
[34] A Multi-Modal Stimulator System for Visual Prosthesis
Abdo, Emad A.
Yuan, Peimin
Zheng, Yujin
Yakovlev, Alex
Degenaar, Patrick
2023 21ST IEEE INTERREGIONAL NEWCAS CONFERENCE, NEWCAS, 2023,
[35] Multi-Object Tracking Based on a Novel Feature Image With Multi-Modal Information
An, Yi
Wu, Jialin
Cui, Yunhao
Hu, Huosheng
IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, 2023, 72 (08) : 9909 - 9921
[36] Multi-modal authentication system based on audio-visual data
Debnath, Saswati
Roy, Pinki
PROCEEDINGS OF THE 2019 IEEE REGION 10 CONFERENCE (TENCON 2019): TECHNOLOGY, KNOWLEDGE, AND SOCIETY, 2019, : 2507 - 2512
[37] Multi-modal face tracking in multi-camera environments
Kang, HB
Cho, SH
COMPUTER ANALYSIS OF IMAGES AND PATTERNS, PROCEEDINGS, 2005, 3691 : 814 - 821
[38] Improving visual grounding with multi-modal interaction and auto-regressive vertex generation
Qin, Xiaofei
Li, Fan
He, Changxiang
Pei, Ruiqi
Zhang, Xuedian
NEUROCOMPUTING, 2024, 598
[39] VGV: Verilog Generation using Visual Capabilities of Multi-Modal Large Language Models
Wong, Sam-Zaak
Wan, Gwok-Waa
Liu, Dongping
Wang, Xi
2024 IEEE LLM AIDED DESIGN WORKSHOP, LAD 2024, 2024,
[40] A Multi-Modal Chinese Poetry Generation Model
Liu, Dayiheng
Guo, Quan
Li, Wubo
2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2018,

← 1 2 3 4 5 →