Cross-Modal Object Tracking via Modality-Aware Fusion Network and a Large-Scale Dataset

被引:1
|
作者
Liu, Lei [1 ]
Zhang, Mengya [1 ]
Li, Cheng [2 ]
Li, Chenglong [3 ]
Tang, Jin [1 ]
机构
[1] Anhui Univ, Informat Mat & Intelligent Sensing Lab Anhui Prov, Anhui Prov Key Lab Multimodal Cognit Computat, Key Lab Intelligent Comp & Signal Proc,Minist Educ, Hefei 230601, Peoples R China
[2] Anhui Univ, Sch Comp Sci & Technol, Hefei 230601, Peoples R China
[3] Anhui Univ, Sch Artificial Intelligence, Informat Mat & Intelligent Sensing Lab Anhui Prov, Anhui Prov Key Lab Secur Artificial Intelligence, Hefei 230601, Peoples R China
基金
中国国家自然科学基金;
关键词
Object tracking; Target tracking; Training; Lighting; Visualization; Switches; Sensors; Cross-modal object tracking; dataset; modality-aware fusion network (MAFNet); SIAMESE NETWORKS; BENCHMARK;
D O I
10.1109/TNNLS.2024.3406189
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual object tracking often faces challenges such as invalid targets and decreased performance in low-light conditions when relying solely on RGB image sequences. While incorporating additional modalities like depth and infrared data has proven effective, existing multimodal imaging platforms are complex and lack real-world applicability. In contrast, near-infrared (NIR) imaging, commonly used in surveillance cameras, can switch between RGB and NIR based on light intensity. However, tracking objects across these heterogeneous modalities poses significant challenges, particularly due to the absence of modality switch signals during tracking. To address these challenges, we propose an adaptive cross-modal object tracking algorithm called modality-aware fusion network (MAFNet). MAFNet efficiently integrates information from both RGB and NIR modalities using an adaptive weighting mechanism, effectively bridging the appearance gap and enabling a modality-aware target representation. It consists of two key components: an adaptive weighting module and a modality-specific representation module. The adaptive weighting module predicts fusion weights to dynamically adjust the contribution of each modality, while the modality-specific representation module captures discriminative features specific to RGB and NIR modalities. MAFNet offers great flexibility as it can effortlessly integrate into diverse tracking frameworks. With its simplicity, effectiveness, and efficiency, MAFNet outperforms state-of-the-art methods in cross-modal object tracking. To validate the effectiveness of our algorithm and overcome the scarcity of data in this field, we introduce CMOTB, a comprehensive and extensive benchmark dataset for cross-modal object tracking. CMOTB consists of 61 categories and 1000 video sequences, comprising a total of over 799K frames. We believe that our proposed method and dataset offer a strong foundation for advancing cross-modal object-tracking research. The dataset, toolkit, experimental data, and source code will be publicly available at: https://github.com/mmic-lcl/ Datasets-and-benchmark-code.
引用
收藏
页码:1 / 14
页数:14
相关论文
共 50 条
  • [31] Unsupervised Deep Hashing via Binary Latent Factor Models for Large-scale Cross-modal Retrieval
    Wu, Gengshen
    Lin, Zijia
    Han, Jungong
    Liu, Li
    Ding, Guiguang
    Zhang, Baochang
    Shen, Jialie
    PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 2854 - 2860
  • [32] Mixed-scale cross-modal fusion network for referring image segmentation
    Pan, Xiong
    Xie, Xuemei
    Yang, Jianxiu
    NEUROCOMPUTING, 2025, 614
  • [33] FDDH: Fast Discriminative Discrete Hashing for Large-Scale Cross-Modal Retrieval
    Liu, Xin
    Wang, Xingzhi
    Yiu-ming Cheung
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 33 (11) : 6306 - 6320
  • [34] Joint Specifics and Consistency Hash Learning for Large-Scale Cross-Modal Retrieval
    Qin, Jianyang
    Fei, Lunke
    Zhang, Zheng
    Wen, Jie
    Xu, Yong
    Zhang, David
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 5343 - 5358
  • [35] Cross-Modal 360° Depth Completion and Reconstruction for Large-Scale Indoor Environment
    Liu, Ruyu
    Zhang, Guodao
    Wang, Jiangming
    Zhao, Shuwen
    IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2022, 23 (12) : 25180 - 25190
  • [36] Joint and individual matrix factorization hashing for large-scale cross-modal retrieval
    Wang, Di
    Wang, Quan
    He, Lihuo
    Gao, Xinbo
    Tian, Yumin
    PATTERN RECOGNITION, 2020, 107
  • [37] Semantic-consistent cross-modal hashing for large-scale image retrieval
    Gu, Xuesong
    Dong, Guohua
    Zhang, Xiang
    Lan, Long
    Luo, Zhigang
    NEUROCOMPUTING, 2021, 433 : 181 - 198
  • [38] Cross-Modal Self-Taught Hashing for large-scale image retrieval
    Xie, Liang
    Zhu, Lei
    Pan, Peng
    Lu, Yansheng
    SIGNAL PROCESSING, 2016, 124 : 81 - 92
  • [39] Multi-Networks Joint Learning for Large-Scale Cross-Modal Retrieval
    Zhang, Liang
    Ma, Bingpeng
    Li, Guorong
    Huang, Qingming
    Tian, Qi
    PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 907 - 915
  • [40] Uncertainty-Aware Multi-modal Learning via Cross-Modal Random Network Prediction
    Wang, Hu
    Zhang, Jianpeng
    Chen, Yuanhong
    Ma, Congbo
    Avery, Jodie
    Hull, Louise
    Carneiro, Gustavo
    COMPUTER VISION, ECCV 2022, PT XXXVII, 2022, 13697 : 200 - 217