Cross-Modal Object Tracking via Modality-Aware Fusion Network and a Large-Scale Dataset

被引:1
|
作者
Liu, Lei [1 ]
Zhang, Mengya [1 ]
Li, Cheng [2 ]
Li, Chenglong [3 ]
Tang, Jin [1 ]
机构
[1] Anhui Univ, Informat Mat & Intelligent Sensing Lab Anhui Prov, Anhui Prov Key Lab Multimodal Cognit Computat, Key Lab Intelligent Comp & Signal Proc,Minist Educ, Hefei 230601, Peoples R China
[2] Anhui Univ, Sch Comp Sci & Technol, Hefei 230601, Peoples R China
[3] Anhui Univ, Sch Artificial Intelligence, Informat Mat & Intelligent Sensing Lab Anhui Prov, Anhui Prov Key Lab Secur Artificial Intelligence, Hefei 230601, Peoples R China
基金
中国国家自然科学基金;
关键词
Object tracking; Target tracking; Training; Lighting; Visualization; Switches; Sensors; Cross-modal object tracking; dataset; modality-aware fusion network (MAFNet); SIAMESE NETWORKS; BENCHMARK;
D O I
10.1109/TNNLS.2024.3406189
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual object tracking often faces challenges such as invalid targets and decreased performance in low-light conditions when relying solely on RGB image sequences. While incorporating additional modalities like depth and infrared data has proven effective, existing multimodal imaging platforms are complex and lack real-world applicability. In contrast, near-infrared (NIR) imaging, commonly used in surveillance cameras, can switch between RGB and NIR based on light intensity. However, tracking objects across these heterogeneous modalities poses significant challenges, particularly due to the absence of modality switch signals during tracking. To address these challenges, we propose an adaptive cross-modal object tracking algorithm called modality-aware fusion network (MAFNet). MAFNet efficiently integrates information from both RGB and NIR modalities using an adaptive weighting mechanism, effectively bridging the appearance gap and enabling a modality-aware target representation. It consists of two key components: an adaptive weighting module and a modality-specific representation module. The adaptive weighting module predicts fusion weights to dynamically adjust the contribution of each modality, while the modality-specific representation module captures discriminative features specific to RGB and NIR modalities. MAFNet offers great flexibility as it can effortlessly integrate into diverse tracking frameworks. With its simplicity, effectiveness, and efficiency, MAFNet outperforms state-of-the-art methods in cross-modal object tracking. To validate the effectiveness of our algorithm and overcome the scarcity of data in this field, we introduce CMOTB, a comprehensive and extensive benchmark dataset for cross-modal object tracking. CMOTB consists of 61 categories and 1000 video sequences, comprising a total of over 799K frames. We believe that our proposed method and dataset offer a strong foundation for advancing cross-modal object-tracking research. The dataset, toolkit, experimental data, and source code will be publicly available at: https://github.com/mmic-lcl/ Datasets-and-benchmark-code.
引用
收藏
页码:1 / 14
页数:14
相关论文
共 50 条
  • [11] TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild
    Mueller, Matthias
    Bibi, Adel
    Giancola, Silvio
    Alsubaihi, Salman
    Ghanem, Bernard
    COMPUTER VISION - ECCV 2018, PT I, 2018, 11205 : 310 - 327
  • [12] Multispectral Object Detection via Cross-Modal Conflict-Aware Learning
    He, Xiao
    Tang, Chang
    Zou, Xin
    Zhang, Wei
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 1465 - 1474
  • [13] SiamSMN: Siamese Cross-Modality Fusion Network for Object Tracking
    Han, Shuo
    Gao, Lisha
    Wu, Yue
    Wei, Tian
    Wang, Manyu
    Cheng, Xu
    INFORMATION, 2024, 15 (07)
  • [14] Attention-aware Cross-modal Cross-level Fusion Network for RGB-D Salient Object Detection
    Chen, Hao
    Li, You-Fu
    Su, Dan
    2018 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2018, : 6821 - 6826
  • [15] Large-Scale Cross-Modal Hashing with Unified Learning and Multi-Object Regional Correlation Reasoning
    Li, Bo
    Li, Zhixin
    NEURAL NETWORKS, 2024, 171 : 276 - 292
  • [16] SEMI-SUPERVISED GRAPH CONVOLUTIONAL HASHING NETWORK FOR LARGE-SCALE CROSS-MODAL RETRIEVAL
    Shen, Zhanjian
    Zhai, Deming
    Liu, Xianming
    Jiang, Junjun
    2020 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2020, : 2366 - 2370
  • [17] CSFNet: Cross-Modal Semantic Focus Network for Semantic Segmentation of Large-Scale Point Clouds
    Luo, Yang
    Han, Ting
    Liu, Yujun
    Su, Jinhe
    Chen, Yiping
    Li, Jinyuan
    Wu, Yundong
    Cai, Guorong
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2025, 63
  • [18] Cross-Modal Image Registration via Rasterized Parameter Prediction for Object Tracking
    Zhang, Qing
    Xiang, Wei
    APPLIED SCIENCES-BASEL, 2023, 13 (09):
  • [19] Visible-thermal multiple object tracking: Large-scale video dataset and progressive fusion approach
    Zhu, Yabin
    Wang, Qianwu
    Li, Chenglong
    Tang, Jin
    Gu, Chengjie
    Huang, Zhixiang
    PATTERN RECOGNITION, 2025, 161
  • [20] Online Adaptive Supervised Hashing for Large-Scale Cross-Modal Retrieval
    Su, Ruoqi
    Wang, Di
    Huang, Zhen
    Liu, Yuan
    An, Yaqiang
    IEEE ACCESS, 2020, 8 : 206360 - 206370