Cross-Modal Object Tracking via Modality-Aware Fusion Network and a Large-Scale Dataset

被引:1
|
作者
Liu, Lei [1 ]
Zhang, Mengya [1 ]
Li, Cheng [2 ]
Li, Chenglong [3 ]
Tang, Jin [1 ]
机构
[1] Anhui Univ, Informat Mat & Intelligent Sensing Lab Anhui Prov, Anhui Prov Key Lab Multimodal Cognit Computat, Key Lab Intelligent Comp & Signal Proc,Minist Educ, Hefei 230601, Peoples R China
[2] Anhui Univ, Sch Comp Sci & Technol, Hefei 230601, Peoples R China
[3] Anhui Univ, Sch Artificial Intelligence, Informat Mat & Intelligent Sensing Lab Anhui Prov, Anhui Prov Key Lab Secur Artificial Intelligence, Hefei 230601, Peoples R China
基金
中国国家自然科学基金;
关键词
Object tracking; Target tracking; Training; Lighting; Visualization; Switches; Sensors; Cross-modal object tracking; dataset; modality-aware fusion network (MAFNet); SIAMESE NETWORKS; BENCHMARK;
D O I
10.1109/TNNLS.2024.3406189
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual object tracking often faces challenges such as invalid targets and decreased performance in low-light conditions when relying solely on RGB image sequences. While incorporating additional modalities like depth and infrared data has proven effective, existing multimodal imaging platforms are complex and lack real-world applicability. In contrast, near-infrared (NIR) imaging, commonly used in surveillance cameras, can switch between RGB and NIR based on light intensity. However, tracking objects across these heterogeneous modalities poses significant challenges, particularly due to the absence of modality switch signals during tracking. To address these challenges, we propose an adaptive cross-modal object tracking algorithm called modality-aware fusion network (MAFNet). MAFNet efficiently integrates information from both RGB and NIR modalities using an adaptive weighting mechanism, effectively bridging the appearance gap and enabling a modality-aware target representation. It consists of two key components: an adaptive weighting module and a modality-specific representation module. The adaptive weighting module predicts fusion weights to dynamically adjust the contribution of each modality, while the modality-specific representation module captures discriminative features specific to RGB and NIR modalities. MAFNet offers great flexibility as it can effortlessly integrate into diverse tracking frameworks. With its simplicity, effectiveness, and efficiency, MAFNet outperforms state-of-the-art methods in cross-modal object tracking. To validate the effectiveness of our algorithm and overcome the scarcity of data in this field, we introduce CMOTB, a comprehensive and extensive benchmark dataset for cross-modal object tracking. CMOTB consists of 61 categories and 1000 video sequences, comprising a total of over 799K frames. We believe that our proposed method and dataset offer a strong foundation for advancing cross-modal object-tracking research. The dataset, toolkit, experimental data, and source code will be publicly available at: https://github.com/mmic-lcl/ Datasets-and-benchmark-code.
引用
收藏
页码:1 / 14
页数:14
相关论文
共 50 条
  • [1] Cross-Modal Object Tracking: Modality-Aware Representations and A Unified Benchmark
    Li, Chenglong
    Zhu, Tianhao
    Liu, Lei
    Si, Xiaonan
    Fan, Zilin
    Zhai, Sulan
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 1289 - 1296
  • [2] Tiny Object Tracking: A Large-Scale Dataset and a Baseline
    Zhu, Yabin
    Li, Chenglong
    Liu, Yao
    Wang, Xiao
    Tang, Jin
    Luo, Bin
    Huang, Zhixiang
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (08) : 10273 - 10287
  • [3] MCCN: Multimodal Coordinated Clustering Network for Large-Scale Cross-modal Retrieval
    Zeng, Zhixiong
    Sun, Ying
    Mao, Wenji
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 5427 - 5435
  • [4] Unsupervised Deep Cross-Modal Hashing by Knowledge Distillation for Large-scale Cross-modal Retrieval
    Li, Mingyong
    Wang, Hongya
    PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 183 - 191
  • [5] Large-Scale Supervised Hashing for Cross-Modal Retreival
    Karbil, Loubna
    Daoudi, Imane
    2017 IEEE/ACS 14TH INTERNATIONAL CONFERENCE ON COMPUTER SYSTEMS AND APPLICATIONS (AICCSA), 2017, : 803 - 808
  • [6] CCMB: A Large-scale Chinese Cross-modal Benchmark
    Xie, Chunyu
    Cai, Heng
    Li, Jincheng
    Kong, Fanjing
    Wu, Xiaoyu
    Song, Jianfei
    Morimitsu, Henrique
    Yao, Lin
    Wang, Dexin
    Zhang, Xiangzheng
    Leng, Dawei
    Zhang, Baochang
    Ji, Xiangyang
    Deng, Yafeng
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4219 - 4227
  • [7] UAV Cross-Modal Image Registration: Large-Scale Dataset and Transformer-Based Approach
    Xiao, Yun
    Liu, Fei
    Zhu, Yabin
    Li, Chenglong
    Wang, Futian
    Tang, Jin
    ADVANCES IN BRAIN INSPIRED COGNITIVE SYSTEMS, BICS 2023, 2024, 14374 : 166 - 176
  • [8] RGBD Salient Object Detection via Disentangled Cross-Modal Fusion
    Chen, Hao
    Deng, Yongjian
    Li, Youfu
    Hung, Tzu-Yi
    Lin, Guosheng
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 (29) : 8407 - 8416
  • [9] CLIP-based fusion-modal reconstructing hashing for large-scale unsupervised cross-modal retrieval
    Mingyong, Li
    Yewen, Li
    Mingyuan, Ge
    Longfei, Ma
    INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2023, 12 (01)
  • [10] CLIP-based fusion-modal reconstructing hashing for large-scale unsupervised cross-modal retrieval
    Li Mingyong
    Li Yewen
    Ge Mingyuan
    Ma Longfei
    International Journal of Multimedia Information Retrieval, 2023, 12