Cross-Modal Object Tracking via Modality-Aware Fusion Network and a Large-Scale Dataset

被引:1
|
作者
Liu, Lei [1 ]
Zhang, Mengya [1 ]
Li, Cheng [2 ]
Li, Chenglong [3 ]
Tang, Jin [1 ]
机构
[1] Anhui Univ, Informat Mat & Intelligent Sensing Lab Anhui Prov, Anhui Prov Key Lab Multimodal Cognit Computat, Key Lab Intelligent Comp & Signal Proc,Minist Educ, Hefei 230601, Peoples R China
[2] Anhui Univ, Sch Comp Sci & Technol, Hefei 230601, Peoples R China
[3] Anhui Univ, Sch Artificial Intelligence, Informat Mat & Intelligent Sensing Lab Anhui Prov, Anhui Prov Key Lab Secur Artificial Intelligence, Hefei 230601, Peoples R China
基金
中国国家自然科学基金;
关键词
Object tracking; Target tracking; Training; Lighting; Visualization; Switches; Sensors; Cross-modal object tracking; dataset; modality-aware fusion network (MAFNet); SIAMESE NETWORKS; BENCHMARK;
D O I
10.1109/TNNLS.2024.3406189
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual object tracking often faces challenges such as invalid targets and decreased performance in low-light conditions when relying solely on RGB image sequences. While incorporating additional modalities like depth and infrared data has proven effective, existing multimodal imaging platforms are complex and lack real-world applicability. In contrast, near-infrared (NIR) imaging, commonly used in surveillance cameras, can switch between RGB and NIR based on light intensity. However, tracking objects across these heterogeneous modalities poses significant challenges, particularly due to the absence of modality switch signals during tracking. To address these challenges, we propose an adaptive cross-modal object tracking algorithm called modality-aware fusion network (MAFNet). MAFNet efficiently integrates information from both RGB and NIR modalities using an adaptive weighting mechanism, effectively bridging the appearance gap and enabling a modality-aware target representation. It consists of two key components: an adaptive weighting module and a modality-specific representation module. The adaptive weighting module predicts fusion weights to dynamically adjust the contribution of each modality, while the modality-specific representation module captures discriminative features specific to RGB and NIR modalities. MAFNet offers great flexibility as it can effortlessly integrate into diverse tracking frameworks. With its simplicity, effectiveness, and efficiency, MAFNet outperforms state-of-the-art methods in cross-modal object tracking. To validate the effectiveness of our algorithm and overcome the scarcity of data in this field, we introduce CMOTB, a comprehensive and extensive benchmark dataset for cross-modal object tracking. CMOTB consists of 61 categories and 1000 video sequences, comprising a total of over 799K frames. We believe that our proposed method and dataset offer a strong foundation for advancing cross-modal object-tracking research. The dataset, toolkit, experimental data, and source code will be publicly available at: https://github.com/mmic-lcl/ Datasets-and-benchmark-code.
引用
收藏
页码:1 / 14
页数:14
相关论文
共 50 条
  • [41] RGBD Fusion Grasp Network with Large-Scale Tableware Grasp Dataset
    Yoon, Jaemin
    Ahn, Joonmo
    Ha, Changsu
    Chung, Rakjoon
    Park, Dongwoo
    Han, Heungwoo
    Kang, Sungchul
    2023 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, IROS, 2023, : 2947 - 2954
  • [42] Speech Emotion Recognition Using Global-Aware Cross-Modal Feature Fusion Network
    Li, Feng
    Luo, Jiusong
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, ICIC 2023, PT II, 2023, 14087 : 211 - 221
  • [43] Cross-Modal Fusion and Progressive Decoding Network for RGB-D Salient Object Detection
    Hu, Xihang
    Sun, Fuming
    Sun, Jing
    Wang, Fasheng
    Li, Haojie
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (08) : 3067 - 3085
  • [44] Cross-modal fusion encoder via graph neural network for referring image segmentation
    Zhang, Yuqing
    Zhang, Yong
    Piao, Xinglin
    Yuan, Peng
    Hu, Yongli
    Yin, Baocai
    IET IMAGE PROCESSING, 2024, 18 (04) : 1083 - 1095
  • [45] DEEP SEMANTIC ADVERSARIAL HASHING BASED ON AUTOENCODER FOR LARGE-SCALE CROSS-MODAL RETRIEVAL
    Li, Mingyong
    Wang, Hongya
    2020 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO WORKSHOPS (ICMEW), 2020,
  • [46] Label Consistent Matrix Factorization Hashing for Large-Scale Cross-Modal Similarity Search
    Wang, Di
    Gao, Xinbo
    Wang, Xiumei
    He, Lihuo
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (10) : 2466 - 2479
  • [47] Semantic-guided autoencoder adversarial hashing for large-scale cross-modal retrieval
    Mingyong Li
    Qiqi Li
    Yan Ma
    Degang Yang
    Complex & Intelligent Systems, 2022, 8 : 1603 - 1617
  • [48] NSDH: A Nonlinear Supervised Discrete Hashing framework for large-scale cross-modal retrieval
    Yang, Zhan
    Yang, Liu
    Raymond, Osolo Ian
    Zhu, Lei
    Huang, Wenti
    Liao, Zhifang
    Long, Jun
    KNOWLEDGE-BASED SYSTEMS, 2021, 217
  • [49] SCQ: Self-Supervised Cross-Modal Quantization for Unsupervised Large-Scale Retrieval
    Nakamura, Fuga
    Harakawa, Ryosuke
    Iwahashi, Masahiro
    PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 1337 - 1342
  • [50] Cross-Modal Collaborative Representation Learning and a Large-Scale RGBT Benchmark for Crowd Counting
    Liu, Lingbo
    Chen, Jiaqi
    Wu, Hefeng
    Li, Guanbin
    Li, Chenglong
    Lin, Liang
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 4821 - 4831