共 50 条
TFIV: Multigrained Token Fusion for Infrared and Visible Image via Transformer
被引:9
|作者:
Li, Jing
[1
]
Yang, Bin
[2
]
Bai, Lu
[3
,4
]
Dou, Hao
[5
]
Li, Chang
[6
]
Ma, Lingfei
[7
]
机构:
[1] Cent Univ Finance & Econ, Sch Informat, Beijing 102206, Peoples R China
[2] Hunan Univ, Coll Elect & Informat Engn, Changsha 410082, Peoples R China
[3] Beijing Normal Univ, Sch Artificial Intelligence, Beijing 100875, Peoples R China
[4] Cent Univ Finance & Econ, Beijing 100081, Peoples R China
[5] China Elect Technol Grp Corp, Res Inst 38, Hefei 230088, Peoples R China
[6] Hefei Univ Technol, Dept Biomed Engn, Hefei 230009, Peoples R China
[7] Cent Univ Finance & Econ, Sch Stat & Math, Beijing 102206, Peoples R China
基金:
中国国家自然科学基金;
关键词:
Image fusion;
infrared image;
transformer;
visible image;
MULTI-FOCUS;
NETWORK;
FRAMEWORK;
D O I:
10.1109/TIM.2023.3312755
中图分类号:
TM [电工技术];
TN [电子技术、通信技术];
学科分类号:
0808 ;
0809 ;
摘要:
The existing transformer-based infrared and visible image fusion methods mainly focus on the self-attention correlation existing in the intra-modal of each image; yet these methods neglect the discrepancies of inter-modal in the same position of two source images, because the information of infrared token and visible token in the same position is unbalanced. Therefore, we develop a pure transformer fusion model to reconstruct fused image in token dimension, which not only perceives the long-range dependencies in intra-modal by self-attention mechanism of the transformer, but also captures the attentive correlation of inter-modal in token space. Moreover, to enhance and balance the interaction of inter-modal tokens when we fuse the corresponding infrared and visible tokens, learnable attentive weights are applied to dynamically measure the correlation of inter-modal tokens in the same position. Concretely, infrared and visible tokens are first calculated by two independent transformers to extract long-range dependencies in intra-modal due to their modal difference. Then, we fuse the corresponding infrared and visible tokens of inter-modal in token space to reconstruct the fused image. In addition, to comprehensively extract multiscale long-range dependencies and capture attentive correlation of corresponding multimodal tokens in different token sizes, we explore and extend the fusion to multigrained token-based fusion. Ablation studies and extensive experiments illustrate the effectiveness and superiorities of our model when compared with nine state-of-the-art methods.
引用
收藏
页数:14
相关论文