A Novel Cross-Fusion Method of Different Types of Features for Image Captioning

被引：0

作者：

Lou, Liangshan ^{[1
]}

Lu, Ke ^{[1
,2
]}

Xue, Jian ^{[1
]}

机构：

[1] Univ Chinese Acad Sci, Sch Engn Sci, Beijing 100049, Peoples R China

[2] Peng Cheng Lab, Shenzhen 518055, Peoples R China

来源：

2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN | 2023年

基金：

中国国家自然科学基金; 北京市自然科学基金;

关键词：

Image Captioning; Region Features; Grid Features; Transformer; Cross-Fusion; LANGUAGE;

D O I：

10.1109/IJCNN54540.2023.10191971

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Multi-modal tasks are receiving more and more attention, including image captioning. Based on X-Linear attention, we simultaneously introduce grid features and region features extracted by Faster RCNN. We obtain a global feature vector of each type of original features through mean pooling. The two types of features are encoded by two parallel encoders. Each encoder has two inputs: a set of feature vectors (region/grid) and the corresponding global feature vector. Each encoding layer outputs an encoded global feature vector and a set of encoded feature vectors. We cross-fuse the global feature vector output by each encoding layer for region features and the set of encoded feature vectors for grid features. In the same way, we cross-fuse another pair of the global feature (grid) and the set of encoded feature vectors (region). Finally, we fuse the two global feature vectors output by the two encoders as the final global features, and the two sets of encoded feature vectors output by the two encoders as the final visual features. Experimental results on the COCO dataset show that our model achieves a new SOTA performance of BLEU-1 81.5%, BLEU-4 40.5%, METEOR 29.6%, and ROUGE 59.5% on the Karpathy test split.

引用

页数：8

共 50 条

[1] Lightweight Image Super-resolution Reconstruction Based on Cross-fusion of Spatial Features
Zhao X.
Cheng W.
Binggong Xuebao/Acta Armamentarii, 2024, 45 (04): : 1273 - 1284
[2] Multiscale cross-fusion network for hyperspectral image classification
Pan, Haizhu
Zhu, Yuexia
Ge, Haimiao
Liu, Moqi
Shi, Cuiping
EGYPTIAN JOURNAL OF REMOTE SENSING AND SPACE SCIENCES, 2023, 26 (03): : 839 - 850
[3] ACFNet: An adaptive cross-fusion network for infrared and visible image fusion
Chen, Xiaoxuan
Xu, Shuwen
Hu, Shaohai
Ma, Xiaole
PATTERN RECOGNITION, 2025, 159
[4] Cross on Cross Attention: Deep Fusion Transformer for Image Captioning
Zhang, Jing
Xie, Yingshuai
Ding, Weichao
Wang, Zhe
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (08) : 4257 - 4268
[5] CCFNet: Collaborative Cross-Fusion Network for Medical Image Segmentation
Chen, Jialu
Yuan, Baohua
ALGORITHMS, 2024, 17 (04)
[6] HYPERSPECTRAL IMAGE DENOISING BASED ON PARALLEL CROSS-FUSION NETWORK
Gong, Zhuoran
Gao, Feng
Dong, Junyu
Qi, Lin
2022 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2022), 2022, : 1528 - 1531
[7] A New Cross-Fusion Method to Automatically Determine the Optimal Input Image Pairs for NDVI Spatiotemporal Data Fusion
Chen, Yang
Cao, Ruyin
Chen, Jin
Zhu, Xiaolin
Zhou, Ji
Wang, Guangpeng
Shen, Miaogen
Chen, Xuehong
Yang, Wei
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2020, 58 (07): : 5179 - 5194
[8] Pairwise Guided Multilayer Cross-Fusion Network for Bird Image Recognition
Lei, Jingsheng
Jin, Yao
Huang, Liya
Ji, Yuan
Yang, Shengying
ELECTRONICS, 2023, 12 (18)
[9] Pulmonary Nodule Computed Tomography Image Classification Method Based on Dual-Path Cross-Fusion Network
Yang Ping
Zhang Xin
Wen Fan
Tian Ji
He Ning
LASER & OPTOELECTRONICS PROGRESS, 2024, 61 (08)
[10] Triplet Cross-Fusion Learning for Unpaired Image Denoising in Optical Coherence Tomography
Geng, Mufeng
Meng, Xiangxi
Zhu, Lei
Jiang, Zhe
Gao, Mengdi
Huang, Zhiyu
Qiu, Bin
Hu, Yicheng
Zhang, Yibao
Ren, Qiushi
Lu, Yanye
IEEE TRANSACTIONS ON MEDICAL IMAGING, 2022, 41 (11) : 3357 - 3372

← 1 2 3 4 5 →