共 2 条
From grids to pseudo-regions: Dynamic memory augmented image captioning with dual relation transformer
被引:0
|作者:
Zhou, Wei
[1
]
Jiang, Weitao
[1
]
Zheng, Zhijie
[1
]
Li, Jianchao
[1
]
Su, Tao
[1
]
Hu, Haifeng
[1
]
机构:
[1] Sun Yat Sen Univ, Sch Elect & Informat Technol, Guangzhou 510006, Guangdong, Peoples R China
基金:
中国国家自然科学基金;
关键词:
Image captioning;
Pseudo-region;
Dynamic memory;
Cross-modal attention fusion;
Transformer;
D O I:
10.1016/j.eswa.2025.126850
中图分类号:
TP18 [人工智能理论];
学科分类号:
081104 ;
0812 ;
0835 ;
1405 ;
摘要:
Image captioning aims to automatically generate a description in natural language fora given image. The existing methods typically exploit grid-level or region-level features to encode visual information. However, extracting region features by an object detector is computationally expensive and inflexible, while region features are criticized for lacking fine-grained details and background information. Besides, current Transformer-based captioning models only focus on the pairwise similarity of token features, which makes it difficult to fully understand the complex scene relationships in images. To address these issues, we introduce a novel Dual Relation Transformer (DRTran) model that can be trained end-to-end. Concretely, in the encoding phase, we first adopt a clustering algorithm to generate pseudo-region features, which does not need to make additional expensive annotations to train object detector. Then, in order to combine the advantages of grid and pseudo-region features, we design anew dual relation enhancement (DRE) encoder to capture the correlation between objects from two different visual features. Furthermore, we devise a novel dynamic memory (DM) module to learn prior knowledge with external dynamic memory vectors. By adding prior knowledge in visual relationship modeling, the model learns complex scene representations to improve caption accuracy. During the decoding stage, we design a new cross-modal attention fusion (CAF) module in the language decoder to adaptively decide the attention weights of enhanced grid and pseudo-region features at each time step. Extensive experiments on the MS-COCO and Flickr30K datasets demonstrate that our DRTran model performs better than current image captioning methods.
引用
收藏
页数:13
相关论文