From grids to pseudo-regions: Dynamic memory augmented image captioning with dual relation transformer

被引：0

作者：

Zhou, Wei ^{[1
]}

Jiang, Weitao ^{[1
]}

Zheng, Zhijie ^{[1
]}

Li, Jianchao ^{[1
]}

Su, Tao ^{[1
]}

Hu, Haifeng ^{[1
]}

机构：

[1] Sun Yat Sen Univ, Sch Elect & Informat Technol, Guangzhou 510006, Guangdong, Peoples R China

来源：

EXPERT SYSTEMS WITH APPLICATIONS | 2025年 / 273卷

基金：

中国国家自然科学基金;

关键词：

Image captioning; Pseudo-region; Dynamic memory; Cross-modal attention fusion; Transformer;

D O I：

10.1016/j.eswa.2025.126850

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Image captioning aims to automatically generate a description in natural language fora given image. The existing methods typically exploit grid-level or region-level features to encode visual information. However, extracting region features by an object detector is computationally expensive and inflexible, while region features are criticized for lacking fine-grained details and background information. Besides, current Transformer-based captioning models only focus on the pairwise similarity of token features, which makes it difficult to fully understand the complex scene relationships in images. To address these issues, we introduce a novel Dual Relation Transformer (DRTran) model that can be trained end-to-end. Concretely, in the encoding phase, we first adopt a clustering algorithm to generate pseudo-region features, which does not need to make additional expensive annotations to train object detector. Then, in order to combine the advantages of grid and pseudo-region features, we design anew dual relation enhancement (DRE) encoder to capture the correlation between objects from two different visual features. Furthermore, we devise a novel dynamic memory (DM) module to learn prior knowledge with external dynamic memory vectors. By adding prior knowledge in visual relationship modeling, the model learns complex scene representations to improve caption accuracy. During the decoding stage, we design a new cross-modal attention fusion (CAF) module in the language decoder to adaptively decide the attention weights of enhanced grid and pseudo-region features at each time step. Extensive experiments on the MS-COCO and Flickr30K datasets demonstrate that our DRTran model performs better than current image captioning methods.

引用

页数：13

共 2 条

[1] From multi-scale grids to dynamic regions: Dual-relation enhanced transformer for image captioning
Zhou, Wei
Song, Chuanle
Chen, Dihu
Su, Tao
Hu, Haifeng
Shan, Chun
KNOWLEDGE-BASED SYSTEMS, 2025, 311
[2] Retrieval-enhanced adversarial training with dynamic memory-augmented attention for image paragraph captioning
Xu, Chunpu
Yang, Min
Ao, Xiang
Shen, Ying
Xu, Ruifeng
Tian, Jinwen
KNOWLEDGE-BASED SYSTEMS, 2021, 214

← 1 →