ragBERT: Relationship-aligned and grammar-wise BERT model for image captioning

被引：0

作者：

Wang, Hengyou ^{[1
]}

Song, Kani ^{[1
]}

Jiang, Xiang ^{[2
]}

He, Zhiquan ^{[3
]}

机构：

[1] Beijing Univ Civil Engn & Architecture, Sch Sci, Beijing 100044, Peoples R China

[2] Jiangsu Normal Univ, Sch Comp Sci & Technol, Xuzhou 221116, Peoples R China

[3] Shenzhen Univ, Guangdong Multimedia Informat Serv Engn Technol, Shenzhen 518060, Peoples R China

来源：

IMAGE AND VISION COMPUTING | 2024年 / 148卷

基金：

中国国家自然科学基金;

关键词：

Image captioning; Relationship tags; Grammar; BERT;

D O I：

10.1016/j.imavis.2024.105105

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Image captioning has become one of the most popular research problems in the field of artificial intelligence. Although many studies have achieved excellent results, there still are some challenges, for example, cross-modal feature alignment lacks explicit guidance, and model-generated sentences contain grammatical errors. In this paper, we propose a relationship-aligned and grammar-wise BERT model, which integrates a relationship exploration module and a grammar enhancement module into the BERT-based model. Specifically, in the relationship exploration module, to explore relationship tags as anchors to guide semantic alignment, we design a network to calculate the cosine similarity between visual features and word vector information. We construct the grammar enhancement module similarly to the BERT. That means we use two BERT modules in our framework. The first is the main frame for generating captions, and the second is the auxiliary model to determine whether the syntax of the generated caption is correct. To validate the performance of our proposed model, we conduct abundant experiments on the MSCOCO dataset, Flickr30k dataset, and Flickr8k dataset. Experimental results show that our proposed method performs better than state-of-the-art approaches.

引用

页数：10