Vision-Enhanced and Consensus-Aware Transformer for Image Captioning

被引：28

作者：

Cao, Shan ^{[1
,2
]}

An, Gaoyun ^{[1
,2
]}

Zheng, Zhenxing ^{[1
,2
]}

Wang, Zhiyong ^{[3
]}

机构：

[1] Beijing Jiaotong Univ, Inst Informat Sci, Beijing 100044, Peoples R China

[2] Beijing Key Lab Adv Informat Sci & Network Techno, Beijing 100044, Peoples R China

[3] Univ Sydney, Sch Comp Sci, Sydney, NSW 2006, Australia

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2022年 / 32卷 / 10期

基金：

中国国家自然科学基金;

关键词：

Transformers; Visualization; Decoding; Semantics; Task analysis; Convolution; Visual perception; Image captioning; vision-enhanced encoder; consensus-aware decoder; consensus knowledge;

D O I：

10.1109/TCSVT.2022.3178844

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Image captioning generates descriptions in a natural language for a given image. Due to its great potential for a wide range of applications, many deep learning based-methods have been proposed. The co-occurrence of words such as mouse and keyboard, constitutes commonsense knowledge, which is referred to as consensus. However, it is challenging to consider commonsense knowledge in producing captions that have rich, natural, and meaningful semantics. In this paper, a Vision-enhanced and Consensus-aware Transformer (VCT) is proposed to exploit both visual information and consensus knowledge for image captioning with three key components: a vision-enhanced encoder, consensus-aware knowledge representation generator, and consensus-aware decoder. The vision-enhanced encoder extends the vanilla self-attention module with a memory-based attention module and a visual perception module for learning better visual representation of an image. Specifically, the relationships between regions in an image and the image's global context are leveraged with scene memory in the memory-based attention module. The visual perception module further enhances the correlation among neighboring tokens in both the spatial and channel-wise dimensions. To learn consensus-aware representations, a word correlation graph is constructed by computing the statistical co-occurrence between semantic concepts. Then consensus knowledge can be acquired using a graph convolutional network in the consensus-aware knowledge representation generator. Finally, such consensus knowledge is integrated into the consensus-aware decoder through consensus memory and a knowledge-based control module to produce a caption. Experimental results on two popular benchmark datasets (MSCOCO and Flickr30k) demonstrate that our proposed model achieves state-of-the-art performance. Extensive ablation studies also validate the effectiveness of each component.

引用

页码：7005 / 7018

页数：14

共 50 条

[31] VTOUCH: VISION-ENHANCED INTERACTION FOR LARGE TOUCH DISPLAYS
Chen, Yinpeng
Liu, Zicheng
Chou, Phil
Zhang, Zhengyou
2015 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO (ICME), 2015,
[32] Reinforced Transformer for Medical Image Captioning
Xiong, Yuxuan
Du, Bo
Yan, Pingkun
MACHINE LEARNING IN MEDICAL IMAGING (MLMI 2019), 2019, 11861 : 673 - 680
[33] Transformer with a Parallel Decoder for Image Captioning
Wei, Peilang
Liu, Xu
Luo, Jun
Pu, Huayan
Huang, Xiaoxu
Wang, Shilong
Cao, Huajun
Yang, Shouhong
Zhuang, Xu
Wang, Jason
Yue, Hong
Ji, Cheng
Zhou, Mingliang
INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2024, 38 (01)
[34] ReFormer: The Relational Transformer for Image Captioning
Yang, Xuewen
Liu, Yingru
Wang, Xin
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5398 - 5406
[35] Image captioning with transformer and knowledge graph
Zhang, Yu
Shi, Xinyu
Mi, Siya
Yang, Xu
PATTERN RECOGNITION LETTERS, 2021, 143 (143) : 43 - 49
[36] Complementary Shifted Transformer for Image Captioning
Yanbo Liu
You Yang
Ruoyu Xiang
Jixin Ma
Neural Processing Letters, 2023, 55 : 8339 - 8363
[37] Direction Relation Transformer for Image Captioning
Song, Zeliang
Zhou, Xiaofei
Dong, Linhua
Tan, Jianlong
Guo, Li
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 5056 - 5064
[38] ETransCap: efficient transformer for image captioning
Mundu, Albert
Singh, Satish Kumar
Dubey, Shiv Ram
APPLIED INTELLIGENCE, 2024, 54 (21) : 10748 - 10762
[39] Recurrent fusion transformer for image captioning
Mou, Zhenping
Yuan, Qiao
Song, Tianqi
SIGNAL IMAGE AND VIDEO PROCESSING, 2025, 19 (01)
[40] Uncertainty-Aware Image Captioning
Fei, Zhengcong
Fan, Mingyuan
Zhu, Li
Huang, Junshi
Wei, Xiaoming
Wei, Xiaolin
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 1, 2023, : 614 - 622

← 1 2 3 4 5 →