Relation constraint self-attention for image captioning

被引：16

作者：

Ji, Junzhong ^{[1
,2
]}

Wang, Mingzhan ^{[1
,2
]}

Zhang, Xiaodan ^{[1
,2
]}

Lei, Minglong ^{[1
,2
]}

Qu, Liangqiong ^{[3
]}

机构：

[1] Beijing Univ Technol, Fac Informat Technol, Beijing Municipal Key Lab Multimedia & Intelligen, Beijing 100124, Peoples R China

[2] Beijing Univ Technol, Beijing Inst Artificial Intelligence, Beijing 100124, Peoples R China

[3] Stanford Univ, Dept Biomed Data Sci, Palo Alto, CA 94304 USA

来源：

NEUROCOMPUTING | 2022年 / 501卷

基金：

中国国家自然科学基金;

关键词：

Image captioning; Relation constraint self -attention; Scene graph; Transformer;

D O I：

10.1016/j.neucom.2022.06.062

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Self-attention based Transformer has been successfully introduced in the encoder-decoder framework of image captioning, which is superior in modeling the inner relations of inputs, i.e., image regions or semantic words. However, relations in self-attention are usually too dense to be fully optimized, which may result in noisy relations and attentions. Meanwhile, the prior relations, e.g., visual relation and semantic relation between objects, which are essential for understanding and describing an image, are ignored by current self-attention. Thus, the relation learning of self-attention in image captioning is biased, which leads to a dilution of the concentration of attentions. In this paper, we propose a Relation Constraint Self-Attention (RCSA) model to enhance the relation learning of self-attention in image captioning by constraining self-attention with prior relations. RCSA exploits the prior visual and semantic relation information from scene graph as constraint factors. And then it builds constraints for self-attention through two sub-modules: an RCSA-E encoder module and an RCSA-D decoder module. RCSA-E introduces the visual relation information to self-attention in encoder, which helps generate a sparse attention map by omitting the attention weights of irrelevant regions to highlight relevant visual features. RCSA-D extends the keys and values of self-attention in decoder with the semantic relation information to constrain the learning of semantic relation, and improve the accuracy of generated semantic words. Intuitively, RCSA-E endows model with an ability to figure out which region to omit and which region to focus by visual relation information; RCSA-D then strengthens the relation learning of the focused regions and improves the sentence generation with semantic relation information. Experiments on the MSCOCO dataset demonstrate the effectiveness of our proposed RCSA.(c) 2022 Elsevier B.V. All rights reserved.

引用

页码：778 / 789

页数：12

共 50 条

[21] Context-Aware Group Captioning via Self-Attention and Contrastive Features
Li, Zhuowan
Tran, Quan
Mai, Long
Lin, Zhe
Yuille, Alan L.
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 3437 - 3447
[22] Multilevel attention and relation network based image captioning model
Sharma, Himanshu
Srivastava, Swati
MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (07) : 10981 - 11003
[23] HIGSA: Human image generation with self-attention
Wu, Haoran
He, Fazhi
Si, Tongzhen
Duan, Yansong
Yan, Xiaohu
ADVANCED ENGINEERING INFORMATICS, 2023, 55
[24] A self-attention model for viewport prediction based on distance constraint
Lan, ChengDong
Qiu, Xu
Miao, Chenqi
Zheng, MengTing
VISUAL COMPUTER, 2024, 40 (09): : 5997 - 6014
[25] Improving Rumor Detection by Image Captioning and Multi-Cell Bi-RNN With Self-Attention in Social Networks
Wang, Jenq-Haur
Huang, Chin-Wei
Norouzi, Mehdi
INTERNATIONAL JOURNAL OF DATA WAREHOUSING AND MINING, 2022, 18 (01) : 1 - 17
[26] Unsupervised Image-to-Image Translation with Self-Attention Networks
Kang, Taewon
Lee, Kwang Hee
2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP 2020), 2020, : 102 - 108
[27] Attention as Relation: Learning Supervised Multi-head Self-Attention for Relation Extraction
Liu, Jie
Chen, Shaowei
Wang, Bingquan
Zhang, Jiaxin
Li, Na
Xu, Tong
PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 3787 - 3793
[28] Spatial self-attention network with self-attention distillation for fine-grained image recognitionx2729;
Baffour, Adu Asare
Qin, Zhen
Wang, Yong
Qin, Zhiguang
Choo, Kim-Kwang Raymond
JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2021, 81
[29] A Framework for Image Captioning Based on Relation Network and Multilevel Attention Mechanism
Himanshu Sharma
Swati Srivastava
Neural Processing Letters, 2023, 55 : 5693 - 5715
[30] Self-Attention Underwater Image Enhancement by Data Augmentation
Gao, Yu
Luo, Huifu
Zhu, Wei
Ma, Feng
Zhao, Jiang
Qin, Kailin
PROCEEDINGS OF 2020 3RD INTERNATIONAL CONFERENCE ON UNMANNED SYSTEMS (ICUS), 2020, : 991 - 995

← 1 2 3 4 5 →