Relation constraint self-attention for image captioning

被引:16
|
作者
Ji, Junzhong [1 ,2 ]
Wang, Mingzhan [1 ,2 ]
Zhang, Xiaodan [1 ,2 ]
Lei, Minglong [1 ,2 ]
Qu, Liangqiong [3 ]
机构
[1] Beijing Univ Technol, Fac Informat Technol, Beijing Municipal Key Lab Multimedia & Intelligen, Beijing 100124, Peoples R China
[2] Beijing Univ Technol, Beijing Inst Artificial Intelligence, Beijing 100124, Peoples R China
[3] Stanford Univ, Dept Biomed Data Sci, Palo Alto, CA 94304 USA
基金
中国国家自然科学基金;
关键词
Image captioning; Relation constraint self -attention; Scene graph; Transformer;
D O I
10.1016/j.neucom.2022.06.062
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Self-attention based Transformer has been successfully introduced in the encoder-decoder framework of image captioning, which is superior in modeling the inner relations of inputs, i.e., image regions or semantic words. However, relations in self-attention are usually too dense to be fully optimized, which may result in noisy relations and attentions. Meanwhile, the prior relations, e.g., visual relation and semantic relation between objects, which are essential for understanding and describing an image, are ignored by current self-attention. Thus, the relation learning of self-attention in image captioning is biased, which leads to a dilution of the concentration of attentions. In this paper, we propose a Relation Constraint Self-Attention (RCSA) model to enhance the relation learning of self-attention in image captioning by constraining self-attention with prior relations. RCSA exploits the prior visual and semantic relation information from scene graph as constraint factors. And then it builds constraints for self-attention through two sub-modules: an RCSA-E encoder module and an RCSA-D decoder module. RCSA-E introduces the visual relation information to self-attention in encoder, which helps generate a sparse attention map by omitting the attention weights of irrelevant regions to highlight relevant visual features. RCSA-D extends the keys and values of self-attention in decoder with the semantic relation information to constrain the learning of semantic relation, and improve the accuracy of generated semantic words. Intuitively, RCSA-E endows model with an ability to figure out which region to omit and which region to focus by visual relation information; RCSA-D then strengthens the relation learning of the focused regions and improves the sentence generation with semantic relation information. Experiments on the MSCOCO dataset demonstrate the effectiveness of our proposed RCSA.(c) 2022 Elsevier B.V. All rights reserved.
引用
收藏
页码:778 / 789
页数:12
相关论文
共 50 条
  • [21] Context-Aware Group Captioning via Self-Attention and Contrastive Features
    Li, Zhuowan
    Tran, Quan
    Mai, Long
    Lin, Zhe
    Yuille, Alan L.
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 3437 - 3447
  • [22] Multilevel attention and relation network based image captioning model
    Sharma, Himanshu
    Srivastava, Swati
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (07) : 10981 - 11003
  • [23] HIGSA: Human image generation with self-attention
    Wu, Haoran
    He, Fazhi
    Si, Tongzhen
    Duan, Yansong
    Yan, Xiaohu
    ADVANCED ENGINEERING INFORMATICS, 2023, 55
  • [24] A self-attention model for viewport prediction based on distance constraint
    Lan, ChengDong
    Qiu, Xu
    Miao, Chenqi
    Zheng, MengTing
    VISUAL COMPUTER, 2024, 40 (09): : 5997 - 6014
  • [25] Improving Rumor Detection by Image Captioning and Multi-Cell Bi-RNN With Self-Attention in Social Networks
    Wang, Jenq-Haur
    Huang, Chin-Wei
    Norouzi, Mehdi
    INTERNATIONAL JOURNAL OF DATA WAREHOUSING AND MINING, 2022, 18 (01) : 1 - 17
  • [26] Unsupervised Image-to-Image Translation with Self-Attention Networks
    Kang, Taewon
    Lee, Kwang Hee
    2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP 2020), 2020, : 102 - 108
  • [27] Attention as Relation: Learning Supervised Multi-head Self-Attention for Relation Extraction
    Liu, Jie
    Chen, Shaowei
    Wang, Bingquan
    Zhang, Jiaxin
    Li, Na
    Xu, Tong
    PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 3787 - 3793
  • [28] Spatial self-attention network with self-attention distillation for fine-grained image recognitionx2729;
    Baffour, Adu Asare
    Qin, Zhen
    Wang, Yong
    Qin, Zhiguang
    Choo, Kim-Kwang Raymond
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2021, 81
  • [29] A Framework for Image Captioning Based on Relation Network and Multilevel Attention Mechanism
    Himanshu Sharma
    Swati Srivastava
    Neural Processing Letters, 2023, 55 : 5693 - 5715
  • [30] Self-Attention Underwater Image Enhancement by Data Augmentation
    Gao, Yu
    Luo, Huifu
    Zhu, Wei
    Ma, Feng
    Zhao, Jiang
    Qin, Kailin
    PROCEEDINGS OF 2020 3RD INTERNATIONAL CONFERENCE ON UNMANNED SYSTEMS (ICUS), 2020, : 991 - 995