Exploring region relationships implicitly: Image captioning with visual relationship attention

被引:31
|
作者
Zhang, Zongjian [1 ]
Wu, Qiang [1 ]
Wang, Yang [1 ]
Chen, Fang [1 ]
机构
[1] Univ Technol Sydney, 15 Broadway, Sydney, NSW, Australia
关键词
Image captioning; Visual relationship attention; Relationship-level attention parallel attention; mechanism; Learned spatial constraint;
D O I
10.1016/j.imavis.2021.104146
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual attention mechanism has been widely used by image captioning model in order to dynamically attend to the related visual region based on given language information. Such capability allows a trained model to carry out fine-grained level image understanding and reasoning. However, existing visual attention models only focus on the individual visual region in the image and the alignment between the language representation and related in-dividual visual regions. It does not fully explore the relationships/interactions between visual regions. Further-more, it does not analyze or explore alignment for related words/phrases (e.g. verb or phrasal verb), which may best describe the relationships/interactions between these visual regions. Thus, it causes the inaccurate or impropriate description to the current image captioning model. Instead of visual region attention commonly ad-dressed by existing visual attention mechanism, this paper proposes the novel visual relationship attention via contextualized embedding for individual regions. It can dynamically explore a related visual relationship existing between multiple regions when generating interaction words. Such relationship exploring process is constrained by spatial relationships and driven by the linguistic context of language decoder. In this work, such new visual relationship attention is designed through a parallel attention mechanism under the learned spatial constraint in order to more precisely map visual relationship information to the semantic description of such relationship in language. Different from existing methods for exploring the visual relationship, it is trained implicitly through an unsupervised approach without using any explicit visual relationship annotations. By integrating the newly proposed visual relationship attention with existing visual region attention, our image captioning model can gen-erate high-quality captions. Solid experiments on the MSCOCO dataset demonstrate the proposed visual relation-ship attention can effectively boost the captioning performances by capturing related visual relationships for generating accurate interaction descriptions. (c) 2021 Elsevier B.V. All rights reserved.
引用
收藏
页数:10
相关论文
共 50 条
  • [31] Exploring Pairwise Relationships Adaptively From Linguistic Context in Image Captioning
    Zhang, Zongjian
    Wu, Qiang
    Wang, Yang
    Chen, Fang
    IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 3101 - 3113
  • [32] Areas of Attention for Image Captioning
    Pedersoli, Marco
    Lucas, Thomas
    Schmid, Cordelia
    Verbeek, Jakob
    2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 1251 - 1259
  • [33] Image Captioning with Semantic Attention
    You, Quanzeng
    Jin, Hailin
    Wang, Zhaowen
    Fang, Chen
    Luo, Jiebo
    2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 4651 - 4659
  • [34] Multimodal Attention with Image Text Spatial Relationship for OCR-Based Image Captioning
    Wang, Jing
    Tang, Jinhui
    Luo, Jiebo
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 4346 - 4354
  • [35] Image Captioning Using Region-Based Attention Joint with Time-Varying Attention
    Wang, Weixuan
    Hu, Haifeng
    NEURAL PROCESSING LETTERS, 2019, 50 (01) : 1005 - 1017
  • [36] A Novelty Framework in Image-Captioning with Visual Attention-Based Refined Visual Features
    Thobhani, Alaa
    Zou, Beiji
    Kui, Xiaoyan
    Abdussalam, Amr
    Asim, Muhammad
    Elaffendi, Mohammed
    Shah, Sajid
    CMC-COMPUTERS MATERIALS & CONTINUA, 2025, 82 (03): : 3943 - 3964
  • [37] Image Captioning Using Region-Based Attention Joint with Time-Varying Attention
    Weixuan Wang
    Haifeng Hu
    Neural Processing Letters, 2019, 50 : 1005 - 1017
  • [38] Fine-grained and Semantic-guided Visual Attention for Image Captioning
    Zhang, Zongjian
    Wu, Qiang
    Wang, Yang
    Chen, Fang
    2018 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2018), 2018, : 1709 - 1717
  • [39] Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning
    Lu, Jiasen
    Xiong, Caiming
    Parikh, Devi
    Socher, Richard
    30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 3242 - 3250
  • [40] Visual Skeleton and Reparative Attention for Part-of-Speech image captioning system
    Yang, Liang
    Hu, Haifeng
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2019, 189