Exploring region relationships implicitly: Image captioning with visual relationship attention

被引:31
|
作者
Zhang, Zongjian [1 ]
Wu, Qiang [1 ]
Wang, Yang [1 ]
Chen, Fang [1 ]
机构
[1] Univ Technol Sydney, 15 Broadway, Sydney, NSW, Australia
关键词
Image captioning; Visual relationship attention; Relationship-level attention parallel attention; mechanism; Learned spatial constraint;
D O I
10.1016/j.imavis.2021.104146
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual attention mechanism has been widely used by image captioning model in order to dynamically attend to the related visual region based on given language information. Such capability allows a trained model to carry out fine-grained level image understanding and reasoning. However, existing visual attention models only focus on the individual visual region in the image and the alignment between the language representation and related in-dividual visual regions. It does not fully explore the relationships/interactions between visual regions. Further-more, it does not analyze or explore alignment for related words/phrases (e.g. verb or phrasal verb), which may best describe the relationships/interactions between these visual regions. Thus, it causes the inaccurate or impropriate description to the current image captioning model. Instead of visual region attention commonly ad-dressed by existing visual attention mechanism, this paper proposes the novel visual relationship attention via contextualized embedding for individual regions. It can dynamically explore a related visual relationship existing between multiple regions when generating interaction words. Such relationship exploring process is constrained by spatial relationships and driven by the linguistic context of language decoder. In this work, such new visual relationship attention is designed through a parallel attention mechanism under the learned spatial constraint in order to more precisely map visual relationship information to the semantic description of such relationship in language. Different from existing methods for exploring the visual relationship, it is trained implicitly through an unsupervised approach without using any explicit visual relationship annotations. By integrating the newly proposed visual relationship attention with existing visual region attention, our image captioning model can gen-erate high-quality captions. Solid experiments on the MSCOCO dataset demonstrate the proposed visual relation-ship attention can effectively boost the captioning performances by capturing related visual relationships for generating accurate interaction descriptions. (c) 2021 Elsevier B.V. All rights reserved.
引用
收藏
页数:10
相关论文
共 50 条
  • [41] On Exploring Undetermined Relationships for Visual Relationship Detection
    Zhan, Yibing
    Yu, Jun
    Yu, Ting
    Tao, Dacheng
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 5123 - 5132
  • [42] Exploring refined dual visual features cross-combination for image captioning
    Hu, Junbo
    Li, Zhixin
    Su, Qiang
    Tang, Zhenjun
    Ma, Huifang
    NEURAL NETWORKS, 2024, 180
  • [43] Deliberate Attention Networks for Image Captioning
    Gao, Lianli
    Fan, Kaixuan
    Song, Jingkuan
    Liu, Xianglong
    Xu, Xing
    Shen, Heng Tao
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 8320 - 8327
  • [44] Gated Hierarchical Attention for Image Captioning
    Wang, Qingzhong
    Chan, Antoni B.
    COMPUTER VISION - ACCV 2018, PT IV, 2019, 11364 : 21 - 37
  • [45] Delving into Precise Attention in Image Captioning
    Hu, Shaohan
    Huang, Shenglei
    Wang, Guolong
    Li, Zhipeng
    Qin, Zheng
    NEURAL INFORMATION PROCESSING, ICONIP 2019, PT V, 2019, 1143 : 74 - 82
  • [46] Multivariate Attention Network for Image Captioning
    Wang, Weixuan
    Chen, Zhihong
    Hu, Haifeng
    COMPUTER VISION - ACCV 2018, PT VI, 2019, 11366 : 587 - 602
  • [47] Distributed Attention for Grounded Image Captioning
    Chen, Nenglun
    Pan, Xingjia
    Chen, Runnan
    Yang, Lei
    Lin, Zhiwen
    Ren, Yuqiang
    Yuan, Haolei
    Guo, Xiaowei
    Huang, Feiyue
    Wang, Wenping
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 1966 - 1975
  • [48] Attention Correctness in Neural Image Captioning
    Liu, Chenxi
    Mao, Junhua
    Sha, Fei
    Yuille, Alan
    THIRTY-FIRST AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 4176 - 4182
  • [49] Feedback Attention Model for Image Captioning
    Lyu F.
    Hu F.
    Zhang Y.
    Xia Z.
    Sheng V.S.
    Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2019, 31 (07): : 1122 - 1129
  • [50] IMAGE CAPTIONING WITH WORD LEVEL ATTENTION
    Fang, Fang
    Wang, Hanli
    Tang, Pengjie
    2018 25TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2018, : 1278 - 1282