GVA: guided visual attention approach for automatic image caption generation

被引:10
|
作者
Hossen, Md. Bipul [1 ]
Ye, Zhongfu [1 ]
Abdussalam, Amr [1 ]
Hossain, Md. Imran [2 ]
机构
[1] Univ Sci & Technol China, Sch Informat Sci & Technol, Hefei 230027, Anhui, Peoples R China
[2] Pabna Univ Sci & Technol, Dept ICE, Pabna 6600, Bangladesh
关键词
Image captioning; Faster R-CNN; LSTM; Up-down model; Encoder-decoder framework;
D O I
10.1007/s00530-023-01249-w
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Automated image caption generation with attention mechanisms focuses on visual features including objects, attributes, actions, and scenes of the image to understand and provide more detailed captions, which attains great attention in the multimedia field. However, deciding which aspects of an image to highlight for better captioning remains a challenge. Most advanced captioning models utilize only one attention module to assign attention weights to visual vectors, but this may not be enough to create an informative caption. To tackle this issue, we propose an innovative and well-designed Guided Visual Attention (GVA) approach, incorporating an additional attention mechanism to re-adjust the attentional weights on the visual feature vectors and feed the resulting context vector to the language LSTM. Utilizing the first-level attention module as guidance for the GVA module and re-weighting the attention weights significantly enhances the caption's quality. Recently, deep neural networks have allowed the encoder-decoder architecture to make use visual attention mechanism, where faster R-CNN is used for extracting features in the encoder and a visual attention-based LSTM is applied in the decoder. Extensive experiments have been implemented on both the MS-COCO and Flickr30k benchmark datasets. Compared with state-of-the-art methods, our approach achieved an average improvement of 2.4% on BLEU@1 and 13.24% on CIDEr for the MSCOCO dataset, as well as 4.6% on BLEU@1 and 12.48% on CIDEr score for the Flickr30K datasets, based on the cross-entropy optimization. These results demonstrate the clear superiority of our proposed approach in comparison to existing methods using standard evaluation metrics. The implementing code can be found here: (https://github.com/mdbipu/GVA).
引用
收藏
页数:16
相关论文
共 50 条
  • [31] Image Caption Automatic Generation Method Based on Weighted Feature
    Xi, Su Mei
    Cho, Young Im
    2013 13TH INTERNATIONAL CONFERENCE ON CONTROL, AUTOMATION AND SYSTEMS (ICCAS 2013), 2013, : 548 - 551
  • [32] Automatic Image Caption Generation Using ResNet & Torch Vision
    Verma, Vijeta
    Saritha, Sri Khetwat
    Jain, Sweta
    MACHINE LEARNING, IMAGE PROCESSING, NETWORK SECURITY AND DATA SCIENCES, MIND 2022, PT II, 2022, 1763 : 82 - 101
  • [33] Biological Visual Attention Guided Automatic Image Segmentation with Application in Satellite Imaging
    Sina, M. I.
    Cretu, A. -M.
    Payeur, P.
    HUMAN VISION AND ELECTRONIC IMAGING XVII, 2012, 8291
  • [34] Visual Image Caption Generation for Service Robotics and Industrial Applications
    Luo, Ren C.
    Hsu, Yu-Ting
    Wen, Yu-Cheng
    Ye, Huan-Jun
    2019 IEEE INTERNATIONAL CONFERENCE ON INDUSTRIAL CYBER PHYSICAL SYSTEMS (ICPS 2019), 2019, : 827 - 832
  • [35] VSAM-Based Visual Keyword Generation for Image Caption
    Zhang, Suya
    Zhang, Yana
    Chen, Zeyu
    Li, Zhaohui
    IEEE ACCESS, 2021, 9 : 27638 - 27649
  • [36] Image Caption Description Generation Method Based on Reflective Attention Mechanism
    Qiao Pingan
    Yuan, Li
    Shen Ruixue
    ADVANCES IN NATURAL COMPUTATION, FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, ICNC-FSKD 2022, 2023, 153 : 600 - 609
  • [37] Neural Image Caption Generation with Global Feature Based Attention Scheme
    Wang, Yongzhuang
    Xiong, Hongkai
    IMAGE AND GRAPHICS (ICIG 2017), PT II, 2017, 10667 : 51 - 61
  • [38] A Deep Attention based Framework for Image Caption Generation in Hindi Language
    Dhir, Rijul
    Mishra, Santosh Kumar
    Saha, Sriparna
    Bhattacharyya, Pushpak
    COMPUTACION Y SISTEMAS, 2019, 23 (03): : 693 - 701
  • [39] Local Attribute Attention Network for Minority Clothing Image Caption Generation
    Xuhui Z.
    Li L.
    Xiaodong F.
    Lijun L.
    Wei P.
    Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2024, 36 (03): : 399 - 412
  • [40] Multilevel Attention Networks and Policy Reinforcement Learning for Image Caption Generation
    Zhou, Zhibo
    Zhang, Xiaoming
    Li, Zhoujun
    Huang, Feiran
    Xu, Jie
    BIG DATA, 2022, 10 (06) : 481 - 492