Image Caption Generation Using Contextual Information Fusion With Bi-LSTM-s

被引:12
|
作者
Zhang, Huawei [1 ]
Ma, Chengbo [1 ]
Jiang, Zhanjun [1 ]
Lian, Jing [1 ]
机构
[1] Lanzhou Jiaotong Univ, Elect & Informat Engn, Lanzhou 730000, Peoples R China
基金
中国国家自然科学基金;
关键词
Feature extraction; Semantics; Visualization; Data mining; Decoding; Task analysis; Logic gates; Bi-LSTM; image caption generation; semantic fusion; semantic similarity;
D O I
10.1109/ACCESS.2022.3232508
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The image caption generation algorithm necessitates the expression of image content using accurate natural language. Given the existing encoder-decoder algorithm structure, the decoder solely generates words one by one in a front-to-back order and is unable to analyze integral contextual information. This paper employs a Bi-LSTM (Bi-directional Long Short-Term Memory) structure, which not only draws on past information but also captures subsequent information, resulting in the prediction of image content subject to the context clues. The visual information is respectively fed into the F-LSTM decoder (forward LSTM decoder) and B-LSTM decoder (backward LSTM decoder) to extract semantic information, along with complementing semantic output. Specifically, the subsidiary attention mechanism S-Att acts between F-LSTM and B-LSTM, while the semantic information of B-LSTM and F-LSTM is extracted using the attention mechanism. Meanwhile, the semantic interaction is extracted pursuant to the similarity while aligning the hidden states, resulting in the output of the fused semantic information. We adopt a Bi-LSTM-s model capable of extracting contextual information and realizing finer-grained image captioning effectively. In the end, our model improved by 9.7% on the basis of the original LSTM. In addition, our model effectively solves the problem of inconsistent semantic information in the forward and backward direction of the simultaneous order, and gets a score of 37.5 on BLEU-4. The superiority of this approach is experimentally demonstrated on the MSCOCO dataset.
引用
收藏
页码:134 / 143
页数:10
相关论文
共 50 条
  • [31] Image caption generation using transformer learning methods: a case study on instagram image
    Dittakan, Kwankamon
    Prompitak, Kamontorn
    Thungklang, Phutphisit
    Wongwattanakit, Chatchawan
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (15) : 46397 - 46417
  • [32] Automatic image caption generation using deep learning and multimodal attention
    Dai, Jin
    Zhang, Xinyu
    COMPUTER ANIMATION AND VIRTUAL WORLDS, 2022, 33 (3-4)
  • [33] An Approach to Generate a Caption for an Image Collection Using Scene Graph Generation
    Phueaksri, Itthisak
    Kastner, Marc A.
    Kawanishi, Yasutomo
    Komamizu, Takahiro
    Ide, Ichiro
    IEEE ACCESS, 2023, 11 : 128245 - 128260
  • [34] Image Caption Generation using Deep Learning For Video Summarization Applications
    Inayathulla, Mohammed
    Karthikeyan, C.
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2024, 15 (01) : 565 - 572
  • [35] Image caption generation using transformer learning methods: a case study on instagram image
    Kwankamon Dittakan
    Kamontorn Prompitak
    Phutphisit Thungklang
    Chatchawan Wongwattanakit
    Multimedia Tools and Applications, 2024, 83 : 46397 - 46417
  • [36] SAR image understanding using contextual information
    Blacknell, D
    Arini, NS
    McConnell, QI
    SAR IMAGE ANALYSIS, MODELING, AND TECHNIQUES IV, 2002, 4543 : 73 - 84
  • [37] The Bidirectional Information Fusion Using an Improved LSTM Model
    Zheng, Tianwei
    Wang, Mei
    Guo, Yuan
    Wang, Zheng
    MOBILE INFORMATION SYSTEMS, 2021, 2021
  • [38] Multiband image fusion using an unsupervised contextual method
    Germain, M
    Boucher, JM
    Bénié, GB
    2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-IV, PROCEEDINGS, 2002, : 3341 - 3344
  • [39] Fusion of multitemporal contextual information by neural networks for multisensor image classification
    Melgani, F
    Serpico, SB
    Vernazza, G
    IGARSS 2001: SCANNING THE PRESENT AND RESOLVING THE FUTURE, VOLS 1-7, PROCEEDINGS, 2001, : 2952 - 2954
  • [40] Contextual Information Driven Multi-modal Medical Image Fusion
    Luo, Xiao-Qing
    Zhang, Zhan-Cheng
    Zhang, Bao-Cheng
    Wu, Xiao-Jun
    IETE TECHNICAL REVIEW, 2017, 34 (06) : 598 - 611