Overview of Image Captions Based on Deep Learning

被引:0
|
作者
Shi Y.-L. [1 ]
Yang W.-Z. [2 ]
Du H.-X. [1 ]
Wang L.-H. [1 ]
Wang T. [1 ]
Li S.-S. [1 ]
机构
[1] Key Laboratory of Software Engineering Technology, Xinjiang University, Urumqi
[2] School of Information Science and Engineering, Xinjiang University, Urumqi
来源
关键词
Attention mechanism; Encoder-decoder framework; Intelligence-image understanding; Reinforcement learning;
D O I
10.12263/DZXB.20200669
中图分类号
学科分类号
摘要
Image caption aims to extract the features of the image and input the description of the final output image into the language generation model, which solves the intersection of natural language processing and computer vision in artificial intelligence-image understanding. Summarize and analyze representative thesis of image description orientation from 2015 to 2020, different core technologies as classification criteria, it can be roughly divided into: image caption based on Encoder-Decoder framework, image caption based on attention mechanism, image caption based on reinforcement learning, image caption based on Generative Adversarial Networks, and based on new fusion data set these five categories. Use three models of NIC, Hard-Attention and Neural Talk to conduct experiments on the real data set MS-COCO data set, and compare the average scores of BLEU1, BLEU2, BLEU3, and BLEU4 to show the effects of the three models. This article points out the development trend of image caption in the future, and the challenges that image caption will face and the research directions that can be digged in. © 2021, Chinese Institute of Electronics. All right reserved.
引用
收藏
页码:2048 / 2060
页数:12
相关论文
共 64 条
  • [11] Fei Z C., Better understanding hierarchical visual relationship for image caption, (2019)
  • [12] Lee K H, Palangi H, Chen X, Et al., Learning visual relation priors for image-text matching and image captioning with neural scene graph generators, (2019)
  • [13] Yao T, Pan Y W, Li Y H, Et al., Hierarchy parsing for image captioning, 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2621-2629, (2019)
  • [14] He S, Liao W T, Tavakoli H R, Et al., Image captioning through image transformer, Computer Vision-ACCV2020, pp. 153-169, (2021)
  • [15] Zhang H B, Jiang Z L, Xiong Q P, Et al., Image attribute annotation via a modified effective range based gene selection and cross-modal semantics mining, Acta Electronica Sinica, 48, 4, pp. 790-799, (2020)
  • [16] Chen F H, Ji R R, Sun X S, Et al., GroupCap: group-based image captioning with structured relevance and diversity constraints, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1345-1353, (2018)
  • [17] Pasunuru R, Bansal M., Multi-task video captioning with video and entailment generation, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1273-1283, (2017)
  • [18] Zhou L W, Palangi H, Zhang L, Et al., Unified vision-language pre-training for image captioning and VQA, Proceedings of the AAAI Conference on Artificial Intelligence, 34, 7, pp. 13041-13049, (2020)
  • [19] Wang Y F, Lin Z, Shen X H, Et al., Skeleton key: Image captioning by skeleton-attribute decomposition, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7378-7387, (2017)
  • [20] Lu J S, Yang J W, Batra D, Et al., Neural baby talk, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7219-7228, (2018)