Improved Image Captioning Using GAN and ViT

被引:0
|
作者
Rao, Vrushank D. [1 ]
Shashank, B. N. [1 ]
Bhattu, S. Nagesh [1 ]
机构
[1] Natl Inst Technol Andhra Pradesh, Dept Comp Sci & Engn, Tadepalligudem, India
来源
COMPUTER VISION AND IMAGE PROCESSING, CVIP 2023, PT III | 2024年 / 2011卷
关键词
Vision Transformers; Data2Vec; Image Captioning;
D O I
10.1007/978-3-031-58535-7_31
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Encoder-decoder architectures are widely used in solving image captioning applications. Convolutional encoders and recurrent decoders are prominently used for such applications. Recent advances in transformer-based designs have made SOTA performances in solving various language and vision tasks. This work inspects the research question of using transformer-based encoder and decoder in building an effective pipeline for image captioning. An adversarial objective using a Generative Adversarial Network is used to improve the diversity of the captions generated. The generator component of our model utilizes a ViT encoder and a transformer decoder to generate semantically meaningful captions for a given image. To enhance the quality and authenticity of the generated captions, we introduce a discriminator component built using a transformer decoder. The discriminator evaluates the captions by considering both the image and the caption generated by the generator. By training this architecture, we aim to ensure that the generator produces captions that are indistinguishable from real captions, increasing the overall quality of the generated outputs. Through extensive experimentation, we demonstrate the effectiveness of our approach in generating diverse and contextually appropriate captions for various images. We evaluate our model on benchmark datasets and compare its performance against existing state-of-the-art image captioning methods. The proposed approach has achieved superior results compared to previous methods, as demonstrated by improved caption accuracy metrics such as BLEU-3, BLEU-4, and other relevant accuracy measures.
引用
收藏
页码:375 / 385
页数:11
相关论文
共 50 条
  • [41] Bridging the Gap between Vision and Language Domains for Improved Image Captioning
    Liu, Fenglin
    Wu, Xian
    Ge, Shen
    Zhang, Xiaoyu
    Fan, Wei
    Zou, Yuexian
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 4153 - 4161
  • [42] News Image Captioning Based On Text Summarization Using Image As Query
    Chen, Jingqiang
    Hai Zhuge
    2019 15TH INTERNATIONAL CONFERENCE ON SEMANTICS, KNOWLEDGE AND GRIDS (SKG 2019), 2019, : 123 - 126
  • [43] Image and Video Captioning for Apparels Using Deep Learning
    Agarwal, Govind
    Jindal, Kritika
    Chowdhury, Abishi
    Singh, Vishal K.
    Pal, Amrit
    IEEE ACCESS, 2024, 12 : 113138 - 113150
  • [44] Image captioning in Bengali language using visual attention
    Masud, Adiba
    Hosen, Md. Biplob
    Habibullah, Md.
    Anannya, Mehrin
    Kaiser, M. Shamim
    PLOS ONE, 2025, 20 (02):
  • [45] Image Captioning using Reinforcement Learning with BLUDEr Optimization
    Devi, P. R.
    Thrivikraman, V
    Kashyap, D.
    Shylaja, S. S.
    PATTERN RECOGNITION AND IMAGE ANALYSIS, 2020, 30 (04) : 607 - 613
  • [46] Generative image captioning in Urdu using deep learning
    Afzal M.K.
    Shardlow M.
    Tuarob S.
    Zaman F.
    Sarwar R.
    Ali M.
    Aljohani N.R.
    Lytras M.D.
    Nawaz R.
    Hassan S.-U.
    Journal of Ambient Intelligence and Humanized Computing, 2023, 14 (06) : 7719 - 7731
  • [47] Image Captioning using Adversarial Networks and Reinforcement Learning
    Yan, Shiyang
    Wu, Fangyu
    Smith, Jeremy S.
    Lu, Wenjin
    Zhang, Bailing
    2018 24TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2018, : 248 - 253
  • [48] Image Captioning using Reinforcement Learning with BLUDEr Optimization
    P. R. Devi
    V. Thrivikraman
    D. Kashyap
    S. S. Shylaja
    Pattern Recognition and Image Analysis, 2020, 30 : 607 - 613
  • [49] Image captioning in Hindi language using transformer networks
    Mishra, Santosh Kumar
    Dhir, Rijul
    Saha, Sriparna
    Bhattacharyya, Pushpak
    Singh, Amit Kumar
    COMPUTERS & ELECTRICAL ENGINEERING, 2021, 92
  • [50] Image captioning using DenseNet network and adaptive attention
    Deng, Zhenrong
    Jiang, Zhouqin
    Lan, Rushi
    Huang, Wenming
    Luo, Xiaonan
    SIGNAL PROCESSING-IMAGE COMMUNICATION, 2020, 85