Improved Image Captioning Using GAN and ViT

被引:0
|
作者
Rao, Vrushank D. [1 ]
Shashank, B. N. [1 ]
Bhattu, S. Nagesh [1 ]
机构
[1] Natl Inst Technol Andhra Pradesh, Dept Comp Sci & Engn, Tadepalligudem, India
来源
COMPUTER VISION AND IMAGE PROCESSING, CVIP 2023, PT III | 2024年 / 2011卷
关键词
Vision Transformers; Data2Vec; Image Captioning;
D O I
10.1007/978-3-031-58535-7_31
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Encoder-decoder architectures are widely used in solving image captioning applications. Convolutional encoders and recurrent decoders are prominently used for such applications. Recent advances in transformer-based designs have made SOTA performances in solving various language and vision tasks. This work inspects the research question of using transformer-based encoder and decoder in building an effective pipeline for image captioning. An adversarial objective using a Generative Adversarial Network is used to improve the diversity of the captions generated. The generator component of our model utilizes a ViT encoder and a transformer decoder to generate semantically meaningful captions for a given image. To enhance the quality and authenticity of the generated captions, we introduce a discriminator component built using a transformer decoder. The discriminator evaluates the captions by considering both the image and the caption generated by the generator. By training this architecture, we aim to ensure that the generator produces captions that are indistinguishable from real captions, increasing the overall quality of the generated outputs. Through extensive experimentation, we demonstrate the effectiveness of our approach in generating diverse and contextually appropriate captions for various images. We evaluate our model on benchmark datasets and compare its performance against existing state-of-the-art image captioning methods. The proposed approach has achieved superior results compared to previous methods, as demonstrated by improved caption accuracy metrics such as BLEU-3, BLEU-4, and other relevant accuracy measures.
引用
收藏
页码:375 / 385
页数:11
相关论文
共 50 条
  • [31] Image Captioning using Deep Neural Architectures
    Shah, Parth
    Bakrola, Vishvajit
    Pati, Supriya
    2017 INTERNATIONAL CONFERENCE ON INNOVATIONS IN INFORMATION, EMBEDDED AND COMMUNICATION SYSTEMS (ICIIECS), 2017,
  • [32] Important Region Estimation Using Image Captioning
    Suzuki, Taku
    Sato, Daisuke
    Sugaya, Yoshihiro
    Miyazaki, Tomo
    Omachi, Shinichiro
    IEEE ACCESS, 2022, 10 : 105546 - 105555
  • [33] Image captioning using facial expression and attention
    Nezami O.M.
    Dras M.
    Wan S.
    Paris C.
    Journal of Artificial Intelligence Research, 2020, 68 : 661 - 689
  • [34] Improved Arabic image captioning model using feature concatenation with pre-trained word embedding
    Elbedwehy, Samar
    Medhat, T.
    NEURAL COMPUTING & APPLICATIONS, 2023, 35 (26): : 19051 - 19067
  • [35] Improved Arabic image captioning model using feature concatenation with pre-trained word embedding
    Samar Elbedwehy
    T. Medhat
    Neural Computing and Applications, 2023, 35 : 19051 - 19067
  • [36] Image Captioning with Dense Fusion Connection and Improved Stacked Attention Module
    Zhu, Hegui
    Wang, Ru
    Zhang, Xiangde
    NEURAL PROCESSING LETTERS, 2021, 53 (02) : 1101 - 1118
  • [37] Incorporating semantic consistency for improved semi-supervised image captioning
    Bicheng Wu
    Yan Wo
    Multimedia Tools and Applications, 2024, 83 : 52931 - 52955
  • [38] IMPROVED IMAGE CAPTIONING VIA KNOWLEDGE GRAPH-AUGMENTED MODELS
    Santiesteban, Sergio Sanchez
    Atito, Sara
    Awais, Muhammad
    Song, Yi-Zhe
    Kittler, Josef
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 4290 - 4294
  • [39] Image Captioning with Dense Fusion Connection and Improved Stacked Attention Module
    Hegui Zhu
    Ru Wang
    Xiangde Zhang
    Neural Processing Letters, 2021, 53 : 1101 - 1118
  • [40] Incorporating semantic consistency for improved semi-supervised image captioning
    Wu, Bicheng
    Wo, Yan
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (17) : 52931 - 52955