Improved Image Captioning Using GAN and ViT

被引:0
|
作者
Rao, Vrushank D. [1 ]
Shashank, B. N. [1 ]
Bhattu, S. Nagesh [1 ]
机构
[1] Natl Inst Technol Andhra Pradesh, Dept Comp Sci & Engn, Tadepalligudem, India
来源
COMPUTER VISION AND IMAGE PROCESSING, CVIP 2023, PT III | 2024年 / 2011卷
关键词
Vision Transformers; Data2Vec; Image Captioning;
D O I
10.1007/978-3-031-58535-7_31
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Encoder-decoder architectures are widely used in solving image captioning applications. Convolutional encoders and recurrent decoders are prominently used for such applications. Recent advances in transformer-based designs have made SOTA performances in solving various language and vision tasks. This work inspects the research question of using transformer-based encoder and decoder in building an effective pipeline for image captioning. An adversarial objective using a Generative Adversarial Network is used to improve the diversity of the captions generated. The generator component of our model utilizes a ViT encoder and a transformer decoder to generate semantically meaningful captions for a given image. To enhance the quality and authenticity of the generated captions, we introduce a discriminator component built using a transformer decoder. The discriminator evaluates the captions by considering both the image and the caption generated by the generator. By training this architecture, we aim to ensure that the generator produces captions that are indistinguishable from real captions, increasing the overall quality of the generated outputs. Through extensive experimentation, we demonstrate the effectiveness of our approach in generating diverse and contextually appropriate captions for various images. We evaluate our model on benchmark datasets and compare its performance against existing state-of-the-art image captioning methods. The proposed approach has achieved superior results compared to previous methods, as demonstrated by improved caption accuracy metrics such as BLEU-3, BLEU-4, and other relevant accuracy measures.
引用
收藏
页码:375 / 385
页数:11
相关论文
共 50 条
  • [21] Image Captioning Using Deep Learning
    Adithya, Paluvayi Veera
    Kalidindi, Mourya Viswanadh
    Swaroop, Nallani Jyothi
    Vishwas, H. N.
    ADVANCED NETWORK TECHNOLOGIES AND INTELLIGENT COMPUTING, ANTIC 2023, PT III, 2024, 2092 : 42 - 58
  • [22] Video captioning using transformer-based GAN
    Babavalian M.R.
    Kiani K.
    Multimedia Tools and Applications, 2025, 84 (10) : 7091 - 7113
  • [23] Crop pest image recognition based on the improved ViT method
    Fu, Xueqian
    Ma, Qiaoyu
    Yang, Feifei
    Zhang, Chunyu
    Zhao, Xiaolong
    Chang, Fuhao
    Han, Lingling
    INFORMATION PROCESSING IN AGRICULTURE, 2024, 11 (02): : 249 - 259
  • [24] Improved image reconstruction from brain activity through automatic image captioning
    Kalantari, Fatemeh
    Faez, Karim
    Amindavar, Hamidreza
    Nazari, Soheila
    SCIENTIFIC REPORTS, 2025, 15 (01):
  • [25] Image Captioning Based on An Improved Transformer with IoU Position Encoding
    Li, Yazhou
    Shi, Yihui
    Liu, Yun
    Li, Ruifan
    Ma, Zhanyu
    2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 2066 - 2071
  • [26] Past is important: Improved image captioning by looking back in time
    Wei, Yiwei
    Wu, Chunlei
    Jia, ZhiYang
    Hu, XuFei
    Guo, Shuang
    Shi, Haitao
    SIGNAL PROCESSING-IMAGE COMMUNICATION, 2021, 94
  • [27] Improved Image Captioning via Policy Gradient optimization of SPIDEr
    Liu, Siqi
    Zhu, Zhenhai
    Ye, Ning
    Guadarrama, Sergio
    Murphy, Kevin
    2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 873 - 881
  • [28] Text Augmentation Using BERT for Image Captioning
    Atliha, Viktar
    Sesok, Dmitrij
    APPLIED SCIENCES-BASEL, 2020, 10 (17):
  • [29] An Image Captioning Approach Using Dynamical Attention
    Wang, Changzhi
    Gu, Xiaodong
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [30] Image Captioning using Facial Expression and Attention
    Nezami, Omid Mohamad
    Dras, Mark
    Wan, Stephen
    Paris, Cecile
    JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2020, 68 : 661 - 689