Improved Image Captioning Using GAN and ViT

被引：0

作者：

Rao, Vrushank D. ^{[1
]}

Shashank, B. N. ^{[1
]}

Bhattu, S. Nagesh ^{[1
]}

机构：

[1] Natl Inst Technol Andhra Pradesh, Dept Comp Sci & Engn, Tadepalligudem, India

来源：

COMPUTER VISION AND IMAGE PROCESSING, CVIP 2023, PT III | 2024年 / 2011卷

关键词：

Vision Transformers; Data2Vec; Image Captioning;

D O I：

10.1007/978-3-031-58535-7_31

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Encoder-decoder architectures are widely used in solving image captioning applications. Convolutional encoders and recurrent decoders are prominently used for such applications. Recent advances in transformer-based designs have made SOTA performances in solving various language and vision tasks. This work inspects the research question of using transformer-based encoder and decoder in building an effective pipeline for image captioning. An adversarial objective using a Generative Adversarial Network is used to improve the diversity of the captions generated. The generator component of our model utilizes a ViT encoder and a transformer decoder to generate semantically meaningful captions for a given image. To enhance the quality and authenticity of the generated captions, we introduce a discriminator component built using a transformer decoder. The discriminator evaluates the captions by considering both the image and the caption generated by the generator. By training this architecture, we aim to ensure that the generator produces captions that are indistinguishable from real captions, increasing the overall quality of the generated outputs. Through extensive experimentation, we demonstrate the effectiveness of our approach in generating diverse and contextually appropriate captions for various images. We evaluate our model on benchmark datasets and compare its performance against existing state-of-the-art image captioning methods. The proposed approach has achieved superior results compared to previous methods, as demonstrated by improved caption accuracy metrics such as BLEU-3, BLEU-4, and other relevant accuracy measures.

引用

页码：375 / 385

页数：11

共 50 条

[41] Bridging the Gap between Vision and Language Domains for Improved Image Captioning
Liu, Fenglin
Wu, Xian
Ge, Shen
Zhang, Xiaoyu
Fan, Wei
Zou, Yuexian
MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 4153 - 4161
[42] News Image Captioning Based On Text Summarization Using Image As Query
Chen, Jingqiang
Hai Zhuge
2019 15TH INTERNATIONAL CONFERENCE ON SEMANTICS, KNOWLEDGE AND GRIDS (SKG 2019), 2019, : 123 - 126
[43] Image and Video Captioning for Apparels Using Deep Learning
Agarwal, Govind
Jindal, Kritika
Chowdhury, Abishi
Singh, Vishal K.
Pal, Amrit
IEEE ACCESS, 2024, 12 : 113138 - 113150
[44] Image captioning in Bengali language using visual attention
Masud, Adiba
Hosen, Md. Biplob
Habibullah, Md.
Anannya, Mehrin
Kaiser, M. Shamim
PLOS ONE, 2025, 20 (02):
[45] Image Captioning using Reinforcement Learning with BLUDEr Optimization
Devi, P. R.
Thrivikraman, V
Kashyap, D.
Shylaja, S. S.
PATTERN RECOGNITION AND IMAGE ANALYSIS, 2020, 30 (04) : 607 - 613
[46] Generative image captioning in Urdu using deep learning
Afzal M.K.
Shardlow M.
Tuarob S.
Zaman F.
Sarwar R.
Ali M.
Aljohani N.R.
Lytras M.D.
Nawaz R.
Hassan S.-U.
Journal of Ambient Intelligence and Humanized Computing, 2023, 14 (06) : 7719 - 7731
[47] Image Captioning using Adversarial Networks and Reinforcement Learning
Yan, Shiyang
Wu, Fangyu
Smith, Jeremy S.
Lu, Wenjin
Zhang, Bailing
2018 24TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2018, : 248 - 253
[48] Image Captioning using Reinforcement Learning with BLUDEr Optimization
P. R. Devi
V. Thrivikraman
D. Kashyap
S. S. Shylaja
Pattern Recognition and Image Analysis, 2020, 30 : 607 - 613
[49] Image captioning in Hindi language using transformer networks
Mishra, Santosh Kumar
Dhir, Rijul
Saha, Sriparna
Bhattacharyya, Pushpak
Singh, Amit Kumar
COMPUTERS & ELECTRICAL ENGINEERING, 2021, 92
[50] Image captioning using DenseNet network and adaptive attention
Deng, Zhenrong
Jiang, Zhouqin
Lan, Rushi
Huang, Wenming
Luo, Xiaonan
SIGNAL PROCESSING-IMAGE COMMUNICATION, 2020, 85

← 1 2 3 4 5 →