Improved Image Captioning Using GAN and ViT

被引：0

作者：

Rao, Vrushank D. ^{[1
]}

Shashank, B. N. ^{[1
]}

Bhattu, S. Nagesh ^{[1
]}

机构：

[1] Natl Inst Technol Andhra Pradesh, Dept Comp Sci & Engn, Tadepalligudem, India

来源：

COMPUTER VISION AND IMAGE PROCESSING, CVIP 2023, PT III | 2024年 / 2011卷

关键词：

Vision Transformers; Data2Vec; Image Captioning;

D O I：

10.1007/978-3-031-58535-7_31

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Encoder-decoder architectures are widely used in solving image captioning applications. Convolutional encoders and recurrent decoders are prominently used for such applications. Recent advances in transformer-based designs have made SOTA performances in solving various language and vision tasks. This work inspects the research question of using transformer-based encoder and decoder in building an effective pipeline for image captioning. An adversarial objective using a Generative Adversarial Network is used to improve the diversity of the captions generated. The generator component of our model utilizes a ViT encoder and a transformer decoder to generate semantically meaningful captions for a given image. To enhance the quality and authenticity of the generated captions, we introduce a discriminator component built using a transformer decoder. The discriminator evaluates the captions by considering both the image and the caption generated by the generator. By training this architecture, we aim to ensure that the generator produces captions that are indistinguishable from real captions, increasing the overall quality of the generated outputs. Through extensive experimentation, we demonstrate the effectiveness of our approach in generating diverse and contextually appropriate captions for various images. We evaluate our model on benchmark datasets and compare its performance against existing state-of-the-art image captioning methods. The proposed approach has achieved superior results compared to previous methods, as demonstrated by improved caption accuracy metrics such as BLEU-3, BLEU-4, and other relevant accuracy measures.

引用

页码：375 / 385

页数：11

共 50 条

[31] Image Captioning using Deep Neural Architectures
Shah, Parth
Bakrola, Vishvajit
Pati, Supriya
2017 INTERNATIONAL CONFERENCE ON INNOVATIONS IN INFORMATION, EMBEDDED AND COMMUNICATION SYSTEMS (ICIIECS), 2017,
[32] Important Region Estimation Using Image Captioning
Suzuki, Taku
Sato, Daisuke
Sugaya, Yoshihiro
Miyazaki, Tomo
Omachi, Shinichiro
IEEE ACCESS, 2022, 10 : 105546 - 105555
[33] Image captioning using facial expression and attention
Nezami O.M.
Dras M.
Wan S.
Paris C.
Journal of Artificial Intelligence Research, 2020, 68 : 661 - 689
[34] Improved Arabic image captioning model using feature concatenation with pre-trained word embedding
Elbedwehy, Samar
Medhat, T.
NEURAL COMPUTING & APPLICATIONS, 2023, 35 (26): : 19051 - 19067
[35] Improved Arabic image captioning model using feature concatenation with pre-trained word embedding
Samar Elbedwehy
T. Medhat
Neural Computing and Applications, 2023, 35 : 19051 - 19067
[36] Image Captioning with Dense Fusion Connection and Improved Stacked Attention Module
Zhu, Hegui
Wang, Ru
Zhang, Xiangde
NEURAL PROCESSING LETTERS, 2021, 53 (02) : 1101 - 1118
[37] Incorporating semantic consistency for improved semi-supervised image captioning
Bicheng Wu
Yan Wo
Multimedia Tools and Applications, 2024, 83 : 52931 - 52955
[38] IMPROVED IMAGE CAPTIONING VIA KNOWLEDGE GRAPH-AUGMENTED MODELS
Santiesteban, Sergio Sanchez
Atito, Sara
Awais, Muhammad
Song, Yi-Zhe
Kittler, Josef
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 4290 - 4294
[39] Image Captioning with Dense Fusion Connection and Improved Stacked Attention Module
Hegui Zhu
Ru Wang
Xiangde Zhang
Neural Processing Letters, 2021, 53 : 1101 - 1118
[40] Incorporating semantic consistency for improved semi-supervised image captioning
Wu, Bicheng
Wo, Yan
MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (17) : 52931 - 52955

← 1 2 3 4 5 →