Improved Image Captioning Using GAN and ViT

被引：0

作者：

Rao, Vrushank D. ^{[1
]}

Shashank, B. N. ^{[1
]}

Bhattu, S. Nagesh ^{[1
]}

机构：

[1] Natl Inst Technol Andhra Pradesh, Dept Comp Sci & Engn, Tadepalligudem, India

来源：

COMPUTER VISION AND IMAGE PROCESSING, CVIP 2023, PT III | 2024年 / 2011卷

关键词：

Vision Transformers; Data2Vec; Image Captioning;

D O I：

10.1007/978-3-031-58535-7_31

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Encoder-decoder architectures are widely used in solving image captioning applications. Convolutional encoders and recurrent decoders are prominently used for such applications. Recent advances in transformer-based designs have made SOTA performances in solving various language and vision tasks. This work inspects the research question of using transformer-based encoder and decoder in building an effective pipeline for image captioning. An adversarial objective using a Generative Adversarial Network is used to improve the diversity of the captions generated. The generator component of our model utilizes a ViT encoder and a transformer decoder to generate semantically meaningful captions for a given image. To enhance the quality and authenticity of the generated captions, we introduce a discriminator component built using a transformer decoder. The discriminator evaluates the captions by considering both the image and the caption generated by the generator. By training this architecture, we aim to ensure that the generator produces captions that are indistinguishable from real captions, increasing the overall quality of the generated outputs. Through extensive experimentation, we demonstrate the effectiveness of our approach in generating diverse and contextually appropriate captions for various images. We evaluate our model on benchmark datasets and compare its performance against existing state-of-the-art image captioning methods. The proposed approach has achieved superior results compared to previous methods, as demonstrated by improved caption accuracy metrics such as BLEU-3, BLEU-4, and other relevant accuracy measures.

引用

页码：375 / 385

页数：11

共 50 条

[1] Improved GAN for image resolution enhancement using ViT for breast cancer detection
Rautela, Kamakshi
Kumar, Dinesh
Kumar, Vijay
INTERNATIONAL JOURNAL OF IMAGING SYSTEMS AND TECHNOLOGY, 2024, 34 (02)
[2] ViT - Inception - GAN for Image Colourisation
Bana, Tejas
Loya, Jatan
Kulkarni, Siddhant
MACHINE LEARNING, OPTIMIZATION, AND DATA SCIENCE (LOD 2021), PT I, 2022, 13163 : 105 - 118
[3] Text to Image Synthesis for Improved Image Captioning
Hossain, Md. Zakir
Sohel, Ferdous
Shiratuddin, Mohd Fairuz
Laga, Hamid
Bennamoun, Mohammed
IEEE ACCESS, 2021, 9 : 64918 - 64928
[4] Image captioning improved visual question answering
Himanshu Sharma
Anand Singh Jalal
Multimedia Tools and Applications, 2022, 81 : 34775 - 34796
[5] CgT-GAN: CLIP-guided Text GAN for Image Captioning
Yu, Jiarui
Li, Haoran
Hao, Yanbin
Zhu, Bin
Xu, Tong
He, Xiangnan
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 2252 - 2263
[6] Improved Transformer with Parallel Encoders for Image Captioning
Lou, Liangshan
Lu, Ke
Xue, Jian
2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 4072 - 4078
[7] Image captioning improved visual question answering
Sharma, Himanshu
Jalal, Anand Singh
MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (24) : 34775 - 34796
[8] Caps Captioning: A Modern Image Captioning Approach Based on Improved Capsule Network
Javanmardi, Shima
Latif, Ali Mohammad
Sadeghi, Mohammad Taghi
Jahanbanifard, Mehrdad
Bonsangue, Marcello
Verbeek, Fons J.
SENSORS, 2022, 22 (21)
[9] Memorial GAN With Joint Semantic Optimization for Unpaired Image Captioning
Song, Peipei
Guo, Dan
Zhou, Jinxing
Xu, Mingliang
Wang, Meng
IEEE TRANSACTIONS ON CYBERNETICS, 2023, 53 (07) : 4388 - 4399
[10] Improving image captioning with Pyramid Attention and SC-GAN
Chen, Tianyu
Li, Zhixin
Wu, Jingli
Ma, Huifang
Su, Bianping
IMAGE AND VISION COMPUTING, 2022, 117

← 1 2 3 4 5 →