Improved Image Captioning Using GAN and ViT

被引：0

作者：

Rao, Vrushank D. ^{[1
]}

Shashank, B. N. ^{[1
]}

Bhattu, S. Nagesh ^{[1
]}

机构：

[1] Natl Inst Technol Andhra Pradesh, Dept Comp Sci & Engn, Tadepalligudem, India

来源：

COMPUTER VISION AND IMAGE PROCESSING, CVIP 2023, PT III | 2024年 / 2011卷

关键词：

Vision Transformers; Data2Vec; Image Captioning;

D O I：

10.1007/978-3-031-58535-7_31

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Encoder-decoder architectures are widely used in solving image captioning applications. Convolutional encoders and recurrent decoders are prominently used for such applications. Recent advances in transformer-based designs have made SOTA performances in solving various language and vision tasks. This work inspects the research question of using transformer-based encoder and decoder in building an effective pipeline for image captioning. An adversarial objective using a Generative Adversarial Network is used to improve the diversity of the captions generated. The generator component of our model utilizes a ViT encoder and a transformer decoder to generate semantically meaningful captions for a given image. To enhance the quality and authenticity of the generated captions, we introduce a discriminator component built using a transformer decoder. The discriminator evaluates the captions by considering both the image and the caption generated by the generator. By training this architecture, we aim to ensure that the generator produces captions that are indistinguishable from real captions, increasing the overall quality of the generated outputs. Through extensive experimentation, we demonstrate the effectiveness of our approach in generating diverse and contextually appropriate captions for various images. We evaluate our model on benchmark datasets and compare its performance against existing state-of-the-art image captioning methods. The proposed approach has achieved superior results compared to previous methods, as demonstrated by improved caption accuracy metrics such as BLEU-3, BLEU-4, and other relevant accuracy measures.

引用

页码：375 / 385

页数：11

共 50 条

[21] Image Captioning Using Deep Learning
Adithya, Paluvayi Veera
Kalidindi, Mourya Viswanadh
Swaroop, Nallani Jyothi
Vishwas, H. N.
ADVANCED NETWORK TECHNOLOGIES AND INTELLIGENT COMPUTING, ANTIC 2023, PT III, 2024, 2092 : 42 - 58
[22] Video captioning using transformer-based GAN
Babavalian M.R.
Kiani K.
Multimedia Tools and Applications, 2025, 84 (10) : 7091 - 7113
[23] Crop pest image recognition based on the improved ViT method
Fu, Xueqian
Ma, Qiaoyu
Yang, Feifei
Zhang, Chunyu
Zhao, Xiaolong
Chang, Fuhao
Han, Lingling
INFORMATION PROCESSING IN AGRICULTURE, 2024, 11 (02): : 249 - 259
[24] Improved image reconstruction from brain activity through automatic image captioning
Kalantari, Fatemeh
Faez, Karim
Amindavar, Hamidreza
Nazari, Soheila
SCIENTIFIC REPORTS, 2025, 15 (01):
[25] Image Captioning Based on An Improved Transformer with IoU Position Encoding
Li, Yazhou
Shi, Yihui
Liu, Yun
Li, Ruifan
Ma, Zhanyu
2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 2066 - 2071
[26] Past is important: Improved image captioning by looking back in time
Wei, Yiwei
Wu, Chunlei
Jia, ZhiYang
Hu, XuFei
Guo, Shuang
Shi, Haitao
SIGNAL PROCESSING-IMAGE COMMUNICATION, 2021, 94
[27] Improved Image Captioning via Policy Gradient optimization of SPIDEr
Liu, Siqi
Zhu, Zhenhai
Ye, Ning
Guadarrama, Sergio
Murphy, Kevin
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 873 - 881
[28] Text Augmentation Using BERT for Image Captioning
Atliha, Viktar
Sesok, Dmitrij
APPLIED SCIENCES-BASEL, 2020, 10 (17):
[29] An Image Captioning Approach Using Dynamical Attention
Wang, Changzhi
Gu, Xiaodong
2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
[30] Image Captioning using Facial Expression and Attention
Nezami, Omid Mohamad
Dras, Mark
Wan, Stephen
Paris, Cecile
JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2020, 68 : 661 - 689

← 1 2 3 4 5 →