ETransCap: efficient transformer for image captioning

被引：0

作者：

Mundu, Albert ^{[1
]}

Singh, Satish Kumar ^{[1
]}

Dubey, Shiv Ram ^{[1
]}

机构：

[1] IIIT Allahabad, Dept IT, Comp Vis & Biometr Lab CVBL, Allahabad, India

来源：

APPLIED INTELLIGENCE | 2024年 / 54卷 / 21期

关键词：

Deep learning; Natural language processing; Image captioning; Scene understanding; Transformers; Efficient transformers; ATTENTION;

D O I：

10.1007/s10489-024-05739-w

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Image captioning is a challenging task in computer vision that automatically generates a textual description of an image by integrating visual and linguistic information, as the generated captions must accurately describe the image's content while also adhering to the conventions of natural language. We adopt the encoder-decoder framework employed by various CNN-RNN-based models for image captioning in the past few years. Recently, we observed that the CNN-Transformer-based models have achieved great success and surpassed traditional CNN-RNN-based models in the area. Many researchers have concentrated on Transformers, exploring and uncovering its vast possibilities. Unlike conventional CNN-RNN-based models in image captioning, transformer-based models have achieved notable success and offer the benefit of handling longer input sequences more efficiently. However, they are resource-intensive to train and deploy, particularly for large-scale tasks or for tasks that require real-time processing. In this work, we introduce a lightweight and efficient transformer-based model called the Efficient Transformer Captioner (ETransCap), which consumes fewer computation resources to generate captions. Our model operates in linear complexity and has been trained and tested on MS-COCO dataset. Comparisons with existing state-of-the-art models show that ETransCap achieves promising results. Our results support the potential of ETransCap as a good approach for image captioning tasks in real-time applications. Code for this project will be available at https://github.com/albertmundu/etranscap.

引用

页码：10748 / 10762

页数：15

共 50 条

[31] Semi-Autoregressive Transformer for Image Captioning
Zhou, Yuanen
Zhang, Yong
Hu, Zhenzhen
Wang, Meng
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 3132 - 3136
[32] HIST: Hierarchical and sequential transformer for image captioning
Lv, Feixiao
Wang, Rui
Jing, Lihua
Dai, Pengwen
IET COMPUTER VISION, 2024, 18 (07) : 1043 - 1056
[33] Dual-Spatial Normalized Transformer for image captioning
Hu, Juntao
Yang, You
An, Yongzhi
Yao, Lu
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2023, 123
[34] External knowledge-assisted Transformer for image captioning
Li, Zhixin
Su, Qiang
Chen, Tianyu
IMAGE AND VISION COMPUTING, 2023, 140
[35] Caption TLSTMs: combining transformer with LSTMs for image captioning
Yan, Jie
Xie, Yuxiang
Luan, Xidao
Guo, Yanming
Gong, Quanzhi
Feng, Suru
INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2022, 11 (02) : 111 - 121
[36] Reinforcement Learning Transformer for Image Captioning Generation Model
Yan, Zhaojie
FIFTEENTH INTERNATIONAL CONFERENCE ON MACHINE VISION, ICMV 2022, 2023, 12701
[37] Improving Stylized Image Captioning with Better Use of Transformer
Tan, Yutong
Lin, Zheng
Liu, Huan
Zuo, Fan
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2022, PT III, 2022, 13531 : 347 - 358
[38] Graph Alignment Transformer for More Grounded Image Captioning
Tian, Canwei
Hu, Haiyang
Li, Zhongjin
2022 INTERNATIONAL CONFERENCE ON INDUSTRIAL IOT, BIG DATA AND SUPPLY CHAIN, IIOTBDSC, 2022, : 95 - 102
[39] Visual contextual relationship augmented transformer for image captioning
Su, Qiang
Hu, Junbo
Li, Zhixin
APPLIED INTELLIGENCE, 2024, 54 (06) : 4794 - 4813
[40] Spiking -Transformer Optimization on FPGA for Image Classification and Captioning
Udeji, Uchechukwu Leo
Margala, Martin
SOUTHEASTCON 2024, 2024, : 1353 - 1357

← 1 2 3 4 5 →