A Comparative Evaluation of Transformer-Based Vision Encoder-Decoder Models for Brazilian Portuguese Image Captioning

被引：0

作者：

Bromonschenkel, Gabriel ^{[1
]}

Oliveira, Hilark ^{[1
]}

Paixao, Thiago M. ^{[1
]}

机构：

[1] Inst Fed Espirito Santo IFES, Programa Posgrad Comp Aplicada PPComp, Serra, Brazil

来源：

2024 37TH SIBGRAPI CONFERENCE ON GRAPHICS, PATTERNS AND IMAGES, SIBGRAPI 2024 | 2024年

关键词：

D O I：

10.1109/SIBGRAPI62404.2024.10716325

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Image captioning refers to the process of creating a natural language description for one or more images. This task has several practical applications, from aiding in medical diagnoses through image descriptions to promoting social inclusion by providing visual context to people with impairments. Despite recent progress, especially in English, low-resource languages like Brazilian Portuguese face a shortage of datasets, models, and studies. This work seeks to contribute to this context by fine-tuning and investigating the performance of vision language models based on the Transformer architecture in Brazilian Portuguese. We leverage pre-trained vision model checkpoints (ViT, Swin, and DeiT) and neural language models (BERTimbau, DistilBERTimbau, and GPorTuguese-2). Several experiments were carried out to compare the efficiency of different model combinations using the #PraCegoVer-63K, a native Portuguese dataset, and a translated version of the Flickr30K dataset. The experimental results demonstrated that configurations using the Swin, DistilBERTimbau, and GPorTuguese-2 models generally achieved the best outcomes. Furthermore, the #PraCegoVer-63K dataset presents a series of challenges, such as descriptions made up of multiple sentences and the presence of proper names of places and people, which significantly decrease the performance of the investigated models.

引用

页码：235 / 240

页数：6

共 50 条

[41] TransformEHR: transformer-based encoder-decoder generative model to enhance prediction of disease outcomes using electronic health records
Zhichao Yang
Avijit Mitra
Weisong Liu
Dan Berlowitz
Hong Yu
Nature Communications, 14
[42] Transformer-based Sparse Encoder and Answer Decoder for Visual Question Answering
Peng, Longkun
An, Gaoyun
Ruan, Qiuqi
2022 16TH IEEE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING (ICSP2022), VOL 1, 2022, : 120 - 123
[43] Learning to Write Anywhere with Spatial Transformer Image-to-Motion Encoder-Decoder Networks
Ridge, Barry
Pahic, Rok
Ude, Ales
Morimoto, Jun
2019 INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), 2019, : 2111 - 2117
[44] A General Two-branch Decoder Architecture for Improving Encoder-decoder Image Segmentation Models
Hu, Sijie
Bonardi, Fabien
Bouchafa, Samia
Sidibe, Desire
PROCEEDINGS OF THE 17TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER VISION, IMAGING AND COMPUTER GRAPHICS THEORY AND APPLICATIONS (VISAPP), VOL 5, 2022, : 374 - 381
[45] Rethinking encoder-decoder architecture using vision transformer for colorectal polyp and surgical instruments segmentation
Iqbal, Ahmed
Ahmed, Zohair
Usman, Muhammad
Malik, Isra
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 136
[46] Performance Comparison of Vision Transformer-Based Models in Medical Image Classification
Kanca, Elif
Ayas, Selen
Kablan, Elif Baykal
Ekinci, Murat
2023 31ST SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE, SIU, 2023,
[47] Transformer-based local-global guidance for image captioning
Parvin, Hashem
Naghsh-Nilchi, Ahmad Reza
Mohammadi, Hossein Mahvash
EXPERT SYSTEMS WITH APPLICATIONS, 2023, 223
[48] Image captioning using transformer-based double attention network
Parvin, Hashem
Naghsh-Nilchi, Ahmad Reza
Mohammadi, Hossein Mahvash
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2023, 125
[49] Encoder-decoder based convolutional neural networks for image forgery detection
El Biach, Fatima Zahra
Iala, Imad
Laanaya, Hicham
Minaoui, Khalid
MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (16) : 22611 - 22628
[50] Retinal vessel image segmentation algorithm based on encoder-decoder structure
ZhengLi Zhai
Shu Feng
Luyao Yao
Penghui Li
Multimedia Tools and Applications, 2022, 81 : 33361 - 33373

← 1 2 3 4 5 →