A Comparative Evaluation of Transformer-Based Vision Encoder-Decoder Models for Brazilian Portuguese Image Captioning

被引:0
|
作者
Bromonschenkel, Gabriel [1 ]
Oliveira, Hilark [1 ]
Paixao, Thiago M. [1 ]
机构
[1] Inst Fed Espirito Santo IFES, Programa Posgrad Comp Aplicada PPComp, Serra, Brazil
关键词
D O I
10.1109/SIBGRAPI62404.2024.10716325
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Image captioning refers to the process of creating a natural language description for one or more images. This task has several practical applications, from aiding in medical diagnoses through image descriptions to promoting social inclusion by providing visual context to people with impairments. Despite recent progress, especially in English, low-resource languages like Brazilian Portuguese face a shortage of datasets, models, and studies. This work seeks to contribute to this context by fine-tuning and investigating the performance of vision language models based on the Transformer architecture in Brazilian Portuguese. We leverage pre-trained vision model checkpoints (ViT, Swin, and DeiT) and neural language models (BERTimbau, DistilBERTimbau, and GPorTuguese-2). Several experiments were carried out to compare the efficiency of different model combinations using the #PraCegoVer-63K, a native Portuguese dataset, and a translated version of the Flickr30K dataset. The experimental results demonstrated that configurations using the Swin, DistilBERTimbau, and GPorTuguese-2 models generally achieved the best outcomes. Furthermore, the #PraCegoVer-63K dataset presents a series of challenges, such as descriptions made up of multiple sentences and the presence of proper names of places and people, which significantly decrease the performance of the investigated models.
引用
收藏
页码:235 / 240
页数:6
相关论文
共 50 条
  • [21] TrEnD: A transformer-based encoder-decoder model with adaptive patch embedding for mass segmentation in mammograms
    Liu, Dongdong
    Wu, Bo
    Li, Changbo
    Sun, Zheng
    Zhang, Nan
    MEDICAL PHYSICS, 2023, 50 (05) : 2884 - 2899
  • [22] MICER: a pre-trained encoder-decoder architecture for molecular image captioning
    Yi, Jiacai
    Wu, Chengkun
    Zhang, Xiaochen
    Xiao, Xinyi
    Qiu, Yanlong
    Zhao, Wentao
    Hou, Tingjun
    Cao, Dongsheng
    BIOINFORMATICS, 2022, 38 (19) : 4562 - 4572
  • [23] FDR-TransUNet: A novel encoder-decoder architecture with vision transformer for improved medical image segmentation
    Zhang, Chaoyang
    Sun, Shibao
    Hu, Wenmao
    Zhao, Pengcheng
    COMPUTERS IN BIOLOGY AND MEDICINE, 2024, 169
  • [24] STEDNet: Swin transformer-based encoder-decoder network for noise reduction in low-dose CT
    Zhu, Linlin
    Han, Yu
    Xi, Xiaoqi
    Fu, Huijuan
    Tan, Siyu
    Liu, Mengnan
    Yang, Shuangzhan
    Liu, Chang
    Li, Lei
    Yan, Bin
    MEDICAL PHYSICS, 2023, 50 (07) : 4443 - 4458
  • [25] A Sparse Transformer-Based Approach for Image Captioning
    Lei, Zhou
    Zhou, Congcong
    Chen, Shengbo
    Huang, Yiyong
    Liu, Xianrui
    IEEE ACCESS, 2020, 8 : 213437 - 213446
  • [26] A Sparse Transformer-Based Approach for Image Captioning
    Lei, Zhou
    Zhou, Congcong
    Chen, Shengbo
    Huang, Yiyong
    Liu, Xianrui
    IEEE Access, 2020, 8 : 213437 - 213446
  • [27] ThaiTC:Thai Transformer-based Image Captioning
    Jaknamon, Teetouch
    Marukatat, Sanparith
    2022 17TH INTERNATIONAL JOINT SYMPOSIUM ON ARTIFICIAL INTELLIGENCE AND NATURAL LANGUAGE PROCESSING (ISAI-NLP 2022) / 3RD INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND INTERNET OF THINGS (AIOT 2022), 2022,
  • [28] A Review of Transformer-Based Approaches for Image Captioning
    Ondeng, Oscar
    Ouma, Heywood
    Akuon, Peter
    APPLIED SCIENCES-BASEL, 2023, 13 (19):
  • [29] Image Captioning Encoder–Decoder Models Using CNN-RNN Architectures: A Comparative Study
    K. Revati Suresh
    Arun Jarapala
    P. V. Sudeep
    Circuits, Systems, and Signal Processing, 2022, 41 : 5719 - 5742
  • [30] Analytical study of the encoder-decoder models for ultrasound image segmentation
    Somya Srivastava
    Ankit Vidyarthi
    Shikha Jain
    Service Oriented Computing and Applications, 2024, 18 : 81 - 100