A Comparative Evaluation of Transformer-Based Vision Encoder-Decoder Models for Brazilian Portuguese Image Captioning

被引:0
|
作者
Bromonschenkel, Gabriel [1 ]
Oliveira, Hilark [1 ]
Paixao, Thiago M. [1 ]
机构
[1] Inst Fed Espirito Santo IFES, Programa Posgrad Comp Aplicada PPComp, Serra, Brazil
关键词
D O I
10.1109/SIBGRAPI62404.2024.10716325
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Image captioning refers to the process of creating a natural language description for one or more images. This task has several practical applications, from aiding in medical diagnoses through image descriptions to promoting social inclusion by providing visual context to people with impairments. Despite recent progress, especially in English, low-resource languages like Brazilian Portuguese face a shortage of datasets, models, and studies. This work seeks to contribute to this context by fine-tuning and investigating the performance of vision language models based on the Transformer architecture in Brazilian Portuguese. We leverage pre-trained vision model checkpoints (ViT, Swin, and DeiT) and neural language models (BERTimbau, DistilBERTimbau, and GPorTuguese-2). Several experiments were carried out to compare the efficiency of different model combinations using the #PraCegoVer-63K, a native Portuguese dataset, and a translated version of the Flickr30K dataset. The experimental results demonstrated that configurations using the Swin, DistilBERTimbau, and GPorTuguese-2 models generally achieved the best outcomes. Furthermore, the #PraCegoVer-63K dataset presents a series of challenges, such as descriptions made up of multiple sentences and the presence of proper names of places and people, which significantly decrease the performance of the investigated models.
引用
收藏
页码:235 / 240
页数:6
相关论文
共 50 条
  • [41] TransformEHR: transformer-based encoder-decoder generative model to enhance prediction of disease outcomes using electronic health records
    Zhichao Yang
    Avijit Mitra
    Weisong Liu
    Dan Berlowitz
    Hong Yu
    Nature Communications, 14
  • [42] Transformer-based Sparse Encoder and Answer Decoder for Visual Question Answering
    Peng, Longkun
    An, Gaoyun
    Ruan, Qiuqi
    2022 16TH IEEE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING (ICSP2022), VOL 1, 2022, : 120 - 123
  • [43] Learning to Write Anywhere with Spatial Transformer Image-to-Motion Encoder-Decoder Networks
    Ridge, Barry
    Pahic, Rok
    Ude, Ales
    Morimoto, Jun
    2019 INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), 2019, : 2111 - 2117
  • [44] A General Two-branch Decoder Architecture for Improving Encoder-decoder Image Segmentation Models
    Hu, Sijie
    Bonardi, Fabien
    Bouchafa, Samia
    Sidibe, Desire
    PROCEEDINGS OF THE 17TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER VISION, IMAGING AND COMPUTER GRAPHICS THEORY AND APPLICATIONS (VISAPP), VOL 5, 2022, : 374 - 381
  • [45] Rethinking encoder-decoder architecture using vision transformer for colorectal polyp and surgical instruments segmentation
    Iqbal, Ahmed
    Ahmed, Zohair
    Usman, Muhammad
    Malik, Isra
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 136
  • [46] Performance Comparison of Vision Transformer-Based Models in Medical Image Classification
    Kanca, Elif
    Ayas, Selen
    Kablan, Elif Baykal
    Ekinci, Murat
    2023 31ST SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE, SIU, 2023,
  • [47] Transformer-based local-global guidance for image captioning
    Parvin, Hashem
    Naghsh-Nilchi, Ahmad Reza
    Mohammadi, Hossein Mahvash
    EXPERT SYSTEMS WITH APPLICATIONS, 2023, 223
  • [48] Image captioning using transformer-based double attention network
    Parvin, Hashem
    Naghsh-Nilchi, Ahmad Reza
    Mohammadi, Hossein Mahvash
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2023, 125
  • [49] Encoder-decoder based convolutional neural networks for image forgery detection
    El Biach, Fatima Zahra
    Iala, Imad
    Laanaya, Hicham
    Minaoui, Khalid
    MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (16) : 22611 - 22628
  • [50] Retinal vessel image segmentation algorithm based on encoder-decoder structure
    ZhengLi Zhai
    Shu Feng
    Luyao Yao
    Penghui Li
    Multimedia Tools and Applications, 2022, 81 : 33361 - 33373