A Comparative Evaluation of Transformer-Based Vision Encoder-Decoder Models for Brazilian Portuguese Image Captioning

被引:0
|
作者
Bromonschenkel, Gabriel [1 ]
Oliveira, Hilark [1 ]
Paixao, Thiago M. [1 ]
机构
[1] Inst Fed Espirito Santo IFES, Programa Posgrad Comp Aplicada PPComp, Serra, Brazil
关键词
D O I
10.1109/SIBGRAPI62404.2024.10716325
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Image captioning refers to the process of creating a natural language description for one or more images. This task has several practical applications, from aiding in medical diagnoses through image descriptions to promoting social inclusion by providing visual context to people with impairments. Despite recent progress, especially in English, low-resource languages like Brazilian Portuguese face a shortage of datasets, models, and studies. This work seeks to contribute to this context by fine-tuning and investigating the performance of vision language models based on the Transformer architecture in Brazilian Portuguese. We leverage pre-trained vision model checkpoints (ViT, Swin, and DeiT) and neural language models (BERTimbau, DistilBERTimbau, and GPorTuguese-2). Several experiments were carried out to compare the efficiency of different model combinations using the #PraCegoVer-63K, a native Portuguese dataset, and a translated version of the Flickr30K dataset. The experimental results demonstrated that configurations using the Swin, DistilBERTimbau, and GPorTuguese-2 models generally achieved the best outcomes. Furthermore, the #PraCegoVer-63K dataset presents a series of challenges, such as descriptions made up of multiple sentences and the presence of proper names of places and people, which significantly decrease the performance of the investigated models.
引用
收藏
页码:235 / 240
页数:6
相关论文
共 50 条
  • [1] Study on Image Super-Resolution with Transformer-Based Encoder-Decoder Models
    Wang, Qing-You
    Lin, Yih-Lon
    2024 11TH INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS-TAIWAN, ICCE-TAIWAN 2024, 2024, : 213 - 214
  • [2] AnoViT: Unsupervised Anomaly Detection and Localization With Vision Transformer-Based Encoder-Decoder
    Lee, Yunseung
    Kang, Pilsung
    IEEE ACCESS, 2022, 10 : 46717 - 46724
  • [3] Parallel encoder-decoder framework for image captioning
    Saeidimesineh, Reyhane
    Adibi, Peyman
    Karshenas, Hossein
    Darvishy, Alireza
    KNOWLEDGE-BASED SYSTEMS, 2023, 282
  • [4] Transformer-based Encoder-Decoder Model for Surface Defect Detection
    Lu, Xiaofeng
    Fan, Wentao
    6TH INTERNATIONAL CONFERENCE ON INNOVATION IN ARTIFICIAL INTELLIGENCE, ICIAI2022, 2022, : 125 - 130
  • [5] Image Captioning Encoder-Decoder Models Using CNN-RNN Architectures: A Comparative Study
    Suresh, K. Revati
    Jarapala, Arun
    Sudeep, P., V
    CIRCUITS SYSTEMS AND SIGNAL PROCESSING, 2022, 41 (10) : 5719 - 5742
  • [6] Deep Hierarchical Encoder-Decoder Network for Image Captioning
    Xiao, Xinyu
    Wang, Lingfeng
    Ding, Kun
    Xiang, Shiming
    Pan, Chunhong
    IEEE TRANSACTIONS ON MULTIMEDIA, 2019, 21 (11) : 2942 - 2956
  • [7] Image Captioning: From Encoder-Decoder to Reinforcement Learning
    Tang, Yu
    2022 6TH INTERNATIONAL CONFERENCE ON IMAGING, SIGNAL PROCESSING AND COMMUNICATIONS, ICISPC, 2022, : 6 - 10
  • [8] Medical image super-resolution via transformer-based hierarchical encoder-decoder network
    Sun, Jianhao
    Zeng, Xiangqin
    Lei, Xiang
    Gao, Mingliang
    Li, Qilei
    Zhang, Housheng
    Ba, Fengli
    NETWORK MODELING AND ANALYSIS IN HEALTH INFORMATICS AND BIOINFORMATICS, 2024, 13 (01):
  • [9] Cross Encoder-Decoder Transformer with Global-Local Visual Extractor for Medical Image Captioning
    Lee, Hojun
    Cho, Hyunjun
    Park, Jieun
    Chae, Jinyeong
    Kim, Jihie
    SENSORS, 2022, 22 (04)
  • [10] Image Guidance Encoder-Decoder Model in Image Captioning and Its Application
    Yang, Zhen
    Zhou, Ziwei
    Wang, Chaoyang
    Xu, Liang
    IAENG International Journal of Computer Science, 2024, 51 (09) : 1385 - 1392