GLOVE-ING ATTENTION: A MULTI-MODAL NEURAL LEARNING APPROACH TO IMAGE CAPTIONING

被引：0

作者：

Anundskas, Lars Halvor ^{[1
]}

Afridi, Hina ^{[1
,3
]}

Tarekegn, Adane Nega ^{[1
]}

Yamin, Muhammad Mudassar ^{[2
]}

Ullah, Mohib ^{[1
]}

Yamin, Saira ^{[2
]}

Cheikh, Faouzi Alaya ^{[1
]}

机构：

[1] Norwegian Univ Sci & Technol NTNU, Dept Comp Sci, Trondheim, Norway

[2] Dept Management Sci, CUI Wah Campus, Wah Cantt, Pakistan

[3] Geno SA, Storhamargata 44, N-2317 Hamar, Norway

来源：

2023 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW | 2023年

关键词：

Chest X-ray; Convolutional neural networks; attention; GloVe embeddings; gated recurrent units;

D O I：

10.1109/ICASSPW59220.2023.10193011

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Articulating pictures using natural language is a complex undertaking within the realm of computer vision. The process of generating image captions involves producing depictions of images which can be achieved through advanced learning frameworks utilizing convolutional neural networks (CNNs) and recurrent neural networks (RNNs). Yet, conventional RNNs face challenges such as gradient explosion and vanishing gradients, resulting in inferior outcomes when producing non-evocative sentences. In this paper, we proposed an encoder-decoder deep neural network to generate image captions using state-of-the-art backbone architecture EfficientNet as the encoder network. We used multimodal gated recurrent units (GrU) for the decoder, which incorporate GloVe word embeddings for the text data and visual attention for the image data. The network is trained on three different datasets, Indiana Chest X-ray, COCO and WIT, and the results are evaluated on the standard performance metrics of BLEU and METEOR. The quantitative results show that the network achieves promising results compared to the state-of-the-art models. The source code is publically available at https://bitbucket.org/larswise/ imagecaptioning/src/master/wit_pipeline/.

引用

页数：5

共 50 条

[41] Multi-modal hypergraph contrastive learning for medical image segmentation
Jing, Weipeng
Wang, Junze
Di, Donglin
Li, Dandan
Song, Yang
Fan, Lei
PATTERN RECOGNITION, 2025, 165
[42] Multi-modal haptic image recognition based on deep learning
Han, Dong
Nie, Hong
Chen, Jinbao
Chen, Meng
Deng, Zhen
Zhang, Jianwei
SENSOR REVIEW, 2018, 38 (04) : 486 - 493
[43] CMGNet: Collaborative multi-modal graph network for video captioning
Rao, Qi
Yu, Xin
Li, Guang
Zhu, Linchao
COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 238
[44] MULTI-MODAL IMAGE PROCESSING BASED ON COUPLED DICTIONARY LEARNING
Song, Pingfan
Rodrigues, Miguel R. D.
2018 IEEE 19TH INTERNATIONAL WORKSHOP ON SIGNAL PROCESSING ADVANCES IN WIRELESS COMMUNICATIONS (SPAWC), 2018, : 356 - 360
[45] Learning multi-modal recurrent neural networks with target propagation
Manchev, Nikolay
Spratling, Michael
COMPUTATIONAL INTELLIGENCE, 2024, 40 (04)
[46] Multi-Modal Attention Network Learning for Semantic Source Code Retrieval
Wan, Yao
Shu, Jingdong
Sui, Yulei
Xu, Guandong
Zhao, Zhou
Wu, Jian
Yu, Philip S.
34TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING (ASE 2019), 2019, : 13 - 25
[47] An efficient deep learning-based video captioning framework using multi-modal features
Varma, Soumya
James, Dinesh Peter
EXPERT SYSTEMS, 2021,
[48] Unsupervised Multi-modal Learning
Iqbal, Mohammed Shameer
ADVANCES IN ARTIFICIAL INTELLIGENCE (AI 2015), 2015, 9091 : 343 - 346
[49] Learning Multi-modal Similarity
McFee, Brian
Lanckriet, Gert
JOURNAL OF MACHINE LEARNING RESEARCH, 2011, 12 : 491 - 523
[50] Multi-modal image segmentation using a modified Hopfield neural network
Rout, S
Seethalakshmy
Srivastava, P
Majumdar, J
PATTERN RECOGNITION, 1998, 31 (06) : 743 - 750

← 1 2 3 4 5 →