Contextualized Keyword Representations for Multi-modal Retinal Image Captioning

被引:11
|
作者
Huang, Jia-Hong [1 ]
Wu, Ting-Wei [2 ]
Worring, Marcel [1 ]
机构
[1] Univ Amsterdam, Amsterdam, Netherlands
[2] Georgia Inst Technol, Atlanta, GA 30332 USA
关键词
OPTIC-NERVE; CLASSIFICATION;
D O I
10.1145/3460426.3463667
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Medical image captioning automatically generates a medical description to describe the content of a given medical image. Traditional medical image captioning models create a medical description based on a single medical image input only. Hence, an abstract medical description or concept is hard to be generated based on the traditional approach. Such a method limits the effectiveness of medical image captioning. Multi-modal medical image captioning is one of the approaches utilized to address this problem. In multi-modal medical image captioning, textual input, e.g., expert-defined keywords, is considered as one of the main drivers of medical description generation. Thus, encoding the textual input and the medical image effectively are both important for the task of multi-modal medical image captioning. In this work, a new end-to-end deep multi-modal medical image captioning model is proposed. Contextualized keyword representations, textual feature reinforcement, and masked self-attention are used to develop the proposed approach. Based on the evaluation of an existing multi-modal medical image captioning dataset, experimental results show that the proposed model is effective with an increase of +53.2% in BLEU-avg and +18.6% in CIDEr, compared with the state-of-the-art method.
引用
收藏
页码:645 / 652
页数:8
相关论文
共 50 条
  • [21] Improving Multi-Modal Representations Using Image Dispersion: Why Less is Sometimes More
    Kiela, Douwe
    Hill, Felix
    Korhonen, Anna
    Clark, Stephen
    PROCEEDINGS OF THE 52ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2, 2014, : 835 - 841
  • [22] Overview of Uni-modal and Multi-modal Representations for Classification Tasks
    Wiesen, Aryeh
    HaCohen-Kerner, Yaakov
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS (NLDB 2018), 2018, 10859 : 397 - 404
  • [23] Gated contextual transformer network for multi-modal retinal image clinical description generation
    Shaik, Nagur Shareef
    Cherukuri, Teja Krishna
    IMAGE AND VISION COMPUTING, 2024, 143
  • [24] Multi-Modal Representations for Improved Bilingual Lexicon Learning
    Vulic, Ivan
    Kiela, Douwe
    Clark, Stephen
    Moens, Marie-Francine
    PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2016), VOL 2, 2016, : 188 - 194
  • [25] LEARNING UNIFIED SPARSE REPRESENTATIONS FOR MULTI-MODAL DATA
    Wang, Kaiye
    Wang, Wei
    Wang, Liang
    2015 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2015, : 3545 - 3549
  • [26] Prediction of undesired situations based on multi-modal representations
    Lara, Bruno
    Rendon, Juan M.
    CERMA2006: ELECTRONICS, ROBOTICS AND AUTOMOTIVE MECHANICS CONFERENCE, VOL 1, PROCEEDINGS, 2006, : 131 - +
  • [27] Generation of Visual Representations for Multi-Modal Mathematical Knowledge
    Wu, Lianlong
    Choi, Seewon
    Raggi, Daniel
    Stockdill, Aaron
    Garcia, Grecia Garcia
    Colarusso, Fiorenzo
    Cheng, Peter C. H.
    Jamnik, Mateja
    THIRTY-EIGTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 21, 2024, : 23850 - 23852
  • [28] Multi-modal and multi-scale retinal imaging with angiography
    Shirazi, Muhammad Faizan
    Andilla, Jordi
    Cunquero, Marina
    Lefaudeux, Nicolas
    De Jesus, Danilo Andrade
    Brea, Luisa Sanchez
    Klein, Stefan
    van Walsum, Theo
    Grieve, Kate
    Paques, Michel
    Torm, Marie Elise Wistrup
    Larsen, Michael
    Loza-Alvarez, Pablo
    Levecq, Xavier
    Chateau, Nicolas
    Pircher, Michael
    INVESTIGATIVE OPHTHALMOLOGY & VISUAL SCIENCE, 2021, 62 (08)
  • [29] Cross-modal attention for multi-modal image registration
    Song, Xinrui
    Chao, Hanqing
    Xu, Xuanang
    Guo, Hengtao
    Xu, Sheng
    Turkbey, Baris
    Wood, Bradford J.
    Sanford, Thomas
    Wang, Ge
    Yan, Pingkun
    MEDICAL IMAGE ANALYSIS, 2022, 82
  • [30] Visual Explanations of Image-Text Representations via Multi-Modal Information Bottleneck Attribution
    Wang, Ying
    Rudner, Tim G. J.
    Wilson, Andrew Gordon
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,