Contextualized Keyword Representations for Multi-modal Retinal Image Captioning

被引:11
|
作者
Huang, Jia-Hong [1 ]
Wu, Ting-Wei [2 ]
Worring, Marcel [1 ]
机构
[1] Univ Amsterdam, Amsterdam, Netherlands
[2] Georgia Inst Technol, Atlanta, GA 30332 USA
关键词
OPTIC-NERVE; CLASSIFICATION;
D O I
10.1145/3460426.3463667
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Medical image captioning automatically generates a medical description to describe the content of a given medical image. Traditional medical image captioning models create a medical description based on a single medical image input only. Hence, an abstract medical description or concept is hard to be generated based on the traditional approach. Such a method limits the effectiveness of medical image captioning. Multi-modal medical image captioning is one of the approaches utilized to address this problem. In multi-modal medical image captioning, textual input, e.g., expert-defined keywords, is considered as one of the main drivers of medical description generation. Thus, encoding the textual input and the medical image effectively are both important for the task of multi-modal medical image captioning. In this work, a new end-to-end deep multi-modal medical image captioning model is proposed. Contextualized keyword representations, textual feature reinforcement, and masked self-attention are used to develop the proposed approach. Based on the evaluation of an existing multi-modal medical image captioning dataset, experimental results show that the proposed model is effective with an increase of +53.2% in BLEU-avg and +18.6% in CIDEr, compared with the state-of-the-art method.
引用
收藏
页码:645 / 652
页数:8
相关论文
共 50 条
  • [41] An overview of multi-modal medical image fusion
    Du, Jiao
    Li, Weisheng
    Lu, Ke
    Xiao, Bin
    NEUROCOMPUTING, 2016, 215 : 3 - 20
  • [42] Multi-modal Learning for Social Image Classification
    Liu, Chunyang
    Zhang, Xu
    Li, Xiong
    Li, Rui
    Zhang, Xiaoming
    Chao, Wenhan
    2016 12TH INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION, FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY (ICNC-FSKD), 2016, : 1174 - 1179
  • [43] CrossCLR: Cross-modal Contrastive Learning For Multi-modal Video Representations
    Zolfaghari, Mohammadreza
    Zhu, Yi
    Gehler, Peter
    Brox, Thomas
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1430 - 1439
  • [44] Self-supervised multi-modal fusion network for multi-modal thyroid ultrasound image diagnosis
    Xiang, Zhuo
    Zhuo, Qiuluan
    Zhao, Cheng
    Deng, Xiaofei
    Zhu, Ting
    Wang, Tianfu
    Jiang, Wei
    Lei, Baiying
    COMPUTERS IN BIOLOGY AND MEDICINE, 2022, 150
  • [45] MRCap: Multi-modal and Multi-level Relationship-based Dense Video Captioning
    Chen, Wei
    Niu, Jianwei
    Liu, Xuefeng
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 2615 - 2620
  • [46] TaGAT: Topology-Aware Graph Attention Network for Multi-modal Retinal Image Fusion
    Tian, Xin
    Anantrasirichai, Nantheera
    Nicholson, Lindsay
    Achim, Alin
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2024, PT I, 2024, 15001 : 775 - 784
  • [47] Guided Image Deblurring by Deep Multi-Modal Image Fusion
    Liu, Yuqi
    Sheng, Zehua
    Shen, Hui-Liang
    IEEE ACCESS, 2022, 10 : 130708 - 130718
  • [48] Conceptual Coherence Revealed in Multi-Modal Representations of Astronomy Knowledge
    Blown, Eric
    Bryce, Tom G. K.
    INTERNATIONAL JOURNAL OF SCIENCE EDUCATION, 2010, 32 (01) : 31 - 67
  • [49] Multiple representations and multi-modal reasoning in medical diagnostic systems
    Torasso, P
    ARTIFICIAL INTELLIGENCE IN MEDICINE, 2001, 23 (01) : 49 - 69
  • [50] Learning Pedestrian Group Representations for Multi-modal Trajectory Prediction
    Bae, Inhwan
    Park, Jin-Hwi
    Jeon, Hae-Gon
    COMPUTER VISION, ECCV 2022, PT XXII, 2022, 13682 : 270 - 289