Dual-Modal Transformer with Enhanced Inter- and Intra-Modality Interactions for Image Captioning

被引:7
|
作者
Kumar, Deepika [1 ]
Srivastava, Varun [2 ]
Popescu, Daniela Elena [3 ]
Hemanth, Jude D. [4 ]
机构
[1] Bharati Vidyapeeths Coll Engn, Dept Comp Sci & Engn, New Delhi 110063, India
[2] Thapar Inst Engn & Technol, Dept Comp Sci & Engn, Patiala 147004, Punjab, India
[3] Univ Oradea, Fac Elect Engn & Informat Technol, Oradea 410087, Romania
[4] Karunya Inst Technol & Sci, Dept Elect & Commun Engn, Coimbatore 641114, Tamil Nadu, India
来源
APPLIED SCIENCES-BASEL | 2022年 / 12卷 / 13期
关键词
attention model; encoder-decoder model; multi-modal transformer; BLEU score; beam search;
D O I
10.3390/app12136733
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Image captioning is oriented towards describing an image with the best possible use of words that can provide a semantic, relatable meaning of the scenario inscribed. Different models can be used to accomplish this arduous task depending on the context and requirement of what needs to be achieved. An encoder-decoder model which uses the image feature vectors as an input to the encoder is often marked as one of the appropriate models to accomplish the captioning process. In the proposed work, a dual-modal transformer has been used which captures the intra- and inter-model interactions in a simultaneous manner within an attention block. The transformer architecture is quantitatively evaluated on a publicly available Microsoft Common Objects in Context (MS COCO) dataset yielding a Bilingual Evaluation Understudy (BLEU)-4 Score of 85.01. The efficacy of the model is evaluated on Flickr 8k, Flickr 30k datasets and MS COCO datasets and results for the same is compared and analysed with the state-of-the-art methods. The results shows that the proposed model outperformed when compared with conventional models, such as the encoder-decoder model and attention model.
引用
收藏
页数:20
相关论文
共 25 条
  • [1] Multimodal Affective Computing With Dense Fusion Transformer for Inter- and Intra-Modality Interactions
    Deng, Huan
    Yang, Zhenguo
    Hao, Tianyong
    Li, Qing
    Liu, Wenyin
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 6575 - 6587
  • [2] Visual Question Answering With Dense Inter- and Intra-Modality Interactions
    Liu, Fei
    Liu, Jing
    Fang, Zhiwei
    Hong, Richang
    Lu, Hanqing
    IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 : 3518 - 3529
  • [3] Illumination-Guided RGBT Object Detection With Inter- and Intra-Modality Fusion
    Zhang, Yan
    Yu, Huai
    He, Yujie
    Wang, Xinya
    Yang, Wen
    IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2023, 72
  • [4] Dual Global Enhanced Transformer for image captioning
    Xian, Tiantao
    Li, Zhixin
    Zhang, Canlong
    Ma, Huifang
    NEURAL NETWORKS, 2022, 148 : 129 - 141
  • [5] Deep Learning Based Inter-modality Image Registration Supervised by Intra-modality Similarity
    Cao, Xiaohuan
    Yang, Jianhuan
    Wang, Li
    Xue, Zhong
    Wang, Qian
    Shen, Dinggang
    MACHINE LEARNING IN MEDICAL IMAGING: 9TH INTERNATIONAL WORKSHOP, MLMI 2018, 2018, 11046 : 55 - 63
  • [6] StairwayGraphNet for Inter- and Intra-modality Multi-resolution Brain Graph Alignment and Synthesis
    Mhiri, Islem
    Mahjoub, Mohamed Ali
    Rekik, Islem
    MACHINE LEARNING IN MEDICAL IMAGING, MLMI 2021, 2021, 12966 : 140 - 150
  • [7] CurrI2P: inter- and intra-modality similarity curriculum learning for image-to-point cloud registration
    Lin, Liwei
    Lin, Chunyu
    Nie, Lang
    Huang, Shujuan
    Zhao, Yao
    VISUAL COMPUTER, 2025,
  • [8] FIRE: Unsupervised bi-directional inter- and intra-modality registration using deep networks
    Wang, Chengjia
    Yang, Guang
    Papanastasiou, Giorgos
    2021 IEEE 34TH INTERNATIONAL SYMPOSIUM ON COMPUTER-BASED MEDICAL SYSTEMS (CBMS), 2021, : 510 - 514
  • [9] Audio-Oriented Multimodal Machine Comprehension via Dynamic Inter- and Intra-modality Attention
    Huang, Zhiqi
    Liu, Fenglin
    Wu, Xian
    Ge, Shen
    Wang, Helin
    Fan, Wei
    Zon, Yuexian
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 13098 - 13106
  • [10] Inter- and intra-modality reliability of magnetoencephalographic somatosensory localization utilizing pneumatic digit and median nerve stimulation
    Carlson, C
    Stout, J
    Schevon, C
    Kuzniecky, R
    Devinsky, O
    Pacia, S
    NEUROLOGY, 2006, 66 (05) : A180 - A180