Exploring and Distilling Cross-Modal Information for Image Captioning

被引:0
|
作者
Liu, Fenglin [1 ]
Ren, Xuancheng [2 ]
Liu, Yuanxin [3 ]
Lei, Kai [1 ]
Sun, Xu [2 ]
机构
[1] Peking Univ, Sch Elect & Comp Engn SECE, Shenzhen Key Lab Informat Centr Networking & Bloc, Beijing, Peoples R China
[2] Peking Univ, Sch EECS, MOE Key Lab Computat Linguist, Beijing, Peoples R China
[3] Beijing Univ Posts & Telecommun, Sch ICE, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, attention-based encoder-decoder models have been used extensively in image captioning. Yet there is still great difficulty for the current methods to achieve deep image understanding. In this work, we argue that such understanding requires visual attention to correlated image regions and semantic attention to coherent attributes of interest. To perform effective attention, we explore image captioning from a cross-modal perspective and propose the Global-and-Local Information Exploring-and-Distilling approach that explores and distills the source information in vision and language. It globally provides the aspect vector, a spatial and relational representation of images based on caption contexts, through the extraction of salient region groupings and attribute collocations, and locally extracts the fine-grained regions and attributes in reference to the aspect vector for word selection. Our fully-attentive model achieves a CIDEr score of 129.3 in offline COCO evaluation with remarkable efficiency in terms of accuracy, speed, and parameter budget.
引用
收藏
页码:5095 / 5101
页数:7
相关论文
共 50 条
  • [1] Exploiting Cross-Modal Prediction and Relation Consistency for Semisupervised Image Captioning
    Yang, Yang
    Wei, Hongchen
    Zhu, Hengshu
    Yu, Dianhai
    Xiong, Hui
    Yang, Jian
    IEEE TRANSACTIONS ON CYBERNETICS, 2024, 54 (02) : 890 - 902
  • [2] Stacked cross-modal feature consolidation attention networks for image captioning
    Mozhgan Pourkeshavarz
    Shahabedin Nabavi
    Mohsen Ebrahimi Moghaddam
    Mehrnoush Shamsfard
    Multimedia Tools and Applications, 2024, 83 : 12209 - 12233
  • [3] Stacked cross-modal feature consolidation attention networks for image captioning
    Pourkeshavarz, Mozhgan
    Nabavi, Shahabedin
    Moghaddam, Mohsen Ebrahimi
    Shamsfard, Mehrnoush
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (04) : 12209 - 12233
  • [4] Learning Cross-modal Representations with Multi-relations for Image Captioning
    Cheng, Peng
    Le, Tung
    Racharak, Teeradaj
    Cao Yiming
    Kong Weikun
    Minh Le Nguyen
    PROCEEDINGS OF THE 11TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION APPLICATIONS AND METHODS (ICPRAM), 2021, : 346 - 353
  • [5] Cross-Modal Retrieval and Semantic Refinement for Remote Sensing Image Captioning
    Li, Zhengxin
    Zhao, Wenzhe
    Du, Xuanyi
    Zhou, Guangyao
    Zhang, Songlin
    REMOTE SENSING, 2024, 16 (01)
  • [6] Cross-Domain Image Captioning via Cross-Modal Retrieval and Model Adaptation
    Zhao, Wentian
    Wu, Xinxiao
    Luo, Jiebo
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 1180 - 1192
  • [7] Semantic-Enhanced Cross-Modal Fusion for Improved Unsupervised Image Captioning
    Xiang, Nan
    Chen, Ling
    Liang, Leiyan
    Rao, Xingdi
    Gong, Zehao
    ELECTRONICS, 2023, 12 (17)
  • [8] Improving Cross-Modal Alignment with Synthetic Pairs for Text-Only Image Captioning
    Liu, Zhiyue
    Liu, Jinyuan
    Ma, Fanrong
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 4, 2024, : 3864 - 3872
  • [9] Cross-Modal Graph With Meta Concepts for Video Captioning
    Wang, Hao
    Lin, Guosheng
    Hoi, Steven C. H.
    Miao, Chunyan
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 5150 - 5162
  • [10] XDBERT: Distilling Visual Information to BERT from Cross-Modal Systems to Improve Language Understanding
    Hsu, Chan-Jan
    Lee, Hung-yi
    Tsao, Yu
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022): (SHORT PAPERS), VOL 2, 2022, : 479 - 489