Exploring and Distilling Cross-Modal Information for Image Captioning

被引：0

作者：

Liu, Fenglin ^{[1
]}

Ren, Xuancheng ^{[2
]}

Liu, Yuanxin ^{[3
]}

Lei, Kai ^{[1
]}

Sun, Xu ^{[2
]}

机构：

[1] Peking Univ, Sch Elect & Comp Engn SECE, Shenzhen Key Lab Informat Centr Networking & Bloc, Beijing, Peoples R China

[2] Peking Univ, Sch EECS, MOE Key Lab Computat Linguist, Beijing, Peoples R China

[3] Beijing Univ Posts & Telecommun, Sch ICE, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE | 2019年

基金：

中国国家自然科学基金;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recently, attention-based encoder-decoder models have been used extensively in image captioning. Yet there is still great difficulty for the current methods to achieve deep image understanding. In this work, we argue that such understanding requires visual attention to correlated image regions and semantic attention to coherent attributes of interest. To perform effective attention, we explore image captioning from a cross-modal perspective and propose the Global-and-Local Information Exploring-and-Distilling approach that explores and distills the source information in vision and language. It globally provides the aspect vector, a spatial and relational representation of images based on caption contexts, through the extraction of salient region groupings and attribute collocations, and locally extracts the fine-grained regions and attributes in reference to the aspect vector for word selection. Our fully-attentive model achieves a CIDEr score of 129.3 in offline COCO evaluation with remarkable efficiency in terms of accuracy, speed, and parameter budget.

引用

页码：5095 / 5101

页数：7

共 50 条

[1] Exploiting Cross-Modal Prediction and Relation Consistency for Semisupervised Image Captioning
Yang, Yang
Wei, Hongchen
Zhu, Hengshu
Yu, Dianhai
Xiong, Hui
Yang, Jian
IEEE TRANSACTIONS ON CYBERNETICS, 2024, 54 (02) : 890 - 902
[2] Stacked cross-modal feature consolidation attention networks for image captioning
Mozhgan Pourkeshavarz
Shahabedin Nabavi
Mohsen Ebrahimi Moghaddam
Mehrnoush Shamsfard
Multimedia Tools and Applications, 2024, 83 : 12209 - 12233
[3] Stacked cross-modal feature consolidation attention networks for image captioning
Pourkeshavarz, Mozhgan
Nabavi, Shahabedin
Moghaddam, Mohsen Ebrahimi
Shamsfard, Mehrnoush
MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (04) : 12209 - 12233
[4] Learning Cross-modal Representations with Multi-relations for Image Captioning
Cheng, Peng
Le, Tung
Racharak, Teeradaj
Cao Yiming
Kong Weikun
Minh Le Nguyen
PROCEEDINGS OF THE 11TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION APPLICATIONS AND METHODS (ICPRAM), 2021, : 346 - 353
[5] Cross-Modal Retrieval and Semantic Refinement for Remote Sensing Image Captioning
Li, Zhengxin
Zhao, Wenzhe
Du, Xuanyi
Zhou, Guangyao
Zhang, Songlin
REMOTE SENSING, 2024, 16 (01)
[6] Cross-Domain Image Captioning via Cross-Modal Retrieval and Model Adaptation
Zhao, Wentian
Wu, Xinxiao
Luo, Jiebo
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 1180 - 1192
[7] Semantic-Enhanced Cross-Modal Fusion for Improved Unsupervised Image Captioning
Xiang, Nan
Chen, Ling
Liang, Leiyan
Rao, Xingdi
Gong, Zehao
ELECTRONICS, 2023, 12 (17)
[8] Improving Cross-Modal Alignment with Synthetic Pairs for Text-Only Image Captioning
Liu, Zhiyue
Liu, Jinyuan
Ma, Fanrong
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 4, 2024, : 3864 - 3872
[9] Cross-Modal Graph With Meta Concepts for Video Captioning
Wang, Hao
Lin, Guosheng
Hoi, Steven C. H.
Miao, Chunyan
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 5150 - 5162
[10] XDBERT: Distilling Visual Information to BERT from Cross-Modal Systems to Improve Language Understanding
Hsu, Chan-Jan
Lee, Hung-yi
Tsao, Yu
PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022): (SHORT PAPERS), VOL 2, 2022, : 479 - 489

← 1 2 3 4 5 →