Consensus Graph Representation Learning for Better Grounded Image Captioning

被引：0

作者：

Zhang, Wenqiao ^{[1
]}

Shi, Haochen ^{[1
]}

Tang, Siliang ^{[1
]}

Xiao, Jun ^{[1
]}

Yu, Qiang ^{[2
]}

Zhuang, Yueting ^{[1
]}

机构：

[1] Zhejiang Univ, Hangzhou, Peoples R China

[2] Citycloud Technol, Hangzhou, Peoples R China

来源：

THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE | 2021年 / 35卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The contemporary visual captioning models frequently hallucinate objects that are not actually in a scene, due to the visual misclassification or over-reliance on priors that resulting in the semantic inconsistency between the visual information and the target lexical words. The most common way is to encourage the captioning model to dynamically link generated object words or phrases to appropriate regions of the image, i.e., the grounded image captioning (GIC). However, GIC utilizes an auxiliary task (grounding objects) that has not solved the key issue of object hallucination, i.e., the semantic inconsistency. In this paper, we take a novel perspective on the issue above: exploiting the semantic coherency between the visual and language modalities. Specifically, we propose the Consensus Rraph Representation Learning framework (CGRL) for GIC that incorporates a consensus representation into the grounded captioning pipeline. The consensus is learned by aligning the visual graph (e.g., scene graph) to the language graph that consider both the nodes and edges in a graph. With the aligned consensus, the captioning model can capture both the correct linguistic characteristics and visual relevance, and then grounding appropriate image regions further. We validate the effectiveness of our model, with a significant decline in object hallucination (-9% CHAIRO on the Flickr30k Entities dataset. Besides, our CGRL also evaluated by several automatic metrics and human evaluation, the results indicate that the proposed approach can simultaneously improve the performance of image captioning (+2.9 Cider) and grounding (+2.3 F1(LOC)).

引用

页码：3394 / 3402

页数：9

共 50 条

[21] Contrastive Learning for Image Captioning
Dai, Bo
Lin, Dahua
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
[22] Learning to Evaluate Image Captioning
Cui, Yin
Yang, Guandao
Veit, Andreas
Huang, Xun
Belongie, Serge
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 5804 - 5812
[23] Personalized Image Retrieval with Sparse Graph Representation Learning
Jia, Xiaowei
Zhao, Handong
Lin, Zhe
Kale, Ajinkya
Kumar, Vipin
KDD '20: PROCEEDINGS OF THE 26TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2020, : 2735 - 2743
[24] Incorporating the Graph Representation of Video and Text into Video Captioning
Lu, Min
Li, Yuan
2022 IEEE 34TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI, 2022, : 396 - 401
[25] Meta Learning for Image Captioning
Li, Nannan
Chen, Zhenzhong
Liu, Shan
THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 8626 - 8633
[26] Exploring better image captioning with grid features
Jie Yan
Yuxiang Xie
Yanming Guo
Yingmei Wei
Xidao Luan
Complex & Intelligent Systems, 2024, 10 : 3541 - 3556
[27] More Grounded Image Captioning by Distilling Image-Text Matching Model
Zhou, Yuanen
Wang, Meng
Liu, Daqing
Hu, Zhenzhen
Zhang, Hanwang
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 4776 - 4785
[28] Exploring better image captioning with grid features
Yan, Jie
Xie, Yuxiang
Guo, Yanming
Wei, Yingmei
Luan, Xidao
COMPLEX & INTELLIGENT SYSTEMS, 2024, 10 (03) : 3541 - 3556
[29] Automated individual cortical parcellation via consensus graph representation learning
Wen, Xuyun
Yang, Mengting
Qi, Shile
Wu, Xia
Zhang, Daoqiang
NEUROIMAGE, 2024, 293
[30] IC3: Image Captioning by Committee Consensus
Chan, David M.
Myers, Austin
Vijayanarasimhan, Sudheendra
Ross, David A.
Canny, John
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 8975 - 9003

← 1 2 3 4 5 →