Consensus Graph Representation Learning for Better Grounded Image Captioning

被引：0

作者：

Zhang, Wenqiao ^{[1
]}

Shi, Haochen ^{[1
]}

Tang, Siliang ^{[1
]}

Xiao, Jun ^{[1
]}

Yu, Qiang ^{[2
]}

Zhuang, Yueting ^{[1
]}

机构：

[1] Zhejiang Univ, Hangzhou, Peoples R China

[2] Citycloud Technol, Hangzhou, Peoples R China

来源：

THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE | 2021年 / 35卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The contemporary visual captioning models frequently hallucinate objects that are not actually in a scene, due to the visual misclassification or over-reliance on priors that resulting in the semantic inconsistency between the visual information and the target lexical words. The most common way is to encourage the captioning model to dynamically link generated object words or phrases to appropriate regions of the image, i.e., the grounded image captioning (GIC). However, GIC utilizes an auxiliary task (grounding objects) that has not solved the key issue of object hallucination, i.e., the semantic inconsistency. In this paper, we take a novel perspective on the issue above: exploiting the semantic coherency between the visual and language modalities. Specifically, we propose the Consensus Rraph Representation Learning framework (CGRL) for GIC that incorporates a consensus representation into the grounded captioning pipeline. The consensus is learned by aligning the visual graph (e.g., scene graph) to the language graph that consider both the nodes and edges in a graph. With the aligned consensus, the captioning model can capture both the correct linguistic characteristics and visual relevance, and then grounding appropriate image regions further. We validate the effectiveness of our model, with a significant decline in object hallucination (-9% CHAIRO on the Flickr30k Entities dataset. Besides, our CGRL also evaluated by several automatic metrics and human evaluation, the results indicate that the proposed approach can simultaneously improve the performance of image captioning (+2.9 Cider) and grounding (+2.3 F1(LOC)).

引用

页码：3394 / 3402

页数：9

共 50 条

[31] Towards Better Laplacian Representation in Reinforcement Learning with Generalized Graph Drawing
Wang, Kaixin
Zhou, Kuangqi
Zhang, Qixin
Shao, Jie
Hooi, Bryan
Feng, Jiashi
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
[32] Bridging by Word: Image-Grounded Vocabulary Construction for Visual Captioning
Fan, Zhihao
Wei, Zhongyu
Wang, Siyuan
Huang, Xuanjing
57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 6514 - 6524
[33] Meta-Graph Representation Learning for PolSAR Image Classification
Yang, Shuyuan
Li, Ruoxue
Li, Zhaoda
Meng, Huixiao
Feng, Zhixi
He, Guangjun
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62 : 1 - 12
[34] Flexible Manifold Learning With Optimal Graph for Image and Video Representation
Wang, Wei
Yan, Yan
Nie, Feiping
Yan, Shuicheng
Sebe, Nicu
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2018, 27 (06) : 2664 - 2675
[35] Unpaired Image Captioning via Scene Graph Alignments
Gu, Jiuxiang
Joty, Shafiq
Cai, Jianfei
Zhao, Handong
Yang, Xu
Wang, Gang
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 10322 - 10331
[36] Deep Learning for Military Image Captioning
Das, Subrata
Jain, Lalit
Das, Amp
2018 21ST INTERNATIONAL CONFERENCE ON INFORMATION FUSION (FUSION), 2018, : 2165 - 2171
[37] A TEXT-GUIDED GRAPH STRUCTURE FOR IMAGE CAPTIONING
Wang, Depeng
Hu, Zhenzhen
Zhou, Yuanen
Liu, Xueliang
Wu, Le
Hong, Richang
2020 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO WORKSHOPS (ICMEW), 2020,
[38] Aligned visual semantic scene graph for image captioning
Zhao, Shanshan
Li, Lixiang
Peng, Haipeng
DISPLAYS, 2022, 74
[39] Modeling graph-structured contexts for image captioning
Li, Zhixin
Wei, Jiahui
Huang, Feicheng
Ma, Huifang
IMAGE AND VISION COMPUTING, 2023, 129
[40] Improving Stylized Image Captioning with Better Use of Transformer
Tan, Yutong
Lin, Zheng
Liu, Huan
Zuo, Fan
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2022, PT III, 2022, 13531 : 347 - 358

← 1 2 3 4 5 →