Consensus Graph Representation Learning for Better Grounded Image Captioning

被引:0
|
作者
Zhang, Wenqiao [1 ]
Shi, Haochen [1 ]
Tang, Siliang [1 ]
Xiao, Jun [1 ]
Yu, Qiang [2 ]
Zhuang, Yueting [1 ]
机构
[1] Zhejiang Univ, Hangzhou, Peoples R China
[2] Citycloud Technol, Hangzhou, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The contemporary visual captioning models frequently hallucinate objects that are not actually in a scene, due to the visual misclassification or over-reliance on priors that resulting in the semantic inconsistency between the visual information and the target lexical words. The most common way is to encourage the captioning model to dynamically link generated object words or phrases to appropriate regions of the image, i.e., the grounded image captioning (GIC). However, GIC utilizes an auxiliary task (grounding objects) that has not solved the key issue of object hallucination, i.e., the semantic inconsistency. In this paper, we take a novel perspective on the issue above: exploiting the semantic coherency between the visual and language modalities. Specifically, we propose the Consensus Rraph Representation Learning framework (CGRL) for GIC that incorporates a consensus representation into the grounded captioning pipeline. The consensus is learned by aligning the visual graph (e.g., scene graph) to the language graph that consider both the nodes and edges in a graph. With the aligned consensus, the captioning model can capture both the correct linguistic characteristics and visual relevance, and then grounding appropriate image regions further. We validate the effectiveness of our model, with a significant decline in object hallucination (-9% CHAIRO on the Flickr30k Entities dataset. Besides, our CGRL also evaluated by several automatic metrics and human evaluation, the results indicate that the proposed approach can simultaneously improve the performance of image captioning (+2.9 Cider) and grounding (+2.3 F1(LOC)).
引用
收藏
页码:3394 / 3402
页数:9
相关论文
共 50 条
  • [21] Contrastive Learning for Image Captioning
    Dai, Bo
    Lin, Dahua
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
  • [22] Learning to Evaluate Image Captioning
    Cui, Yin
    Yang, Guandao
    Veit, Andreas
    Huang, Xun
    Belongie, Serge
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 5804 - 5812
  • [23] Personalized Image Retrieval with Sparse Graph Representation Learning
    Jia, Xiaowei
    Zhao, Handong
    Lin, Zhe
    Kale, Ajinkya
    Kumar, Vipin
    KDD '20: PROCEEDINGS OF THE 26TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2020, : 2735 - 2743
  • [24] Incorporating the Graph Representation of Video and Text into Video Captioning
    Lu, Min
    Li, Yuan
    2022 IEEE 34TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI, 2022, : 396 - 401
  • [25] Meta Learning for Image Captioning
    Li, Nannan
    Chen, Zhenzhong
    Liu, Shan
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 8626 - 8633
  • [26] Exploring better image captioning with grid features
    Jie Yan
    Yuxiang Xie
    Yanming Guo
    Yingmei Wei
    Xidao Luan
    Complex & Intelligent Systems, 2024, 10 : 3541 - 3556
  • [27] More Grounded Image Captioning by Distilling Image-Text Matching Model
    Zhou, Yuanen
    Wang, Meng
    Liu, Daqing
    Hu, Zhenzhen
    Zhang, Hanwang
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 4776 - 4785
  • [28] Exploring better image captioning with grid features
    Yan, Jie
    Xie, Yuxiang
    Guo, Yanming
    Wei, Yingmei
    Luan, Xidao
    COMPLEX & INTELLIGENT SYSTEMS, 2024, 10 (03) : 3541 - 3556
  • [29] Automated individual cortical parcellation via consensus graph representation learning
    Wen, Xuyun
    Yang, Mengting
    Qi, Shile
    Wu, Xia
    Zhang, Daoqiang
    NEUROIMAGE, 2024, 293
  • [30] IC3: Image Captioning by Committee Consensus
    Chan, David M.
    Myers, Austin
    Vijayanarasimhan, Sudheendra
    Ross, David A.
    Canny, John
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 8975 - 9003