Consensus Graph Representation Learning for Better Grounded Image Captioning

被引:0
|
作者
Zhang, Wenqiao [1 ]
Shi, Haochen [1 ]
Tang, Siliang [1 ]
Xiao, Jun [1 ]
Yu, Qiang [2 ]
Zhuang, Yueting [1 ]
机构
[1] Zhejiang Univ, Hangzhou, Peoples R China
[2] Citycloud Technol, Hangzhou, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The contemporary visual captioning models frequently hallucinate objects that are not actually in a scene, due to the visual misclassification or over-reliance on priors that resulting in the semantic inconsistency between the visual information and the target lexical words. The most common way is to encourage the captioning model to dynamically link generated object words or phrases to appropriate regions of the image, i.e., the grounded image captioning (GIC). However, GIC utilizes an auxiliary task (grounding objects) that has not solved the key issue of object hallucination, i.e., the semantic inconsistency. In this paper, we take a novel perspective on the issue above: exploiting the semantic coherency between the visual and language modalities. Specifically, we propose the Consensus Rraph Representation Learning framework (CGRL) for GIC that incorporates a consensus representation into the grounded captioning pipeline. The consensus is learned by aligning the visual graph (e.g., scene graph) to the language graph that consider both the nodes and edges in a graph. With the aligned consensus, the captioning model can capture both the correct linguistic characteristics and visual relevance, and then grounding appropriate image regions further. We validate the effectiveness of our model, with a significant decline in object hallucination (-9% CHAIRO on the Flickr30k Entities dataset. Besides, our CGRL also evaluated by several automatic metrics and human evaluation, the results indicate that the proposed approach can simultaneously improve the performance of image captioning (+2.9 Cider) and grounding (+2.3 F1(LOC)).
引用
收藏
页码:3394 / 3402
页数:9
相关论文
共 50 条
  • [1] Graph Alignment Transformer for More Grounded Image Captioning
    Tian, Canwei
    Hu, Haiyang
    Li, Zhongjin
    2022 INTERNATIONAL CONFERENCE ON INDUSTRIAL IOT, BIG DATA AND SUPPLY CHAIN, IIOTBDSC, 2022, : 95 - 102
  • [2] Learning Scene Graph for Better Cross-Domain Image Captioning
    Jia, Junhua
    Xin, Xiaowei
    Gao, Xiaoyan
    Ding, Xiangqian
    Pang, Shunpeng
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT III, 2024, 14427 : 121 - 137
  • [3] Neural Symbolic Representation Learning for Image Captioning
    Wang, Xiaomei
    Ma, Lin
    Fu, Yanwei
    Xue, Xiangyang
    PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 312 - 321
  • [4] Distributed Attention for Grounded Image Captioning
    Chen, Nenglun
    Pan, Xingjia
    Chen, Runnan
    Yang, Lei
    Lin, Zhiwen
    Ren, Yuqiang
    Yuan, Haolei
    Guo, Xiaowei
    Huang, Feiyue
    Wang, Wenping
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 1966 - 1975
  • [5] Scene graph captioner: Image captioning based on structural visual representation
    Xu, Ning
    Liu, An-An
    Liu, Jing
    Nie, Weizhi
    Su, Yuting
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2019, 58 : 477 - 485
  • [6] Towards Captioning an Image Collection from a Combined Scene Graph Representation Approach
    Phueaksri, Itthisak
    Kastner, Marc A.
    Kawanishi, Yasutomo
    Komamizu, Takahiro
    Ide, Ichiro
    MULTIMEDIA MODELING, MMM 2023, PT I, 2023, 13833 : 178 - 190
  • [7] Dual Graph Convolutional Networks with Transformer and Curriculum Learning for Image Captioning
    Dong, Xinzhi
    Long, Chengjiang
    Xu, Wenju
    Xiao, Chunxia
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2615 - 2624
  • [8] Graph Representation Learning for Spatial Image Steganalysis
    Liu, Qiyun
    Zhou, Limengnan
    Wu, Hanzhou
    2022 IEEE 24TH INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP), 2022,
  • [9] STAIR: Learning Sparse Text and Image Representation in Grounded Tokens
    Chen, Chen
    Zhang, Bowen
    Cao, Liangliang
    Shen, Jiguang
    Gunter, Tom
    Jose, Albin Madappally
    Toshev, Alexander
    Zheng, Yantao
    Shlenst, Jonathon
    Pang, Ruoming
    Yang, Yinfei
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 15079 - 15094
  • [10] Image captioning with transformer and knowledge graph
    Zhang, Yu
    Shi, Xinyu
    Mi, Siya
    Yang, Xu
    PATTERN RECOGNITION LETTERS, 2021, 143 (143) : 43 - 49