MITIGATING DATASET BIAS IN IMAGE CAPTIONING THROUGH CLIP CONFOUNDER-FREE CAPTIONING NETWORK

被引:2
|
作者
Kim, Yeonju [1 ]
Kim, Junho [1 ]
Lee, Byung-Kwan [1 ]
Shin, Sebin [1 ]
Ro, Yong Man [1 ]
机构
[1] Korea Adv Inst Sci & Technol, Sch Elect Engn, Image & Video Syst Lab, Daejeon, South Korea
关键词
Image captioning; Causal inference; Dataset bias; Global visual confounder; CLIP;
D O I
10.1109/ICIP49359.2023.10222502
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The dataset bias has been identified as a major challenge in image captioning. When the image captioning model predicts a word, it should consider the visual evidence associated with the word, but the model tends to use contextual evidence from the dataset bias and results in biased captions, especially when the dataset is biased toward some specific situations. To solve this problem, we approach from the causal inference perspective and design a causal graph. Based on the causal graph, we propose a novel method named C(2)Cap which is CLIP confounder-free captioning network. We use the global visual confounder to control the confounding factors in the image and train the model to produce debiased captions. We validate our proposed method on MSCOCO benchmark and demonstrate the effectiveness of our method. https://github.com/yeonju7kim/C2Cap
引用
收藏
页码:1720 / 1724
页数:5
相关论文
共 50 条
  • [31] Hierarchical Deep Neural Network for Image Captioning
    Yuting Su
    Yuqian Li
    Ning Xu
    An-An Liu
    Neural Processing Letters, 2020, 52 : 1057 - 1067
  • [32] Mask-guided network for image captioning
    Lim, Jian Han
    Chan, Chee Seng
    PATTERN RECOGNITION LETTERS, 2023, 173 : 79 - 86
  • [33] Evolutionary recurrent neural network for image captioning
    Wang, Hanzhang
    Wang, Hanli
    Xu, Kaisheng
    NEUROCOMPUTING, 2020, 401 : 249 - 256
  • [34] A Multiscale Grouping Transformer With CLIP Latents for Remote Sensing Image Captioning
    Meng, Lingwu
    Wang, Jing
    Meng, Ran
    Yang, Yang
    Xiao, Liang
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62 : 1 - 15
  • [35] Cascade Semantic Prompt Alignment Network for Image Captioning
    Li, Jingyu
    Zhang, Lei
    Zhang, Kun
    Hu, Bo
    Xie, Hongtao
    Mao, Zhendong
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (07) : 5266 - 5281
  • [36] Detach and Attach: Stylized Image Captioning without Paired Stylized Dataset
    Tan, Yutong
    Lin, Zheng
    Fu, Peng
    Zheng, Mingyu
    Wang, Lanrui
    Cao, Yanan
    Wang, Weiping
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4733 - 4741
  • [37] CLIP-Based Grid Features and Masking for Remote Sensing Image Captioning
    Lin, Qiaoling
    Wang, Shuang
    Ye, Xiutiao
    Wang, Ruixuan
    Yang, Rui
    Jiao, Licheng
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2025, 18 : 2631 - 2642
  • [38] Deliberate Multi-Attention Network for Image Captioning
    Dan, Zedong
    Fang, Yanmei
    PATTERN RECOGNITION AND COMPUTER VISION, PT I, PRCV 2022, 2022, 13534 : 475 - 487
  • [39] Intensive Positioning Network for Remote Sensing Image Captioning
    Wang, Shengsheng
    Chen, Jiawei
    Wang, Guangyao
    INTELLIGENCE SCIENCE AND BIG DATA ENGINEERING, 2018, 11266 : 567 - 576
  • [40] Recursive Network with Explicit Neighbor Connection for Image Captioning
    Shaikh, Mohammedsayeemuddin K.
    Joshi, Manjunath V.
    2018 INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATIONS (SPCOM 2018), 2018, : 392 - 396