HCCL: Hierarchical Counterfactual Contrastive Learning for Robust Visual Question Answering

被引:0
|
作者
Hao, Dongze [1 ,2 ]
Wang, Qunbo [1 ]
Zhu, Xinxin [1 ]
Liu, Jing [1 ,2 ]
机构
[1] Chinese Acad Sci, Inst Automat, Lab Cognit & Decis Intelligence Complex Syst, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Visual question answering; hierarchical counterfactual contrastive learning; robust VQA;
D O I
10.1145/3673902
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Despite most state-of-the-art models having achieved amazing performance in Visual Question Answering (VQA), they usually utilize biases to answer the question. Recently, some studies synthesize counterfactual training samples to help the model to mitigate the biases. However, these synthetic samples need extra annotations and often contain noises. Moreover, these methods simply add synthetic samples to the training data to train the model with the cross-entropy loss, which cannot make the best use of synthetic samples to mitigate the biases. In this article, to mitigate the biases in VQA more effectively, we propose a Hierarchical Counterfactual Contrastive Learning (HCCL) method. Firstly, to avoid introducing noises and extra annotations, our method automatically masks the unimportant features in original pairs to obtain positive samples and create mismatched question-image pairs as negative samples. Then our method uses feature-level and answer-level contrastive learning to make the original sample close to positive samples in the feature space, while away from negative samples in both feature and answer spaces. In this way, the VQA model can learn the robust multimodal features and focus on both visual and language information to produce the answer. Our HCCL method can be adopted in different baselines, and the experimental results on VQA v2, VQA-CP, and GQA-OOD datasets show that our method is effective in mitigating the biases in VQA, which improves the robustness of the VQA model.
引用
收藏
页数:21
相关论文
共 50 条
  • [21] Robust video question answering via contrastive cross-modality representation learning
    Yang, Xun
    Zeng, Jianming
    Guo, Dan
    Wang, Shanshan
    Dong, Jianfeng
    Wang, Meng
    SCIENCE CHINA-INFORMATION SCIENCES, 2024, 67 (10)
  • [22] Image-Guided Augmentation Visual Question Answering Model Combined with Contrastive Learning
    Yang, You
    Yao, Lu
    Computer Engineering and Applications, 2024, 60 (07) : 157 - 166
  • [23] Robust data augmentation and contrast learning for debiased visual question answering
    Ning, Ke
    Li, Zhixin
    NEUROCOMPUTING, 2025, 626
  • [24] Contrastive training of a multimodal encoder for medical visual question answering
    Silva, Joao Daniel
    Martins, Bruno
    Magalhaes, Joao
    INTELLIGENT SYSTEMS WITH APPLICATIONS, 2023, 18
  • [25] Generative Bias for Robust Visual Question Answering
    Cho, Jae Won
    Kim, Dong-Jin
    Ryu, Hyeonggon
    Kweon, In So
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 11681 - 11690
  • [26] Multitask Learning for Visual Question Answering
    Ma, Jie
    Liu, Jun
    Lin, Qika
    Wu, Bei
    Wang, Yaxian
    You, Yang
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (03) : 1380 - 1394
  • [27] Improving Data Augmentation for Robust Visual Question Answering with Effective Curriculum Learning
    Zheng, Yuhang
    Wang, Zhen
    Chen, Long
    PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024, 2024, : 1084 - 1088
  • [28] Surgical-VQLA plus plus : Adversarial contrastive learning for calibrated robust visual question-localized answering in robotic surgery
    Bai, Long
    Wang, Guankun
    Islam, Mobarakol
    Seenivasan, Lalithkumar
    Wang, An
    Ren, Hongliang
    INFORMATION FUSION, 2025, 113
  • [29] Multi-Question Learning for Visual Question Answering
    Lei, Chenyi
    Wu, Lei
    Liu, Dong
    Li, Zhao
    Wang, Guoxin
    Tang, Haihong
    Li, Houqiang
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 11328 - 11335
  • [30] Multi-level Contrastive Learning for Commonsense Question Answering
    Fang, Quntian
    Huang, Zhen
    Zhang, Ziwen
    Hu, Minghao
    Hu, Biao
    Wang, Ankun
    Li, Dongsheng
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, PT IV, KSEM 2023, 2023, 14120 : 318 - 331