Deconfounded fashion image captioning with transformer and multimodal retrieval

被引:0
|
作者
Tao PENG
Weiqiao YIN
Junping LIU
Li LI
Xinrong HU
机构
[1] SchoolofComputerScienceandArtificialIntelligence,WuhanTextileUniversity
关键词
D O I
暂无
中图分类号
TP391.41 []; TS941.2 [设计、计算、图解];
学科分类号
080203 ; 1403 ;
摘要
Background The annotation of fashion images is a significantly important task in the fashion industry as well as social media and e-commerce. However, owing to the complexity and diversity of fashion images, this task entails multiple challenges, including the lack of fine-grained captions and confounders caused by dataset bias. Specifically, confounders often cause models to learn spurious correlations, thereby reducing their generalization capabilities. Method In this work, we propose the Deconfounded Fashion Image Captioning(DFIC)framework, which first uses multimodal retrieval to enrich the predicted captions of clothing, and then constructs a detailed causal graph using causal inference in the decoder to perform deconfounding. Multimodal retrieval is used to obtain semantic words related to image features, which are input into the decoder as prompt words to enrich sentence descriptions. In the decoder, causal inference is applied to disentangle visual and semantic features while concurrently eliminating visual and language confounding. Results Overall, our method can not only effectively enrich the captions of target images, but also greatly reduce confounders caused by the dataset. To verify the effectiveness of the proposed framework, the model was experimentally verified using the FACAD dataset.
引用
收藏
页码:127 / 138
页数:12
相关论文
共 50 条
  • [1] Deconfounded Image Captioning: A Causal Retrospect
    Yang, Xu
    Zhang, Hanwang
    Cai, Jianfei
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (11) : 12996 - 13010
  • [2] Retrieval-Augmented Transformer for Image Captioning
    Sarto, Sara
    Cornia, Marcella
    Baraldi, Lorenzo
    Cucchiara, Rita
    19TH INTERNATIONAL CONFERENCE ON CONTENT-BASED MULTIMEDIA INDEXING, CBMI 2022, 2022, : 1 - 7
  • [3] Multimodal Transformer With Multi-View Visual Representation for Image Captioning
    Yu, Jun
    Li, Jing
    Yu, Zhou
    Huang, Qingming
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2020, 30 (12) : 4467 - 4480
  • [4] Towards a Multimodal Framework for Remote Sensing Image Change Retrieval and Captioning
    Ferrod, Roger
    Di Caro, Luigi
    Ienco, Dino
    DISCOVERY SCIENCE, DS 2024, PT II, 2025, 15244 : 231 - 245
  • [5] Distance Transformer for Image Captioning
    Wang, Jiarong
    Lu, Tongwei
    Liu, Xuanxuan
    Yang, Qi
    2021 4TH INTERNATIONAL CONFERENCE ON ROBOTICS, CONTROL AND AUTOMATION ENGINEERING (RCAE 2021), 2021, : 73 - 76
  • [6] Rotary Transformer for Image Captioning
    Qiu, Yile
    Zhu, Li
    SECOND INTERNATIONAL CONFERENCE ON OPTICS AND IMAGE PROCESSING (ICOIP 2022), 2022, 12328
  • [7] Entangled Transformer for Image Captioning
    Li, Guang
    Zhu, Linchao
    Liu, Ping
    Yang, Yi
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 8927 - 8936
  • [8] Boosted Transformer for Image Captioning
    Li, Jiangyun
    Yao, Peng
    Guo, Longteng
    Zhang, Weicun
    APPLIED SCIENCES-BASEL, 2019, 9 (16):
  • [9] Fashion-Oriented Image Captioning with External Knowledge Retrieval and Fully Attentive Gates
    Moratelli, Nicholas
    Barraco, Manuele
    Morelli, Davide
    Cornia, Marcella
    Baraldi, Lorenzo
    Cucchiara, Rita
    SENSORS, 2023, 23 (03)
  • [10] ATTRIBUTE CONDITIONED FASHION IMAGE CAPTIONING
    Cai, Chen
    Yap, Kim-Hui
    Wang, Suchen
    2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 1921 - 1925