Deconfounded fashion image captioning with transformer and multimodal retrieval

被引：0

作者：

Tao PENG

Weiqiao YIN

Junping LIU

Li LI

Xinrong HU

机构：

[1] SchoolofComputerScienceandArtificialIntelligence,WuhanTextileUniversity

来源：

虚拟现实与智能硬件(中英文) | 2025年 / 7卷 / 02期

关键词：

D O I：

暂无

中图分类号：

TP391.41 []; TS941.2 [设计、计算、图解];

学科分类号：

080203 ; 1403 ;

摘要：

Background The annotation of fashion images is a significantly important task in the fashion industry as well as social media and e-commerce. However, owing to the complexity and diversity of fashion images, this task entails multiple challenges, including the lack of fine-grained captions and confounders caused by dataset bias. Specifically, confounders often cause models to learn spurious correlations, thereby reducing their generalization capabilities. Method In this work, we propose the Deconfounded Fashion Image Captioning(DFIC)framework, which first uses multimodal retrieval to enrich the predicted captions of clothing, and then constructs a detailed causal graph using causal inference in the decoder to perform deconfounding. Multimodal retrieval is used to obtain semantic words related to image features, which are input into the decoder as prompt words to enrich sentence descriptions. In the decoder, causal inference is applied to disentangle visual and semantic features while concurrently eliminating visual and language confounding. Results Overall, our method can not only effectively enrich the captions of target images, but also greatly reduce confounders caused by the dataset. To verify the effectiveness of the proposed framework, the model was experimentally verified using the FACAD dataset.

引用

页码：127 / 138

页数：12

共 50 条

[1] Deconfounded Image Captioning: A Causal Retrospect
Yang, Xu
Zhang, Hanwang
Cai, Jianfei
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (11) : 12996 - 13010
[2] Retrieval-Augmented Transformer for Image Captioning
Sarto, Sara
Cornia, Marcella
Baraldi, Lorenzo
Cucchiara, Rita
19TH INTERNATIONAL CONFERENCE ON CONTENT-BASED MULTIMEDIA INDEXING, CBMI 2022, 2022, : 1 - 7
[3] Multimodal Transformer With Multi-View Visual Representation for Image Captioning
Yu, Jun
Li, Jing
Yu, Zhou
Huang, Qingming
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2020, 30 (12) : 4467 - 4480
[4] Towards a Multimodal Framework for Remote Sensing Image Change Retrieval and Captioning
Ferrod, Roger
Di Caro, Luigi
Ienco, Dino
DISCOVERY SCIENCE, DS 2024, PT II, 2025, 15244 : 231 - 245
[5] Distance Transformer for Image Captioning
Wang, Jiarong
Lu, Tongwei
Liu, Xuanxuan
Yang, Qi
2021 4TH INTERNATIONAL CONFERENCE ON ROBOTICS, CONTROL AND AUTOMATION ENGINEERING (RCAE 2021), 2021, : 73 - 76
[6] Rotary Transformer for Image Captioning
Qiu, Yile
Zhu, Li
SECOND INTERNATIONAL CONFERENCE ON OPTICS AND IMAGE PROCESSING (ICOIP 2022), 2022, 12328
[7] Entangled Transformer for Image Captioning
Li, Guang
Zhu, Linchao
Liu, Ping
Yang, Yi
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 8927 - 8936
[8] Boosted Transformer for Image Captioning
Li, Jiangyun
Yao, Peng
Guo, Longteng
Zhang, Weicun
APPLIED SCIENCES-BASEL, 2019, 9 (16):
[9] Fashion-Oriented Image Captioning with External Knowledge Retrieval and Fully Attentive Gates
Moratelli, Nicholas
Barraco, Manuele
Morelli, Davide
Cornia, Marcella
Baraldi, Lorenzo
Cucchiara, Rita
SENSORS, 2023, 23 (03)
[10] ATTRIBUTE CONDITIONED FASHION IMAGE CAPTIONING
Cai, Chen
Yap, Kim-Hui
Wang, Suchen
2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 1921 - 1925

← 1 2 3 4 5 →