Cross-Modal and Uni-Modal Soft-Label Alignment for Image-Text Retrieval

被引：0

作者：

Huang, Hailang ^{[1
]}

Nie, Zhijie ^{[1
,2
]}

Wang, Ziqiao ^{[3
]}

Shang, Ziyu ^{[4
]}

机构：

[1] Beihang Univ, Sch Comp Sci & Engn, SKLSDE, Beijing, Peoples R China

[2] Beihang Univ, Shen Yuan Honors Coll, Beijing, Peoples R China

[3] Univ Ottawa, Sch Elect Engn & Comp Sci, Ottawa, ON, Canada

[4] Southeast Univ, Sch Comp Sci & Engn, Nanjing, Peoples R China

来源：

THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16 | 2024年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Current image-text retrieval methods have demonstrated impressive performance in recent years. However, they still face two problems: the inter-modal matching missing problem and the intra-modal semantic loss problem. These problems can significantly affect the accuracy of image-text retrieval. To address these challenges, we propose a novel method called Cross-modal and Uni-modal Soft-label Alignment (CUSA). Our method leverages the power of uni-modal pre-trained models to provide soft-label supervision signals for the image-text retrieval model. Additionally, we introduce two alignment techniques, Cross-modal Soft-label Alignment (CSA) and Uni-modal Soft-label Alignment (USA), to overcome false negatives and enhance similarity recognition between uni-modal samples. Our method is designed to be plugand-play, meaning it can be easily applied to existing imagetext retrieval models without changing their original architectures. Extensive experiments on various image-text retrieval models and datasets, we demonstrate that our method can consistently improve the performance of image-text retrieval and achieve new state-of-the-art results. Furthermore, our method can also boost the uni-modal retrieval performance of image-text retrieval models, enabling it to achieve universal retrieval. The code and supplementary files can be found at https://github.com/lerogo/aaai24 itr cusa.

引用

页码：18298 / 18306

页数：9

共 50 条

[41] DEEP RANK CROSS-MODAL HASHING WITH SEMANTIC CONSISTENT FOR IMAGE-TEXT RETRIEVAL
Liu, Xiaoqing
Zeng, Huanqiang
Shi, Yifan
Zhu, Jianqing
Ma, Kai-Kuang
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4828 - 4832
[42] Conceptual and Syntactical Cross-modal Alignment with Cross-level Consistency for Image-Text Matching
Zeng, Pengpeng
Gao, Lianli
Lyu, Xinyu
Jing, Shuaiqi
Song, Jingkuan
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2205 - 2213
[43] Cross-modal image-text search via Efficient Discrete Class Alignment Hashing
Wang, Song
Zhao, Huan
Wang, Yunbo
Huang, Jing
Li, Keqin
INFORMATION PROCESSING & MANAGEMENT, 2022, 59 (03)
[44] Multi-view visual semantic embedding for cross-modal image-text retrieval
Li, Zheng
Guo, Caili
Wang, Xin
Zhang, Hao
Hu, Lin
PATTERN RECOGNITION, 2025, 159
[45] Image-Text Cross-Modal Retrieval via Modality-Specific Feature Learning
Wang, Jian
He, Yonghao
Kang, Cuicui
Xiang, Shiming
Pan, Chunhong
ICMR'15: PROCEEDINGS OF THE 2015 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2015, : 347 - 354
[46] Cross-modal fabric image-text retrieval based on convolutional neural network and TinyBERT
Xiang, Jun
Zhang, Ning
Pan, Ruru
MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (21) : 59725 - 59746
[47] Cross-modal information balance-aware reasoning network for image-text retrieval
Qin, Xueyang
Li, Lishuang
Hao, Fei
Pang, Guangyao
Wang, Zehao
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2023, 120
[48] Unsupervised deep hashing with multiple similarity preservation for cross-modal image-text retrieval
Xiong, Siyu
Pan, Lili
Ma, Xueqiang
Hu, Qinghua
Beckman, Eric
INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2024, 15 (10) : 4423 - 4434
[49] IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval
Chen, Hui
Ding, Guiguang
Liu, Xudong
Lin, Zijia
Liu, Ji
Han, Jungong
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, : 12652 - 12660
[50] Cross-modal Scene Graph Matching for Relationship-aware Image-Text Retrieval
Wang, Sijin
Wang, Ruiping
Yao, Ziwei
Shan, Shiguang
Chen, Xilin
2020 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2020, : 1497 - 1506

← 1 2 3 4 5 →