Cross-modal alignment with synthetic caption for text-based person search

被引：0

作者：

Weichen Zhao ^{[1
]}

Yuxing Lu ^{[3
]}

Zhiyuan Liu ^{[2
]}

Yuan Yang ^{[3
]}

Ge Jiao ^{[1
]}

机构：

[1] Hengyang Normal University,College of Computer Science and Technology

[2] Peking University,College of Future Technology

[3] Soochow University,School of Computer Science and Technology

来源：

International Journal of Multimedia Information Retrieval | 2025年 / 14卷 / 2期

关键词：

Text-based person search; Cross-modal retrieval; Cross-modal alignment; Synthetic caption;

D O I：

10.1007/s13735-025-00356-w

中图分类号：

学科分类号：

摘要：

Text-based person search aims to retrieve target person from a large gallery based on natural language description. Existing methods take it as one-to-one embedding or many-to-many embedding matching problem. The former approach relies on the assumption of the existence of strong alignment between text and images, while the latter inevitably leads to issues of intra-class variation. Rather than being confined to these two approaches, we propose a new strategy that achieves cross-modal alignment with synthetic caption for joint image-text-caption optimization, named CASC. The core of this strategy lies in generating fine-grained captions that are informative for multimodal alignment. To realize this, we introduce two novel components: Granularity Awareness Sensor (GAS) and Conditional Contrastive Learning (CCL). GAS selects relative features through an innovative adaptive masking strategy, endowing the model with an enhanced perception of discriminative features. CCL aligns different modalities through further constraints on the synthetic captions by comparing the similarity of hard negative samples, protecting the disruption from noisy contents. With the incorporation of extra caption supervision, the model has access to learn more comprehensive feature representation, which in turn boosts the retrieval performance during inference. Experiments demonstrate that CASC outperforms existing state-of-the-art methods by 1.20%, 2.35% and 2.29% in terms of Rank@1 on CUHK-PEDES, ICFG-PEDES and RSTPReid datasets, respectively.

引用

共 50 条

[1] Text-based person search via cross-modal alignment learning
Ke, Xiao
Liu, Hao
Xu, Peirong
Lin, Xinru
Guo, Wenzhong
PATTERN RECOGNITION, 2024, 152
[2] Cross-Modal Feature Fusion-Based Knowledge Transfer for Text-Based Person Search
You, Kaiyang
Chen, Wenjing
Wang, Chengji
Sun, Hao
Xie, Wei
IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 2230 - 2234
[3] Asymmetric Cross-Scale Alignment for Text-Based Person Search
Ji, Zhong
Hu, Junhua
Liu, Deyin
Wu, Lin Yuanbo
Zhao, Ye
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 7699 - 7709
[4] Prototype-guided Cross-modal Completion and Alignment for Incomplete Text-based Person Re-identification
Gong, Tiantian
Du, Guodong
Wang, Junsheng
Ding, Yongkang
Zhang, Liyan
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5253 - 5261
[5] DCEL: Deep Cross-modal Evidential Learning for Text-Based Person Retrieval
Li, Shenshen
Xu, Xing
Yang, Yang
Shen, Fumin
Mo, Yijun
Li, Yujie
Shen, Heng Tao
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 6292 - 6300
[6] Cross-modal Generation and Alignment via Attribute-guided Prompt for Unsupervised Text-Based Person Retrieval
Li, Zongyi
Li, Jianbo
Shi, Yuxuan
Ling, Hefei
Chen, Jiazhong
Wang, Runsheng
Huang, Shijuan
PROCEEDINGS OF THE THIRTY-THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2024, 2024, : 1047 - 1055
[7] Cross-Modal Uncertainty Modeling With Diffusion-Based Refinement for Text-Based Person Retrieval
Li, Shenshen
Xu, Xing
He, Chen
Shen, Fumin
Yang, Yang
Shen, Heng Tao
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (03) : 2881 - 2893
[8] Feature semantic alignment and information supplement for Text-based person search
Zhou, Hang
Li, Fan
Tian, Xuening
Huang, Yuling
FRONTIERS IN PHYSICS, 2023, 11
[9] Joint Token and Feature Alignment Framework for Text-Based Person Search
Li, Shangze
Lu, Andong
Huang, Yan
Li, Chenglong
Wang, Liang
IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 2238 - 2242
[10] CLIP-Based Multi-level Alignment for Text-based Person Search
Wu, Zhijun
Ma, Shiwei
2024 5TH INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING AND APPLICATION, ICCEA 2024, 2024, : 610 - 614

← 1 2 3 4 5 →