Cross-modal alignment with synthetic caption for text-based person search

被引:0
|
作者
Weichen Zhao [1 ]
Yuxing Lu [3 ]
Zhiyuan Liu [2 ]
Yuan Yang [3 ]
Ge Jiao [1 ]
机构
[1] Hengyang Normal University,College of Computer Science and Technology
[2] Peking University,College of Future Technology
[3] Soochow University,School of Computer Science and Technology
关键词
Text-based person search; Cross-modal retrieval; Cross-modal alignment; Synthetic caption;
D O I
10.1007/s13735-025-00356-w
中图分类号
学科分类号
摘要
Text-based person search aims to retrieve target person from a large gallery based on natural language description. Existing methods take it as one-to-one embedding or many-to-many embedding matching problem. The former approach relies on the assumption of the existence of strong alignment between text and images, while the latter inevitably leads to issues of intra-class variation. Rather than being confined to these two approaches, we propose a new strategy that achieves cross-modal alignment with synthetic caption for joint image-text-caption optimization, named CASC. The core of this strategy lies in generating fine-grained captions that are informative for multimodal alignment. To realize this, we introduce two novel components: Granularity Awareness Sensor (GAS) and Conditional Contrastive Learning (CCL). GAS selects relative features through an innovative adaptive masking strategy, endowing the model with an enhanced perception of discriminative features. CCL aligns different modalities through further constraints on the synthetic captions by comparing the similarity of hard negative samples, protecting the disruption from noisy contents. With the incorporation of extra caption supervision, the model has access to learn more comprehensive feature representation, which in turn boosts the retrieval performance during inference. Experiments demonstrate that CASC outperforms existing state-of-the-art methods by 1.20%, 2.35% and 2.29% in terms of Rank@1 on CUHK-PEDES, ICFG-PEDES and RSTPReid datasets, respectively.
引用
收藏
相关论文
共 50 条
  • [1] Text-based person search via cross-modal alignment learning
    Ke, Xiao
    Liu, Hao
    Xu, Peirong
    Lin, Xinru
    Guo, Wenzhong
    PATTERN RECOGNITION, 2024, 152
  • [2] Cross-Modal Feature Fusion-Based Knowledge Transfer for Text-Based Person Search
    You, Kaiyang
    Chen, Wenjing
    Wang, Chengji
    Sun, Hao
    Xie, Wei
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 2230 - 2234
  • [3] Asymmetric Cross-Scale Alignment for Text-Based Person Search
    Ji, Zhong
    Hu, Junhua
    Liu, Deyin
    Wu, Lin Yuanbo
    Zhao, Ye
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 7699 - 7709
  • [4] Prototype-guided Cross-modal Completion and Alignment for Incomplete Text-based Person Re-identification
    Gong, Tiantian
    Du, Guodong
    Wang, Junsheng
    Ding, Yongkang
    Zhang, Liyan
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5253 - 5261
  • [5] DCEL: Deep Cross-modal Evidential Learning for Text-Based Person Retrieval
    Li, Shenshen
    Xu, Xing
    Yang, Yang
    Shen, Fumin
    Mo, Yijun
    Li, Yujie
    Shen, Heng Tao
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 6292 - 6300
  • [6] Cross-modal Generation and Alignment via Attribute-guided Prompt for Unsupervised Text-Based Person Retrieval
    Li, Zongyi
    Li, Jianbo
    Shi, Yuxuan
    Ling, Hefei
    Chen, Jiazhong
    Wang, Runsheng
    Huang, Shijuan
    PROCEEDINGS OF THE THIRTY-THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2024, 2024, : 1047 - 1055
  • [7] Cross-Modal Uncertainty Modeling With Diffusion-Based Refinement for Text-Based Person Retrieval
    Li, Shenshen
    Xu, Xing
    He, Chen
    Shen, Fumin
    Yang, Yang
    Shen, Heng Tao
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (03) : 2881 - 2893
  • [8] Feature semantic alignment and information supplement for Text-based person search
    Zhou, Hang
    Li, Fan
    Tian, Xuening
    Huang, Yuling
    FRONTIERS IN PHYSICS, 2023, 11
  • [9] Joint Token and Feature Alignment Framework for Text-Based Person Search
    Li, Shangze
    Lu, Andong
    Huang, Yan
    Li, Chenglong
    Wang, Liang
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 2238 - 2242
  • [10] CLIP-Based Multi-level Alignment for Text-based Person Search
    Wu, Zhijun
    Ma, Shiwei
    2024 5TH INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING AND APPLICATION, ICCEA 2024, 2024, : 610 - 614