ClusterE-ZSL: A Novel Cluster-Based Embedding for Enhanced Zero-Shot Learning in Contrastive Pre-Training Cross-Modal Retrieval

被引:1
|
作者
Tariq, Umair [1 ]
Hu, Zonghai [1 ]
Tasneem, Khawaja Tauseef [2 ]
Bin Heyat, Md Belal [3 ]
Iqbal, Muhammad Shahid [4 ]
Aziz, Kamran [5 ]
机构
[1] Beijing Univ Posts & Telecommun, Sch Elect Engn, Beijing, Peoples R China
[2] Saudi Elect Univ, Coll Comp & Informat, Informat Technol Dept, Riyadh, Saudi Arabia
[3] Westlake Univ, CenBRAIN Neurotech Ctr Excellence, Sch Engn, Hangzhou 310024, Zhejiang, Peoples R China
[4] Anhui Univ, Sch Comp Sci & Technol, Hefei, Anhui, Peoples R China
[5] Wuhan Univ, Sch Cyber Sci & Engn, Lab Aerosp Informat Secur & Trusted Comp, Minist Educ, Wuhan 430072, Peoples R China
来源
IEEE ACCESS | 2024年 / 12卷
基金
国家重点研发计划;
关键词
Zero shot learning; Vectors; Computational modeling; Data models; Accuracy; Visualization; Semantics; Contrastive learning; Transformers; Training; Clustering methods; Self-supervised learning; Machine learning; embedded; cluster; self-supervised learning; embedded computing; cross-modal retrieval; multi-model machine learning; TEXT CLASSIFICATION; IMAGE;
D O I
10.1109/ACCESS.2024.3476082
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Zero-shot learning (ZSL) in a multi-model environment presents significant challenges and opportunities for improving cross-modal retrieval and object detection in unseen data. This study introduced a novel embedding approach of vector space clustering to address image-to-text and text-to-image retrieval problems effectively. We proposed an iterative training strategy; unlike the CLIP model, which directly compares visual and textual modalities, our model concatenates by clustering trained image and text features in common vector space. We use cross-modal contrastive and multi-stage contrast loss to improve the unsupervised learning of our model. This integration makes it possible to achieve proper clustering on embedding, which enhances the image-text matching problem in zero-shot learning tasks. We rigorously evaluate our model performance on standard benchmark datasets, including Flickr30K, Flickr8K, and MSCOCO 5K, achieving notable improvements with accuracies of 91.3%, 88.8%, and 90.3%, respectively. The results demonstrate the better performance of our model over existing methods but also show its effectiveness in enhancing cross-modal retrieval in zero-shot learning.
引用
收藏
页码:162622 / 162637
页数:16
相关论文
共 26 条
  • [21] Variational autoencoder based on distributional semantic embedding and cross-modal reconstruction for generalized zero-shot fault diagnosis of industrial processes
    Mou, Miao
    Zhao, Xiaoqiang
    Liu, Kai
    Hui, Yongyong
    PROCESS SAFETY AND ENVIRONMENTAL PROTECTION, 2023, 177 : 1154 - 1167
  • [22] WAD-CMSN: Wasserstein distance-based cross-modal semantic network for zero-shot sketch-based image retrieval
    Xu, Guanglong
    Hu, Zhensheng
    Cai, Jia
    INTERNATIONAL JOURNAL OF WAVELETS MULTIRESOLUTION AND INFORMATION PROCESSING, 2023, 21 (02)
  • [23] Zero-Shot Sketch-Based Image Retrieval with teacher-guided and student-centered cross-modal bidirectional knowledge distillation
    Du, Jiale
    Liu, Yang
    Gao, Xinbo
    Han, Jungong
    Zhang, Lei
    PATTERN RECOGNITION, 2025, 164
  • [24] M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval
    Dong, Xingning
    Feng, Zipeng
    Zhou, Chunluan
    Yu, Xuzheng
    Yang, Ming
    Guo, Qingpei
    PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 2156 - 2166
  • [25] XLPT-AMR: Cross-Lingual Pre-Training via Multi-Task Learning for Zero-Shot AMR Parsing and Text Generation
    Xu, Dongqin
    Li, Junhui
    Zhu, Muhua
    Zhang, Min
    Zhou, Guodong
    59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1 (ACL-IJCNLP 2021), 2021, : 896 - 907
  • [26] ACF-R+: An asymmetry-sensitive method for image-text retrieval enhanced by cross-modal fusion and re-ranking based on contrastive learning
    Gong, Ziyu
    Huang, Yihua
    Yu, Chunhua
    Dai, Peng
    Ge, Xing
    Shen, Yiming
    Liu, Yafei
    NEUROCOMPUTING, 2025, 628