ClusterE-ZSL: A Novel Cluster-Based Embedding for Enhanced Zero-Shot Learning in Contrastive Pre-Training Cross-Modal Retrieval

被引:1
|
作者
Tariq, Umair [1 ]
Hu, Zonghai [1 ]
Tasneem, Khawaja Tauseef [2 ]
Bin Heyat, Md Belal [3 ]
Iqbal, Muhammad Shahid [4 ]
Aziz, Kamran [5 ]
机构
[1] Beijing Univ Posts & Telecommun, Sch Elect Engn, Beijing, Peoples R China
[2] Saudi Elect Univ, Coll Comp & Informat, Informat Technol Dept, Riyadh, Saudi Arabia
[3] Westlake Univ, CenBRAIN Neurotech Ctr Excellence, Sch Engn, Hangzhou 310024, Zhejiang, Peoples R China
[4] Anhui Univ, Sch Comp Sci & Technol, Hefei, Anhui, Peoples R China
[5] Wuhan Univ, Sch Cyber Sci & Engn, Lab Aerosp Informat Secur & Trusted Comp, Minist Educ, Wuhan 430072, Peoples R China
来源
IEEE ACCESS | 2024年 / 12卷
基金
国家重点研发计划;
关键词
Zero shot learning; Vectors; Computational modeling; Data models; Accuracy; Visualization; Semantics; Contrastive learning; Transformers; Training; Clustering methods; Self-supervised learning; Machine learning; embedded; cluster; self-supervised learning; embedded computing; cross-modal retrieval; multi-model machine learning; TEXT CLASSIFICATION; IMAGE;
D O I
10.1109/ACCESS.2024.3476082
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Zero-shot learning (ZSL) in a multi-model environment presents significant challenges and opportunities for improving cross-modal retrieval and object detection in unseen data. This study introduced a novel embedding approach of vector space clustering to address image-to-text and text-to-image retrieval problems effectively. We proposed an iterative training strategy; unlike the CLIP model, which directly compares visual and textual modalities, our model concatenates by clustering trained image and text features in common vector space. We use cross-modal contrastive and multi-stage contrast loss to improve the unsupervised learning of our model. This integration makes it possible to achieve proper clustering on embedding, which enhances the image-text matching problem in zero-shot learning tasks. We rigorously evaluate our model performance on standard benchmark datasets, including Flickr30K, Flickr8K, and MSCOCO 5K, achieving notable improvements with accuracies of 91.3%, 88.8%, and 90.3%, respectively. The results demonstrate the better performance of our model over existing methods but also show its effectiveness in enhancing cross-modal retrieval in zero-shot learning.
引用
收藏
页码:162622 / 162637
页数:16
相关论文
共 26 条
  • [1] Manifold regularized cross-modal embedding for zero-shot learning
    Ji, Zhong
    Yu, Yunlong
    Pang, Yanwei
    Guo, Jichang
    Zhang, Zhongfei
    INFORMATION SCIENCES, 2017, 378 : 48 - 58
  • [2] DUET: Cross-Modal Semantic Grounding for Contrastive Zero-Shot Learning
    Chen, Zhuo
    Huang, Yufeng
    Chen, Jiaoyan
    Geng, Yuxia
    Zhang, Wen
    Fang, Yin
    Pan, Jeff Z.
    Chen, Huajun
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 1, 2023, : 405 - 413
  • [3] Cross-modal distribution alignment embedding network for generalized zero-shot learning
    Li, Qin
    Hou, Mingzhen
    Lai, Hong
    Yang, Ming
    NEURAL NETWORKS, 2022, 148 : 176 - 182
  • [4] Cross-Modal Contrastive Pre-Training for Few-Shot Skeleton Action Recognition
    Lu, Mingqi
    Yang, Siyuan
    Lu, Xiaobo
    Liu, Jun
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (10) : 9798 - 9807
  • [5] Learning Cross-Aligned Latent Embeddings for Zero-Shot Cross-Modal Retrieval
    Lin, Kaiyi
    Xu, Xing
    Gao, Lianli
    Wang, Zheng
    Shen, Heng Tao
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 11515 - 11522
  • [6] Vision Language Pre-training by Contrastive Learning with Cross-Modal Similarity Regulation
    Jiang, Chaoya
    Ye, Wei
    Xu, Haiyang
    Huang, Songfang
    Huang, Fei
    Zhang, Shikun
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 14660 - 14679
  • [7] Deep cross-modal discriminant adversarial learning for zero-shot sketch-based image retrieval
    Jiao, Shichao
    Han, Xie
    Xiong, Fengguang
    Yang, Xiaowen
    Han, Huiyan
    He, Ligang
    Kuang, Liqun
    NEURAL COMPUTING & APPLICATIONS, 2022, 34 (16): : 13469 - 13483
  • [8] Deep cross-modal discriminant adversarial learning for zero-shot sketch-based image retrieval
    Shichao Jiao
    Xie Han
    Fengguang Xiong
    Xiaowen Yang
    Huiyan Han
    Ligang He
    Liqun Kuang
    Neural Computing and Applications, 2022, 34 : 13469 - 13483
  • [9] INTER-MODALITY FUSION BASED ATTENTION FOR ZERO-SHOT CROSS-MODAL RETRIEVAL
    Chakraborty, Bela
    Wang, Peng
    Wang, Lei
    2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 2648 - 2652
  • [10] CREST: Cross-modal Resonance through Evidential Deep Learning for Enhanced Zero-Shot Learning
    Huang, Haojian
    Qiao, Xiaozhennn
    Chen, Zhuo
    Chen, Haodong
    Li, Bingyu
    Sun, Zhe
    Chen, Mulin
    Li, Xuelong
    MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia, : 5181 - 5190