Improving Text Embeddings with Large Language Models

被引:0
|
作者
Wang, Liang [1 ]
Yang, Nan [1 ]
Huang, Xiaolong [1 ]
Yang, Linjun [1 ]
Majumder, Rangan [1 ]
Wei, Furu [1 ]
机构
[1] Microsoft Corp, Redmond, WA 98052 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps. Unlike existing methods that often depend on multi-stage intermediate pretraining with billions of weakly-supervised text pairs, followed by fine-tuning with a few labeled datasets, our method does not require building complex training pipelines or relying on manually collected datasets that are often constrained by task diversity and language coverage. We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages. We then fine-tune open-source decoder-only LLMs on the synthetic data using standard contrastive loss. Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data. Furthermore, when fine-tuned with a mixture of synthetic and labeled data, our model sets new state-of-the-art results on the BEIR and MTEB benchmarks.
引用
收藏
页码:11897 / 11916
页数:20
相关论文
共 50 条
  • [1] Text clustering with large language model embeddings
    Petukhova, Alina
    Matos-Carvalho, João P.
    Fachada, Nuno
    International Journal of Cognitive Computing in Engineering, 2025, 6 : 100 - 108
  • [2] Improving Zero-Shot Text Matching for Financial Auditing with Large Language Models
    Hillebrand, Lars
    Berger, Armin
    Deusser, Tobias
    Dilmaghani, Tim
    Khaled, Mohamed
    Kliem, Bernd
    Loitz, Ruediger
    Pielka, Maren
    Leonhard, David
    Bauckhage, Christian
    Sifa, Rafet
    PROCEEDINGS OF THE 2023 ACM SYMPOSIUM ON DOCUMENT ENGINEERING, DOCENG 2023, 2023,
  • [3] Text Classification via Large Language Models
    Sun, Xiaofei
    Li, Xiaoya
    Li, Jiwei
    Wu, Fei
    Guo, Shangwei
    Zhang, Tianwei
    Wang, Guoyin
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 8990 - 9005
  • [4] CoLLM: Integrating Collaborative Embeddings Into Large Language Models for Recommendation
    Zhang, Yang
    Feng, Fuli
    Zhang, Jizhi
    Bao, Keqin
    Wang, Qifan
    He, Xiangnan
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2025, 37 (05) : 2329 - 2340
  • [5] Improving Recommender Systems with Large Language Models
    Lubos, Sebastian
    ADJUNCT PROCEEDINGS OF THE 32ND ACM CONFERENCE ON USER MODELING, ADAPTATION AND PERSONALIZATION, UMAP 2024, 2024, : 40 - 44
  • [6] Utility of word embeddings from large language models in medical diagnosis
    Yazdani, Shahram
    Henry, Ronald Claude
    Byrne, Avery
    Henry, Isaac Claude
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2025, 32 (03) : 526 - 534
  • [7] From Sentence Embeddings to Large Language Models to Detect and Understand Wordplay
    Dsilva, Ryan Rony
    EXPERIMENTAL IR MEETS MULTILINGUALITY, MULTIMODALITY, AND INTERACTION, PT I, CLEF 2024, 2024, 14958 : 205 - 214
  • [8] CLUSTERLLM: Large Language Models as a Guide for Text Clustering
    Zhang, Yuwei
    Wang, Zihan
    Shang, Jingbo
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 13903 - 13920
  • [9] Enabling Large Language Models to Generate Text with Citations
    Gao, Tianyu
    Yen, Howard
    Yu, Jiatong
    Chen, Danqi
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 6465 - 6488
  • [10] Best Practices for Text Annotation with Large Language Models
    Toernberg, Petter
    SOCIOLOGICA-INTERNATIONAL JOURNAL FOR SOCIOLOGICAL DEBATE, 2024, 18 (02): : 67 - 85