Improving Text Embeddings with Large Language Models

被引:0
|
作者
Wang, Liang [1 ]
Yang, Nan [1 ]
Huang, Xiaolong [1 ]
Yang, Linjun [1 ]
Majumder, Rangan [1 ]
Wei, Furu [1 ]
机构
[1] Microsoft Corp, Redmond, WA 98052 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps. Unlike existing methods that often depend on multi-stage intermediate pretraining with billions of weakly-supervised text pairs, followed by fine-tuning with a few labeled datasets, our method does not require building complex training pipelines or relying on manually collected datasets that are often constrained by task diversity and language coverage. We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages. We then fine-tune open-source decoder-only LLMs on the synthetic data using standard contrastive loss. Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data. Furthermore, when fine-tuned with a mixture of synthetic and labeled data, our model sets new state-of-the-art results on the BEIR and MTEB benchmarks.
引用
收藏
页码:11897 / 11916
页数:20
相关论文
共 50 条
  • [31] Improving Large Language Models in Event Relation Logical Prediction
    Chen, Meiqi
    Ma, Yubo
    Song, Kaitao
    Cao, Yixin
    Zhang, Yan
    Li, Dongsheng
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 9451 - 9478
  • [32] Measuring and Improving the Energy Efficiency of Large Language Models Inference
    Argerich, Mauricio Fadel
    Patino-Martinez, Marta
    IEEE ACCESS, 2024, 12 : 80194 - 80207
  • [33] Improving generalization in large language models by learning prefix subspaces
    Falissard, Louis
    Guigue, Vincent
    Soulier, Laure
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 11474 - 11483
  • [34] Embedding Layout in Text for Document Understanding Using Large Language Models
    Minouei, Mohammad
    Soheili, Mohammad Reza
    Stricker, Didier
    DOCUMENT ANALYSIS AND RECOGNITION-ICDAR 2024, PT I, 2024, 14804 : 280 - 293
  • [35] The Effect of Prompt Types on Text Summarization Performance With Large Language Models
    Borhan, Iffat
    Bajaj, Akhilesh
    JOURNAL OF DATABASE MANAGEMENT, 2024, 35 (01)
  • [36] Limits of Detecting Text Generated by Large-Scale Language Models
    Varshney, Lav R.
    Keskar, Nitish Shirish
    Socher, Richard
    2020 INFORMATION THEORY AND APPLICATIONS WORKSHOP (ITA), 2020,
  • [37] Assessing the Impact of Prompt Strategies on Text Summarization with Large Language Models
    Onan, Aytug
    Alhumyani, Hesham
    COMPUTER APPLICATIONS IN INDUSTRY AND ENGINEERING, CAINE 2024, 2025, 2242 : 41 - 55
  • [38] Large Language Models-Based Local Explanations of Text Classifiers
    Angiulli, Fabrizio
    De Luca, Francesco
    Fassetti, Fabio
    Nistico, Simona
    DISCOVERY SCIENCE, DS 2024, PT I, 2025, 15243 : 19 - 35
  • [39] Text Summarization in Aviation Safety: A Comparative Study of Large Language Models
    Emmons, Jonathan
    Sharma, Taneesha
    Salloum, Mariam
    Matthews, Bryan
    AIAA AVIATION FORUM AND ASCEND 2024, 2024,
  • [40] Structured information extraction from scientific text with large language models
    John Dagdelen
    Alexander Dunn
    Sanghoon Lee
    Nicholas Walker
    Andrew S. Rosen
    Gerbrand Ceder
    Kristin A. Persson
    Anubhav Jain
    Nature Communications, 15