Improving Text Embeddings with Large Language Models

被引:0
|
作者
Wang, Liang [1 ]
Yang, Nan [1 ]
Huang, Xiaolong [1 ]
Yang, Linjun [1 ]
Majumder, Rangan [1 ]
Wei, Furu [1 ]
机构
[1] Microsoft Corp, Redmond, WA 98052 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps. Unlike existing methods that often depend on multi-stage intermediate pretraining with billions of weakly-supervised text pairs, followed by fine-tuning with a few labeled datasets, our method does not require building complex training pipelines or relying on manually collected datasets that are often constrained by task diversity and language coverage. We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages. We then fine-tune open-source decoder-only LLMs on the synthetic data using standard contrastive loss. Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data. Furthermore, when fine-tuned with a mixture of synthetic and labeled data, our model sets new state-of-the-art results on the BEIR and MTEB benchmarks.
引用
收藏
页码:11897 / 11916
页数:20
相关论文
共 50 条
  • [21] Supporting Text Entry in Virtual Reality with Large Language Models
    Chen, Liuqing
    Cai, Yu
    Wang, Ruyue
    Ding, Shixian
    Tang, Yilin
    Hansen, Preben
    Sun, Lingyun
    2024 IEEE CONFERENCE ON VIRTUAL REALITY AND 3D USER INTERFACES, VR 2024, 2024, : 524 - 534
  • [22] Recipe For Arbitrary Text Style Transfer with Large Language Models
    Reif, Emily
    Ippolito, Daphne
    Yuan, Ann
    Coenen, Andy
    Callison-Burch, Chris
    Wei, Jason
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022): (SHORT PAPERS), VOL 2, 2022, : 837 - 848
  • [23] General Purpose Text Embeddings from Pre-trained Language Models for Scalable Inference
    Du, Jingfei
    Ott, Myle
    Li, Haoran
    Zhou, Xing
    Stoyanov, Veselin
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020,
  • [24] Word Embeddings Are Steers for Language Models
    Han, Chi
    Xu, Jialiang
    Li, Manling
    Fung, Yi
    Sun, Chenkai
    Jiang, Nan
    Abdelzaher, Tarek
    Ji, Heng
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 16410 - 16430
  • [25] Assessing the Text Readability by Use of Language Model Embeddings
    Sopyla, Krzysztof
    Sawaniewski, Lukasz
    Drozda, Pawel
    Kislak-Malinowska, Aleksandra
    RECENT CHALLENGES IN INTELLIGENT INFORMATION AND DATABASE SYSTEMS, PT II, ACIIDS 2024, 2024, 2145 : 283 - 294
  • [26] Improving the Accuracy of Text-to-SQL Tools Based on Large Language Models for Real-World Relational Databases
    Coelho, Gustavo M. C.
    Nascimento, Eduardo R. S.
    Izquierdo, Yenier T.
    Garcia, Grettel M.
    Feijo, Lucas
    Lemos, Melissa
    Garcia, Robinson L. S.
    de Oliveira, Aiko R.
    Pinheiro, Joao P.
    Casanova, Marco A.
    DATABASE AND EXPERT SYSTEMS APPLICATIONS, PT I, DEXA 2024, 2024, 14910 : 93 - 107
  • [27] Improving Large-scale Language Models and Resources for Filipino
    Cruz, Jan Christian Blaise
    Cheng, Charibeth
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6548 - 6555
  • [28] Improving Automatic VQA Evaluation Using Large Language Models
    Manas, Oscar
    Krojer, Benno
    Agrawal, Aishwarya
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 5, 2024, : 4171 - 4179
  • [29] Improving Causal Inference of Large Language Models with SCM Tools
    Hua, Zhenyang
    Xing, Shuyue
    Jiang, Huixing
    Wei, Chen
    Wang, Xiaojie
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT III, NLPCC 2024, 2025, 15361 : 3 - 14
  • [30] CARTGPT: Improving CART Captioning using Large Language Models
    Wu, Liang-Yuan
    Kleiver, Andrea
    Jain, Dhruv
    PROCEEDINGS OF THE 26TH INTERNATIONAL ACM SIGACCESS CONFERENCE ON COMPUTERS AND ACCESSIBILITY, ASSETS 2024, 2024,