Improving Text Embeddings with Large Language Models

被引:0
|
作者
Wang, Liang [1 ]
Yang, Nan [1 ]
Huang, Xiaolong [1 ]
Yang, Linjun [1 ]
Majumder, Rangan [1 ]
Wei, Furu [1 ]
机构
[1] Microsoft Corp, Redmond, WA 98052 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps. Unlike existing methods that often depend on multi-stage intermediate pretraining with billions of weakly-supervised text pairs, followed by fine-tuning with a few labeled datasets, our method does not require building complex training pipelines or relying on manually collected datasets that are often constrained by task diversity and language coverage. We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages. We then fine-tune open-source decoder-only LLMs on the synthetic data using standard contrastive loss. Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data. Furthermore, when fine-tuned with a mixture of synthetic and labeled data, our model sets new state-of-the-art results on the BEIR and MTEB benchmarks.
引用
收藏
页码:11897 / 11916
页数:20
相关论文
共 50 条
  • [41] Visualizing Linguistic Diversity of Text Datasets Synthesized by Large Language Models
    Reif, Emily
    Kahng, Minsuk
    Petridis, Savvas
    2023 IEEE VISUALIZATION AND VISUAL ANALYTICS, VIS, 2023, : 236 - 240
  • [42] Large language models recover scientific collaboration networks from text
    Jeyaram, Rathin
    Ward, Robert N.
    Santolini, Marc
    APPLIED NETWORK SCIENCE, 2024, 9 (01)
  • [43] Medical foundation large language models for comprehensive text analysis and beyond
    Xie, Qianqian
    Chen, Qingyu
    Chen, Aokun
    Peng, Cheng
    Hu, Yan
    Lin, Fongci
    Peng, Xueqing
    Huang, Jimin
    Zhang, Jeffrey
    Keloth, Vipina
    Zhou, Xinyu
    Qian, Lingfei
    He, Huan
    Shung, Dennis
    Ohno-Machado, Lucila
    Wu, Yonghui
    Xu, Hua
    Bian, Jiang
    NPJ DIGITAL MEDICINE, 2025, 8 (01):
  • [44] From text to insight: large language models for chemical data extraction
    Schilling-Wilhelmi, Mara
    Rios-Garcia, Martino
    Shabih, Sherjeel
    Gil, Maria Victoria
    Miret, Santiago
    Koch, Christoph T.
    Marquez, Jose A.
    Jablonka, Kevin Maik
    CHEMICAL SOCIETY REVIEWS, 2025, 54 (03) : 1125 - 1150
  • [45] Fine-tuning large language models for chemical text mining
    Zhang, Wei
    Wang, Qinggong
    Kong, Xiangtai
    Xiong, Jiacheng
    Ni, Shengkun
    Cao, Duanhua
    Niu, Buying
    Chen, Mingan
    Li, Yameng
    Zhang, Runze
    Wang, Yitian
    Zhang, Lehan
    Li, Xutong
    Xiong, Zhaoping
    Shi, Qian
    Huang, Ziming
    Fu, Zunyun
    Zheng, Mingyue
    CHEMICAL SCIENCE, 2024, 15 (27) : 10600 - 10611
  • [46] Large language models overcome the challenges of unstructured text data in ecology
    Castro, Andry
    Pinto, Joao
    Reino, Luis
    Pipek, Pavel
    Capinha, Cesar
    ECOLOGICAL INFORMATICS, 2024, 82
  • [47] Evaluation and Analysis of Large Language Models for Clinical Text Augmentation and Generation
    Latif, Atif
    Kim, Jihie
    IEEE ACCESS, 2024, 12 : 48987 - 48996
  • [48] Learning the Visualness of Text Using Large Vision-Language Models
    Verma, Gaurav
    Rossi, Ryan A.
    Tensmeyer, Christopher
    Gu, Jiuxiang
    Nenkova, Ani
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 2394 - 2408
  • [49] Large language models as a substitute for human experts in annotating political text
    Heseltine, Michael
    von Hohenberg, Bernhard Clemm
    RESEARCH & POLITICS, 2024, 11 (01)
  • [50] A Two-Stage Adaptation of Large Language Models for Text Ranking
    Zhang, Longhui
    Zhang, Yanzhao
    Long, Dingkun
    Xie, Pengjun
    Zhang, Meishan
    Zhang, Min
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 11880 - 11891