C-Pack: Packed Resources For General Chinese Embeddings

被引:6
|
作者
Xiao, Shitao [1 ]
Liu, Zheng [1 ]
Zhang, Peitian [2 ]
Muennighoff, Niklas [3 ]
Lian, Defu [4 ]
Nie, Jian-Yun [5 ]
机构
[1] Beijing Acad AI, Beijing, Peoples R China
[2] Renmin Univ China, Beijing, Peoples R China
[3] HuggingFace, Beijing, Peoples R China
[4] USTC, Hefei, Peoples R China
[5] Univ Montreal, Montreal, PQ, Canada
来源
PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024 | 2024年
关键词
Text Embeddings; Training Data; Benchmark; Pre-trained Models;
D O I
10.1145/3626772.3657878
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We introduce C-Pack, a package of resources that significantly advances the field of general text embeddings for Chinese. C-Pack includes three critical resources. 1) C-MTP is a massive training dataset for text embedding, which is based on the curation of vast unlabeled corpora and the integration of high-quality labeled corpora. 2) C-MTEB is a comprehensive benchmark for Chinese text embeddings covering 6 tasks and 35 datasets. 3) BGE is a family of embedding models covering multiple sizes. Our models outperform all prior Chinese text embeddings on C-MTEB by more than +10% upon the time of the release. We also integrate and optimize the entire suite of training methods for BGE. Along with our resources on general Chinese embedding, we release our data and models for English text embeddings. The English models also achieve stateof-the-art performance on the MTEB benchmark; meanwhile, our released English data is 2 times larger than the Chinese data. Both Chinese and English datasets are the largest public release of training data for text embeddings.
引用
收藏
页码:641 / 649
页数:9
相关论文
共 21 条
  • [1] WHAT EVER HAPPENED TO THE C-PACK SISTERS
    ROOD, JP
    NATURAL HISTORY, 1988, 97 (02) : 40 - 47
  • [2] C-Pack of IPAs: A C90 Program Benchmark of Introductory Programming Assignments
    Orvalho, Pedro
    Janota, Mikolas
    Manquinho, Vasco
    2024 ACM/IEEE INTERNATIONAL WORKSHOP ON AUTOMATED PROGRAM REPAIR, APR 2024, 2024, : 14 - 21
  • [3] C-Pack: A High-Performance Microprocessor Cache Compression Algorithm
    Chen, Xi
    Yang, Lei
    Dick, Robert P.
    Shang, Li
    Lekatsas, Haris
    IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2010, 18 (08) : 1196 - 1208
  • [4] Involvement of Clostridium gasigenes and C. algidicarnis in 'blown pack' spoilage of Brazilian vacuum-packed beef
    Silva, Alessandra R.
    Paulo, Ezio N.
    Sant'Ana, Anderson S.
    Chaves, Rafael D.
    Massaguer, Pilar R.
    INTERNATIONAL JOURNAL OF FOOD MICROBIOLOGY, 2011, 148 (03) : 156 - 163
  • [5] Development and validation of the conservation of resources scale for COVID-19 in the Chinese adult general population
    Yanqiu Yu
    Joseph T.F. Lau
    Mason M.C. Lau
    Current Psychology, 2023, 42 : 6447 - 6456
  • [6] Development and validation of the conservation of resources scale for COVID-19 in the Chinese adult general population
    Yu, Yanqiu
    Lau, Joseph T. F.
    Lau, Mason M. C.
    CURRENT PSYCHOLOGY, 2023, 42 (08) : 6447 - 6456
  • [7] Validation of the General Practitioner Assessment of Cognition - Chinese version (GPCOG-C) in China
    Li, Xia
    Xiao, Shifu
    Fang, Yuan
    Zhu, Minjie
    Wang, Tao
    Seeher, Katrin
    Brodaty, Henry
    INTERNATIONAL PSYCHOGERIATRICS, 2013, 25 (10) : 1649 - 1657
  • [8] Intra-individual variability of high-sensitivity C-reactive protein in Chinese general population
    Wu, Shouling
    Li, Yun
    Jin, Cheng
    Yang, Peng
    Li, Dongqing
    Li, Hongfeng
    Shen, Chong
    INTERNATIONAL JOURNAL OF CARDIOLOGY, 2012, 157 (01) : 75 - 79
  • [9] Associations of serum cystatin C and its change with new-onset cardiovascular disease in Chinese general population
    Zhang, Yanjun
    Yang, Sisi
    Chen, Jia
    Zhang, Zhuxian
    He, Panpan
    Zhou, Chun
    Liu, Mengyi
    Ye, Ziliang
    Wu, Qimeng
    Li, Huan
    Zhang, Yuanyuan
    Liu, Chengzhang
    Qin, Xianhui
    NUTRITION METABOLISM AND CARDIOVASCULAR DISEASES, 2022, 32 (08) : 1963 - 1971
  • [10] Antithrombin, protein C, protein S and activated protein C resistance in the general healthy chinese population: Normal plasmatic ranges and genetic defects
    Zhao, Yongqiang
    Wang, Xuefeng
    Ding, Qiulan
    Wei, Xuqian
    Kaguelidou, Florentia
    Ruan, Changgeng
    Wang, Zhaoyue
    Bai, Xia
    Schlegel, Nicole
    BLOOD, 2007, 110 (11) : 63B - 63B