Farewell to Aimless Large-scale Pretraining: Influential Subset Selection for Language Model

被引:0
|
作者
Wang, Xiao [1 ]
Zhou, Weikang [1 ]
Zhang, Qi [1 ]
Zhou, Jie [1 ]
Gao, Songyang [1 ]
Wang, Junzhe [1 ]
Zhang, Menghan [2 ]
Gao, Xiang [3 ]
Chen, Yunwen [3 ]
Gui, Tao [2 ]
机构
[1] Fudan Univ, Sch Comp Sci, Shanghai, Peoples R China
[2] Fudan Univ, Inst Modern Languages & Linguist, Shanghai, Peoples R China
[3] DataGrand Informat Technol Shanghai Co Ltd, Shanghai, Peoples R China
基金
上海市自然科学基金; 中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Pretrained language models have achieved remarkable success in various natural language processing tasks. However, pretraining has recently shifted toward larger models and larger data, and this has resulted in significant computational and energy costs. In this paper, we propose Influence Subset Selection (ISS) for language model, which explicitly utilizes end-task knowledge to select a tiny subset of the pretraining corpus. Specifically, the ISS selects the samples that will provide the most positive influence on the performance of the end-task. Furthermore, we design a gradient matching based influence estimation method, which can drastically reduce the computation time of influence. With only 0.45% of the data and a three-orders-of-magnitude lower computational cost, ISS outperformed pretrained models (e.g., RoBERTa) on eight datasets covering four domains.
引用
收藏
页码:555 / 568
页数:14
相关论文
共 50 条
  • [31] Large-scale distributed language modeling
    Emami, Ahmad
    Papineni, Kishore
    Sorensen, Jeffrey
    2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL IV, PTS 1-3, 2007, : 37 - +
  • [32] Selection and Execution of large-scale projects
    Ahrens, G. -A.
    Beckmann, K. J.
    Boltze, M.
    Eisenkopf, A.
    Fricke, H.
    Knieps, G.
    Knorr, A.
    Mitusch, K.
    Oeter, S.
    Radermacher, F. -J
    Sieg, G.
    Siegmann, J.
    Schlag, B.
    Stoelzle, W.
    Vallee, D.
    Winner, H.
    BAUINGENIEUR, 2015, 90 : 129 - 139
  • [33] Large-scale resource selection in grids
    Roumani, AM
    Skillicorn, DB
    ON THE MOVE TO MEANINGFUL INTERNET SYSTEMS 2004: OTM 2004 WORKSHOPS, PROCEEDINGS, 2004, 3292 : 154 - 164
  • [34] Large-Scale Loan Portfolio Selection
    Sirignano, Justin A.
    Tsoukalas, Gerry
    Giesecke, Kay
    OPERATIONS RESEARCH, 2016, 64 (06) : 1239 - 1255
  • [35] Deep Context: A Neural Language Model for Large-scale Networked Documents
    Wu, Hao
    Lerman, Kristina
    PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 3091 - 3097
  • [36] Generative pretraining from large-scale transcriptomes for single-cell deciphering
    Shen, Hongru
    Liu, Jilei
    Hu, Jiani
    Shen, Xilin
    Zhang, Chao
    Wu, Dan
    Feng, Mengyao
    Yang, Meng
    Li, Yang
    Yang, Yichen
    Wang, Wei
    Zhang, Qiang
    Yang, Jilong
    Chen, Kexin
    Li, Xiangchun
    ISCIENCE, 2023, 26 (05)
  • [37] QUERY-BASED COMPOSITION FOR LARGE-SCALE LANGUAGE MODEL IN LVCSR
    Han, Yang
    Zhang, Chenwei
    Li, Xiangang
    Liu, Yi
    Wu, Xihong
    2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
  • [38] HSIMAE: A Unified Masked Autoencoder With Large-Scale Pretraining for Hyperspectral Image Classification
    Wang, Yue
    Wen, Ming
    Zhang, Hailiang
    Sun, Jinyu
    Yang, Qiong
    Zhang, Zhimin
    Lu, Hongmei
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2024, 17 : 14064 - 14079
  • [39] Optimal group selection model for large-scale group decision making
    Wu, Peng
    Wu, Qun
    Zhou, Ligang
    Chen, Huayou
    INFORMATION FUSION, 2020, 61 : 1 - 12
  • [40] NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient Framework
    Yao, Xingcheng
    Zheng, Yanan
    Yang, Xiaocong
    Yang, Zhilin
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,