Farewell to Aimless Large-scale Pretraining: Influential Subset Selection for Language Model

被引:0
|
作者
Wang, Xiao [1 ]
Zhou, Weikang [1 ]
Zhang, Qi [1 ]
Zhou, Jie [1 ]
Gao, Songyang [1 ]
Wang, Junzhe [1 ]
Zhang, Menghan [2 ]
Gao, Xiang [3 ]
Chen, Yunwen [3 ]
Gui, Tao [2 ]
机构
[1] Fudan Univ, Sch Comp Sci, Shanghai, Peoples R China
[2] Fudan Univ, Inst Modern Languages & Linguist, Shanghai, Peoples R China
[3] DataGrand Informat Technol Shanghai Co Ltd, Shanghai, Peoples R China
基金
上海市自然科学基金; 中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Pretrained language models have achieved remarkable success in various natural language processing tasks. However, pretraining has recently shifted toward larger models and larger data, and this has resulted in significant computational and energy costs. In this paper, we propose Influence Subset Selection (ISS) for language model, which explicitly utilizes end-task knowledge to select a tiny subset of the pretraining corpus. Specifically, the ISS selects the samples that will provide the most positive influence on the performance of the end-task. Furthermore, we design a gradient matching based influence estimation method, which can drastically reduce the computation time of influence. With only 0.45% of the data and a three-orders-of-magnitude lower computational cost, ISS outperformed pretrained models (e.g., RoBERTa) on eight datasets covering four domains.
引用
收藏
页码:555 / 568
页数:14
相关论文
共 50 条
  • [1] On the Effect of Pretraining Corpora on In-context Learning by a Large-scale Language Model
    Shin, Seongjin
    Lee, Sang-Woo
    Ahn, Hwijeen
    Kim, Sungdong
    Kim, HyoungSeok
    Kim, Boseop
    Cho, Kyunghyun
    Lee, Gichang
    Park, Woomyoung
    Ha, Jung-Woo
    Sung, Nako
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 5168 - 5186
  • [2] FedID: Federated Interactive Distillation for Large-Scale Pretraining Language Models
    Ma, Xinge
    Liu, Jiangming
    Wang, Jin
    Zhang, Xuejie
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 8566 - 8577
  • [3] Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
    Yang, Antoine
    Nagrani, Arsha
    Seo, Paul Hongsuck
    Miech, Antoine
    Pont-Tuset, Jordi
    Laptev, Ivan
    Sivic, Josef
    Schmid, Cordelia
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10714 - 10726
  • [4] 3D Vision and Language Pretraining with Large-Scale Synthetic Data
    Yang, Dejie
    Xu, Zhu
    Mo, Wentao
    Chen, Qingchao
    Huang, Siyuan
    Liu, Yang
    PROCEEDINGS OF THE THIRTY-THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2024, 2024, : 1552 - 1560
  • [5] Greedy column subset selection for large-scale data sets
    Farahat, Ahmed K.
    Elgohary, Ahmed
    Ghodsi, Ali
    Kamel, Mohamed S.
    KNOWLEDGE AND INFORMATION SYSTEMS, 2015, 45 (01) : 1 - 34
  • [6] Greedy column subset selection for large-scale data sets
    Ahmed K. Farahat
    Ahmed Elgohary
    Ali Ghodsi
    Mohamed S. Kamel
    Knowledge and Information Systems, 2015, 45 : 1 - 34
  • [7] Distributed Pareto Optimization for Large-Scale Noisy Subset Selection
    Qian, Chao
    IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, 2020, 24 (04) : 694 - 707
  • [8] SUBMODULAR SUBSET SELECTION FOR LARGE-SCALE SPEECH TRAINING DATA
    Wei, Kai
    Liu, Yuzong
    Kirchhoff, Katrin
    Bartels, Chris
    Bilmes, Jeff
    2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
  • [9] Antenna Subset Selection in MU large-scale MIMO Systems
    Ni, Yan
    Zhang, Wence
    Chen, Ming
    2013 INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS AND SIGNAL PROCESSING (WCSP 2013), 2013,
  • [10] An Improved Local Search Method for Large-Scale Hypervolume Subset Selection
    Nan, Yang
    Shang, Ke
    Ishibuchi, Hisao
    He, Linjun
    IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, 2023, 27 (06) : 1690 - 1704