Farewell to Aimless Large-scale Pretraining: Influential Subset Selection for Language Model

被引:0
|
作者
Wang, Xiao [1 ]
Zhou, Weikang [1 ]
Zhang, Qi [1 ]
Zhou, Jie [1 ]
Gao, Songyang [1 ]
Wang, Junzhe [1 ]
Zhang, Menghan [2 ]
Gao, Xiang [3 ]
Chen, Yunwen [3 ]
Gui, Tao [2 ]
机构
[1] Fudan Univ, Sch Comp Sci, Shanghai, Peoples R China
[2] Fudan Univ, Inst Modern Languages & Linguist, Shanghai, Peoples R China
[3] DataGrand Informat Technol Shanghai Co Ltd, Shanghai, Peoples R China
基金
上海市自然科学基金; 中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Pretrained language models have achieved remarkable success in various natural language processing tasks. However, pretraining has recently shifted toward larger models and larger data, and this has resulted in significant computational and energy costs. In this paper, we propose Influence Subset Selection (ISS) for language model, which explicitly utilizes end-task knowledge to select a tiny subset of the pretraining corpus. Specifically, the ISS selects the samples that will provide the most positive influence on the performance of the end-task. Furthermore, we design a gradient matching based influence estimation method, which can drastically reduce the computation time of influence. With only 0.45% of the data and a three-orders-of-magnitude lower computational cost, ISS outperformed pretrained models (e.g., RoBERTa) on eight datasets covering four domains.
引用
收藏
页码:555 / 568
页数:14
相关论文
共 50 条
  • [21] Large-scale model selection in misspecified generalized linear models
    Demirkaya, Emre
    Feng, Yang
    Basu, Pallavi
    Lv, Jinchi
    BIOMETRIKA, 2022, 109 (01) : 123 - 136
  • [22] Scalable Model Selection for Large-Scale Factorial Relational Models
    Liu, Chunchen
    Feng, Lu
    Fujimaki, Ryohei
    Muraoka, Yusuke
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 37, 2015, 37 : 1227 - 1235
  • [23] Optimizing Sensor Subset Selection with Quantum Annealing: A Large-Scale Indoor Temperature Regulation Application
    Meray, Aurelien
    Prabakar, Nagarajan
    INTELLIGENT HUMAN COMPUTER INTERACTION, IHCI 2023, PT II, 2024, 14532 : 228 - 237
  • [24] Gradient-Guided Local Search for Large-Scale IGD/IGD+ Subset Selection
    Nan, Yang
    Ishibuchi, Hisao
    Shu, Tianye
    Shang, Ke
    PROCEEDINGS OF THE 2024 GENETIC AND EVOLUTIONARY COMPUTATION CONFERENCE, GECCO 2024, 2024, : 585 - 593
  • [25] Exploring Vision Language Pretraining with Knowledge Enhancement via Large Language Model
    Tung, Chuenyuet
    Lin, Yi
    Yin, Jianing
    Ye, Qiaoyuchen
    Chen, Hao
    TRUSTWORTHY ARTIFICIAL INTELLIGENCE FOR HEALTHCARE, TAI4H 2024, 2024, 14812 : 81 - 91
  • [26] Finding a representative subset from large-scale documents
    Zhang, Jin
    Liu, Guannan
    Ren, Ming
    JOURNAL OF INFORMETRICS, 2016, 10 (03) : 762 - 775
  • [27] Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition
    Huang, W. Ronny
    Peyser, Cal
    Sainath, Tara N.
    Pang, Ruoming
    Strohman, Trevor
    Kumar, Shankar
    INTERSPEECH 2022, 2022, : 689 - 693
  • [28] Effectively identifying the influential spreaders in large-scale social networks
    Yingjie Xia
    Xiaolong Ren
    Zhengchao Peng
    Jianlin Zhang
    Li She
    Multimedia Tools and Applications, 2016, 75 : 8829 - 8841
  • [29] "Influence Sketching": Finding Influential Samples In Large-Scale Regressions
    Wojnowiez, Mike
    Cruz, Ben
    Zhao, Xuan
    Wallace, Brian
    Wolff, Matt
    Luan, Jay
    Crable, Caleb
    2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2016, : 3601 - 3610
  • [30] Effectively identifying the influential spreaders in large-scale social networks
    Xia, Yingjie
    Ren, Xiaolong
    Peng, Zhengchao
    Zhang, Jianlin
    She, Li
    MULTIMEDIA TOOLS AND APPLICATIONS, 2016, 75 (15) : 8829 - 8841