Farewell to Aimless Large-scale Pretraining: Influential Subset Selection for Language Model

被引:0
|
作者
Wang, Xiao [1 ]
Zhou, Weikang [1 ]
Zhang, Qi [1 ]
Zhou, Jie [1 ]
Gao, Songyang [1 ]
Wang, Junzhe [1 ]
Zhang, Menghan [2 ]
Gao, Xiang [3 ]
Chen, Yunwen [3 ]
Gui, Tao [2 ]
机构
[1] Fudan Univ, Sch Comp Sci, Shanghai, Peoples R China
[2] Fudan Univ, Inst Modern Languages & Linguist, Shanghai, Peoples R China
[3] DataGrand Informat Technol Shanghai Co Ltd, Shanghai, Peoples R China
基金
上海市自然科学基金; 中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Pretrained language models have achieved remarkable success in various natural language processing tasks. However, pretraining has recently shifted toward larger models and larger data, and this has resulted in significant computational and energy costs. In this paper, we propose Influence Subset Selection (ISS) for language model, which explicitly utilizes end-task knowledge to select a tiny subset of the pretraining corpus. Specifically, the ISS selects the samples that will provide the most positive influence on the performance of the end-task. Furthermore, we design a gradient matching based influence estimation method, which can drastically reduce the computation time of influence. With only 0.45% of the data and a three-orders-of-magnitude lower computational cost, ISS outperformed pretrained models (e.g., RoBERTa) on eight datasets covering four domains.
引用
收藏
页码:555 / 568
页数:14
相关论文
共 50 条
  • [11] A general framework for large-scale model selection
    Haunschild, M. D.
    Wahl, S. A.
    Freisleben, B.
    Wiechert, W.
    OPTIMIZATION METHODS & SOFTWARE, 2006, 21 (06): : 901 - 917
  • [12] Multi-pretraining for Large-scale Text Classification
    Kim, Kang-Min
    Hyeon, Bumsu
    Kim, Yeachan
    Park, Jun-Hyung
    Lee, SangKeun
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 2041 - 2050
  • [13] ANTENNA SUBSET SELECTION OPTIMIZATION FOR LARGE-SCALE MISO CONSTANT ENVELOPE PRECODING
    Pan, Jiaxian
    Ma, Wing-Kin
    2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
  • [14] Benchmarking large-scale subset selection in evolutionary multi-objective optimization
    Shang, Ke
    Shu, Tianye
    Ishibuchi, Hisao
    Nan, Yang
    Pang, Lie Meng
    INFORMATION SCIENCES, 2023, 622 : 755 - 770
  • [15] Self-Supervised Pretraining for Large-Scale Point Clouds
    Zhang, Zaiwei
    Bai, Min
    Li, Erran
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [16] FURS: Fast and Unique Representative Subset selection retaining large-scale community structure
    Mall, Raghvendra
    Langone, Rocco
    Suykens, Johan A. K.
    SOCIAL NETWORK ANALYSIS AND MINING, 2013, 3 (04) : 1075 - 1095
  • [17] Identifying Influential Nodes in Large-scale Software Networks
    Zhao, Xiaoshu
    Zhang, Haohua
    Zhang, Mengyao
    Cheng, Liying
    Ma, Shijun
    2017 IEEE 3RD INFORMATION TECHNOLOGY AND MECHATRONICS ENGINEERING CONFERENCE (ITOEC), 2017, : 764 - 767
  • [18] Implementation of a large-scale language model adaptation in a cloud environment
    Kwang-Ho Kim
    Dae-Young Jung
    Donghyun Lee
    Hyuk-Jun Lee
    Sung-Yong Park
    Myoung-Wan Koo
    Ji-Hwan Kim
    Jeong-sik Park
    Hyung-Bae Jeon
    Yun-Keun Lee
    Multimedia Tools and Applications, 2016, 75 : 5029 - 5045
  • [19] Implementation of a large-scale language model adaptation in a cloud environment
    Kim, Kwang-Ho
    Jung, Dae-Young
    Lee, Donghyun
    Lee, Hyuk-Jun
    Park, Sung-Yong
    Koo, Myoung-Wan
    Kim, Ji-Hwan
    Park, Jeong-sik
    Jeon, Hyung-Bae
    Lee, Yun-Keun
    MULTIMEDIA TOOLS AND APPLICATIONS, 2016, 75 (09) : 5029 - 5045
  • [20] Development of a Large-scale Korean Language Model in the Field of Geosciences
    Lee, Sang-ho
    ECONOMIC AND ENVIRONMENTAL GEOLOGY, 2024, 57 (05): : 539 - 550