LiteTransformerSearch: Training-free Neural Architecture Search for Efficient Language Models

被引:0
|
作者
Javaheripi, Mojan [1 ]
de Rosa, Gustavo H. [2 ]
Mukherjee, Subhabrata [2 ]
Shah, Shital [2 ]
Religa, Tomasz L. [3 ]
Mendes, Caio C. T. [2 ]
Bubeck, Sebastien [2 ]
Koushanfar, Farinaz [1 ]
Dey, Debadeepta [2 ]
机构
[1] Univ Calif San Diego, La Jolla, CA USA
[2] Microsoft Res, Curitiba, Parana, Brazil
[3] Microsoft, Oxford, England
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The Transformer architecture is ubiquitously used as the building block of largescale autoregressive language models. However, finding architectures with the optimal trade-off between task performance (perplexity) and hardware constraints like peak memory utilization and latency is non-trivial. This is exacerbated by the proliferation of various hardware. We leverage the somewhat surprising empirical observation that the number of decoder parameters in autoregressive Transformers has a high rank correlation with task performance, irrespective of the architecture topology. This observation organically induces a simple Neural Architecture Search (NAS) algorithm that uses decoder parameters as a proxy for perplexity without need for any model training. The search phase of our training-free algorithm, dubbed Lightweight Transformer Search (LTS)(1), can be run directly on target devices since it does not require GPUs. Using on-target-device measurements, LTS extracts the Pareto-frontier of perplexity versus any hardware performance cost. We evaluate LTS on diverse devices from ARM CPUs to NVIDIA GPUs and two popular autoregressive Transformer backbones: GPT-2 and Transformer-XL. Results show that the perplexity of 16-layer GPT-2 and Transformer-XL can be achieved with up to 1.5., 2.5. faster runtime and 1.2., 2.0. lower peak memory utilization. When evaluated in zero and one-shot settings, LTS Pareto-frontier models achieve higher average accuracy compared to the 350M parameter OPT across 14 tasks, with up to 1.6. lower latency. LTS extracts the Pareto-frontier in under 3 hours while running on a commodity laptop. We effectively remove the carbon footprint of hundreds of GPU hours of training during search, offering a strong simple baseline for future NAS methods in autoregressive language modeling.
引用
收藏
页数:14
相关论文
共 50 条
  • [31] Effective, Efficient and Robust Neural Architecture Search Effective, Efficient and Robust Neural Architecture Search
    Yue, Zhixiong
    Lin, Baijiong
    Zhang, Yu
    Liang, Christy
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [32] Revisiting Training-free NAS Metrics: An Efficient Training-based Method
    Yang, Taojiannan
    Yang, Linjie
    Jin, Xiaojie
    Chen, Chen
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 4740 - 4749
  • [33] Lightweight Multi-Objective and Many-Objective Problem Formulations for Evolutionary Neural Architecture Search with the Training-Free Performance Metric Synaptic Flow
    Vo A.
    Pham T.N.
    Nguyen V.B.
    Luong N.H.
    Informatica (Slovenia), 2023, 47 (03): : 303 - 314
  • [34] SuS-X: Training-Free Name-Only Transfer of Vision-Language Models
    Udandarao, Vishaal
    Gupta, Ankush
    Albanie, Samuel
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2725 - 2736
  • [35] Training-Free Diffusion Models for Content-Style Synthesis
    Xu, Ruipeng
    Shen, Fei
    Xie, Xu
    Li, Zongyi
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT X, ICIC 2024, 2024, 14871 : 308 - 319
  • [36] CryoSAM: Training-Free CryoET Tomogram Segmentation with Foundation Models
    Zhao, Yizhou
    Bian, Hengwei
    Mu, Michael
    Uddin, Mostofa R.
    Li, Zhenyang
    Li, Xiang
    Wang, Tianyang
    Xu, Min
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2024, PT VIII, 2024, 15008 : 124 - 134
  • [37] Neural Architecture Search without Training
    Mellor, Joseph
    Turner, Jack
    Storkey, Amos
    Crowley, Elliot J.
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [38] Efficient Global Neural Architecture Search
    Shahid Siddiqui
    Christos Kyrkou
    Theocharis Theocharides
    SN Computer Science, 6 (3)
  • [39] Efficient Evolution for Neural Architecture Search
    Chen, Zihao
    Li, Bin
    2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [40] Training-free hyperparameter optimization of neural networks for electronic structures in matter
    Fiedler, Lenz
    Hoffmann, Nils
    Mohammed, Parvez
    Popoola, Gabriel A.
    Yovell, Tamar
    Oles, Vladyslav
    Ellis, J. Austin
    Rajamanickam, Sivasankaran
    Cangi, Attila
    MACHINE LEARNING-SCIENCE AND TECHNOLOGY, 2022, 3 (04):