LiteTransformerSearch: Training-free Neural Architecture Search for Efficient Language Models

被引:0
|
作者
Javaheripi, Mojan [1 ]
de Rosa, Gustavo H. [2 ]
Mukherjee, Subhabrata [2 ]
Shah, Shital [2 ]
Religa, Tomasz L. [3 ]
Mendes, Caio C. T. [2 ]
Bubeck, Sebastien [2 ]
Koushanfar, Farinaz [1 ]
Dey, Debadeepta [2 ]
机构
[1] Univ Calif San Diego, La Jolla, CA USA
[2] Microsoft Res, Curitiba, Parana, Brazil
[3] Microsoft, Oxford, England
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The Transformer architecture is ubiquitously used as the building block of largescale autoregressive language models. However, finding architectures with the optimal trade-off between task performance (perplexity) and hardware constraints like peak memory utilization and latency is non-trivial. This is exacerbated by the proliferation of various hardware. We leverage the somewhat surprising empirical observation that the number of decoder parameters in autoregressive Transformers has a high rank correlation with task performance, irrespective of the architecture topology. This observation organically induces a simple Neural Architecture Search (NAS) algorithm that uses decoder parameters as a proxy for perplexity without need for any model training. The search phase of our training-free algorithm, dubbed Lightweight Transformer Search (LTS)(1), can be run directly on target devices since it does not require GPUs. Using on-target-device measurements, LTS extracts the Pareto-frontier of perplexity versus any hardware performance cost. We evaluate LTS on diverse devices from ARM CPUs to NVIDIA GPUs and two popular autoregressive Transformer backbones: GPT-2 and Transformer-XL. Results show that the perplexity of 16-layer GPT-2 and Transformer-XL can be achieved with up to 1.5., 2.5. faster runtime and 1.2., 2.0. lower peak memory utilization. When evaluated in zero and one-shot settings, LTS Pareto-frontier models achieve higher average accuracy compared to the 350M parameter OPT across 14 tasks, with up to 1.6. lower latency. LTS extracts the Pareto-frontier in under 3 hours while running on a commodity laptop. We effectively remove the carbon footprint of hundreds of GPU hours of training during search, offering a strong simple baseline for future NAS methods in autoregressive language modeling.
引用
收藏
页数:14
相关论文
共 50 条
  • [21] Auto-GAS: Automated Proxy Discovery for Training-Free Generative Architecture Search
    Li, Lujun
    Sun, Haosen
    Li, Shiwen
    Dong, Peijie
    Luo, Wenhan
    Xue, Wei
    Liu, Qifeng
    Guo, Yike
    COMPUTER VISION - ECCV 2024, PT V, 2025, 15063 : 38 - 55
  • [22] Training-Free Transformer Architecture Search With Zero-Cost Proxy Guided Evolution
    Zhou, Qinqin
    Sheng, Kekai
    Zheng, Xiawu
    Li, Ke
    Tian, Yonghong
    Chen, Jie
    Ji, Rongrong
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (10) : 6525 - 6541
  • [23] Multi-Objective Evolutionary Search of Compact Convolutional Neural Networks with Training-Free Estimation
    Huang, Junhao
    Xue, Bing
    Sun, Yanan
    Zhang, Mengjie
    PROCEEDINGS OF THE 2023 GENETIC AND EVOLUTIONARY COMPUTATION CONFERENCE COMPANION, GECCO 2023 COMPANION, 2023, : 655 - 658
  • [24] Training-Free Neural Matte Extraction for Visual Effects
    Elcott, Sharif
    Lewis, J. P.
    Kanazawa, Nori
    Bregler, Christoph
    SIGGRAPH ASIA 2022 TECHNICAL COMMUNICATIONS PROCEEDINGS, SIGGRAPH 2022, 2022,
  • [25] Training-Free Compressed Sensing for Wireless Neural Recording
    Sun, Biao
    Ni, Yuming
    Zhao, Wenfeng
    PROCEEDINGS OF 2016 IEEE BIOMEDICAL CIRCUITS AND SYSTEMS CONFERENCE (BIOCAS), 2016, : 18 - 21
  • [26] Auto-DAS: Automated Proxy Discovery for Training-Free Distillation-Aware Architecture Search
    Sun, Haosen
    Li, Lujun
    Dong, Peijie
    Wei, Zimian
    Shao, Shitong
    COMPUTER VISION - ECCV 2024, PT V, 2025, 15063 : 56 - 73
  • [27] Auto-Prox: Training-Free Vision Transformer Architecture Search via Automatic Proxy Discovery
    Wei, Zimian
    Dong, Peijie
    Hui, Zheng
    Li, Anggeng
    Li, Lujun
    Lu, Menglong
    Pan, Hengyue
    Li, Dongsheng
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 14, 2024, : 15814 - 15822
  • [28] ShuffleNASNets: Efficient CNN models through modified Efficient Neural Architecture Search
    Laube, Kevin A.
    Zell, Andreas
    2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2019,
  • [29] Developing Language-Specific Models Using a Neural Architecture Search
    Yoo, YongSuk
    Park, Kang-moon
    APPLIED SCIENCES-BASEL, 2021, 11 (21):
  • [30] EvoPrompting: Language Models for Code-Level Neural Architecture Search
    Chen, Angelica
    Dohan, David M.
    So, David R.
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,