LiteTransformerSearch: Training-free Neural Architecture Search for Efficient Language Models

被引:0
|
作者
Javaheripi, Mojan [1 ]
de Rosa, Gustavo H. [2 ]
Mukherjee, Subhabrata [2 ]
Shah, Shital [2 ]
Religa, Tomasz L. [3 ]
Mendes, Caio C. T. [2 ]
Bubeck, Sebastien [2 ]
Koushanfar, Farinaz [1 ]
Dey, Debadeepta [2 ]
机构
[1] Univ Calif San Diego, La Jolla, CA USA
[2] Microsoft Res, Curitiba, Parana, Brazil
[3] Microsoft, Oxford, England
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The Transformer architecture is ubiquitously used as the building block of largescale autoregressive language models. However, finding architectures with the optimal trade-off between task performance (perplexity) and hardware constraints like peak memory utilization and latency is non-trivial. This is exacerbated by the proliferation of various hardware. We leverage the somewhat surprising empirical observation that the number of decoder parameters in autoregressive Transformers has a high rank correlation with task performance, irrespective of the architecture topology. This observation organically induces a simple Neural Architecture Search (NAS) algorithm that uses decoder parameters as a proxy for perplexity without need for any model training. The search phase of our training-free algorithm, dubbed Lightweight Transformer Search (LTS)(1), can be run directly on target devices since it does not require GPUs. Using on-target-device measurements, LTS extracts the Pareto-frontier of perplexity versus any hardware performance cost. We evaluate LTS on diverse devices from ARM CPUs to NVIDIA GPUs and two popular autoregressive Transformer backbones: GPT-2 and Transformer-XL. Results show that the perplexity of 16-layer GPT-2 and Transformer-XL can be achieved with up to 1.5., 2.5. faster runtime and 1.2., 2.0. lower peak memory utilization. When evaluated in zero and one-shot settings, LTS Pareto-frontier models achieve higher average accuracy compared to the 350M parameter OPT across 14 tasks, with up to 1.6. lower latency. LTS extracts the Pareto-frontier in under 3 hours while running on a commodity laptop. We effectively remove the carbon footprint of hundreds of GPU hours of training during search, offering a strong simple baseline for future NAS methods in autoregressive language modeling.
引用
收藏
页数:14
相关论文
共 50 条
  • [1] Training-free neural architecture search: A review
    Wu, Meng -Ting
    Tsai, Chun -Wei
    ICT EXPRESS, 2024, 10 (01): : 213 - 231
  • [2] Training-free Neural Architecture Search for RNNs and Transformers
    Serianni, Aaron
    Kalita, Jugal
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 2522 - 2540
  • [3] A Training-Free Neural Architecture Search Algorithm Based on Search Economics
    Wu, Meng-Ting
    Lin, Hung-, I
    Tsai, Chun-Wei
    IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, 2024, 28 (02) : 445 - 459
  • [4] AUTOST: TRAINING-FREE NEURAL ARCHITECTURE SEARCH FOR SPIKING TRANSFORMERS
    Wang, Ziqing
    Zhao, Qidong
    Cui, Jinku
    Liu, Xu
    Xu, Dongkuan
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 3455 - 3459
  • [5] Training-Free Quantum Architecture Search
    He, Zhimin
    Deng, Maijie
    Zheng, Shenggen
    Li, Lvzhou
    Situ, Haozhen
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 11, 2024, : 12430 - 12438
  • [6] Training-free Transformer Architecture Search
    Zhou, Qinqin
    Sheng, Kekai
    Zheng, Xiawu
    Li, Ke
    Sun, Xing
    Tian, Yonghong
    Chen, Jie
    Ji, Rongrong
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 10884 - 10893
  • [7] Bayesian neural architecture search using a training-free performance metric
    Camero, Andres
    Wang, Hao
    Alba, Enrique
    Back, Thomas
    APPLIED SOFT COMPUTING, 2021, 106
  • [8] PATNAS: A Path-Based Training-Free Neural Architecture Search
    Yang, Jiechao
    Liu, Yong
    Wang, Wei
    Wu, Haoran
    Chen, Zhiyuan
    Ma, Xibo
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2025, 47 (03) : 1484 - 1500
  • [9] A Feature Fusion Based Indicator for Training-Free Neural Architecture Search
    Tran, Linh-Tam
    Ali, Muhammad Salman
    Bae, Sung-Ho
    IEEE ACCESS, 2021, 9 : 133914 - 133923
  • [10] MedNAS: Multiscale Training-Free Neural Architecture Search for Medical Image Analysis
    Wang, Yan
    Zhen, Liangli
    Zhang, Jianwei
    Li, Miqing
    Zhang, Lei
    Wang, Zizhou
    Feng, Yangqin
    Xue, Yu
    Wang, Xiao
    Chen, Zheng
    Luo, Tao
    Goh, Rich Siow Mong
    Liu, Yong
    IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, 2024, 28 (03) : 668 - 681