LiteTransformerSearch: Training-free Neural Architecture Search for Efficient Language Models

被引:0
|
作者
Javaheripi, Mojan [1 ]
de Rosa, Gustavo H. [2 ]
Mukherjee, Subhabrata [2 ]
Shah, Shital [2 ]
Religa, Tomasz L. [3 ]
Mendes, Caio C. T. [2 ]
Bubeck, Sebastien [2 ]
Koushanfar, Farinaz [1 ]
Dey, Debadeepta [2 ]
机构
[1] Univ Calif San Diego, La Jolla, CA USA
[2] Microsoft Res, Curitiba, Parana, Brazil
[3] Microsoft, Oxford, England
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The Transformer architecture is ubiquitously used as the building block of largescale autoregressive language models. However, finding architectures with the optimal trade-off between task performance (perplexity) and hardware constraints like peak memory utilization and latency is non-trivial. This is exacerbated by the proliferation of various hardware. We leverage the somewhat surprising empirical observation that the number of decoder parameters in autoregressive Transformers has a high rank correlation with task performance, irrespective of the architecture topology. This observation organically induces a simple Neural Architecture Search (NAS) algorithm that uses decoder parameters as a proxy for perplexity without need for any model training. The search phase of our training-free algorithm, dubbed Lightweight Transformer Search (LTS)(1), can be run directly on target devices since it does not require GPUs. Using on-target-device measurements, LTS extracts the Pareto-frontier of perplexity versus any hardware performance cost. We evaluate LTS on diverse devices from ARM CPUs to NVIDIA GPUs and two popular autoregressive Transformer backbones: GPT-2 and Transformer-XL. Results show that the perplexity of 16-layer GPT-2 and Transformer-XL can be achieved with up to 1.5., 2.5. faster runtime and 1.2., 2.0. lower peak memory utilization. When evaluated in zero and one-shot settings, LTS Pareto-frontier models achieve higher average accuracy compared to the 350M parameter OPT across 14 tasks, with up to 1.6. lower latency. LTS extracts the Pareto-frontier in under 3 hours while running on a commodity laptop. We effectively remove the carbon footprint of hundreds of GPU hours of training during search, offering a strong simple baseline for future NAS methods in autoregressive language modeling.
引用
收藏
页数:14
相关论文
共 50 条
  • [41] Improving Neural Architecture Search by Mixing a FireFly algorithm with a Training Free Evaluation
    Mokhtari, Nassim
    Nedelec, Alexis
    Gilles, Marlene
    De Loor, Pierre
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [42] Neural Architecture Search for Parameter-Efficient Fine-tuning of Large Pre-trained Language Models
    Lawton, Neal
    Kumar, Anoop
    Thattai, Govind
    Galstyan, Aram
    Ver Steeg, Greg
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 8506 - 8515
  • [43] Streamlined photoacoustic image processing with foundation models: A training-free solution
    Deng, Handi
    Zhou, Yucheng
    Xiang, Jiaxuan
    Gu, Liujie
    Luo, Yan
    Feng, Hai
    Liu, Mingyuan
    Ma, Cheng
    JOURNAL OF INNOVATIVE OPTICAL HEALTH SCIENCES, 2025, 18 (01)
  • [44] CartoonDiff: Training-free Cartoon Image Generation with Diffusion Transformer Models
    He, Feihong
    Li, Gang
    Si, Lingyu
    Yan, Leilei
    Hou, Shimeng
    Dong, Hongwei
    Li, Fanzhang
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 3825 - 3829
  • [45] Attention-Driven Training-Free Efficiency Enhancement of Diffusion Models
    Wang, Hongjie
    Liu, Difan
    Kang, Yan
    Lie, Yijun
    Line, Zhe
    Jha, Niraj K.
    Liu, Yuchen
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 16080 - 16089
  • [46] An Efficient and Training-Free Blind Image Blur Assessment in the Spatial Domain
    Bong, David B. L.
    Khoo, Bee Ee
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2014, E97D (07): : 1864 - 1871
  • [47] Efficient Spiking Neural Architecture Search with Mixed Neuron Models and Variable Thresholds
    Xie, Zaipeng
    Liu, Ziang
    Chen, Peng
    Zhang, Jianan
    NEURAL INFORMATION PROCESSING, ICONIP 2023, PT II, 2024, 14448 : 466 - 481
  • [48] TRAMS: Training-free Memory Selection for Long-range Language Modeling
    Yu, Haofei
    Wang, Cunxiang
    Zhang, Yue
    Bi, Wei
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 4966 - 4972
  • [49] FRDiff : Feature Reuse for Universal Training-Free Acceleration of Diffusion Models
    Sole, Junhyuk
    Lee, Jungwon
    Park, Eunhyeok
    COMPUTER VISION - ECCV 2024, PT LXXIII, 2025, 15131 : 328 - 344
  • [50] The effect of reduced training in neural architecture search
    George Kyriakides
    Konstantinos Margaritis
    Neural Computing and Applications, 2020, 32 : 17321 - 17332