LiteTransformerSearch: Training-free Neural Architecture Search for Efficient Language Models

被引：0

作者：

Javaheripi, Mojan ^{[1
]}

de Rosa, Gustavo H. ^{[2
]}

Mukherjee, Subhabrata ^{[2
]}

Shah, Shital ^{[2
]}

Religa, Tomasz L. ^{[3
]}

Mendes, Caio C. T. ^{[2
]}

Bubeck, Sebastien ^{[2
]}

Koushanfar, Farinaz ^{[1
]}

Dey, Debadeepta ^{[2
]}

机构：

[1] Univ Calif San Diego, La Jolla, CA USA

[2] Microsoft Res, Curitiba, Parana, Brazil

[3] Microsoft, Oxford, England

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022) | 2022年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The Transformer architecture is ubiquitously used as the building block of largescale autoregressive language models. However, finding architectures with the optimal trade-off between task performance (perplexity) and hardware constraints like peak memory utilization and latency is non-trivial. This is exacerbated by the proliferation of various hardware. We leverage the somewhat surprising empirical observation that the number of decoder parameters in autoregressive Transformers has a high rank correlation with task performance, irrespective of the architecture topology. This observation organically induces a simple Neural Architecture Search (NAS) algorithm that uses decoder parameters as a proxy for perplexity without need for any model training. The search phase of our training-free algorithm, dubbed Lightweight Transformer Search (LTS)(1), can be run directly on target devices since it does not require GPUs. Using on-target-device measurements, LTS extracts the Pareto-frontier of perplexity versus any hardware performance cost. We evaluate LTS on diverse devices from ARM CPUs to NVIDIA GPUs and two popular autoregressive Transformer backbones: GPT-2 and Transformer-XL. Results show that the perplexity of 16-layer GPT-2 and Transformer-XL can be achieved with up to 1.5., 2.5. faster runtime and 1.2., 2.0. lower peak memory utilization. When evaluated in zero and one-shot settings, LTS Pareto-frontier models achieve higher average accuracy compared to the 350M parameter OPT across 14 tasks, with up to 1.6. lower latency. LTS extracts the Pareto-frontier in under 3 hours while running on a commodity laptop. We effectively remove the carbon footprint of hundreds of GPU hours of training during search, offering a strong simple baseline for future NAS methods in autoregressive language modeling.

引用

页数：14

共 50 条

[41] Improving Neural Architecture Search by Mixing a FireFly algorithm with a Training Free Evaluation
Mokhtari, Nassim
Nedelec, Alexis
Gilles, Marlene
De Loor, Pierre
2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
[42] Neural Architecture Search for Parameter-Efficient Fine-tuning of Large Pre-trained Language Models
Lawton, Neal
Kumar, Anoop
Thattai, Govind
Galstyan, Aram
Ver Steeg, Greg
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 8506 - 8515
[43] Streamlined photoacoustic image processing with foundation models: A training-free solution
Deng, Handi
Zhou, Yucheng
Xiang, Jiaxuan
Gu, Liujie
Luo, Yan
Feng, Hai
Liu, Mingyuan
Ma, Cheng
JOURNAL OF INNOVATIVE OPTICAL HEALTH SCIENCES, 2025, 18 (01)
[44] CartoonDiff: Training-free Cartoon Image Generation with Diffusion Transformer Models
He, Feihong
Li, Gang
Si, Lingyu
Yan, Leilei
Hou, Shimeng
Dong, Hongwei
Li, Fanzhang
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 3825 - 3829
[45] Attention-Driven Training-Free Efficiency Enhancement of Diffusion Models
Wang, Hongjie
Liu, Difan
Kang, Yan
Lie, Yijun
Line, Zhe
Jha, Niraj K.
Liu, Yuchen
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 16080 - 16089
[46] An Efficient and Training-Free Blind Image Blur Assessment in the Spatial Domain
Bong, David B. L.
Khoo, Bee Ee
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2014, E97D (07): : 1864 - 1871
[47] Efficient Spiking Neural Architecture Search with Mixed Neuron Models and Variable Thresholds
Xie, Zaipeng
Liu, Ziang
Chen, Peng
Zhang, Jianan
NEURAL INFORMATION PROCESSING, ICONIP 2023, PT II, 2024, 14448 : 466 - 481
[48] TRAMS: Training-free Memory Selection for Long-range Language Modeling
Yu, Haofei
Wang, Cunxiang
Zhang, Yue
Bi, Wei
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 4966 - 4972
[49] FRDiff : Feature Reuse for Universal Training-Free Acceleration of Diffusion Models
Sole, Junhyuk
Lee, Jungwon
Park, Eunhyeok
COMPUTER VISION - ECCV 2024, PT LXXIII, 2025, 15131 : 328 - 344
[50] The effect of reduced training in neural architecture search
George Kyriakides
Konstantinos Margaritis
Neural Computing and Applications, 2020, 32 : 17321 - 17332

← 1 2 3 4 5 →