LiteTransformerSearch: Training-free Neural Architecture Search for Efficient Language Models

被引：0

作者：

Javaheripi, Mojan ^{[1
]}

de Rosa, Gustavo H. ^{[2
]}

Mukherjee, Subhabrata ^{[2
]}

Shah, Shital ^{[2
]}

Religa, Tomasz L. ^{[3
]}

Mendes, Caio C. T. ^{[2
]}

Bubeck, Sebastien ^{[2
]}

Koushanfar, Farinaz ^{[1
]}

Dey, Debadeepta ^{[2
]}

机构：

[1] Univ Calif San Diego, La Jolla, CA USA

[2] Microsoft Res, Curitiba, Parana, Brazil

[3] Microsoft, Oxford, England

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022) | 2022年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The Transformer architecture is ubiquitously used as the building block of largescale autoregressive language models. However, finding architectures with the optimal trade-off between task performance (perplexity) and hardware constraints like peak memory utilization and latency is non-trivial. This is exacerbated by the proliferation of various hardware. We leverage the somewhat surprising empirical observation that the number of decoder parameters in autoregressive Transformers has a high rank correlation with task performance, irrespective of the architecture topology. This observation organically induces a simple Neural Architecture Search (NAS) algorithm that uses decoder parameters as a proxy for perplexity without need for any model training. The search phase of our training-free algorithm, dubbed Lightweight Transformer Search (LTS)(1), can be run directly on target devices since it does not require GPUs. Using on-target-device measurements, LTS extracts the Pareto-frontier of perplexity versus any hardware performance cost. We evaluate LTS on diverse devices from ARM CPUs to NVIDIA GPUs and two popular autoregressive Transformer backbones: GPT-2 and Transformer-XL. Results show that the perplexity of 16-layer GPT-2 and Transformer-XL can be achieved with up to 1.5., 2.5. faster runtime and 1.2., 2.0. lower peak memory utilization. When evaluated in zero and one-shot settings, LTS Pareto-frontier models achieve higher average accuracy compared to the 350M parameter OPT across 14 tasks, with up to 1.6. lower latency. LTS extracts the Pareto-frontier in under 3 hours while running on a commodity laptop. We effectively remove the carbon footprint of hundreds of GPU hours of training during search, offering a strong simple baseline for future NAS methods in autoregressive language modeling.

引用

页数：14

共 50 条

[1] Training-free neural architecture search: A review
Wu, Meng -Ting
Tsai, Chun -Wei
ICT EXPRESS, 2024, 10 (01): : 213 - 231
[2] Training-free Neural Architecture Search for RNNs and Transformers
Serianni, Aaron
Kalita, Jugal
PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 2522 - 2540
[3] A Training-Free Neural Architecture Search Algorithm Based on Search Economics
Wu, Meng-Ting
Lin, Hung-, I
Tsai, Chun-Wei
IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, 2024, 28 (02) : 445 - 459
[4] AUTOST: TRAINING-FREE NEURAL ARCHITECTURE SEARCH FOR SPIKING TRANSFORMERS
Wang, Ziqing
Zhao, Qidong
Cui, Jinku
Liu, Xu
Xu, Dongkuan
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 3455 - 3459
[5] Training-Free Quantum Architecture Search
He, Zhimin
Deng, Maijie
Zheng, Shenggen
Li, Lvzhou
Situ, Haozhen
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 11, 2024, : 12430 - 12438
[6] Training-free Transformer Architecture Search
Zhou, Qinqin
Sheng, Kekai
Zheng, Xiawu
Li, Ke
Sun, Xing
Tian, Yonghong
Chen, Jie
Ji, Rongrong
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 10884 - 10893
[7] Bayesian neural architecture search using a training-free performance metric
Camero, Andres
Wang, Hao
Alba, Enrique
Back, Thomas
APPLIED SOFT COMPUTING, 2021, 106
[8] PATNAS: A Path-Based Training-Free Neural Architecture Search
Yang, Jiechao
Liu, Yong
Wang, Wei
Wu, Haoran
Chen, Zhiyuan
Ma, Xibo
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2025, 47 (03) : 1484 - 1500
[9] A Feature Fusion Based Indicator for Training-Free Neural Architecture Search
Tran, Linh-Tam
Ali, Muhammad Salman
Bae, Sung-Ho
IEEE ACCESS, 2021, 9 : 133914 - 133923
[10] MedNAS: Multiscale Training-Free Neural Architecture Search for Medical Image Analysis
Wang, Yan
Zhen, Liangli
Zhang, Jianwei
Li, Miqing
Zhang, Lei
Wang, Zizhou
Feng, Yangqin
Xue, Yu
Wang, Xiao
Chen, Zheng
Luo, Tao
Goh, Rich Siow Mong
Liu, Yong
IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, 2024, 28 (03) : 668 - 681

← 1 2 3 4 5 →