LiteTransformerSearch: Training-free Neural Architecture Search for Efficient Language Models

被引：0

作者：

Javaheripi, Mojan ^{[1
]}

de Rosa, Gustavo H. ^{[2
]}

Mukherjee, Subhabrata ^{[2
]}

Shah, Shital ^{[2
]}

Religa, Tomasz L. ^{[3
]}

Mendes, Caio C. T. ^{[2
]}

Bubeck, Sebastien ^{[2
]}

Koushanfar, Farinaz ^{[1
]}

Dey, Debadeepta ^{[2
]}

机构：

[1] Univ Calif San Diego, La Jolla, CA USA

[2] Microsoft Res, Curitiba, Parana, Brazil

[3] Microsoft, Oxford, England

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022) | 2022年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The Transformer architecture is ubiquitously used as the building block of largescale autoregressive language models. However, finding architectures with the optimal trade-off between task performance (perplexity) and hardware constraints like peak memory utilization and latency is non-trivial. This is exacerbated by the proliferation of various hardware. We leverage the somewhat surprising empirical observation that the number of decoder parameters in autoregressive Transformers has a high rank correlation with task performance, irrespective of the architecture topology. This observation organically induces a simple Neural Architecture Search (NAS) algorithm that uses decoder parameters as a proxy for perplexity without need for any model training. The search phase of our training-free algorithm, dubbed Lightweight Transformer Search (LTS)(1), can be run directly on target devices since it does not require GPUs. Using on-target-device measurements, LTS extracts the Pareto-frontier of perplexity versus any hardware performance cost. We evaluate LTS on diverse devices from ARM CPUs to NVIDIA GPUs and two popular autoregressive Transformer backbones: GPT-2 and Transformer-XL. Results show that the perplexity of 16-layer GPT-2 and Transformer-XL can be achieved with up to 1.5., 2.5. faster runtime and 1.2., 2.0. lower peak memory utilization. When evaluated in zero and one-shot settings, LTS Pareto-frontier models achieve higher average accuracy compared to the 350M parameter OPT across 14 tasks, with up to 1.6. lower latency. LTS extracts the Pareto-frontier in under 3 hours while running on a commodity laptop. We effectively remove the carbon footprint of hundreds of GPU hours of training during search, offering a strong simple baseline for future NAS methods in autoregressive language modeling.

引用

页数：14

共 50 条

[21] Auto-GAS: Automated Proxy Discovery for Training-Free Generative Architecture Search
Li, Lujun
Sun, Haosen
Li, Shiwen
Dong, Peijie
Luo, Wenhan
Xue, Wei
Liu, Qifeng
Guo, Yike
COMPUTER VISION - ECCV 2024, PT V, 2025, 15063 : 38 - 55
[22] Training-Free Transformer Architecture Search With Zero-Cost Proxy Guided Evolution
Zhou, Qinqin
Sheng, Kekai
Zheng, Xiawu
Li, Ke
Tian, Yonghong
Chen, Jie
Ji, Rongrong
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (10) : 6525 - 6541
[23] Multi-Objective Evolutionary Search of Compact Convolutional Neural Networks with Training-Free Estimation
Huang, Junhao
Xue, Bing
Sun, Yanan
Zhang, Mengjie
PROCEEDINGS OF THE 2023 GENETIC AND EVOLUTIONARY COMPUTATION CONFERENCE COMPANION, GECCO 2023 COMPANION, 2023, : 655 - 658
[24] Training-Free Neural Matte Extraction for Visual Effects
Elcott, Sharif
Lewis, J. P.
Kanazawa, Nori
Bregler, Christoph
SIGGRAPH ASIA 2022 TECHNICAL COMMUNICATIONS PROCEEDINGS, SIGGRAPH 2022, 2022,
[25] Training-Free Compressed Sensing for Wireless Neural Recording
Sun, Biao
Ni, Yuming
Zhao, Wenfeng
PROCEEDINGS OF 2016 IEEE BIOMEDICAL CIRCUITS AND SYSTEMS CONFERENCE (BIOCAS), 2016, : 18 - 21
[26] Auto-DAS: Automated Proxy Discovery for Training-Free Distillation-Aware Architecture Search
Sun, Haosen
Li, Lujun
Dong, Peijie
Wei, Zimian
Shao, Shitong
COMPUTER VISION - ECCV 2024, PT V, 2025, 15063 : 56 - 73
[27] Auto-Prox: Training-Free Vision Transformer Architecture Search via Automatic Proxy Discovery
Wei, Zimian
Dong, Peijie
Hui, Zheng
Li, Anggeng
Li, Lujun
Lu, Menglong
Pan, Hengyue
Li, Dongsheng
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 14, 2024, : 15814 - 15822
[28] ShuffleNASNets: Efficient CNN models through modified Efficient Neural Architecture Search
Laube, Kevin A.
Zell, Andreas
2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2019,
[29] Developing Language-Specific Models Using a Neural Architecture Search
Yoo, YongSuk
Park, Kang-moon
APPLIED SCIENCES-BASEL, 2021, 11 (21):
[30] EvoPrompting: Language Models for Code-Level Neural Architecture Search
Chen, Angelica
Dohan, David M.
So, David R.
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,

← 1 2 3 4 5 →