LiteTransformerSearch: Training-free Neural Architecture Search for Efficient Language Models

被引：0

作者：

Javaheripi, Mojan ^{[1
]}

de Rosa, Gustavo H. ^{[2
]}

Mukherjee, Subhabrata ^{[2
]}

Shah, Shital ^{[2
]}

Religa, Tomasz L. ^{[3
]}

Mendes, Caio C. T. ^{[2
]}

Bubeck, Sebastien ^{[2
]}

Koushanfar, Farinaz ^{[1
]}

Dey, Debadeepta ^{[2
]}

机构：

[1] Univ Calif San Diego, La Jolla, CA USA

[2] Microsoft Res, Curitiba, Parana, Brazil

[3] Microsoft, Oxford, England

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022) | 2022年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The Transformer architecture is ubiquitously used as the building block of largescale autoregressive language models. However, finding architectures with the optimal trade-off between task performance (perplexity) and hardware constraints like peak memory utilization and latency is non-trivial. This is exacerbated by the proliferation of various hardware. We leverage the somewhat surprising empirical observation that the number of decoder parameters in autoregressive Transformers has a high rank correlation with task performance, irrespective of the architecture topology. This observation organically induces a simple Neural Architecture Search (NAS) algorithm that uses decoder parameters as a proxy for perplexity without need for any model training. The search phase of our training-free algorithm, dubbed Lightweight Transformer Search (LTS)(1), can be run directly on target devices since it does not require GPUs. Using on-target-device measurements, LTS extracts the Pareto-frontier of perplexity versus any hardware performance cost. We evaluate LTS on diverse devices from ARM CPUs to NVIDIA GPUs and two popular autoregressive Transformer backbones: GPT-2 and Transformer-XL. Results show that the perplexity of 16-layer GPT-2 and Transformer-XL can be achieved with up to 1.5., 2.5. faster runtime and 1.2., 2.0. lower peak memory utilization. When evaluated in zero and one-shot settings, LTS Pareto-frontier models achieve higher average accuracy compared to the 350M parameter OPT across 14 tasks, with up to 1.6. lower latency. LTS extracts the Pareto-frontier in under 3 hours while running on a commodity laptop. We effectively remove the carbon footprint of hundreds of GPU hours of training during search, offering a strong simple baseline for future NAS methods in autoregressive language modeling.

引用

页数：14

共 50 条

[31] Effective, Efficient and Robust Neural Architecture Search Effective, Efficient and Robust Neural Architecture Search
Yue, Zhixiong
Lin, Baijiong
Zhang, Yu
Liang, Christy
2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
[32] Revisiting Training-free NAS Metrics: An Efficient Training-based Method
Yang, Taojiannan
Yang, Linjie
Jin, Xiaojie
Chen, Chen
2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 4740 - 4749
[33] Lightweight Multi-Objective and Many-Objective Problem Formulations for Evolutionary Neural Architecture Search with the Training-Free Performance Metric Synaptic Flow
Vo A.
Pham T.N.
Nguyen V.B.
Luong N.H.
Informatica (Slovenia), 2023, 47 (03): : 303 - 314
[34] SuS-X: Training-Free Name-Only Transfer of Vision-Language Models
Udandarao, Vishaal
Gupta, Ankush
Albanie, Samuel
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2725 - 2736
[35] Training-Free Diffusion Models for Content-Style Synthesis
Xu, Ruipeng
Shen, Fei
Xie, Xu
Li, Zongyi
ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT X, ICIC 2024, 2024, 14871 : 308 - 319
[36] CryoSAM: Training-Free CryoET Tomogram Segmentation with Foundation Models
Zhao, Yizhou
Bian, Hengwei
Mu, Michael
Uddin, Mostofa R.
Li, Zhenyang
Li, Xiang
Wang, Tianyang
Xu, Min
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2024, PT VIII, 2024, 15008 : 124 - 134
[37] Neural Architecture Search without Training
Mellor, Joseph
Turner, Jack
Storkey, Amos
Crowley, Elliot J.
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
[38] Efficient Global Neural Architecture Search
Shahid Siddiqui
Christos Kyrkou
Theocharis Theocharides
SN Computer Science, 6 (3)
[39] Efficient Evolution for Neural Architecture Search
Chen, Zihao
Li, Bin
2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
[40] Training-free hyperparameter optimization of neural networks for electronic structures in matter
Fiedler, Lenz
Hoffmann, Nils
Mohammed, Parvez
Popoola, Gabriel A.
Yovell, Tamar
Oles, Vladyslav
Ellis, J. Austin
Rajamanickam, Sivasankaran
Cangi, Attila
MACHINE LEARNING-SCIENCE AND TECHNOLOGY, 2022, 3 (04):

← 1 2 3 4 5 →