Cheaply Estimating Inference Efficiency Metrics for Autoregressive Transformer Models

被引：0

作者：

Narayanan, Deepak ^{[1
,3
]}

Santhanam, Keshav ^{[2
]}

Henderson, Peter ^{[2
]}

Bommasani, Rishi ^{[2
]}

Lee, Tony ^{[2
]}

Liang, Percy ^{[2
]}

机构：

[1] NVIDIA, Santa Clara, CA 95051 USA

[2] Stanford Univ, Stanford, CA 94305 USA

[3] Microsoft Res, Redmond, WA 98052 USA

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Large language models (LLMs) are highly capable but also computationally expensive. Characterizing the fundamental tradeoff between inference efficiency and model capabilities is thus important, but requires an efficiency metric that is comparable across models from different providers. Unfortunately, raw runtimes measured through black-box APIs do not satisfy this property: model providers can implement software and hardware optimizations orthogonal to the model, and shared infrastructure introduces performance contention. We propose a new metric for inference efficiency called idealized runtime, that puts models on equal footing as though they were served on uniform hardware and software without performance contention, and a cost model to efficiently estimate this metric for autoregressive Transformer models. We also propose variants of the idealized runtime that incorporate the number and type of accelerators needed to serve the model. Using these metrics, we compare ten LLMs developed in 2022 to provide the first analysis of inference efficiency-capability tradeoffs; we make several observations from this analysis, including the fact that the superior inference runtime performance of certain APIs is often a byproduct of optimizations within the API rather than the underlying model. Our code is open sourced at https://github.com/stanford-crfm/helm-efficiency.

引用

页数：21

共 50 条

[11] Hybrid Autoregressive and Non-Autoregressive Transformer Models for Speech Recognition
Tian, Zhengkun
Yi, Jiangyan
Tao, Jianhua
Zhang, Shuai
Wen, Zhengqi
IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 762 - 766
[12] Bayesian Inference for Directional Conditionally Autoregressive Models
Kyung, Minjung
Ghosh, Sujit K.
BAYESIAN ANALYSIS, 2009, 4 (04): : 675 - 705
[13] Inference on cointegration in vector autoregressive models.
Öller, LE
Ahlgren, N
EKONOMISKA SAMFUNDETS TIDSKRIFT, 2003, 56 (02): : 123 - 124
[14] Nonparametric likelihood inference for general autoregressive models
Bravo, Francesco
STATISTICAL METHODS AND APPLICATIONS, 2010, 19 (01): : 79 - 106
[15] Interventional and Counterfactual Inference with Autoregressive Flow Models
Cui, Ruijing
Sun, Jianbin
Li, Zituo
He, Bingyu
Ge, Bingfeng
Yang, Kewei
2024 11TH IEEE SWISS CONFERENCE ON DATA SCIENCE, SDS 2024, 2024, : 1 - 7
[16] Bayesian Inference of Triple Seasonal Autoregressive Models
Amin, Ayman A.
PAKISTAN JOURNAL OF STATISTICS AND OPERATION RESEARCH, 2022, 18 (04) : 853 - 865
[17] Bootstrap prediction inference of nonlinear autoregressive models
Wu, Kejin
Politis, Dimitris N.
JOURNAL OF TIME SERIES ANALYSIS, 2024, 45 (05) : 800 - 822
[18] Nonparametric likelihood inference for general autoregressive models
Francesco Bravo
Statistical Methods and Applications, 2010, 19 : 79 - 106
[19] Estimating multiple breaks in nonstationary autoregressive models
Pang, Tianxiao
Du, Lingjie
Chong, Terence Tai-Leung
JOURNAL OF ECONOMETRICS, 2021, 221 (01) : 277 - 311
[20] ATTENTION OR CONVOLUTION: TRANSFORMER ENCODERS IN AUDIO LANGUAGE MODELS FOR INFERENCE EFFICIENCY<bold> </bold>
Jeon, Sungho
Yeh, Ching-Feng
Inan, Hakan
Hsu, Wei-Ning
Rungta, Rashi
Mehdad, Yashar
Bikel, Daniel
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW 2024, 2024, : 555 - 559

← 1 2 3 4 5 →