Cheaply Estimating Inference Efficiency Metrics for Autoregressive Transformer Models

被引:0
|
作者
Narayanan, Deepak [1 ,3 ]
Santhanam, Keshav [2 ]
Henderson, Peter [2 ]
Bommasani, Rishi [2 ]
Lee, Tony [2 ]
Liang, Percy [2 ]
机构
[1] NVIDIA, Santa Clara, CA 95051 USA
[2] Stanford Univ, Stanford, CA 94305 USA
[3] Microsoft Res, Redmond, WA 98052 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large language models (LLMs) are highly capable but also computationally expensive. Characterizing the fundamental tradeoff between inference efficiency and model capabilities is thus important, but requires an efficiency metric that is comparable across models from different providers. Unfortunately, raw runtimes measured through black-box APIs do not satisfy this property: model providers can implement software and hardware optimizations orthogonal to the model, and shared infrastructure introduces performance contention. We propose a new metric for inference efficiency called idealized runtime, that puts models on equal footing as though they were served on uniform hardware and software without performance contention, and a cost model to efficiently estimate this metric for autoregressive Transformer models. We also propose variants of the idealized runtime that incorporate the number and type of accelerators needed to serve the model. Using these metrics, we compare ten LLMs developed in 2022 to provide the first analysis of inference efficiency-capability tradeoffs; we make several observations from this analysis, including the fact that the superior inference runtime performance of certain APIs is often a byproduct of optimizations within the API rather than the underlying model. Our code is open sourced at https://github.com/stanford-crfm/helm-efficiency.
引用
收藏
页数:21
相关论文
共 50 条
  • [11] Hybrid Autoregressive and Non-Autoregressive Transformer Models for Speech Recognition
    Tian, Zhengkun
    Yi, Jiangyan
    Tao, Jianhua
    Zhang, Shuai
    Wen, Zhengqi
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 762 - 766
  • [12] Bayesian Inference for Directional Conditionally Autoregressive Models
    Kyung, Minjung
    Ghosh, Sujit K.
    BAYESIAN ANALYSIS, 2009, 4 (04): : 675 - 705
  • [13] Inference on cointegration in vector autoregressive models.
    Öller, LE
    Ahlgren, N
    EKONOMISKA SAMFUNDETS TIDSKRIFT, 2003, 56 (02): : 123 - 124
  • [14] Nonparametric likelihood inference for general autoregressive models
    Bravo, Francesco
    STATISTICAL METHODS AND APPLICATIONS, 2010, 19 (01): : 79 - 106
  • [15] Interventional and Counterfactual Inference with Autoregressive Flow Models
    Cui, Ruijing
    Sun, Jianbin
    Li, Zituo
    He, Bingyu
    Ge, Bingfeng
    Yang, Kewei
    2024 11TH IEEE SWISS CONFERENCE ON DATA SCIENCE, SDS 2024, 2024, : 1 - 7
  • [16] Bayesian Inference of Triple Seasonal Autoregressive Models
    Amin, Ayman A.
    PAKISTAN JOURNAL OF STATISTICS AND OPERATION RESEARCH, 2022, 18 (04) : 853 - 865
  • [17] Bootstrap prediction inference of nonlinear autoregressive models
    Wu, Kejin
    Politis, Dimitris N.
    JOURNAL OF TIME SERIES ANALYSIS, 2024, 45 (05) : 800 - 822
  • [18] Nonparametric likelihood inference for general autoregressive models
    Francesco Bravo
    Statistical Methods and Applications, 2010, 19 : 79 - 106
  • [19] Estimating multiple breaks in nonstationary autoregressive models
    Pang, Tianxiao
    Du, Lingjie
    Chong, Terence Tai-Leung
    JOURNAL OF ECONOMETRICS, 2021, 221 (01) : 277 - 311
  • [20] ATTENTION OR CONVOLUTION: TRANSFORMER ENCODERS IN AUDIO LANGUAGE MODELS FOR INFERENCE EFFICIENCY<bold> </bold>
    Jeon, Sungho
    Yeh, Ching-Feng
    Inan, Hakan
    Hsu, Wei-Ning
    Rungta, Rashi
    Mehdad, Yashar
    Bikel, Daniel
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW 2024, 2024, : 555 - 559