Minimax Off-Policy Evaluation for Multi-Armed Bandits

被引:3
|
作者
Ma, Cong [1 ]
Zhu, Banghua [2 ]
Jiao, Jiantao [2 ,3 ]
Wainwright, Martin J. [2 ,3 ]
机构
[1] Univ Chicago, Dept Stat, Chicago, IL 60637 USA
[2] Univ Calif Berkeley UC Berkeley, Dept Elect Engn & Comp Sci, Berkeley, CA 94720 USA
[3] Univ Calif Berkeley UC Berkeley, Dept Stat, Berkeley, CA 94720 USA
关键词
Switches; Probability; Monte Carlo methods; Chebyshev approximation; Measurement; Computational modeling; Sociology; Off-policy evaluation; multi-armed bandits; minimax optimality; importance sampling; POLYNOMIALS;
D O I
10.1109/TIT.2022.3162335
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We study the problem of off-policy evaluation in the multi-armed bandit model with bounded rewards, and develop minimax rate-optimal procedures under three settings. First, when the behavior policy is known, we show that the Switch estimator, a method that alternates between the plug-in and importance sampling estimators, is minimax rate-optimal for all sample sizes. Second, when the behavior policy is unknown, we analyze performance in terms of the competitive ratio, thereby revealing a fundamental gap between the settings of known and unknown behavior policies. When the behavior policy is unknown, any estimator must have mean-squared error larger-relative to the oracle estimator equipped with the knowledge of the behavior policy- by a multiplicative factor proportional to the support size of the target policy. Moreover, we demonstrate that the plug-in approach achieves this worst-case competitive ratio up to a logarithmic factor. Third, we initiate the study of the partial knowledge setting in which it is assumed that the minimum probability taken by the behavior policy is known. We show that the plug-in estimator is optimal for relatively large values of the minimum probability, but is sub-optimal when the minimum probability is low. In order to remedy this gap, we propose a new estimator based on approximation by Chebyshev polynomials that provably achieves the optimal estimation error. Numerical experiments on both simulated and real data corroborate our theoretical findings.
引用
收藏
页码:5314 / 5339
页数:26
相关论文
共 50 条
  • [31] Decentralized Exploration in Multi-Armed Bandits
    Feraud, Raphael
    Alami, Reda
    Laroche, Romain
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97, 2019, 97
  • [32] Multi-armed bandits with episode context
    Rosin, Christopher D.
    ANNALS OF MATHEMATICS AND ARTIFICIAL INTELLIGENCE, 2011, 61 (03) : 203 - 230
  • [33] Introduction to Multi-Armed Bandits Preface
    Slivkins, Aleksandrs
    FOUNDATIONS AND TRENDS IN MACHINE LEARNING, 2019, 12 (1-2): : 1 - 286
  • [34] Federated Multi-armed Bandits with Personalization
    Shi, Chengshuai
    Shen, Cong
    Yang, Jing
    24TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS (AISTATS), 2021, 130
  • [35] Off-policy Bandits with Deficient Support
    Sachdeva, Noveen
    Su, Yi
    Joachims, Thorsten
    KDD '20: PROCEEDINGS OF THE 26TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2020, : 965 - 975
  • [36] ON THE IDENTIFICATION AND MITIGATION OF WEAKNESSES IN THE KNOWLEDGE GRADIENT POLICY FOR MULTI-ARMED BANDITS
    Edwards, James
    Fearnhead, Paul
    Glazebrook, Kevin
    PROBABILITY IN THE ENGINEERING AND INFORMATIONAL SCIENCES, 2017, 31 (02) : 239 - 263
  • [37] Effective Off-Policy Evaluation and Learning in Contextual Combinatorial Bandits
    Shimizu, Tatsuhiro
    Tanaka, Koichi
    Kishimoto, Ren
    Kiyohara, Haruka
    Nomura, Masahiro
    Saito, Yuta
    PROCEEDINGS OF THE EIGHTEENTH ACM CONFERENCE ON RECOMMENDER SYSTEMS, RECSYS 2024, 2024, : 733 - 741
  • [38] Marginal Density Ratio for Off-Policy Evaluation in Contextual Bandits
    Taufiq, Muhammad Faaiz
    Doucet, Arnaud
    Cornish, Rob
    Ton, Jean-Francois
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [39] Statistical and Computational Trade-off in Multi-Agent Multi-Armed Bandits
    Vannella, Filippo
    Protiuere, Alexandre
    Jeong, Jaeseong
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [40] LEVY BANDITS: MULTI-ARMED BANDITS DRIVEN BY LEVY PROCESSES
    Kaspi, Haya
    Mandelbaum, Avi
    ANNALS OF APPLIED PROBABILITY, 1995, 5 (02): : 541 - 565