Minimax Off-Policy Evaluation for Multi-Armed Bandits

被引：3

作者：

Ma, Cong ^{[1
]}

Zhu, Banghua ^{[2
]}

Jiao, Jiantao ^{[2
,3
]}

Wainwright, Martin J. ^{[2
,3
]}

机构：

[1] Univ Chicago, Dept Stat, Chicago, IL 60637 USA

[2] Univ Calif Berkeley UC Berkeley, Dept Elect Engn & Comp Sci, Berkeley, CA 94720 USA

[3] Univ Calif Berkeley UC Berkeley, Dept Stat, Berkeley, CA 94720 USA

来源：

IEEE TRANSACTIONS ON INFORMATION THEORY | 2022年 / 68卷 / 08期

关键词：

Switches; Probability; Monte Carlo methods; Chebyshev approximation; Measurement; Computational modeling; Sociology; Off-policy evaluation; multi-armed bandits; minimax optimality; importance sampling; POLYNOMIALS;

D O I：

10.1109/TIT.2022.3162335

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

We study the problem of off-policy evaluation in the multi-armed bandit model with bounded rewards, and develop minimax rate-optimal procedures under three settings. First, when the behavior policy is known, we show that the Switch estimator, a method that alternates between the plug-in and importance sampling estimators, is minimax rate-optimal for all sample sizes. Second, when the behavior policy is unknown, we analyze performance in terms of the competitive ratio, thereby revealing a fundamental gap between the settings of known and unknown behavior policies. When the behavior policy is unknown, any estimator must have mean-squared error larger-relative to the oracle estimator equipped with the knowledge of the behavior policy- by a multiplicative factor proportional to the support size of the target policy. Moreover, we demonstrate that the plug-in approach achieves this worst-case competitive ratio up to a logarithmic factor. Third, we initiate the study of the partial knowledge setting in which it is assumed that the minimum probability taken by the behavior policy is known. We show that the plug-in estimator is optimal for relatively large values of the minimum probability, but is sub-optimal when the minimum probability is low. In order to remedy this gap, we propose a new estimator based on approximation by Chebyshev polynomials that provably achieves the optimal estimation error. Numerical experiments on both simulated and real data corroborate our theoretical findings.

引用

页码：5314 / 5339

页数：26

共 50 条

[21] Minimax Value Interval for Off-Policy Evaluation and Policy Optimization
Jiang, Nan
Huang, Jiawei
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
[22] MULTI-ARMED BANDITS AND THE GITTINS INDEX
WHITTLE, P
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1980, 42 (02): : 143 - 149
[23] Multi-armed bandits with episode context
Christopher D. Rosin
Annals of Mathematics and Artificial Intelligence, 2011, 61 : 203 - 230
[24] Multi-armed bandits with switching penalties
Asawa, M
Teneketzis, D
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 1996, 41 (03) : 328 - 348
[25] Active Learning in Multi-armed Bandits
Antos, Andras
Grover, Varun
Szepesvari, Csaba
ALGORITHMIC LEARNING THEORY, PROCEEDINGS, 2008, 5254 : 287 - +
[26] Multi-Armed Bandits With Correlated Arms
Gupta, Samarth
Chaudhari, Shreyas
Joshi, Gauri
Yagan, Osman
IEEE TRANSACTIONS ON INFORMATION THEORY, 2021, 67 (10) : 6711 - 6732
[27] Multi-Armed Bandits with Cost Subsidy
Sinha, Deeksha
Sankararama, Karthik Abinav
Kazerouni, Abbas
Avadhanula, Vashist
24TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS (AISTATS), 2021, 130
[28] Batched Multi-armed Bandits Problem
Gao, Zijun
Han, Yanjun
Ren, Zhimei
Zhou, Zhengqing
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
[29] Are Multi-Armed Bandits Susceptible to Peeking?
Loecher, Markus
ZAGREB INTERNATIONAL REVIEW OF ECONOMICS & BUSINESS, 2018, 21 (01): : 95 - 104
[30] Secure Outsourcing of Multi-Armed Bandits
Ciucanu, Radu
Lafourcade, Pascal
Lombard-Platet, Marius
Soare, Marta
2020 IEEE 19TH INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS (TRUSTCOM 2020), 2020, : 202 - 209

← 1 2 3 4 5 →