Optimal and Adaptive Off-policy Evaluation in Contextual Bandits

被引：0

作者：

Wang, Yu-Xiang ^{[1
]}

Agarwal, Alekh ^{[2
]}

Dudik, Miroslav ^{[2
]}

机构：

[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA

[2] Microsoft Res, New York, NY 10011 USA

来源：

INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 70 | 2017年 / 70卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We study the off-policy evaluation problem-estimating the value of a target policy using data collected by another policy-under the contextual bandit model. We consider the general (agnostic) setting without access to a consistent model of rewards and establish a minimax lower bound on the mean squared error (MSE). The bound is matched up to constants by the inverse propensity scoring (IPS) and doubly robust (DR) estimators. This highlights the difficulty of the agnostic contextual setting, in contrast with multi-armed bandits and contextual bandits with access to a consistent reward model, where IPS is suboptimal. We then propose the SWITCH estimator, which can use an existing reward model (not necessarily consistent) to achieve a better bias-variance tradeoff than IPS and DR. We prove an upper bound on its MSE and demonstrate its benefits empirically on a diverse collection of data sets, often outperforming prior work by orders of magnitude.

引用

页数：9

共 50 条

[21] Evaluating the Robustness of Off-Policy Evaluation
Saito, Yuta
Udagawa, Takuma
Kiyohara, Haruka
Mogi, Kazuki
Narita, Yusuke
Tateno, Kei
15TH ACM CONFERENCE ON RECOMMENDER SYSTEMS (RECSYS 2021), 2021, : 114 - 123
[22] Towards Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling
Xie, Tengyang
Ma, Yifei
Wang, Yu-Xiang
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
[23] Adaptive Optimal Control of Linear Periodic Systems: An Off-Policy Value Iteration Approach
Pang, Bo
Jiang, Zhong-Ping
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2021, 66 (02) : 888 - 894
[24] Representation Balancing MDPs for Off-Policy Policy Evaluation
Liu, Yao
Gottesman, Omer
Raghu, Aniruddh
Komorowski, Matthieu
Faisal, Aldo
Doshi-Velez, Finale
Brunskill, Emma
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
[25] Consistent On-Line Off-Policy Evaluation
Hallak, Assaf
Mannor, Shie
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 70, 2017, 70
[26] IntOPE: Off-Policy Evaluation in the Presence of Interference
Bai, Yuqi
Zhao, Ziyu
Zhu, Minqin
Kuang, Kun
arXiv, 2024,
[27] Off-Policy Evaluation via the Regularized Lagrangian
Yang, Mengjiao
Nachum, Ofir
Dai, Bo
Li, Lihong
Schuurmans, Dale
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
[28] Safe Optimal Design with Applications in Off-Policy Learning
Zhu, Ruihao
Kveton, Branislav
INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 151, 2022, 151
[29] Offline RL Without Off-Policy Evaluation
Brandfonbrener, David
Whitney, William F.
Ranganath, Rajesh
Bruna, Joan
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[30] Learning Action Embeddings for Off-Policy Evaluation
Cief, Matej
Golebiowski, Jacek
Schmidt, Philipp
Abedjan, Ziawasch
Bekasov, Artur
ADVANCES IN INFORMATION RETRIEVAL, ECIR 2024, PT I, 2024, 14608 : 108 - 122

← 1 2 3 4 5 →