Optimal and Adaptive Off-policy Evaluation in Contextual Bandits

被引:0
|
作者
Wang, Yu-Xiang [1 ]
Agarwal, Alekh [2 ]
Dudik, Miroslav [2 ]
机构
[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
[2] Microsoft Res, New York, NY 10011 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We study the off-policy evaluation problem-estimating the value of a target policy using data collected by another policy-under the contextual bandit model. We consider the general (agnostic) setting without access to a consistent model of rewards and establish a minimax lower bound on the mean squared error (MSE). The bound is matched up to constants by the inverse propensity scoring (IPS) and doubly robust (DR) estimators. This highlights the difficulty of the agnostic contextual setting, in contrast with multi-armed bandits and contextual bandits with access to a consistent reward model, where IPS is suboptimal. We then propose the SWITCH estimator, which can use an existing reward model (not necessarily consistent) to achieve a better bias-variance tradeoff than IPS and DR. We prove an upper bound on its MSE and demonstrate its benefits empirically on a diverse collection of data sets, often outperforming prior work by orders of magnitude.
引用
收藏
页数:9
相关论文
共 50 条
  • [31] A perspective on off-policy evaluation in reinforcement learning
    Lihong Li
    Frontiers of Computer Science, 2019, 13 : 911 - 912
  • [32] Off-Policy Evaluation in Doubly Inhomogeneous Environments
    Bian, Zeyu
    Shi, Chengchun
    Qi, Zhengling
    Wang, Lan
    JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2024,
  • [33] A perspective on off-policy evaluation in reinforcement learning
    Li, Lihong
    FRONTIERS OF COMPUTER SCIENCE, 2019, 13 (05) : 911 - 912
  • [34] Control Variates for Slate Off-Policy Evaluation
    Vlassis, Nikos
    Chandrashekar, Ashok
    Gil, Fernando Amat
    Kallus, Nathan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [35] Distributional Off-Policy Evaluation for Slate Recommendations
    Chaudhari, Shreyas
    Arbour, David
    Theocharous, Georgios
    Vlassis, Nikos
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 8, 2024, : 8265 - 8273
  • [36] Adaptive Trade-Offs in Off-Policy Learning
    Rowland, Mark
    Dabney, Will
    Munos, Remi
    INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 108, 2020, 108 : 34 - 43
  • [37] Reliable Off-Policy Evaluation for Reinforcement Learning
    Wang, Jie
    Gao, Rui
    Zha, Hongyuan
    OPERATIONS RESEARCH, 2024, 72 (02) : 699 - 716
  • [38] Handling Confounding for Realistic Off-Policy Evaluation
    Sohoney, Saurabh
    Prabhu, Nikita
    Chaoji, Vineet
    COMPANION PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE 2018 (WWW 2018), 2018, : 33 - 34
  • [39] Debiased Off-Policy Evaluation for Recommendation Systems
    Narita, Yusuke
    Yasui, Shota
    Yata, Kohei
    15TH ACM CONFERENCE ON RECOMMENDER SYSTEMS (RECSYS 2021), 2021, : 372 - 379
  • [40] Off-Policy Evaluation in Partially Observable Environments
    Tennenholtz, Guy
    Mannor, Shie
    Shalit, Uri
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 10276 - 10283