Optimal and Adaptive Off-policy Evaluation in Contextual Bandits

被引:0
|
作者
Wang, Yu-Xiang [1 ]
Agarwal, Alekh [2 ]
Dudik, Miroslav [2 ]
机构
[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
[2] Microsoft Res, New York, NY 10011 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We study the off-policy evaluation problem-estimating the value of a target policy using data collected by another policy-under the contextual bandit model. We consider the general (agnostic) setting without access to a consistent model of rewards and establish a minimax lower bound on the mean squared error (MSE). The bound is matched up to constants by the inverse propensity scoring (IPS) and doubly robust (DR) estimators. This highlights the difficulty of the agnostic contextual setting, in contrast with multi-armed bandits and contextual bandits with access to a consistent reward model, where IPS is suboptimal. We then propose the SWITCH estimator, which can use an existing reward model (not necessarily consistent) to achieve a better bias-variance tradeoff than IPS and DR. We prove an upper bound on its MSE and demonstrate its benefits empirically on a diverse collection of data sets, often outperforming prior work by orders of magnitude.
引用
收藏
页数:9
相关论文
共 50 条
  • [21] Evaluating the Robustness of Off-Policy Evaluation
    Saito, Yuta
    Udagawa, Takuma
    Kiyohara, Haruka
    Mogi, Kazuki
    Narita, Yusuke
    Tateno, Kei
    15TH ACM CONFERENCE ON RECOMMENDER SYSTEMS (RECSYS 2021), 2021, : 114 - 123
  • [22] Towards Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling
    Xie, Tengyang
    Ma, Yifei
    Wang, Yu-Xiang
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [23] Adaptive Optimal Control of Linear Periodic Systems: An Off-Policy Value Iteration Approach
    Pang, Bo
    Jiang, Zhong-Ping
    IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2021, 66 (02) : 888 - 894
  • [24] Representation Balancing MDPs for Off-Policy Policy Evaluation
    Liu, Yao
    Gottesman, Omer
    Raghu, Aniruddh
    Komorowski, Matthieu
    Faisal, Aldo
    Doshi-Velez, Finale
    Brunskill, Emma
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
  • [25] Consistent On-Line Off-Policy Evaluation
    Hallak, Assaf
    Mannor, Shie
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 70, 2017, 70
  • [26] IntOPE: Off-Policy Evaluation in the Presence of Interference
    Bai, Yuqi
    Zhao, Ziyu
    Zhu, Minqin
    Kuang, Kun
    arXiv, 2024,
  • [27] Off-Policy Evaluation via the Regularized Lagrangian
    Yang, Mengjiao
    Nachum, Ofir
    Dai, Bo
    Li, Lihong
    Schuurmans, Dale
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [28] Safe Optimal Design with Applications in Off-Policy Learning
    Zhu, Ruihao
    Kveton, Branislav
    INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 151, 2022, 151
  • [29] Offline RL Without Off-Policy Evaluation
    Brandfonbrener, David
    Whitney, William F.
    Ranganath, Rajesh
    Bruna, Joan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [30] Learning Action Embeddings for Off-Policy Evaluation
    Cief, Matej
    Golebiowski, Jacek
    Schmidt, Philipp
    Abedjan, Ziawasch
    Bekasov, Artur
    ADVANCES IN INFORMATION RETRIEVAL, ECIR 2024, PT I, 2024, 14608 : 108 - 122