Off-Policy Evaluation via Adaptive Weighting with Data from Contextual Bandits

被引:11
|
作者
Zhan, Ruohan [1 ]
Hadad, Vitor [1 ]
Hirshberg, David A. [1 ]
Athey, Susan [1 ]
机构
[1] Stanford Univ, Stanford, CA 94305 USA
来源
KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING | 2021年
关键词
contextual bandits; off-policy evaluation; adaptive weighting; variance reduction;
D O I
10.1145/3447548.3467456
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
It has become increasingly common for data to be collected adaptively, for example using contextual bandits. Historical data of this type can be used to evaluate other treatment assignment policies to guide future innovation or experiments. However, policy evaluation is challenging if the target policy differs from the one used to collect data, and popular estimators, including doubly robust (DR) estimators, can be plagued by bias, excessive variance, or both. In particular, when the pattern of treatment assignment in the collected data looks little like the pattern generated by the policy to be evaluated, the importance weights used in DR estimators explode, leading to excessive variance. In this paper, we improve the DR estimator by adaptively weighting observations to control its variance. We show that a t-statistic based on our improved estimator is asymptotically normal under certain conditions, allowing us to form confidence intervals and test hypotheses. Using synthetic data and public benchmarks, we provide empirical evidence for our estimator's improved accuracy and inferential properties relative to existing alternatives.
引用
收藏
页码:2125 / 2135
页数:11
相关论文
共 50 条
  • [21] Off-Policy Evaluation for Human Feedback
    Gao, Qitong
    Gao, Ge
    Dong, Juncheng
    Tarokh, Vahid
    Chi, Min
    Pajic, Miroslav
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [22] Off-policy evaluation for slate recommendation
    Swaminathan, Adith
    Krishnamurthy, Akshay
    Agarwal, Alekh
    Dudik, Miroslav
    Langford, John
    Jose, Damien
    Zitouni, Imed
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
  • [23] High Confidence Off-Policy Evaluation
    Thomas, Philip S.
    Theocharous, Georgios
    Ghavamzadeh, Mohammad
    PROCEEDINGS OF THE TWENTY-NINTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2015, : 3000 - 3006
  • [24] State Relevance for Off-Policy Evaluation
    Shen, Simon P.
    Ma, Yecheng Jason
    Gottesman, Omer
    Doshi-Velez, Finale
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [25] Evaluating the Robustness of Off-Policy Evaluation
    Saito, Yuta
    Udagawa, Takuma
    Kiyohara, Haruka
    Mogi, Kazuki
    Narita, Yusuke
    Tateno, Kei
    15TH ACM CONFERENCE ON RECOMMENDER SYSTEMS (RECSYS 2021), 2021, : 114 - 123
  • [26] VALUE-AWARE IMPORTANCE WEIGHTING FOR OFF-POLICY REINFORCEMENT LEARNING
    De Asis, Kristopher
    Graves, Eric
    Sutton, Richard S.
    CONFERENCE ON LIFELONG LEARNING AGENTS, VOL 232, 2023, 232 : 745 - 763
  • [27] Representation Balancing MDPs for Off-Policy Policy Evaluation
    Liu, Yao
    Gottesman, Omer
    Raghu, Aniruddh
    Komorowski, Matthieu
    Faisal, Aldo
    Doshi-Velez, Finale
    Brunskill, Emma
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
  • [28] Consistent On-Line Off-Policy Evaluation
    Hallak, Assaf
    Mannor, Shie
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 70, 2017, 70
  • [29] IntOPE: Off-Policy Evaluation in the Presence of Interference
    Bai, Yuqi
    Zhao, Ziyu
    Zhu, Minqin
    Kuang, Kun
    arXiv, 2024,
  • [30] Offline RL Without Off-Policy Evaluation
    Brandfonbrener, David
    Whitney, William F.
    Ranganath, Rajesh
    Bruna, Joan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34