Off-Policy Evaluation via Adaptive Weighting with Data from Contextual Bandits

被引：11

作者：

Zhan, Ruohan ^{[1
]}

Hadad, Vitor ^{[1
]}

Hirshberg, David A. ^{[1
]}

Athey, Susan ^{[1
]}

机构：

[1] Stanford Univ, Stanford, CA 94305 USA

来源：

KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING | 2021年

关键词：

contextual bandits; off-policy evaluation; adaptive weighting; variance reduction;

D O I：

10.1145/3447548.3467456

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

It has become increasingly common for data to be collected adaptively, for example using contextual bandits. Historical data of this type can be used to evaluate other treatment assignment policies to guide future innovation or experiments. However, policy evaluation is challenging if the target policy differs from the one used to collect data, and popular estimators, including doubly robust (DR) estimators, can be plagued by bias, excessive variance, or both. In particular, when the pattern of treatment assignment in the collected data looks little like the pattern generated by the policy to be evaluated, the importance weights used in DR estimators explode, leading to excessive variance. In this paper, we improve the DR estimator by adaptively weighting observations to control its variance. We show that a t-statistic based on our improved estimator is asymptotically normal under certain conditions, allowing us to form confidence intervals and test hypotheses. Using synthetic data and public benchmarks, we provide empirical evidence for our estimator's improved accuracy and inferential properties relative to existing alternatives.

引用

页码：2125 / 2135

页数：11

共 50 条

[21] Off-Policy Evaluation for Human Feedback
Gao, Qitong
Gao, Ge
Dong, Juncheng
Tarokh, Vahid
Chi, Min
Pajic, Miroslav
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[22] Off-policy evaluation for slate recommendation
Swaminathan, Adith
Krishnamurthy, Akshay
Agarwal, Alekh
Dudik, Miroslav
Langford, John
Jose, Damien
Zitouni, Imed
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
[23] High Confidence Off-Policy Evaluation
Thomas, Philip S.
Theocharous, Georgios
Ghavamzadeh, Mohammad
PROCEEDINGS OF THE TWENTY-NINTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2015, : 3000 - 3006
[24] State Relevance for Off-Policy Evaluation
Shen, Simon P.
Ma, Yecheng Jason
Gottesman, Omer
Doshi-Velez, Finale
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
[25] Evaluating the Robustness of Off-Policy Evaluation
Saito, Yuta
Udagawa, Takuma
Kiyohara, Haruka
Mogi, Kazuki
Narita, Yusuke
Tateno, Kei
15TH ACM CONFERENCE ON RECOMMENDER SYSTEMS (RECSYS 2021), 2021, : 114 - 123
[26] VALUE-AWARE IMPORTANCE WEIGHTING FOR OFF-POLICY REINFORCEMENT LEARNING
De Asis, Kristopher
Graves, Eric
Sutton, Richard S.
CONFERENCE ON LIFELONG LEARNING AGENTS, VOL 232, 2023, 232 : 745 - 763
[27] Representation Balancing MDPs for Off-Policy Policy Evaluation
Liu, Yao
Gottesman, Omer
Raghu, Aniruddh
Komorowski, Matthieu
Faisal, Aldo
Doshi-Velez, Finale
Brunskill, Emma
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
[28] Consistent On-Line Off-Policy Evaluation
Hallak, Assaf
Mannor, Shie
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 70, 2017, 70
[29] IntOPE: Off-Policy Evaluation in the Presence of Interference
Bai, Yuqi
Zhao, Ziyu
Zhu, Minqin
Kuang, Kun
arXiv, 2024,
[30] Offline RL Without Off-Policy Evaluation
Brandfonbrener, David
Whitney, William F.
Ranganath, Rajesh
Bruna, Joan
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34

← 1 2 3 4 5 →