Pessimistic Reward Models for Off-Policy Learning in Recommendation

被引：27

作者：

Jeunen, Olivier ^{[1
]}

Goethals, Bart ^{[1
]}

机构：

[1] Univ Antwerp, Adrem Data Lab, Antwerp, Belgium

来源：

15TH ACM CONFERENCE ON RECOMMENDER SYSTEMS (RECSYS 2021) | 2021年

关键词：

Contextual Bandits; Offline Reinforcement Learning; Probabilistic Models;

D O I：

10.1145/3460231.3474247

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Methods for bandit learning from user interactions often require a model of the reward a certain context-action pair will yield for example, the probability of a click on a recommendation. This common machine learning task is highly non-trivial, as the datagenerating process for contexts and actions is often skewed by the recommender system itself. Indeed, when the deployed recommendation policy at data collection time does not pick its actions uniformly-at-random, this leads to a selection bias that can impede effective reward modelling. This in turn makes off-policy learning - the typical setup in industry - particularly challenging. In this work, we propose and validate a general pessimistic reward modelling approach for off-policy learning in recommendation. Bayesian uncertainty estimates allow us to express scepticism about our own reward model, which can in turn be used to generate a conservative decision rule. We show how it alleviates a well-known decision making phenomenon known as the Optimiser's Curse, and draw parallels with existing work on pessimistic policy learning. Leveraging the available closed-form expressions for both the posterior mean and variance when a ridge regressor models the reward, we show how to apply pessimism effectively and efficiently to an off-policy recommendation use-case. Empirical observations in a wide range of environments show that being conservative in decision-making leads to a significant and robust increase in recommendation performance. The merits of our approach are most outspoken in realistic settings with limited logging randomisation, limited training samples, and larger action spaces.

引用

页码：63 / 74

页数：12

共 50 条

[1] Off-policy Learning over Heterogeneous Information for Recommendation
Wang, Xiangmeng
Li, Qian
Yu, Dianer
Xu, Guandong
PROCEEDINGS OF THE ACM WEB CONFERENCE 2022 (WWW'22), 2022, : 2348 - 2359
[2] Pessimistic Off-Policy Multi-Objective Optimization
Alizadeh, Shima
Bhargava, Aniruddha
Gopalswamy, Karthick
Jain, Lalit
Kveton, Branislav
Liu, Ge
INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 238, 2024, 238
[3] Off-policy evaluation for slate recommendation
Swaminathan, Adith
Krishnamurthy, Akshay
Agarwal, Alekh
Dudik, Miroslav
Langford, John
Jose, Damien
Zitouni, Imed
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
[4] Doubly Pessimistic Algorithms for Strictly Safe Off-Policy Optimization
Amani, Sanae
Yang, Lin F.
2022 56TH ANNUAL CONFERENCE ON INFORMATION SCIENCES AND SYSTEMS (CISS), 2022, : 113 - 118
[5] Boosted Off-Policy Learning
London, Ben
Lu, Levi
Sandler, Ted
Joachims, Thorsten
INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 206, 2023, 206
[6] Debiased Off-Policy Evaluation for Recommendation Systems
Narita, Yusuke
Yasui, Shota
Yata, Kohei
15TH ACM CONFERENCE ON RECOMMENDER SYSTEMS (RECSYS 2021), 2021, : 372 - 379
[7] RVI-SAC: Average Reward Off-Policy Deep Reinforcement Learning
Hisaki, Yukinari
Ono, Isao
arXiv,
[8] Learning with Options that Terminate Off-Policy
Harutyunyan, Anna
Vrancx, Peter
Bacon, Pierre-Luc
Precup, Doina
Nowe, Ann
THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 3173 - 3182
[9] Online Learning with Off-Policy Feedback
Gabbianelli, Germano
Neu, Gergely
Papini, Matteo
INTERNATIONAL CONFERENCE ON ALGORITHMIC LEARNING THEORY, VOL 201, 2023, 201 : 620 - 641
[10] Average-Reward Off-Policy Policy Evaluation with Function Approximation
Zhang, Shangtong
Wan, Yi
Sutton, Richard S.
Whiteson, Shimon
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139

← 1 2 3 4 5 →