Pessimistic Reward Models for Off-Policy Learning in Recommendation

被引:27
|
作者
Jeunen, Olivier [1 ]
Goethals, Bart [1 ]
机构
[1] Univ Antwerp, Adrem Data Lab, Antwerp, Belgium
来源
15TH ACM CONFERENCE ON RECOMMENDER SYSTEMS (RECSYS 2021) | 2021年
关键词
Contextual Bandits; Offline Reinforcement Learning; Probabilistic Models;
D O I
10.1145/3460231.3474247
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Methods for bandit learning from user interactions often require a model of the reward a certain context-action pair will yield for example, the probability of a click on a recommendation. This common machine learning task is highly non-trivial, as the datagenerating process for contexts and actions is often skewed by the recommender system itself. Indeed, when the deployed recommendation policy at data collection time does not pick its actions uniformly-at-random, this leads to a selection bias that can impede effective reward modelling. This in turn makes off-policy learning - the typical setup in industry - particularly challenging. In this work, we propose and validate a general pessimistic reward modelling approach for off-policy learning in recommendation. Bayesian uncertainty estimates allow us to express scepticism about our own reward model, which can in turn be used to generate a conservative decision rule. We show how it alleviates a well-known decision making phenomenon known as the Optimiser's Curse, and draw parallels with existing work on pessimistic policy learning. Leveraging the available closed-form expressions for both the posterior mean and variance when a ridge regressor models the reward, we show how to apply pessimism effectively and efficiently to an off-policy recommendation use-case. Empirical observations in a wide range of environments show that being conservative in decision-making leads to a significant and robust increase in recommendation performance. The merits of our approach are most outspoken in realistic settings with limited logging randomisation, limited training samples, and larger action spaces.
引用
收藏
页码:63 / 74
页数:12
相关论文
共 50 条
  • [1] Off-policy Learning over Heterogeneous Information for Recommendation
    Wang, Xiangmeng
    Li, Qian
    Yu, Dianer
    Xu, Guandong
    PROCEEDINGS OF THE ACM WEB CONFERENCE 2022 (WWW'22), 2022, : 2348 - 2359
  • [2] Pessimistic Off-Policy Multi-Objective Optimization
    Alizadeh, Shima
    Bhargava, Aniruddha
    Gopalswamy, Karthick
    Jain, Lalit
    Kveton, Branislav
    Liu, Ge
    INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 238, 2024, 238
  • [3] Off-policy evaluation for slate recommendation
    Swaminathan, Adith
    Krishnamurthy, Akshay
    Agarwal, Alekh
    Dudik, Miroslav
    Langford, John
    Jose, Damien
    Zitouni, Imed
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
  • [4] Doubly Pessimistic Algorithms for Strictly Safe Off-Policy Optimization
    Amani, Sanae
    Yang, Lin F.
    2022 56TH ANNUAL CONFERENCE ON INFORMATION SCIENCES AND SYSTEMS (CISS), 2022, : 113 - 118
  • [5] Boosted Off-Policy Learning
    London, Ben
    Lu, Levi
    Sandler, Ted
    Joachims, Thorsten
    INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 206, 2023, 206
  • [6] Debiased Off-Policy Evaluation for Recommendation Systems
    Narita, Yusuke
    Yasui, Shota
    Yata, Kohei
    15TH ACM CONFERENCE ON RECOMMENDER SYSTEMS (RECSYS 2021), 2021, : 372 - 379
  • [7] RVI-SAC: Average Reward Off-Policy Deep Reinforcement Learning
    Hisaki, Yukinari
    Ono, Isao
    arXiv,
  • [8] Learning with Options that Terminate Off-Policy
    Harutyunyan, Anna
    Vrancx, Peter
    Bacon, Pierre-Luc
    Precup, Doina
    Nowe, Ann
    THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 3173 - 3182
  • [9] Online Learning with Off-Policy Feedback
    Gabbianelli, Germano
    Neu, Gergely
    Papini, Matteo
    INTERNATIONAL CONFERENCE ON ALGORITHMIC LEARNING THEORY, VOL 201, 2023, 201 : 620 - 641
  • [10] Average-Reward Off-Policy Policy Evaluation with Function Approximation
    Zhang, Shangtong
    Wan, Yi
    Sutton, Richard S.
    Whiteson, Shimon
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139