Off-Policy Actor-critic for Recommender Systems

被引:22
|
作者
Chen, Minmin [1 ]
Xu, Can [2 ]
Gatto, Vince [2 ]
Jain, Devanshu [2 ]
Kumar, Aviral [1 ]
Chi, Ed [1 ]
机构
[1] Google Res, Mountain View, CA 94043 USA
[2] Google Inc, Mountain View, CA USA
关键词
reinforcement learning; batch RL; off-policy actor-critic; pessimism; recommender systems; REINFORCEMENT; GO; GAME;
D O I
10.1145/3523227.3546758
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Industrial recommendation platforms are increasingly concerned with how to make recommendations that cause users to enjoy their long term experience on the platform. Reinforcement learning emerged naturally as an appealing approach for its promise in 1) combating feedback loop effect resulted from myopic system behaviors; and 2) sequential planning to optimize long term outcome. Scaling RL algorithms to production recommender systems serving billions of users and contents, however remain challenging. Sample inefficiency and instability of online RL hinder its widespread adoption in production. Offline RL enables usage of off-policy data and batch learning. It on the other hand faces significant challenges in learning due to the distribution shift. A REINFORCE agent [3] was successfully tested for YouTube recommendation, significantly outperforming a sophisticated supervised learning production system. Off-policy correction was employed to learn from logged data. The algorithm partially mitigates the distribution shift by employing a one-step importance weighting. We resort to the off-policy actor critic algorithms to addresses the distribution shift to a better extent. Here we share the key designs in setting up an off-policy actor-critic agent for production recommender systems. It extends [3] with a critic network that estimates the value of any state-action pairs under the target learned policy through temporal difference learning. We demonstrate in offline and live experiments that the new framework out-performs baseline and improves long term user experience. An interesting discovery along our investigation is that recommendation agents that employ a softmax policy parameterization, can end up being too pessimistic about out-of-distribution (OOD) actions. Finding the right balance between pessimism and optimism on OOD actions is critical to the success of offline RL for recommender systems.
引用
收藏
页码:338 / 349
页数:12
相关论文
共 50 条
  • [31] Boosting On-Policy Actor-Critic With Shallow Updates in Critic
    Li, Luntong
    Zhu, Yuanheng
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, : 1 - 10
  • [32] Fast and stable learning of quasi-passive dynamic walking by an unstable biped robot based on off-policy natural actor-critic
    Ueno, Tsuyoshi
    Nakamura, Yutaka
    Takuma, Takashi
    Shibata, Tomohiro
    Hosoda, Koh
    Ishii, Shin
    2006 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, VOLS 1-12, 2006, : 5226 - +
  • [33] Off-policy Learning in Two-stage Recommender Systems
    Ma, Jiaqi
    Zhao, Zhe
    Yi, Xinyang
    Yang, Ji
    Chen, Minmin
    Tang, Jiaxi
    Hong, Lichan
    Chi, Ed H.
    WEB CONFERENCE 2020: PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE (WWW 2020), 2020, : 463 - 473
  • [34] Actor-critic algorithms
    Konda, VR
    Tsitsiklis, JN
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 12, 2000, 12 : 1008 - 1014
  • [35] On actor-critic algorithms
    Konda, VR
    Tsitsiklis, JN
    SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 2003, 42 (04) : 1143 - 1166
  • [36] Natural Actor-Critic
    Peters, Jan
    Schaal, Stefan
    NEUROCOMPUTING, 2008, 71 (7-9) : 1180 - 1190
  • [37] Natural Actor-Critic
    Peters, J
    Vijayakumar, S
    Schaal, S
    MACHINE LEARNING: ECML 2005, PROCEEDINGS, 2005, 3720 : 280 - 291
  • [38] Optimal Actor-Critic Policy With Optimized Training Datasets
    Banerjee, Chayan
    Chen, Zhiyong
    Noman, Nasimul
    Zamani, Mohsen
    IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2022, 6 (06): : 1324 - 1334
  • [39] Policy-Gradient Based Actor-Critic Algorithms
    Awate, Yogesh P.
    PROCEEDINGS OF THE 2009 WRI GLOBAL CONGRESS ON INTELLIGENT SYSTEMS, VOL III, 2009, : 505 - 509
  • [40] Exploring Policy Diversity in Parallel Actor-Critic Learning
    Zhang, Yanqiang
    Zhai, Yuanzhao
    Zhou, Gongqian
    Ding, Bo
    Feng, Dawei
    Liu, Songwang
    2022 IEEE 34TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI, 2022, : 1196 - 1203