Off-Policy Actor-critic for Recommender Systems

被引:22
|
作者
Chen, Minmin [1 ]
Xu, Can [2 ]
Gatto, Vince [2 ]
Jain, Devanshu [2 ]
Kumar, Aviral [1 ]
Chi, Ed [1 ]
机构
[1] Google Res, Mountain View, CA 94043 USA
[2] Google Inc, Mountain View, CA USA
关键词
reinforcement learning; batch RL; off-policy actor-critic; pessimism; recommender systems; REINFORCEMENT; GO; GAME;
D O I
10.1145/3523227.3546758
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Industrial recommendation platforms are increasingly concerned with how to make recommendations that cause users to enjoy their long term experience on the platform. Reinforcement learning emerged naturally as an appealing approach for its promise in 1) combating feedback loop effect resulted from myopic system behaviors; and 2) sequential planning to optimize long term outcome. Scaling RL algorithms to production recommender systems serving billions of users and contents, however remain challenging. Sample inefficiency and instability of online RL hinder its widespread adoption in production. Offline RL enables usage of off-policy data and batch learning. It on the other hand faces significant challenges in learning due to the distribution shift. A REINFORCE agent [3] was successfully tested for YouTube recommendation, significantly outperforming a sophisticated supervised learning production system. Off-policy correction was employed to learn from logged data. The algorithm partially mitigates the distribution shift by employing a one-step importance weighting. We resort to the off-policy actor critic algorithms to addresses the distribution shift to a better extent. Here we share the key designs in setting up an off-policy actor-critic agent for production recommender systems. It extends [3] with a critic network that estimates the value of any state-action pairs under the target learned policy through temporal difference learning. We demonstrate in offline and live experiments that the new framework out-performs baseline and improves long term user experience. An interesting discovery along our investigation is that recommendation agents that employ a softmax policy parameterization, can end up being too pessimistic about out-of-distribution (OOD) actions. Finding the right balance between pessimism and optimism on OOD actions is critical to the success of offline RL for recommender systems.
引用
收藏
页码:338 / 349
页数:12
相关论文
共 50 条
  • [21] Actor-Critic Off-Policy Learning for Optimal Control of Multiple-Model Discrete-Time Systems
    Skach, Jan
    Kiumarsi, Bahare
    Lewis, Frank L.
    Straka, Ondrej
    IEEE TRANSACTIONS ON CYBERNETICS, 2018, 48 (01) : 29 - 40
  • [22] Finite-Sample Analysis of Off-Policy Natural Actor-Critic With Linear Function Approximation
    Chen, Zaiwei
    Khodadadian, Sajad
    Maguluri, Siva Theja
    IEEE CONTROL SYSTEMS LETTERS, 2022, 6 : 2611 - 2616
  • [23] Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for Addressing Value Estimation Errors
    Duan, Jingliang
    Guan, Yang
    Li, Shengbo Eben
    Ren, Yangang
    Sun, Qi
    Cheng, Bo
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 33 (11) : 6584 - 6598
  • [24] Multi-agent Gradient-Based Off-Policy Actor-Critic Algorithm for Distributed Reinforcement Learning
    Ren, Jineng
    INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE SYSTEMS, 2024, 17 (01)
  • [25] Multi-agent off-policy actor-critic algorithm for distributed multi-task reinforcement learning
    Stankovic, Milos S.
    Beko, Marko
    Ilic, Nemanja
    Stankovic, Srdjan S.
    EUROPEAN JOURNAL OF CONTROL, 2023, 74
  • [26] Episode-Experience Replay Based Tree-Backup Method for Off-Policy Actor-Critic Algorithm
    Jiang, Haobo
    Qian, Jianjun
    Xie, Jin
    Yang, Jian
    PATTERN RECOGNITION AND COMPUTER VISION (PRCV 2018), PT I, 2018, 11256 : 562 - 573
  • [27] Max-Min Off-Policy Actor-Critic Method Focusing on Worst-Case Robustness to Model Misspecification
    Tanabe, Takumi
    Sato, Rei
    Fukuchi, Kazuto
    Sakuma, Jun
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
  • [28] Mild Policy Evaluation for Offline Actor-Critic
    Huang, Longyang
    Dong, Botao
    Lu, Jinhui
    Zhang, Weidong
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (12) : 17950 - 17964
  • [29] Bayesian Policy Gradient and Actor-Critic Algorithms
    Ghavamzadeh, Mohammad
    Engel, Yaakov
    Valko, Michal
    JOURNAL OF MACHINE LEARNING RESEARCH, 2016, 17
  • [30] Cooperative traffic signal control using Multi-step return and Off-policy Asynchronous Advantage Actor-Critic Graph algorithm
    Yang, Shantian
    Yang, Bo
    Wong, Hau-San
    Kang, Zhongfeng
    KNOWLEDGE-BASED SYSTEMS, 2019, 183