Off-Policy Actor-critic for Recommender Systems

被引:22
|
作者
Chen, Minmin [1 ]
Xu, Can [2 ]
Gatto, Vince [2 ]
Jain, Devanshu [2 ]
Kumar, Aviral [1 ]
Chi, Ed [1 ]
机构
[1] Google Res, Mountain View, CA 94043 USA
[2] Google Inc, Mountain View, CA USA
关键词
reinforcement learning; batch RL; off-policy actor-critic; pessimism; recommender systems; REINFORCEMENT; GO; GAME;
D O I
10.1145/3523227.3546758
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Industrial recommendation platforms are increasingly concerned with how to make recommendations that cause users to enjoy their long term experience on the platform. Reinforcement learning emerged naturally as an appealing approach for its promise in 1) combating feedback loop effect resulted from myopic system behaviors; and 2) sequential planning to optimize long term outcome. Scaling RL algorithms to production recommender systems serving billions of users and contents, however remain challenging. Sample inefficiency and instability of online RL hinder its widespread adoption in production. Offline RL enables usage of off-policy data and batch learning. It on the other hand faces significant challenges in learning due to the distribution shift. A REINFORCE agent [3] was successfully tested for YouTube recommendation, significantly outperforming a sophisticated supervised learning production system. Off-policy correction was employed to learn from logged data. The algorithm partially mitigates the distribution shift by employing a one-step importance weighting. We resort to the off-policy actor critic algorithms to addresses the distribution shift to a better extent. Here we share the key designs in setting up an off-policy actor-critic agent for production recommender systems. It extends [3] with a critic network that estimates the value of any state-action pairs under the target learned policy through temporal difference learning. We demonstrate in offline and live experiments that the new framework out-performs baseline and improves long term user experience. An interesting discovery along our investigation is that recommendation agents that employ a softmax policy parameterization, can end up being too pessimistic about out-of-distribution (OOD) actions. Finding the right balance between pessimism and optimism on OOD actions is critical to the success of offline RL for recommender systems.
引用
收藏
页码:338 / 349
页数:12
相关论文
共 50 条
  • [1] Generalized Off-Policy Actor-Critic
    Zhang, Shangtong
    Boehmer, Wendelin
    Whiteson, Shimon
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [2] SOAC: Supervised Off-Policy Actor -Critic for Recommender Systems
    Wu, Shiqing
    Xu, Guandong
    Wang, Xianzhi
    23RD IEEE INTERNATIONAL CONFERENCE ON DATA MINING, ICDM 2023, 2023, : 14121 - 14626
  • [3] Meta attention for Off-Policy Actor-Critic
    Huang, Jiateng
    Huang, Wanrong
    Lan, Long
    Wu, Dan
    NEURAL NETWORKS, 2023, 163 : 86 - 96
  • [4] Off-Policy Actor-Critic with Emphatic Weightings
    Graves, Eric
    Imani, Ehsan
    Kumaraswamy, Raksha
    White, Martha
    JOURNAL OF MACHINE LEARNING RESEARCH, 2023, 24
  • [5] Noisy Importance Sampling Actor-Critic: An Off-Policy Actor-Critic With Experience Replay
    Tasfi, Norman
    Capretz, Miriam
    2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [6] Variance Penalized On-Policy and Off-Policy Actor-Critic
    Jain, Arushi
    Patil, Gandharv
    Jain, Ayush
    Khetarpa, Khimya
    Precup, Doina
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 7899 - 7907
  • [7] Off-Policy Actor-Critic Structure for Optimal Control of Unknown Systems With Disturbances
    Song, Ruizhuo
    Lewis, Frank L.
    Wei, Qinglai
    Zhang, Huaguang
    IEEE TRANSACTIONS ON CYBERNETICS, 2016, 46 (05) : 1041 - 1050
  • [8] Doubly Robust Off-Policy Actor-Critic: Convergence and Optimality
    Xu, Tengyu
    Yang, Zhuoran
    Wang, Zhaoran
    Liang, Yingbin
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [9] Distributed off-Policy Actor-Critic Reinforcement Learning with Policy Consensus
    Zhang, Yan
    Zavlanos, Michael M.
    2019 IEEE 58TH CONFERENCE ON DECISION AND CONTROL (CDC), 2019, : 4674 - 4679
  • [10] Online Meta-Critic Learning for Off-Policy Actor-Critic Methods
    Zhou, Wei
    Li, Yiying
    Yang, Yongxin
    Wang, Huaimin
    Hospedales, Timothy M.
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33