Supervised Advantage Actor-Critic for Recommender Systems

被引：14

作者：

Xin, Xin ^{[1
]}

Karatzoglou, Alexandros ^{[2
]}

Arapakis, Ioannis ^{[3
]}

Jose, Joemon M. ^{[4
]}

机构：

[1] Shandong Univ, Jinan, Peoples R China

[2] Google Res, London, England

[3] Tel Res, Barcelona, Spain

[4] Univ Glasgow, Glasgow, Lanark, Scotland

来源：

WSDM'22: PROCEEDINGS OF THE FIFTEENTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING | 2022年

基金：

国家重点研发计划;

关键词：

Recommendation; Reinforcement Learning; Actor-Critic; Q-learning; Advantage Actor-Critic; Negative Sampling;

D O I：

10.1145/3488560.3498494

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Casting session-based or sequential recommendation as reinforcement learning (RL) through reward signals is a promising research direction towards recommender systems (RS) that maximize cumulative profits. However, the direct use of RL algorithms in the RS setting is impractical due to challenges like off-policy training, huge action spaces and lack of sufficient reward signals. Recent RL approaches for RS attempt to tackle these challenges by combining RL and (self-)supervised sequential learning, but still suffer from certain limitations. For example, the estimation of Q-values tends to be biased toward positive values due to the lack of negative reward signals. Moreover, the Q-values also depend heavily on the specific timestamp of a sequence. To address the above problems, we propose negative sampling strategy for training the RL component and combine it with supervised sequential learning. We call this method Supervised Negative Q-learning (SNQN). Based on sampled (negative) actions (items), we can calculate the "advantage" of a positive action over the average case, which can be further utilized as a normalized weight for learning the supervised sequential part. This leads to another learning framework: Supervised Advantage Actor-Critic (SA2C). We instantiate SNQN and SA2C with four state-of-the-art sequential recommendation models and conduct experiments on two real-world datasets. Experimental results show that the proposed approaches achieve significantly better performance than state-of-the-art supervised methods and existing self-supervised RL methods.

引用

页码：1186 / 1196

页数：11

共 50 条

[11] Actor-critic algorithms
Konda, VR
Tsitsiklis, JN
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 12, 2000, 12 : 1008 - 1014
[12] On actor-critic algorithms
Konda, VR
Tsitsiklis, JN
SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 2003, 42 (04) : 1143 - 1166
[13] Natural Actor-Critic
Peters, Jan
Schaal, Stefan
NEUROCOMPUTING, 2008, 71 (7-9) : 1180 - 1190
[14] Natural Actor-Critic
Peters, J
Vijayakumar, S
Schaal, S
MACHINE LEARNING: ECML 2005, PROCEEDINGS, 2005, 3720 : 280 - 291
[15] A New Advantage Actor-Critic Algorithm For Multi-Agent Environments
Paczolay, Gabor
Harmati, Istvan
2020 23RD IEEE INTERNATIONAL SYMPOSIUM ON MEASUREMENT AND CONTROL IN ROBOTICS (ISMCR), 2020,
[16] Towards Understanding Asynchronous Advantage Actor-Critic: Convergence and Linear Speedup
Shen, Han
Zhang, Kaiqing
Hong, Mingyi
Chen, Tianyi
IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2023, 71 : 2579 - 2594
[17] An Advantage Actor-Critic Algorithm with Confidence Exploration for Open Information Extraction
Liu, Guiliang
Li, Xu
Sun, Miningming
Li, Ping
PROCEEDINGS OF THE 2020 SIAM INTERNATIONAL CONFERENCE ON DATA MINING (SDM), 2020, : 217 - 225
[18] Adversarial retraining attack of asynchronous advantage actor-critic based pathfinding
Chen Tong
Liu Jiqiang
Xiang Yingxiao
Niu Wenjia
Tong Endong
Wang Shuoru
Li He
Chang Liang
Li Gang
Alfred, Chen Qi
INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2021, 36 (05) : 2323 - 2346
[19] Traffic signal control method based on asynchronous advantage actor-critic
Ye, Baolin
Sun, Ruitao
Wu, Weimin
Chen, Bin
Yao, Qing
Zhejiang Daxue Xuebao (Gongxue Ban)/Journal of Zhejiang University (Engineering Science), 2024, 58 (08): : 1671 - 1680
[20] An Actor-Critic Algorithm With Second-Order Actor and Critic
Wang, Jing
Paschalidis, Ioannis Ch.
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2017, 62 (06) : 2689 - 2703

← 1 2 3 4 5 →