Prioritized Experience Replay based on Multi-armed Bandit

被引：12

作者：

Liu, Ximing ^{[1
]}

Zhu, Tianqing ^{[2
]}

Jiang, Cuiqing ^{[1
]}

Ye, Dayong ^{[2
]}

Zhao, Fuqing ^{[3
]}

机构：

[1] Hefei Univ Technol, Sch Management, Hefei, Anhui, Peoples R China

[2] Univ Technol Sydney, Sch Comp Sci, Sydney, NSW, Australia

[3] Lanzhou Univ Technol, Sch Comp & Commun Technol, Lanzhou 730050, Peoples R China

来源：

EXPERT SYSTEMS WITH APPLICATIONS | 2022年 / 189卷

关键词：

Deep reinforcement learning; Q-learning; Deep Q-network; Experience replay; Multi-armed Bandit;

D O I：

10.1016/j.eswa.2021.116023

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Experience replay has been widely used in deep reinforcement learning. The learning algorithm allows online reinforcement learning agents to remember and reuse experiences from the past. In order to further improve the sampling efficiency for experience replay, the most useful experiences are expected to be sampled with higher frequency. Existing methods usually designed their sampling strategy according to a few criteria, but they tended to combine different criteria in a linear or fixed manner, where the strategy were static and independent of the agent learner. This ignores the dynamic attribute of the environment and thus can only lead to a suboptimal performance. In this work, we propose a dynamic experience replay strategy according to the interaction between the agent and environment, which is called Prioritized Experience Replay based on Multi-armed Bandit (PERMAB). PERMAB can adaptively combine multiple priority criteria to measure the importance of the experience. In particular, the weight of each assessing criterion can be adaptively adjusted from episode to episode according to their respective contribution to the agent performance, which guarantees useful criterion to be weighted more in its current state. The proposed replay strategy is able to take both sample informativeness and diversity into consideration, which could significantly boosts learning ability and speed of the game agent. Experimental results show that PERMAB accelerates the network learning and achieves a better performance compared to baseline algorithms on seven benchmark environments with various difficulties.

引用

页数：11

共 50 条

[31] Multi-armed Bandit Mechanism with Private Histories
Liu, Chang
Cai, Qingpeng
Zhang, Yukui
AAMAS'17: PROCEEDINGS OF THE 16TH INTERNATIONAL CONFERENCE ON AUTONOMOUS AGENTS AND MULTIAGENT SYSTEMS, 2017, : 1607 - 1609
[32] An Adaptive Algorithm in Multi-Armed Bandit Problem
Zhang X.
Zhou Q.
Liang B.
Xu J.
Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2019, 56 (03): : 643 - 654
[33] Multi-Armed Recommender System Bandit Ensembles
Canamares, Rocio
Redondo, Marcos
Castells, Pablo
RECSYS 2019: 13TH ACM CONFERENCE ON RECOMMENDER SYSTEMS, 2019, : 432 - 436
[34] Noise Free Multi-armed Bandit Game
Nakamura, Atsuyoshi
Helmbold, David P.
Warmuth, Manfred K.
LANGUAGE AND AUTOMATA THEORY AND APPLICATIONS, LATA 2016, 2016, 9618 : 412 - 423
[35] Ambiguity aversion in multi-armed bandit problems
Anderson, Christopher M.
THEORY AND DECISION, 2012, 72 (01) : 15 - 33
[36] Robust control of the multi-armed bandit problem
Felipe Caro
Aparupa Das Gupta
Annals of Operations Research, 2022, 317 : 461 - 480
[37] CHARACTERIZING TRUTHFUL MULTI-ARMED BANDIT MECHANISMS
Babaioff, Moshe
Sharma, Yogeshwer
Slivkins, Aleksandrs
SIAM JOURNAL ON COMPUTING, 2014, 43 (01) : 194 - 230
[38] Multi-armed Bandit Problems with Strategic Arms
Braverman, Mark
Mao, Jieming
Schneider, Jon
Weinberg, S. Matthew
CONFERENCE ON LEARNING THEORY, VOL 99, 2019, 99
[39] Ambiguity aversion in multi-armed bandit problems
Christopher M. Anderson
Theory and Decision, 2012, 72 : 15 - 33
[40] Multi-armed bandit problem with known trend
Bouneffouf, Djallel
Feraud, Raphael
NEUROCOMPUTING, 2016, 205 : 16 - 21

← 1 2 3 4 5 →