Prioritized Experience Replay based on Multi-armed Bandit

被引:12
|
作者
Liu, Ximing [1 ]
Zhu, Tianqing [2 ]
Jiang, Cuiqing [1 ]
Ye, Dayong [2 ]
Zhao, Fuqing [3 ]
机构
[1] Hefei Univ Technol, Sch Management, Hefei, Anhui, Peoples R China
[2] Univ Technol Sydney, Sch Comp Sci, Sydney, NSW, Australia
[3] Lanzhou Univ Technol, Sch Comp & Commun Technol, Lanzhou 730050, Peoples R China
关键词
Deep reinforcement learning; Q-learning; Deep Q-network; Experience replay; Multi-armed Bandit;
D O I
10.1016/j.eswa.2021.116023
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Experience replay has been widely used in deep reinforcement learning. The learning algorithm allows online reinforcement learning agents to remember and reuse experiences from the past. In order to further improve the sampling efficiency for experience replay, the most useful experiences are expected to be sampled with higher frequency. Existing methods usually designed their sampling strategy according to a few criteria, but they tended to combine different criteria in a linear or fixed manner, where the strategy were static and independent of the agent learner. This ignores the dynamic attribute of the environment and thus can only lead to a suboptimal performance. In this work, we propose a dynamic experience replay strategy according to the interaction between the agent and environment, which is called Prioritized Experience Replay based on Multi-armed Bandit (PERMAB). PERMAB can adaptively combine multiple priority criteria to measure the importance of the experience. In particular, the weight of each assessing criterion can be adaptively adjusted from episode to episode according to their respective contribution to the agent performance, which guarantees useful criterion to be weighted more in its current state. The proposed replay strategy is able to take both sample informativeness and diversity into consideration, which could significantly boosts learning ability and speed of the game agent. Experimental results show that PERMAB accelerates the network learning and achieves a better performance compared to baseline algorithms on seven benchmark environments with various difficulties.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] The multi-armed bandit, with constraints
    Eric V. Denardo
    Eugene A. Feinberg
    Uriel G. Rothblum
    Annals of Operations Research, 2013, 208 : 37 - 62
  • [2] The multi-armed bandit, with constraints
    Denardo, Eric V.
    Feinberg, Eugene A.
    Rothblum, Uriel G.
    ANNALS OF OPERATIONS RESEARCH, 2013, 208 (01) : 37 - 62
  • [3] The Assistive Multi-Armed Bandit
    Chan, Lawrence
    Hadfield-Menell, Dylan
    Srinivasa, Siddhartha
    Dragan, Anca
    HRI '19: 2019 14TH ACM/IEEE INTERNATIONAL CONFERENCE ON HUMAN-ROBOT INTERACTION, 2019, : 354 - 363
  • [4] Multi-armed bandit games
    Gursoy, Kemal
    ANNALS OF OPERATIONS RESEARCH, 2024,
  • [5] Dynamic Multi-Armed Bandit with Covariates
    Pavlidis, Nicos G.
    Tasoulis, Dimitris K.
    Adams, Niall M.
    Hand, David J.
    ECAI 2008, PROCEEDINGS, 2008, 178 : 777 - +
  • [6] Scaling Multi-Armed Bandit Algorithms
    Fouche, Edouard
    Komiyama, Junpei
    Boehm, Klemens
    KDD'19: PROCEEDINGS OF THE 25TH ACM SIGKDD INTERNATIONAL CONFERENCCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2019, : 1449 - 1459
  • [7] The budgeted multi-armed bandit problem
    Madani, O
    Lizotte, DJ
    Greiner, R
    LEARNING THEORY, PROCEEDINGS, 2004, 3120 : 643 - 645
  • [8] The Multi-Armed Bandit With Stochastic Plays
    Lesage-Landry, Antoine
    Taylor, Joshua A.
    IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2018, 63 (07) : 2280 - 2286
  • [9] Satisficing in Multi-Armed Bandit Problems
    Reverdy, Paul
    Srivastava, Vaibhav
    Leonard, Naomi Ehrich
    IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2017, 62 (08) : 3788 - 3803
  • [10] Multi-armed Bandit with Additional Observations
    Yun, Donggyu
    Proutiere, Alexandre
    Ahn, Sumyeong
    Shin, Jinwoo
    Yi, Yung
    PROCEEDINGS OF THE ACM ON MEASUREMENT AND ANALYSIS OF COMPUTING SYSTEMS, 2018, 2 (01)