Prioritized Experience Replay based on Multi-armed Bandit

被引:12
|
作者
Liu, Ximing [1 ]
Zhu, Tianqing [2 ]
Jiang, Cuiqing [1 ]
Ye, Dayong [2 ]
Zhao, Fuqing [3 ]
机构
[1] Hefei Univ Technol, Sch Management, Hefei, Anhui, Peoples R China
[2] Univ Technol Sydney, Sch Comp Sci, Sydney, NSW, Australia
[3] Lanzhou Univ Technol, Sch Comp & Commun Technol, Lanzhou 730050, Peoples R China
关键词
Deep reinforcement learning; Q-learning; Deep Q-network; Experience replay; Multi-armed Bandit;
D O I
10.1016/j.eswa.2021.116023
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Experience replay has been widely used in deep reinforcement learning. The learning algorithm allows online reinforcement learning agents to remember and reuse experiences from the past. In order to further improve the sampling efficiency for experience replay, the most useful experiences are expected to be sampled with higher frequency. Existing methods usually designed their sampling strategy according to a few criteria, but they tended to combine different criteria in a linear or fixed manner, where the strategy were static and independent of the agent learner. This ignores the dynamic attribute of the environment and thus can only lead to a suboptimal performance. In this work, we propose a dynamic experience replay strategy according to the interaction between the agent and environment, which is called Prioritized Experience Replay based on Multi-armed Bandit (PERMAB). PERMAB can adaptively combine multiple priority criteria to measure the importance of the experience. In particular, the weight of each assessing criterion can be adaptively adjusted from episode to episode according to their respective contribution to the agent performance, which guarantees useful criterion to be weighted more in its current state. The proposed replay strategy is able to take both sample informativeness and diversity into consideration, which could significantly boosts learning ability and speed of the game agent. Experimental results show that PERMAB accelerates the network learning and achieves a better performance compared to baseline algorithms on seven benchmark environments with various difficulties.
引用
收藏
页数:11
相关论文
共 50 条
  • [41] A Multi-Armed Bandit Hyper-Heuristic
    Ferreira, Alexandre Silvestre
    Goncalves, Richard Aderbal
    Ramirez Pozo, Aurora Trinidad
    2015 BRAZILIAN CONFERENCE ON INTELLIGENT SYSTEMS (BRACIS 2015), 2015, : 13 - 18
  • [42] Bridging Adversarial and Nonstationary Multi-Armed Bandit
    Chen, Ningyuan
    Yang, Shuoguang
    Zhang, Hailun
    PRODUCTION AND OPERATIONS MANAGEMENT, 2025,
  • [43] Variational inference for the multi-armed contextual bandit
    Urteaga, Inigo
    Wiggins, Chris H.
    INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 84, 2018, 84
  • [44] Automatic Quality of Experience Management for WLAN Networks using Multi-Armed Bandit
    Moura, Henrique D.
    Macedo, Daniel Fernandes
    Vieira, Marcos A. M.
    2019 IFIP/IEEE SYMPOSIUM ON INTEGRATED NETWORK AND SERVICE MANAGEMENT (IM), 2019, : 279 - 288
  • [45] Opportunistic Spectrum Access Based on a Constrained Multi-Armed Bandit Formulation
    Ai, Jing
    Abouzeid, Alhussein A.
    JOURNAL OF COMMUNICATIONS AND NETWORKS, 2009, 11 (02) : 134 - 147
  • [46] Research on Modelling Single Keyword Selection Based on Multi-armed Bandit
    Zhou, Baojian
    Qi, Wei
    Chen, Ligang
    2ND INTERNATIONAL CONFERENCE ON COMMUNICATION AND TECHNOLOGY (ICCT 2015), 2015, : 266 - 273
  • [47] Multi-Armed Bandit-Based User Network Node Selection
    Gao, Qinyan
    Xie, Zhidong
    SENSORS, 2024, 24 (13)
  • [48] Personalized clinical trial based on multi-armed bandit algorithms with covariates
    Shao, Yifei
    PROCEEDINGS OF INTERNATIONAL CONFERENCE ON ALGORITHMS, SOFTWARE ENGINEERING, AND NETWORK SECURITY, ASENS 2024, 2024, : 12 - 17
  • [49] Multi-Armed Bandit-Based Client Scheduling for Federated Learning
    Xia, Wenchao
    Quek, Tony Q. S.
    Guo, Kun
    Wen, Wanli
    Yang, Howard H.
    Zhu, Hongbo
    IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, 2020, 19 (11) : 7108 - 7123
  • [50] Thompson Sampling Based Mechanisms for Stochastic Multi-Armed Bandit Problems
    Ghalme, Ganesh
    Jain, Shweta
    Gujar, Sujit
    Narahari, Y.
    AAMAS'17: PROCEEDINGS OF THE 16TH INTERNATIONAL CONFERENCE ON AUTONOMOUS AGENTS AND MULTIAGENT SYSTEMS, 2017, : 87 - 95