Sample strategy based on TD-error for offline reinforcement learning

被引：0

作者：

Zhang L. ^{[1
]}

Feng Y. ^{[1
]}

Liang X. ^{[1
]}

Liu S. ^{[1
]}

Cheng G. ^{[1
]}

Huang J. ^{[1
]}

机构：

[1] College of Systems Engineering, National University of Defense Technology, Changsha

来源：

Gongcheng Kexue Xuebao/Chinese Journal of Engineering | 2023年 / 45卷 / 12期

关键词：

experience replay buffer; offline; reinforcement learning; sample strategy; TD-error;

D O I：

10.13374/j.issn2095-9389.2022.10.22.001

中图分类号：

学科分类号：

摘要：

Offline reinforcement learning uses pre-collected expert data or other empirical data to learn action strategies offline without interacting with the environment. Offline reinforcement learning is preferable to online reinforcement learning because it has lower interaction costs and trial-and-error risks. However, offline reinforcement learning often faces the issues of severe extrapolation errors and low sample utilization because the Q-value estimation errors cannot be corrected in time by interacting with the environment. To this end, this paper suggests an effective sampling strategy for offline reinforcement learning based on TD-error, using TD-error as the priority measure for priority sampling, and enhancing the sampling efficacy of offline reinforcement learning and addressing the issue of out-of-distribution error by using a combination of priority sampling and uniform sampling. Meanwhile, based on the use of the dual Q-value estimation network, this paper examines the performance of the algorithms corresponding to their time-difference error measures when determining the target network using three approaches, including the minimum, the maximum, and the convex combined of dual Q-value network, according to the various calculation techniques of the target network. Furthermore, to eliminate the training bias arising from preference sampling using priority sampling, this paper uses a significant sampling mechanism. By comparing with existing offline reinforcement learning research results combining sampling strategies on the D4RL baseline, the algorithm proposed shows better performance in terms of the final performance, data efficiency, and training stability. To confirm the contribution of each research point in the algorithm, two experiments were performed in the ablation experiment section of this study. Experiment 1 shows that the algorithm using the sampling method with a combination of uniform sampling and priority sampling outperforms the algorithm using uniform sampling alone and the algorithm using priority sampling alone in terms of sample utilization and strategy stability, while experiment 2 compares the effect on the performance of the algorithm based on the double Q-value estimation network produced by the double network of a maximum, minimum, and maximum-minimum convex combination of values based on the dual Q-value estimation network with a total of three different time-difference calculation methods on the performance of the algorithm. Experimental evidence shows that the algorithm in the research that uses the least amount of dual networks performs better overall and in terms of data utilization than the other two algorithms, but its strategy variance is higher. The approach described in this paper can be used in conjunction with any offline reinforcement learning method based on Q-value estimation. This approach has the advantages of stable performance, straightforward implementation, and high scalability, and it supports the use of reinforcement learning techniques in real-world settings. © 2023 Science Press. All rights reserved.

引用

页码：2118 / 2128

页数：10

共 27 条

[1] Vinyals O, Babuschkin I, Czarnecki W M, Et al., Grandmaster level in StarCraft II using multi-agent reinforcement learning, Nature, 575, 7782, (2019)
[2] Kiran B R, Sobh I, Talpaert V, Et al., Deep reinforcement learning for autonomous driving: A survey, IEEE Trans Intell Transp Syst, 23, 6, (2022)
[3] Degrave J, Felici F, Buchli J., Et al., Magnetic control of tokamak plasmas through deep reinforcement learning, Nature, 602, 7897, (2022)
[4] Fawzi A, Balog M, Huang A, Et al., Discovering faster matrix multiplication algorithms with reinforcement learning, Nature, 610, 7930, (2022)
[5] Liang X X, Feng Y H, Huang J C, Et al., Novel deep reinforcement learning algorithm based on attention-based value function and autoregressive environment model, J Softw, 31, 4, (2020)
[6] Mnih V, Badia A P, Mirza M, Et al., Asynchronous methods for deep reinforcement learning, International Conference on Machine Learning, (2016)
[7] Haarnoja T, Zhou A, Abbeel P, Et al., Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor, International Conference on Machine Learning, (2018)
[8] Fujimoto S, Hoof H, Meger D., Addressing function approximation error in actor-critic methods, International Conference on Machine Learning, (2018)
[9] Hafner D, Lillicrap T, Fischer I, Et al., Learning latent dynamics for planning from pixels, International Conference on Machine Learning, (2019)
[10] Hafner D, Lillicrap T, Ba J, Et al., Dream to control: Learning behaviors by latent imagination[J/OL]

← 1 2 3 →