Offline Deterministic Actor-Critic Based on Uncertainty Estimation

被引:0
|
作者
Feng H.-T. [1 ,2 ]
Cheng Y.-H. [1 ]
Wang X.-S. [1 ]
机构
[1] School of Information and Control Engineering, China University of Mining and Technology, Jiangsu, Xuzhou
[2] School of Intelligent Engineering, Jiangsu Vocational College of Information Technology, Jiangsu, Wuxi
来源
关键词
Actor-Critic; convex combination; offline reinforcement learning; out-of-distribution sampling; uncertainty estimation;
D O I
10.11897/SP.J.1016.2024.00717
中图分类号
学科分类号
摘要
Actor-critic is a reinforcement learning method that learns a policy by collecting samples through online trial-and-error interaction with the environment, which is an effective tool for solving sequential perceptual decision problems. However, the active learning paradigm of online interaction raises cost and security issues when collecting samples in some complex real-world environments. Offline reinforcement learning, as a data-driven reinforcement learning paradigm, emphasizes learning policy from a static sample dataset without exploratory interaction with the environment, which has been a research hotspot in recent years and provides a feasible solution for real-world deployment applications such as robotics, autonomous driving, healthcare, and so on. At present, offline reinforcement learning methods face the challenge of distribution shift between the learned and behavior policies, which generates extrapolation errors in the value function estimation for the out-ofdistribution (OOD) actions of the static sample dataset. The extrapolation errors are accumulated with the Bellman bootstrapping operation, which leads to the performance degradation or even nonconvergence of offline reinforcement learning. In order to deal with the distribution shift problem, the policy constraint or value function regularization is usually used to restrict the agent access to OOD actions, which may result in overly conservative learning performance and hinder the generalization of value function network and performance improvement of policy. To this end, an offline deterministic actor-critic method based on uncertainty estimation (ODACUE) is proposed to balance the generalization and conservation of value function learning by utilizing the uncertainty estimation and OOD sampling. Firstly, for the deterministic policy, the definition of uncertainty estimation operator is given according to the different estimation methods of Q value function for the in-dataset and OOD actions. The in-dataset action value function is estimated according to the Bellman bootstrapping operation and ensemble uncertainty estimation. On the other hand, the OOD action value function is estimated based on a pseudo-target constructed by the ensemble uncertainty estimation and OOD sampling method. The pessimism of the uncertainty estimation operator is theoretically analyzed by ξ-uncertainty estimation theory. By choosing appropriate parameters, the Q value function learned according to the uncertainty estimation operator is a pessimistic estimation of the optimal Q value function. Then, by applying the uncertainty estimation operator to the deterministic actor-critic framework, the objective function of critic learning is constructed via a convex combination of the indataset and OOD action value functions, thus the conservative constraints and generalization of value function learning are balanced by using the convex combination coefficient. Moreover, the uncertainty estimation operator of value function is implemented by the critic target network during the indataset action value function learning process. During the OOD action value function learning process, the OOD sampling is implemented by the actor main network, and the uncertainty estimation operator of value function is implemented by the critic main network. Finally, ODACUE and some state-of-the-art baseline algorithms are evaluated on D4RL benchmark. Experimental results show that, in contrast to the comparative algorithms, the overall performance improvement of ODACUE on the 11 datasets with different quality levels is at least 9.56% and at most 64.92%. In addition, parameter analysis and ablation experiments further validate the stability and generalization ability of ODACUE. © 2024 Science Press. All rights reserved.
引用
收藏
页码:717 / 732
页数:15
相关论文
共 37 条
  • [1] Liu Yang, He Ze-Zhong, Wang Chun-Yu, Guo Mao-Zu, Terminal guidance law design based on DDPG algorithm, Chinese Journal of Computers, 44, 9, pp. 1854-1865, (2021)
  • [2] Xiang Chao-Can, Li Yao-Yu, Feng Liang, Chen Chao, Et al., Near-optimal vehicular crowdsensing task allocation empowered by deep reinforcement learning, Chinese Journal of Computers, 45, 5, pp. 918-934, (2022)
  • [3] Jiang Yu-Bin, Liu Quan, Hu Zhi-Hui, Actor-Critic algorithm with maximum-entropy correction, Chinese Journal of Computers, 43, 10, pp. 1897-1908, (2020)
  • [4] Levine S, Kumar A, Tucker G, Et al., Offline reinforcement learning: tutorial, review, and perspectives on open problems, (2020)
  • [5] Zhu Fei, Wu Wen, Fu Yu-Chen, Liu Quan, A dual deep network based secure deep reinforcement learning method, Chinese Journal of Computers, 42, 8, pp. 1812-1826, (2019)
  • [6] Kiran B R, Sobh I, Talpaert V, Et al., Deep reinforcement learning for autonomous driving: A survey, IEEE Transactions on Intelligent Transportation Systems, 23, 6, pp. 4909-4926, (2022)
  • [7] Yu C, Liu J, Nemati S., Reinforcement learning in healthcare: A survey, (2019)
  • [8] Singh B, Kumar R, Singh V P., Reinforcement learning in robotic applications: A comprehensive survey, Artificial Intelligence Review, 55, 2, pp. 945-990, (2022)
  • [9] Fujimoto S, Meger D, Precup D., Off-policy deep reinforcement learning without exploration, Proceedings of the 36th International Conference on Machine Learning, pp. 2052-2062, (2019)
  • [10] Kumar A, Fu J, Soh M, Et al., Stabilizing off-policy q-learn-ing via bootstrapping error reduction, Proceedings of the 33th Conference on Neural Information Processing Systems, pp. 11761-11771, (2019)