Generalized Maximum Entropy Reinforcement Learning via Reward Shaping

被引:2
|
作者
Tao F. [1 ]
Wu M. [2 ]
Cao Y. [2 ]
机构
[1] Volvo Car Technology Usa Llc, Sunnyvale, 94085, CA
[2] University of Texas, Department of Electrical Engineering, San Antonio, 78249, TX
来源
关键词
Entropy; reinforcement learning (RL); reward-shaping;
D O I
10.1109/TAI.2023.3297988
中图分类号
学科分类号
摘要
Entropy regularization is a commonly used technique in reinforcement learning to improve exploration and cultivate a better pre-trained policy for later adaptation. Recent studies further show that the use of entropy regularization can smooth the optimization landscape and simplify the policy optimization process, indicating the value of integrating entropy into reinforcement learning. However, existing studies only consider the policy's entropy at the current state as an extra regularization term in the policy gradient or in the objective function without formally integrating the entropy in the reward function. In this article, we propose a shaped reward that includes the agent's policy entropy into the reward function. In particular, the agent's expected entropy over a distribution of the next state is added to the immediate reward associated with the current state. The addition of the agent's expected policy entropy at the next state distribution is shown to yield new soft Q-function and state function that are concise and modular. Moreover, the new reinforcement learning framework can be easily applied to the existing standard reinforcement learning algorithms, such as deep q-network (DQN) and proximal policy optimization (PPO), while inheriting the benefits of employing entropy regularization. We further present a soft stochastic policy gradient theorem based on the shaped reward and propose a new practical reinforcement learning algorithm. Finally, a few experimental studies are conducted in MuJoCo environment to demonstrate that our method can outperform an existing state-of-the-art off-policy maximum entropy reinforcement learning approach soft actor-critic by 5%-150% in terms of average return. © 2020 IEEE.
引用
收藏
页码:1563 / 1572
页数:9
相关论文
共 50 条
  • [41] Energy management strategy via maximum entropy reinforcement learning for an extended range logistics vehicle
    Xiao, Boyi
    Yang, Weiwei
    Wu, Jiamin
    Walker, Paul D.
    Zhang, Nong
    ENERGY, 2022, 253
  • [42] Maximum Entropy Semi-Supervised Inverse Reinforcement Learning
    Audiffren, Julien
    Valko, Michal
    Lazaric, Alessandro
    Ghavamzadeh, Mohammad
    PROCEEDINGS OF THE TWENTY-FOURTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE (IJCAI), 2015, : 3315 - 3321
  • [43] MaxEnt Dreamer: Maximum Entropy Reinforcement Learning with World Model
    Ma, Hongying
    Xue, Wuyang
    Ying, Rendong
    Liu, PeiLin
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [44] Comparison and Deduction of Maximum Entropy Deep Inverse Reinforcement Learning
    Chen, Guannan
    Fu, Yanfang
    Liu, Yu
    Dang, Xiangbin
    Hao, Jiajun
    Liu, Xinchen
    2023 IEEE 2ND INDUSTRIAL ELECTRONICS SOCIETY ANNUAL ON-LINE CONFERENCE, ONCON, 2023,
  • [45] Adaptive generative adversarial maximum entropy inverse reinforcement learning
    Song, Li
    Li, Dazi
    Xu, Xin
    INFORMATION SCIENCES, 2025, 695
  • [46] Maximum-Entropy Progressive State Aggregation for Reinforcement Learning
    Mavridis, Christos N.
    Suriyarachchi, Nilesh
    Baras, John S.
    2021 60TH IEEE CONFERENCE ON DECISION AND CONTROL (CDC), 2021, : 5144 - 5149
  • [47] A Study of Continuous Maximum Entropy Deep Inverse Reinforcement Learning
    Chen, Xi-liang
    Cao, Lei
    Xu, Zhi-xiong
    Lai, Jun
    Li, Chen-xi
    MATHEMATICAL PROBLEMS IN ENGINEERING, 2019, 2019
  • [48] Bi-level Optimization Method for Automatic Reward Shaping of Reinforcement Learning
    Wang, Ludi
    Wang, Zhaolei
    Gong, Qinghai
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2022, PT III, 2022, 13531 : 382 - 393
  • [49] Temporal-Logic-Based Reward Shaping for Continuing Reinforcement Learning Tasks
    Jiang, Yuqian
    Bharadwaj, Suda
    Wu, Bo
    Shah, Rishi
    Topcu, Ufuk
    Stone, Peter
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 7995 - 8003
  • [50] A Reward Shaping Approach for Reserve Price Optimization using Deep Reinforcement Learning
    Afshar, Reza Refaei
    Rhuggenaath, Jason
    Zhang, Yingqian
    Kaymak, Uzay
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,