Generalized Maximum Entropy Reinforcement Learning via Reward Shaping

被引:2
|
作者
Tao F. [1 ]
Wu M. [2 ]
Cao Y. [2 ]
机构
[1] Volvo Car Technology Usa Llc, Sunnyvale, 94085, CA
[2] University of Texas, Department of Electrical Engineering, San Antonio, 78249, TX
来源
关键词
Entropy; reinforcement learning (RL); reward-shaping;
D O I
10.1109/TAI.2023.3297988
中图分类号
学科分类号
摘要
Entropy regularization is a commonly used technique in reinforcement learning to improve exploration and cultivate a better pre-trained policy for later adaptation. Recent studies further show that the use of entropy regularization can smooth the optimization landscape and simplify the policy optimization process, indicating the value of integrating entropy into reinforcement learning. However, existing studies only consider the policy's entropy at the current state as an extra regularization term in the policy gradient or in the objective function without formally integrating the entropy in the reward function. In this article, we propose a shaped reward that includes the agent's policy entropy into the reward function. In particular, the agent's expected entropy over a distribution of the next state is added to the immediate reward associated with the current state. The addition of the agent's expected policy entropy at the next state distribution is shown to yield new soft Q-function and state function that are concise and modular. Moreover, the new reinforcement learning framework can be easily applied to the existing standard reinforcement learning algorithms, such as deep q-network (DQN) and proximal policy optimization (PPO), while inheriting the benefits of employing entropy regularization. We further present a soft stochastic policy gradient theorem based on the shaped reward and propose a new practical reinforcement learning algorithm. Finally, a few experimental studies are conducted in MuJoCo environment to demonstrate that our method can outperform an existing state-of-the-art off-policy maximum entropy reinforcement learning approach soft actor-critic by 5%-150% in terms of average return. © 2020 IEEE.
引用
收藏
页码:1563 / 1572
页数:9
相关论文
共 50 条
  • [31] Bid optimization using maximum entropy reinforcement learning
    Liu, Mengjuan
    Liu, Jinyu
    Hu, Zhengning
    Ge, Yuchen
    Nie, Xuyun
    NEUROCOMPUTING, 2022, 501 : 529 - 543
  • [32] Maximum causal entropy inverse constrained reinforcement learning
    Baert, Mattijs
    Mazzaglia, Pietro
    Leroux, Sam
    Simoens, Pieter
    MACHINE LEARNING, 2025, 114 (04)
  • [33] A Maximum Entropy Deep Reinforcement Learning Neural Tracker
    Balaram, Shafa
    Arulkumaran, Kai
    Dai, Tianhong
    Bharath, Anil Anthony
    MACHINE LEARNING IN MEDICAL IMAGING (MLMI 2019), 2019, 11861 : 400 - 408
  • [34] Comprehensive Overview of Reward Engineering and Shaping in Advancing Reinforcement Learning Applications
    Ibrahim, Sinan
    Mostafa, Mostafa
    Jnadi, Ali
    Salloum, Hadi
    Osinenko, Pavel
    IEEE ACCESS, 2024, 12 : 175473 - 175500
  • [35] A new Potential-Based Reward Shaping for Reinforcement Learning Agent
    Badnava, Babak
    Esmaeili, Mona
    Mozayani, Nasser
    Zarkesh-Ha, Payman
    2023 IEEE 13TH ANNUAL COMPUTING AND COMMUNICATION WORKSHOP AND CONFERENCE, CCWC, 2023, : 630 - 635
  • [36] Subgoal-Based Reward Shaping to Improve Efficiency in Reinforcement Learning
    Okudo, Takato
    Yamada, Seiji
    IEEE ACCESS, 2021, 9 : 97557 - 97568
  • [37] An Improvement on Mapless Navigation with Deep Reinforcement Learning: A Reward Shaping Approach
    Alipanah, Arezoo
    Moosavian, S. Ali A.
    2022 10TH RSI INTERNATIONAL CONFERENCE ON ROBOTICS AND MECHATRONICS (ICROM), 2022, : 261 - 266
  • [38] MEDIRL: Predicting the Visual Attention of Drivers via Maximum Entropy Deep Inverse Reinforcement Learning
    Baee, Sonia
    Pakdamanian, Erfan
    Kim, Inki
    Feng, Lu
    Ordonez, Vicente
    Barnes, Laura
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 13158 - 13168
  • [39] Sparse online maximum entropy inverse reinforcement learning via proximal optimization and truncated gradient
    Song L.
    Li D.
    Xu X.
    Knowledge-Based Systems, 2022, 252
  • [40] Energy management strategy via maximum entropy reinforcement learning for an extended range logistics vehicle
    Xiao, Boyi
    Yang, Weiwei
    Wu, Jiamin
    Walker, Paul D.
    Zhang, Nong
    Energy, 2022, 253