Proximal Policy Optimization with Entropy Regularization

被引:0
|
作者
Shen, Yuqing [1 ]
机构
[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
关键词
reinforcement learning; policy gradient; entropy regularization;
D O I
10.1109/ICCCR61138.2024.10585473
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This study provides a revision to the Proximal Policy Optimization (PPO) algorithm, primarily aimed at improving the stability of PPO during the training process while maintaining a balance between exploration and exploitation. Recognizing the inherent challenge of achieving this balance in a complex environment, the proposed method adopts an entropy regularization technique similar to the one used in the Asynchronous Advantage Actor-Critic (A3C) algorithm. The main purpose of this design is to encourage exploration in the early stages, preventing the agent from prematurely converging to a sub-optimal policy. Detailed theoretical explanations of how the entropy term improves the robustness of the learning trajectory will be provided. Experimental results demonstrate that the revised PPO not only maintains the original strengths of the PPO algorithm, but also shows significant improvement in the stability of the training process. This work contributes to the ongoing research in reinforcement learning and offers a promising direction for future research on the adoption of PPO in environments with complicated dynamics.
引用
收藏
页码:380 / 383
页数:4
相关论文
共 50 条
  • [1] Entropy adjustment by interpolation for exploration in Proximal Policy Optimization (PPO)
    Boudlal, Ayoub
    Khafaji, Abderahim
    Elabbadi, Jamal
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 133
  • [2] Post: Device Placement with Cross-Entropy Minimization and Proximal Policy Optimization
    Gao, Yuanxiang
    Chen, Li
    Li, Baochun
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
  • [3] An inexact proximal regularization method for unconstrained optimization
    Paul Armand
    Isaï Lankoandé
    Mathematical Methods of Operations Research, 2017, 85 : 43 - 59
  • [4] Proximal Policy Optimization With Policy Feedback
    Gu, Yang
    Cheng, Yuhu
    Chen, C. L. Philip
    Wang, Xuesong
    IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS, 2022, 52 (07): : 4600 - 4610
  • [5] An inexact proximal regularization method for unconstrained optimization
    Armand, Paul
    Lankoande, Isai
    MATHEMATICAL METHODS OF OPERATIONS RESEARCH, 2017, 85 (01) : 43 - 59
  • [6] Trust region policy optimization via entropy regularization for Kullback-Leibler divergence constraint
    Xu, Haotian
    Xuan, Junyu
    Zhang, Guangquan
    Lu, Jie
    NEUROCOMPUTING, 2024, 589
  • [7] Coordinated Proximal Policy Optimization
    Wu, Zifan
    Yu, Chao
    Ye, Deheng
    Zhang, Junge
    Piao, Haiyin
    Zhuo, Hankz Hankui
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [8] Truly Proximal Policy Optimization
    Wang, Yuhui
    He, Hao
    Tan, Xiaoyang
    35TH UNCERTAINTY IN ARTIFICIAL INTELLIGENCE CONFERENCE (UAI 2019), 2020, 115 : 113 - 122
  • [9] Fast Proximal Policy Optimization
    Zhao, Weiqi
    Jiang, Haobo
    Xie, Jin
    PATTERN RECOGNITION, ACPR 2021, PT II, 2022, 13189 : 73 - 86
  • [10] Off-Policy Proximal Policy Optimization
    Meng, Wenjia
    Zheng, Qian
    Pan, Gang
    Yin, Yilong
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 8, 2023, : 9162 - 9170