Model-Free Trajectory-based Policy Optimization with Monotonic Improvement

被引:0
|
作者
Akrour, Riad [1 ]
Abdolmaleki, Abbas [2 ]
Abdulsamad, Hany [1 ]
Peters, Jan [1 ,3 ]
Neumann, Gerhard [1 ,4 ]
机构
[1] Tech Univ Darmstadt, CLAS IAS, Hsch Str 10, D-64289 Darmstadt, Germany
[2] DeepMind, London N1C 4AG, England
[3] Max Planck Inst Intelligent Syst, Max Planck Ring 4, Tubingen, Germany
[4] Univ Lincoln, L CAS, Lincoln LN6 7TS, England
基金
欧盟地平线“2020”;
关键词
Reinforcement Learning; Policy Optimization; Trajectory Optimization; Robotics;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Many of the recent trajectory optimization algorithms alternate between linear approximation of the system dynamics around the mean trajectory and conservative policy update. One way of constraining the policy change is by bounding the Kullback-Leibler (KL) divergence between successive policies. These approaches already demonstrated great experimental success in challenging problems such as end-to-end control of physical systems. However, the linear approximation of the system dynamics can introduce a bias in the policy update and prevent convergence to the optimal policy. In this article, we propose a new model-free trajectory-based policy optimization algorithm with guaranteed monotonic improvement. The algorithm backpropagates a local, quadratic and time-dependent Q-Function learned from trajectory data instead of a model of the system dynamics. Our policy update ensures exact KL-constraint satisfaction without simplifying assumptions on the system dynamics. We experimentally demonstrate on highly non-linear control tasks the improvement in performance of our algorithm in comparison to approaches linearizing the system dynamics. In order to show the monotonic improvement of our algorithm, we additionally conduct a theoretical analysis of our policy update scheme to derive a lower bound of the change in policy return between successive iterations.
引用
收藏
页数:25
相关论文
共 50 条
  • [21] Levy Flight Trajectory-Based Whale Optimization Algorithm for Global Optimization
    Ling, Ying
    Zhou, Yongquan
    Luo, Qifang
    IEEE ACCESS, 2017, 5 : 6168 - 6186
  • [22] Trajectory-based Probabilistic Policy Gradient for Learning Locomotion Behaviors
    Choi, Sungjoon
    Kim, Joohyung
    2019 INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), 2019, : 1 - 7
  • [23] Trajectory-Based Off-Policy Deep Reinforcement Learning
    Doerr, Andreas
    Volpp, Michael
    Toussaint, Marc
    Trimpe, Sebastian
    Daniel, Christian
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97, 2019, 97
  • [24] Monotonic Robust Policy Optimization with Model Discrepancy
    Jiang, Yuankun
    Li, Chenglin
    Dai, Wenrui
    Zou, Junni
    Xiong, Hongkai
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [25] Model-Free Optimization Based Feedforward Control for an Inkjet Printhead
    Ezzeldin, M.
    van den Bosch, P. P. J.
    Jokic, A.
    Waarsing, R.
    2010 IEEE INTERNATIONAL CONFERENCE ON CONTROL APPLICATIONS, 2010, : 967 - 972
  • [26] Model-free optimization in cement plants
    Holmes, DS
    IEEE-IAS/PCA 2003 CEMENT INDUSTRY TECHNICAL CONFERENCE, CONFERENCE RECORD, 2003, : 159 - 173
  • [27] Model-Free Nonlinear Feedback Optimization
    He, Zhiyu
    Bolognani, Saverio
    He, Jianping
    Dorfler, Florian
    Guan, Xinping
    IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2024, 69 (07) : 4554 - 4569
  • [28] Trajectory Tracking for Autonomous Underwater Vehicle Based on Model-Free Predictive Control
    Xu, Weiwei
    Xiao, Yuchen
    Li, Hongran
    Zhang, Jian
    Zhang, Heng
    2019 IEEE 20TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE SWITCHING AND ROUTING (IEEE HPSR), 2019,
  • [29] Model-Free Trajectory Optimisation for Wireless Data Ferries
    Pearre, Ben
    IEEE LOCAL COMPUTER NETWORK CONFERENCE, 2010, : 777 - 784
  • [30] Model-Free and Model-Based Policy Evaluation when Causality is Uncertain
    Bruns-Smith, David
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139