Model-Free Trajectory-based Policy Optimization with Monotonic Improvement

被引:0
|
作者
Akrour, Riad [1 ]
Abdolmaleki, Abbas [2 ]
Abdulsamad, Hany [1 ]
Peters, Jan [1 ,3 ]
Neumann, Gerhard [1 ,4 ]
机构
[1] Tech Univ Darmstadt, CLAS IAS, Hsch Str 10, D-64289 Darmstadt, Germany
[2] DeepMind, London N1C 4AG, England
[3] Max Planck Inst Intelligent Syst, Max Planck Ring 4, Tubingen, Germany
[4] Univ Lincoln, L CAS, Lincoln LN6 7TS, England
基金
欧盟地平线“2020”;
关键词
Reinforcement Learning; Policy Optimization; Trajectory Optimization; Robotics;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Many of the recent trajectory optimization algorithms alternate between linear approximation of the system dynamics around the mean trajectory and conservative policy update. One way of constraining the policy change is by bounding the Kullback-Leibler (KL) divergence between successive policies. These approaches already demonstrated great experimental success in challenging problems such as end-to-end control of physical systems. However, the linear approximation of the system dynamics can introduce a bias in the policy update and prevent convergence to the optimal policy. In this article, we propose a new model-free trajectory-based policy optimization algorithm with guaranteed monotonic improvement. The algorithm backpropagates a local, quadratic and time-dependent Q-Function learned from trajectory data instead of a model of the system dynamics. Our policy update ensures exact KL-constraint satisfaction without simplifying assumptions on the system dynamics. We experimentally demonstrate on highly non-linear control tasks the improvement in performance of our algorithm in comparison to approaches linearizing the system dynamics. In order to show the monotonic improvement of our algorithm, we additionally conduct a theoretical analysis of our policy update scheme to derive a lower bound of the change in policy return between successive iterations.
引用
收藏
页数:25
相关论文
共 50 条
  • [1] Model-Free Trajectory Optimization for Reinforcement Learning
    Akrour, Riad
    Abdolmaleki, Abbas
    Abdulsamad, Hany
    Neumann, Gerhard
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 48, 2016, 48
  • [2] Policy Improvement by a Model-Free Dyna Architecture
    Hwang, Kao-Shing
    Lo, Chia-Yue
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2013, 24 (05) : 776 - 788
  • [3] Model-Free Imitation Learning with Policy Optimization
    Ho, Jonathan
    Gupta, Jayesh K.
    Ermon, Stefano
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 48, 2016, 48
  • [4] Model-free based Automated Trajectory Optimization for UAVs Toward Data Transmission
    Cui, Jingjing
    Ding, Zhiguo
    Deng, Yansha
    Nallanathan, Arumugam
    2019 IEEE GLOBAL COMMUNICATIONS CONFERENCE (GLOBECOM), 2019,
  • [5] Trajectory-Based Modified Policy Iteration
    Sharma, R.
    Gopal, M.
    PROCEEDINGS OF WORLD ACADEMY OF SCIENCE, ENGINEERING AND TECHNOLOGY, VOL 12, 2006, 12 : 103 - +
  • [6] OPTIMIZATION OF TRAJECTORY-BASED HCCI COMBUSTION
    Zhang, Chen
    Sun, Zongxuan
    PROCEEDINGS OF THE ASME 9TH ANNUAL DYNAMIC SYSTEMS AND CONTROL CONFERENCE, 2016, VOL 2, 2017,
  • [7] Accelerating Model-Free Policy Optimization Using Model-Based Gradient: A Composite Optimization Perspective
    Li, Yansong
    Han, Shuo
    LEARNING FOR DYNAMICS AND CONTROL CONFERENCE, VOL 168, 2022, 168
  • [8] Single trajectory-based policy optimization for discrete-time stochastic systems
    Lai, Jing
    Xiong, Junlin
    INTERNATIONAL JOURNAL OF GENERAL SYSTEMS, 2025,
  • [9] Fast, Scalable, Model-free Trajectory Optimization for Wireless Data Ferries
    Pearre, Ben
    Brown, Timothy X.
    2011 20TH INTERNATIONAL CONFERENCE ON COMPUTER COMMUNICATIONS AND NETWORKS (ICCCN), 2011,
  • [10] Dyna-style Model-based reinforcement learning with Model-Free Policy Optimization
    Dong, Kun
    Luo, Yongle
    Wang, Yuxin
    Liu, Yu
    Qu, Chengeng
    Zhang, Qiang
    Cheng, Erkang
    Sun, Zhiyong
    Song, Bo
    KNOWLEDGE-BASED SYSTEMS, 2024, 287