Continuous-Time Fitted Value Iteration for Robust Policies

被引:3
|
作者
Lutter, Michael [1 ]
Belousov, Boris [1 ]
Mannor, Shie [2 ,3 ]
Fox, Dieter [4 ]
Garg, Animesh [5 ]
Peters, Jan [1 ]
机构
[1] Tech Univ Darmstadt, Comp Sci Dept, Intelligent Autonomous Syst Grp, D-64289 Darmstadt, Germany
[2] Technion Israel Inst Technol, Andrew & Erna Viterbi Fac Elect & Comp Engn, x0026, IL-3200003 Haifa, Israel
[3] NVIDIA, IL-6121002 Tel Aviv, Israel
[4] Univ Washington, Allen Sch Comp Sci & Engn, NVIDIA, Seattle, WA 98195 USA
[5] Univ Toronto, Comp Sci Dept, NVIDIA, Toronto, ON M5S 1A4, Canada
关键词
Mathematical models; Optimization; Differential equations; Robots; Heuristic algorithms; Reinforcement learning; Costs; Value Iteration; continuous control; dynamic programming adversarial reinforcement learning; REINFORCEMENT; COST;
D O I
10.1109/TPAMI.2022.3215769
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Solving the Hamilton-Jacobi-Bellman equation is important in many domains including control, robotics and economics. Especially for continuous control, solving this differential equation and its extension the Hamilton-Jacobi-Isaacs equation, is important as it yields the optimal policy that achieves the maximum reward on a give task. In the case of the Hamilton-Jacobi-Isaacs equation, which includes an adversary controlling the environment and minimizing the reward, the obtained policy is also robust to perturbations of the dynamics. In this paper we propose continuous fitted value iteration (cFVI) and robust fitted value iteration (rFVI). These algorithms leverage the non-linear control-affine dynamics and separable state and action reward of many continuous control problems to derive the optimal policy and optimal adversary in closed form. This analytic expression simplifies the differential equations and enables us to solve for the optimal value function using value iteration for continuous actions and states as well as the adversarial case. Notably, the resulting algorithms do not require discretization of states or actions. We apply the resulting algorithms to the Furuta pendulum and cartpole. We show that both algorithms obtain the optimal policy. The robustness Sim2Real experiments on the physical systems show that the policies successfully achieve the task in the real-world. When changing the masses of the pendulum, we observe that robust value iteration is more robust compared to deep reinforcement learning algorithm and the non-robust version of the algorithm. Videos of the experiments are shown at https://sites.google.com/view/rfvi.
引用
收藏
页码:5534 / 5548
页数:15
相关论文
共 50 条
  • [21] Explorized policy iteration for continuous-time linear systems
    Chun, Tae Yoon
    Choi, Yoon Ho
    Park, Jin Bae
    Transactions of the Korean Institute of Electrical Engineers, 2012, 61 (03): : 451 - 458
  • [22] Swarm-intelligence-based value iteration for optimal regulation of continuous-time nonlinear systems
    Wang, Ding
    Hu, Qinna
    Liu, Ao
    Qiao, Junfei
    SWARM AND EVOLUTIONARY COMPUTATION, 2025, 95
  • [23] CONTINUOUS-TIME ROBUST DYNAMIC PROGRAMMING
    Bian, Tao
    Jiang, Zhong-Ping
    SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 2019, 57 (06) : 4150 - 4174
  • [24] Reinforcement Learning and Adaptive Optimal Control for Continuous-Time Nonlinear Systems: A Value Iteration Approach
    Bian, Tao
    Jiang, Zhong-Ping
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 33 (07) : 2781 - 2790
  • [25] Value Iteration and Data-Driven Optimal Output Regulation of Linear Continuous-Time Systems
    Jiang, Yi
    Gao, Weinan
    2022 34TH CHINESE CONTROL AND DECISION CONFERENCE, CCDC, 2022, : 1509 - 1514
  • [26] Adaptive optimal tracking control for nonlinear continuous-time systems with time delay using value iteration algorithm
    Shi, Jing
    Yue, Dong
    Xie, Xiangpeng
    NEUROCOMPUTING, 2020, 396 : 172 - 178
  • [27] Value Iteration in Continuous Actions, States and Time
    Lutter, Michael
    Mannor, Shie
    Peters, Jan
    Fox, Dieter
    Garg, Animesh
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [28] Continuous-Time Stochastic Policy Iteration of Adaptive Dynamic Programming
    Wei, Qinglai
    Zhou, Tianmin
    Lu, Jingwei
    Liu, Yu
    Su, Shuai
    Xiao, Jun
    IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS, 2023, 53 (10): : 6375 - 6387
  • [29] Policy iteration for continuous-time systems with unknown internal dynamics
    Vrabie, D.
    Pastravanu, O.
    Lewis, F. L.
    2007 MEDITERRANEAN CONFERENCE ON CONTROL & AUTOMATION, VOLS 1-4, 2007, : 34 - +
  • [30] Continuous-Time Distributed Policy Iteration for Multicontroller Nonlinear Systems
    Wei, Qinglai
    Li, Hongyang
    Yang, Xiong
    He, Haibo
    IEEE TRANSACTIONS ON CYBERNETICS, 2021, 51 (05) : 2372 - 2383