Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming

被引:4
|
作者
Bertsekas, Dimitri P. [1 ]
Yu, Huizhen [2 ]
机构
[1] MIT, Dept Elect Engn & Comp Sci, Cambridge, MA 02139 USA
[2] Univ Helsinki, Dept Comp Sci, Helsinki, Finland
基金
芬兰科学院;
关键词
STOCHASTIC-APPROXIMATION; ALGORITHMS;
D O I
10.1109/CDC.2010.5717930
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We consider the classical finite-state discounted Markovian decision problem, and we introduce a new policy iteration-like algorithm for finding the optimal Q-factors. Instead of policy evaluation by solving a linear system of equations, our algorithm involves (possibly inexact) solution of an optimal stopping problem. This problem can be solved with simple Q-learning iterations, in the case where a lookup table representation is used; it can also be solved with the Q-learning algorithm of Tsitsiklis and Van Roy [TsV99], in the case where feature-based Q-factor approximations are used. In exact/lookup table representation form, our algorithm admits asynchronous and stochastic iterative implementations, in the spirit of asynchronous/modified policy iteration, with lower overhead advantages over existing Q-learning schemes. Furthermore, for large-scale problems, where linear basis function approximations and simulation-based temporal difference implementations are used, our algorithm resolves effectively the inherent difficulties of existing schemes due to inadequate exploration.
引用
收藏
页码:1409 / 1416
页数:8
相关论文
共 50 条
  • [31] AN EFFICIENT POLICY ITERATION ALGORITHM FOR DYNAMIC PROGRAMMING EQUATIONS
    Alla, Alessandro
    Falcone, Maurizio
    Kalise, Dante
    SIAM JOURNAL ON SCIENTIFIC COMPUTING, 2015, 37 (01): : A181 - A200
  • [32] CONVERGENCE OF POLICY ITERATION IN CONTRACTING DYNAMIC-PROGRAMMING
    PUTERMAN, ML
    ADVANCES IN APPLIED PROBABILITY, 1978, 10 (02) : 312 - 312
  • [33] Integral Q-learning and explorized policy iteration for adaptive optimal control of continuous-time linear systems
    Lee, Jae Young
    Park, Jin Bae
    Choi, Yoon Ho
    AUTOMATICA, 2012, 48 (11) : 2850 - 2859
  • [34] Final Iteration Convergence Bound of Q-Learning: Switching System Approach
    Lee, Donghwan
    IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2024, 69 (07) : 4765 - 4772
  • [35] Multi-Agent Reward-Iteration Fuzzy Q-Learning
    Leng, Lixiong
    Li, Jingchen
    Zhu, Jinhui
    Hwang, Kao-Shing
    Shi, Haobin
    INTERNATIONAL JOURNAL OF FUZZY SYSTEMS, 2021, 23 (06) : 1669 - 1679
  • [36] Reinforcement Q-learning based on Multirate Generalized Policy Iteration and Its Application to a 2-DOF Helicopter
    Tae Yoon Chun
    Jin Bae Park
    Yoon Ho Choi
    International Journal of Control, Automation and Systems, 2018, 16 : 377 - 386
  • [37] Reinforcement Q-learning based on Multirate Generalized Policy Iteration and Its Application to a 2-DOF Helicopter
    Chun, Tae Yoon
    Park, Jin Bae
    Choi, Yoon Ho
    INTERNATIONAL JOURNAL OF CONTROL AUTOMATION AND SYSTEMS, 2018, 16 (01) : 377 - 386
  • [38] Multi-Agent Reward-Iteration Fuzzy Q-Learning
    Lixiong Leng
    Jingchen Li
    Jinhui Zhu
    Kao-Shing Hwang
    Haobin Shi
    International Journal of Fuzzy Systems, 2021, 23 : 1669 - 1679
  • [39] Policy Iteration Q-Learning for Linear Ito Stochastic Systems With Markovian Jumps and Its Application to Power Systems
    Ming, Zhongyang
    Zhang, Huaguang
    Wang, Yingchun
    Dai, Jing
    IEEE TRANSACTIONS ON CYBERNETICS, 2024, : 1 - 10
  • [40] Dynamic Choice of State Abstraction in Q-Learning
    Tamassia, Marco
    Zambetta, Fabio
    Raffe, William L.
    Mueller, Florian 'Floyd'
    Li, Xiaodong
    ECAI 2016: 22ND EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2016, 285 : 46 - 54