Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming

被引:4
|
作者
Bertsekas, Dimitri P. [1 ]
Yu, Huizhen [2 ]
机构
[1] MIT, Dept Elect Engn & Comp Sci, Cambridge, MA 02139 USA
[2] Univ Helsinki, Dept Comp Sci, Helsinki, Finland
基金
芬兰科学院;
关键词
STOCHASTIC-APPROXIMATION; ALGORITHMS;
D O I
10.1109/CDC.2010.5717930
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We consider the classical finite-state discounted Markovian decision problem, and we introduce a new policy iteration-like algorithm for finding the optimal Q-factors. Instead of policy evaluation by solving a linear system of equations, our algorithm involves (possibly inexact) solution of an optimal stopping problem. This problem can be solved with simple Q-learning iterations, in the case where a lookup table representation is used; it can also be solved with the Q-learning algorithm of Tsitsiklis and Van Roy [TsV99], in the case where feature-based Q-factor approximations are used. In exact/lookup table representation form, our algorithm admits asynchronous and stochastic iterative implementations, in the spirit of asynchronous/modified policy iteration, with lower overhead advantages over existing Q-learning schemes. Furthermore, for large-scale problems, where linear basis function approximations and simulation-based temporal difference implementations are used, our algorithm resolves effectively the inherent difficulties of existing schemes due to inadequate exploration.
引用
收藏
页码:1409 / 1416
页数:8
相关论文
共 50 条
  • [1] Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming
    Bertsekas, Dimitri P.
    Yu, Huizhen
    MATHEMATICS OF OPERATIONS RESEARCH, 2012, 37 (01) : 66 - 94
  • [2] New value iteration and Q-learning methods for the average cost dynamic programming problem
    Bertsekas, DP
    PROCEEDINGS OF THE 37TH IEEE CONFERENCE ON DECISION AND CONTROL, VOLS 1-4, 1998, : 2692 - 2697
  • [3] New value iteration and Q-learning methods for the average cost dynamic programming problem
    Massachusetts Inst of Technology, Cambridge, United States
    Proc IEEE Conf Decis Control, (2692-2697):
  • [4] Modified policy iteration algorithms are not strongly polynomial for discounted dynamic programming
    Feinberg, Eugene A.
    Huang, Jefferson
    Scherrer, Bruno
    OPERATIONS RESEARCH LETTERS, 2014, 42 (6-7) : 429 - 431
  • [5] Q-learning and policy iteration algorithms for stochastic shortest path problems
    Huizhen Yu
    Dimitri P. Bertsekas
    Annals of Operations Research, 2013, 208 : 95 - 132
  • [6] Q-learning and policy iteration algorithms for stochastic shortest path problems
    Yu, Huizhen
    Bertsekas, Dimitri P.
    ANNALS OF OPERATIONS RESEARCH, 2013, 208 (01) : 95 - 132
  • [7] A dynamic channel assignment policy through Q-learning
    Nie, JH
    Haykin, S
    IEEE TRANSACTIONS ON NEURAL NETWORKS, 1999, 10 (06): : 1443 - 1455
  • [8] Discounted UCB1-tuned for Q-Learning
    Saito, Koki
    Notsu, Akira
    Honda, Katsuhiro
    2014 JOINT 7TH INTERNATIONAL CONFERENCE ON SOFT COMPUTING AND INTELLIGENT SYSTEMS (SCIS) AND 15TH INTERNATIONAL SYMPOSIUM ON ADVANCED INTELLIGENT SYSTEMS (ISIS), 2014, : 966 - 970
  • [9] The value iteration algorithm is not strongly polynomial for discounted dynamic programming
    Feinberg, Eugene A.
    Huang, Jefferson
    OPERATIONS RESEARCH LETTERS, 2014, 42 (02) : 130 - 131
  • [10] Dynamic programming with NAR model versus Q-learning - Case study
    Chrobak, J
    Pacut, A
    NEURAL NETWORKS AND SOFT COMPUTING, 2003, : 728 - 733