Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming

被引：4

作者：

Bertsekas, Dimitri P. ^{[1
]}

Yu, Huizhen ^{[2
]}

机构：

[1] MIT, Dept Elect Engn & Comp Sci, Cambridge, MA 02139 USA

[2] Univ Helsinki, Dept Comp Sci, Helsinki, Finland

来源：

49TH IEEE CONFERENCE ON DECISION AND CONTROL (CDC) | 2010年

基金：

芬兰科学院;

关键词：

STOCHASTIC-APPROXIMATION; ALGORITHMS;

D O I：

10.1109/CDC.2010.5717930

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

We consider the classical finite-state discounted Markovian decision problem, and we introduce a new policy iteration-like algorithm for finding the optimal Q-factors. Instead of policy evaluation by solving a linear system of equations, our algorithm involves (possibly inexact) solution of an optimal stopping problem. This problem can be solved with simple Q-learning iterations, in the case where a lookup table representation is used; it can also be solved with the Q-learning algorithm of Tsitsiklis and Van Roy [TsV99], in the case where feature-based Q-factor approximations are used. In exact/lookup table representation form, our algorithm admits asynchronous and stochastic iterative implementations, in the spirit of asynchronous/modified policy iteration, with lower overhead advantages over existing Q-learning schemes. Furthermore, for large-scale problems, where linear basis function approximations and simulation-based temporal difference implementations are used, our algorithm resolves effectively the inherent difficulties of existing schemes due to inadequate exploration.

引用

页码：1409 / 1416

页数：8

共 50 条

[1] Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming
Bertsekas, Dimitri P.
Yu, Huizhen
MATHEMATICS OF OPERATIONS RESEARCH, 2012, 37 (01) : 66 - 94
[2] New value iteration and Q-learning methods for the average cost dynamic programming problem
Bertsekas, DP
PROCEEDINGS OF THE 37TH IEEE CONFERENCE ON DECISION AND CONTROL, VOLS 1-4, 1998, : 2692 - 2697
[3] New value iteration and Q-learning methods for the average cost dynamic programming problem
Massachusetts Inst of Technology, Cambridge, United States
Proc IEEE Conf Decis Control, (2692-2697):
[4] Modified policy iteration algorithms are not strongly polynomial for discounted dynamic programming
Feinberg, Eugene A.
Huang, Jefferson
Scherrer, Bruno
OPERATIONS RESEARCH LETTERS, 2014, 42 (6-7) : 429 - 431
[5] Q-learning and policy iteration algorithms for stochastic shortest path problems
Huizhen Yu
Dimitri P. Bertsekas
Annals of Operations Research, 2013, 208 : 95 - 132
[6] Q-learning and policy iteration algorithms for stochastic shortest path problems
Yu, Huizhen
Bertsekas, Dimitri P.
ANNALS OF OPERATIONS RESEARCH, 2013, 208 (01) : 95 - 132
[7] A dynamic channel assignment policy through Q-learning
Nie, JH
Haykin, S
IEEE TRANSACTIONS ON NEURAL NETWORKS, 1999, 10 (06): : 1443 - 1455
[8] Discounted UCB1-tuned for Q-Learning
Saito, Koki
Notsu, Akira
Honda, Katsuhiro
2014 JOINT 7TH INTERNATIONAL CONFERENCE ON SOFT COMPUTING AND INTELLIGENT SYSTEMS (SCIS) AND 15TH INTERNATIONAL SYMPOSIUM ON ADVANCED INTELLIGENT SYSTEMS (ISIS), 2014, : 966 - 970
[9] The value iteration algorithm is not strongly polynomial for discounted dynamic programming
Feinberg, Eugene A.
Huang, Jefferson
OPERATIONS RESEARCH LETTERS, 2014, 42 (02) : 130 - 131
[10] Dynamic programming with NAR model versus Q-learning - Case study
Chrobak, J
Pacut, A
NEURAL NETWORKS AND SOFT COMPUTING, 2003, : 728 - 733

← 1 2 3 4 5 →