Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming

被引：4

作者：

Bertsekas, Dimitri P. ^{[1
]}

Yu, Huizhen ^{[2
]}

机构：

[1] MIT, Dept Elect Engn & Comp Sci, Cambridge, MA 02139 USA

[2] Univ Helsinki, Dept Comp Sci, Helsinki, Finland

来源：

49TH IEEE CONFERENCE ON DECISION AND CONTROL (CDC) | 2010年

基金：

芬兰科学院;

关键词：

STOCHASTIC-APPROXIMATION; ALGORITHMS;

D O I：

10.1109/CDC.2010.5717930

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

We consider the classical finite-state discounted Markovian decision problem, and we introduce a new policy iteration-like algorithm for finding the optimal Q-factors. Instead of policy evaluation by solving a linear system of equations, our algorithm involves (possibly inexact) solution of an optimal stopping problem. This problem can be solved with simple Q-learning iterations, in the case where a lookup table representation is used; it can also be solved with the Q-learning algorithm of Tsitsiklis and Van Roy [TsV99], in the case where feature-based Q-factor approximations are used. In exact/lookup table representation form, our algorithm admits asynchronous and stochastic iterative implementations, in the spirit of asynchronous/modified policy iteration, with lower overhead advantages over existing Q-learning schemes. Furthermore, for large-scale problems, where linear basis function approximations and simulation-based temporal difference implementations are used, our algorithm resolves effectively the inherent difficulties of existing schemes due to inadequate exploration.

引用

页码：1409 / 1416

页数：8

共 50 条

[41] Efficient implementation of dynamic fuzzy Q-learning
Deng, C
Er, MJ
ICICS-PCM 2003, VOLS 1-3, PROCEEDINGS, 2003, : 1854 - 1858
[42] Q-learning with Experience Replay in a Dynamic Environment
Pieters, Mathijs
Wiering, Marco A.
PROCEEDINGS OF 2016 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (SSCI), 2016,
[43] PENALIZED Q-LEARNING FOR DYNAMIC TREATMENT REGIMENS
Song, Rui
Wang, Weiwei
Zeng, Donglin
Kosorok, Michael R.
STATISTICA SINICA, 2015, 25 (03) : 901 - 920
[44] A New Discrete-Time Iterative Adaptive Dynamic Programming Algorithm Based on Q-Learning
Wei, Qinglai
Liu, Derong
ADVANCES IN NEURAL NETWORKS - ISNN 2015, 2015, 9377 : 43 - 52
[45] Dynamic programming with ARMA, Markov, and NARMA models vs. Q-learning - case study
Chrobak, J
Pacut, A
Karbowski, A
IJCNN 2000: PROCEEDINGS OF THE IEEE-INNS-ENNS INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, VOL III, 2000, : 265 - 270
[46] Constrained discounted dynamic programming
Feinberg, EA
Shwartz, A
MATHEMATICS OF OPERATIONS RESEARCH, 1996, 21 (04) : 922 - 945
[47] COUNTEREXAMPLE IN DISCOUNTED DYNAMIC PROGRAMMING
HORDIJK, A
TIJMS, HC
JOURNAL OF MATHEMATICAL ANALYSIS AND APPLICATIONS, 1972, 39 (02) : 455 - &
[48] Error Bound Analysis of Q-Function for Discounted Optimal Control Problems With Policy Iteration
Yan, Pengfei
Wang, Ding
Li, Hongliang
Liu, Derong
IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS, 2017, 47 (07): : 1207 - 1216
[49] On-policy Q-learning for Adaptive Optimal Control
Jha, Sumit Kumar
Bhasin, Shubhendu
2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), 2014, : 301 - 306
[50] Fundamental Q-learning Algorithm in Finding Optimal Policy
Sun, Canyu
2017 INTERNATIONAL CONFERENCE ON SMART GRID AND ELECTRICAL AUTOMATION (ICSGEA), 2017, : 243 - 246

← 1 2 3 4 5 →