Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming

被引:4
|
作者
Bertsekas, Dimitri P. [1 ]
Yu, Huizhen [2 ]
机构
[1] MIT, Dept Elect Engn & Comp Sci, Cambridge, MA 02139 USA
[2] Univ Helsinki, Dept Comp Sci, Helsinki, Finland
基金
芬兰科学院;
关键词
STOCHASTIC-APPROXIMATION; ALGORITHMS;
D O I
10.1109/CDC.2010.5717930
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We consider the classical finite-state discounted Markovian decision problem, and we introduce a new policy iteration-like algorithm for finding the optimal Q-factors. Instead of policy evaluation by solving a linear system of equations, our algorithm involves (possibly inexact) solution of an optimal stopping problem. This problem can be solved with simple Q-learning iterations, in the case where a lookup table representation is used; it can also be solved with the Q-learning algorithm of Tsitsiklis and Van Roy [TsV99], in the case where feature-based Q-factor approximations are used. In exact/lookup table representation form, our algorithm admits asynchronous and stochastic iterative implementations, in the spirit of asynchronous/modified policy iteration, with lower overhead advantages over existing Q-learning schemes. Furthermore, for large-scale problems, where linear basis function approximations and simulation-based temporal difference implementations are used, our algorithm resolves effectively the inherent difficulties of existing schemes due to inadequate exploration.
引用
收藏
页码:1409 / 1416
页数:8
相关论文
共 50 条
  • [41] Efficient implementation of dynamic fuzzy Q-learning
    Deng, C
    Er, MJ
    ICICS-PCM 2003, VOLS 1-3, PROCEEDINGS, 2003, : 1854 - 1858
  • [42] Q-learning with Experience Replay in a Dynamic Environment
    Pieters, Mathijs
    Wiering, Marco A.
    PROCEEDINGS OF 2016 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (SSCI), 2016,
  • [43] PENALIZED Q-LEARNING FOR DYNAMIC TREATMENT REGIMENS
    Song, Rui
    Wang, Weiwei
    Zeng, Donglin
    Kosorok, Michael R.
    STATISTICA SINICA, 2015, 25 (03) : 901 - 920
  • [44] A New Discrete-Time Iterative Adaptive Dynamic Programming Algorithm Based on Q-Learning
    Wei, Qinglai
    Liu, Derong
    ADVANCES IN NEURAL NETWORKS - ISNN 2015, 2015, 9377 : 43 - 52
  • [45] Dynamic programming with ARMA, Markov, and NARMA models vs. Q-learning - case study
    Chrobak, J
    Pacut, A
    Karbowski, A
    IJCNN 2000: PROCEEDINGS OF THE IEEE-INNS-ENNS INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, VOL III, 2000, : 265 - 270
  • [46] Constrained discounted dynamic programming
    Feinberg, EA
    Shwartz, A
    MATHEMATICS OF OPERATIONS RESEARCH, 1996, 21 (04) : 922 - 945
  • [47] COUNTEREXAMPLE IN DISCOUNTED DYNAMIC PROGRAMMING
    HORDIJK, A
    TIJMS, HC
    JOURNAL OF MATHEMATICAL ANALYSIS AND APPLICATIONS, 1972, 39 (02) : 455 - &
  • [48] Error Bound Analysis of Q-Function for Discounted Optimal Control Problems With Policy Iteration
    Yan, Pengfei
    Wang, Ding
    Li, Hongliang
    Liu, Derong
    IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS, 2017, 47 (07): : 1207 - 1216
  • [49] On-policy Q-learning for Adaptive Optimal Control
    Jha, Sumit Kumar
    Bhasin, Shubhendu
    2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), 2014, : 301 - 306
  • [50] Fundamental Q-learning Algorithm in Finding Optimal Policy
    Sun, Canyu
    2017 INTERNATIONAL CONFERENCE ON SMART GRID AND ELECTRICAL AUTOMATION (ICSGEA), 2017, : 243 - 246