Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming

被引:4
|
作者
Bertsekas, Dimitri P. [1 ]
Yu, Huizhen [2 ]
机构
[1] MIT, Dept Elect Engn & Comp Sci, Cambridge, MA 02139 USA
[2] Univ Helsinki, Dept Comp Sci, Helsinki, Finland
基金
芬兰科学院;
关键词
STOCHASTIC-APPROXIMATION; ALGORITHMS;
D O I
10.1109/CDC.2010.5717930
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We consider the classical finite-state discounted Markovian decision problem, and we introduce a new policy iteration-like algorithm for finding the optimal Q-factors. Instead of policy evaluation by solving a linear system of equations, our algorithm involves (possibly inexact) solution of an optimal stopping problem. This problem can be solved with simple Q-learning iterations, in the case where a lookup table representation is used; it can also be solved with the Q-learning algorithm of Tsitsiklis and Van Roy [TsV99], in the case where feature-based Q-factor approximations are used. In exact/lookup table representation form, our algorithm admits asynchronous and stochastic iterative implementations, in the spirit of asynchronous/modified policy iteration, with lower overhead advantages over existing Q-learning schemes. Furthermore, for large-scale problems, where linear basis function approximations and simulation-based temporal difference implementations are used, our algorithm resolves effectively the inherent difficulties of existing schemes due to inadequate exploration.
引用
收藏
页码:1409 / 1416
页数:8
相关论文
共 50 条
  • [21] Inverse Value Iteration and Q-Learning: Algorithms, Stability, and Robustness
    Lian, Bosen
    Xue, Wenqian
    Lewis, Frank L.
    Davoudi, Ali
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, : 1 - 11
  • [22] Task decomposition and dynamic policy merging in the distributed Q-learning classifier system
    Chapman, KL
    Bay, JS
    1997 IEEE INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE IN ROBOTICS AND AUTOMATION - CIRA '97, PROCEEDINGS: TOWARDS NEW COMPUTATIONAL PRINCIPLES FOR ROBOTICS AND AUTOMATION, 1997, : 166 - 171
  • [23] Policy Iteration Q-Learning for Linear It Stochastic Systems With Markovian Jumps and its Application to Power Systems
    Ming, Zhongyang
    Zhang, Huaguang
    Wang, Yingchun
    Dai, Jing
    IEEE TRANSACTIONS ON CYBERNETICS, 2024, 54 (12) : 7804 - 7813
  • [24] Convergence and stability analysis of value iteration Q-learning under non-discounted cost for discrete-time optimal control
    Song, Shijie
    Zhao, Mingming
    Gong, Dawei
    Zhu, Minglei
    NEUROCOMPUTING, 2024, 606
  • [25] A Method Integrating Q-Learning With Approximate Dynamic Programming for Gantry Work Cell Scheduling
    Ou, Xinyan
    Chang, Qing
    Chakraborty, Nilanjan
    IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, 2021, 18 (01) : 85 - 93
  • [26] Q-LEARNING, POLICY ITERATION AND ACTOR-CRITIC REINFORCEMENT LEARNING COMBINED WITH METAHEURISTIC ALGORITHMS IN SERVO SYSTEM CONTROL
    Zamfirache, Iuliu Alexandru
    Precup, Radu-Emil
    Petriu, Emil M.
    FACTA UNIVERSITATIS-SERIES MECHANICAL ENGINEERING, 2023, 21 (04) : 615 - 630
  • [27] Discounted linear Q-learning control with novel tracking cost and its stability
    Wang, Ding
    Ren, Jin
    Ha, Mingming
    INFORMATION SCIENCES, 2023, 626 : 339 - 353
  • [28] Q-Learning with probability based action policy
    Ugurlu, Ekin Su
    Biricik, Goksel
    2006 IEEE 14TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS, VOLS 1 AND 2, 2006, : 210 - +
  • [29] Cooperative Q-Learning Based on Maturity of the Policy
    Yang, Mao
    Tian, Yantao
    Liu, Xiaomei
    2009 IEEE INTERNATIONAL CONFERENCE ON MECHATRONICS AND AUTOMATION, VOLS 1-7, CONFERENCE PROCEEDINGS, 2009, : 1352 - 1356
  • [30] Performance Investigation of UCB Policy in Q-Learning
    Saito, Koki
    Notsu, Akira
    Ubukata, Seiki
    Honda, Katsuhiro
    2015 IEEE 14TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2015, : 777 - 780