Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming

被引：4

作者：

Bertsekas, Dimitri P. ^{[1
]}

Yu, Huizhen ^{[2
]}

机构：

[1] MIT, Dept Elect Engn & Comp Sci, Cambridge, MA 02139 USA

[2] Univ Helsinki, Dept Comp Sci, Helsinki, Finland

来源：

49TH IEEE CONFERENCE ON DECISION AND CONTROL (CDC) | 2010年

基金：

芬兰科学院;

关键词：

STOCHASTIC-APPROXIMATION; ALGORITHMS;

D O I：

10.1109/CDC.2010.5717930

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

We consider the classical finite-state discounted Markovian decision problem, and we introduce a new policy iteration-like algorithm for finding the optimal Q-factors. Instead of policy evaluation by solving a linear system of equations, our algorithm involves (possibly inexact) solution of an optimal stopping problem. This problem can be solved with simple Q-learning iterations, in the case where a lookup table representation is used; it can also be solved with the Q-learning algorithm of Tsitsiklis and Van Roy [TsV99], in the case where feature-based Q-factor approximations are used. In exact/lookup table representation form, our algorithm admits asynchronous and stochastic iterative implementations, in the spirit of asynchronous/modified policy iteration, with lower overhead advantages over existing Q-learning schemes. Furthermore, for large-scale problems, where linear basis function approximations and simulation-based temporal difference implementations are used, our algorithm resolves effectively the inherent difficulties of existing schemes due to inadequate exploration.

引用

页码：1409 / 1416

页数：8

共 50 条

[21] Inverse Value Iteration and Q-Learning: Algorithms, Stability, and Robustness
Lian, Bosen
Xue, Wenqian
Lewis, Frank L.
Davoudi, Ali
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, : 1 - 11
[22] Task decomposition and dynamic policy merging in the distributed Q-learning classifier system
Chapman, KL
Bay, JS
1997 IEEE INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE IN ROBOTICS AND AUTOMATION - CIRA '97, PROCEEDINGS: TOWARDS NEW COMPUTATIONAL PRINCIPLES FOR ROBOTICS AND AUTOMATION, 1997, : 166 - 171
[23] Policy Iteration Q-Learning for Linear It Stochastic Systems With Markovian Jumps and its Application to Power Systems
Ming, Zhongyang
Zhang, Huaguang
Wang, Yingchun
Dai, Jing
IEEE TRANSACTIONS ON CYBERNETICS, 2024, 54 (12) : 7804 - 7813
[24] Convergence and stability analysis of value iteration Q-learning under non-discounted cost for discrete-time optimal control
Song, Shijie
Zhao, Mingming
Gong, Dawei
Zhu, Minglei
NEUROCOMPUTING, 2024, 606
[25] A Method Integrating Q-Learning With Approximate Dynamic Programming for Gantry Work Cell Scheduling
Ou, Xinyan
Chang, Qing
Chakraborty, Nilanjan
IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, 2021, 18 (01) : 85 - 93
[26] Q-LEARNING, POLICY ITERATION AND ACTOR-CRITIC REINFORCEMENT LEARNING COMBINED WITH METAHEURISTIC ALGORITHMS IN SERVO SYSTEM CONTROL
Zamfirache, Iuliu Alexandru
Precup, Radu-Emil
Petriu, Emil M.
FACTA UNIVERSITATIS-SERIES MECHANICAL ENGINEERING, 2023, 21 (04) : 615 - 630
[27] Discounted linear Q-learning control with novel tracking cost and its stability
Wang, Ding
Ren, Jin
Ha, Mingming
INFORMATION SCIENCES, 2023, 626 : 339 - 353
[28] Q-Learning with probability based action policy
Ugurlu, Ekin Su
Biricik, Goksel
2006 IEEE 14TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS, VOLS 1 AND 2, 2006, : 210 - +
[29] Cooperative Q-Learning Based on Maturity of the Policy
Yang, Mao
Tian, Yantao
Liu, Xiaomei
2009 IEEE INTERNATIONAL CONFERENCE ON MECHATRONICS AND AUTOMATION, VOLS 1-7, CONFERENCE PROCEEDINGS, 2009, : 1352 - 1356
[30] Performance Investigation of UCB Policy in Q-Learning
Saito, Koki
Notsu, Akira
Ubukata, Seiki
Honda, Katsuhiro
2015 IEEE 14TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2015, : 777 - 780

← 1 2 3 4 5 →