Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming

被引：4

作者：

Bertsekas, Dimitri P. ^{[1
]}

Yu, Huizhen ^{[2
]}

机构：

[1] MIT, Dept Elect Engn & Comp Sci, Cambridge, MA 02139 USA

[2] Univ Helsinki, Dept Comp Sci, Helsinki, Finland

来源：

49TH IEEE CONFERENCE ON DECISION AND CONTROL (CDC) | 2010年

基金：

芬兰科学院;

关键词：

STOCHASTIC-APPROXIMATION; ALGORITHMS;

D O I：

10.1109/CDC.2010.5717930

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

We consider the classical finite-state discounted Markovian decision problem, and we introduce a new policy iteration-like algorithm for finding the optimal Q-factors. Instead of policy evaluation by solving a linear system of equations, our algorithm involves (possibly inexact) solution of an optimal stopping problem. This problem can be solved with simple Q-learning iterations, in the case where a lookup table representation is used; it can also be solved with the Q-learning algorithm of Tsitsiklis and Van Roy [TsV99], in the case where feature-based Q-factor approximations are used. In exact/lookup table representation form, our algorithm admits asynchronous and stochastic iterative implementations, in the spirit of asynchronous/modified policy iteration, with lower overhead advantages over existing Q-learning schemes. Furthermore, for large-scale problems, where linear basis function approximations and simulation-based temporal difference implementations are used, our algorithm resolves effectively the inherent difficulties of existing schemes due to inadequate exploration.

引用

页码：1409 / 1416

页数：8

共 50 条

[31] AN EFFICIENT POLICY ITERATION ALGORITHM FOR DYNAMIC PROGRAMMING EQUATIONS
Alla, Alessandro
Falcone, Maurizio
Kalise, Dante
SIAM JOURNAL ON SCIENTIFIC COMPUTING, 2015, 37 (01): : A181 - A200
[32] CONVERGENCE OF POLICY ITERATION IN CONTRACTING DYNAMIC-PROGRAMMING
PUTERMAN, ML
ADVANCES IN APPLIED PROBABILITY, 1978, 10 (02) : 312 - 312
[33] Integral Q-learning and explorized policy iteration for adaptive optimal control of continuous-time linear systems
Lee, Jae Young
Park, Jin Bae
Choi, Yoon Ho
AUTOMATICA, 2012, 48 (11) : 2850 - 2859
[34] Final Iteration Convergence Bound of Q-Learning: Switching System Approach
Lee, Donghwan
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2024, 69 (07) : 4765 - 4772
[35] Multi-Agent Reward-Iteration Fuzzy Q-Learning
Leng, Lixiong
Li, Jingchen
Zhu, Jinhui
Hwang, Kao-Shing
Shi, Haobin
INTERNATIONAL JOURNAL OF FUZZY SYSTEMS, 2021, 23 (06) : 1669 - 1679
[36] Reinforcement Q-learning based on Multirate Generalized Policy Iteration and Its Application to a 2-DOF Helicopter
Tae Yoon Chun
Jin Bae Park
Yoon Ho Choi
International Journal of Control, Automation and Systems, 2018, 16 : 377 - 386
[37] Reinforcement Q-learning based on Multirate Generalized Policy Iteration and Its Application to a 2-DOF Helicopter
Chun, Tae Yoon
Park, Jin Bae
Choi, Yoon Ho
INTERNATIONAL JOURNAL OF CONTROL AUTOMATION AND SYSTEMS, 2018, 16 (01) : 377 - 386
[38] Multi-Agent Reward-Iteration Fuzzy Q-Learning
Lixiong Leng
Jingchen Li
Jinhui Zhu
Kao-Shing Hwang
Haobin Shi
International Journal of Fuzzy Systems, 2021, 23 : 1669 - 1679
[39] Policy Iteration Q-Learning for Linear Ito Stochastic Systems With Markovian Jumps and Its Application to Power Systems
Ming, Zhongyang
Zhang, Huaguang
Wang, Yingchun
Dai, Jing
IEEE TRANSACTIONS ON CYBERNETICS, 2024, : 1 - 10
[40] Dynamic Choice of State Abstraction in Q-Learning
Tamassia, Marco
Zambetta, Fabio
Raffe, William L.
Mueller, Florian 'Floyd'
Li, Xiaodong
ECAI 2016: 22ND EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2016, 285 : 46 - 54

← 1 2 3 4 5 →