Efficient data use in incremental actor-critic algorithms

被引：7

作者：

Cheng, Yuhu ^{[1
]}

Feng, Huanting ^{[1
]}

Wang, Xuesong ^{[1
]}

机构：

[1] China Univ Min & Technol, Sch Informat & Elect Engn, Xuzhou 221116, Jiangsu, Peoples R China

来源：

NEUROCOMPUTING | 2013年 / 116卷

基金：

高等学校博士学科点专项科研基金; 中国国家自然科学基金;

关键词：

Actor-critic; Reinforcement learning; Incremental least-squares temporal difference; Recursive least-squares temporal difference; Policy evaluation; Function approximation;

D O I：

10.1016/j.neucom.2011.11.034

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Actor-critic (AC) reinforcement learning methods are on-line approximations to policy iterations and have wide application in solving large-scale Markov decision and high-dimensional learning control problems. In order to overcome data inefficiency of incremental AC algorithms based on temporal difference (AC-TD), two new incremental AC algorithms (i.e., AC-RLSTD and AC-iLSTD) are proposed by applying a recursive least-squares TD (RLSTD(lambda)) algorithm and an incremental least-squares TO (iLSTD(lambda)) algorithm to the Critic evaluation, which can make more efficient use of data than TD. The Critic estimates a value-function using the RLSTD(lambda) or iLSTD(lambda) algorithm and the Actor updates the policy based on a regular gradient obtained by the TD error. The improvement in learning evaluation efficiency of the Critic will contribute to the improvement in policy learning performance of the Actor. Simulation results on the learning control of an inverted pendulum and a mountain-car problem illustrate the effectiveness of the two proposed AC algorithms in comparison to the AC-TD algorithm. In addition the AC-iLSTD, using a greedy selection mechanism, can perform much better than the AC-iLSTD using a random selection mechanism. In the simulation, the effect of different parameter settings of the eligibility trace on the learning performance of AC algorithms is analyzed. Furthermore, it is found that different initial values of the variance matrix in the AC-RLSTD algorithm should be chosen appropriately to obtain better performance for different learning problems. Crown Copyright (C) 2012 Published by Elsevier B.V. All rights reserved.

引用

页码：346 / 354

页数：9

共 50 条

[1] Actor-critic algorithms
Konda, VR
Tsitsiklis, JN
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 12, 2000, 12 : 1008 - 1014
[2] On actor-critic algorithms
Konda, VR
Tsitsiklis, JN
SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 2003, 42 (04) : 1143 - 1166
[3] Variational actor-critic algorithms*,**
Zhu, Yuhua
Ying, Lexing
ESAIM-CONTROL OPTIMISATION AND CALCULUS OF VARIATIONS, 2023, 29
[4] Natural actor-critic algorithms
Bhatnagar, Shalabh
Sutton, Richard S.
Ghavamzadeh, Mohammad
Lee, Mark
AUTOMATICA, 2009, 45 (11) : 2471 - 2482
[5] Importance sampling actor-critic algorithms
Williams, Jason L.
Fisher, John W., III
Willsky, Alan S.
2006 AMERICAN CONTROL CONFERENCE, VOLS 1-12, 2006, 1-12 : 1625 - +
[6] Actor-Critic Algorithms for Variance Minimization
Awate, Yogesh P.
TECHNOLOGICAL DEVELOPMENTS IN EDUCATION AND AUTOMATION, 2010, : 455 - 460
[7] Bias in Natural Actor-Critic Algorithms
Thomas, Philip S.
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 32 (CYCLE 1), 2014, 32
[8] Incremental Receptive Field Weighted Actor-Critic
Lee, Dong-Hyun
Lee, Ju-Jang
IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2013, 9 (01) : 62 - 71
[9] Actor-Critic Algorithms with Online Feature Adaptation
Prabuchandran, K. J.
Bhatnagar, Shalabh
Borkar, Vivek S.
ACM TRANSACTIONS ON MODELING AND COMPUTER SIMULATION, 2016, 26 (04):
[10] Bayesian Policy Gradient and Actor-Critic Algorithms
Ghavamzadeh, Mohammad
Engel, Yaakov
Valko, Michal
JOURNAL OF MACHINE LEARNING RESEARCH, 2016, 17

← 1 2 3 4 5 →