Randomized Exploration for Reinforcement Learning with General Value Function Approximation

被引：0

作者：

Ishfaq, Haque ^{[1
,2
]}

Cui, Qiwen ^{[3
]}

Viet Nguyen ^{[1
,2
]}

Ayoub, Alex ^{[4
,5
]}

Yang, Zhuoran ^{[6
]}

Wang, Zhaoran ^{[7
]}

Precup, Doina ^{[1
,2
,8
]}

Yang, Lin F. ^{[9
]}

机构：

[1] Mila, Montreal, PQ, Canada

[2] McGill Univ, Sch Comp Sci, Montreal, PQ, Canada

[3] Peking Univ, Sch Math Sci, Beijing, Peoples R China

[4] Univ Alberta, Amii, Edmonton, AB, Canada

[5] Univ Alberta, Dept Comp Sci, Edmonton, AB, Canada

[6] Princeton Univ, Dept Operat Res & Financial Engn, Princeton, NJ 08544 USA

[7] Northwestern Univ, Ind Engn Management Sci, Evanston, IL 60208 USA

[8] DeepMind, Montreal, PQ, Canada

[9] Univ Calif Los Angeles, Dept Elect & Comp Engn, Los Angeles, CA USA

来源：

INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139 | 2021年 / 139卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We propose a model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm as well as the optimism principle. Unlike existing upper-confidence-bound (UCB) based approaches, which are often computationally intractable, our algorithm drives exploration by simply perturbing the training data with judiciously chosen i.i.d. scalar noises. To attain optimistic value function estimation without resorting to a UCB-style bonus, we introduce an optimistic reward sampling procedure. When the value functions can be represented by a function class F, our algorithm achieves a worst-case regret bound of (O) over tilde (poly(d(E)H)root T) where T is the time elapsed, d(E) is the planning horizon and d E is the eluder dimension of In the linear setting, our algorithm reduces to LSVI-PHE, a variant of RLSVI, that enjoys an (O) over tilde root d(3)H(3)T) regret. We complement the theory with an empirical evaluation across known difficult exploration tasks.

引用

页数：10

共 50 条

[31] Parallel reinforcement learning with linear function approximation
Grounds, Matthew
Kudenko, Daniel
ADAPTIVE AGENTS AND MULTI-AGENT SYSTEMS, 2008, 4865 : 60 - 74
[32] Safe Reinforcement Learning with Linear Function Approximation
Amani, Sanae
Thrampoulidis, Christos
Yang, Lin F.
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
[33] The Value Function Polytope in Reinforcement Learning
Dadashi, Robert
Taiga, Adrien Ali
Le Roux, Nicolas
Schuurmans, Dale
Bellemare, Marc G. L.
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97, 2019, 97
[34] Integrating Symmetry of Environment by Designing Special Basis functions for Value Function Approximation in Reinforcement Learning
Wang, Guo-fang
Fang, Zhou
Li, Bo
Li, Ping
2016 14TH INTERNATIONAL CONFERENCE ON CONTROL, AUTOMATION, ROBOTICS AND VISION (ICARCV), 2016,
[35] Local and soft feature selection for value function approximation in batch reinforcement learning for robot navigation
Fatemeh Fathinezhad
Peyman Adibi
Bijan Shoushtarian
Jocelyn Chanussot
The Journal of Supercomputing, 2024, 80 : 10720 - 10745
[36] Local and soft feature selection for value function approximation in batch reinforcement learning for robot navigation
Fathinezhad, Fatemeh
Adibi, Peyman
Shoushtarian, Bijan
Chanussot, Jocelyn
JOURNAL OF SUPERCOMPUTING, 2024, 80 (08): : 10720 - 10745
[37] Rethinking Value Function Learning for Generalization in Reinforcement Learning
Moon, Seungyong
Lee, JunYeong
Song, Hyun Oh
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
[38] Reinforcement learning with function approximation for cooperative navigation tasks
Melo, Francisco S.
Ribeiro, M. Isabel
2008 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, VOLS 1-9, 2008, : 3321 - +
[39] Online Model Selection for Reinforcement Learning with Function Approximation
Lee, Jonathan N.
Pacchiano, Aldo
Muthukumar, Vidya
Kong, Weihao
Brunskill, Emma
24TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS (AISTATS), 2021, 130
[40] Reinforcement Learning With Function Approximation for Traffic Signal Control
Prashanth, L. A.
Bhatnagar, Shalabh
IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2011, 12 (02) : 412 - 421

← 1 2 3 4 5 →