Randomized Exploration for Reinforcement Learning with General Value Function Approximation

被引：0

作者：

Ishfaq, Haque ^{[1
,2
]}

Cui, Qiwen ^{[3
]}

Viet Nguyen ^{[1
,2
]}

Ayoub, Alex ^{[4
,5
]}

Yang, Zhuoran ^{[6
]}

Wang, Zhaoran ^{[7
]}

Precup, Doina ^{[1
,2
,8
]}

Yang, Lin F. ^{[9
]}

机构：

[1] Mila, Montreal, PQ, Canada

[2] McGill Univ, Sch Comp Sci, Montreal, PQ, Canada

[3] Peking Univ, Sch Math Sci, Beijing, Peoples R China

[4] Univ Alberta, Amii, Edmonton, AB, Canada

[5] Univ Alberta, Dept Comp Sci, Edmonton, AB, Canada

[6] Princeton Univ, Dept Operat Res & Financial Engn, Princeton, NJ 08544 USA

[7] Northwestern Univ, Ind Engn Management Sci, Evanston, IL 60208 USA

[8] DeepMind, Montreal, PQ, Canada

[9] Univ Calif Los Angeles, Dept Elect & Comp Engn, Los Angeles, CA USA

来源：

INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139 | 2021年 / 139卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We propose a model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm as well as the optimism principle. Unlike existing upper-confidence-bound (UCB) based approaches, which are often computationally intractable, our algorithm drives exploration by simply perturbing the training data with judiciously chosen i.i.d. scalar noises. To attain optimistic value function estimation without resorting to a UCB-style bonus, we introduce an optimistic reward sampling procedure. When the value functions can be represented by a function class F, our algorithm achieves a worst-case regret bound of (O) over tilde (poly(d(E)H)root T) where T is the time elapsed, d(E) is the planning horizon and d E is the eluder dimension of In the linear setting, our algorithm reduces to LSVI-PHE, a variant of RLSVI, that enjoys an (O) over tilde root d(3)H(3)T) regret. We complement the theory with an empirical evaluation across known difficult exploration tasks.

引用

页数：10

共 50 条

[1] Efficient exploration through active learning for value function approximation in reinforcement learning
Akiyama, Takayuki
Hachiya, Hirotaka
Sugiyama, Masashi
NEURAL NETWORKS, 2010, 23 (05) : 639 - 648
[2] Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning
Akiyama, Takayuki
Hachiya, Hirotaka
Sugiyama, Masashi
21ST INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE (IJCAI-09), PROCEEDINGS, 2009, : 980 - 985
[3] Toward General Function Approximation in Nonstationary Reinforcement Learning
Feng, Songtao
Yin, Ming
Huang, Ruiquan
Wang, Yu-Xiang
Yang, Jing
Liang, Yingbin
IEEE JOURNAL ON SELECTED AREAS IN INFORMATION THEORY, 2024, 5 : 190 - 206
[4] CBR for state value function approximation in reinforcement learning
Gabel, T
Riedmiller, M
CASE-BASED REASONING RESEARCH AND DEVELOPMENT, PROCEEDINGS, 2005, 3620 : 206 - 221
[5] Offline Reinforcement Learning: Fundamental Barriers for Value Function Approximation
Foster, Dylan J.
Krishnamurthy, Akshay
Simchi-Levi, David
Xu, Yunzong
CONFERENCE ON LEARNING THEORY, VOL 178, 2022, 178
[6] Distributed Value Function Approximation for Collaborative Multiagent Reinforcement Learning
Stankovic, Milos S.
Beko, Marko
Stankovic, Srdjan S.
IEEE TRANSACTIONS ON CONTROL OF NETWORK SYSTEMS, 2021, 8 (03): : 1270 - 1280
[7] A grey approximation approach to state value function in reinforcement learning
Hwang, Kao-Shing
Chen, Yu-Jen
Lee, Guar-Yuan
2007 IEEE INTERNATIONAL CONFERENCE ON INTEGRATION TECHNOLOGY, PROCEEDINGS, 2007, : 379 - +
[8] Parameterized Indexed Value Function for Efficient Exploration in Reinforcement Learning
Tan, Tian
Xiong, Zhihan
Dwaracherla, Vikranth R.
THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 5948 - 5955
[9] Reinforcement Learning with General Value Function Approximation: Provably Efficient Approach via Bounded Eluder Dimension
Wang, Ruosong
Salakhutdinov, Ruslan
Yang, Lin F.
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
[10] Corruption-Robust Offline Reinforcement Learning with General Function Approximation
Ye, Chenlu
Yang, Rui
Gu, Quanquan
Zhang, Tong
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,

← 1 2 3 4 5 →