Reward-Free Model-Based Reinforcement Learning with Linear Function Approximation

被引：0

作者：

Zhang, Weitong ^{[1
]}

Zhou, Dongruo ^{[1
]}

Gu, Quanquan ^{[1
]}

机构：

[1] Univ Calif Los Angeles, Dept Comp Sci, Los Angeles, CA 90095 USA

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021) | 2021年 / 34卷

基金：

美国国家科学基金会;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We study the model-based reward-free reinforcement learning with linear function approximation for episodic Markov decision processes (MDPs). In this setting, the agent works in two phases. In the exploration phase, the agent interacts with the environment and collects samples without the reward. In the planning phase, the agent is given a specific reward function and uses samples collected from the exploration phase to learn a good policy. We propose a new provably efficient algorithm, called UCRL-RFE under the Linear Mixture MDP assumption, where the transition probability kernel of the MDP can be parameterized by a linear function over certain feature mappings defined on the triplet of state, action, and next state. We show that to obtain an epsilon-optimal policy for arbitrary reward function, UCRL-RFE needs to sample at most (O) over tilde (H(5)d(2)epsilon(-2)) episodes during the exploration phase. Here, H is the length of the episode, d is the dimension of the feature mapping. We also propose a variant of UCRL-RFE using Bernstein-type bonus and show that it needs to sample at most (O) over tilde (H(4)d(H + d)epsilon(-2)) to achieve an epsilon-optimal policy. By constructing a special class of linear Mixture MDPs, we also prove that for any reward-free algorithm, it needs to sample at least (Omega) over tilde (H(2)d epsilon(-2)) episodes to obtain an epsilon-optimal policy. Our upper bound matches the lower bound in terms of the dependence on epsilon and the dependence on d if H >= d.

引用

页数：12

共 50 条

[41] EEG-based classification of learning strategies : model-based and model-free reinforcement learning
Kim, Dongjae
Weston, Charles
Lee, Sang Wan
2018 6TH INTERNATIONAL CONFERENCE ON BRAIN-COMPUTER INTERFACE (BCI), 2018, : 146 - 148
[42] A Modified Average Reward Reinforcement Learning Based on Fuzzy Reward Function
Zhai, Zhenkun
Chen, Wei
Li, Xiong
Guo, Jing
IMECS 2009: INTERNATIONAL MULTI-CONFERENCE OF ENGINEERS AND COMPUTER SCIENTISTS, VOLS I AND II, 2009, : 113 - 117
[43] Model-based Reinforcement Learning: A Survey
Moerland, Thomas M.
Broekens, Joost
Plaat, Aske
Jonker, Catholijn M.
FOUNDATIONS AND TRENDS IN MACHINE LEARNING, 2023, 16 (01): : 1 - 118
[44] A survey on model-based reinforcement learning
Fan-Ming LUO
Tian XU
Hang LAI
Xiong-Hui CHEN
Weinan ZHANG
Yang YU
Science China(Information Sciences), 2024, 67 (02) : 59 - 84
[45] Nonparametric model-based reinforcement learning
Atkeson, CG
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 10, 1998, 10 : 1008 - 1014
[46] The ubiquity of model-based reinforcement learning
Doll, Bradley B.
Simon, Dylan A.
Daw, Nathaniel D.
CURRENT OPINION IN NEUROBIOLOGY, 2012, 22 (06) : 1075 - 1081
[47] Multiple model-based reinforcement learning
Doya, K
Samejima, K
Katagiri, K
Kawato, M
NEURAL COMPUTATION, 2002, 14 (06) : 1347 - 1369
[48] A survey on model-based reinforcement learning
Luo, Fan-Ming
Xu, Tian
Lai, Hang
Chen, Xiong-Hui
Zhang, Weinan
Yu, Yang
SCIENCE CHINA-INFORMATION SCIENCES, 2024, 67 (02)
[49] Value-Aware Loss Function for Model-based Reinforcement Learning
Farahmand, Amir-massoud
Barreto, Andre M. S.
Nikovski, Daniel N.
ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 54, 2017, 54 : 1486 - 1494
[50] Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov Decision Processes
Wagenmaker, Andrew
Chen, Yifang
Simchowitz, Max
Du, Simon S.
Jamieson, Kevin
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,

← 1 2 3 4 5 →