Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning

被引：0

作者：

Zhang, Zihan ^{[1
]}

Jiang, Yuhang ^{[1
]}

Zhou, Yuan ^{[2
,3
]}

Ji, Xiangyang ^{[1
]}

机构：

[1] Tsinghua Univ, Dept Automat, Beijing, Peoples R China

[2] Tsinghua Univ, Yau Math Sci Ctr, Beijing, Peoples R China

[3] Tsinghua Univ, Dept Math Sci, Beijing, Peoples R China

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022 | 2022年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper, we study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches. The multi-batch reinforcement learning framework, where the agent is required to provide a time schedule to update policy before everything, which is particularly suitable for the scenarios where the agent suffers extensively from changing the policy adaptively. Given a finite-horizon MDP with S states, A actions and planning horizon H, we design a computational efficient algorithm to achieve near-optimal regret of (O) over tilde (root SAH(3)K ln(1/delta))(5) in K episodes using O (H + log(2) log(2)(K)) batches with confidence parameter delta. To our best of knowledge, it is the first (O) over tilde ( root SAH(3)K) regret bound with O(H + log(2) log(2)(K)) batch complexity. Meanwhile, we show that to achieve (O) over tilde (poly(S, A, H)root K) regret, the number of batches is at least Omega(H/log(A)(K) + log(2) log(2)(K)), which matches our upper bound up to logarithmic terms. Our technical contribution are two-fold: 1) a near-optimal design scheme to explore over the unlearned states; 2) an computational efficient algorithm to explore certain directions with an approximated transition model.

引用

页数：11

共 50 条

[21] Polynomial-time reinforcement learning of near-optimal policies
Pivazyan, K
Shoham, Y
EIGHTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-02)/FOURTEENTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE (IAAI-02), PROCEEDINGS, 2002, : 205 - 210
[22] Selecting Near-Optimal Approximate State Representations in Reinforcement Learning
Ortner, Ronald
Maillard, Odalric-Ambrym
Ryabko, Daniil
Algorithmic Learning Theory (ALT 2014), 2014, 8776 : 140 - 154
[23] NEAR-OPTIMAL BOUNDS FOR PHASE SYNCHRONIZATION
Zhong, Yiqiao
Boumal, Nicolas
SIAM JOURNAL ON OPTIMIZATION, 2018, 28 (02) : 989 - 1016
[24] Near-Optimal No-Regret Learning for Correlated Equilibria in Multi-player General-Sum Games
Anagnostides, Ioannis
Daskalakis, Constantinos
Farina, Gabriele
Fishelson, Maxwell
Golowich, Noah
Sandholm, Tuomas
PROCEEDINGS OF THE 54TH ANNUAL ACM SIGACT SYMPOSIUM ON THEORY OF COMPUTING (STOC '22), 2022, : 736 - 749
[25] Regret Bounds for Learning State Representations in Reinforcement Learning
Ortner, Ronald
Pirotta, Matteo
Fruit, Ronan
Lazaric, Alessandro
Maillard, Odalric-Ambrym
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
[26] Near-Optimal Design of Experiments via Regret Minimization
Allen-Zhu, Zeyuan
Li, Yuanzhi
Singh, Aarti
Wang, Yining
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 70, 2017, 70
[27] Variational Bayesian Reinforcement Learning with Regret Bounds
O'Donoghue, Brendan
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021,
[28] Near-Optimal Bounds for Learning Gaussian Halfspaces with Random Classification Noise
Diakonikolas, Ilias
Diakonikolas, Jelena
Kane, Daniel M.
Wang, Puqian
Zarifis, Nikos
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[29] Non-stationary Risk-Sensitive Reinforcement Learning: Near-Optimal Dynamic Regret, Adaptive Detection, and Separation Design
Ding, Yuhao
Jin, Ming
Lavaei, Javad
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 6, 2023, : 7405 - 7413
[30] Near-Optimal Offline Reinforcement Learning via Double Variance Reduction
Yin, Ming
Bai, Yu
Wang, Yu-Xiang
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34

← 1 2 3 4 5 →