Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning

被引：0

作者：

Zhang, Zihan ^{[1
]}

Jiang, Yuhang ^{[1
]}

Zhou, Yuan ^{[2
,3
]}

Ji, Xiangyang ^{[1
]}

机构：

[1] Tsinghua Univ, Dept Automat, Beijing, Peoples R China

[2] Tsinghua Univ, Yau Math Sci Ctr, Beijing, Peoples R China

[3] Tsinghua Univ, Dept Math Sci, Beijing, Peoples R China

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022 | 2022年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper, we study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches. The multi-batch reinforcement learning framework, where the agent is required to provide a time schedule to update policy before everything, which is particularly suitable for the scenarios where the agent suffers extensively from changing the policy adaptively. Given a finite-horizon MDP with S states, A actions and planning horizon H, we design a computational efficient algorithm to achieve near-optimal regret of (O) over tilde (root SAH(3)K ln(1/delta))(5) in K episodes using O (H + log(2) log(2)(K)) batches with confidence parameter delta. To our best of knowledge, it is the first (O) over tilde ( root SAH(3)K) regret bound with O(H + log(2) log(2)(K)) batch complexity. Meanwhile, we show that to achieve (O) over tilde (poly(S, A, H)root K) regret, the number of batches is at least Omega(H/log(A)(K) + log(2) log(2)(K)), which matches our upper bound up to logarithmic terms. Our technical contribution are two-fold: 1) a near-optimal design scheme to explore over the unlearned states; 2) an computational efficient algorithm to explore certain directions with an approximated transition model.

引用

页数：11

共 50 条

[41] Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback
Jin, Tiancheng
Lancewicki, Tal
Luo, Haipeng
Mansour, Yishay
Rosenberg, Aviv
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
[42] A Bayesian reinforcement learning approach in markov games for computing near-optimal policies
Julio B. Clempner
Annals of Mathematics and Artificial Intelligence, 2023, 91 : 675 - 690
[43] Near-optimal reinforcement learning framework for energy-aware sensor communications
Pandana, C
Liu, KJR
IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, 2005, 23 (04) : 788 - 797
[44] A Bayesian reinforcement learning approach in markov games for computing near-optimal policies
Clempner, Julio B.
ANNALS OF MATHEMATICS AND ARTIFICIAL INTELLIGENCE, 2023, 91 (05) : 675 - 690
[45] Near-Optimal Provable Uniform Convergence in Offine Policy Evaluation for Reinforcement Learning
Yin, Ming
Bai, Yu
Wang, Yu-Xiang
24TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS (AISTATS), 2021, 130
[46] Near-Optimal Vehicular Crowdsensing Task Allocation Empowered by Deep Reinforcement Learning
Xiang C.-C.
Li Y.-Y.
Feng L.
Chen C.
Guo S.-T.
Yang P.-L.
Jisuanji Xuebao/Chinese Journal of Computers, 2022, 45 (05): : 918 - 934
[47] Near-Optimal Sample Complexity Bounds for Constrained MDPs
Vaswani, Sharan
Yang, Lin F.
Szepesvari, Csaba
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
[48] Near-optimal lower bounds on the multi-party communication complexity of set disjointness
Chakrabarti, A
Khot, S
Sun, XD
18TH IEEE ANNUAL CONFERENCE ON COMPUTATIONAL COMPLEXITY, PROCEEDINGS, 2003, : 107 - 117
[49] Near-Optimal Complexity Bounds for Fragments of the Skolem Problem
Akshay, S.
Balaji, Nikhil
Murhekar, Aniket
Varma, Rohith
Vyas, Nikhil
37TH INTERNATIONAL SYMPOSIUM ON THEORETICAL ASPECTS OF COMPUTER SCIENCE (STACS 2020), 2020, 154
[50] Safe Learning for Near-Optimal Scheduling
Busatto-Gaston, Damien
Chakraborty, Debraj
Guha, Shibashis
Perez, Guillermo A.
Raskin, Jean-Francois
QUANTITATIVE EVALUATION OF SYSTEMS (QEST 2021), 2021, 12846 : 235 - 254

← 1 2 3 4 5 →