Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning

被引:0
|
作者
Zhang, Zihan [1 ]
Jiang, Yuhang [1 ]
Zhou, Yuan [2 ,3 ]
Ji, Xiangyang [1 ]
机构
[1] Tsinghua Univ, Dept Automat, Beijing, Peoples R China
[2] Tsinghua Univ, Yau Math Sci Ctr, Beijing, Peoples R China
[3] Tsinghua Univ, Dept Math Sci, Beijing, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches. The multi-batch reinforcement learning framework, where the agent is required to provide a time schedule to update policy before everything, which is particularly suitable for the scenarios where the agent suffers extensively from changing the policy adaptively. Given a finite-horizon MDP with S states, A actions and planning horizon H, we design a computational efficient algorithm to achieve near-optimal regret of (O) over tilde (root SAH(3)K ln(1/delta))(5) in K episodes using O (H + log(2) log(2)(K)) batches with confidence parameter delta. To our best of knowledge, it is the first (O) over tilde ( root SAH(3)K) regret bound with O(H + log(2) log(2)(K)) batch complexity. Meanwhile, we show that to achieve (O) over tilde (poly(S, A, H)root K) regret, the number of batches is at least Omega(H/log(A)(K) + log(2) log(2)(K)), which matches our upper bound up to logarithmic terms. Our technical contribution are two-fold: 1) a near-optimal design scheme to explore over the unlearned states; 2) an computational efficient algorithm to explore certain directions with an approximated transition model.
引用
收藏
页数:11
相关论文
共 50 条
  • [21] Polynomial-time reinforcement learning of near-optimal policies
    Pivazyan, K
    Shoham, Y
    EIGHTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-02)/FOURTEENTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE (IAAI-02), PROCEEDINGS, 2002, : 205 - 210
  • [22] Selecting Near-Optimal Approximate State Representations in Reinforcement Learning
    Ortner, Ronald
    Maillard, Odalric-Ambrym
    Ryabko, Daniil
    Algorithmic Learning Theory (ALT 2014), 2014, 8776 : 140 - 154
  • [23] NEAR-OPTIMAL BOUNDS FOR PHASE SYNCHRONIZATION
    Zhong, Yiqiao
    Boumal, Nicolas
    SIAM JOURNAL ON OPTIMIZATION, 2018, 28 (02) : 989 - 1016
  • [24] Near-Optimal No-Regret Learning for Correlated Equilibria in Multi-player General-Sum Games
    Anagnostides, Ioannis
    Daskalakis, Constantinos
    Farina, Gabriele
    Fishelson, Maxwell
    Golowich, Noah
    Sandholm, Tuomas
    PROCEEDINGS OF THE 54TH ANNUAL ACM SIGACT SYMPOSIUM ON THEORY OF COMPUTING (STOC '22), 2022, : 736 - 749
  • [25] Regret Bounds for Learning State Representations in Reinforcement Learning
    Ortner, Ronald
    Pirotta, Matteo
    Fruit, Ronan
    Lazaric, Alessandro
    Maillard, Odalric-Ambrym
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [26] Near-Optimal Design of Experiments via Regret Minimization
    Allen-Zhu, Zeyuan
    Li, Yuanzhi
    Singh, Aarti
    Wang, Yining
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 70, 2017, 70
  • [27] Variational Bayesian Reinforcement Learning with Regret Bounds
    O'Donoghue, Brendan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021,
  • [28] Near-Optimal Bounds for Learning Gaussian Halfspaces with Random Classification Noise
    Diakonikolas, Ilias
    Diakonikolas, Jelena
    Kane, Daniel M.
    Wang, Puqian
    Zarifis, Nikos
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [29] Non-stationary Risk-Sensitive Reinforcement Learning: Near-Optimal Dynamic Regret, Adaptive Detection, and Separation Design
    Ding, Yuhao
    Jin, Ming
    Lavaei, Javad
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 6, 2023, : 7405 - 7413
  • [30] Near-Optimal Offline Reinforcement Learning via Double Variance Reduction
    Yin, Ming
    Bai, Yu
    Wang, Yu-Xiang
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34