Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning

被引:0
|
作者
Zhang, Zihan [1 ]
Jiang, Yuhang [1 ]
Zhou, Yuan [2 ,3 ]
Ji, Xiangyang [1 ]
机构
[1] Tsinghua Univ, Dept Automat, Beijing, Peoples R China
[2] Tsinghua Univ, Yau Math Sci Ctr, Beijing, Peoples R China
[3] Tsinghua Univ, Dept Math Sci, Beijing, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches. The multi-batch reinforcement learning framework, where the agent is required to provide a time schedule to update policy before everything, which is particularly suitable for the scenarios where the agent suffers extensively from changing the policy adaptively. Given a finite-horizon MDP with S states, A actions and planning horizon H, we design a computational efficient algorithm to achieve near-optimal regret of (O) over tilde (root SAH(3)K ln(1/delta))(5) in K episodes using O (H + log(2) log(2)(K)) batches with confidence parameter delta. To our best of knowledge, it is the first (O) over tilde ( root SAH(3)K) regret bound with O(H + log(2) log(2)(K)) batch complexity. Meanwhile, we show that to achieve (O) over tilde (poly(S, A, H)root K) regret, the number of batches is at least Omega(H/log(A)(K) + log(2) log(2)(K)), which matches our upper bound up to logarithmic terms. Our technical contribution are two-fold: 1) a near-optimal design scheme to explore over the unlearned states; 2) an computational efficient algorithm to explore certain directions with an approximated transition model.
引用
收藏
页数:11
相关论文
共 50 条
  • [31] Constructions of Batch Codes with Near-Optimal Redundancy
    Vardy, Alexander
    Yaakobi, Eitan
    2016 IEEE INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY, 2016, : 1197 - 1201
  • [32] BOUNDS FOR THE ADDITIONAL COST OF NEAR-OPTIMAL CONTROLS
    STEINBERG, AM
    FORTE, I
    JOURNAL OF OPTIMIZATION THEORY AND APPLICATIONS, 1980, 31 (03) : 385 - 395
  • [33] Near-optimal quantum tomography: estimators and bounds
    Kueng, Richard
    Ferrie, Christopher
    NEW JOURNAL OF PHYSICS, 2015, 17
  • [34] Near-optimal PAC bounds for discounted MDPs
    Lattimore, Tor
    Hutter, Marcus
    THEORETICAL COMPUTER SCIENCE, 2014, 558 : 125 - 143
  • [35] Near-Optimal Bounds for Testing Histogram Distributions
    Canonne, Clement L.
    Diakonikolas, Ilias
    Kane, Daniel M.
    Liu, Sihan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
  • [36] Regret Bounds for Information-Directed Reinforcement Learning
    Hao, Botao
    Lattimore, Tor
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
  • [37] Optimal Regret Bounds for Collaborative Learning in Bandits
    Shidani, Amitis
    Vakili, Sattar
    INTERNATIONAL CONFERENCE ON ALGORITHMIC LEARNING THEORY, VOL 237, 2024, 237
  • [38] Improved Regret Bounds for Undiscounted Continuous Reinforcement Learning
    Lakshmanan, K.
    Ortner, Ronald
    Ryabko, Daniil
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 37, 2015, 37 : 524 - 532
  • [39] Regret Bounds for Risk-Sensitive Reinforcement Learning
    Bastani, Osbert
    Ma, Yecheng Jason
    Shen, Estelle
    Xu, Wanqiao
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
  • [40] A Biased Graph Neural Network Sampler with Near-Optimal Regret
    Zhang, Qingru
    Wipf, David
    Gan, Quan
    Song, Le
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34