Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning

被引:0
|
作者
Zhang, Zihan [1 ]
Jiang, Yuhang [1 ]
Zhou, Yuan [2 ,3 ]
Ji, Xiangyang [1 ]
机构
[1] Tsinghua Univ, Dept Automat, Beijing, Peoples R China
[2] Tsinghua Univ, Yau Math Sci Ctr, Beijing, Peoples R China
[3] Tsinghua Univ, Dept Math Sci, Beijing, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches. The multi-batch reinforcement learning framework, where the agent is required to provide a time schedule to update policy before everything, which is particularly suitable for the scenarios where the agent suffers extensively from changing the policy adaptively. Given a finite-horizon MDP with S states, A actions and planning horizon H, we design a computational efficient algorithm to achieve near-optimal regret of (O) over tilde (root SAH(3)K ln(1/delta))(5) in K episodes using O (H + log(2) log(2)(K)) batches with confidence parameter delta. To our best of knowledge, it is the first (O) over tilde ( root SAH(3)K) regret bound with O(H + log(2) log(2)(K)) batch complexity. Meanwhile, we show that to achieve (O) over tilde (poly(S, A, H)root K) regret, the number of batches is at least Omega(H/log(A)(K) + log(2) log(2)(K)), which matches our upper bound up to logarithmic terms. Our technical contribution are two-fold: 1) a near-optimal design scheme to explore over the unlearned states; 2) an computational efficient algorithm to explore certain directions with an approximated transition model.
引用
收藏
页数:11
相关论文
共 50 条
  • [41] Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback
    Jin, Tiancheng
    Lancewicki, Tal
    Luo, Haipeng
    Mansour, Yishay
    Rosenberg, Aviv
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
  • [42] A Bayesian reinforcement learning approach in markov games for computing near-optimal policies
    Julio B. Clempner
    Annals of Mathematics and Artificial Intelligence, 2023, 91 : 675 - 690
  • [43] Near-optimal reinforcement learning framework for energy-aware sensor communications
    Pandana, C
    Liu, KJR
    IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, 2005, 23 (04) : 788 - 797
  • [44] A Bayesian reinforcement learning approach in markov games for computing near-optimal policies
    Clempner, Julio B.
    ANNALS OF MATHEMATICS AND ARTIFICIAL INTELLIGENCE, 2023, 91 (05) : 675 - 690
  • [45] Near-Optimal Provable Uniform Convergence in Offine Policy Evaluation for Reinforcement Learning
    Yin, Ming
    Bai, Yu
    Wang, Yu-Xiang
    24TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS (AISTATS), 2021, 130
  • [46] Near-Optimal Vehicular Crowdsensing Task Allocation Empowered by Deep Reinforcement Learning
    Xiang C.-C.
    Li Y.-Y.
    Feng L.
    Chen C.
    Guo S.-T.
    Yang P.-L.
    Jisuanji Xuebao/Chinese Journal of Computers, 2022, 45 (05): : 918 - 934
  • [47] Near-Optimal Sample Complexity Bounds for Constrained MDPs
    Vaswani, Sharan
    Yang, Lin F.
    Szepesvari, Csaba
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
  • [48] Near-optimal lower bounds on the multi-party communication complexity of set disjointness
    Chakrabarti, A
    Khot, S
    Sun, XD
    18TH IEEE ANNUAL CONFERENCE ON COMPUTATIONAL COMPLEXITY, PROCEEDINGS, 2003, : 107 - 117
  • [49] Near-Optimal Complexity Bounds for Fragments of the Skolem Problem
    Akshay, S.
    Balaji, Nikhil
    Murhekar, Aniket
    Varma, Rohith
    Vyas, Nikhil
    37TH INTERNATIONAL SYMPOSIUM ON THEORETICAL ASPECTS OF COMPUTER SCIENCE (STACS 2020), 2020, 154
  • [50] Safe Learning for Near-Optimal Scheduling
    Busatto-Gaston, Damien
    Chakraborty, Debraj
    Guha, Shibashis
    Perez, Guillermo A.
    Raskin, Jean-Francois
    QUANTITATIVE EVALUATION OF SYSTEMS (QEST 2021), 2021, 12846 : 235 - 254