Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning

被引:0
|
作者
Zhang, Zihan [1 ]
Jiang, Yuhang [1 ]
Zhou, Yuan [2 ,3 ]
Ji, Xiangyang [1 ]
机构
[1] Tsinghua Univ, Dept Automat, Beijing, Peoples R China
[2] Tsinghua Univ, Yau Math Sci Ctr, Beijing, Peoples R China
[3] Tsinghua Univ, Dept Math Sci, Beijing, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches. The multi-batch reinforcement learning framework, where the agent is required to provide a time schedule to update policy before everything, which is particularly suitable for the scenarios where the agent suffers extensively from changing the policy adaptively. Given a finite-horizon MDP with S states, A actions and planning horizon H, we design a computational efficient algorithm to achieve near-optimal regret of (O) over tilde (root SAH(3)K ln(1/delta))(5) in K episodes using O (H + log(2) log(2)(K)) batches with confidence parameter delta. To our best of knowledge, it is the first (O) over tilde ( root SAH(3)K) regret bound with O(H + log(2) log(2)(K)) batch complexity. Meanwhile, we show that to achieve (O) over tilde (poly(S, A, H)root K) regret, the number of batches is at least Omega(H/log(A)(K) + log(2) log(2)(K)), which matches our upper bound up to logarithmic terms. Our technical contribution are two-fold: 1) a near-optimal design scheme to explore over the unlearned states; 2) an computational efficient algorithm to explore certain directions with an approximated transition model.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] Near-optimal Regret Bounds for Reinforcement Learning
    Jaksch, Thomas
    Ortner, Ronald
    Auer, Peter
    JOURNAL OF MACHINE LEARNING RESEARCH, 2010, 11 : 1563 - 1600
  • [2] Near-optimal regret bounds for reinforcement learning
    Jaksch, Thomas
    Ortner, Ronald
    Auer, Peter
    Journal of Machine Learning Research, 2010, 11 : 1563 - 1600
  • [3] Near-Optimal Regret Bounds for Thompson Sampling
    Agrawal, Shipra
    Goyal, Navin
    JOURNAL OF THE ACM, 2017, 64 (05)
  • [4] Model-Free Nonstationary Reinforcement Learning: Near-Optimal Regret and Applications in Multiagent Reinforcement Learning and Inventory Control
    Mao, Weichao
    Zhang, Kaiqing
    Zhu, Ruihao
    Simchi-Levi, David
    Basar, Tamer
    MANAGEMENT SCIENCE, 2024,
  • [5] Near-Optimal No-Regret Learning in General Games
    Daskalakis, Constantinos
    Fishelson, Maxwell
    Golowich, Noah
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [6] Kernelized Reinforcement Learning with Order Optimal Regret Bounds
    Vakili, Sattar
    Olkhovskaya, Julia
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [7] Near-optimal Per-Action Regret Bounds for Sleeping Bandits
    Quan Nguyen
    Mehta, Nishant A.
    INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 238, 2024, 238
  • [8] Collaborative Linear Bandits with Adversarial Agents: Near-Optimal Regret Bounds
    Mitra, Aritra
    Adibi, Arman
    Pappas, George J.
    Hassani, Hamed
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [9] Risk-Sensitive Reinforcement Learning: Near-Optimal Risk-Sample Tradeoff in Regret
    Fei, Yingjie
    Yang, Zhuoran
    Chen, Yudong
    Wang, Zhaoran
    Xie, Qiaomin
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [10] Near-Optimal Reinforcement Learning in Polynomial Time
    Michael Kearns
    Satinder Singh
    Machine Learning, 2002, 49 : 209 - 232