Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement Learning: Adaptivity and Computational Efficiency

被引:0
|
作者
Zhao, Heyang [1 ]
He, Jiafan [1 ]
Zhou, Dongruo [1 ]
Zhang, Tong [2 ,3 ]
Gu, Quanquan [1 ]
机构
[1] Univ Calif Los Angeles, Los Angeles, CA 90095 USA
[2] Google Res, Mountain View, CA USA
[3] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
基金
美国国家科学基金会;
关键词
Linear bandits; reinforcement learning; instance-dependent regret;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, several studies (Zhou et al., 2021a; Zhang et al., 2021b; Kim et al., 2021; Zhou and Gu, 2022) have provided variance-dependent regret bounds for linear contextual bandits, which interpolates the regret for the worst-case regime and the deterministic reward regime. However, these algorithms are either computationally intractable or unable to handle unknown variance of the noise. In this paper, we present a novel solution to this open problem by proposing the first computationally efficient algorithm for linear bandits with heteroscedastic noise. Our algorithm is adaptive to the unknown variance of noise and achieves an (O) over tilde (d root Sigma(K)(k=1) sigma(2)(k) + d) regret, where sigma(2)(k) is the variance of the noise at the round k, d is the dimension of the contexts and K is the total number of rounds. Our results are based on an adaptive variance-aware confidence set enabled by a new Freedman-type concentration inequality for self-normalized martingales and a multi-layer structure to stratify the context vectors into different layers with different uniform upper bounds on the uncertainty. Furthermore, our approach can be extended to linear mixture Markov decision processes (MDPs) in reinforcement learning. We propose a variance-adaptive algorithm for linear mixture MDPs, which achieves a problem-dependent horizon-free regret bound that can gracefully reduce to a nearly constant regret for deterministic MDPs. Unlike existing nearly minimax optimal algorithms for linear mixture MDPs, our algorithm does not require explicit variance estimation of the transitional probabilities or the use of high-order moment estimators to attain horizon-free regret. We believe the techniques developed in this paper can have independent value for general online decision making problems.
引用
收藏
页数:44
相关论文
共 50 条
  • [1] Adversarially Robust Multi-Armed Bandit Algorithm with Variance-Dependent Regret Bounds
    Ito, Shinji
    Tsuchiya, Taira
    Honda, Junya
    CONFERENCE ON LEARNING THEORY, VOL 178, 2022, 178
  • [2] Optimal Regret Bounds for Collaborative Learning in Bandits
    Shidani, Amitis
    Vakili, Sattar
    INTERNATIONAL CONFERENCE ON ALGORITHMIC LEARNING THEORY, VOL 237, 2024, 237
  • [3] Hybrid Regret Bounds for Combinatorial Semi-Bandits and Adversarial Linear Bandits
    Ito, Shinji
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [4] Minimax Regret Bounds for Reinforcement Learning
    Azar, Mohammad Gheshlaghi
    Osband, Ian
    Munos, Remi
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 70, 2017, 70
  • [5] Variational Regret Bounds for Reinforcement Learning
    Ortner, Ronald
    Gajane, Pratik
    Auer, Peter
    35TH UNCERTAINTY IN ARTIFICIAL INTELLIGENCE CONFERENCE (UAI 2019), 2020, 115 : 81 - 90
  • [6] Tight Regret Bounds for Infinite-armed Linear Contextual Bandits
    Li, Yingkai
    Wang, Yining
    Chen, Xi
    Zhou, Yuan
    24TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS (AISTATS), 2021, 130 : 370 - 378
  • [7] Regret Bounds for Learning State Representations in Reinforcement Learning
    Ortner, Ronald
    Pirotta, Matteo
    Fruit, Ronan
    Lazaric, Alessandro
    Maillard, Odalric-Ambrym
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [8] Variational Bayesian Reinforcement Learning with Regret Bounds
    O'Donoghue, Brendan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021,
  • [9] Linear Bandits with Limited Adaptivity and Learning Distributional Optimal Design
    Ruan, Yufei
    Yang, Jiaqi
    Zhou, Yuan
    STOC '21: PROCEEDINGS OF THE 53RD ANNUAL ACM SIGACT SYMPOSIUM ON THEORY OF COMPUTING, 2021, : 74 - 87
  • [10] Collaborative Linear Bandits with Adversarial Agents: Near-Optimal Regret Bounds
    Mitra, Aritra
    Adibi, Arman
    Pappas, George J.
    Hassani, Hamed
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,