Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement Learning: Adaptivity and Computational Efficiency

被引:0
|
作者
Zhao, Heyang [1 ]
He, Jiafan [1 ]
Zhou, Dongruo [1 ]
Zhang, Tong [2 ,3 ]
Gu, Quanquan [1 ]
机构
[1] Univ Calif Los Angeles, Los Angeles, CA 90095 USA
[2] Google Res, Mountain View, CA USA
[3] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
基金
美国国家科学基金会;
关键词
Linear bandits; reinforcement learning; instance-dependent regret;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, several studies (Zhou et al., 2021a; Zhang et al., 2021b; Kim et al., 2021; Zhou and Gu, 2022) have provided variance-dependent regret bounds for linear contextual bandits, which interpolates the regret for the worst-case regime and the deterministic reward regime. However, these algorithms are either computationally intractable or unable to handle unknown variance of the noise. In this paper, we present a novel solution to this open problem by proposing the first computationally efficient algorithm for linear bandits with heteroscedastic noise. Our algorithm is adaptive to the unknown variance of noise and achieves an (O) over tilde (d root Sigma(K)(k=1) sigma(2)(k) + d) regret, where sigma(2)(k) is the variance of the noise at the round k, d is the dimension of the contexts and K is the total number of rounds. Our results are based on an adaptive variance-aware confidence set enabled by a new Freedman-type concentration inequality for self-normalized martingales and a multi-layer structure to stratify the context vectors into different layers with different uniform upper bounds on the uncertainty. Furthermore, our approach can be extended to linear mixture Markov decision processes (MDPs) in reinforcement learning. We propose a variance-adaptive algorithm for linear mixture MDPs, which achieves a problem-dependent horizon-free regret bound that can gracefully reduce to a nearly constant regret for deterministic MDPs. Unlike existing nearly minimax optimal algorithms for linear mixture MDPs, our algorithm does not require explicit variance estimation of the transitional probabilities or the use of high-order moment estimators to attain horizon-free regret. We believe the techniques developed in this paper can have independent value for general online decision making problems.
引用
收藏
页数:44
相关论文
共 50 条
  • [31] Beyond Value-Function Gaps: Improved Instance-Dependent Regret Bounds for Episodic Reinforcement Learning
    Dann, Chris
    Marinov, Teodor V.
    Mohri, Mehryar
    Zimmert, Julian
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [32] Near-Optimal Regret Bounds for Contextual Combinatorial Semi-Bandits with Linear Payoff Functions
    Takemura, Kei
    Ito, Shinji
    Hatano, Daisuke
    Sumita, Hanna
    Fukunaga, Takuro
    Kakimura, Naonori
    Kawarabayashi, Ken-ichi
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 9791 - 9798
  • [33] Reinforcement Learning in Linear MDPs: Constant Regret and Representation Selection
    Papini, Matteo
    Tirinzoni, Andrea
    Pacchiano, Aldo
    Restilli, Marcello
    Lazaric, Alessandro
    Pirotta, Matteo
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [34] Distributed Lifelong Reinforcement Learning with Sub-Linear Regret
    Tutunov, Rasul
    El-Zini, Julia
    Bou-Ammar, Haitham
    Jadbabaie, Ali
    2017 IEEE 56TH ANNUAL CONFERENCE ON DECISION AND CONTROL (CDC), 2017,
  • [35] Tight Regret Bounds for Model-Based Reinforcement Learning with Greedy Policies
    Efroni, Yonathan
    Merlis, Nadav
    Ghavamzadeh, Mohammad
    Mannor, Shie
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [36] Optimistic posterior sampling for reinforcement learning: worst-case regret bounds
    Agrawal, Shipra
    Jia, Randy
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
  • [37] Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning
    Zhang, Zihan
    Jiang, Yuhang
    Zhou, Yuan
    Ji, Xiangyang
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
  • [38] Optimistic Posterior Sampling for Reinforcement Learning: Worst-Case Regret Bounds
    Agrawal, Shipra
    Jia, Randy
    MATHEMATICS OF OPERATIONS RESEARCH, 2023, 48 (01) : 363 - 392
  • [39] Bridging Distributional and Risk-sensitive Reinforcement Learning with Provable Regret Bounds
    Liang, Hao
    Luo, Zhi-Quan
    JOURNAL OF MACHINE LEARNING RESEARCH, 2024, 25
  • [40] Beyond No Regret: Instance-Dependent PAC Reinforcement Learning
    Wagenmaker, Andrew
    Simchowitz, Max
    Jamieson, Kevin
    CONFERENCE ON LEARNING THEORY, VOL 178, 2022, 178 : 358 - 418