Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement Learning: Adaptivity and Computational Efficiency

被引:0
|
作者
Zhao, Heyang [1 ]
He, Jiafan [1 ]
Zhou, Dongruo [1 ]
Zhang, Tong [2 ,3 ]
Gu, Quanquan [1 ]
机构
[1] Univ Calif Los Angeles, Los Angeles, CA 90095 USA
[2] Google Res, Mountain View, CA USA
[3] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
基金
美国国家科学基金会;
关键词
Linear bandits; reinforcement learning; instance-dependent regret;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, several studies (Zhou et al., 2021a; Zhang et al., 2021b; Kim et al., 2021; Zhou and Gu, 2022) have provided variance-dependent regret bounds for linear contextual bandits, which interpolates the regret for the worst-case regime and the deterministic reward regime. However, these algorithms are either computationally intractable or unable to handle unknown variance of the noise. In this paper, we present a novel solution to this open problem by proposing the first computationally efficient algorithm for linear bandits with heteroscedastic noise. Our algorithm is adaptive to the unknown variance of noise and achieves an (O) over tilde (d root Sigma(K)(k=1) sigma(2)(k) + d) regret, where sigma(2)(k) is the variance of the noise at the round k, d is the dimension of the contexts and K is the total number of rounds. Our results are based on an adaptive variance-aware confidence set enabled by a new Freedman-type concentration inequality for self-normalized martingales and a multi-layer structure to stratify the context vectors into different layers with different uniform upper bounds on the uncertainty. Furthermore, our approach can be extended to linear mixture Markov decision processes (MDPs) in reinforcement learning. We propose a variance-adaptive algorithm for linear mixture MDPs, which achieves a problem-dependent horizon-free regret bound that can gracefully reduce to a nearly constant regret for deterministic MDPs. Unlike existing nearly minimax optimal algorithms for linear mixture MDPs, our algorithm does not require explicit variance estimation of the transitional probabilities or the use of high-order moment estimators to attain horizon-free regret. We believe the techniques developed in this paper can have independent value for general online decision making problems.
引用
收藏
页数:44
相关论文
共 50 条
  • [21] Regret bounds for reinforcement learning via markov chain concentration
    Ortner, Ronald
    Journal of Artificial Intelligence Research, 2020, 67 : 115 - 128
  • [22] Improved Bayesian Regret Bounds for Thompson Sampling in Reinforcement Learning
    Moradipari, Ahmadreza
    Pedramfar, Mohammad
    Zini, Modjtaba Shokrian
    Aggarwal, Vaneet
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [23] Scalable Representation Learning in Linear Contextual Bandits with Constant Regret Guarantees
    Tirinzoni, Andrea
    Papini, Matteo
    Touati, Ahmed
    Lazaric, Alessandro
    Pirotta, Matteo
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [24] Regret Bounds for Reinforcement Learning via Markov Chain Concentration
    Ortner, Ronald
    JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2020, 67 : 115 - 128
  • [25] Improved Regret Analysis for Variance-Adaptive Linear Bandits and Horizon-Free Linear Mixture MDPs
    Kim, Yeoneung
    Yang, Insoon
    Jun, Kwang-Sung
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
  • [26] Horizon-Free and Instance-Dependent Regret Bounds for Reinforcement Learning with General Function Approximation
    Huang, Jiayi
    Zhong, Han
    Wang, Liwei
    Yang, Lin F.
    INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 238, 2024, 238
  • [27] Logarithmic Regret for Reinforcement Learning with Linear Function Approximation
    He, Jiafan
    Zhou, Dongruo
    Gu, Quanquan
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [28] On Instance-Dependent Bounds for Offline Reinforcement Learning with Linear Function Approximation
    Nguyen-Tang, Thanh
    Yin, Ming
    Gupta, Sunil
    Venkatesh, Svetha
    Arora, Raman
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 8, 2023, : 9310 - 9318
  • [29] Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning
    Dann, Christoph
    Lattimore, Tor
    Brunskill, Emma
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
  • [30] Regret Lower Bounds for Learning Linear Quadratic Gaussian Systems
    Ziemann, Ingvar
    Sandberg, Henrik
    IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2025, 70 (01) : 159 - 173