Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement Learning: Adaptivity and Computational Efficiency

被引：0

作者：

Zhao, Heyang ^{[1
]}

He, Jiafan ^{[1
]}

Zhou, Dongruo ^{[1
]}

Zhang, Tong ^{[2
,3
]}

Gu, Quanquan ^{[1
]}

机构：

[1] Univ Calif Los Angeles, Los Angeles, CA 90095 USA

[2] Google Res, Mountain View, CA USA

[3] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China

来源：

THIRTY SIXTH ANNUAL CONFERENCE ON LEARNING THEORY, VOL 195 | 2023年 / 195卷

基金：

美国国家科学基金会;

关键词：

Linear bandits; reinforcement learning; instance-dependent regret;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recently, several studies (Zhou et al., 2021a; Zhang et al., 2021b; Kim et al., 2021; Zhou and Gu, 2022) have provided variance-dependent regret bounds for linear contextual bandits, which interpolates the regret for the worst-case regime and the deterministic reward regime. However, these algorithms are either computationally intractable or unable to handle unknown variance of the noise. In this paper, we present a novel solution to this open problem by proposing the first computationally efficient algorithm for linear bandits with heteroscedastic noise. Our algorithm is adaptive to the unknown variance of noise and achieves an (O) over tilde (d root Sigma(K)(k=1) sigma(2)(k) + d) regret, where sigma(2)(k) is the variance of the noise at the round k, d is the dimension of the contexts and K is the total number of rounds. Our results are based on an adaptive variance-aware confidence set enabled by a new Freedman-type concentration inequality for self-normalized martingales and a multi-layer structure to stratify the context vectors into different layers with different uniform upper bounds on the uncertainty. Furthermore, our approach can be extended to linear mixture Markov decision processes (MDPs) in reinforcement learning. We propose a variance-adaptive algorithm for linear mixture MDPs, which achieves a problem-dependent horizon-free regret bound that can gracefully reduce to a nearly constant regret for deterministic MDPs. Unlike existing nearly minimax optimal algorithms for linear mixture MDPs, our algorithm does not require explicit variance estimation of the transitional probabilities or the use of high-order moment estimators to attain horizon-free regret. We believe the techniques developed in this paper can have independent value for general online decision making problems.

引用

页数：44

共 50 条

[41] Best-of-Three-Worlds Linear Bandit Algorithm with Variance-Adaptive Regret Bounds
Ito, Shinji
Takemura, Kei
THIRTY SIXTH ANNUAL CONFERENCE ON LEARNING THEORY, VOL 195, 2023, 195
[42] Problem-dependent regret bounds for online learning with feedback graphs
Hu, Bingshan
Mehta, Nishant A.
Pan, Jianping
35TH UNCERTAINTY IN ARTIFICIAL INTELLIGENCE CONFERENCE (UAI 2019), 2020, 115 : 852 - 861
[43] Provably Efficient Reinforcement Learning with Linear Function Approximation under Adaptivity Constraints
Wang, Tianhao
Zhou, Dongruo
Gu, Quanquan
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[44] Regret Bounds for Risk-sensitive Reinforcement Learning with Lipschitz Dynamic Risk Measures
Liang, Hao
Luo, Zhi-Quan
INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 238, 2024, 238
[45] On Gap-dependent Bounds for Offline Reinforcement Learning
Wang, Xinqi
Cui, Qiwen
Du, Simon S.
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[46] Exponential Bellman Equation and Improved Regret Bounds for Risk-Sensitive Reinforcement Learning
Fei, Yingjie
Yang, Zhuoran
Chen, Yudong
Wang, Zhaoran
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[47] Tackling Heavy-Tailed Rewards in Reinforcement Learning with Function Approximation: Minimax Optimal and Instance-Dependent Regret Bounds
Huang, Jiayi
Zhong, Han
Wang, Liwei
Yang, Lin F.
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[48] Uniform-PAC Bounds for Reinforcement Learning with Linear Function Approximation
He, Jiafan
Zhou, Dongruo
Gu, Quanquan
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021,
[49] A Tighter Problem-Dependent Regret Bound for Risk-Sensitive Reinforcement Learning
Hu, Xiaoyan
Leung, Ho-Fung
INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 206, 2023, 206
[50] Regret bounds for online-learning-based linear quadratic control under database attacks
Chekan, Jafar Abbaszadeh
Langbort, Cedric
AUTOMATICA, 2023, 151

← 1 2 3 4 5 →