Nonstationary Stochastic Bandits: UCB Policies and Minimax Regret

被引：1

作者：

Wei, Lai ^{[1
]}

Srivastava, Vaibhav ^{[2
]}

机构：

[1] Univ Michigan, Life Sci Inst, Ann Arbor, MI 48109 USA

[2] Michigan State Univ, Dept Elect & Comp Engn, E Lansing, MI 48824 USA

来源：

IEEE OPEN JOURNAL OF CONTROL SYSTEMS | 2024年 / 3卷

基金：

美国国家科学基金会;

关键词：

Heavily-tailed distribution; Stochastic processes; Heuristic algorithms; Control systems; Lightly-tailed distribution; Upper bound; History; Heavy-tailed distributions; minimax regret; nonstationary multiarmed bandit; upper-confidence bound; variation budget; MULTIARMED BANDIT;

D O I：

10.1109/OJCSYS.2024.3372929

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

We study the nonstationary stochastic Multi-Armed Bandit (MAB) problem in which the distributions of rewards associated with arms are assumed to be time-varying and the total variation in the expected rewards is subject to a variation budget. The regret of a policy is defined by the difference in the expected cumulative reward obtained using the policy and using an oracle that selects the arm with the maximum mean reward at each time. We characterize the performance of the proposed policies in terms of the worst-case regret, which is the supremum of the regret over the set of reward distribution sequences satisfying the variation budget. We design Upper-Confidence Bound (UCB)-based policies with three different approaches, namely, periodic resetting, sliding observation window, and discount factor, and show that they are order-optimal with respect to the minimax regret, i.e., the minimum worst-case regret achieved by any policy. We also relax the sub-Gaussian assumption on reward distributions and develop robust versions of the proposed policies that can handle heavy-tailed reward distributions and maintain their performance guarantees.

引用

页码：128 / 142

页数：15

共 50 条

[1] Minimax Regret for Cascading Bandits
Vial, Daniel
Sanghavi, Sujay
Shakkottai, Sanjay
Srikant, R.
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[2] Nearly Minimax-Optimal Regret for Linearly Parameterized Bandits
Li, Yingkai
Wang, Yining
Zhou, Yuan
IEEE TRANSACTIONS ON INFORMATION THEORY, 2024, 70 (01) : 372 - 388
[3] Nearly Minimax-Optimal Regret for Linearly Parameterized Bandits
Li, Yingkai
Wang, Yining
Zhou, Yuan
CONFERENCE ON LEARNING THEORY, VOL 99, 2019, 99
[4] Minimax Regret for Stochastic Shortest Path
Cohen, Alon
Efroni, Yonathan
Mansour, Yishay
Rosenberg, Aviv
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[5] Regret of Age-of-Information Bandits in Nonstationary Wireless Networks
Song, Zaofa
Yang, Tao
Wu, Xiaofeng
Feng, Hui
Hu, Bo
IEEE WIRELESS COMMUNICATIONS LETTERS, 2022, 11 (11) : 2415 - 2419
[6] Regret Bounds and Minimax Policies under Partial Monitoring
Audibert, Jean-Yves
Bubeck, Sebastien
JOURNAL OF MACHINE LEARNING RESEARCH, 2010, 11 : 2785 - 2836
[7] Regret bounds and minimax policies under partial monitoring
Audibert, Jean-Yves
Bubeck, Sébastien
Journal of Machine Learning Research, 2010, 11 : 2785 - 2863
[8] Bandits with Stochastic Experts: Constant Regret, Empirical Experts and Episodes
Sharma, Nihal
Sen, Rajat
Basu, Soumya
Shanmugam, Karthikeyan
Shakkottai, Sanjay
ACM TRANSACTIONS ON MODELING AND PERFORMANCE EVALUATION OF COMPUTING SYSTEMS, 2024, 9 (03)
[9] Tight Regret Bounds for Stochastic Combinatorial Semi-Bandits
Kveton, Branislav
Wen, Zheng
Ashkan, Azin
Szepesvari, Csaba
ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 38, 2015, 38 : 535 - 543
[10] KL-UCB-Switch: Optimal Regret Bounds for Stochastic Bandits from Both a Distribution-Dependent and a Distribution-Free Viewpoints
Garivier, Aurelien
Hadiji, Hedi
Menard, Pierre
Stoltz, Gilles
JOURNAL OF MACHINE LEARNING RESEARCH, 2022, 23 : 1 - 66

← 1 2 3 4 5 →