Nonstationary Stochastic Bandits: UCB Policies and Minimax Regret

被引：1

作者：

Wei, Lai ^{[1
]}

Srivastava, Vaibhav ^{[2
]}

机构：

[1] Univ Michigan, Life Sci Inst, Ann Arbor, MI 48109 USA

[2] Michigan State Univ, Dept Elect & Comp Engn, E Lansing, MI 48824 USA

来源：

IEEE OPEN JOURNAL OF CONTROL SYSTEMS | 2024年 / 3卷

基金：

美国国家科学基金会;

关键词：

Heavily-tailed distribution; Stochastic processes; Heuristic algorithms; Control systems; Lightly-tailed distribution; Upper bound; History; Heavy-tailed distributions; minimax regret; nonstationary multiarmed bandit; upper-confidence bound; variation budget; MULTIARMED BANDIT;

D O I：

10.1109/OJCSYS.2024.3372929

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

We study the nonstationary stochastic Multi-Armed Bandit (MAB) problem in which the distributions of rewards associated with arms are assumed to be time-varying and the total variation in the expected rewards is subject to a variation budget. The regret of a policy is defined by the difference in the expected cumulative reward obtained using the policy and using an oracle that selects the arm with the maximum mean reward at each time. We characterize the performance of the proposed policies in terms of the worst-case regret, which is the supremum of the regret over the set of reward distribution sequences satisfying the variation budget. We design Upper-Confidence Bound (UCB)-based policies with three different approaches, namely, periodic resetting, sliding observation window, and discount factor, and show that they are order-optimal with respect to the minimax regret, i.e., the minimum worst-case regret achieved by any policy. We also relax the sub-Gaussian assumption on reward distributions and develop robust versions of the proposed policies that can handle heavy-tailed reward distributions and maintain their performance guarantees.

引用

页码：128 / 142

页数：15

共 50 条

[11] KL-UCB-Switch: Optimal Regret Bounds for Stochastic Bandits from Both a Distribution-Dependent and a Distribution-Free Viewpoints
Garivier, Aurélien
Hadiji, Hédi
Ménard, Pierre
Stoltz, Gilles
Journal of Machine Learning Research, 2022, 23
[12] Thresholding Bandits with Augmented UCB
Mukherjee, Subhojyoti
Purushothama, Naveen Kolar
Sudarsanam, Nandan
Ravindran, Balaraman
PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 2515 - 2521
[13] A New Look at Dynamic Regret for Non-Stationary Stochastic Bandits
Abbasi-Yadkori, Yasin
Gyorgy, Andraes
Lazic, Nevena
JOURNAL OF MACHINE LEARNING RESEARCH, 2023, 24
[14] CRIMED: Lower and Upper Bounds on Regret for Bandits with Unbounded Stochastic Corruption
Agrawal, Shubhada
Mathieu, Timothee
Basu, Debabrota
Maillard, Odalric-Ambrym
INTERNATIONAL CONFERENCE ON ALGORITHMIC LEARNING THEORY, VOL 237, 2024, 237
[15] Regret of Queueing Bandits
Krishnasamy, Subhashini
Sen, Rajat
Johari, Ramesh
Shakkottai, Sanjay
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 29 (NIPS 2016), 2016, 29
[16] Stochastic regret minimization for revenue management problems with nonstationary demands
Zhang, Huanan
Shi, Cong
Qin, Chao
Hua, Cheng
NAVAL RESEARCH LOGISTICS, 2016, 63 (06) : 433 - 448
[17] Efficient Kernel UCB for Contextual Bandits
Zenati, Houssam
Bietti, Alberto
Diemert, Eustache
Mairal, Julien
Martin, Matthieu
Gaillard, Pierre
INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 151, 2022, 151 : 5689 - 5720
[18] Nearly Optimal Regret for Stochastic Linear Bandits with Heavy-Tailed Payoffs
Xue, Bo
Wang, Guanghui
Wang, Yimu
Zhang, Lijun
PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 2936 - 2942
[19] PARADOX OF MINIMAX REGRET
BECK, N
AMERICAN POLITICAL SCIENCE REVIEW, 1975, 69 (03) : 918 - 918
[20] UCB REVISITED: IMPROVED REGRET BOUNDS FOR THE STOCHASTIC MULTI-ARMED BANDIT PROBLEM
Auer, Peter
Ortner, Ronald
PERIODICA MATHEMATICA HUNGARICA, 2010, 61 (1-2) : 55 - 65

← 1 2 3 4 5 →