Nonstationary Stochastic Bandits: UCB Policies and Minimax Regret

被引:1
|
作者
Wei, Lai [1 ]
Srivastava, Vaibhav [2 ]
机构
[1] Univ Michigan, Life Sci Inst, Ann Arbor, MI 48109 USA
[2] Michigan State Univ, Dept Elect & Comp Engn, E Lansing, MI 48824 USA
来源
基金
美国国家科学基金会;
关键词
Heavily-tailed distribution; Stochastic processes; Heuristic algorithms; Control systems; Lightly-tailed distribution; Upper bound; History; Heavy-tailed distributions; minimax regret; nonstationary multiarmed bandit; upper-confidence bound; variation budget; MULTIARMED BANDIT;
D O I
10.1109/OJCSYS.2024.3372929
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We study the nonstationary stochastic Multi-Armed Bandit (MAB) problem in which the distributions of rewards associated with arms are assumed to be time-varying and the total variation in the expected rewards is subject to a variation budget. The regret of a policy is defined by the difference in the expected cumulative reward obtained using the policy and using an oracle that selects the arm with the maximum mean reward at each time. We characterize the performance of the proposed policies in terms of the worst-case regret, which is the supremum of the regret over the set of reward distribution sequences satisfying the variation budget. We design Upper-Confidence Bound (UCB)-based policies with three different approaches, namely, periodic resetting, sliding observation window, and discount factor, and show that they are order-optimal with respect to the minimax regret, i.e., the minimum worst-case regret achieved by any policy. We also relax the sub-Gaussian assumption on reward distributions and develop robust versions of the proposed policies that can handle heavy-tailed reward distributions and maintain their performance guarantees.
引用
收藏
页码:128 / 142
页数:15
相关论文
共 50 条
  • [11] KL-UCB-Switch: Optimal Regret Bounds for Stochastic Bandits from Both a Distribution-Dependent and a Distribution-Free Viewpoints
    Garivier, Aurélien
    Hadiji, Hédi
    Ménard, Pierre
    Stoltz, Gilles
    Journal of Machine Learning Research, 2022, 23
  • [12] Thresholding Bandits with Augmented UCB
    Mukherjee, Subhojyoti
    Purushothama, Naveen Kolar
    Sudarsanam, Nandan
    Ravindran, Balaraman
    PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 2515 - 2521
  • [13] A New Look at Dynamic Regret for Non-Stationary Stochastic Bandits
    Abbasi-Yadkori, Yasin
    Gyorgy, Andraes
    Lazic, Nevena
    JOURNAL OF MACHINE LEARNING RESEARCH, 2023, 24
  • [14] CRIMED: Lower and Upper Bounds on Regret for Bandits with Unbounded Stochastic Corruption
    Agrawal, Shubhada
    Mathieu, Timothee
    Basu, Debabrota
    Maillard, Odalric-Ambrym
    INTERNATIONAL CONFERENCE ON ALGORITHMIC LEARNING THEORY, VOL 237, 2024, 237
  • [15] Regret of Queueing Bandits
    Krishnasamy, Subhashini
    Sen, Rajat
    Johari, Ramesh
    Shakkottai, Sanjay
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 29 (NIPS 2016), 2016, 29
  • [16] Stochastic regret minimization for revenue management problems with nonstationary demands
    Zhang, Huanan
    Shi, Cong
    Qin, Chao
    Hua, Cheng
    NAVAL RESEARCH LOGISTICS, 2016, 63 (06) : 433 - 448
  • [17] Efficient Kernel UCB for Contextual Bandits
    Zenati, Houssam
    Bietti, Alberto
    Diemert, Eustache
    Mairal, Julien
    Martin, Matthieu
    Gaillard, Pierre
    INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 151, 2022, 151 : 5689 - 5720
  • [18] Nearly Optimal Regret for Stochastic Linear Bandits with Heavy-Tailed Payoffs
    Xue, Bo
    Wang, Guanghui
    Wang, Yimu
    Zhang, Lijun
    PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 2936 - 2942
  • [19] PARADOX OF MINIMAX REGRET
    BECK, N
    AMERICAN POLITICAL SCIENCE REVIEW, 1975, 69 (03) : 918 - 918
  • [20] UCB REVISITED: IMPROVED REGRET BOUNDS FOR THE STOCHASTIC MULTI-ARMED BANDIT PROBLEM
    Auer, Peter
    Ortner, Ronald
    PERIODICA MATHEMATICA HUNGARICA, 2010, 61 (1-2) : 55 - 65