Nonstationary Stochastic Bandits: UCB Policies and Minimax Regret

被引:1
|
作者
Wei, Lai [1 ]
Srivastava, Vaibhav [2 ]
机构
[1] Univ Michigan, Life Sci Inst, Ann Arbor, MI 48109 USA
[2] Michigan State Univ, Dept Elect & Comp Engn, E Lansing, MI 48824 USA
来源
基金
美国国家科学基金会;
关键词
Heavily-tailed distribution; Stochastic processes; Heuristic algorithms; Control systems; Lightly-tailed distribution; Upper bound; History; Heavy-tailed distributions; minimax regret; nonstationary multiarmed bandit; upper-confidence bound; variation budget; MULTIARMED BANDIT;
D O I
10.1109/OJCSYS.2024.3372929
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We study the nonstationary stochastic Multi-Armed Bandit (MAB) problem in which the distributions of rewards associated with arms are assumed to be time-varying and the total variation in the expected rewards is subject to a variation budget. The regret of a policy is defined by the difference in the expected cumulative reward obtained using the policy and using an oracle that selects the arm with the maximum mean reward at each time. We characterize the performance of the proposed policies in terms of the worst-case regret, which is the supremum of the regret over the set of reward distribution sequences satisfying the variation budget. We design Upper-Confidence Bound (UCB)-based policies with three different approaches, namely, periodic resetting, sliding observation window, and discount factor, and show that they are order-optimal with respect to the minimax regret, i.e., the minimum worst-case regret achieved by any policy. We also relax the sub-Gaussian assumption on reward distributions and develop robust versions of the proposed policies that can handle heavy-tailed reward distributions and maintain their performance guarantees.
引用
收藏
页码:128 / 142
页数:15
相关论文
共 50 条
  • [21] No Regret Bound for Extreme Bandits
    Nishihara, Robert
    Lopez-Paz, David
    Bottou, Leon
    ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 51, 2016, 51 : 259 - 267
  • [22] Minimax regret discounting
    Iverson, Terrence
    JOURNAL OF ENVIRONMENTAL ECONOMICS AND MANAGEMENT, 2013, 66 (03) : 598 - 608
  • [23] The Pareto Regret Frontier for Bandits
    Lattimore, Tor
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 28 (NIPS 2015), 2015, 28
  • [24] Regret Bounds for Batched Bandits
    Esfandiari, Hossein
    Karbasi, Amin
    Mehrabian, Abbas
    Mirrokni, Vahab
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 7340 - 7348
  • [25] Dueling Bandits with Weak Regret
    Chen, Bangrui
    Frazier, Peter, I
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 70, 2017, 70
  • [26] On Logarithmic Regret for Bandits with Knapsacks
    Ren, Wenbo
    Liu, Jia
    Shroff, Ness B.
    2021 55TH ANNUAL CONFERENCE ON INFORMATION SCIENCES AND SYSTEMS (CISS), 2021,
  • [27] UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem
    Peter Auer
    Ronald Ortner
    Periodica Mathematica Hungarica, 2010, 61 : 55 - 65
  • [28] Empirical entropy, minimax regret and minimax risk
    Rakhlin, Alexander
    Sridharan, Karthik
    Tsybakov, Alexandre B.
    BERNOULLI, 2017, 23 (02) : 789 - 824
  • [29] On the Sublinear Regret of GP-UCB
    Whitehouse, Justin
    Wu, Zhiwei Steven
    Ramdas, Aaditya
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [30] Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic Shortest Path
    Di, Qiwei
    He, Jiafan
    Zhou, Dongruo
    Gu, Quanquan
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 202, 2023, 202