Nonstationary Stochastic Bandits: UCB Policies and Minimax Regret

被引:1
|
作者
Wei, Lai [1 ]
Srivastava, Vaibhav [2 ]
机构
[1] Univ Michigan, Life Sci Inst, Ann Arbor, MI 48109 USA
[2] Michigan State Univ, Dept Elect & Comp Engn, E Lansing, MI 48824 USA
来源
基金
美国国家科学基金会;
关键词
Heavily-tailed distribution; Stochastic processes; Heuristic algorithms; Control systems; Lightly-tailed distribution; Upper bound; History; Heavy-tailed distributions; minimax regret; nonstationary multiarmed bandit; upper-confidence bound; variation budget; MULTIARMED BANDIT;
D O I
10.1109/OJCSYS.2024.3372929
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We study the nonstationary stochastic Multi-Armed Bandit (MAB) problem in which the distributions of rewards associated with arms are assumed to be time-varying and the total variation in the expected rewards is subject to a variation budget. The regret of a policy is defined by the difference in the expected cumulative reward obtained using the policy and using an oracle that selects the arm with the maximum mean reward at each time. We characterize the performance of the proposed policies in terms of the worst-case regret, which is the supremum of the regret over the set of reward distribution sequences satisfying the variation budget. We design Upper-Confidence Bound (UCB)-based policies with three different approaches, namely, periodic resetting, sliding observation window, and discount factor, and show that they are order-optimal with respect to the minimax regret, i.e., the minimum worst-case regret achieved by any policy. We also relax the sub-Gaussian assumption on reward distributions and develop robust versions of the proposed policies that can handle heavy-tailed reward distributions and maintain their performance guarantees.
引用
收藏
页码:128 / 142
页数:15
相关论文
共 50 条
  • [31] Thompson Sampling for Stochastic Bandits with Noisy Contexts: An Information-Theoretic Regret Analysis
    Jose, Sharu Theresa
    Moothedath, Shana
    ENTROPY, 2024, 26 (07)
  • [32] STOCHASTIC CONTINUUM-ARMED BANDITS WITH ADDITIVE MODELS: MINIMAX REGRETS AND ADAPTIVE ALGORITHM
    Cai, T. Tony
    Pu, Hongming
    ANNALS OF STATISTICS, 2022, 50 (04): : 2179 - 2204
  • [33] Regret bounds for restless Markov bandits
    Ortner, Ronald
    Ryabko, Daniil
    Auer, Peter
    Munos, Remi
    THEORETICAL COMPUTER SCIENCE, 2014, 558 : 62 - 76
  • [34] Minimax regret and strategic uncertainty
    Renou, Ludovic
    Schlag, Karl H.
    JOURNAL OF ECONOMIC THEORY, 2010, 145 (01) : 264 - 286
  • [35] KL-UCB-Based Policy for Budgeted Multi-Armed Bandits with Stochastic Action Costs
    Watanabe, Ryo
    Komiyama, Junpei
    Nakamura, Atsuyoshi
    Kudo, Mineichi
    IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES, 2017, E100A (11): : 2470 - 2486
  • [36] MINIMAX REGRET AND WELFARE ECONOMICS
    GROUT, P
    JOURNAL OF PUBLIC ECONOMICS, 1978, 9 (03) : 405 - 410
  • [37] Market Exit and Minimax Regret
    Umbhauer, Gisele
    INTERNATIONAL GAME THEORY REVIEW, 2022, 24 (04)
  • [38] Regret bounds for sleeping experts and bandits
    Robert Kleinberg
    Alexandru Niculescu-Mizil
    Yogeshwer Sharma
    Machine Learning, 2010, 80 : 245 - 272
  • [39] Neural Contextual Bandits without Regret
    Kassraie, Parnian
    Krause, Andreas
    INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 151, 2022, 151 : 240 - 278
  • [40] Regret of Age-of-Information Bandits
    Fatale, Santosh
    Bhandari, Kavya
    Narula, Urvidh
    Moharir, Sharayu
    Hanawal, Manjesh K.
    IEEE TRANSACTIONS ON COMMUNICATIONS, 2022, 70 (01) : 87 - 100