Nonstationary Stochastic Bandits: UCB Policies and Minimax Regret

被引:1
|
作者
Wei, Lai [1 ]
Srivastava, Vaibhav [2 ]
机构
[1] Univ Michigan, Life Sci Inst, Ann Arbor, MI 48109 USA
[2] Michigan State Univ, Dept Elect & Comp Engn, E Lansing, MI 48824 USA
来源
基金
美国国家科学基金会;
关键词
Heavily-tailed distribution; Stochastic processes; Heuristic algorithms; Control systems; Lightly-tailed distribution; Upper bound; History; Heavy-tailed distributions; minimax regret; nonstationary multiarmed bandit; upper-confidence bound; variation budget; MULTIARMED BANDIT;
D O I
10.1109/OJCSYS.2024.3372929
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We study the nonstationary stochastic Multi-Armed Bandit (MAB) problem in which the distributions of rewards associated with arms are assumed to be time-varying and the total variation in the expected rewards is subject to a variation budget. The regret of a policy is defined by the difference in the expected cumulative reward obtained using the policy and using an oracle that selects the arm with the maximum mean reward at each time. We characterize the performance of the proposed policies in terms of the worst-case regret, which is the supremum of the regret over the set of reward distribution sequences satisfying the variation budget. We design Upper-Confidence Bound (UCB)-based policies with three different approaches, namely, periodic resetting, sliding observation window, and discount factor, and show that they are order-optimal with respect to the minimax regret, i.e., the minimum worst-case regret achieved by any policy. We also relax the sub-Gaussian assumption on reward distributions and develop robust versions of the proposed policies that can handle heavy-tailed reward distributions and maintain their performance guarantees.
引用
收藏
页码:128 / 142
页数:15
相关论文
共 50 条
  • [41] Regret bounds for sleeping experts and bandits
    Kleinberg, Robert
    Niculescu-Mizil, Alexandru
    Sharma, Yogeshwer
    MACHINE LEARNING, 2010, 80 (2-3) : 245 - 272
  • [42] Nash Regret Guarantees for Linear Bandits
    Sawarni, Ayush
    Pal, Soumyabrata
    Barman, Siddharth
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [43] Implementation in minimax regret equilibrium
    Renou, Ludovic
    Schlag, Karl H.
    GAMES AND ECONOMIC BEHAVIOR, 2011, 71 (02) : 527 - 533
  • [44] Precise minimax redundancy and regret
    Drmota, M
    Szpankowski, W
    IEEE TRANSACTIONS ON INFORMATION THEORY, 2004, 50 (11) : 2686 - 2707
  • [45] Explore no more: Improved high-probability regret bounds for non-stochastic bandits
    Neu, Gergely
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 28 (NIPS 2015), 2015, 28
  • [46] Breaking the √T Barrier: Instance-Independent Logarithmic Regret in Stochastic Contextual Linear Bandits
    Ghosh, Avishek
    Sankararaman, Abishek
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [47] ROBUST SOLUTION FOR A MINIMAX REGRET HUB LOCATION PROBLEM IN A FUZZY-STOCHASTIC ENVIRONMENT
    Abbasi-Parizi, Saeid
    Aminnayeri, Majid
    Bashiri, Mahdi
    JOURNAL OF INDUSTRIAL AND MANAGEMENT OPTIMIZATION, 2018, 14 (03) : 1271 - 1295
  • [48] Quantum Speedups of Optimizing Approximately Convex Functions with Applications to Logarithmic Regret Stochastic Convex Bandits
    Li, Tongyang
    Zhang, Ruizhe
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
  • [49] Interconnected Neural Linear Contextual Bandits with UCB Exploration
    Chen, Yang
    Xie, Miao
    Liu, Jiamou
    Zhao, Kaiqi
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2022, PT I, 2022, 13280 : 169 - 181
  • [50] An improved upper bound on the expected regret of UCB-type policies for a matching-selection bandit problem
    Watanabe, Ryo
    Nakamura, Atsuyoshi
    Kudo, Mineichi
    OPERATIONS RESEARCH LETTERS, 2015, 43 (06) : 558 - 563