Nonstationary Stochastic Bandits: UCB Policies and Minimax Regret

被引:1
|
作者
Wei, Lai [1 ]
Srivastava, Vaibhav [2 ]
机构
[1] Univ Michigan, Life Sci Inst, Ann Arbor, MI 48109 USA
[2] Michigan State Univ, Dept Elect & Comp Engn, E Lansing, MI 48824 USA
来源
基金
美国国家科学基金会;
关键词
Heavily-tailed distribution; Stochastic processes; Heuristic algorithms; Control systems; Lightly-tailed distribution; Upper bound; History; Heavy-tailed distributions; minimax regret; nonstationary multiarmed bandit; upper-confidence bound; variation budget; MULTIARMED BANDIT;
D O I
10.1109/OJCSYS.2024.3372929
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We study the nonstationary stochastic Multi-Armed Bandit (MAB) problem in which the distributions of rewards associated with arms are assumed to be time-varying and the total variation in the expected rewards is subject to a variation budget. The regret of a policy is defined by the difference in the expected cumulative reward obtained using the policy and using an oracle that selects the arm with the maximum mean reward at each time. We characterize the performance of the proposed policies in terms of the worst-case regret, which is the supremum of the regret over the set of reward distribution sequences satisfying the variation budget. We design Upper-Confidence Bound (UCB)-based policies with three different approaches, namely, periodic resetting, sliding observation window, and discount factor, and show that they are order-optimal with respect to the minimax regret, i.e., the minimum worst-case regret achieved by any policy. We also relax the sub-Gaussian assumption on reward distributions and develop robust versions of the proposed policies that can handle heavy-tailed reward distributions and maintain their performance guarantees.
引用
收藏
页码:128 / 142
页数:15
相关论文
共 50 条
  • [1] Minimax Regret for Cascading Bandits
    Vial, Daniel
    Sanghavi, Sujay
    Shakkottai, Sanjay
    Srikant, R.
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [2] Nearly Minimax-Optimal Regret for Linearly Parameterized Bandits
    Li, Yingkai
    Wang, Yining
    Zhou, Yuan
    IEEE TRANSACTIONS ON INFORMATION THEORY, 2024, 70 (01) : 372 - 388
  • [3] Nearly Minimax-Optimal Regret for Linearly Parameterized Bandits
    Li, Yingkai
    Wang, Yining
    Zhou, Yuan
    CONFERENCE ON LEARNING THEORY, VOL 99, 2019, 99
  • [4] Minimax Regret for Stochastic Shortest Path
    Cohen, Alon
    Efroni, Yonathan
    Mansour, Yishay
    Rosenberg, Aviv
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [5] Regret of Age-of-Information Bandits in Nonstationary Wireless Networks
    Song, Zaofa
    Yang, Tao
    Wu, Xiaofeng
    Feng, Hui
    Hu, Bo
    IEEE WIRELESS COMMUNICATIONS LETTERS, 2022, 11 (11) : 2415 - 2419
  • [6] Regret Bounds and Minimax Policies under Partial Monitoring
    Audibert, Jean-Yves
    Bubeck, Sebastien
    JOURNAL OF MACHINE LEARNING RESEARCH, 2010, 11 : 2785 - 2836
  • [7] Regret bounds and minimax policies under partial monitoring
    Audibert, Jean-Yves
    Bubeck, Sébastien
    Journal of Machine Learning Research, 2010, 11 : 2785 - 2863
  • [8] Bandits with Stochastic Experts: Constant Regret, Empirical Experts and Episodes
    Sharma, Nihal
    Sen, Rajat
    Basu, Soumya
    Shanmugam, Karthikeyan
    Shakkottai, Sanjay
    ACM TRANSACTIONS ON MODELING AND PERFORMANCE EVALUATION OF COMPUTING SYSTEMS, 2024, 9 (03)
  • [9] Tight Regret Bounds for Stochastic Combinatorial Semi-Bandits
    Kveton, Branislav
    Wen, Zheng
    Ashkan, Azin
    Szepesvari, Csaba
    ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 38, 2015, 38 : 535 - 543
  • [10] KL-UCB-Switch: Optimal Regret Bounds for Stochastic Bandits from Both a Distribution-Dependent and a Distribution-Free Viewpoints
    Garivier, Aurelien
    Hadiji, Hedi
    Menard, Pierre
    Stoltz, Gilles
    JOURNAL OF MACHINE LEARNING RESEARCH, 2022, 23 : 1 - 66