Nonstationary Stochastic Bandits: UCB Policies and Minimax Regret

被引：1

作者：

Wei, Lai ^{[1
]}

Srivastava, Vaibhav ^{[2
]}

机构：

[1] Univ Michigan, Life Sci Inst, Ann Arbor, MI 48109 USA

[2] Michigan State Univ, Dept Elect & Comp Engn, E Lansing, MI 48824 USA

来源：

IEEE OPEN JOURNAL OF CONTROL SYSTEMS | 2024年 / 3卷

基金：

美国国家科学基金会;

关键词：

Heavily-tailed distribution; Stochastic processes; Heuristic algorithms; Control systems; Lightly-tailed distribution; Upper bound; History; Heavy-tailed distributions; minimax regret; nonstationary multiarmed bandit; upper-confidence bound; variation budget; MULTIARMED BANDIT;

D O I：

10.1109/OJCSYS.2024.3372929

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

We study the nonstationary stochastic Multi-Armed Bandit (MAB) problem in which the distributions of rewards associated with arms are assumed to be time-varying and the total variation in the expected rewards is subject to a variation budget. The regret of a policy is defined by the difference in the expected cumulative reward obtained using the policy and using an oracle that selects the arm with the maximum mean reward at each time. We characterize the performance of the proposed policies in terms of the worst-case regret, which is the supremum of the regret over the set of reward distribution sequences satisfying the variation budget. We design Upper-Confidence Bound (UCB)-based policies with three different approaches, namely, periodic resetting, sliding observation window, and discount factor, and show that they are order-optimal with respect to the minimax regret, i.e., the minimum worst-case regret achieved by any policy. We also relax the sub-Gaussian assumption on reward distributions and develop robust versions of the proposed policies that can handle heavy-tailed reward distributions and maintain their performance guarantees.

引用

页码：128 / 142

页数：15

共 50 条

[41] Regret bounds for sleeping experts and bandits
Kleinberg, Robert
Niculescu-Mizil, Alexandru
Sharma, Yogeshwer
MACHINE LEARNING, 2010, 80 (2-3) : 245 - 272
[42] Nash Regret Guarantees for Linear Bandits
Sawarni, Ayush
Pal, Soumyabrata
Barman, Siddharth
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[43] Implementation in minimax regret equilibrium
Renou, Ludovic
Schlag, Karl H.
GAMES AND ECONOMIC BEHAVIOR, 2011, 71 (02) : 527 - 533
[44] Precise minimax redundancy and regret
Drmota, M
Szpankowski, W
IEEE TRANSACTIONS ON INFORMATION THEORY, 2004, 50 (11) : 2686 - 2707
[45] Explore no more: Improved high-probability regret bounds for non-stochastic bandits
Neu, Gergely
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 28 (NIPS 2015), 2015, 28
[46] Breaking the √T Barrier: Instance-Independent Logarithmic Regret in Stochastic Contextual Linear Bandits
Ghosh, Avishek
Sankararaman, Abishek
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
[47] ROBUST SOLUTION FOR A MINIMAX REGRET HUB LOCATION PROBLEM IN A FUZZY-STOCHASTIC ENVIRONMENT
Abbasi-Parizi, Saeid
Aminnayeri, Majid
Bashiri, Mahdi
JOURNAL OF INDUSTRIAL AND MANAGEMENT OPTIMIZATION, 2018, 14 (03) : 1271 - 1295
[48] Quantum Speedups of Optimizing Approximately Convex Functions with Applications to Logarithmic Regret Stochastic Convex Bandits
Li, Tongyang
Zhang, Ruizhe
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
[49] Interconnected Neural Linear Contextual Bandits with UCB Exploration
Chen, Yang
Xie, Miao
Liu, Jiamou
Zhao, Kaiqi
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2022, PT I, 2022, 13280 : 169 - 181
[50] An improved upper bound on the expected regret of UCB-type policies for a matching-selection bandit problem
Watanabe, Ryo
Nakamura, Atsuyoshi
Kudo, Mineichi
OPERATIONS RESEARCH LETTERS, 2015, 43 (06) : 558 - 563

← 1 2 3 4 5 →