Exploration-exploitation tradeoff using variance estimates in multi-armed bandits

被引：296

作者：

Audibert, Jean-Yves ^{[1
,2
]}

Munos, Remi ^{[3
]}

Szepesvari, Csaba ^{[4
]}

机构：

[1] Univ Paris Est, Ecole Ponts ParisTech, CERTIS, F-77455 Marne La Vallee, France

[2] Willow ENS INRIA, F-75005 Paris, France

[3] INRIA Lille Nord Europe, SequeL Project, F-59650 Villeneuve Dascq, France

[4] Univ Alberta, Dept Comp Sci, Edmonton, AB T6G 2E8, Canada

来源：

THEORETICAL COMPUTER SCIENCE | 2009年 / 410卷 / 19期

基金：

加拿大自然科学与工程研究理事会;

关键词：

Exploration-exploitation tradeoff; Multi-armed bandits; Bernstein's inequality; High-probability bound; Risk analysis;

D O I：

10.1016/j.tcs.2009.01.016

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Algorithms based on upper confidence bounds for balancing exploration and exploitation are gaining popularity since they are easy to implement, efficient and effective. This paper considers a variant of the basic algorithm for the stochastic, multi-armed bandit problem that takes into account the empirical variance of the different arms. In earlier experimental works, Such algorithms were found to outperform the competing algorithms. We provide the first analysis of the expected regret for such algorithms. As expected, our results show that the algorithm that uses the variance estimates has a major advantage over its alternatives that do not use Such estimates provided that the variances of the payoffs of the suboptimal arms are low. We also prove that the regret concentrates only at a polynomial rate. This holds for all the upper confidence bound based algorithms and for all bandit problems except those special ones where with probability one the payoff obtained by pulling the optimal arm is larger than the expected payoff for the second best arm. Hence, although upper confidence bound bandit algorithms achieve logarithmic expected regret rates, they might not be Suitable for a risk-averse decision maker. We illustrate some of the results by Computer simulations. (C) 2009 Elsevier B.V. All rights reserved.

引用

页码：1876 / 1902

页数：27

共 50 条

[41] Visualizations for interrogations of multi-armed bandits
Keaton, Timothy J.
Sabbaghi, Arman
STAT, 2019, 8 (01):
[42] Multi-armed bandits with dependent arms
Singh, Rahul
Liu, Fang
Sun, Yin
Shroff, Ness
MACHINE LEARNING, 2024, 113 (01) : 45 - 71
[43] On Kernelized Multi-Armed Bandits with Constraints
Zhou, Xingyu
Ji, Bo
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
[44] Multi-Armed Bandits in Metric Spaces
Kleinberg, Robert
Slivkins, Aleksandrs
Upfal, Eli
STOC'08: PROCEEDINGS OF THE 2008 ACM INTERNATIONAL SYMPOSIUM ON THEORY OF COMPUTING, 2008, : 681 - +
[45] Multi-Armed Bandits With Costly Probes
Elumar, Eray Can
Tekin, Cem
Yagan, Osman
IEEE TRANSACTIONS ON INFORMATION THEORY, 2025, 71 (01) : 618 - 643
[46] On Optimal Foraging and Multi-armed Bandits
Srivastava, Vaibhav
Reverdy, Paul
Leonard, Naomi E.
2013 51ST ANNUAL ALLERTON CONFERENCE ON COMMUNICATION, CONTROL, AND COMPUTING (ALLERTON), 2013, : 494 - 499
[47] MULTI-ARMED BANDITS AND THE GITTINS INDEX
WHITTLE, P
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1980, 42 (02): : 143 - 149
[48] Multi-armed bandits with episode context
Christopher D. Rosin
Annals of Mathematics and Artificial Intelligence, 2011, 61 : 203 - 230
[49] Multi-armed bandits with switching penalties
Asawa, M
Teneketzis, D
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 1996, 41 (03) : 328 - 348
[50] Active Learning in Multi-armed Bandits
Antos, Andras
Grover, Varun
Szepesvari, Csaba
ALGORITHMIC LEARNING THEORY, PROCEEDINGS, 2008, 5254 : 287 - +

← 1 2 3 4 5 →