The Blinded Bandit: Learning with Adaptive Feedback

被引：0

作者：

Dekel, Ofer ^{[1
]}

Hazan, Elad ^{[2
]}

Koren, Tomer ^{[2
]}

机构：

[1] Microsoft Res, Redmond, WA 98052 USA

[2] Technion, Haifa, Israel

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 27 (NIPS 2014) | 2014年 / 27卷

关键词：

REGRET;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We study an online learning setting where the player is temporarily deprived of feedback each time it switches to a different action. Such model of adaptive feedback naturally occurs in scenarios where the environment reacts to the player's actions and requires some time to recover and stabilize after the algorithm switches actions. This motivates a variant of the multi-armed bandit problem, which we call the blinded multi-armed bandit, in which no feedback is given to the algorithm whenever it switches arms. We develop efficient online learning algorithms for this problem and prove that they guarantee the same asymptotic regret as the optimal algorithms for the standard multi-armed bandit problem. This result stands in stark contrast to another recent result, which states that adding a switching cost to the standard multi-armed bandit makes it substantially harder to learn, and provides a direct comparison of how feedback and loss contribute to the difficulty of an online learning problem. We also extend our results to the general prediction framework of bandit linear optimization, again attaining near-optimal regret bounds.

引用

页数：9

共 50 条

[41] Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition
Jin, Chi
Jin, Tiancheng
Luo, Haipeng
Sra, Suvrit
Yu, Tiancheng
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 119, 2020, 119
[42] Distributed No-Regret Learning in Aggregative Games With Residual Bandit Feedback
Liu, Wenting
Yi, Peng
IEEE TRANSACTIONS ON CONTROL OF NETWORK SYSTEMS, 2024, 11 (04): : 1734 - 1745
[43] Meta-Scheduling for the Wireless Downlink through Learning with Bandit Feedback
Song, Jianhan
de Veciana, Gustavo
Shakkottai, Sanjay
2020 18TH INTERNATIONAL SYMPOSIUM ON MODELING AND OPTIMIZATION IN MOBILE, AD HOC, AND WIRELESS NETWORKS (WIOPT), 2020,
[44] Targeting Optimization for Internet Advertising by Learning from Logged Bandit Feedback
Gasparini, Margherita
Nuara, Alessandro
Trovo, Francesco
Gatti, Nicola
Restelli, Marcello
2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2018,
[45] Nonparametric adaptive learning with feedback
Chen, XH
White, H
JOURNAL OF ECONOMIC THEORY, 1998, 82 (01) : 190 - 222
[46] Online Boosting with Bandit Feedback
Brukhim, Nataly
Hazan, Elad
ALGORITHMIC LEARNING THEORY, VOL 132, 2021, 132
[47] Optimal Clustering with Bandit Feedback
Yang, Junwen
Zhong, Zixin
Tan, Vincent Y. F.
JOURNAL OF MACHINE LEARNING RESEARCH, 2024, 25
[48] Nearest Neighbour with Bandit Feedback
Pasteris, Stephen
Hicks, Chris
Mavroudis, Vasilios
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[49] Adaptive estimation of random vectors with bandit feedback: A mean-squared error viewpoint
Sen, Dipayan
Prashanth, L. A.
Gopalan, Aditya
2023 NINTH INDIAN CONTROL CONFERENCE, ICC, 2023, : 180 - 181
[50] Batch Learning from Logged Bandit Feedback through Counterfactual Risk Minimization
Swaminathan, Adith
Joachims, Thorsten
JOURNAL OF MACHINE LEARNING RESEARCH, 2015, 16 : 1731 - 1755

← 1 2 3 4 5 →