Efficient and Optimal Algorithms for Contextual Dueling Bandits under Realizability

被引:0
|
作者
Saha, Aadirupa [1 ]
Krishnamurthy, Akshay [1 ]
机构
[1] Microsoft Res, New York, NY 10012 USA
关键词
Contextual; Dueling Bandits; Preference-based learning; Realizability; Function approximation; Regret analysis; Best-Response; Policy regret; Regression oracles; Efficient; Optimal algorithms; Markov games; Linear realizability; Agnostic;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We study the K-armed contextual dueling bandit problem, a sequential decision making setting in which the learner uses contextual information to make two decisions, but only observes preference-based feedback suggesting that one decision was better than the other. We focus on the regret minimization problem under realizability, where the feedback is generated by a pairwise preference matrix that is well-specified by a given function class F. We provide a new algorithm that achieves the optimal regret rate for a new notion of best response regret, which is a strictly stronger performance measure than those considered in prior works. The algorithm is also computationally efficient, running in polynomial time assuming access to an online oracle for square loss regression over F. This resolves an open problem of Dudik et al. (2015) on oracle efficient, regret-optimal algorithms for contextual dueling bandits.
引用
收藏
页数:27
相关论文
共 50 条
  • [21] Algorithms with Logarithmic or Sublinear Regret for Constrained Contextual Bandits
    Wu, Huasen
    Srikant, R.
    Liu, Xin
    Jiang, Chong
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 28 (NIPS 2015), 2015, 28
  • [22] A New Algorithm for Non-stationary Contextual Bandits: Efficient, Optimal, and Parameter-free
    Chen, Yifang
    Lee, Chung-Wei
    Luo, Haipeng
    Wei, Chen-Yu
    CONFERENCE ON LEARNING THEORY, VOL 99, 2019, 99
  • [23] EFFICIENT ALGORITHMS FOR LINEAR POLYHEDRAL BANDITS
    Hanawal, Manjesh K.
    Leshem, Amir
    Saligrama, Venkatesh
    2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 4796 - 4800
  • [24] Sublinear Optimal Policy Value Estimation in Contextual Bandits
    Kong, Weihao
    Valiant, Gregory
    Brunskill, Emma
    INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 108, 2020, 108 : 4377 - 4386
  • [25] Best-of-Both-Worlds Algorithms for Linear Contextual Bandits
    Kuroki, Yuko
    Rumi, Alberto
    Tsuchiya, Taira
    Vitale, Fabio
    Cesa-Bianchi, Nicolo
    INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 238, 2024, 238
  • [26] A Multiplier Bootstrap Approach to Designing Robust Algorithms for Contextual Bandits
    Xie, Hong
    Tang, Qiao
    Zhu, Qingsheng
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (12) : 9887 - 9899
  • [27] Communication Efficient Distributed Learning for Kernelized Contextual Bandits
    Li, Chuanhao
    Wang, Huazheng
    Wang, Mengdi
    Wang, Hongning
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [28] Distributed Contextual Linear Bandits with Minimax Optimal Communication Cost
    Amani, Sanae
    Lattimore, Tor
    Gyorgy, Andras
    Yang, Lin F.
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 202, 2023, 202 : 691 - 717
  • [29] Optimal and Adaptive Off-policy Evaluation in Contextual Bandits
    Wang, Yu-Xiang
    Agarwal, Alekh
    Dudik, Miroslav
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 70, 2017, 70
  • [30] Optimal Baseline Corrections for Off-Policy Contextual Bandits
    Gupta, Shashank
    Jeunen, Olivier
    Oosterhuis, Harrie
    de Rijke, Maarten
    PROCEEDINGS OF THE EIGHTEENTH ACM CONFERENCE ON RECOMMENDER SYSTEMS, RECSYS 2024, 2024, : 722 - 732