A central problem in online learning and decision making-from bandits to reinforcement learning-is to understand what modeling assumptions lead to sample-efficient learning guarantees. We consider a general adversarial decision making framework that encompasses (structured) bandit problems with adversarial rewards and reinforcement learning problems with adversarial dynamics. Our main result is to show-via new upper and lower bounds-that the Decision-Estimation Coefficient, a complexity measure introduced by Foster et al. [17] in the stochastic counterpart to our setting, is necessary and sufficient to obtain low regret for adversarial decision making. However, compared to the stochastic setting, one must apply the Decision-Estimation Coefficient to the convex hull of the class of models (or, hypotheses) under consideration. This establishes that the price of accommodating adversarial rewards or dynamics is governed by the behavior of the model class under convexification, and recovers a number of existing results-both positive and negative. En route to obtaining these guarantees, we provide new structural results that connect the Decision-Estimation Coefficient to variants of other well-known complexity measures, including the Information Ratio of Russo and Van Roy [47] and the Exploration-by-Optimization objective of Lattimore and Gyorgy [32].
机构:
Univ Oxford, Dept Polit & Int Relat, Oxford, EnglandUniv Oxford, Dept Polit & Int Relat, Oxford, England
Johnson, Dominic D. P.
Fowler, James H.
论文数: 0引用数: 0
h-index: 0
机构:
Univ Calif San Diego, Dept Polit Sci, San Diego, CA 92103 USA
Univ Calif San Diego, Div Med Genet, San Diego, CA 92103 USAUniv Oxford, Dept Polit & Int Relat, Oxford, England
机构:
VA Pittsburgh Healthcare Syst, Univ Dr,112-SpU, Pittsburgh, PA 15240 USA
West Virginia Univ, Coll Educ & Human Serv, Dept Commun Sci & Disorders, Morgantown, WV 26506 USAVA Pittsburgh Healthcare Syst, Univ Dr,112-SpU, Pittsburgh, PA 15240 USA