Near-optimal Reinforcement Learning in Factored MDPs

被引:0
|
作者
Osband, Ian [1 ]
Van Roy, Benjamin [1 ]
机构
[1] Stanford Univ, Stanford, CA 94305 USA
基金
美国国家科学基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Any reinforcement learning algorithm that applies to all Markov decision processes (MDPs) will suffer Omega(root SAT) regret on some MDP, where T is the elapsed time and S and A are the cardinalities of the state and action spaces. This implies T = Omega(SA) time to guarantee a near-optimal policy. In many settings of practical interest, due to the curse of dimensionality, S and A can be so enormous that this learning time is unacceptable. We establish that, if the system is known to be a factored MDP, it is possible to achieve regret that scales polynomially in the number of parameters encoding the factored MDP, which may be exponentially smaller than S or A. We provide two algorithms that satisfy near-optimal regret bounds in this context: posterior sampling reinforcement learning (PSRL) and an upper confidence bound algorithm (UCRL-Factored).
引用
收藏
页数:9
相关论文
共 50 条
  • [1] Near-Optimal Interdiction of Factored MDPs
    Panda, Swetasudha
    Vorobeychik, Yevgeniy
    CONFERENCE ON UNCERTAINTY IN ARTIFICIAL INTELLIGENCE (UAI2017), 2017,
  • [2] Efficient reinforcement learning in factored MDPs
    Kearns, M
    Koller, D
    IJCAI-99: PROCEEDINGS OF THE SIXTEENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOLS 1 & 2, 1999, : 740 - 747
  • [3] TeXDYNA: Hierarchical Reinforcement Learning in Factored MDPs
    Kozlova, Olga
    Sigaud, Olivier
    Meyer, Christophe
    FROM ANIMALS TO ANIMATS 11, 2010, 6226 : 489 - +
  • [4] Near-Optimal Model-Free Reinforcement Learning in Non-Stationary Episodic MDPs
    Mao, Weichao
    Zhang, Kaiqing
    Zhu, Ruihao
    Simchi-Levi, David
    Basar, Tamer
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [5] Exploiting Additive Structure in Factored MDPs for Reinforcement Learning
    Degris, Thomas
    Sigaud, Olivier
    Wuillemin, Pierre-Henri
    RECENT ADVANCES IN REINFORCEMENT LEARNING, 2008, 5323 : 15 - 26
  • [6] Near-Optimal Reinforcement Learning in Polynomial Time
    Michael Kearns
    Satinder Singh
    Machine Learning, 2002, 49 : 209 - 232
  • [7] Near-optimal reinforcement learning in polynomial time
    Kearns, M
    Singh, S
    MACHINE LEARNING, 2002, 49 (2-3) : 209 - 232
  • [8] Near-optimal Regret Bounds for Reinforcement Learning
    Jaksch, Thomas
    Ortner, Ronald
    Auer, Peter
    JOURNAL OF MACHINE LEARNING RESEARCH, 2010, 11 : 1563 - 1600
  • [9] Near-optimal regret bounds for reinforcement learning
    Jaksch, Thomas
    Ortner, Ronald
    Auer, Peter
    Journal of Machine Learning Research, 2010, 11 : 1563 - 1600
  • [10] Near-optimal PAC bounds for discounted MDPs
    Lattimore, Tor
    Hutter, Marcus
    THEORETICAL COMPUTER SCIENCE, 2014, 558 : 125 - 143