Online Planning for Large Markov Decision Processes with Hierarchical Decomposition

被引:25
|
作者
Bai, Aijun [1 ]
Wu, Feng [1 ]
Chen, Xiaoping [1 ]
机构
[1] Univ Sci & Technol China, Sch Comp Sci & Technol, Hefei 230026, Anhui, Peoples R China
基金
新加坡国家研究基金会; 中国国家自然科学基金;
关键词
Algorithms; Experimentation; MDP; online planning; MAXQ-OP; RoboCup; ROBOCUP SOCCER; REINFORCEMENT; ABSTRACTION; SEARCH;
D O I
10.1145/2717316
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Markov decision processes (MDPs) provide a rich framework for planning under uncertainty. However, exactly solving a large MDP is usually intractable due to the "curse of dimensionality"- the state space grows exponentially with the number of state variables. Online algorithms tackle this problem by avoiding computing a policy for the entire state space. On the other hand, since online algorithm has to find a near-optimal action online in almost real time, the computation time is often very limited. In the context of reinforcement learning, MAXQ is a value function decomposition method that exploits the underlying structure of the original MDP and decomposes it into a combination of smaller subproblems arranged over a task hierarchy. In this article, we present MAXQ-OP-a novel online planning algorithm for large MDPs that utilizes MAXQ hierarchical decomposition in online settings. Compared to traditional online planning algorithms, MAXQ-OP is able to reach much more deeper states in the search tree with relatively less computation time by exploiting MAXQ hierarchical decomposition online. We empirically evaluate our algorithm in the standard Taxi domain-a common benchmark for MDPs-to show the effectiveness of our approach. We have also conducted a long-term case study in a highly complex simulated soccer domain and developed a team named WrightEagle that has won five world champions and five runners-up in the recent 10 years of RoboCup Soccer Simulation 2D annual competitions. The results in the RoboCup domain confirm the scalability of MAXQ-OP to very large domains.
引用
收藏
页数:28
相关论文
共 50 条
  • [31] Oblivious Markov Decision Processes: Planning and Policy Execution
    Alsayegh, Murtadha
    Fuentes, Jose
    Bobadilla, Leonardo
    Shell, Dylan A.
    2023 62ND IEEE CONFERENCE ON DECISION AND CONTROL, CDC, 2023, : 3850 - 3857
  • [32] Probabilistic Preference Planning Problem for Markov Decision Processes
    Li, Meilun
    Turrini, Andrea
    Hahn, Ernst Moritz
    She, Zhikun
    Zhang, Lijun
    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2022, 48 (05) : 1545 - 1559
  • [33] Planning with Hierarchical Temporal Memory for Deterministic Markov Decision Problem
    Kuderov, Petr
    Panov, Aleksandr, I
    ICAART: PROCEEDINGS OF THE 13TH INTERNATIONAL CONFERENCE ON AGENTS AND ARTIFICIAL INTELLIGENCE - VOL 2, 2021, : 1073 - 1081
  • [34] Learning and Planning with Timing Information in Markov Decision Processes
    Bacon, Pierre-Luc
    Balle, Borja
    Precup, Doina
    UNCERTAINTY IN ARTIFICIAL INTELLIGENCE, 2015, : 111 - 120
  • [35] Actor-critic algorithms for hierarchical Markov decision processes
    Bhatnagar, S
    Panigrahi, JR
    AUTOMATICA, 2006, 42 (04) : 637 - 644
  • [36] Lagrange Dual Decomposition for Finite Horizon Markov Decision Processes
    Furmston, Thomas
    Barber, David
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, PT I, 2011, 6911 : 487 - 502
  • [37] Large Scale Markov Decision Processes with Changing Rewards
    Cardoso, Adrian Rivera
    Wang, He
    Xu, Huan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [38] Online Regret Bounds for Markov Decision Processes with Deterministic Transitions
    Ortner, Ronald
    ALGORITHMIC LEARNING THEORY, PROCEEDINGS, 2008, 5254 : 123 - 137
  • [39] Online regret bounds for Markov decision processes with deterministic transitions
    Ortner, Ronald
    THEORETICAL COMPUTER SCIENCE, 2010, 411 (29-30) : 2684 - 2695
  • [40] Online Learning in Markov Decision Processes with Changing Cost Sequences
    Dick, Travis
    Gyorgy, Andras
    Szepesvari, Csaba
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 32 (CYCLE 1), 2014, 32