Bayes-adaptive hierarchical MDPs

被引:2
|
作者
Ngo Anh Vien [1 ]
Lee, SeungGwan [2 ]
Chung, TaeChoong [3 ]
机构
[1] Univ Stuttgart, Machine Learning & Robot Lab, D-70174 Stuttgart, Germany
[2] Kyung Hee Univ, Coll Liberal Arts, 1 Seocheon Dong, Yongin 446701, Gyeonggi Do, South Korea
[3] Kyung Hee Univ, Dept Comp Engn, Yongin 446701, Gyeonggi Do, South Korea
基金
新加坡国家研究基金会;
关键词
Reinforcement learning; Bayesian reinforcement learning; Hierarchical reinforcement learning; MDP; POMDP; POSMDP; Monte-Carlo tree search; Hierarchical Monte-Carlo planning; POLICY GRADIENT SMDP; RESOURCE-ALLOCATION;
D O I
10.1007/s10489-015-0742-2
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Reinforcement learning (RL) is an area of machine learning that is concerned with how an agent learns to make decisions sequentially in order to optimize a particular performance measure. For achieving such a goal, the agent has to choose either 1) exploiting previously known knowledge that might end up at local optimality or 2) exploring to gather new knowledge that expects to improve the current performance. Among other RL algorithms, Bayesian model-based RL (BRL) is well-known to be able to trade-off between exploitation and exploration optimally via belief planning, i.e. partially observable Markov decision process (POMDP). However, solving that POMDP often suffers from curse of dimensionality and curse of history. In this paper, we make two major contributions which are: 1) an integration framework of temporal abstraction into BRL that eventually results in a hierarchical POMDP formulation, which can be solved online using a hierarchical sample-based planning solver; 2) a subgoal discovery method for hierarchical BRL that automatically discovers useful macro actions to accelerate learning. In the experiment section, we demonstrate that the proposed approach can scale up to much larger problems. On the other hand, the agent is able to discover useful subgoals for speeding up Bayesian reinforcement learning.
引用
收藏
页码:112 / 126
页数:15
相关论文
共 50 条
  • [1] Bayes-adaptive hierarchical MDPs
    Ngo Anh Vien
    SeungGwan Lee
    TaeChoong Chung
    Applied Intelligence, 2016, 45 : 112 - 126
  • [2] ContraBAR: Contrastive Bayes-Adaptive Deep RL
    Choshen, Era
    Tamar, Aviv
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 202, 2023, 202
  • [3] Expectation-maximization for Bayes-adaptive POMDPs
    Vargo, Erik P.
    Cogill, Randy
    JOURNAL OF THE OPERATIONAL RESEARCH SOCIETY, 2015, 66 (10) : 1605 - 1623
  • [4] Patient-Type Bayes-Adaptive Treatment Plans
    Skandari, M. Reza
    Shechter, Steven M.
    OPERATIONS RESEARCH, 2021, 69 (02) : 574 - 598
  • [5] Building Adaptive Dialogue Systems Via Bayes-Adaptive POMDPs
    Png, Shaowei
    Pineau, Joelle
    Chaib-draa, Brahim
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2012, 6 (08) : 917 - 927
  • [6] Risk-Averse Bayes-Adaptive Reinforcement Learning
    Rigter, Marc
    Lacerda, Bruno
    Hawes, Nick
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [7] Bayes-Adaptive Simulation-based Search with Value Function Approximation
    Guez, Arthur
    Heess, Nicolas
    Silver, David
    Dayan, Peter
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 27 (NIPS 2014), 2014, 27
  • [8] Bayes-Adaptive Planning for Data-Efficient Verification of Uncertain Markov Decision Processes
    Wijesuriya, Viraj Brian
    Abate, Alessandro
    QUANTITATIVE EVALUATION OF SYSTEMS (QEST 2019), 2019, 11785 : 91 - 108
  • [9] Bayes-Adaptive Monte-Carlo Planning and Learning for Goal-Oriented Dialogues
    Jung, Youngsoo
    Lee, Jongmin
    Kim, Kee-Eung
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 7994 - 8001
  • [10] Scalable and Efficient Bayes-Adaptive Reinforcement Learning Based on Monte-Carlo Tree Search
    Guez, Arthur
    Silver, David
    Dayan, Peter
    JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2013, 48 : 841 - 883