Randomized Exploration for Reinforcement Learning with General Value Function Approximation

被引:0
|
作者
Ishfaq, Haque [1 ,2 ]
Cui, Qiwen [3 ]
Viet Nguyen [1 ,2 ]
Ayoub, Alex [4 ,5 ]
Yang, Zhuoran [6 ]
Wang, Zhaoran [7 ]
Precup, Doina [1 ,2 ,8 ]
Yang, Lin F. [9 ]
机构
[1] Mila, Montreal, PQ, Canada
[2] McGill Univ, Sch Comp Sci, Montreal, PQ, Canada
[3] Peking Univ, Sch Math Sci, Beijing, Peoples R China
[4] Univ Alberta, Amii, Edmonton, AB, Canada
[5] Univ Alberta, Dept Comp Sci, Edmonton, AB, Canada
[6] Princeton Univ, Dept Operat Res & Financial Engn, Princeton, NJ 08544 USA
[7] Northwestern Univ, Ind Engn Management Sci, Evanston, IL 60208 USA
[8] DeepMind, Montreal, PQ, Canada
[9] Univ Calif Los Angeles, Dept Elect & Comp Engn, Los Angeles, CA USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose a model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm as well as the optimism principle. Unlike existing upper-confidence-bound (UCB) based approaches, which are often computationally intractable, our algorithm drives exploration by simply perturbing the training data with judiciously chosen i.i.d. scalar noises. To attain optimistic value function estimation without resorting to a UCB-style bonus, we introduce an optimistic reward sampling procedure. When the value functions can be represented by a function class F, our algorithm achieves a worst-case regret bound of (O) over tilde (poly(d(E)H)root T) where T is the time elapsed, d(E) is the planning horizon and d E is the eluder dimension of In the linear setting, our algorithm reduces to LSVI-PHE, a variant of RLSVI, that enjoys an (O) over tilde root d(3)H(3)T) regret. We complement the theory with an empirical evaluation across known difficult exploration tasks.
引用
收藏
页数:10
相关论文
共 50 条
  • [1] Efficient exploration through active learning for value function approximation in reinforcement learning
    Akiyama, Takayuki
    Hachiya, Hirotaka
    Sugiyama, Masashi
    NEURAL NETWORKS, 2010, 23 (05) : 639 - 648
  • [2] Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning
    Akiyama, Takayuki
    Hachiya, Hirotaka
    Sugiyama, Masashi
    21ST INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE (IJCAI-09), PROCEEDINGS, 2009, : 980 - 985
  • [3] Toward General Function Approximation in Nonstationary Reinforcement Learning
    Feng, Songtao
    Yin, Ming
    Huang, Ruiquan
    Wang, Yu-Xiang
    Yang, Jing
    Liang, Yingbin
    IEEE JOURNAL ON SELECTED AREAS IN INFORMATION THEORY, 2024, 5 : 190 - 206
  • [4] CBR for state value function approximation in reinforcement learning
    Gabel, T
    Riedmiller, M
    CASE-BASED REASONING RESEARCH AND DEVELOPMENT, PROCEEDINGS, 2005, 3620 : 206 - 221
  • [5] Offline Reinforcement Learning: Fundamental Barriers for Value Function Approximation
    Foster, Dylan J.
    Krishnamurthy, Akshay
    Simchi-Levi, David
    Xu, Yunzong
    CONFERENCE ON LEARNING THEORY, VOL 178, 2022, 178
  • [6] Distributed Value Function Approximation for Collaborative Multiagent Reinforcement Learning
    Stankovic, Milos S.
    Beko, Marko
    Stankovic, Srdjan S.
    IEEE TRANSACTIONS ON CONTROL OF NETWORK SYSTEMS, 2021, 8 (03): : 1270 - 1280
  • [7] A grey approximation approach to state value function in reinforcement learning
    Hwang, Kao-Shing
    Chen, Yu-Jen
    Lee, Guar-Yuan
    2007 IEEE INTERNATIONAL CONFERENCE ON INTEGRATION TECHNOLOGY, PROCEEDINGS, 2007, : 379 - +
  • [8] Parameterized Indexed Value Function for Efficient Exploration in Reinforcement Learning
    Tan, Tian
    Xiong, Zhihan
    Dwaracherla, Vikranth R.
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 5948 - 5955
  • [9] Reinforcement Learning with General Value Function Approximation: Provably Efficient Approach via Bounded Eluder Dimension
    Wang, Ruosong
    Salakhutdinov, Ruslan
    Yang, Lin F.
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [10] Corruption-Robust Offline Reinforcement Learning with General Function Approximation
    Ye, Chenlu
    Yang, Rui
    Gu, Quanquan
    Zhang, Tong
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,