Randomized Exploration for Reinforcement Learning with General Value Function Approximation

被引:0
|
作者
Ishfaq, Haque [1 ,2 ]
Cui, Qiwen [3 ]
Viet Nguyen [1 ,2 ]
Ayoub, Alex [4 ,5 ]
Yang, Zhuoran [6 ]
Wang, Zhaoran [7 ]
Precup, Doina [1 ,2 ,8 ]
Yang, Lin F. [9 ]
机构
[1] Mila, Montreal, PQ, Canada
[2] McGill Univ, Sch Comp Sci, Montreal, PQ, Canada
[3] Peking Univ, Sch Math Sci, Beijing, Peoples R China
[4] Univ Alberta, Amii, Edmonton, AB, Canada
[5] Univ Alberta, Dept Comp Sci, Edmonton, AB, Canada
[6] Princeton Univ, Dept Operat Res & Financial Engn, Princeton, NJ 08544 USA
[7] Northwestern Univ, Ind Engn Management Sci, Evanston, IL 60208 USA
[8] DeepMind, Montreal, PQ, Canada
[9] Univ Calif Los Angeles, Dept Elect & Comp Engn, Los Angeles, CA USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose a model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm as well as the optimism principle. Unlike existing upper-confidence-bound (UCB) based approaches, which are often computationally intractable, our algorithm drives exploration by simply perturbing the training data with judiciously chosen i.i.d. scalar noises. To attain optimistic value function estimation without resorting to a UCB-style bonus, we introduce an optimistic reward sampling procedure. When the value functions can be represented by a function class F, our algorithm achieves a worst-case regret bound of (O) over tilde (poly(d(E)H)root T) where T is the time elapsed, d(E) is the planning horizon and d E is the eluder dimension of In the linear setting, our algorithm reduces to LSVI-PHE, a variant of RLSVI, that enjoys an (O) over tilde root d(3)H(3)T) regret. We complement the theory with an empirical evaluation across known difficult exploration tasks.
引用
收藏
页数:10
相关论文
共 50 条
  • [21] The Role of Lookahead and Approximate Policy Evaluation in Reinforcement Learning with Linear Value Function Approximation
    Winnicki, Anna
    Lubars, Joseph
    Livesay, Michael
    Srikant, R.
    OPERATIONS RESEARCH, 2025, 73 (01)
  • [22] Adaptive importance sampling for value function approximation in off-policy reinforcement learning
    Hachiya, Hirotaka
    Akiyama, Takayuki
    Sugiayma, Masashi
    Peters, Jan
    NEURAL NETWORKS, 2009, 22 (10) : 1399 - 1410
  • [23] Ramp Metering for a Distant Downstream Bottleneck Using Reinforcement Learning with Value Function Approximation
    Zhou, Yue
    Ozbay, Kaan
    Kachroo, Pushkin
    Zuo, Fan
    JOURNAL OF ADVANCED TRANSPORTATION, 2020, 2020 (2020)
  • [24] A Clustering-Based Graph Laplacian Framework for Value Function Approximation in Reinforcement Learning
    Xu, Xin
    Huang, Zhenhua
    Graves, Daniel
    Pedrycz, Witold
    IEEE TRANSACTIONS ON CYBERNETICS, 2014, 44 (12) : 2613 - 2625
  • [25] Restricted gradient-descent algorithm for value-function approximation in reinforcement learning
    Salles Barreto, Andre da Motta
    Anderson, Charles W.
    ARTIFICIAL INTELLIGENCE, 2008, 172 (4-5) : 454 - 482
  • [26] Multiagent reinforcement learning using function approximation
    Abul, O
    Polat, F
    Alhajj, R
    IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART C-APPLICATIONS AND REVIEWS, 2000, 30 (04): : 485 - 497
  • [27] Resilient Multiagent Reinforcement Learning With Function Approximation
    Ye, Lintao
    Figura, Martin
    Lin, Yixuan
    Pal, Mainak
    Das, Pranoy
    Liu, Ji
    Gupta, Vijay
    IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2024, 69 (12) : 8497 - 8512
  • [28] Ensemble Methods for Reinforcement Learning with Function Approximation
    Fausser, Stefan
    Schwenker, Friedhelm
    MULTIPLE CLASSIFIER SYSTEMS, 2011, 6713 : 56 - 65
  • [29] Distributional reinforcement learning with linear function approximation
    Bellemare, Marc G.
    Le Roux, Nicolas
    Castro, Pablo Samuel
    Moitra, Subhodeep
    22ND INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 89, 2019, 89
  • [30] Reinforcement learning with function approximation converges to a region
    Gordon, GJ
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 13, 2001, 13 : 1040 - 1046