Loss Dynamics of Temporal Difference Reinforcement Learning

被引:0
|
作者
Bordelon, Blake [1 ]
Masset, Paul [1 ]
Kuo, Henry [1 ]
Pehlevan, Cengiz [1 ]
机构
[1] Harvard Univ, John Paulson Sch Engn & Appl Sci, Ctr Brain Sci, Kempner Inst Study Nat & Artificial Intelligence, Cambridge, MA 02138 USA
关键词
STATISTICAL-MECHANICS; CONVERGENCE; HIPPOCAMPUS;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Reinforcement learning has been successful across several applications in which agents have to learn to act in environments with sparse feedback. However, despite this empirical success there is still a lack of theoretical understanding of how the parameters of reinforcement learning models and the features used to represent states interact to control the dynamics of learning. In this work, we use concepts from statistical physics, to study the typical case learning curves for temporal difference learning of a value function with linear function approximators. Our theory is derived under a Gaussian equivalence hypothesis where averages over the random trajectories are replaced with temporally correlated Gaussian feature averages and we validate our assumptions on small scale Markov Decision Processes. We find that the stochastic semi-gradient noise due to subsampling the space of possible episodes leads to significant plateaus in the value error, unlike in traditional gradient descent dynamics. We study how learning dynamics and plateaus depend on feature structure, learning rate, discount factor, and reward function. We then analyze how strategies like learning rate annealing and reward shaping can favorably alter learning dynamics and plateaus. To conclude, our work introduces new tools to open a new direction towards developing a theory of learning dynamics in reinforcement learning.
引用
收藏
页数:28
相关论文
共 50 条
  • [1] Temporal difference coding in reinforcement learning
    Iwata, K
    Ikeda, K
    INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING, 2003, 2690 : 218 - 227
  • [2] Dynamics of Temporal Difference Learning
    Wendemuth, Andreas
    20TH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2007, : 1107 - 1112
  • [3] Off-Policy Reinforcement Learning with Loss Function Weighted by Temporal Difference Error
    Park, Bumgeun
    Kim, Taeyoung
    Moon, Woohyeon
    Nengroo, Sarvar Hussain
    Har, Dongsoo
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, ICIC 2023, PT V, 2023, 14090 : 600 - 613
  • [4] STOCHASTIC KERNEL TEMPORAL DIFFERENCE FOR REINFORCEMENT LEARNING
    Bae, Jihye
    Giraldo, Luis Sanchez
    Chhatbar, Pratik
    Francis, Joseph
    Sanchez, Justin
    Principe, Jose
    2011 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2011,
  • [5] Reinforcement Learning via Kernel Temporal Difference
    Bae, Jihye
    Chhatbar, Pratik
    Francis, Joseph T.
    Sanchez, Justin C.
    Principe, Jose C.
    2011 ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY (EMBC), 2011, : 5662 - 5665
  • [6] Eigensubspace of Temporal-Difference Dynamics and How It Improves Value Approximation in Reinforcement Learning
    He, Qiang
    Zhou, Tianyi
    Fang, Meng
    Maghsudi, Setareh
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES: RESEARCH TRACK, ECML PKDD 2023, PT IV, 2023, 14172 : 573 - 589
  • [7] Monte Carlo and Temporal Difference Methods in Reinforcement Learning
    Han, Isaac
    Oh, Seungwon
    Jung, Hoyoun
    Chung, Insik
    Kim, Kyung-Joong
    IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE, 2023, 18 (04) : 64 - 65
  • [8] The role of dopamine in the temporal difference model of reinforcement learning
    Montague, R
    NEUROPSYCHOPHARMACOLOGY, 2005, 30 : S27 - S27
  • [9] A Reinforcement Learning Model Based on Temporal Difference Algorithm
    Li, Xiali
    Lv, Zhengyu
    Wang, Song
    Wei, Zhi
    Wu, Licheng
    IEEE ACCESS, 2019, 7 : 121922 - 121930
  • [10] Basis function adaptation in temporal difference reinforcement learning
    Menache, I
    Mannor, S
    Shimkin, N
    ANNALS OF OPERATIONS RESEARCH, 2005, 134 (01) : 215 - 238