Supported Value Regularization for Offline Reinforcement Learning

被引:0
|
作者
Mao, Yixiu [1 ]
Zhang, Hongchang [1 ]
Chen, Chen [1 ]
Xu, Yi [2 ]
Ji, Xiangyang [1 ]
机构
[1] Tsinghua Univ, Dept Automat, Beijing, Peoples R China
[2] Dalian Univ Technol, Sch Artificial Intelligence, Dalian, Peoples R China
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Offline reinforcement learning suffers from the extrapolation error and value overestimation caused by out-of-distribution (OOD) actions. To mitigate this issue, value regularization approaches aim to penalize the learned value functions to assign lower values to OOD actions. However, existing value regularization methods lack a proper distinction between the regularization effects on in-distribution (ID) and OOD actions, and fail to guarantee optimal convergence results of the policy. To this end, we propose Supported Value Regularization (SVR), which penalizes the Q-values for all OOD actions while maintaining standard Bellman updates for ID ones. Specifically, we utilize the bias of importance sampling to compute the summation of Q-values over the entire OOD region, which serves as the penalty for policy evaluation. This design automatically separates the regularization for ID and OOD actions without manually distinguishing between them. In tabular MDP, we show that the policy evaluation operator of SVR is a contraction, whose fixed point outputs unbiased Q-values for ID actions and underestimated Q-values for OOD actions. Furthermore, the policy iteration with SVR guarantees strict policy improvement until convergence to the optimal support-constrained policy in the dataset. Empirically, we validate the theoretical properties of SVR in a tabular maze environment and demonstrate its state-of-the-art performance on a range of continuous control tasks in the D4RL benchmark.
引用
收藏
页数:23
相关论文
共 50 条
  • [1] Offline Reinforcement Learning With Behavior Value Regularization
    Huang, Longyang
    Dong, Botao
    Xie, Wei
    Zhang, Weidong
    IEEE TRANSACTIONS ON CYBERNETICS, 2024, 54 (06) : 3692 - 3704
  • [2] ORAD: a new framework of offline Reinforcement Learning with Q-value regularization
    Longfei Zhang
    Yulong Zhang
    Shixuan Liu
    Li Chen
    Xingxing Liang
    Guangquan Cheng
    Zhong Liu
    Evolutionary Intelligence, 2024, 17 : 339 - 347
  • [3] ORAD: a new framework of offline Reinforcement Learning with Q-value regularization
    Zhang, Longfei
    Zhang, Yulong
    Liu, Shixuan
    Chen, Li
    Liang, Xingxing
    Cheng, Guangquan
    Liu, Zhong
    EVOLUTIONARY INTELLIGENCE, 2024, 17 (01) : 339 - 347
  • [4] Offline Reinforcement Learning with Fisher Divergence Critic Regularization
    Kostrikov, Ilya
    Tompson, Jonathan
    Fergus, Rob
    Nachum, Ofir
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [5] Supported Policy Optimization for Offline Reinforcement Learning
    Wu, Jialong
    Wu, Haixu
    Qiu, Zihan
    Wang, Jianmin
    Long, Mingsheng
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
  • [6] Offline Multi-Agent Reinforcement Learning with Implicit Global-to-Local Value Regularization
    Wang, Xiangsen
    Xu, Haoran
    Zheng, Yinan
    Zhan, Xianyuan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [7] Offline Reinforcement Learning with Uncertainty Critic Regularization Based on Density Estimation
    Li, Chao
    Wu, Fengge
    Zhao, Junsuo
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [8] Offline Reinforcement Learning with On-Policy Q-Function Regularization
    Shi, Laixi
    Dadashi, Robert
    Chi, Yuejie
    Castro, Pablo Samuel
    Geist, Matthieu
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES: RESEARCH TRACK, ECML PKDD 2023, PT IV, 2023, 14172 : 455 - 471
  • [9] Towards Offline Reinforcement Learning with Pessimistic Value Priors
    Valdettaro, Filippo
    Faisal, A. Aldo
    EPISTEMIC UNCERTAINTY IN ARTIFICIAL INTELLIGENCE, EPI UAI 2023, 2024, 14523 : 89 - 100
  • [10] Conservative State Value Estimation for Offline Reinforcement Learning
    Chen, Liting
    Yan, Jie
    Shao, Zhengdao
    Wang, Lu
    Lin, Qingwei
    Rajmohan, Saravan
    Moscibroda, Thomas
    Zhang, Dongmei
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,