Supported Value Regularization for Offline Reinforcement Learning

被引：0

作者：

Mao, Yixiu ^{[1
]}

Zhang, Hongchang ^{[1
]}

Chen, Chen ^{[1
]}

Xu, Yi ^{[2
]}

Ji, Xiangyang ^{[1
]}

机构：

[1] Tsinghua Univ, Dept Automat, Beijing, Peoples R China

[2] Dalian Univ Technol, Sch Artificial Intelligence, Dalian, Peoples R China

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年

基金：

国家重点研发计划; 中国国家自然科学基金;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Offline reinforcement learning suffers from the extrapolation error and value overestimation caused by out-of-distribution (OOD) actions. To mitigate this issue, value regularization approaches aim to penalize the learned value functions to assign lower values to OOD actions. However, existing value regularization methods lack a proper distinction between the regularization effects on in-distribution (ID) and OOD actions, and fail to guarantee optimal convergence results of the policy. To this end, we propose Supported Value Regularization (SVR), which penalizes the Q-values for all OOD actions while maintaining standard Bellman updates for ID ones. Specifically, we utilize the bias of importance sampling to compute the summation of Q-values over the entire OOD region, which serves as the penalty for policy evaluation. This design automatically separates the regularization for ID and OOD actions without manually distinguishing between them. In tabular MDP, we show that the policy evaluation operator of SVR is a contraction, whose fixed point outputs unbiased Q-values for ID actions and underestimated Q-values for OOD actions. Furthermore, the policy iteration with SVR guarantees strict policy improvement until convergence to the optimal support-constrained policy in the dataset. Empirically, we validate the theoretical properties of SVR in a tabular maze environment and demonstrate its state-of-the-art performance on a range of continuous control tasks in the D4RL benchmark.

引用

页数：23

共 50 条

[1] Offline Reinforcement Learning With Behavior Value Regularization
Huang, Longyang
Dong, Botao
Xie, Wei
Zhang, Weidong
IEEE TRANSACTIONS ON CYBERNETICS, 2024, 54 (06) : 3692 - 3704
[2] ORAD: a new framework of offline Reinforcement Learning with Q-value regularization
Longfei Zhang
Yulong Zhang
Shixuan Liu
Li Chen
Xingxing Liang
Guangquan Cheng
Zhong Liu
Evolutionary Intelligence, 2024, 17 : 339 - 347
[3] ORAD: a new framework of offline Reinforcement Learning with Q-value regularization
Zhang, Longfei
Zhang, Yulong
Liu, Shixuan
Chen, Li
Liang, Xingxing
Cheng, Guangquan
Liu, Zhong
EVOLUTIONARY INTELLIGENCE, 2024, 17 (01) : 339 - 347
[4] Offline Reinforcement Learning with Fisher Divergence Critic Regularization
Kostrikov, Ilya
Tompson, Jonathan
Fergus, Rob
Nachum, Ofir
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
[5] Supported Policy Optimization for Offline Reinforcement Learning
Wu, Jialong
Wu, Haixu
Qiu, Zihan
Wang, Jianmin
Long, Mingsheng
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
[6] Offline Multi-Agent Reinforcement Learning with Implicit Global-to-Local Value Regularization
Wang, Xiangsen
Xu, Haoran
Zheng, Yinan
Zhan, Xianyuan
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[7] Offline Reinforcement Learning with Uncertainty Critic Regularization Based on Density Estimation
Li, Chao
Wu, Fengge
Zhao, Junsuo
2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
[8] Offline Reinforcement Learning with On-Policy Q-Function Regularization
Shi, Laixi
Dadashi, Robert
Chi, Yuejie
Castro, Pablo Samuel
Geist, Matthieu
MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES: RESEARCH TRACK, ECML PKDD 2023, PT IV, 2023, 14172 : 455 - 471
[9] Towards Offline Reinforcement Learning with Pessimistic Value Priors
Valdettaro, Filippo
Faisal, A. Aldo
EPISTEMIC UNCERTAINTY IN ARTIFICIAL INTELLIGENCE, EPI UAI 2023, 2024, 14523 : 89 - 100
[10] Conservative State Value Estimation for Offline Reinforcement Learning
Chen, Liting
Yan, Jie
Shao, Zhengdao
Wang, Lu
Lin, Qingwei
Rajmohan, Saravan
Moscibroda, Thomas
Zhang, Dongmei
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,

← 1 2 3 4 5 →