Reward-free offline reinforcement learning: Optimizing behavior policy via action exploration

被引：0

作者：

Huang, Zhenbo ^{[1
]}

Sun, Shiliang ^{[2
]}

Zhao, Jing ^{[1
]}

机构：

[1] East China Normal Univ, Sch Comp Sci & Technol, Shanghai 200062, Peoples R China

[2] Shanghai Jiao Tong Univ, Dept Automat, Shanghai 200240, Peoples R China

来源：

KNOWLEDGE-BASED SYSTEMS | 2024年 / 299卷

关键词：

Offline reinforcement learning; Reward-free learning; Action exploration;

D O I：

10.1016/j.knosys.2024.112018

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Offline reinforcement learning (RL) aims to learn a policy from pre-collected data, avoiding costly or risky interactions with the environment. In the offline setting, the inherent problem of distribution shift leads to extrapolation error, resulting in policy learning failures. Conventional offline RL methods tackle this by reducing the value estimates of unseen actions or incorporating policy constraints. However, these methods confine the agent's actions within the data manifold, hampering the agent's capacity to acquire fresh insights from actions beyond the dataset's scope. To address this, we propose a novel offline RL method incorporating action exploration, called EoRL. We partition policy learning into behavior and exploration learning, where exploration learning empowers the agent to discover novel actions, while behavior learning approximates the behavior policy. Specifically, in exploratory learning, we define the deviation between decision actions and dataset actions as the action novelty, replacing the traditional reward with an assessment of the cumulative novelty of the policy. Additionally, behavior policy restricts actions to the vicinity of the dataset-supported actions, and the two parts of the policy learning share parameters. We demonstrate EoRL's ability to explore a larger action space while controlling the policy shift. And its reward-free learning model is more compatible with realistic task scenarios. Experimental results demonstrate the outstanding performance of our method on Mujoco locomotion and 2D maze tasks.

引用

页数：13

共 50 条

[21] Boosting Policy Learning in Reinforcement Learning via Adaptive Intrinsic Reward Regulation
Zhao, Qian
Han, Jinhui
Xu, Mao
IEEE Access, 2024, 12 : 2224 - 2235
[22] Learning Behavior of Offline Reinforcement Learning Agents
Shukla, Indu
Dozier, Haley. R.
Henslee, Althea. C.
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FOR MULTI-DOMAIN OPERATIONS APPLICATIONS VI, 2024, 13051
[23] Optimizing Policy via Deep Reinforcement Learning for Dialogue Management
Xu, Guanghao
Lee, Hyunjung
Koo, Myoung-Wan
Seo, Jungyun
2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP), 2018, : 582 - 589
[24] QFAE: Q-Function guided Action Exploration for offline deep reinforcement learning
Pang, Teng
Wu, Guoqiang
Zhang, Yan
Wang, Bingzheng
Yin, Yilong
PATTERN RECOGNITION, 2025, 158
[25] Optimizing trajectories for highway driving with offline reinforcement learning
Mirchevska, Branka
Werling, Moritz
Boedecker, Joschka
FRONTIERS IN FUTURE TRANSPORTATION, 2023, 4
[26] Policy Finetuning in Reinforcement Learning via Design of Experiments using Offline Data
Zhang, Ruiqi
Zanette, Andrea
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[27] Offline Reinforcement Learning via Policy Regularization and Ensemble Q-Functions
Wang, Tao
Xie, Shaorong
Gao, Mingke
Chen, Xue
Zhang, Zhenyu
Yu, Hang
2022 IEEE 34TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI, 2022, : 1167 - 1174
[28] Supported Policy Optimization for Offline Reinforcement Learning
Wu, Jialong
Wu, Haixu
Qiu, Zihan
Wang, Jianmin
Long, Mingsheng
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
[29] Implicit policy constraint for offline reinforcement learning
Peng, Zhiyong
Liu, Yadong
Han, Changlin
Zhou, Zongtan
CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY, 2024, 9 (04) : 973 - 981
[30] Weighted Policy Constraints for Offline Reinforcement Learning
Peng, Zhiyong
Han, Changlin
Liu, Yadong
Zhou, Zongtan
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 8, 2023, : 9435 - 9443

← 1 2 3 4 5 →