Reward-free offline reinforcement learning: Optimizing behavior policy via action exploration

被引:0
|
作者
Huang, Zhenbo [1 ]
Sun, Shiliang [2 ]
Zhao, Jing [1 ]
机构
[1] East China Normal Univ, Sch Comp Sci & Technol, Shanghai 200062, Peoples R China
[2] Shanghai Jiao Tong Univ, Dept Automat, Shanghai 200240, Peoples R China
关键词
Offline reinforcement learning; Reward-free learning; Action exploration;
D O I
10.1016/j.knosys.2024.112018
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Offline reinforcement learning (RL) aims to learn a policy from pre-collected data, avoiding costly or risky interactions with the environment. In the offline setting, the inherent problem of distribution shift leads to extrapolation error, resulting in policy learning failures. Conventional offline RL methods tackle this by reducing the value estimates of unseen actions or incorporating policy constraints. However, these methods confine the agent's actions within the data manifold, hampering the agent's capacity to acquire fresh insights from actions beyond the dataset's scope. To address this, we propose a novel offline RL method incorporating action exploration, called EoRL. We partition policy learning into behavior and exploration learning, where exploration learning empowers the agent to discover novel actions, while behavior learning approximates the behavior policy. Specifically, in exploratory learning, we define the deviation between decision actions and dataset actions as the action novelty, replacing the traditional reward with an assessment of the cumulative novelty of the policy. Additionally, behavior policy restricts actions to the vicinity of the dataset-supported actions, and the two parts of the policy learning share parameters. We demonstrate EoRL's ability to explore a larger action space while controlling the policy shift. And its reward-free learning model is more compatible with realistic task scenarios. Experimental results demonstrate the outstanding performance of our method on Mujoco locomotion and 2D maze tasks.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] Reward-Free Exploration for Reinforcement Learning
    Jin, Chi
    Krishnamurthy, Akshay
    Simchowitz, Max
    Yu, Tiancheng
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 119, 2020, 119
  • [2] Reward-Free Policy Space Compression for Reinforcement Learning
    Mutti, Mirco
    Del Col, Stefano
    Restelli, Marcello
    INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 151, 2022, 151
  • [3] One Policy is Enough: Parallel Exploration with a Single Policy is Near-Optimal for Reward-Free Reinforcement Learning
    Cisneros-Velarde, Pedro
    Lyu, Boxiang
    Koyejo, Sanmi
    Kolar, Mladen
    INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 206, 2023, 206
  • [4] Nearly Optimal Reward-Free Reinforcement Learning
    Zhang, Zihan
    Du, Simon S.
    Ji, Xiangyang
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [5] A Simple Reward-free Approach to Constrained Reinforcement Learning
    Miryoosefi, Sobhan
    Jin, Chi
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [6] On Reward-Free Reinforcement Learning with Linear Function Approximation
    Wang, Ruosong
    Du, Simon S.
    Yang, Lin F.
    Salakhutdinov, Ruslan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [7] Adaptive Reward-Free Exploration
    Kaufmann, Emilie
    Menard, Pierre
    Domingues, Omar Darwiche
    Jonsson, Anders
    Leurent, Edouard
    Valko, Michal
    ALGORITHMIC LEARNING THEORY, VOL 132, 2021, 132
  • [8] Robust Reward-Free ActorCritic for Cooperative Multiagent Reinforcement Learning
    Lin, Qifeng
    Ling, Qing
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (12) : 17318 - 17329
  • [9] Reward-Free Reinforcement Learning Algorithm Using Prediction Network
    Yu, Zhen
    Feng, Yimin
    Liu, Lijun
    FUZZY SYSTEMS AND DATA MINING VI, 2020, 331 : 663 - 670
  • [10] Reward-Free Model-Based Reinforcement Learning with Linear Function Approximation
    Zhang, Weitong
    Zhou, Dongruo
    Gu, Quanquan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34