Adaptive trajectory-constrained exploration strategy for deep reinforcement learning

被引:0
|
作者
Wang, Guojian [1 ,4 ]
Wu, Faguo [1 ,2 ,3 ,4 ,5 ]
Zhang, Xiao [1 ,3 ,4 ,5 ]
Guo, Ning [1 ,4 ]
Zheng, Zhiming [2 ,3 ,4 ,5 ]
机构
[1] Beihang Univ, Sch Math Sci, Beijing 100191, Peoples R China
[2] Beihang Univ, Inst Artificial Intelligence, Beijing 100191, Peoples R China
[3] Zhongguancun Lab, Beijing 100194, Peoples R China
[4] Beihang Univ, Key Lab Math Informat & Behav Semant LMIB, Beijing 100191, Peoples R China
[5] Beihang Univ, Beijing Adv Innovat Ctr Future Blockchain & Privac, Beijing 100191, Peoples R China
基金
中国国家自然科学基金;
关键词
Deep reinforcement learning; Hard-exploration problem; Policy gradient; Offline suboptimal demonstrations;
D O I
10.1016/j.knosys.2023.111334
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Deep reinforcement learning (DRL) faces significant challenges in addressing hard -exploration tasks with sparse or deceptive rewards and large state spaces. These challenges severely limit the practical application of DRL. Most previous exploration methods relied on complex architectures to estimate state novelty or introduced sensitive hyperparameters, resulting in instability. To mitigate these issues, we propose an efficient adaptive trajectory -constrained exploration strategy for DRL. The proposed method guides the agent's policy away from suboptimal solutions by regarding previous offline demonstrations as references. Specifically, this approach gradually expands the exploration scope of the agent and strives for optimality in a constrained optimization manner. Additionally, we introduce a novel policy -gradient -based optimization algorithm that utilizes adaptive clipped trajectory -distance rewards for both singleand multi -agent reinforcement learning. We provide a theoretical analysis of our method, including a deduction of the worst -case approximation error bounds, highlighting the validity of our approach for enhancing exploration. To evaluate the effectiveness of the proposed method, we conducted experiments on two large 2D grid world mazes and several MuJoCo tasks. The extensive experimental results demonstrated the significant advantages of our method in achieving temporally extended exploration and avoiding myopic and suboptimal behaviors in both singleand multi -agent settings. Notably, the specific metrics and quantifiable results further support these findings. The code used in the study is available at https://github.com/buaawgj/TACE.
引用
收藏
页数:20
相关论文
共 50 条
  • [1] A Novel Adaptive Sampling Strategy for Deep Reinforcement Learning
    Liang, Xingxing
    Chen, Li
    Feng, Yanghe
    Liu, Zhong
    Ma, Yang
    Huang, Kuihua
    INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE AND APPLICATIONS, 2021, 20 (02)
  • [2] Diversity-Driven Exploration Strategy for Deep Reinforcement Learning
    Hong, Zhang-Wei
    Shann, Tzu-Yun
    Su, Shih-Yang
    Chang, Yi-Hsiang
    Fu, Tsu-Jui
    Lee, Chun-Yi
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
  • [3] Exploration Strategy based on Validity of Actions in Deep Reinforcement Learning
    Yoon, Hyung-Suk
    Lee, Sang-Hyun
    Seo, Seung-Woo
    2020 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2020, : 6134 - 6139
  • [4] A Robust Integration of GPS and MEMS-INS Through Trajectory-constrained Adaptive Kalman Filtering
    Zhou, Zebo
    Li, Yong
    Rizos, Chris
    Shen, Yunzhong
    PROCEEDINGS OF THE 22ND INTERNATIONAL TECHNICAL MEETING OF THE SATELLITE DIVISION OF THE INSTITUTE OF NAVIGATION (ION GNSS 2009), 2009, : 995 - 1003
  • [5] Reward-Based Exploration: Adaptive Control for Deep Reinforcement Learning
    Xu, Zhi-xiong
    Cao, Lei
    Chen, Xi-liang
    Li, Chen-xi
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2018, E101D (09): : 2409 - 2412
  • [6] Trajectory-Constrained Deep Latent Visual Attention for Improved Local Planning in Presence of Heterogeneous Terrain
    Wapnick, Stefan
    Manderson, Travis
    Meger, David
    Dudek, Gregory
    2021 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2021, : 460 - 467
  • [7] Trajectory-Constrained Collective Circular Motion With Different Phase Arrangements
    Jain, Anoop
    Ghose, Debasish
    IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2020, 65 (05) : 2237 - 2244
  • [8] Constrained Reinforcement Learning in Hard Exploration Problems
    Pankayaraj, Pathmanathan
    Varakantham, Pradeep
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 12, 2023, : 15055 - 15063
  • [9] Deep Reinforcement Factorization Machines: A Deep Reinforcement Learning Model with Random Exploration Strategy and High Deployment Efficiency
    Yu, Huaidong
    Yin, Jian
    APPLIED SCIENCES-BASEL, 2022, 12 (11):
  • [10] Adaptive Exploration Strategies for Reinforcement Learning
    Hwang, Kao-Shing
    Li, Chih-Wen
    Jiang, Wei-Cheng
    2017 INTERNATIONAL CONFERENCE ON SYSTEM SCIENCE AND ENGINEERING (ICSSE), 2017, : 16 - 19