Accelerating Self-Imitation Learning from Demonstrations via Policy Constraints and Q-Ensemble

被引：1

作者：

Li, Chao ^{[1
]}

Wu, Fengge ^{[1
]}

Zhao, Junsuo ^{[1
]}

机构：

[1] Univ Chinese Acad Sci, Chinese Acad Sci, Inst Software, Beijing, Peoples R China

来源：

2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN | 2023年

关键词：

Deep Reinforcement Learning; Learning from Demonstrations; Self-Imitation Learning; Sample Efficiency;

D O I：

10.1109/IJCNN54540.2023.10191691

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Deep reinforcement learning (DRL) provides a new way to generate robot control policy. However, the process of training control policy requires lengthy exploration, resulting in a low sample efficiency of reinforcement learning (RL) in real-world tasks. Both imitation learning (IL) and learning from demonstrations (LfD) improve the training process by using expert demonstrations, but imperfect expert demonstrations can mislead policy improvement. Offline to Online reinforcement learning requires a lot of offline data to initialize the policy, and distribution shift can easily lead to performance degradation during online fine-tuning. To solve the above problems, we propose a learning from demonstrations method named Accelerating Self-Imitation Learning from Demonstrations (A-SILfD), which treats expert demonstrations as the agent's successful experiences and uses experiences to constrain policy improvement. Furthermore, we prevent performance degradation due to large estimation errors in the Q-function by the ensemble Q-functions. Our experiments show that A-SILfD can significantly improve sample efficiency using a small number of different quality expert demonstrations. In four Mujoco continuous control tasks, A-SILfD can significantly outperform baseline methods after 150,000 steps of online training and is not misled by imperfect expert demonstrations during training. In addition, our ablation experiments demonstrate the effectiveness of each part of the method.

引用

页数：8

共 17 条

[1] Learning Category-Level Generalizable Object Manipulation Policy Via Generative Adversarial Self-Imitation Learning From Demonstrations
Shen, Hao
Wan, Weikang
Wang, He
IEEE ROBOTICS AND AUTOMATION LETTERS, 2022, 7 (04) : 11166 - 11173
[2] Self-Imitation Learning via Generalized Lower Bound Q-learning
Tang, Yunhao
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
[3] Perception-Aware-Based UAV Trajectory Planner via Generative Adversarial Self-Imitation Learning From Demonstrations
Zhang, Hanxuan
Huo, Ju
Huang, Yulong
Cheng, Jiajun
Li, Xiaofeng
IEEE INTERNET OF THINGS JOURNAL, 2025, 12 (03): : 3248 - 3260
[4] Learning Robotic Skills via Self-Imitation and Guide Reward
Ran, Chenyang
Su, Jianbo
2021 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC), 2021, : 2158 - 2163
[5] Improving Deep Reinforcement Learning with Intrinsic Rewards via Self-Imitation Learning
Xu, Mao
Zhao, Qian
39TH YOUTH ACADEMIC ANNUAL CONFERENCE OF CHINESE ASSOCIATION OF AUTOMATION, YAC 2024, 2024, : 2017 - 2022
[6] Automated Anomaly Detection via Curiosity-Guided Search and Self-Imitation Learning
Li, Yuening
Chen, Zhengzhang
Zha, Daochen
Zhou, Kaixiong
Jin, Haifeng
Chen, Haifeng
Hu, Xia
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 33 (06) : 2365 - 2377
[7] Harnessing Network Effect for Fake News Mitigation: Selecting Debunkers via Self-Imitation Learning
Xu, Xiaofei
Deng, Ke
Dann, Michael
Zhang, Xiuzhen
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 20, 2024, : 22447 - 22456
[8] Taking complementary advantages: Improving exploration via double self-imitation learning in procedurally-generated environments
Lin, Hao
He, Yue
Li, Fanzhang
Liu, Quan
Wang, Bangjun
Zhu, Fei
EXPERT SYSTEMS WITH APPLICATIONS, 2024, 238
[9] Automated financial time series anomaly detection via curiosity-guided exploration and self-imitation learning
Cao, Feifei
Guo, Xitong
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 135
[10] Self-Practice Imitation Learning from Weak Policy
Da, Qing
Yu, Yang
Zhou, Zhi-Hua
PARTIALLY SUPERVISED LEARNING, PSL 2013, 2013, 8193 : 9 - 20

← 1 2 →