Sample-Efficient Preference-based Reinforcement Learning with Dynamics Aware Rewards

被引:0
|
作者
Metcalf, Katherine [1 ]
Sarabia, Miguel [1 ]
Mackraz, Natalie [1 ]
Theobald, Barry-John [1 ]
机构
[1] Apple, Cupertino, CA 95014 USA
来源
关键词
human-in-the-loop learning; preference-based RL; RLHF;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Preference-based reinforcement learning (PbRL) aligns a robot behavior with human preferences via a reward function learned from binary feedback over agent behaviors. We show that dynamics-aware reward functions improve the sample efficiency of PbRL by an order of magnitude. In our experiments we iterate between: (1) learning a dynamics-aware state-action representation z(sa) via a self-supervised temporal consistency task, and (2) bootstrapping the preference-based reward function from z(sa), which results in faster policy learning and better final policy performance. For example, on quadruped-walk, walker-walk, and cheetah-run, with 50 preference labels we achieve the same performance as existing approaches with 500 preference labels, and we recover 83% and 66% of ground truth reward policy performance versus only 38% and 21%. The performance gains demonstrate the benefits of explicitly learning a dynamics-aware reward model. Repo: https://github.com/apple/ml-reed.
引用
收藏
页数:49
相关论文
共 50 条
  • [21] Preference-Based Policy Iteration: Leveraging Preference Learning for Reinforcement Learning
    Cheng, Weiwei
    Fuernkranz, Johannes
    Huellermeier, Eyke
    Park, Sang-Hyeun
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, PT I, 2011, 6911 : 312 - 327
  • [22] Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion
    Buckman, Jacob
    Hafner, Danijar
    Tucker, George
    Brevdo, Eugene
    Lee, Honglak
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
  • [23] Is Plug-in Solver Sample-Efficient for Feature-based Reinforcement Learning?
    Cui, Qiwen
    Yang, Lin F.
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [24] Safe and Sample-Efficient Reinforcement Learning for Clustered Dynamic Environments
    Chen, Hongyi
    Liu, Changliu
    IEEE CONTROL SYSTEMS LETTERS, 2022, 6 : 1928 - 1933
  • [25] Sample-Efficient Reinforcement Learning with loglog(T) Switching Cost
    Qiao, Dan
    Yin, Ming
    Min, Ming
    Wang, Yu-Xiang
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [26] Sample-Efficient Reinforcement Learning of Partially Observable Markov Games
    Liu, Qinghua
    Szepesvari, Csaba
    Jin, Chi
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
  • [27] Sample-Efficient Deep Reinforcement Learning with Directed Associative Graph
    Dujia Yang
    Xiaowei Qin
    Xiaodong Xu
    Chensheng Li
    Guo Wei
    中国通信, 2021, 18 (06) : 100 - 113
  • [28] Reinforcement Learning Boat Autopilot: A Sample-efficient and Model Predictive Control based Approach
    Cui, Yunduan
    Osaki, Shigeki
    Matsubara, Takamitsu
    2019 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2019, : 2868 - 2875
  • [29] Learning Better with Less: Effective Augmentation for Sample-Efficient Visual Reinforcement Learning
    Ma, Guozheng
    Zhang, Linrui
    Wang, Haoyu
    Li, Lu
    Wang, Zilin
    Wang, Zhen
    Shen, Li
    Wang, Xueqian
    Tao, Dacheng
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [30] Sample-Efficient Learning of Mixtures
    Ashtiani, Hassan
    Ben-David, Shai
    Mehrabian, Abbas
    THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 2679 - 2686