Sample-Efficient Preference-based Reinforcement Learning with Dynamics Aware Rewards

被引:0
|
作者
Metcalf, Katherine [1 ]
Sarabia, Miguel [1 ]
Mackraz, Natalie [1 ]
Theobald, Barry-John [1 ]
机构
[1] Apple, Cupertino, CA 95014 USA
来源
关键词
human-in-the-loop learning; preference-based RL; RLHF;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Preference-based reinforcement learning (PbRL) aligns a robot behavior with human preferences via a reward function learned from binary feedback over agent behaviors. We show that dynamics-aware reward functions improve the sample efficiency of PbRL by an order of magnitude. In our experiments we iterate between: (1) learning a dynamics-aware state-action representation z(sa) via a self-supervised temporal consistency task, and (2) bootstrapping the preference-based reward function from z(sa), which results in faster policy learning and better final policy performance. For example, on quadruped-walk, walker-walk, and cheetah-run, with 50 preference labels we achieve the same performance as existing approaches with 500 preference labels, and we recover 83% and 66% of ground truth reward policy performance versus only 38% and 21%. The performance gains demonstrate the benefits of explicitly learning a dynamics-aware reward model. Repo: https://github.com/apple/ml-reed.
引用
收藏
页数:49
相关论文
共 50 条
  • [41] Mixing corrupted preferences for robust and feedback-efficient preference-based reinforcement learning
    Heo, Jongkook
    Lee, Young Jae
    Kim, Jaehoon
    Kwak, Min Gu
    Park, Young Joon
    Kim, Seoung Bum
    KNOWLEDGE-BASED SYSTEMS, 2025, 309
  • [42] Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm
    Róbert Busa-Fekete
    Balázs Szörényi
    Paul Weng
    Weiwei Cheng
    Eyke Hüllermeier
    Machine Learning, 2014, 97 : 327 - 351
  • [43] Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm
    Busa-Fekete, Robert
    Szoerenyi, Balazs
    Weng, Paul
    Cheng, Weiwei
    Huellermeier, Eyke
    MACHINE LEARNING, 2014, 97 (03) : 327 - 351
  • [44] Sample-Efficient Deep Reinforcement Learning via Episodic Backward Update
    Lee, Su Young
    Choi, Sungik
    Chung, Sae-Young
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [45] Optimal Operable Power Flow: Sample-Efficient Holomorphic Embedding-Based Reinforcement Learning
    Sayed, Ahmed Rabee
    Zhang, Xian
    Wang, Guibin
    Wang, Cheng
    Qiu, Jing
    IEEE TRANSACTIONS ON POWER SYSTEMS, 2024, 39 (01) : 1739 - 1751
  • [46] Sample-Efficient Reinforcement Learning via Conservative Model-Based Actor-Critic
    Wang, Zhihai
    Wang, Jie
    Zhou, Qi
    Li, Bin
    Li, Houqiang
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 8612 - 8620
  • [47] An Advisor-Based Architecture for a Sample-Efficient Training of Autonomous Navigation Agents with Reinforcement Learning
    Wijesinghe, Rukshan Darshana
    Tissera, Dumindu
    Vithanage, Mihira Kasun
    Xavier, Alex
    Fernando, Subha
    Samarawickrama, Jayathu
    ROBOTICS, 2023, 12 (05)
  • [48] Sample-Efficient Reinforcement Learning for Linearly-Parameterized MDPs with a Generative Model
    Wang, Bingyan
    Yan, Yuling
    Fan, Jianqing
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [49] On Sample-Efficient Offline Reinforcement Learning: Data Diversity, Posterior Sampling, and Beyond
    Nguyen-Tang, Thanh
    Arora, Raman
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [50] Sample-Efficient Reinforcement Learning Is Feasible for Linearly Realizable MDPs with Limited Revisiting
    Li, Gen
    Chen, Yuxin
    Chi, Yuejie
    Gu, Yuantao
    Wei, Yuting
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34