Sample-Efficient Preference-based Reinforcement Learning with Dynamics Aware Rewards

被引:0
|
作者
Metcalf, Katherine [1 ]
Sarabia, Miguel [1 ]
Mackraz, Natalie [1 ]
Theobald, Barry-John [1 ]
机构
[1] Apple, Cupertino, CA 95014 USA
来源
关键词
human-in-the-loop learning; preference-based RL; RLHF;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Preference-based reinforcement learning (PbRL) aligns a robot behavior with human preferences via a reward function learned from binary feedback over agent behaviors. We show that dynamics-aware reward functions improve the sample efficiency of PbRL by an order of magnitude. In our experiments we iterate between: (1) learning a dynamics-aware state-action representation z(sa) via a self-supervised temporal consistency task, and (2) bootstrapping the preference-based reward function from z(sa), which results in faster policy learning and better final policy performance. For example, on quadruped-walk, walker-walk, and cheetah-run, with 50 preference labels we achieve the same performance as existing approaches with 500 preference labels, and we recover 83% and 66% of ground truth reward policy performance versus only 38% and 21%. The performance gains demonstrate the benefits of explicitly learning a dynamics-aware reward model. Repo: https://github.com/apple/ml-reed.
引用
收藏
页数:49
相关论文
共 50 条
  • [31] Relative Entropy Regularized Sample-Efficient Reinforcement Learning With Continuous Actions
    Shang, Zhiwei
    Li, Renxing
    Zheng, Chunhua
    Li, Huiyun
    Cui, Yunduan
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2025, 36 (01) : 475 - 485
  • [32] Relative Entropy Regularized Sample-Efficient Reinforcement Learning With Continuous Actions
    Shang, Zhiwei
    Li, Renxing
    Zheng, Chunhua
    Li, Huiyun
    Cui, Yunduan
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2025, 36 (01) : 475 - 485
  • [33] TEXPLORE: real-time sample-efficient reinforcement learning for robots
    Hester, Todd
    Stone, Peter
    MACHINE LEARNING, 2013, 90 (03) : 385 - 429
  • [34] Model-Free Preference-Based Reinforcement Learning
    Wirth, Christian
    Fuernkranz, Johannes
    Neumann, Gerhard
    THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2016, : 2222 - 2228
  • [35] Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning
    Xie, Tengyang
    Jiang, Nan
    Wang, Huan
    Xiong, Caiming
    Bai, Yu
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [36] Dueling Posterior Sampling for Preference-Based Reinforcement Learning
    Novoseller, Ellen R.
    Wei, Yibing
    Sui, Yanan
    Yue, Yisong
    Burdick, Joel W.
    CONFERENCE ON UNCERTAINTY IN ARTIFICIAL INTELLIGENCE (UAI 2020), 2020, 124 : 1029 - 1038
  • [37] TEXPLORE: real-time sample-efficient reinforcement learning for robots
    Todd Hester
    Peter Stone
    Machine Learning, 2013, 90 : 385 - 429
  • [38] Augmented Memory: Sample-Efficient Generative Molecular Design with Reinforcement Learning
    Guo, Jeff
    Schwaller, Philippe
    JACS AU, 2024, 4 (06): : 2160 - 2172
  • [39] Sample-Efficient Blockage Prediction and Handover Using Causal Reinforcement Learning
    Kanagamani, Tamizharasan
    Sadasivan, Jishnu
    Banerjee, Serene
    10TH INTERNATIONAL CONFERENCE ON ELECTRONICS, COMPUTING AND COMMUNICATION TECHNOLOGIES, CONECCT 2024, 2024,
  • [40] Sample-efficient multi-agent reinforcement learning with masked reconstruction
    Kim, Jung In
    Lee, Young Jae
    Heo, Jongkook
    Park, Jinhyeok
    Kim, Jaehoon
    Lim, Sae Rin
    Jeong, Jinyong
    Kim, Seoung Bum
    PLOS ONE, 2023, 18 (09):