Sample-Efficient Preference-based Reinforcement Learning with Dynamics Aware Rewards

被引：0

作者：

Metcalf, Katherine ^{[1
]}

Sarabia, Miguel ^{[1
]}

Mackraz, Natalie ^{[1
]}

Theobald, Barry-John ^{[1
]}

机构：

[1] Apple, Cupertino, CA 95014 USA

来源：

CONFERENCE ON ROBOT LEARNING, VOL 229 | 2023年 / 229卷

关键词：

human-in-the-loop learning; preference-based RL; RLHF;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Preference-based reinforcement learning (PbRL) aligns a robot behavior with human preferences via a reward function learned from binary feedback over agent behaviors. We show that dynamics-aware reward functions improve the sample efficiency of PbRL by an order of magnitude. In our experiments we iterate between: (1) learning a dynamics-aware state-action representation z(sa) via a self-supervised temporal consistency task, and (2) bootstrapping the preference-based reward function from z(sa), which results in faster policy learning and better final policy performance. For example, on quadruped-walk, walker-walk, and cheetah-run, with 50 preference labels we achieve the same performance as existing approaches with 500 preference labels, and we recover 83% and 66% of ground truth reward policy performance versus only 38% and 21%. The performance gains demonstrate the benefits of explicitly learning a dynamics-aware reward model. Repo: https://github.com/apple/ml-reed.

引用

页数：49

共 50 条

[41] Mixing corrupted preferences for robust and feedback-efficient preference-based reinforcement learning
Heo, Jongkook
Lee, Young Jae
Kim, Jaehoon
Kwak, Min Gu
Park, Young Joon
Kim, Seoung Bum
KNOWLEDGE-BASED SYSTEMS, 2025, 309
[42] Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm
Róbert Busa-Fekete
Balázs Szörényi
Paul Weng
Weiwei Cheng
Eyke Hüllermeier
Machine Learning, 2014, 97 : 327 - 351
[43] Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm
Busa-Fekete, Robert
Szoerenyi, Balazs
Weng, Paul
Cheng, Weiwei
Huellermeier, Eyke
MACHINE LEARNING, 2014, 97 (03) : 327 - 351
[44] Sample-Efficient Deep Reinforcement Learning via Episodic Backward Update
Lee, Su Young
Choi, Sungik
Chung, Sae-Young
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
[45] Optimal Operable Power Flow: Sample-Efficient Holomorphic Embedding-Based Reinforcement Learning
Sayed, Ahmed Rabee
Zhang, Xian
Wang, Guibin
Wang, Cheng
Qiu, Jing
IEEE TRANSACTIONS ON POWER SYSTEMS, 2024, 39 (01) : 1739 - 1751
[46] Sample-Efficient Reinforcement Learning via Conservative Model-Based Actor-Critic
Wang, Zhihai
Wang, Jie
Zhou, Qi
Li, Bin
Li, Houqiang
THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 8612 - 8620
[47] An Advisor-Based Architecture for a Sample-Efficient Training of Autonomous Navigation Agents with Reinforcement Learning
Wijesinghe, Rukshan Darshana
Tissera, Dumindu
Vithanage, Mihira Kasun
Xavier, Alex
Fernando, Subha
Samarawickrama, Jayathu
ROBOTICS, 2023, 12 (05)
[48] Sample-Efficient Reinforcement Learning for Linearly-Parameterized MDPs with a Generative Model
Wang, Bingyan
Yan, Yuling
Fan, Jianqing
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[49] On Sample-Efficient Offline Reinforcement Learning: Data Diversity, Posterior Sampling, and Beyond
Nguyen-Tang, Thanh
Arora, Raman
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[50] Sample-Efficient Reinforcement Learning Is Feasible for Linearly Realizable MDPs with Limited Revisiting
Li, Gen
Chen, Yuxin
Chi, Yuejie
Gu, Yuantao
Wei, Yuting
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34

← 1 2 3 4 5 →