Reinforcement Learning with a Corrupted Reward Channel

被引：0

作者：

Everitt, Tom ^{[1
]}

Krakovna, Victoria ^{[2
]}

Orseau, Laurent ^{[2
]}

Legg, Shane ^{[2
]}

机构：

[1] Australian Natl Univ, Canberra, ACT, Australia

[2] DeepMind, London, England

来源：

PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE | 2017年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

No real-world reward function is perfect. Sensory errors and software bugs may result in agents getting higher (or lower) rewards than they should. For example, a reinforcement learning agent may prefer states where a sensory error gives it the maximum reward, but where the true reward is actually small. We formalise this problem as a generalised Markov Decision Problem called Corrupt Reward MDP. Traditional RL methods fare poorly in CRMDPs, even under strong simplifying assumptions and when trying to compensate for the possibly corrupt rewards. Two ways around the problem are investigated. First, by giving the agent richer data, such as in inverse reinforcement learning and semi-supervised reinforcement learning, reward corruption stemming from systematic sensory errors may sometimes be completely managed. Second, by using randomisation to blunt the agent's optimisation, reward corruption can be partially managed under some assumptions.

引用

页码：4705 / 4713

页数：9

共 50 条

[41] Balancing multiple sources of reward in reinforcement learning
Shelton, CR
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 13, 2001, 13 : 1082 - 1088
[42] IMMEDIATE REINFORCEMENT IN DELAYED REWARD LEARNING IN PIGEONS
WINTER, J
PERKINS, CC
JOURNAL OF THE EXPERIMENTAL ANALYSIS OF BEHAVIOR, 1982, 38 (02) : 169 - 179
[43] Evolved Intrinsic Reward Functions for Reinforcement Learning
Niekum, Scott
PROCEEDINGS OF THE TWENTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-10), 2010, : 1955 - 1956
[44] Reward Shaping Based Federated Reinforcement Learning
Hu, Yiqiu
Hua, Yun
Liu, Wenyan
Zhu, Jun
IEEE ACCESS, 2021, 9 : 67259 - 67267
[45] CONDITIONED (SECONDARY) REINFORCEMENT AND DELAYED REWARD LEARNING
PERKINS, CC
BULLETIN OF THE PSYCHONOMIC SOCIETY, 1981, 18 (02) : 57 - 57
[46] Hindsight Reward Shaping in Deep Reinforcement Learning
de Villiers, Byron
Sabatta, Deon
2020 INTERNATIONAL SAUPEC/ROBMECH/PRASA CONFERENCE, 2020, : 653 - 659
[47] Robust Average-Reward Reinforcement Learning
Wang, Yue
Velasquez, Alvaro
Atia, George
Prater-Bennette, Ashley
Zou, Shaofeng
Journal of Artificial Intelligence Research, 2024, 80 : 719 - 803
[48] Reward-Free Exploration for Reinforcement Learning
Jin, Chi
Krishnamurthy, Akshay
Simchowitz, Max
Yu, Tiancheng
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 119, 2020, 119
[49] AntNet with Reward-Penalty Reinforcement Learning
Lalbakhsh, Pooia
Zaeri, Bahram
Lalbakhsh, Ali
Fesharaki, Mehdi N.
2010 SECOND INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE, COMMUNICATION SYSTEMS AND NETWORKS (CICSYN), 2010, : 17 - 21
[50] Schedules of Reinforcement, Learning, and Frequency Reward Programs
Craig, Adam
Silk, Timothy
ADVANCES IN CONSUMER RESEARCH, VOL XXXVI, 2009, 36 : 555 - 555

← 1 2 3 4 5 →