Reinforcement Learning with a Corrupted Reward Channel

被引:0
|
作者
Everitt, Tom [1 ]
Krakovna, Victoria [2 ]
Orseau, Laurent [2 ]
Legg, Shane [2 ]
机构
[1] Australian Natl Univ, Canberra, ACT, Australia
[2] DeepMind, London, England
来源
PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE | 2017年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
No real-world reward function is perfect. Sensory errors and software bugs may result in agents getting higher (or lower) rewards than they should. For example, a reinforcement learning agent may prefer states where a sensory error gives it the maximum reward, but where the true reward is actually small. We formalise this problem as a generalised Markov Decision Problem called Corrupt Reward MDP. Traditional RL methods fare poorly in CRMDPs, even under strong simplifying assumptions and when trying to compensate for the possibly corrupt rewards. Two ways around the problem are investigated. First, by giving the agent richer data, such as in inverse reinforcement learning and semi-supervised reinforcement learning, reward corruption stemming from systematic sensory errors may sometimes be completely managed. Second, by using randomisation to blunt the agent's optimisation, reward corruption can be partially managed under some assumptions.
引用
收藏
页码:4705 / 4713
页数:9
相关论文
共 50 条
  • [41] Balancing multiple sources of reward in reinforcement learning
    Shelton, CR
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 13, 2001, 13 : 1082 - 1088
  • [42] IMMEDIATE REINFORCEMENT IN DELAYED REWARD LEARNING IN PIGEONS
    WINTER, J
    PERKINS, CC
    JOURNAL OF THE EXPERIMENTAL ANALYSIS OF BEHAVIOR, 1982, 38 (02) : 169 - 179
  • [43] Evolved Intrinsic Reward Functions for Reinforcement Learning
    Niekum, Scott
    PROCEEDINGS OF THE TWENTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-10), 2010, : 1955 - 1956
  • [44] Reward Shaping Based Federated Reinforcement Learning
    Hu, Yiqiu
    Hua, Yun
    Liu, Wenyan
    Zhu, Jun
    IEEE ACCESS, 2021, 9 : 67259 - 67267
  • [45] CONDITIONED (SECONDARY) REINFORCEMENT AND DELAYED REWARD LEARNING
    PERKINS, CC
    BULLETIN OF THE PSYCHONOMIC SOCIETY, 1981, 18 (02) : 57 - 57
  • [46] Hindsight Reward Shaping in Deep Reinforcement Learning
    de Villiers, Byron
    Sabatta, Deon
    2020 INTERNATIONAL SAUPEC/ROBMECH/PRASA CONFERENCE, 2020, : 653 - 659
  • [47] Robust Average-Reward Reinforcement Learning
    Wang, Yue
    Velasquez, Alvaro
    Atia, George
    Prater-Bennette, Ashley
    Zou, Shaofeng
    Journal of Artificial Intelligence Research, 2024, 80 : 719 - 803
  • [48] Reward-Free Exploration for Reinforcement Learning
    Jin, Chi
    Krishnamurthy, Akshay
    Simchowitz, Max
    Yu, Tiancheng
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 119, 2020, 119
  • [49] AntNet with Reward-Penalty Reinforcement Learning
    Lalbakhsh, Pooia
    Zaeri, Bahram
    Lalbakhsh, Ali
    Fesharaki, Mehdi N.
    2010 SECOND INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE, COMMUNICATION SYSTEMS AND NETWORKS (CICSYN), 2010, : 17 - 21
  • [50] Schedules of Reinforcement, Learning, and Frequency Reward Programs
    Craig, Adam
    Silk, Timothy
    ADVANCES IN CONSUMER RESEARCH, VOL XXXVI, 2009, 36 : 555 - 555