Policy Teaching Through Reward Function Learning

被引：0

作者：

Zhang, Haoqi ^{[1
]}

Parkes, David C. ^{[1
]}

Chen, Yiling ^{[1
]}

机构：

[1] Harvard Univ, Sch Engn & Appl Sci, Cambridge, MA 02138 USA

来源：

10TH ACM CONFERENCE ON ELECTRONIC COMMERCE - EC 2009 | 2009年

关键词：

D O I：

暂无

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Policy teaching considers a Markov Decision Process setting in which an interested party aims to influence an agent's decisions by providing limited incentives. In this paper, we consider the specific objective of inducing a pre-specilied desired policy. We examine both the case in which the agent's reward function is known and unknown to the interested party, presenting a linear program for the former case and formulating an active, indirect elicitation method for the latter. We provide conditions for logarithmic convergence, and present a polynomial time algorithm that ensures logarithmic convergence with arbitrarily high probability. We also offer practical elicitation heuristics that can be formulated as linear programs, and demonstrate their effectiveness on a policy teaching problem in a simulated ad network setting. We extend our methods to handle partial observations and partial target policies, and provide a game-theoretic interpretation of our methods for handling strategic agents.

引用

页码：295 / 304

页数：10

共 50 条

[31] Teaching social inclusion, public policy and governance through active learning and educational games
Perez-Duran, Ixchel
Acebillo-Baque, Miriam
Comellas-Bonsfills, Josep M.
TEACHING PUBLIC ADMINISTRATION, 2024,
[32] Fast Probabilistic Policy Reuse via Reward Function Fitting
Liu, Jinmei
Wang, Zhi
Chen, Chunlin
2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
[33] Boosting Policy Learning in Reinforcement Learning via Adaptive Intrinsic Reward Regulation
Zhao, Qian
Han, Jinhui
Xu, Mao
IEEE ACCESS, 2024, 12 : 2224 - 2235
[34] Boosting Policy Learning in Reinforcement Learning via Adaptive Intrinsic Reward Regulation
Zhao, Qian
Han, Jinhui
Xu, Mao
IEEE Access, 2024, 12 : 2224 - 2235
[35] LEARNING AND TEACHING THROUGH DISCUSSION
HILL, WF
CENTRAL STATES SPEECH JOURNAL, 1962, 13 (03): : 198 - 198
[36] Learning Through Teaching Response
Ashton, Rendell W.
Burkart, Kristin M.
Lenz, Peter H.
Kumar, Sunita
McCallister, Jennifer W.
CHEST, 2018, 153 (04) : 1082 - 1083
[37] Reward estimation with scheduled knowledge distillation for dialogue policy learning
Qiu, Junyan
Zhang, Haidong
Yang, Yiping
CONNECTION SCIENCE, 2023, 35 (01)
[38] BATCH POLICY LEARNING IN AVERAGE REWARD MARKOV DECISION PROCESSES
Liao, Peng
Qi, Zhengling
Wan, Runzhe
Klasnja, Predrag
Murphy, Susan A.
ANNALS OF STATISTICS, 2022, 50 (06): : 3364 - 3387
[39] Reward-Free Policy Space Compression for Reinforcement Learning
Mutti, Mirco
Del Col, Stefano
Restelli, Marcello
INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 151, 2022, 151
[40] Pessimistic Reward Models for Off-Policy Learning in Recommendation
Jeunen, Olivier
Goethals, Bart
15TH ACM CONFERENCE ON RECOMMENDER SYSTEMS (RECSYS 2021), 2021, : 63 - 74

← 1 2 3 4 5 →