Reward estimation with scheduled knowledge distillation for dialogue policy learning

被引：2

作者：

Qiu, Junyan ^{[1
]}

Zhang, Haidong ^{[2
]}

Yang, Yiping ^{[2
]}

机构：

[1] Univ Chinese Acad Sci, Beijing, Peoples R China

[2] Chinese Acad Sci, Inst Automat, Beijing, Peoples R China

来源：

CONNECTION SCIENCE | 2023年 / 35卷 / 01期

关键词：

Reinforcement learning; dialogue policy learning; curriculum learning; knowledge distillation;

D O I：

10.1080/09540091.2023.2174078

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Formulating dialogue policy as a reinforcement learning (RL) task enables a dialogue system to act optimally by interacting with humans. However, typical RL-based methods normally suffer from challenges such as sparse and delayed reward problems. Besides, with user goal unavailable in real scenarios, the reward estimator is unable to generate reward reflecting action validity and task completion. Those issues may slow down and degrade the policy learning significantly. In this paper, we present a novel scheduled knowledge distillation framework for dialogue policy learning, which trains a compact student reward estimator by distilling the prior knowledge of user goals from a large teacher model. To further improve the stability of dialogue policy learning, we propose to leverage self-paced learning to arrange meaningful training order for the student reward estimator. Comprehensive experiments on Microsoft Dialogue Challenge and MultiWOZ datasets indicate that our approach significantly accelerates the learning speed, and the task-completion success rate can be improved from 0.47%similar to 9.01% compared with several strong baselines.

引用

页数：28

共 50 条

[31] Online Knowledge Distillation for Efficient Pose Estimation
Li, Zheng
Ye, Jingwen
Song, Mingli
Huang, Ying
Pan, Zhigeng
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 11720 - 11730
[32] Dialogue POMDP components (Part II): learning the reward function
Chinaei, H.
Chaib-draa, B.
INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2014, 17 (04) : 325 - 340
[33] Continual Learning Based on Knowledge Distillation and Representation Learning
Chen, Xiu-Yan
Liu, Jian-Wei
Li, Wen-Tao
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2022, PT IV, 2022, 13532 : 27 - 38
[34] Acquiring New Knowledge Without Losing Old Ones for Effective Continual Dialogue Policy Learning
Wang, Huimin
Zhang, Yunyan
Yang, Yifan
Zheng, Yefeng
Wong, Kam-Fai
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (12) : 7569 - 7584
[35] Active Learning for Reward Estimation in Inverse Reinforcement Learning
Lopes, Manuel
Melo, Francisco
Montesano, Luis
MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, PT II, 2009, 5782 : 31 - +
[36] Deep Learning-Based Eye Gaze Estimation for Automotive Applications Using Knowledge Distillation
Orasan, Ioan Lucan
Bublea, Adrian-Ioan
Caleanu, Catalin Daniel
IEEE ACCESS, 2023, 11 : 120741 - 120753
[37] Reward Certification for Policy Smoothed Reinforcement Learning
Mu, Ronghui
Marcolino, Leandro Soriano
Zhang, Yanghao
Zhang, Tianle
Huang, Xiaowei
Ruan, Wenjie
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 19, 2024, : 21429 - 21437
[38] Dialogue State Distillation Network with Inter-slot Contrastive Learning for Dialogue State Tracking
Xu, Jing
Song, Dandan
Liu, Chong
Hui, Siu Cheung
Li, Fei
Ju, Qiang
He, Xiaonan
Xie, Jian
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11, 2023, : 13834 - 13842
[39] Policy Teaching Through Reward Function Learning
Zhang, Haoqi
Parkes, David C.
Chen, Yiling
10TH ACM CONFERENCE ON ELECTRONIC COMMERCE - EC 2009, 2009, : 295 - 304
[40] Relabeling and policy distillation of hierarchical reinforcement learning
Zou, Qijie
Zhao, Xiling
Gao, Bing
Chen, Shuang
Liu, Zhiguo
Zhang, Zhejie
INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2024, 15 (11) : 4923 - 4939

← 1 2 3 4 5 →