Reward estimation with scheduled knowledge distillation for dialogue policy learning

被引:2
|
作者
Qiu, Junyan [1 ]
Zhang, Haidong [2 ]
Yang, Yiping [2 ]
机构
[1] Univ Chinese Acad Sci, Beijing, Peoples R China
[2] Chinese Acad Sci, Inst Automat, Beijing, Peoples R China
关键词
Reinforcement learning; dialogue policy learning; curriculum learning; knowledge distillation;
D O I
10.1080/09540091.2023.2174078
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Formulating dialogue policy as a reinforcement learning (RL) task enables a dialogue system to act optimally by interacting with humans. However, typical RL-based methods normally suffer from challenges such as sparse and delayed reward problems. Besides, with user goal unavailable in real scenarios, the reward estimator is unable to generate reward reflecting action validity and task completion. Those issues may slow down and degrade the policy learning significantly. In this paper, we present a novel scheduled knowledge distillation framework for dialogue policy learning, which trains a compact student reward estimator by distilling the prior knowledge of user goals from a large teacher model. To further improve the stability of dialogue policy learning, we propose to leverage self-paced learning to arrange meaningful training order for the student reward estimator. Comprehensive experiments on Microsoft Dialogue Challenge and MultiWOZ datasets indicate that our approach significantly accelerates the learning speed, and the task-completion success rate can be improved from 0.47%similar to 9.01% compared with several strong baselines.
引用
收藏
页数:28
相关论文
共 50 条
  • [31] Online Knowledge Distillation for Efficient Pose Estimation
    Li, Zheng
    Ye, Jingwen
    Song, Mingli
    Huang, Ying
    Pan, Zhigeng
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 11720 - 11730
  • [32] Dialogue POMDP components (Part II): learning the reward function
    Chinaei, H.
    Chaib-draa, B.
    INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2014, 17 (04) : 325 - 340
  • [33] Continual Learning Based on Knowledge Distillation and Representation Learning
    Chen, Xiu-Yan
    Liu, Jian-Wei
    Li, Wen-Tao
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2022, PT IV, 2022, 13532 : 27 - 38
  • [34] Acquiring New Knowledge Without Losing Old Ones for Effective Continual Dialogue Policy Learning
    Wang, Huimin
    Zhang, Yunyan
    Yang, Yifan
    Zheng, Yefeng
    Wong, Kam-Fai
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (12) : 7569 - 7584
  • [35] Active Learning for Reward Estimation in Inverse Reinforcement Learning
    Lopes, Manuel
    Melo, Francisco
    Montesano, Luis
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, PT II, 2009, 5782 : 31 - +
  • [36] Deep Learning-Based Eye Gaze Estimation for Automotive Applications Using Knowledge Distillation
    Orasan, Ioan Lucan
    Bublea, Adrian-Ioan
    Caleanu, Catalin Daniel
    IEEE ACCESS, 2023, 11 : 120741 - 120753
  • [37] Reward Certification for Policy Smoothed Reinforcement Learning
    Mu, Ronghui
    Marcolino, Leandro Soriano
    Zhang, Yanghao
    Zhang, Tianle
    Huang, Xiaowei
    Ruan, Wenjie
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 19, 2024, : 21429 - 21437
  • [38] Dialogue State Distillation Network with Inter-slot Contrastive Learning for Dialogue State Tracking
    Xu, Jing
    Song, Dandan
    Liu, Chong
    Hui, Siu Cheung
    Li, Fei
    Ju, Qiang
    He, Xiaonan
    Xie, Jian
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11, 2023, : 13834 - 13842
  • [39] Policy Teaching Through Reward Function Learning
    Zhang, Haoqi
    Parkes, David C.
    Chen, Yiling
    10TH ACM CONFERENCE ON ELECTRONIC COMMERCE - EC 2009, 2009, : 295 - 304
  • [40] Relabeling and policy distillation of hierarchical reinforcement learning
    Zou, Qijie
    Zhao, Xiling
    Gao, Bing
    Chen, Shuang
    Liu, Zhiguo
    Zhang, Zhejie
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2024, 15 (11) : 4923 - 4939