Reward estimation with scheduled knowledge distillation for dialogue policy learning

被引:2
|
作者
Qiu, Junyan [1 ]
Zhang, Haidong [2 ]
Yang, Yiping [2 ]
机构
[1] Univ Chinese Acad Sci, Beijing, Peoples R China
[2] Chinese Acad Sci, Inst Automat, Beijing, Peoples R China
关键词
Reinforcement learning; dialogue policy learning; curriculum learning; knowledge distillation;
D O I
10.1080/09540091.2023.2174078
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Formulating dialogue policy as a reinforcement learning (RL) task enables a dialogue system to act optimally by interacting with humans. However, typical RL-based methods normally suffer from challenges such as sparse and delayed reward problems. Besides, with user goal unavailable in real scenarios, the reward estimator is unable to generate reward reflecting action validity and task completion. Those issues may slow down and degrade the policy learning significantly. In this paper, we present a novel scheduled knowledge distillation framework for dialogue policy learning, which trains a compact student reward estimator by distilling the prior knowledge of user goals from a large teacher model. To further improve the stability of dialogue policy learning, we propose to leverage self-paced learning to arrange meaningful training order for the student reward estimator. Comprehensive experiments on Microsoft Dialogue Challenge and MultiWOZ datasets indicate that our approach significantly accelerates the learning speed, and the task-completion success rate can be improved from 0.47%similar to 9.01% compared with several strong baselines.
引用
收藏
页数:28
相关论文
共 50 条
  • [1] Reward estimation for dialogue policy optimisation
    Su, Pei-Hao
    Gasic, Milica
    Young, Steve
    COMPUTER SPEECH AND LANGUAGE, 2018, 51 : 24 - 43
  • [2] WeaSuLπ: Weakly Supervised Dialogue Policy Learning: Reward Estimation for Multi-turn Dialogue
    Khandelwal, Anant
    1ST WORKSHOP ON DOCUMENT-GROUNDED DIALOGUE AND CONVERSATIONAL QUESTION ANSWERING (DIALDOC 2021), 2021, : 69 - 80
  • [3] Semi-Supervised Dialogue Policy Learning via Stochastic Reward Estimation
    Huang, Xinting
    Qi, Jianzhong
    Sun, Yu
    Zhang, Rui
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 660 - 670
  • [4] Domain-independent User Satisfaction Reward Estimation for Dialogue Policy Learning
    Ultes, Stefan
    Budzianowski, Pawel
    Casanueva, Inigo
    Mrksic, Nikola
    Rojas-Barahona, Lina
    Su, Pei-Hao
    Wen, Tsung-Hsien
    Gasic, Milica
    Young, Steve
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 1721 - 1725
  • [5] Combining Curriculum Learning and Knowledge Distillation for Dialogue Generation
    Zhu, Qingqing
    Chen, Xiuying
    Wu, Pengfei
    Liu, JunFei
    Zhao, Dongyan
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 1284 - 1295
  • [6] On the Applicability of a User Satisfaction-Based Reward for Dialogue Policy Learning
    Ultes, Stefan
    Miehle, Juliana
    Minker, Wolfgang
    ADVANCED SOCIAL INTERACTION WITH AGENTS, 2019, 510 : 211 - 217
  • [7] On-line Active Reward Learning for Policy Optimisation in Spoken Dialogue Systems
    Su, Pei-Hao
    Gasic, Milica
    Mrksic, Nikola
    Rojas-Barahona, Lina
    Ultes, Stefan
    Vandyke, David
    Wen, Tsung-Hsien
    Young, Steve
    PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, 2016, : 2431 - 2441
  • [8] Reward Function Learning for Dialogue Management
    El Asri, Layla
    Laroche, Romain
    Pietquin, Olivier
    PROCEEDINGS OF THE SIXTH STARTING AI RESEARCHERS' SYMPOSIUM (STAIRS 2012), 2012, 241 : 95 - +
  • [9] HIERARCHICAL KNOWLEDGE DISTILLATION FOR DIALOGUE SEQUENCE LABELING
    Orihashi, Shota
    Yamazaki, Yoshihiro
    Makishima, Naoki
    Ihori, Mana
    Takashima, Akihiko
    Tanaka, Tomohiro
    Masumura, Ryo
    2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 433 - 440
  • [10] DROID: Learning from Offline Heterogeneous Demonstrations via Reward-Policy Distillation
    Jayanthi, Sravan
    Chen, Letian
    Balabanska, Nadya
    Duong, Van
    Scarlatescu, Erik
    Ameperosa, Ezra
    Zaidi, Zulfiqar
    Martin, Daniel
    Del Matto, Taylor
    Ono, Masahiro
    Gombolay, Matthew
    CONFERENCE ON ROBOT LEARNING, VOL 229, 2023, 229