Reward estimation with scheduled knowledge distillation for dialogue policy learning

被引：2

作者：

Qiu, Junyan ^{[1
]}

Zhang, Haidong ^{[2
]}

Yang, Yiping ^{[2
]}

机构：

[1] Univ Chinese Acad Sci, Beijing, Peoples R China

[2] Chinese Acad Sci, Inst Automat, Beijing, Peoples R China

来源：

CONNECTION SCIENCE | 2023年 / 35卷 / 01期

关键词：

Reinforcement learning; dialogue policy learning; curriculum learning; knowledge distillation;

D O I：

10.1080/09540091.2023.2174078

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Formulating dialogue policy as a reinforcement learning (RL) task enables a dialogue system to act optimally by interacting with humans. However, typical RL-based methods normally suffer from challenges such as sparse and delayed reward problems. Besides, with user goal unavailable in real scenarios, the reward estimator is unable to generate reward reflecting action validity and task completion. Those issues may slow down and degrade the policy learning significantly. In this paper, we present a novel scheduled knowledge distillation framework for dialogue policy learning, which trains a compact student reward estimator by distilling the prior knowledge of user goals from a large teacher model. To further improve the stability of dialogue policy learning, we propose to leverage self-paced learning to arrange meaningful training order for the student reward estimator. Comprehensive experiments on Microsoft Dialogue Challenge and MultiWOZ datasets indicate that our approach significantly accelerates the learning speed, and the task-completion success rate can be improved from 0.47%similar to 9.01% compared with several strong baselines.

引用

页数：28

共 50 条

[1] Reward estimation for dialogue policy optimisation
Su, Pei-Hao
Gasic, Milica
Young, Steve
COMPUTER SPEECH AND LANGUAGE, 2018, 51 : 24 - 43
[2] WeaSuLπ: Weakly Supervised Dialogue Policy Learning: Reward Estimation for Multi-turn Dialogue
Khandelwal, Anant
1ST WORKSHOP ON DOCUMENT-GROUNDED DIALOGUE AND CONVERSATIONAL QUESTION ANSWERING (DIALDOC 2021), 2021, : 69 - 80
[3] Semi-Supervised Dialogue Policy Learning via Stochastic Reward Estimation
Huang, Xinting
Qi, Jianzhong
Sun, Yu
Zhang, Rui
58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 660 - 670
[4] Domain-independent User Satisfaction Reward Estimation for Dialogue Policy Learning
Ultes, Stefan
Budzianowski, Pawel
Casanueva, Inigo
Mrksic, Nikola
Rojas-Barahona, Lina
Su, Pei-Hao
Wen, Tsung-Hsien
Gasic, Milica
Young, Steve
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 1721 - 1725
[5] Combining Curriculum Learning and Knowledge Distillation for Dialogue Generation
Zhu, Qingqing
Chen, Xiuying
Wu, Pengfei
Liu, JunFei
Zhao, Dongyan
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 1284 - 1295
[6] On the Applicability of a User Satisfaction-Based Reward for Dialogue Policy Learning
Ultes, Stefan
Miehle, Juliana
Minker, Wolfgang
ADVANCED SOCIAL INTERACTION WITH AGENTS, 2019, 510 : 211 - 217
[7] On-line Active Reward Learning for Policy Optimisation in Spoken Dialogue Systems
Su, Pei-Hao
Gasic, Milica
Mrksic, Nikola
Rojas-Barahona, Lina
Ultes, Stefan
Vandyke, David
Wen, Tsung-Hsien
Young, Steve
PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, 2016, : 2431 - 2441
[8] Reward Function Learning for Dialogue Management
El Asri, Layla
Laroche, Romain
Pietquin, Olivier
PROCEEDINGS OF THE SIXTH STARTING AI RESEARCHERS' SYMPOSIUM (STAIRS 2012), 2012, 241 : 95 - +
[9] HIERARCHICAL KNOWLEDGE DISTILLATION FOR DIALOGUE SEQUENCE LABELING
Orihashi, Shota
Yamazaki, Yoshihiro
Makishima, Naoki
Ihori, Mana
Takashima, Akihiko
Tanaka, Tomohiro
Masumura, Ryo
2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 433 - 440
[10] DROID: Learning from Offline Heterogeneous Demonstrations via Reward-Policy Distillation
Jayanthi, Sravan
Chen, Letian
Balabanska, Nadya
Duong, Van
Scarlatescu, Erik
Ameperosa, Ezra
Zaidi, Zulfiqar
Martin, Daniel
Del Matto, Taylor
Ono, Masahiro
Gombolay, Matthew
CONFERENCE ON ROBOT LEARNING, VOL 229, 2023, 229

← 1 2 3 4 5 →