Ditch the Gold Standard: Re-evaluating Conversational Question Answering

被引：0

作者：

Li, Huihan ^{[1
]}

Gao, Tianyu ^{[1
]}

Goenka, Manan ^{[1
]}

Chen, Danqi ^{[1
]}

机构：

[1] Princeton Univ, Dept Comp Sci, Princeton, NJ 08544 USA

来源：

PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS) | 2022年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Conversational question answering aims to provide natural-language answers to users in information-seeking conversations. Existing conversational QA benchmarks compare models with pre-collected human-human conversations, using ground-truth answers provided in conversational history. It remains unclear whether we can rely on this static evaluation for model development and whether current systems can well generalize to real-world human-machine conversations. In this work, we conduct the first large-scale human evaluation of state-of-the-art conversational QA systems, where human evaluators converse with models and judge the correctness of their answers. We find that the distribution of human-machine conversations differs drastically from that of human-human conversations, and there is a disagreement between human and gold-history evaluation in terms of model ranking. We further investigate how to improve automatic evaluations, and propose a question rewriting mechanism based on predicted history, which better correlates with human judgments. Finally, we analyze the impact of various modeling strategies and discuss future directions towards building better conversational question answering systems.(1)

引用

页码：8074 / 8085

页数：12

共 50 条

[21] Re-evaluating Evaluation
Balduzzi, David
Tuyls, Karl
Perolat, Julien
Graepel, Thore
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
[22] Re-evaluating the Anthropocene
Dalby, Simon
ANTIQUITY, 2016, 90 (350) : 514 - 515
[23] Open-Retrieval Conversational Question Answering
Qu, Chen
Yang, Liu
Chen, Cen
Qiu, Minghui
Croft, W. Bruce
Iyyer, Mohit
PROCEEDINGS OF THE 43RD INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '20), 2020, : 539 - 548
[24] Consistency Training by Synthetic Question Generation for Conversational Question Answering
Hemati, Hamed Hematian
Beigy, Hamid
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2: SHORT PAPERS, 2024, : 630 - 639
[25] Re-Evaluating "Community"
O'Donnell, Kathleen M.
ARCHITECT, 2018, 107 (12): : 59 - 59
[26] RE-EVALUATING THE REVEL
不详
GAMING LAW REVIEW-ECONOMICS REGULATION COMPLIANCE AND POLICY, 2012, 16 (11): : 635 - 635
[27] Re-evaluating EMS
不详
VETERINARY RECORD, 2009, 164 (22) : 669 - 669
[28] Question Rewriting? Assessing Its Importance for Conversational Question Answering
Raposo, Goncalo
Ribeiro, Rui
Martins, Bruno
Coheur, Luisa
ADVANCES IN INFORMATION RETRIEVAL, PT II, 2022, 13186 : 199 - 206
[29] The Standard of Care in Type 2 Diabetes: Re-evaluating the Treatment Paradigm
Viswanathan Mohan
Mark E. Cooper
David R. Matthews
Kamlesh Khunti
Diabetes Therapy, 2019, 10 : 1 - 13
[30] Connecting Question Answering and Conversational Agents Contextualizing German Questions for Interactive Question Answering Systems
Waltinger, Ulli
Breuing, Alexa
Wachsmuth, Ipke
KUNSTLICHE INTELLIGENZ, 2012, 26 (04): : 381 - 390

← 1 2 3 4 5 →