Ditch the Gold Standard: Re-evaluating Conversational Question Answering

被引:0
|
作者
Li, Huihan [1 ]
Gao, Tianyu [1 ]
Goenka, Manan [1 ]
Chen, Danqi [1 ]
机构
[1] Princeton Univ, Dept Comp Sci, Princeton, NJ 08544 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Conversational question answering aims to provide natural-language answers to users in information-seeking conversations. Existing conversational QA benchmarks compare models with pre-collected human-human conversations, using ground-truth answers provided in conversational history. It remains unclear whether we can rely on this static evaluation for model development and whether current systems can well generalize to real-world human-machine conversations. In this work, we conduct the first large-scale human evaluation of state-of-the-art conversational QA systems, where human evaluators converse with models and judge the correctness of their answers. We find that the distribution of human-machine conversations differs drastically from that of human-human conversations, and there is a disagreement between human and gold-history evaluation in terms of model ranking. We further investigate how to improve automatic evaluations, and propose a question rewriting mechanism based on predicted history, which better correlates with human judgments. Finally, we analyze the impact of various modeling strategies and discuss future directions towards building better conversational question answering systems.(1)
引用
收藏
页码:8074 / 8085
页数:12
相关论文
共 50 条
  • [1] RE-EVALUATING THE GOLD STANDARD OF DIAGNOSIS FOR POMPE DISEASE
    Genge, A.
    Campbell, N.
    MUSCLE & NERVE, 2015, 52 : S7 - S7
  • [2] Question Rewriting for Conversational Question Answering
    Vakulenko, Svitlana
    Longpre, Shayne
    Tu, Zhucheng
    Anantha, Raviteja
    WSDM '21: PROCEEDINGS OF THE 14TH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, 2021, : 355 - 363
  • [3] Conversational question answering: a survey
    Zaib, Munazza
    Zhang, Wei Emma
    Sheng, Quan Z.
    Mahmood, Adnan
    Zhang, Yang
    KNOWLEDGE AND INFORMATION SYSTEMS, 2022, 64 (12) : 3151 - 3195
  • [4] Conversational question answering: a survey
    Munazza Zaib
    Wei Emma Zhang
    Quan Z. Sheng
    Adnan Mahmood
    Yang Zhang
    Knowledge and Information Systems, 2022, 64 : 3151 - 3195
  • [5] Evaluating Natural Language Understanding Services for Conversational Question Answering Systems
    Braun, Daniel
    Mendez, Adrian Hernandez
    Matthes, Florian
    Langen, Manfred
    18TH ANNUAL MEETING OF THE SPECIAL INTEREST GROUP ON DISCOURSE AND DIALOGUE (SIGDIAL 2017), 2017, : 174 - 185
  • [6] Re-evaluating the community question from a German perspective
    Hennig, Marina
    SOCIAL NETWORKS, 2007, 29 (03) : 375 - 390
  • [7] Conversational Question Answering on Heterogeneous Sources
    Christmann, Philipp
    Roy, Rishiraj Saha
    Weikum, Gerhard
    PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22), 2022, : 144 - 154
  • [8] CoQA: A Conversational Question Answering Challenge
    Reddy, Siva
    Chen, Danqi
    Manning, Christopher D.
    TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2019, 7 : 249 - 266
  • [9] Towards a Conversational Question Answering System
    Jebbor, Fatine
    Benhlima, Laila
    PROCEEDINGS OF THE MEDITERRANEAN CONFERENCE ON INFORMATION & COMMUNICATION TECHNOLOGIES 2015, VOL 1, 2016, 380 : 307 - 315
  • [10] An Adaptive Framework for Conversational Question Answering
    Su, Lixin
    Guo, Jiafeng
    Fan, Yixing
    Lan, Yanyan
    Zhang, Ruqing
    Cheng, Xueqi
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 10041 - 10042