Ditch the Gold Standard: Re-evaluating Conversational Question Answering

被引:0
|
作者
Li, Huihan [1 ]
Gao, Tianyu [1 ]
Goenka, Manan [1 ]
Chen, Danqi [1 ]
机构
[1] Princeton Univ, Dept Comp Sci, Princeton, NJ 08544 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Conversational question answering aims to provide natural-language answers to users in information-seeking conversations. Existing conversational QA benchmarks compare models with pre-collected human-human conversations, using ground-truth answers provided in conversational history. It remains unclear whether we can rely on this static evaluation for model development and whether current systems can well generalize to real-world human-machine conversations. In this work, we conduct the first large-scale human evaluation of state-of-the-art conversational QA systems, where human evaluators converse with models and judge the correctness of their answers. We find that the distribution of human-machine conversations differs drastically from that of human-human conversations, and there is a disagreement between human and gold-history evaluation in terms of model ranking. We further investigate how to improve automatic evaluations, and propose a question rewriting mechanism based on predicted history, which better correlates with human judgments. Finally, we analyze the impact of various modeling strategies and discuss future directions towards building better conversational question answering systems.(1)
引用
收藏
页码:8074 / 8085
页数:12
相关论文
共 50 条
  • [21] Re-evaluating Evaluation
    Balduzzi, David
    Tuyls, Karl
    Perolat, Julien
    Graepel, Thore
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
  • [22] Re-evaluating the Anthropocene
    Dalby, Simon
    ANTIQUITY, 2016, 90 (350) : 514 - 515
  • [23] Open-Retrieval Conversational Question Answering
    Qu, Chen
    Yang, Liu
    Chen, Cen
    Qiu, Minghui
    Croft, W. Bruce
    Iyyer, Mohit
    PROCEEDINGS OF THE 43RD INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '20), 2020, : 539 - 548
  • [24] Consistency Training by Synthetic Question Generation for Conversational Question Answering
    Hemati, Hamed Hematian
    Beigy, Hamid
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2: SHORT PAPERS, 2024, : 630 - 639
  • [25] Re-Evaluating "Community"
    O'Donnell, Kathleen M.
    ARCHITECT, 2018, 107 (12): : 59 - 59
  • [26] RE-EVALUATING THE REVEL
    不详
    GAMING LAW REVIEW-ECONOMICS REGULATION COMPLIANCE AND POLICY, 2012, 16 (11): : 635 - 635
  • [27] Re-evaluating EMS
    不详
    VETERINARY RECORD, 2009, 164 (22) : 669 - 669
  • [28] Question Rewriting? Assessing Its Importance for Conversational Question Answering
    Raposo, Goncalo
    Ribeiro, Rui
    Martins, Bruno
    Coheur, Luisa
    ADVANCES IN INFORMATION RETRIEVAL, PT II, 2022, 13186 : 199 - 206
  • [29] The Standard of Care in Type 2 Diabetes: Re-evaluating the Treatment Paradigm
    Viswanathan Mohan
    Mark E. Cooper
    David R. Matthews
    Kamlesh Khunti
    Diabetes Therapy, 2019, 10 : 1 - 13
  • [30] Connecting Question Answering and Conversational Agents Contextualizing German Questions for Interactive Question Answering Systems
    Waltinger, Ulli
    Breuing, Alexa
    Wachsmuth, Ipke
    KUNSTLICHE INTELLIGENZ, 2012, 26 (04): : 381 - 390