Ditch the Gold Standard: Re-evaluating Conversational Question Answering

被引:0
|
作者
Li, Huihan [1 ]
Gao, Tianyu [1 ]
Goenka, Manan [1 ]
Chen, Danqi [1 ]
机构
[1] Princeton Univ, Dept Comp Sci, Princeton, NJ 08544 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Conversational question answering aims to provide natural-language answers to users in information-seeking conversations. Existing conversational QA benchmarks compare models with pre-collected human-human conversations, using ground-truth answers provided in conversational history. It remains unclear whether we can rely on this static evaluation for model development and whether current systems can well generalize to real-world human-machine conversations. In this work, we conduct the first large-scale human evaluation of state-of-the-art conversational QA systems, where human evaluators converse with models and judge the correctness of their answers. We find that the distribution of human-machine conversations differs drastically from that of human-human conversations, and there is a disagreement between human and gold-history evaluation in terms of model ranking. We further investigate how to improve automatic evaluations, and propose a question rewriting mechanism based on predicted history, which better correlates with human judgments. Finally, we analyze the impact of various modeling strategies and discuss future directions towards building better conversational question answering systems.(1)
引用
收藏
页码:8074 / 8085
页数:12
相关论文
共 50 条
  • [41] Re-evaluating Web evaluation
    Notess, GR
    ONLINE, 2006, 30 (01): : 45 - 47
  • [42] RE-EVALUATING THE DECODING PRINCIPLE
    不详
    NATURE REVIEWS MOLECULAR CELL BIOLOGY, 2012, 13 (05) : 280 - 280
  • [43] Re-evaluating revalidation and appraisal
    Pringle, M
    BRITISH JOURNAL OF GENERAL PRACTICE, 2003, 53 (491): : 437 - 438
  • [44] Re-evaluating therapeutic neovascularization
    de Muinck, ED
    Simons, M
    JOURNAL OF MOLECULAR AND CELLULAR CARDIOLOGY, 2004, 36 (01) : 25 - 32
  • [45] Re-evaluating police militarization
    Jonathan Mummolo
    Nature Human Behaviour, 2021, 5 : 181 - 182
  • [46] Re-evaluating prokaryotic species
    Dirk Gevers
    Frederick M. Cohan
    Jeffrey G. Lawrence
    Brian G. Spratt
    Tom Coenye
    Edward J. Feil
    Erko Stackebrandt
    Yves Van de Peer
    Peter Vandamme
    Fabiano L. Thompson
    Jean Swings
    Nature Reviews Microbiology, 2005, 3 : 733 - 739
  • [47] Re-evaluating the Introduction of the Adoratio
    Lindholmer, Mads ortving
    HISTORIA-ZEITSCHRIFT FUR ALTE GESCHICHTE, 2024, 73 (03): : 362 - 383
  • [48] Re-evaluating base closings
    Browne, J
    MICROWAVES & RF, 2005, 44 (09) : 17 - 17
  • [49] Re-evaluating primate monogamy
    Fuentes, A
    AMERICAN ANTHROPOLOGIST, 1998, 100 (04) : 890 - 907
  • [50] Re-evaluating the ENT procedures
    Maw, R
    PRACTITIONER, 2000, 244 (1612) : 608 - +