Investigating Prior Knowledge for Challenging Chinese Machine Reading Comprehension

被引:44
作者
Sun, Kai [1 ]
Yu, Dian [2 ]
Yu, Dong [2 ]
Cardie, Claire [1 ]
机构
[1] Cornell Univ, Ithaca, NY 14850 USA
[2] Tencent AI Lab, Bellevue, WA USA
关键词
Computational linguistics;
D O I
10.1162/tacl_a_00305
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Machine reading comprehension tasks require a machine reader to answer questions relevant to the given document. In this paper, we present the first free-form multiple-Choice Chinese machine reading Comprehension dataset (C-3), containing 13,369 documents (dialogues or more formally written mixed-genre texts) and their associated 19,577 multiple-choice free-form questions collected from Chineseas-a-second-language examinations. We present a comprehensive analysis of the prior knowledge (i.e., linguistic, domainspecific, and generalworld knowledge) needed for these real-world problems. We implement rule-based and popular neuralmethods and find that there is still a significant performance gap between the best performing model (68.5%) and human readers (96.0%), especiallyon problems that require prior knowledge. We further study the effects of distractor plausibility and data augmentation based on translated relevant datasets for English on model performance. We expect C-3 to present great challenges to existing systems as answering 86.8% of questions requires both knowledge within and beyond the accompanying document, and we hope that C-3 can serve as a platform to study how to leverage various kinds of prior knowledge to better understand a given written or orally oriented text. C-3 is available at https://dataset.org/c3/.
引用
收藏
页码:141 / 155
页数:15
相关论文
共 62 条
[1]  
Adams Marilyn, 1982, READER MEETS AUTHOR, V13, P2
[2]  
AlecRadford Karthik Narasimhan, 2018, IMPROVING LANGUAGE U
[3]  
Bordes A., 2016, ICLR, P1
[4]  
Cheng Gong, 2016, Proceedings of the IJCAI, P2479
[5]  
Choi E, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P2174
[6]  
Clark C, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P2924
[7]  
Clark P, 2016, AAAI CONF ARTIF INTE, P2580
[8]  
Cui Y., 2016, COLING 2016 26 INT, P1777
[9]   Pre-Training With Whole Word Masking for Chinese BERT [J].
Cui, Yiming ;
Che, Wanxiang ;
Liu, Ting ;
Qin, Bing ;
Yang, Ziqing .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 :3504-3514
[10]  
Cui YM, 2018, PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), P2721