Corpus Construction for Chinese Zero Anaphora from Discourse Perspective

被引:0
|
作者
Kong F. [1 ,2 ]
Ge H.-Z. [1 ]
Zhou G.-D. [1 ,2 ]
机构
[1] Laboratory for Natural Language Processing, School of Computer Science and Technology, Soochow University, Suzhou
[2] Jiangsu Key Laboratory of Computer Information Processing Technology, Suzhou
来源
Ruan Jian Xue Bao/Journal of Software | 2021年 / 32卷 / 12期
基金
中国国家自然科学基金;
关键词
Corpus construction; Discourse analysis; Elementary discourse unit; Zero anaphora; Zero pronouns;
D O I
10.13328/j.cnki.jos.006119
中图分类号
学科分类号
摘要
As a common phenomenon in Chinese, zero anaphora plays an important role in many natural language processing tasks, such as machine translation, text summarization and machine reading comprehension. Currently, it has become a research hotspot in the field of natural language processing. Towards better discourse analysis, this study proposes a representation architecture for Chinese zero anaphora from the discourse perspective. Firstly, the elementary discourse unit is taken as the investigation object to determine whether it contains zero elements. Secondly, according to the roles of zero elements in the elementary discourse unit, the zero elements are divided into two categories: the core type and the modifier type. Thirdly, the discourse rhetorical tree of the paragraph is used as the basic unit to evaluate the Chinese zero coreferential relationship. According to the positional relationship between the antecedent and the zero element, the coreferential relationship is classified into two types, i.e., Intra-EDU and Inter-EDU. After that, for Inter-EDU type, the coreferential relationship is furtherly divided into four categories according to the status of the antecedent, i.e., entity, event, union, and others. Finally, this study selects the overlapped 325 texts of the Chinese treebank (CTB), the connective-driven Chinese discourse treebank (CDTB), and the OntoNotes corpus to annotate the Chinese zero anaphora. System evaluation shows the high quality of the constructed corpus for Chinese zero anaphora. Moreover, a complete zero anaphor resolution baseline system is constructed to show the appropriateness and the effectiveness of the proposed representation architecture for Chinese zero anaphora from computability perspective. © Copyright 2021, Institute of Software, the Chinese Academy of Sciences. All rights reserved.
引用
收藏
页码:3782 / 3801
页数:19
相关论文
共 45 条
  • [1] Kim YJ., Subject/Object drop in the acquisition of Korean: A cross-linguistic comparison, Journal of East Asian Linguistics, 9, pp. 325-351, (2000)
  • [2] Beaugrande RAD, Dressler W., Introduction to Text Linguistics, (1981)
  • [3] Schank Roger C., Conceptual dependency: A theory of natural language understanding, Cognitive Psychology, 3, 4, pp. 552-631, (1972)
  • [4] Pradhan S, Ramshaw L, Marcus M, Et al., CoNLL-2011 shared task: Modeling unrestricted coreference in ontonotes, Proc. of the 15th Conf. on Computational Natural Language Learning: Shared Task, pp. 1-27, (2011)
  • [5] Pradhan S, Moschitti A, Xue N, Et al., CoNLL-2012 shared task: Modeling multilingual unrestricted coreference in OntoNotes, Proc. of the Joint Conf. on EMNLP and CoNLL-Shared Task, pp. 1-40, (2012)
  • [6] Li CN, Thompson SA., Third-person pronouns and zero-anaphora in Chinese discourse, Syntax and Semantics, 12, pp. 311-335, (1979)
  • [7] Li WD., Topic chains in Chinese discourse, Discourse Processes, 37, pp. 25-45, (2004)
  • [8] Converse S., Pronominal anaphora resolution in Chinese, (2006)
  • [9] Zhao SH, Ng HT., Identification and resolution of Chinese zero pronouns: A machine learning approach, Proc. of the EMNLP- CoNLL 2007, pp. 541-550, (2007)
  • [10] Campbell R., Using linguistic principles to recover empty categories, Proc. of the 42nd Annual Meeting on Association for Computational Linguistics, (2004)