Corpus Construction for Chinese Zero Anaphora from Discourse Perspective

被引:0
|
作者
Kong F. [1 ,2 ]
Ge H.-Z. [1 ]
Zhou G.-D. [1 ,2 ]
机构
[1] Laboratory for Natural Language Processing, School of Computer Science and Technology, Soochow University, Suzhou
[2] Jiangsu Key Laboratory of Computer Information Processing Technology, Suzhou
来源
Ruan Jian Xue Bao/Journal of Software | 2021年 / 32卷 / 12期
基金
中国国家自然科学基金;
关键词
Corpus construction; Discourse analysis; Elementary discourse unit; Zero anaphora; Zero pronouns;
D O I
10.13328/j.cnki.jos.006119
中图分类号
学科分类号
摘要
As a common phenomenon in Chinese, zero anaphora plays an important role in many natural language processing tasks, such as machine translation, text summarization and machine reading comprehension. Currently, it has become a research hotspot in the field of natural language processing. Towards better discourse analysis, this study proposes a representation architecture for Chinese zero anaphora from the discourse perspective. Firstly, the elementary discourse unit is taken as the investigation object to determine whether it contains zero elements. Secondly, according to the roles of zero elements in the elementary discourse unit, the zero elements are divided into two categories: the core type and the modifier type. Thirdly, the discourse rhetorical tree of the paragraph is used as the basic unit to evaluate the Chinese zero coreferential relationship. According to the positional relationship between the antecedent and the zero element, the coreferential relationship is classified into two types, i.e., Intra-EDU and Inter-EDU. After that, for Inter-EDU type, the coreferential relationship is furtherly divided into four categories according to the status of the antecedent, i.e., entity, event, union, and others. Finally, this study selects the overlapped 325 texts of the Chinese treebank (CTB), the connective-driven Chinese discourse treebank (CDTB), and the OntoNotes corpus to annotate the Chinese zero anaphora. System evaluation shows the high quality of the constructed corpus for Chinese zero anaphora. Moreover, a complete zero anaphor resolution baseline system is constructed to show the appropriateness and the effectiveness of the proposed representation architecture for Chinese zero anaphora from computability perspective. © Copyright 2021, Institute of Software, the Chinese Academy of Sciences. All rights reserved.
引用
收藏
页码:3782 / 3801
页数:19
相关论文
共 45 条
  • [31] Liu T, Cui YM, Yin QY, Et al., Generating and exploiting large-scale pseudo training data for zero pronoun resolution, Proc. of the ACL 2017, pp. 102-111, (2017)
  • [32] Sheng C, Kong F, Zhou GD., Toward better Chinese zero pronoun resolution from discourse perspective, Proc. of the NLPCC- ICCPOL 2017, (2017)
  • [33] Xi XF, Chu XM, Sun QY, Et al., Corpus construction for chinese discourse topic via micro-topic scheme, Journal of Computer Research and Development, 54, 8, pp. 1833-1852, (2017)
  • [34] Xi XF., Research on Chinese discourse topic structure: Representation, resource construction and its analysis, (2017)
  • [35] Sheng C, Kong F, Zhou GD., Building Chinese zero corpus form discourse perspective, Acta Scientiarum Naturalium Universitatis Pekinensis, 55, 1, pp. 15-21, (2019)
  • [36] Sheng C., Research of Chinese zero elements detection based on discourse perspective, (2018)
  • [37] Ge HZ., Research on key issues of Chinese zero anaphora for text understanding, (2020)
  • [38] Li YC, Feng WH, Kong F, Et al., Build Chinese discourse corpus with connective-driven dependency tree structure, Proc. of the EMNLP 2014, pp. 2105-2114, (2014)
  • [39] Li YC., Research of Chinese discourse structure representation and resource construction, (2015)
  • [40] Kong F, Zhou GD., Pronoun resolution in english and chinese languages based on tree kernel, Ruan Jian Xue Bao/Journal of Software, 23, 5, pp. 1085-1099, (2012)