Complex Entity Recognition Based on Prior Semantic Knowledge and Type Embedding

被引:0
|
作者
Jiang X.-B. [1 ]
He K. [1 ]
Yan G.-Y. [1 ]
机构
[1] School of Electronic and Information Engineering, South China University of Technology, Guangzhou
来源
Ruan Jian Xue Bao/Journal of Software | 2023年 / 34卷 / 12期
关键词
2D probability encoding; complex entity recognition; gated interactive attention; information extraction;
D O I
10.13328/j.cnki.jos.006750
中图分类号
学科分类号
摘要
Entity recognition is a key task of information extraction. With the development of information extraction technology, researchers turn the research direction from the recognition of simple entities to the recognition of complex ones. Complex entities usually have no explicit features, and they are more complicated in syntactic constructions and parts of speech, which makes the recognition of complex entities a great challenge. In addition, existing models widely use span-based methods to identify nested entities. As a result, they always have an ambiguity in the detection of entity boundaries, which affects recognition performance. In response to the above challenge and problem, this study proposes an entity recognition model GIA-2DPE based on prior semantic knowledge and type embedding. The model uses keyword sequences of entity categories as prior semantic knowledge to improve the cognition of entities, utilizes type embedding to capture potential features of different entity types, and then combines prior knowledge with entity-type features through the gated interactive attention mechanism to assist in the recognition of complex entities. Moreover, the model uses 2D probability encoding to predict entity boundaries and combines boundary features and contextual features to enhance accurate boundary detection, thereby improving the performance of nested entity recognition. This study conducts extensive experiments on seven English datasets and two Chinese datasets. The results show that GIA-2DPE outperforms state-of-the-art models and achieves a 10.4% F1 boost compared with the baseline in entity recognition tasks on the ScienceIE dataset. © 2023 Chinese Academy of Sciences. All rights reserved.
引用
收藏
页码:5649 / 5669
页数:20
相关论文
共 47 条
  • [1] Eberts M, Ulges A., Span-based joint entity and relation extraction with transformer pre-training, Proc. of the 24th European Conf. on Artificial Intelligence, pp. 2006-2013, (2020)
  • [2] Friedrich A, Adel H, Tomazic F, Hingerl J, Benteau R, Marusczyk A, Lange L., The SOFC-Exp corpus and neural approaches to information extraction in the materials science domain, Proc. of the 58th Annual Meeting of the Association for Computational Linguistics. ACL, pp. 1255-1268, (2020)
  • [3] Li XY, Yin F, Sun ZJ, Li XY, Yuan A, Chai D, Zhou MX, Li JW., Entity-relation extraction as multi-turn question answering, Proc. of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1340-1350, (2019)
  • [4] Sahrawat D, Mahata D, Zhang HM, Kulkarni M, Sharma A, Gosangi R, Stent A, Kumar Y, Shah RR, Zimmermann R., Keyphrase extraction as sequence labeling using contextualized embeddings, Proc. of the 42nd European Conf. on Information Retrieval, pp. 328-335, (2020)
  • [5] Luan Y, He LH, Ostendorf M, Hajishirzi H., Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction, Proc. of the 2018 Conf. on Empirical Methods in Natural Language Processing, pp. 3219-3232, (2018)
  • [6] Gurulingappa H, Rajput AM, Roberts A, Fluck J, Hofmann-Apitius M, Toldo L., Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports, Journal of Biomedical Informatics, 45, 5, pp. 885-892, (2012)
  • [7] Walker C, Strassel S, Medero J, Maeda K., ACE 2005 Multilingual Training Corpus, (2006)
  • [8] Augenstein I, Das M, Riedel S, Vikraman L, McCallum A., SemEval 2017 task 10: ScienceIE-Extracting keyphrases and relations from scientific publications, Proc. of the 11th Int'l Workshop on Semantic Evaluation, pp. 546-555, (2017)
  • [9] Doddington GR, Mitchell A, Przybocki MA, Ramshaw LA, Strassel SM, Weischedel RM., The automatic content extraction (ACE) program-tasks, data, and evaluation, Proc. of the 4th Int'l Conf. on Language Resources and Evaluation, (2004)
  • [10] Ohta T, Tateisi Y, Kim JD., The GENIA corpus: An annotated research abstract corpus in molecular biology domain, Proc. of the 2nd Int'l Conf. on Human Language Technology Research, pp. 82-86, (2002)