Chemical named entity recognition in patents by domain knowledge and unsupervised feature learning

被引：22

作者：

Zhang, Yaoyun ^{[1
]}

Xu, Jun ^{[1
]}

Chen, Hui ^{[2
]}

Wang, Jingqi ^{[1
]}

Wu, Yonghui ^{[1
]}

Prakasam, Manu ^{[3
]}

Xu, Hua ^{[1
]}

机构：

[1] Univ Texas Hlth Sci Ctr Houston, Sch Biomed Informat, Houston, TX 77030 USA

[2] Capital Med Univ, Sch Biomed Engn, Beijing 100069, Peoples R China

[3] Mira Loma High Sch, Sacramento, CA 95821 USA

来源：

DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION | 2016年

基金：

美国国家卫生研究院;

关键词：

HYBRID SYSTEM; INFORMATION; EXTRACTION; TEXT; DATABASE;

D O I：

10.1093/database/baw049

中图分类号：

Q [生物科学];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

Medicinal chemistry patents contain rich information about chemical compounds. Although much effort has been devoted to extracting chemical entities from scientific literature, limited numbers of patent mining systems are publically available, probably due to the lack of large manually annotated corpora. To accelerate the development of information extraction systems for medicinal chemistry patents, the 2015 BioCreative V challenge organized a track on Chemical and Drug Named Entity Recognition from patent text (CHEMDNER patents). This track included three individual subtasks: (i) Chemical Entity Mention Recognition in Patents (CEMP), (ii) Chemical Passage Detection (CPD) and (iii) Gene and Protein Related Object task (GPRO). We participated in the two subtasks of CEMP and CPD using machine learning-based systems. Our machine learning-based systems employed the algorithms of conditional random fields (CRF) and structured support vector machines (SSVMs), respectively. To improve the performance of the NER systems, two strategies were proposed for feature engineering: (i) domain knowledge features of dictionaries, chemical structural patterns and semantic type information present in the context of the candidate chemical and (ii) unsupervised feature learning algorithms to generate word representation features by Brown clustering and a novel binarized Word embedding to enhance the generalizability of the system. Further, the system output for the CPD task was yielded based on the patent titles and abstracts with chemicals recognized in the CEMP task. The effects of the proposed feature strategies on both the machine learning-based systems were investigated. Our best system achieved the second best performance among 21 participating teams in CEMP with a precision of 87.18%, a recall of 90.78% and a F-measure of 88.94% and was the top performing system among nine participating teams in CPD with a sensitivity of 98.60%, a specificity of 87.21%, an accuracy of 94.75%, a Matthew's correlation coefficient( MCC) of 88.24%, a precision at full recall (P_full_R) of 66.57% and an area under the precision-recall curve (AUC_PR) of 0.9347. The SSVM-based CEMP systems outperformed the CRF-based CEMP systems when using the same features. Features generated from both the domain knowledge and unsupervised learning algorithms significantly improved the chemical NER task on patents.

引用

页数：10

共 50 条

[21] Ensemble Learning for Named Entity Recognition
Speck, Rene
Ngomo, Axel-Cyrille Ngonga
SEMANTIC WEB - ISWC 2014, PT I, 2014, 8796 : 519 - 534
[22] Named Entity Recognition in the Domain of Geographical Subject
Xu, Feifei
Li, Huiying
Li, Xuelian
2017 13TH INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION, FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY (ICNC-FSKD), 2017, : 2229 - 2234
[23] Joint Learning of Named Entity Recognition and Entity Linking
Martins, Pedro Henrique
Marinho, Zita
Martins, Andre F. T.
57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019:): STUDENT RESEARCH WORKSHOP, 2019, : 190 - 196
[24] A framework for Named Entity Recognition in the Open domain
Evans, RJ
RECENT ADVANCES IN NATURAL LANGUAGE PROCESSING III, 2004, 260 : 267 - 276
[25] Named Entity Recognition System for the Biomedical Domain
Sharma, Raghav
Chauhan, Deependra
Sharma, Raksha
PROCEEDINGS OF THE 2022 17TH CONFERENCE ON COMPUTER SCIENCE AND INTELLIGENCE SYSTEMS (FEDCSIS), 2022, : 837 - 840
[26] Named Entity Recognition in a Very Homogeneous Domain
Agarwal, Oshin
Nenkova, Ani
17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 1850 - 1855
[27] FEATURE-ENRICHEDWORD EMBEDDINGS FOR NAMED ENTITY RECOGNITION IN OPEN-DOMAIN CONVERSATIONS
Ma, Yukun
Kim, Jung-Jae
Bigot, Benjamin
Khan, Tahir Muhammad
2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 6055 - 6059
[28] Knowledge-Augmented Language Model and Its Application to Unsupervised Named-Entity Recognition
Liu, Angli
Du, Jingfei
Stoyanov, Veselin
2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, 2019, : 1142 - 1150
[29] A hybrid deep learning framework for bacterial named entity recognition with domain features
Xusheng Li
Chengcheng Fu
Ran Zhong
Duo Zhong
Tingting He
Xingpeng Jiang
BMC Bioinformatics, 20
[30] Multi-Task Learning for Chemical Named Entity Recognition with Chemical Compound Paraphrasing
Watanabe, Taiki
Tamura, Akihiro
Ninomiya, Takashi
Makino, Takuya
Iwakura, Tomoya
2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 6244 - 6249

← 1 2 3 4 5 →