A dictionary-based approach to normalizing gene names in one domain of knowledge from the biomedical literature

被引：9

作者：

Galvez, Carmen ^{[1
]}

de Moya-Anegon, Felix ^{[2
]}

机构：

[1] Univ Granada, Dept Informat Sci, Commun & Documentat Fac, Granada, Spain

[2] Inst Publ Goods & Policies IPP, SCImago Res Grp CSIC, Madrid, Spain

来源：

JOURNAL OF DOCUMENTATION | 2012年 / 68卷 / 01期

关键词：

Linguistics; Dictionary; Gene name normalization; Genes; LITERATURE-BASED DISCOVERY; MOLECULAR-BIOLOGY; MEDICAL LITERATURES; PROTEIN NAMES; FISH OIL; TEXT; INFORMATION; NOMENCLATURE; GUIDELINES; ONTOLOGY;

D O I：

10.1108/00220411211200301

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Purpose - Gene term variation is a shortcoming in text-mining applications based on biomedical literature-based knowledge discovery. The purpose of this paper is to propose a technique for normalizing gene names in biomedical literature. Design/methodology/approach - Under this proposal, the normalized forms can be characterized as a unique gene symbol, defined as the official symbol or normalized name. The unification method involves five stages: collection of the gene term, using the resources provided by the Entrez Gene database; encoding of gene-naming terms in a table or binary matrix; design of a parametrized finite-state graph (P-FSG); automatic generation of a dictionary; and matching based on dictionary look-up to transform the gene mentions into the corresponding unified form. Findings - The findings show that the approach yields a high percentage of recall. Precision is only moderately high, basically due to ambiguity problems between gene-naming terms and words and abbreviations in general English. Research limitations/implications - The major limitation of this study is that biomedical abstracts were analyzed instead of full-text documents. The number of under-normalization and over-normalization errors is reduced considerably by limiting the realm of application to biomedical abstracts in a well-defined domain. Practical implications - The system can be used for practical tasks in biomedical literature mining. Normalized gene terms can be used as input to literature-based gene clustering algorithms, for identifying hidden gene-to-disease, gene-to-gene and gene-to-literature relationships. Originality/value - Few systems for gene term variation handling have been developed to date. The technique described performs gene name normalization by dictionary look-up.

引用

页码：5 / 30

页数：26

共 47 条

[1] An Approach for Identifying Malicious Domain Names Generated by Dictionary-Based DGA Bots
Satoh, Akihiro
Nakamura, Yutaka
Fukuda, Yutaka
Nobayashi, Daiki
Ikenaga, Takeshi
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2021, E104D (05): : 669 - 672
[2] A Dictionary-Based Approach for Identifying Biomedical Concepts
Gong, Lejun
Yang, Ronggen
Liu, Quan
Dong, Zhenjiang
Chen, Hong
Yang, Geng
INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2017, 31 (09)
[3] A dictionary-based approach to identify biomedical concepts
Gong, Lejun
Feng, Jiacheng
Yan, Jie
Yang, Ronggen
2015 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), 2015, : 1091 - 1095
[4] A dictionary-based approach for gene annotation
Pachter, L
Batzoglou, S
Spitkovsky, VI
Banks, E
Lander, ES
Kleitman, DJ
Berger, B
JOURNAL OF COMPUTATIONAL BIOLOGY, 1999, 6 (3-4) : 419 - 430
[5] A Word-Level Analytical Approach for Identifying Malicious Domain Names Caused by Dictionary-Based DGA Malware
Satoh, Akihiro
Fukuda, Yutaka
Kitagata, Gen
Nakamura, Yutaka
ELECTRONICS, 2021, 10 (09)
[6] Exploiting the performance of dictionary-based bio-entity name recognition in biomedical literature
Yang, Zhihao
Lin, Hongfei
Li, Yanpeng
COMPUTATIONAL BIOLOGY AND CHEMISTRY, 2008, 32 (04) : 287 - 291
[7] Mining Context-Specific Web Knowledge: An Experimental Dictionary-Based Approach
Di Lecce, Vincenzo
Calabrese, Marco
Soldo, Domenico
ADVANCED INTELLIGENT COMPUTING THEORIES AND APPLICATIONS, PROCEEDINGS: WITH ASPECTS OF ARTIFICIAL INTELLIGENCE, 2008, 5227 : 896 - 905
[8] A hybrid approach for standardized Dictionary-based knowledge extraction for Arabic morpho-semantic retrieval
Soudani, Nadia
Bounhas, Ibrahim
Slimani, Yahya
2018 IEEE 2ND INTERNATIONAL WORKSHOP ON ARABIC AND DERIVED SCRIPT ANALYSIS AND RECOGNITION (ASAR), 2018, : 47 - 51
[9] Enhanced Identifying Gene Names from Biomedical Literature with Conditional Random Fields
Wei-Zhong Qian
Journal of Electronic Science and Technology, 2009, 7 (03) : 227 - 231
[10] Mining Novel Knowledge from Biomedical Literature using Statistical Measures and Domain Knowledge
Jha, Kishlay
Jin, Wei
PROCEEDINGS OF THE 7TH ACM INTERNATIONAL CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY, AND HEALTH INFORMATICS, 2016, : 317 - 326

← 1 2 3 4 5 →