A dictionary-based approach to normalizing gene names in one domain of knowledge from the biomedical literature

被引:9
|
作者
Galvez, Carmen [1 ]
de Moya-Anegon, Felix [2 ]
机构
[1] Univ Granada, Dept Informat Sci, Commun & Documentat Fac, Granada, Spain
[2] Inst Publ Goods & Policies IPP, SCImago Res Grp CSIC, Madrid, Spain
关键词
Linguistics; Dictionary; Gene name normalization; Genes; LITERATURE-BASED DISCOVERY; MOLECULAR-BIOLOGY; MEDICAL LITERATURES; PROTEIN NAMES; FISH OIL; TEXT; INFORMATION; NOMENCLATURE; GUIDELINES; ONTOLOGY;
D O I
10.1108/00220411211200301
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Purpose - Gene term variation is a shortcoming in text-mining applications based on biomedical literature-based knowledge discovery. The purpose of this paper is to propose a technique for normalizing gene names in biomedical literature. Design/methodology/approach - Under this proposal, the normalized forms can be characterized as a unique gene symbol, defined as the official symbol or normalized name. The unification method involves five stages: collection of the gene term, using the resources provided by the Entrez Gene database; encoding of gene-naming terms in a table or binary matrix; design of a parametrized finite-state graph (P-FSG); automatic generation of a dictionary; and matching based on dictionary look-up to transform the gene mentions into the corresponding unified form. Findings - The findings show that the approach yields a high percentage of recall. Precision is only moderately high, basically due to ambiguity problems between gene-naming terms and words and abbreviations in general English. Research limitations/implications - The major limitation of this study is that biomedical abstracts were analyzed instead of full-text documents. The number of under-normalization and over-normalization errors is reduced considerably by limiting the realm of application to biomedical abstracts in a well-defined domain. Practical implications - The system can be used for practical tasks in biomedical literature mining. Normalized gene terms can be used as input to literature-based gene clustering algorithms, for identifying hidden gene-to-disease, gene-to-gene and gene-to-literature relationships. Originality/value - Few systems for gene term variation handling have been developed to date. The technique described performs gene name normalization by dictionary look-up.
引用
收藏
页码:5 / 30
页数:26
相关论文
共 47 条
  • [21] Deep Graph Search Based Disease Related Knowledge Summarization from Biomedical Literature
    Wu, Xiaofang
    Yang, Zhihao
    Li, Zhiheng
    Sun, Yuanyuan
    Lin, Hongfei
    Wang, Jian
    2014 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2014,
  • [22] Knowledge flow of biomedical informatics domain: position-based co-citation analysis approach
    Chu, Kuo-Chung
    Yeh, Chun-Cheng
    PROCEEDINGS OF THE 2016 IEEE/ACM INTERNATIONAL CONFERENCE ON ADVANCES IN SOCIAL NETWORKS ANALYSIS AND MINING ASONAM 2016, 2016, : 1119 - 1126
  • [23] Biomedical literature classification using encyclopedic knowledge: a Wikipedia-based bag-of-concepts approach
    Mourino Garcia, Marcos Antonio
    Perez Rodriguez, Roberto
    Anido Rifon, Luis E.
    PEERJ, 2015, 3
  • [24] A neural network-based joint learning approach for biomedical entity and relation extraction from biomedical literature
    Luo, Ling
    Yang, Zhihao
    Cao, Mingyu
    Wang, Lei
    Zhang, Yin
    Lin, Hongfei
    JOURNAL OF BIOMEDICAL INFORMATICS, 2020, 103
  • [25] A gene-phenotype relationship extraction pipeline from the biomedical literature using a representation learning approach
    Xing, Wenhui
    Qi, Junsheng
    Yuan, Xiaohui
    Li, Lin
    Zhang, Xiaoyu
    Fu, Yuhua
    Xiong, Shengwu
    Hu, Lun
    Peng, Jing
    BIOINFORMATICS, 2018, 34 (13) : 386 - 394
  • [26] KNOWLEDGE-BASED SEGMENTATION AND STATE-BASED CONTROL IN IMAGE-ANALYSIS - 2 EXAMPLES FROM THE BIOMEDICAL DOMAIN
    MOLANDER, S
    BROMAN, H
    SIGNAL PROCESSING, 1993, 32 (1-2) : 201 - 215
  • [27] A KNOWLEDGE-BASED APPROACH TO GENERATING TARGET SYSTEM SPECIFICATIONS FROM A DOMAIN MODEL
    GOMAA, H
    KERSCHBERG, L
    SUGUMARAN, V
    IFIP TRANSACTIONS A-COMPUTER SCIENCE AND TECHNOLOGY, 1992, 12 : 252 - 258
  • [28] From text mining to knowledge: PubChem knowledge panels provide synopsis of chemical, gene, protein and disease term co-occurrences in biomedical literature
    Zaslaysky, Leonid
    Gindulyte, Asta
    Thiessen, Paul
    Bolton, Evan
    ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2018, 256
  • [29] A NEW LATENT SEMANTIC ANALYSIS BASED METHODOLOGY FOR KNOWLEDGE EXTRACTION FROM BIOMEDICAL LITERATURE AND BIOLOGICAL PATHWAYS DATABASES
    Abate, F.
    Acquaviva, A.
    Ficarra, E.
    Macii, E.
    BIOINFORMATICS 2011, 2011, : 66 - 74
  • [30] NetMe 2.0: a web-based platform for extracting and modeling knowledge from biomedical literature as a labeled graph
    Di Maria, Antonio
    Bellomo, Lorenzo
    Billeci, Fabrizio
    Cardillo, Alfio
    Alaimo, Salvatore
    Ferragina, Paolo
    Ferro, Alfredo
    Pulvirenti, Alfredo
    BIOINFORMATICS, 2024, 40 (05)