Evaluation of techniques for increasing recall in a dictionary approach to gene and protein name identification

被引:23
|
作者
Schuemie, Martijn J. [1 ]
Mons, Barend [1 ]
Weeber, Marc [1 ]
Kors, Jan A. [1 ]
机构
[1] Erasmus Univ, Med Ctr, Dept Med Informat, NL-3000 DR Rotterdam, Netherlands
关键词
gene name identification; information extraction; dictionary; thesaurus; spelling variations;
D O I
10.1016/j.jbi.2006.09.002
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Gene and protein name identification in text requires a dictionary approach to relate synonyms to the same gene or protein, and to link names to external databases. However, existing dictionaries are incomplete. We investigate two complementary methods for automatic generation of a comprehensive dictionary: combination of information from existing gene and protein databases and rule-based generation of spelling variations. Both methods have been reported in literature before, but have hitherto not been combined and evaluated systematically. We combined gene and protein names from several existing databases of four different organisms. The combined dictionaries showed a substantial increase in recall on three different test sets, as compared to any single database. Application of 23 spelling variation rules to the combined dictionaries further increased recall. However, many rules appeared to have no effect and some appear to have a detrimental effect on precision. (C) 2006 Elsevier Inc. All rights reserved.
引用
收藏
页码:316 / 324
页数:9
相关论文
共 38 条
  • [21] Function Identification of the Protein Product of Gene Lin2722 from Listeria innocua using Computational and In-Vitro Techniques
    Sharkawy, Mary
    Carter, Andrea A.
    Craig, Paul
    BIOPHYSICAL JOURNAL, 2019, 116 (03) : 67A - 67A
  • [22] Lung resistance protein (LRP) gene expression in acute myeloid leukemia (AML) samples: A critical evaluation by three techniques
    Legrand, O
    Simonin, G
    Zittoun, R
    Marie, JF
    BRITISH JOURNAL OF HAEMATOLOGY, 1998, 102 (01) : 332 - 332
  • [23] The combined use of high performance liquid chromatography and immuno-biochemical techniques for protein isolation: a new approach for identification of an individual protein from a pool of proteins
    Torabi-Pour, N
    Nouri, AME
    Perrett, D
    Oliver, RTD
    BIOMEDICAL CHROMATOGRAPHY, 2000, 14 (07) : 483 - 488
  • [24] A Parametric Targetability Evaluation Approach for Vitiligo Proteome Extracted through Integration of Gene Ontologies and Protein Interaction Topologies
    Malhotra, Anvita Gupta
    Singh, Sudha
    Jha, Mohit
    Pandey, Khushhali M.
    IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2019, 16 (06) : 1830 - 1842
  • [25] ANALYSIS OF NEISSERIA-MENINGITIDIS CLASS-3 OUTER-MEMBRANE PROTEIN GENE VARIABLE REGIONS AND TYPE IDENTIFICATION USING GENETIC TECHNIQUES
    BASH, MC
    LESIAK, KB
    BANKS, SD
    FRASCH, CE
    INFECTION AND IMMUNITY, 1995, 63 (04) : 1484 - 1490
  • [26] A bioinformatics approach for cancer immunotherapeutic target identification by evaluating surfaceome gene and protein expression (CITIESGAPE) in tumor and pan-normal tissues
    Tian, Xiangjun
    Wang, Jing
    Wang, Yifei
    Zhang, Zhongting
    Roth, Michael
    Gill, Jonathan B.
    Gorlick, Richard
    CANCER RESEARCH, 2020, 80 (16)
  • [27] RNomic identification and evaluation of npcTB_6715, a non-protein-coding RNA gene as a potential biomarker for the detection of Mycobacterium tuberculosis
    Kanniappan, Priyatharisni
    Ahmed, Siti Aminah
    Rajasekaram, Ganeswrie
    Marimuthu, Citartan
    Ch'ng, Ewe Seng
    Lee, Li Pin
    Raabe, Carsten A.
    Rozhdestvensky, Timofey S.
    Tang, Thean Hock
    JOURNAL OF CELLULAR AND MOLECULAR MEDICINE, 2017, 21 (10) : 2276 - 2283
  • [28] Identification of the human ortholog of the t-complex-encoded protein TCTE3 and evaluation as a candidate gene for primary ciliary dyskinesia
    Neesen, J
    Drenckhahn, JD
    Tiede, S
    Burfeind, P
    Grzmil, M
    Konietzko, J
    Dixkens, C
    Kreutzberger, J
    Laccone, F
    Omran, H
    CYTOGENETIC AND GENOME RESEARCH, 2002, 98 (01) : 38 - 44
  • [29] Identification of variants in the mitochondrial lysine-tRNA (MT-TK) gene in myoclonic epilepsypathogenicity evaluation and structural characterization by in silico approach
    Nadeem, Muhammad S.
    Ahmad, Habib
    Mohammed, Kaleemuddin
    Muhammad, Khushi
    Ullah, Inam
    Baothman, Othman A. S.
    Ali, Nasir
    Anwar, Firoz
    Zamzami, Mazin A.
    Shakoori, Abdul Rauf
    JOURNAL OF CELLULAR BIOCHEMISTRY, 2018, 119 (07) : 6258 - 6265
  • [30] An integrated multi-techniques approach for hydrogeochemical evaluation of ion exchange processes and identification of water types based on statistical analysis: Application to the Gaza coastal aquifer, Gaza Strip, Palestine
    Abu Alnaeem, Madhat
    Yusoff, Ismail
    Ng, Tham
    Alias, Yatimah
    May, Raksmey
    Haniffa, Mohammed
    GROUNDWATER FOR SUSTAINABLE DEVELOPMENT, 2019, 9