Evaluation of techniques for increasing recall in a dictionary approach to gene and protein name identification

被引:23
|
作者
Schuemie, Martijn J. [1 ]
Mons, Barend [1 ]
Weeber, Marc [1 ]
Kors, Jan A. [1 ]
机构
[1] Erasmus Univ, Med Ctr, Dept Med Informat, NL-3000 DR Rotterdam, Netherlands
关键词
gene name identification; information extraction; dictionary; thesaurus; spelling variations;
D O I
10.1016/j.jbi.2006.09.002
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Gene and protein name identification in text requires a dictionary approach to relate synonyms to the same gene or protein, and to link names to external databases. However, existing dictionaries are incomplete. We investigate two complementary methods for automatic generation of a comprehensive dictionary: combination of information from existing gene and protein databases and rule-based generation of spelling variations. Both methods have been reported in literature before, but have hitherto not been combined and evaluated systematically. We combined gene and protein names from several existing databases of four different organisms. The combined dictionaries showed a substantial increase in recall on three different test sets, as compared to any single database. Application of 23 spelling variation rules to the combined dictionaries further increased recall. However, many rules appeared to have no effect and some appear to have a detrimental effect on precision. (C) 2006 Elsevier Inc. All rights reserved.
引用
收藏
页码:316 / 324
页数:9
相关论文
共 38 条
  • [1] An approach to protein name extraction using heuristics and a dictionary
    Seki, K
    Mostafa, J
    ASIST 2003: PROCEEDINGS OF THE 66TH ASIST ANNUAL MEETING, VOL 40, 2003: HUMANIZING INFORMATION TECHNOLOGY: FROM IDEAS TO BITS AND BACK, 2003, 40 : 71 - 77
  • [2] A hybrid approach to protein name identification in biomedical texts
    Seki, K
    Mostafa, J
    INFORMATION PROCESSING & MANAGEMENT, 2005, 41 (04) : 723 - 743
  • [3] A simple approach for protein name identification: prospects and limits
    Katrin Fundel
    Daniel Güttler
    Ralf Zimmer
    Joannis Apostolakis
    BMC Bioinformatics, 6
  • [4] A simple approach for protein name identification:: prospects and limits
    Fundel, K
    Güttler, D
    Zimmer, R
    Apostolakis, J
    BMC BIOINFORMATICS, 2005, 6 (Suppl 1)
  • [5] Gene/protein name recognition based on support vector machine using dictionary as features
    Tomohiro Mitsumori
    Sevrani Fation
    Masaki Murata
    Kouichi Doi
    Hirohumi Doi
    BMC Bioinformatics, 6
  • [6] Gene/protein name recognition based on support vector machine using dictionary as features
    Mitsumori, T
    Fation, S
    Murata, M
    Doi, K
    Doi, H
    BMC BIOINFORMATICS, 2005, 6 (Suppl 1)
  • [7] Identification of related gene/protein names based on an HMM of name variations
    Yeganova, L
    Smith, L
    Wilbur, WJ
    COMPUTATIONAL BIOLOGY AND CHEMISTRY, 2004, 28 (02) : 97 - 107
  • [8] Building a protein name dictionary from full text: a machine learning term extraction approach
    Shi, L
    Campagne, F
    BMC BIOINFORMATICS, 2005, 6 (1)
  • [9] Building a protein name dictionary from full text: a machine learning term extraction approach
    Lei Shi
    Fabien Campagne
    BMC Bioinformatics, 6
  • [10] Learning string similarity measures for gene/protein name dictionary look-up using logistic regression
    Tsuruoka, Yoshimasa
    McNaught, John
    Tsujii, Jun'ichi
    Ananiadou, Sophia
    BIOINFORMATICS, 2007, 23 (20) : 2768 - 2774