A simple approach for protein name identification:: prospects and limits

被引:30
|
作者
Fundel, K [1 ]
Güttler, D [1 ]
Zimmer, R [1 ]
Apostolakis, J [1 ]
机构
[1] Univ Munich, Inst Informat, D-80333 Munich, Germany
关键词
D O I
10.1186/1471-2105-6-S1-S15
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Significant parts of biological knowledge are available only as unstructured text in articles of biomedical journals. By automatically identifying gene and gene product (protein) names and mapping these to unique database identifiers, it becomes possible to extract and integrate information from articles and various data sources. We present a simple and efficient approach that identifies gene and protein names in texts and returns database identifiers for matches. It has been evaluated in the recent BioCreAtIvE entity extraction and mention normalization task by an independent jury. Methods: Our approach is based on the use of synonym lists that map the unique database identifiers for each gene/protein to the different synonym names. For yeast and mouse, synonym lists were used as provided by the organizers who generated them from public model organism databases. The synonym list for fly was generated directly from the corresponding organism database. The lists were then extensively curated in largely automated procedure and matched against MEDLINE abstracts by exact text matching. Rule-based and support vector machine-based post filters were designed and applied to improve precision. Results: Our procedure showed high recall and precision with F-measures of 0.897 for yeast and 0.764/0.773 for mouse in the BioCreAtIvE assessment (Task 1B) and 0.768 for fly in a post-evaluation. Conclusion: The results were close to the best over all submissions. Depending on the synonym properties it can be crucial to consider context and to filter out erroneous matches. This is especially important for fly, which has a very challenging nomenclature for the protein name identification task. Here, the support vector machine-based post filter proved to be very effective.
引用
收藏
页数:10
相关论文
共 50 条
  • [31] In the name of protein
    Guthman, Julie
    Butler, Michaelanne
    Martin, Sarah J.
    Mather, Charles
    Biltekoff, Charlotte
    NATURE FOOD, 2022, 3 (06): : 391 - 393
  • [32] NAME AND GROUP IDENTIFICATION
    KANG, TS
    JOURNAL OF SOCIAL PSYCHOLOGY, 1972, 86 (01): : 159 - &
  • [33] Status and prospects for SIMPLE
    Collar, JI
    Puibasset, J
    Girard, TA
    Limagne, D
    Miley, HS
    Waysand, G
    SOURCES AND DETECTION OF DARK MATTER AND DARK ENERGY IN THE UNIVERSE, 2001, : 477 - 484
  • [34] A simple probabilistic scoring method for protein domain identification
    Murvai, J
    Vlahovicek, K
    Pongor, S
    BIOINFORMATICS, 2000, 16 (12) : 1155 - 1156
  • [35] IDENTIFICATION OF A PROTEIN ENCODED BY A MOUSE SIMPLE REPEATED SEQUENCE
    DICARLO, M
    ROMANCINO, DP
    GHERSI, G
    MONTANA, G
    MONTELEONE, D
    JOURNAL OF SUBMICROSCOPIC CYTOLOGY AND PATHOLOGY, 1994, 26 (02) : 255 - 262
  • [36] Follow Me: A Simple Approach for Person Identification and Tracking
    Wunderlich, Sarah
    Schmoelz, Johannes
    Kuehnlenz, Kolja
    2017 IEEE 26TH INTERNATIONAL SYMPOSIUM ON INDUSTRIAL ELECTRONICS (ISIE), 2017, : 1609 - 1614
  • [37] A simple cumulant based approach for multiuser channel identification
    Liang, J
    Ding, Z
    2002 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, VOL III, PROCEEDINGS, 2002, : 659 - 662
  • [38] A simple and efficient approach to improve protein identification by the peptide mass fingerprinting method: concomitant use of negative ionization
    Sanaki, Takao
    Suzuki, Mao
    Lee, Seon Hwa
    Goto, Takaaki
    Oe, Tomoyuki
    ANALYTICAL METHODS, 2010, 2 (08) : 1144 - 1151
  • [39] Power: Limits and Prospects for Human Survival
    Vollmar, Rob
    WORLD LITERATURE TODAY, 2021, 95 (04) : 27 - 27
  • [40] Dental anthropology:: Fundamentals, limits and prospects
    Hillson, S
    Alt, KW
    Rösing, FW
    Teschler-Nicola, M
    NATURE, 1998, 396 (6712) : 640 - 641