Extraction of gene/protein names involved in each stage of spermatogenesis based on literature mining

被引：0

作者：

Zhu, Jun ^{[1
,3
]}

Yin, Jianping ^{[1
]}

Zhao, Zhiheng ^{[1
]}

Zhu, En ^{[1
]}

Ban, Rongjun ^{[2
]}

机构：

[1] [1,Zhu, Jun

[2] Yin, Jianping

[3] Zhao, Zhiheng

[4] Zhu, En

[5] Ban, Rongjun

来源：

Zhu, J. (cqzhujun@126.com) | 1600年 / Science Press卷 / 51期

关键词：

Classification (of information) - Text processing - Extraction - Statistical tests;

D O I：

10.7544/issn1000-1239.2014.20121057

中图分类号：

学科分类号：

摘要：

Spermatogenesis is an important bioprocess in the lifetime of male mammalians, which has deep effect on mammal's reproduction. Abnormal spermatogenesis is a major cause of male infertility, however treatments for this are limited. Characterizing the genes/proteins involved in spermatogenesis is fundamental to understand the mechanisms underlying this biological process and to develop treatments for the problems in spermatogenesis. However, most crucial information of spermatogenesis-related genes/proteins scatters in vast amount of research articles, so manually curation of these genes/proteins could be a time-consuming task. In this paper, a novel strategy is proposed to automatically extract the names of spermatogenesis-related genes/proteins, which function in different stages of spermatogenesis based on literature mining. Firstly, it compares three different algorithms performance on different terms and applys an SVM classifier trained with a manually prepared dataset to classify spermatogenesis-related texts into three classes in accordance with the three stages of spermatogenesis. Then, integrating expert knowledge and grammar rules, it recongnizes and extracts the gene/protein names of each spermatogenesis stage with high confidence. Finally, a manually curation test dataset is used to test the performance of this strategy, and the strategy gets an accuracy of 71.9%, which verifys the reliability of proposed method and proves the value of application.

引用

页码：1352 / 1358

共 50 条

[1] Automatic extraction of reference gene from literature in plants based on texting mining
He Lin
Shen Gengyu
Li Fei
Huang Shuiqing
INTERNATIONAL JOURNAL OF DATA MINING AND BIOINFORMATICS, 2015, 12 (04) : 400 - 416
[2] Biomedical literature mining: graph kernel-based learning for gene-gene interaction extraction
Hsieh, Ai-Ru
Tsai, Chen-Yu
EUROPEAN JOURNAL OF MEDICAL RESEARCH, 2024, 29 (01)
[3] Literature extraction of protein functions using sentence pattern mining
Chiang, JH
Yu, HC
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2005, 17 (08) : 1088 - 1098
[4] MinePhos: A Literature Mining System for Protein Phoshphorylation Information Extraction
Xu, Yun
Teng, Da
Lei, Yiming
IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2012, 9 (01) : 311 - 315
[5] SPAG4L, a Novel Nuclear Envelope Protein Involved in the Meiotic Stage of Spermatogenesis
Jiang, Xian-Zhen
Yang, Ming-Gang
Huang, Li-Hua
Li, Chang-Qi
Xing, Xiao-Wei
DNA AND CELL BIOLOGY, 2011, 30 (11) : 875 - 882
[6] BioThesaurus: a web-based thesaurus of protein and gene names
Liu, HF
Hu, ZZ
Zhang, J
Wu, C
BIOINFORMATICS, 2006, 22 (01) : 103 - 105
[7] Annotating gene sets by mining large literature collections with protein networks
Wang, Sheng
Ma, Jianzhu
Yu, Michael Ku
Zheng, Fan
Huang, Edward W.
Han, Jiawei
Peng, Jian
Ideker, Trey
PACIFIC SYMPOSIUM ON BIOCOMPUTING 2018 (PSB), 2018, : 602 - 613
[8] Protein identification at each growth stage based on early-stage expression in 'Niitaka' pear fruits
Baek, Yun-Ju
Seo, Su-mi
Yang, Ung
Wi, Seung Gon
Lee, Sang-Hyun
HORTICULTURE ENVIRONMENT AND BIOTECHNOLOGY, 2024, : 219 - 232
[9] Identification of related gene/protein names based on an HMM of name variations
Yeganova, L
Smith, L
Wilbur, WJ
COMPUTATIONAL BIOLOGY AND CHEMISTRY, 2004, 28 (02) : 97 - 107
[10] SpermatogenesisOnline 1.0: a resource for spermatogenesis based on manual literature curation and genome-wide data mining
Zhang, Yuanwei
Zhong, Liangwen
Xu, Bo
Yang, Yifan
Ban, Rongjun
Zhu, Jun
Cooke, Howard J.
Hao, QiaoMei
Shi, Qinghua
NUCLEIC ACIDS RESEARCH, 2013, 41 (D1) : D1055 - D1062

← 1 2 3 4 5 →