Allerdictor: fast allergen prediction using text classification techniques

被引:35
|
作者
Dang, Ha X. [1 ]
Lawrence, Christopher B. [1 ,2 ]
机构
[1] Virginia Tech, Virginia Bioinformat Inst, Blacksburg, VA 24061 USA
[2] Virginia Tech, Dept Biol Sci, Blacksburg, VA 24061 USA
关键词
WEB SERVER; PROTEINS; ALGORITHM; DATABASE; ASTHMA;
D O I
10.1093/bioinformatics/btu004
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Accurately identifying and eliminating allergens from biotechnology- derived products are important for human health. From a biomedical research perspective, it is also important to identify allergens in sequenced genomes. Many allergen prediction tools have been developed during the past years. Although these tools have achieved certain levels of specificity, when applied to large-scale allergen discovery (e. g. at a whole-genome scale), they still yield many false positives and thus low precision (even at low recall) due to the extreme skewness of the data (allergens are rare). Moreover, the most accurate tools are relatively slow because they use protein sequence alignment to build feature vectors for allergen classifiers. Additionally, only web server implementations of the current allergen prediction tools are publicly available and are without the capability of large batch submission. These weaknesses make large-scale allergen discovery ineffective and inefficient in the public domain. Results: We developed Allerdictor, a fast and accurate sequence-based allergen prediction tool that models protein sequences as text documents and uses support vector machine in text classification for allergen prediction. Test results on multiple highly skewed datasets demonstrated that Allerdictor predicted allergens with high precision over high recall at fast speed. For example, Allerdictor only took similar to 6 min on a single core PC to scan a whole Swiss-Prot database of similar to 540 000 sequences and identified < 1% of them as allergens.
引用
收藏
页码:1120 / 1128
页数:9
相关论文
共 50 条
  • [31] Using Some Web Content Mining Techniques for Arabic Text Classification
    Zubi, Zakaria Suliman
    PROCEEDINGS OF THE 8TH WSEAS INTERNATIONAL CONFERENCE ON DATA NETWORKS, COMMUNICATIONS, COMPUTERS (DNCOCO '09), 2009, : 73 - 84
  • [32] Text Classification Using Ensemble Features Selection and Data Mining Techniques
    Shravankumar, B.
    Ravi, Vadlamani
    SWARM, EVOLUTIONARY, AND MEMETIC COMPUTING, SEMCCO 2014, 2015, 8947 : 176 - 186
  • [33] Classification of Short Text Using Various Preprocessing Techniques: An Empirical Evaluation
    Kumar, H. M. Keerthi
    Harish, B. S.
    RECENT FINDINGS IN INTELLIGENT COMPUTING TECHNIQUES, VOL 3, 2018, 709 : 19 - 30
  • [34] An Experimental Comparison of Text Classification Techniques
    Lakhotia, Suyash
    Bresson, Xavier
    2018 INTERNATIONAL CONFERENCE ON CYBERWORLDS (CW), 2018, : 58 - 65
  • [35] A SURVEY ON CLASSIFICATION TECHNIQUES FOR TEXT MINING
    Brindha, S.
    Sukumaran, S.
    Prabha, K.
    2016 3RD INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING AND COMMUNICATION SYSTEMS (ICACCS), 2016,
  • [36] Protein classification based on text document classification techniques
    Cheng, BYM
    Carbonell, JG
    Klein-Seetharaman, J
    PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2005, 58 (04) : 955 - 970
  • [37] Customer Churn Prediction Using Sentiment Analysis and Text Classification of VOC
    Wang, Yiou
    Satake, Koji
    Onishi, Takeshi
    Masuichi, Hiroshi
    COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, CICLING 2017, PT II, 2018, 10762 : 156 - 165
  • [38] Improving subcellular localization prediction using text classification and the gene ontology
    Fyshe, Alona
    Liu, Yifeng
    Szafron, Duane
    Greiner, Russ
    Lu, Paul
    BIOINFORMATICS, 2008, 24 (21) : 2512 - 2517
  • [39] A Comparison of Text-Classification Techniques Applied to Arabic Text
    Kanaan, Ghassan
    Al-Shalabi, Riyad
    Ghwanmeh, Sameh
    Al-Ma'adeed, Hamda
    JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2009, 60 (09): : 1836 - 1844
  • [40] Text Classification and Prediction in the Legal Domain
    Nghiem, Minh-Quoc
    Baylis, Paul
    Freitas, Andre
    Ananiadou, Sophia
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 4717 - 4722