Classification of Highly Divergent Viruses from DNA/RNA Sequence Using Transformer-Based Models

被引:3
|
作者
Sadad, Tariq [1 ]
Aurangzeb, Raja Atif [2 ]
Imran [4 ]
Safran, Mejdl [3 ]
Alfarhood, Sultan [3 ]
Kim, Jungsuk [5 ]
机构
[1] Univ Engn & Technol, Dept Comp Sci, Mardan 23200, Pakistan
[2] Int Islamic Univ Islamabad, Dept Comp Sci & Software Engn, Islamabad 44000, Pakistan
[3] King Saud Univ, Coll Comp & Informat Sci, Dept Comp Sci, Riyadh 11543, Saudi Arabia
[4] Gachon Univ, Dept Biomed Engn, Incheon 21936, South Korea
[5] Gachon Univ, Dept Biomed Engn, Seongnam 13120, South Korea
基金
新加坡国家研究基金会;
关键词
BERT; deep learning; DNA; RNA sequence; K-MERS;
D O I
10.3390/biomedicines11051323
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Viruses infect millions of people worldwide each year, and some can lead to cancer or increase the risk of cancer. As viruses have highly mutable genomes, new viruses may emerge in the future, such as COVID-19 and influenza. Traditional virology relies on predefined rules to identify viruses, but new viruses may be completely or partially divergent from the reference genome, rendering statistical methods and similarity calculations insufficient for all genome sequences. Identifying DNA/RNA-based viral sequences is a crucial step in differentiating different types of lethal pathogens, including their variants and strains. While various tools in bioinformatics can align them, expert biologists are required to interpret the results. Computational virology is a scientific field that studies viruses, their origins, and drug discovery, where machine learning plays a crucial role in extracting domain- and task-specific features to tackle this challenge. This paper proposes a genome analysis system that uses advanced deep learning to identify dozens of viruses. The system uses nucleotide sequences from the NCBI GenBank database and a BERT tokenizer to extract features from the sequences by breaking them down into tokens. We also generated synthetic data for viruses with small sample sizes. The proposed system has two components: a scratch BERT architecture specifically designed for DNA analysis, which is used to learn the next codons unsupervised, and a classifier that identifies important features and understands the relationship between genotype and phenotype. Our system achieved an accuracy of 97.69% in identifying viral sequences.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] EEG Classification with Transformer-Based Models
    Sun, Jiayao
    Xie, Jin
    Zhou, Huihui
    2021 IEEE 3RD GLOBAL CONFERENCE ON LIFE SCIENCES AND TECHNOLOGIES (IEEE LIFETECH 2021), 2021, : 92 - 93
  • [2] Exploring the hidden world of RNA viruses with a transformer-based tool
    Nakagawa, So
    Sakaguchi, Shoichi
    PATTERNS, 2024, 5 (11):
  • [3] Transformer-based temporal sequence learners for arrhythmia classification
    Varghese, Ann
    Kamal, Suraj
    Kurian, James
    MEDICAL & BIOLOGICAL ENGINEERING & COMPUTING, 2023, 61 (08) : 1993 - 2000
  • [4] Transformer-based temporal sequence learners for arrhythmia classification
    Ann Varghese
    Suraj Kamal
    James Kurian
    Medical & Biological Engineering & Computing, 2023, 61 : 1993 - 2000
  • [5] In-Context Learning for MIMO Equalization Using Transformer-Based Sequence Models
    Zecchin, Matteo
    Yu, Kai
    Simeone, Osvaldo
    2024 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS WORKSHOPS, ICC WORKSHOPS 2024, 2024, : 1573 - 1578
  • [6] Empirical Study of Tweets Topic Classification Using Transformer-Based Language Models
    Mandal, Ranju
    Chen, Jinyan
    Becken, Susanne
    Stantic, Bela
    INTELLIGENT INFORMATION AND DATABASE SYSTEMS, ACIIDS 2021, 2021, 12672 : 340 - 350
  • [7] Classification and recognition of gesture EEG signals with Transformer-Based models
    Qu, Yan
    Li, Congsheng
    Jiang, Haoyu
    2024 3RD INTERNATIONAL CONFERENCE ON ROBOTICS, ARTIFICIAL INTELLIGENCE AND INTELLIGENT CONTROL, RAIIC 2024, 2024, : 415 - 418
  • [8] Transformer-Based Composite Language Models for Text Evaluation and Classification
    Skoric, Mihailo
    Utvic, Milos
    Stankovic, Ranka
    MATHEMATICS, 2023, 11 (22)
  • [9] Tweets Topic Classification and Sentiment Analysis Based on Transformer-Based Language Models
    Mandal, Ranju
    Chen, Jinyan
    Becken, Susanne
    Stantic, Bela
    VIETNAM JOURNAL OF COMPUTER SCIENCE, 2023, 10 (02) : 117 - 134
  • [10] Performance Comparison of Transformer-Based Models on Twitter Health Mention Classification
    Khan, Pervaiz Iqbal
    Razzak, Imran
    Dengel, Andreas
    Ahmed, Sheraz
    IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2023, 10 (03) : 1140 - 1149