Classification of Highly Divergent Viruses from DNA/RNA Sequence Using Transformer-Based Models

被引：3

作者：

Sadad, Tariq ^{[1
]}

Aurangzeb, Raja Atif ^{[2
]}

Imran ^{[4
]}

Safran, Mejdl ^{[3
]}

Alfarhood, Sultan ^{[3
]}

Kim, Jungsuk ^{[5
]}

机构：

[1] Univ Engn & Technol, Dept Comp Sci, Mardan 23200, Pakistan

[2] Int Islamic Univ Islamabad, Dept Comp Sci & Software Engn, Islamabad 44000, Pakistan

[3] King Saud Univ, Coll Comp & Informat Sci, Dept Comp Sci, Riyadh 11543, Saudi Arabia

[4] Gachon Univ, Dept Biomed Engn, Incheon 21936, South Korea

[5] Gachon Univ, Dept Biomed Engn, Seongnam 13120, South Korea

来源：

BIOMEDICINES | 2023年 / 11卷 / 05期

基金：

新加坡国家研究基金会;

关键词：

BERT; deep learning; DNA; RNA sequence; K-MERS;

D O I：

10.3390/biomedicines11051323

中图分类号：

Q5 [生物化学]; Q7 [分子生物学];

学科分类号：

071010 ; 081704 ;

摘要：

Viruses infect millions of people worldwide each year, and some can lead to cancer or increase the risk of cancer. As viruses have highly mutable genomes, new viruses may emerge in the future, such as COVID-19 and influenza. Traditional virology relies on predefined rules to identify viruses, but new viruses may be completely or partially divergent from the reference genome, rendering statistical methods and similarity calculations insufficient for all genome sequences. Identifying DNA/RNA-based viral sequences is a crucial step in differentiating different types of lethal pathogens, including their variants and strains. While various tools in bioinformatics can align them, expert biologists are required to interpret the results. Computational virology is a scientific field that studies viruses, their origins, and drug discovery, where machine learning plays a crucial role in extracting domain- and task-specific features to tackle this challenge. This paper proposes a genome analysis system that uses advanced deep learning to identify dozens of viruses. The system uses nucleotide sequences from the NCBI GenBank database and a BERT tokenizer to extract features from the sequences by breaking them down into tokens. We also generated synthetic data for viruses with small sample sizes. The proposed system has two components: a scratch BERT architecture specifically designed for DNA analysis, which is used to learn the next codons unsupervised, and a classifier that identifies important features and understands the relationship between genotype and phenotype. Our system achieved an accuracy of 97.69% in identifying viral sequences.

引用

页数：12

共 50 条

[1] EEG Classification with Transformer-Based Models
Sun, Jiayao
Xie, Jin
Zhou, Huihui
2021 IEEE 3RD GLOBAL CONFERENCE ON LIFE SCIENCES AND TECHNOLOGIES (IEEE LIFETECH 2021), 2021, : 92 - 93
[2] Exploring the hidden world of RNA viruses with a transformer-based tool
Nakagawa, So
Sakaguchi, Shoichi
PATTERNS, 2024, 5 (11):
[3] Transformer-based temporal sequence learners for arrhythmia classification
Varghese, Ann
Kamal, Suraj
Kurian, James
MEDICAL & BIOLOGICAL ENGINEERING & COMPUTING, 2023, 61 (08) : 1993 - 2000
[4] Transformer-based temporal sequence learners for arrhythmia classification
Ann Varghese
Suraj Kamal
James Kurian
Medical & Biological Engineering & Computing, 2023, 61 : 1993 - 2000
[5] In-Context Learning for MIMO Equalization Using Transformer-Based Sequence Models
Zecchin, Matteo
Yu, Kai
Simeone, Osvaldo
2024 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS WORKSHOPS, ICC WORKSHOPS 2024, 2024, : 1573 - 1578
[6] Empirical Study of Tweets Topic Classification Using Transformer-Based Language Models
Mandal, Ranju
Chen, Jinyan
Becken, Susanne
Stantic, Bela
INTELLIGENT INFORMATION AND DATABASE SYSTEMS, ACIIDS 2021, 2021, 12672 : 340 - 350
[7] Classification and recognition of gesture EEG signals with Transformer-Based models
Qu, Yan
Li, Congsheng
Jiang, Haoyu
2024 3RD INTERNATIONAL CONFERENCE ON ROBOTICS, ARTIFICIAL INTELLIGENCE AND INTELLIGENT CONTROL, RAIIC 2024, 2024, : 415 - 418
[8] Transformer-Based Composite Language Models for Text Evaluation and Classification
Skoric, Mihailo
Utvic, Milos
Stankovic, Ranka
MATHEMATICS, 2023, 11 (22)
[9] Tweets Topic Classification and Sentiment Analysis Based on Transformer-Based Language Models
Mandal, Ranju
Chen, Jinyan
Becken, Susanne
Stantic, Bela
VIETNAM JOURNAL OF COMPUTER SCIENCE, 2023, 10 (02) : 117 - 134
[10] Performance Comparison of Transformer-Based Models on Twitter Health Mention Classification
Khan, Pervaiz Iqbal
Razzak, Imran
Dengel, Andreas
Ahmed, Sheraz
IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2023, 10 (03) : 1140 - 1149

← 1 2 3 4 5 →