Classification of Highly Divergent Viruses from DNA/RNA Sequence Using Transformer-Based Models

被引:3
|
作者
Sadad, Tariq [1 ]
Aurangzeb, Raja Atif [2 ]
Imran [4 ]
Safran, Mejdl [3 ]
Alfarhood, Sultan [3 ]
Kim, Jungsuk [5 ]
机构
[1] Univ Engn & Technol, Dept Comp Sci, Mardan 23200, Pakistan
[2] Int Islamic Univ Islamabad, Dept Comp Sci & Software Engn, Islamabad 44000, Pakistan
[3] King Saud Univ, Coll Comp & Informat Sci, Dept Comp Sci, Riyadh 11543, Saudi Arabia
[4] Gachon Univ, Dept Biomed Engn, Incheon 21936, South Korea
[5] Gachon Univ, Dept Biomed Engn, Seongnam 13120, South Korea
基金
新加坡国家研究基金会;
关键词
BERT; deep learning; DNA; RNA sequence; K-MERS;
D O I
10.3390/biomedicines11051323
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Viruses infect millions of people worldwide each year, and some can lead to cancer or increase the risk of cancer. As viruses have highly mutable genomes, new viruses may emerge in the future, such as COVID-19 and influenza. Traditional virology relies on predefined rules to identify viruses, but new viruses may be completely or partially divergent from the reference genome, rendering statistical methods and similarity calculations insufficient for all genome sequences. Identifying DNA/RNA-based viral sequences is a crucial step in differentiating different types of lethal pathogens, including their variants and strains. While various tools in bioinformatics can align them, expert biologists are required to interpret the results. Computational virology is a scientific field that studies viruses, their origins, and drug discovery, where machine learning plays a crucial role in extracting domain- and task-specific features to tackle this challenge. This paper proposes a genome analysis system that uses advanced deep learning to identify dozens of viruses. The system uses nucleotide sequences from the NCBI GenBank database and a BERT tokenizer to extract features from the sequences by breaking them down into tokens. We also generated synthetic data for viruses with small sample sizes. The proposed system has two components: a scratch BERT architecture specifically designed for DNA analysis, which is used to learn the next codons unsupervised, and a classifier that identifies important features and understands the relationship between genotype and phenotype. Our system achieved an accuracy of 97.69% in identifying viral sequences.
引用
收藏
页数:12
相关论文
共 50 条
  • [21] Prediction of Marine Shaft Centerline Trajectories Using Transformer-Based Models
    Han, Jialin
    Zhu, Qingbo
    Yang, Sheng
    Xia, Wan
    Yao, Yongjun
    SYMMETRY-BASEL, 2025, 17 (01):
  • [22] Generating Fake Cyber Threat Intelligence Using Transformer-Based Models
    Ranade, Priyanka
    Piplai, Aritran
    Mittal, Sudip
    Joshi, Anupam
    Finin, Tim
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [23] Pre-Trained Transformer-Based Models for Text Classification Using Low-Resourced Ewe Language
    Agbesi, Victor Kwaku
    Chen, Wenyu
    Yussif, Sophyani Banaamwini
    Hossin, Md Altab
    Ukwuoma, Chiagoziem C.
    Kuadey, Noble A.
    Agbesi, Colin Collinson
    Samee, Nagwan Abdel
    Jamjoom, Mona M.
    Al-antari, Mugahed A.
    SYSTEMS, 2024, 12 (01):
  • [24] On the Use of Transformer-Based Models for Intent Detection Using Clustering Algorithms
    Moura, Andre
    Lima, Pedro
    Mendonca, Fabio
    Mostafa, Sheikh Shanawaz
    Morgado-Dias, Fernando
    APPLIED SCIENCES-BASEL, 2023, 13 (08):
  • [25] Diabetic Foot Ulcer Segmentation Using Convolutional and Transformer-Based Models
    Hassib, Mariam
    Ali, Maram
    Mohamed, Amina
    Torki, Marwan
    Hussein, Mohamed
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2023, 13797 LNCS : 83 - 91
  • [26] Enhancing Address Data Integrity using Transformer-Based Language Models
    Kurklu, Omer Faruk
    Akagiunduz, Erdem
    32ND IEEE SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE, SIU 2024, 2024,
  • [27] Improving transformer-based acoustic model performance using sequence discriminative training
    Lee, Chae-Won
    Chang, Joon-Hyuk
    JOURNAL OF THE ACOUSTICAL SOCIETY OF KOREA, 2022, 41 (03): : 335 - 341
  • [28] A Performance Comparison of Convolutional Neural Networks and Transformer-Based Models for Classification of the Spread of Bushfires
    Tang, Taylor
    Jayaputera, Glenn T.
    Sinnott, Richard O.
    2024 IEEE 20TH INTERNATIONAL CONFERENCE ON E-SCIENCE, E-SCIENCE 2024, 2024,
  • [29] Privacy-Aware Human Activity Classification using a Transformer-based Model
    Thipprachak, Khirakorn
    Tangamchit, Poj
    Lerspalungsanti, Sarawut
    2022 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (SSCI), 2022, : 528 - 534
  • [30] Vision Transformer-based Classification for Lung and Colon Cancer using Histopathology Images
    Hasan, Munjur
    Rahman, Md Saifur
    Islam, Sabrina
    Ahmed, Tanvir
    Rifat, Nafiz
    Ahsan, Mostofa
    Gomes, Rahul
    Chowdhury, Md.
    Proceedings - 22nd IEEE International Conference on Machine Learning and Applications, ICMLA 2023, 2023, : 1300 - 1304