PandoGen: Generating complete instances of future SARS-CoV-2 sequences using Deep Learning

被引:1
|
作者
Ramachandran, Anand [1 ]
Lumetta, Steven S. [1 ]
Chen, Deming [1 ]
机构
[1] Univ Illinois, Urbana, IL 61820 USA
基金
美国国家科学基金会;
关键词
LANGUAGE;
D O I
10.1371/journal.pcbi.1011790
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
One of the challenges in a viral pandemic is the emergence of novel variants with different phenotypical characteristics. An ability to forecast future viral individuals at the sequence level enables advance preparation by characterizing the sequences and closing vulnerabilities in current preventative and therapeutic methods. In this article, we explore, in the context of a viral pandemic, the problem of generating complete instances of undiscovered viral protein sequences, which have a high likelihood of being discovered in the future using protein language models. Current approaches to training these models fit model parameters to a known sequence set, which does not suit pandemic forecasting as future sequences differ from known sequences in some respects. To address this, we develop a novel method, called PandoGen, to train protein language models towards the pandemic protein forecasting task. PandoGen combines techniques such as synthetic data generation, conditional sequence generation, and reward-based learning, enabling the model to forecast future sequences, with a high propensity to spread. Applying our method to modeling the SARS-CoV-2 Spike protein sequence, we find empirically that our model forecasts twice as many novel sequences with five times the case counts compared to a model that is 30x larger. Our method forecasts unseen lineages months in advance, whereas models 4x and 30x larger forecast almost no new lineages. When trained on data available up to a month before the onset of important Variants of Concern, our method consistently forecasts sequences belonging to those variants within tight sequence budgets. Viral protein sequences play a pivotal role in the spread of a pandemic. As the virus evolves, so do the viral proteins, increasing the potency of the virus. Knowledge of future viral protein sequences can be invaluable because it allows us to test the efficacy of preventative and treatment methods against future changes to the virus, and tailor them to such changes early. We attempt to forecast viral proteins ahead of time. Making such predictions is very challenging and complex because the prediction target is a sequence with thousands of positions, and a single mis-predicted sequence position may invalidate the entire prediction. Also, as the virus continues to evolve, the data available to train models becomes obsolete. Addressing these challenges, we create a novel approach to train models of the SARS-CoV-2 Spike protein, that are especially tailored to forecasting future sequences. Models trained using this approach outperform existing approaches in their effectiveness. In addition, our method can train models to forecast important pandemic variants ahead of time.
引用
收藏
页数:31
相关论文
共 50 条
  • [41] SARS-CoV-2, the other face to SARS-CoV and MERS-CoV: Future predictions
    Abdelghany, T. M.
    Ganash, Magdah
    Bakri, Marwah M.
    Qanash, Husam
    Al-Rajhi, Aisha M. H.
    Elhussieny, Nadeem, I
    BIOMEDICAL JOURNAL, 2021, 44 (01) : 86 - 93
  • [42] Identification of Epidemiological Traits by Analysis of SARS-CoV-2 Sequences
    Pan, Bohu
    Ji, Zuowei
    Sakkiah, Sugunadevi
    Guo, Wenjing
    Liu, Jie
    Patterson, Tucker A.
    Hong, Huixiao
    VIRUSES-BASEL, 2021, 13 (05):
  • [43] Optimal entropic properties of SARS-CoV-2 RNA sequences
    Formentin, Marco
    Chignola, Roberto
    Favretti, Marco
    ROYAL SOCIETY OPEN SCIENCE, 2024, 11 (01):
  • [44] How Trustworthy Are the Genomic Sequences of SARS-CoV-2 in GenBank?
    Xia, Xuhua
    MICROORGANISMS, 2024, 12 (11)
  • [45] Human Gene Sequences in SARS-CoV-2 and Other Viruses
    Lehrer, Steven
    Rheinstein, Peter H.
    IN VIVO, 2020, 34 : 1633 - 1636
  • [46] Entropic Dynamics of Mutations in SARS-CoV-2 Genomic Sequences
    Favretti, Marco
    ENTROPY, 2024, 26 (02)
  • [47] SARS-CoV-2 Vaccination Rate and SARS-CoV-2 Infection of Health Care Workers in Aerosol-Generating Medical Disciplines
    Muzalyova, Anna
    Ebigbo, Alanna
    Kahn, Maria
    Zellmer, Stephan
    Beyer, Albert
    Rosendahl, Jonas
    Zenk, Johannes
    Al-Nawas, Bilal
    Frankenberger, Roland
    Hoffmann, Juergen
    Arens, Christoph
    Lammert, Frank
    Traidl-Hoffmann, Claudia
    Messmann, Helmut
    Roemmele, Christoph
    JOURNAL OF CLINICAL MEDICINE, 2022, 11 (10)
  • [48] Evolution and genetic diversity of SARS-CoV-2 in Africa using whole genome sequences
    Motayo, Babatunde Olarenwaju
    Oluwasemowo, Olukunle Oluwapamilerin
    Olusola, Babatunde Adebiyi
    Akinduti, Paul Akiniyi
    Arege, Olamide T.
    Obafemi, Yemisi Dorcas
    Faneye, Adedayo Omotayo
    Isibor, Patrick Omoregie
    Aworunse, Oluwadurotimi Samuel
    Oranusi, Solomon Uche
    INTERNATIONAL JOURNAL OF INFECTIOUS DISEASES, 2021, 103 : 282 - 287
  • [49] Predicting SARS-CoV-2 infection duration at hospital admission:a deep learning solution
    Piergiuseppe Liuzzi
    Silvia Campagnini
    Chiara Fanciullacci
    Chiara Arienti
    Michele Patrini
    Maria Chiara Carrozza
    Andrea Mannini
    Medical & Biological Engineering & Computing, 2022, 60 : 459 - 470
  • [50] Repositioning Molecules of Chinese Medicine to Targets of SARS-Cov-2 by Deep Learning Method
    Song, Tao
    Zhong, Yue
    Ding, Mao
    Zhao, Renteng
    Tian, Qingyu
    Du, Zhenzhen
    Liu, Dayan
    Liu, Jiali
    Deng, Yufeng
    2020 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE, 2020, : 2306 - 2312