PandoGen: Generating complete instances of future SARS-CoV-2 sequences using Deep Learning

被引:1
|
作者
Ramachandran, Anand [1 ]
Lumetta, Steven S. [1 ]
Chen, Deming [1 ]
机构
[1] Univ Illinois, Urbana, IL 61820 USA
基金
美国国家科学基金会;
关键词
LANGUAGE;
D O I
10.1371/journal.pcbi.1011790
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
One of the challenges in a viral pandemic is the emergence of novel variants with different phenotypical characteristics. An ability to forecast future viral individuals at the sequence level enables advance preparation by characterizing the sequences and closing vulnerabilities in current preventative and therapeutic methods. In this article, we explore, in the context of a viral pandemic, the problem of generating complete instances of undiscovered viral protein sequences, which have a high likelihood of being discovered in the future using protein language models. Current approaches to training these models fit model parameters to a known sequence set, which does not suit pandemic forecasting as future sequences differ from known sequences in some respects. To address this, we develop a novel method, called PandoGen, to train protein language models towards the pandemic protein forecasting task. PandoGen combines techniques such as synthetic data generation, conditional sequence generation, and reward-based learning, enabling the model to forecast future sequences, with a high propensity to spread. Applying our method to modeling the SARS-CoV-2 Spike protein sequence, we find empirically that our model forecasts twice as many novel sequences with five times the case counts compared to a model that is 30x larger. Our method forecasts unseen lineages months in advance, whereas models 4x and 30x larger forecast almost no new lineages. When trained on data available up to a month before the onset of important Variants of Concern, our method consistently forecasts sequences belonging to those variants within tight sequence budgets. Viral protein sequences play a pivotal role in the spread of a pandemic. As the virus evolves, so do the viral proteins, increasing the potency of the virus. Knowledge of future viral protein sequences can be invaluable because it allows us to test the efficacy of preventative and treatment methods against future changes to the virus, and tailor them to such changes early. We attempt to forecast viral proteins ahead of time. Making such predictions is very challenging and complex because the prediction target is a sequence with thousands of positions, and a single mis-predicted sequence position may invalidate the entire prediction. Also, as the virus continues to evolve, the data available to train models becomes obsolete. Addressing these challenges, we create a novel approach to train models of the SARS-CoV-2 Spike protein, that are especially tailored to forecasting future sequences. Models trained using this approach outperform existing approaches in their effectiveness. In addition, our method can train models to forecast important pandemic variants ahead of time.
引用
收藏
页数:31
相关论文
共 50 条
  • [21] Coding-Complete Genome Sequences of Three SARS-CoV-2 Strains from Bangladesh
    Akter, Shahina
    Banu, Tanjina Akhtar
    Goswami, Barna
    Osman, Eshrar
    Uzzaman, Mohammad Samir
    Habib, M. Ahashan
    Jahan, Iffat
    Mahmud, Abu Sayeed Mohammad
    Sarker, M. Murshed Hasan
    Hossain, M. Saddam
    Shamsuzzaman, A. K. Mohammad
    Nafisa, Tasnim
    Molla, M. Maruf Ahmed
    Yeasmin, Mahmuda
    Ghosh, Asish Kumar
    Al Din, Sheikh M. Selim
    Ray, Utpal Chandra
    Sajib, Salek Ahmed
    Hossain, Maqsud
    Khan, M. Salim
    MICROBIOLOGY RESOURCE ANNOUNCEMENTS, 2020, 9 (39):
  • [22] Coding-Complete Genome Sequences of Two SARS-CoV-2 Isolates from Egypt
    Kandeil, Ahmed
    Mostafa, Ahmed
    El-Shesheny, Rabeh
    Shehata, Mahmoud
    Roshdy, Wael H.
    Ahmed, Shymaa Showky
    Gomaa, Mokhtar
    El Taweel, Ahmed
    Kayed, Ahmed E.
    Mahmoud, Sara H.
    Moatasim, Yassmin
    Kutkat, Omnia
    Kamel, Mina Nabil
    Mahrous, Noura
    El Sayes, Mohamed
    El Guindy, Nancy M.
    Naguib, Amal
    Ali, Mohamed A.
    MICROBIOLOGY RESOURCE ANNOUNCEMENTS, 2020, 9 (22):
  • [23] Coding-Complete Genome Sequences of 23 SARS-CoV-2 Samples from the Philippines
    Velasco, John Mark
    Chinnawirotpisan, Piyawan
    Joonlasak, Khajohn
    Manasatienkij, Wudtichai
    Huang, Angkana
    Valderama, Maria Theresa
    Diones, Paula Corazon
    Leonardia, Susie
    Timbol, Maria Leanor
    Navarro, Fatima Claire
    Villa, Vicente
    Tabinas, Henry, Jr.
    Chua, Domingo, Jr.
    Fernandez, Stefan
    Jones, Anthony
    Klungthong, Chonticha
    MICROBIOLOGY RESOURCE ANNOUNCEMENTS, 2020, 9 (43):
  • [24] Cov2clusters: genomic clustering of SARS-CoV-2 sequences
    Benjamin Sobkowiak
    Kimia Kamelian
    James E. A. Zlosnik
    John Tyson
    Anders Gonçalves da Silva
    Linda M. N. Hoang
    Natalie Prystajecky
    Caroline Colijn
    BMC Genomics, 23
  • [25] Cov2clusters: genomic clustering of SARS-CoV-2 sequences
    Sobkowiak, Benjamin
    Kamelian, Kimia
    Zlosnik, James E. A.
    Tyson, John
    da Silva, Anders Goncalves
    Hoang, Linda M. N.
    Prystajecky, Natalie
    Colijn, Caroline
    BMC GENOMICS, 2022, 23 (01)
  • [26] Unsupervised clustering of SARS-CoV-2 using deep convolutional autoencoder
    Sherif, Fayroz F.
    Ahmed, Khaled S.
    Journal of Engineering and Applied Science, 2022, 69 (01):
  • [27] Unsupervised clustering of SARS-CoV-2 using deep convolutional autoencoder
    Sherif F.F.
    Ahmed K.S.
    Journal of Engineering and Applied Science, 2022, 69 (1):
  • [28] Complete aortic thrombosis in SARS-CoV-2 infection
    Tinelli, Giovanni
    Minelli, Fabrizio
    Sica, Simona
    Tshomba, Yamume
    EUROPEAN HEART JOURNAL, 2021, 42 (23) : 2314 - 2314
  • [29] Deep learning application detecting SARS-CoV-2 key enzymes inhibitors
    Leila Benarous
    Khedidja Benarous
    Ghulam Muhammad
    Zulfiqar Ali
    Cluster Computing, 2023, 26 : 1169 - 1180
  • [30] Deep learning application detecting SARS-CoV-2 key enzymes inhibitors
    Benarous, Leila
    Benarous, Khedidja
    Muhammad, Ghulam
    Ali, Zulfiqar
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2023, 26 (02): : 1169 - 1180