PandoGen: Generating complete instances of future SARS-CoV-2 sequences using Deep Learning

被引:1
|
作者
Ramachandran, Anand [1 ]
Lumetta, Steven S. [1 ]
Chen, Deming [1 ]
机构
[1] Univ Illinois, Urbana, IL 61820 USA
基金
美国国家科学基金会;
关键词
LANGUAGE;
D O I
10.1371/journal.pcbi.1011790
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
One of the challenges in a viral pandemic is the emergence of novel variants with different phenotypical characteristics. An ability to forecast future viral individuals at the sequence level enables advance preparation by characterizing the sequences and closing vulnerabilities in current preventative and therapeutic methods. In this article, we explore, in the context of a viral pandemic, the problem of generating complete instances of undiscovered viral protein sequences, which have a high likelihood of being discovered in the future using protein language models. Current approaches to training these models fit model parameters to a known sequence set, which does not suit pandemic forecasting as future sequences differ from known sequences in some respects. To address this, we develop a novel method, called PandoGen, to train protein language models towards the pandemic protein forecasting task. PandoGen combines techniques such as synthetic data generation, conditional sequence generation, and reward-based learning, enabling the model to forecast future sequences, with a high propensity to spread. Applying our method to modeling the SARS-CoV-2 Spike protein sequence, we find empirically that our model forecasts twice as many novel sequences with five times the case counts compared to a model that is 30x larger. Our method forecasts unseen lineages months in advance, whereas models 4x and 30x larger forecast almost no new lineages. When trained on data available up to a month before the onset of important Variants of Concern, our method consistently forecasts sequences belonging to those variants within tight sequence budgets. Viral protein sequences play a pivotal role in the spread of a pandemic. As the virus evolves, so do the viral proteins, increasing the potency of the virus. Knowledge of future viral protein sequences can be invaluable because it allows us to test the efficacy of preventative and treatment methods against future changes to the virus, and tailor them to such changes early. We attempt to forecast viral proteins ahead of time. Making such predictions is very challenging and complex because the prediction target is a sequence with thousands of positions, and a single mis-predicted sequence position may invalidate the entire prediction. Also, as the virus continues to evolve, the data available to train models becomes obsolete. Addressing these challenges, we create a novel approach to train models of the SARS-CoV-2 Spike protein, that are especially tailored to forecasting future sequences. Models trained using this approach outperform existing approaches in their effectiveness. In addition, our method can train models to forecast important pandemic variants ahead of time.
引用
收藏
页数:31
相关论文
共 50 条
  • [1] Selective Electrochemical Detection of SARS-CoV-2 Using Deep Learning
    Gecgel, Ozhan
    Ramanujam, Ashwin
    Botte, Gerardine G.
    VIRUSES-BASEL, 2022, 14 (09):
  • [2] Classification of SARS-CoV-2 viral genome sequences using Neurochaos Learning
    Harikrishnan, N.B.
    Pranay, S.Y.
    Nagaraj, Nithin
    Medical and Biological Engineering and Computing, 2022, 60 (08): : 2245 - 2255
  • [3] Classification of SARS-CoV-2 viral genome sequences using Neurochaos Learning
    Harikrishnan, N. B.
    Pranay, S. Y.
    Nagaraj, Nithin
    MEDICAL & BIOLOGICAL ENGINEERING & COMPUTING, 2022, 60 (08) : 2245 - 2255
  • [4] Efficient Classification of SARS-CoV-2 Spike Sequences Using Federated Learning
    Chourasia, Prakash
    Murad, Taslim
    Tayebi, Zahra
    Ali, Sarwan
    Khan, Imdad Ullah
    Patterson, Murray
    INFORMATION MANAGEMENT AND BIG DATA, SIMBIG 2023, 2024, 2142 : 80 - 96
  • [5] Classification of SARS-CoV-2 viral genome sequences using Neurochaos Learning
    N. B. Harikrishnan
    S. Y. Pranay
    Nithin Nagaraj
    Medical & Biological Engineering & Computing, 2022, 60 : 2245 - 2255
  • [6] Variation analysis of SARS-CoV-2 complete sequences from Iran
    Moradi, Jale
    Moradi, Parnia
    Alvandi, Amir H.
    Abiri, Ramin
    Moghoofei, Mohsen
    FUTURE VIROLOGY, 2022, 17 (12) : 863 - 872
  • [7] Complete Genome Sequences of SARS-CoV-2 Strains Detected in Malaysia
    Chong, Yoong Min
    Sam, I-Ching
    Ponnampalavanar, Sasheela
    Omar, Sharifah Faridah Syed
    Kamarulzaman, Adeeba
    Munusamy, Vijayan
    Wong, Chee Kuan
    Jamaluddin, Fadhil Hadi
    Gan, Han Ming
    Chong, Jennifer
    Teh, Cindy Shuan Ju
    Chan, Yoke Fun
    MICROBIOLOGY RESOURCE ANNOUNCEMENTS, 2020, 9 (20):
  • [8] Deep Learning for SARS COV-2 Genome Sequences
    Whata, Albert
    Chimedza, Charles
    IEEE ACCESS, 2021, 9 : 59597 - 59611
  • [9] Predicting the antigenic evolution of SARS-COV-2 with deep learning
    Han, Wenkai
    Chen, Ningning
    Xu, Xinzhou
    Sahil, Adil
    Zhou, Juexiao
    Li, Zhongxiao
    Zhong, Huawen
    Gao, Elva
    Zhang, Ruochi
    Wang, Yu
    Sun, Shiwei
    Cheung, Peter Pak-Hang
    Gao, Xin
    NATURE COMMUNICATIONS, 2023, 14 (01)
  • [10] Predicting the antigenic evolution of SARS-COV-2 with deep learning
    Wenkai Han
    Ningning Chen
    Xinzhou Xu
    Adil Sahil
    Juexiao Zhou
    Zhongxiao Li
    Huawen Zhong
    Elva Gao
    Ruochi Zhang
    Yu Wang
    Shiwei Sun
    Peter Pak-Hang Cheung
    Xin Gao
    Nature Communications, 14