Multiple sequence alignment-based RNA language model and its application to structural inference

被引:18
|
作者
Zhang, Yikun [1 ,2 ]
Lang, Mei [3 ]
Jiang, Jiuhong [3 ]
Gao, Zhiqiang [4 ,5 ]
Xu, Fan [5 ]
Litfin, Thomas [6 ]
Chen, Ke [3 ]
Singh, Jaswinder [3 ]
Huang, Xiansong [5 ]
Song, Guoli [5 ]
Tian, Yonghong [5 ]
Zhan, Jian [3 ]
Chen, Jie [1 ,5 ]
Zhou, Yaoqi [3 ,6 ]
机构
[1] Peking Univ, Sch Elect & Comp Engn, Shenzhen 518055, Peoples R China
[2] Peking Univ, AI Sci AI4S Preferred Program, Shenzhen Grad Sch, Shenzhen 518055, Peoples R China
[3] Inst Syst & Phys Biol, Shenzhen Bay Lab, Shenzhen 518107, Peoples R China
[4] Shanghai Artificial Intelligence Lab, Shanghai 200232, Peoples R China
[5] Peng Cheng Lab, Shenzhen 518066, Peoples R China
[6] Griffith Univ, Inst Glycom, Parklands Dr, Southport, Qld 4215, Australia
基金
国家重点研发计划;
关键词
PROTEIN; SECONDARY; SEARCH; GENERATION; PREDICTION;
D O I
10.1093/nar/gkad1031
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Compared with proteins, DNA and RNA are more difficult languages to interpret because four-letter coded DNA/RNA sequences have less information content than 20-letter coded protein sequences. While BERT (Bidirectional Encoder Representations from Transformers)-like language models have been developed for RNA, they are ineffective at capturing the evolutionary information from homologous sequences because unlike proteins, RNA sequences are less conserved. Here, we have developed an unsupervised multiple sequence alignment-based RNA language model (RNA-MSM) by utilizing homologous sequences from an automatic pipeline, RNAcmap, as it can provide significantly more homologous sequences than manually annotated Rfam. We demonstrate that the resulting unsupervised, two-dimensional attention maps and one-dimensional embeddings from RNA-MSM contain structural information. In fact, they can be directly mapped with high accuracy to 2D base pairing probabilities and 1D solvent accessibilities, respectively. Further fine-tuning led to significantly improved performance on these two downstream tasks compared with existing state-of-the-art techniques including SPOT-RNA2 and RNAsnap2. By comparison, RNA-FM, a BERT-based RNA language model, performs worse than one-hot encoding with its embedding in base pair and solvent-accessible surface area prediction. We anticipate that the pre-trained RNA-MSM model can be fine-tuned on many other tasks related to RNA structure and function. Graphical Abstract
引用
收藏
页数:13
相关论文
共 50 条
  • [31] Biclustering as a method for RNA local multiple sequence alignment
    Wang, Shu
    Gutell, Robin R.
    Miranker, Daniel P.
    BIOINFORMATICS, 2007, 23 (24) : 3289 - 3296
  • [32] Local gapped subforest alignment and its application in finding RNA structural motifs
    Jansson, J
    Hieu, NT
    Sung, WK
    ALGORITHMS AND COMPUTATION, 2004, 3341 : 569 - 580
  • [33] Local gapped subforest alignment and its application in finding RNA structural motifs
    Jansson, Jesper
    Hieu, Ngo Trung
    Sung, Wing-Kin
    JOURNAL OF COMPUTATIONAL BIOLOGY, 2006, 13 (03) : 702 - 718
  • [34] Application of clustering technique in multiple sequence alignment
    Peres, Patricia Silva
    de Moura, Edleno Silva
    STRING PROCESSING AND INFORMATION RETRIEVAL, PROCEEDINGS, 2005, 3772 : 202 - 205
  • [35] Kernel alignment-based three-way clustering on attribute space and its application in stroke risk identification
    Ting Wang
    Bingzhen Sun
    Chao Jiang
    Heng Weng
    Xiaoli Chu
    International Journal of Machine Learning and Cybernetics, 2022, 13 : 1697 - 1711
  • [36] Using Multiple Sequence Alignment and Statistical Language Model to Integrate Multiple Chinese Address Recognition Outputs
    Chen, Shengchang
    Lu, Shujing
    Wen, Ying
    Lu, Yue
    2015 13TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), 2015, : 151 - 155
  • [37] Coordination analysis using global structural constraints and alignment-based local features
    Hara K.
    Shimbo M.
    Matsumoto Y.
    Transactions of the Japanese Society for Artificial Intelligence, 2010, 25 (05) : 560 - 569
  • [38] ClipKIT: A multiple sequence alignment trimming software for accurate phylogenomic inference
    Steenwyk, Jacob L.
    Buida, Thomas J., III
    Li, Yuanning
    Shen, Xing-Xing
    Rokas, Antonis
    PLOS BIOLOGY, 2020, 18 (12)
  • [39] Kernel alignment-based three-way clustering on attribute space and its application in stroke risk identification
    Wang, Ting
    Sun, Bingzhen
    Jiang, Chao
    Weng, Heng
    Chu, Xiaoli
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2022, 13 (06) : 1697 - 1711
  • [40] A model of evolution and structure for multiple sequence alignment
    Loeytynoja, Ari
    Goldman, Nick
    PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY B-BIOLOGICAL SCIENCES, 2008, 363 (1512) : 3913 - 3919