Fast and Accurate Algorithms for Mapping and Aligning Long Reads

被引:2
|
作者
Yang, Wen [1 ,2 ]
Wang, Lusheng [2 ]
机构
[1] City Univ Hong Kong, Dept Comp Sci, Kowloon, 83 Tat Chee Ave, Hong Kong, Peoples R China
[2] City Univ Hong Kong, Shenzhen Res Inst, Shenzhen, Peoples R China
基金
美国国家科学基金会;
关键词
long-read mapping and long-read local alignment; longest common subsequence with distance constraints; k-mer-based local alignment with variable value of k; BASIC LOCAL ALIGNMENT; SEARCH; SEQUENCES; GENOME;
D O I
10.1089/cmb.2020.0603
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
For DNA sequence analysis, we are facing challenging tasks such as the identification of structural variants, sequencing repetitive regions, and phasing of alleles. Those challenging tasks suffer from the short length of sequencing reads, where each read may cover less than 2 single nucleotide polymorphism (SNP), or less than two occurrences of a repeated region. It is believed that long reads can help to solve those challenging tasks. In this study, we have designed new algorithms for mapping long reads to reference genomes. We have also designed efficient and effective heuristic algorithms for local alignments of long reads against the corresponding segments of the reference genome. To design the new mapping algorithm, we formulate the problem as the longest common subsequence with distance constraints. The local alignment heuristic algorithm is based on the idea of recursive alignment of k-mers, where the size of k differs in each round. We have implemented all the algorithms in C++ and produce a software package named mapAlign. Experiments on real data sets showed that the newly proposed approach can generate better alignments in terms of both identity and alignment scores for both Nanopore and single molecule real time sequencing (SMRT) data sets. For human individuals of both Nanopore and SMRT data sets, the new method can successfully math/align 91.53% and 85.36% of letters from reads to identical letters on reference genomes, respectively. In comparison, the best known method can only align 88.44% and 79.08% letters of reads for Nanopore and SMRT data sets, respectively. Our method is also faster than the best known method.
引用
收藏
页码:789 / 803
页数:15
相关论文
共 50 条
  • [1] Fast and accurate mapping of Complete Genomics reads
    Lee, Donghyuk
    Hormozdiari, Farhad
    Xin, Hongyi
    Hach, Faraz
    Mutlu, Onur
    Alkan, Can
    METHODS, 2015, 79-80 : 3 - 10
  • [2] A survey of mapping algorithms in the long-reads era
    Kristoffer Sahlin
    Thomas Baudeau
    Bastien Cazaux
    Camille Marchet
    Genome Biology, 24
  • [3] A survey of mapping algorithms in the long-reads era
    Sahlin, Kristoffer
    Baudeau, Thomas
    Cazaux, Bastien
    Marchet, Camille
    GENOME BIOLOGY, 2023, 24 (01)
  • [4] Efficient mapping of accurate long reads in minimizer space with mapquik
    Ekim, Baris
    Sahlin, Kristoffer
    Medvedev, Paul
    Berger, Bonnie
    Chikhi, Rayan
    GENOME RESEARCH, 2023, 33 (07) : 1188 - 1197
  • [5] Accurate mapping of tRNA reads
    Hoffmann, Anne
    Fallmann, Joerg
    Vilardo, Elisa
    Moerl, Mario
    Stadler, Peter F.
    Amman, Fabian
    BIOINFORMATICS, 2018, 34 (07) : 1116 - 1124
  • [6] BRAT-nova: fast and accurate mapping of bisulfite-treated reads
    Harris, Elena Y.
    Ounit, Rachid
    Lonardi, Stefano
    BIOINFORMATICS, 2016, 32 (17) : 2696 - 2698
  • [7] Fast and Accurate Classification of Meta-Genomics Long Reads With deSAMBA
    Li, Gaoyang
    Liu, Yongzhuang
    Li, Deying
    Liu, Bo
    Li, Junyi
    Hu, Yang
    Wang, Yadong
    FRONTIERS IN CELL AND DEVELOPMENTAL BIOLOGY, 2021, 9
  • [8] A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases
    Jain, Chirag
    Dilthey, Alexander
    Koren, Sergey
    Aluru, Srinivas
    Phillippy, Adam M.
    RESEARCH IN COMPUTATIONAL MOLECULAR BIOLOGY, RECOMB 2017, 2017, 10229 : 66 - 81
  • [9] A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases
    Jain, Chirag
    Dilthey, Alexander
    Koren, Sergey
    Aluru, Srinivas
    Phillippy, Adam M.
    JOURNAL OF COMPUTATIONAL BIOLOGY, 2018, 25 (07) : 766 - 779
  • [10] Fast and accurate de novo genome assembly from long uncorrected reads
    Vaser, Robert
    Sovic, Ivan
    Nagarajan, Niranjan
    Sikic, Mile
    GENOME RESEARCH, 2017, 27 (05) : 737 - 746