A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases

被引:33
|
作者
Jain, Chirag [1 ,2 ]
Dilthey, Alexander [2 ]
Koren, Sergey [2 ]
Aluru, Srinivas [1 ]
Phillippy, Adam M. [2 ]
机构
[1] Georgia Inst Technol, Atlanta, GA 30332 USA
[2] NIH, Bethesda, MD 20894 USA
基金
美国国家科学基金会; 美国国家卫生研究院;
关键词
Long read mapping; Jaccard; MinHash; Winnowing; Minimizers; Sketching; Nanopore; PacBio; GENOME; ALIGNMENT; GENERATION; TIME;
D O I
10.1007/978-3-319-56970-3_5
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Emerging single-molecule sequencing technologies from Pacific Biosciences and Oxford Nanopore have revived interest in long read mapping algorithms. Alignment-based seed-and-extend methods demonstrate good accuracy, but face limited scalability, while faster alignment-free methods typically trade decreased precision for efficiency. In this paper, we combine a fast approximate read mapping algorithm based on minimizers with a novel MinHash identity estimation technique to achieve both scalability and precision. In contrast to prior methods, we develop a mathematical framework that defines the types of mapping targets we uncover, establish probabilistic estimates of p-value and sensitivity, and demonstrate tolerance for alignment error rates up to 20%. With this framework, our algorithm automatically adapts to different minimum length and identity requirements and provides both positional and identity estimates for each mapping reported. For mapping human PacBio reads to the hg38 reference, our method is 290x faster than BWA-MEM with a lower memory footprint and recall rate of 96%. We further demonstrate the scalability of our method by mapping noisy PacBio reads (each >= 5 kbp in length) to the complete NCBI RefSeq database containing 838 Gbp of sequence and > 60, 000 genomes.
引用
收藏
页码:66 / 81
页数:16
相关论文
共 50 条
  • [21] A fast algorithm for mining sequential patterns from large databases
    Chen, N
    Chen, A
    Zhou, LX
    Liu, L
    JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2001, 16 (04) : 359 - 370
  • [22] Fast algorithm to discovering sequential patterns from large databases
    Hu Huirong
    PROCEEDINGS OF THE 24TH CHINESE CONTROL CONFERENCE, VOLS 1 AND 2, 2005, : 1352 - 1355
  • [23] A fast descriptor matching algorithm for exhaustive search in large databases
    Song, BC
    Kim, MJ
    Ra, JB
    ADVANCES IN MUTLIMEDIA INFORMATION PROCESSING - PCM 2001, PROCEEDINGS, 2001, 2195 : 732 - 739
  • [24] Long Reads and the Return of Reference Genomes
    Boles, T. Chris
    Korlach, Jonas
    AMERICAN LABORATORY, 2016, 48 (02) : 42 - 43
  • [25] An Efficient Parallel Sketch-based Algorithm for Mapping Long Reads to Contigs
    Rahman, Tazin
    Bhowmik, Oieswarya
    Kalyanaraman, Ananth
    2023 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS, IPDPSW, 2023, : 157 - 166
  • [26] A spectral algorithm for fast de novo layout of uncorrected long nanopore reads
    Recanati, Antoine
    Bruls, Thomas
    d'Aspremont, Alexandre
    BIOINFORMATICS, 2017, 33 (20) : 3188 - 3194
  • [27] Fast, Ungapped Reads Mapping Using Squid
    Riccardi, Christopher
    Innocenti, Gabriel
    Fondi, Marco
    Bacci, Giovanni
    INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH, 2022, 19 (09)
  • [28] Fast and accurate mapping of Complete Genomics reads
    Lee, Donghyuk
    Hormozdiari, Farhad
    Xin, Hongyi
    Hach, Faraz
    Mutlu, Onur
    Alkan, Can
    METHODS, 2015, 79-80 : 3 - 10
  • [29] Search for approximate matches in large databases
    Fink, E
    Goldstein, A
    Hayes, P
    Carbonell, JG
    2004 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN & CYBERNETICS, VOLS 1-7, 2004, : 1431 - 1435
  • [30] A Fast and Efficient Algorithm for Mapping Short Sequences to a Reference Genome
    Antoniou, Pavlos
    Iliopoulos, Costas S.
    Mouchard, Laurent
    Pissis, Solon P.
    ADVANCES IN COMPUTATIONAL BIOLOGY, 2010, 680 : 399 - 403