A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases

被引:33
|
作者
Jain, Chirag [1 ,2 ]
Dilthey, Alexander [2 ]
Koren, Sergey [2 ]
Aluru, Srinivas [1 ]
Phillippy, Adam M. [2 ]
机构
[1] Georgia Inst Technol, Atlanta, GA 30332 USA
[2] NIH, Bethesda, MD 20894 USA
来源
RESEARCH IN COMPUTATIONAL MOLECULAR BIOLOGY, RECOMB 2017 | 2017年 / 10229卷
基金
美国国家科学基金会; 美国国家卫生研究院;
关键词
Long read mapping; Jaccard; MinHash; Winnowing; Minimizers; Sketching; Nanopore; PacBio; GENOME; ALIGNMENT; GENERATION; TIME;
D O I
10.1007/978-3-319-56970-3_5
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Emerging single-molecule sequencing technologies from Pacific Biosciences and Oxford Nanopore have revived interest in long read mapping algorithms. Alignment-based seed-and-extend methods demonstrate good accuracy, but face limited scalability, while faster alignment-free methods typically trade decreased precision for efficiency. In this paper, we combine a fast approximate read mapping algorithm based on minimizers with a novel MinHash identity estimation technique to achieve both scalability and precision. In contrast to prior methods, we develop a mathematical framework that defines the types of mapping targets we uncover, establish probabilistic estimates of p-value and sensitivity, and demonstrate tolerance for alignment error rates up to 20%. With this framework, our algorithm automatically adapts to different minimum length and identity requirements and provides both positional and identity estimates for each mapping reported. For mapping human PacBio reads to the hg38 reference, our method is 290x faster than BWA-MEM with a lower memory footprint and recall rate of 96%. We further demonstrate the scalability of our method by mapping noisy PacBio reads (each >= 5 kbp in length) to the complete NCBI RefSeq database containing 838 Gbp of sequence and > 60, 000 genomes.
引用
收藏
页码:66 / 81
页数:16
相关论文
共 50 条
  • [31] A fast algorithm for discovering optimal string patterns in large text databases
    Arimura, H
    Wataki, A
    Fujino, R
    Araikawa, S
    ALGORITHMIC LEARNING THEORY, 1998, 1501 : 247 - 261
  • [32] A Review of RNA - Seq Reads Mapping Algorithm
    Liu, Fang
    Ji, Zhaohua
    Xu, Xingjian
    Wang, Lidong
    PROCEEDINGS OF 2017 8TH IEEE INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING AND SERVICE SCIENCE (ICSESS 2017), 2017, : 907 - 911
  • [33] A Fast Approximate Algorithm for Large-Scale Latent Semantic Indexing
    Zhang, Dell
    Zhu, Zheng
    2008 THIRD INTERNATIONAL CONFERENCE ON DIGITAL INFORMATION MANAGEMENT, VOLS 1 AND 2, 2008, : 639 - 644
  • [34] A fast approximate algorithm for determining bond orders in large polycyclic structures
    Trepalin, Sergey
    Gurke, Sasha
    Akhukov, Mikhail
    Knizhnik, Andrey
    Potapkin, Boris
    JOURNAL OF MOLECULAR GRAPHICS & MODELLING, 2019, 86 : 52 - 65
  • [35] Fast and sensitive mapping of nanopore sequencing reads with GraphMap
    Sovic, Ivan
    Sikic, Mile
    Wilm, Andreas
    Fenlon, Shannon Nicole
    Chen, Swaine
    Nagarajan, Niranjan
    NATURE COMMUNICATIONS, 2016, 7
  • [36] Fast and accurate handwritten character recognition using approximate nearest neighbours search on large databases
    Pérez-Cortes, JC
    Llobet, R
    Arlandis, J
    ADVANCES IN PATTERN RECOGNITION, 2000, 1876 : 767 - 776
  • [37] Fast and sensitive mapping of nanopore sequencing reads with GraphMap
    Ivan Sović
    Mile Šikić
    Andreas Wilm
    Shannon Nicole Fenlon
    Swaine Chen
    Niranjan Nagarajan
    Nature Communications, 7
  • [38] HASLR: Fast Hybrid Assembly of Long Reads
    Haghshenas, Ehsan
    Asghari, Hossein
    Stoye, Jens
    Chauve, Cedric
    Hach, Faraz
    ISCIENCE, 2020, 23 (08)
  • [39] BASS: Approximate search on large string databases
    Yang, J
    Wang, W
    Yu, P
    16TH INTERNATIONAL CONFERENCE ON SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT, PROCEEDINGS, 2004, : 181 - 190
  • [40] Identifying approximate itemsets of interest in large databases
    Zhang, CQ
    Zhang, SH
    Webb, GI
    APPLIED INTELLIGENCE, 2003, 18 (01) : 91 - 104