A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases

被引:33
|
作者
Jain, Chirag [1 ,2 ]
Dilthey, Alexander [2 ]
Koren, Sergey [2 ]
Aluru, Srinivas [1 ]
Phillippy, Adam M. [2 ]
机构
[1] Georgia Inst Technol, Atlanta, GA 30332 USA
[2] NIH, Bethesda, MD 20894 USA
来源
RESEARCH IN COMPUTATIONAL MOLECULAR BIOLOGY, RECOMB 2017 | 2017年 / 10229卷
基金
美国国家科学基金会; 美国国家卫生研究院;
关键词
Long read mapping; Jaccard; MinHash; Winnowing; Minimizers; Sketching; Nanopore; PacBio; GENOME; ALIGNMENT; GENERATION; TIME;
D O I
10.1007/978-3-319-56970-3_5
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Emerging single-molecule sequencing technologies from Pacific Biosciences and Oxford Nanopore have revived interest in long read mapping algorithms. Alignment-based seed-and-extend methods demonstrate good accuracy, but face limited scalability, while faster alignment-free methods typically trade decreased precision for efficiency. In this paper, we combine a fast approximate read mapping algorithm based on minimizers with a novel MinHash identity estimation technique to achieve both scalability and precision. In contrast to prior methods, we develop a mathematical framework that defines the types of mapping targets we uncover, establish probabilistic estimates of p-value and sensitivity, and demonstrate tolerance for alignment error rates up to 20%. With this framework, our algorithm automatically adapts to different minimum length and identity requirements and provides both positional and identity estimates for each mapping reported. For mapping human PacBio reads to the hg38 reference, our method is 290x faster than BWA-MEM with a lower memory footprint and recall rate of 96%. We further demonstrate the scalability of our method by mapping noisy PacBio reads (each >= 5 kbp in length) to the complete NCBI RefSeq database containing 838 Gbp of sequence and > 60, 000 genomes.
引用
收藏
页码:66 / 81
页数:16
相关论文
共 50 条
  • [41] Identifying Approximate Itemsets of Interest in Large Databases
    Chengqi Zhang
    Shichao Zhang
    Geoffrey I. Webb
    Applied Intelligence, 2003, 18 : 91 - 104
  • [42] A fast multiresolution feature matching algorithm for exhaustive search in large image databases
    Song, BC
    Kim, MJ
    Ra, JB
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2001, 11 (05) : 673 - 678
  • [43] A fast algorithm for selecting sets of dissimilar molecules from large chemical databases
    Holliday, JD
    Ranade, SS
    Willett, P
    QUANTITATIVE STRUCTURE-ACTIVITY RELATIONSHIPS, 1995, 14 (06): : 501 - 506
  • [44] An improved assembly of the pearl millet reference genome using Oxford Nanopore long reads and optical mapping
    Salson, Marine
    Orjuela, Julie
    Mariac, Cedric
    Zekraoui, Leila
    Couderc, Marie
    Arribat, Sandrine
    Rodde, Nathalie
    Faye, Adama
    Kane, Ndjido A.
    Tranchant-Dubreuil, Christine
    Vigouroux, Yves
    Berthouly-Salazar, Cecile
    G3-GENES GENOMES GENETICS, 2023, 13 (05):
  • [45] A Scalable Reference-Point Based Algorithm to Efficiently Search Large Chemical Databases
    Napolitano, Francesco
    Tagliaferri, Roberto
    Baldi, Pierre
    2010 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS IJCNN 2010, 2010,
  • [46] Fast query algorithm for encrypted databases
    Huazhong Ligong Daxue Xuebao, 9 (8-10):
  • [47] Sensitive and fast mapping of di-base encoded reads
    Hormozdiari, Farhad
    Hach, Faraz
    Sahinalp, S. Cenk
    Eichler, Evan E.
    Alkan, Can
    BIOINFORMATICS, 2011, 27 (14) : 1915 - 1921
  • [48] Fast fingerprint identification for large databases
    Peralta, D.
    Triguero, I.
    Sanchez-Reillo, R.
    Herrera, F.
    Benitez, J. M.
    PATTERN RECOGNITION, 2014, 47 (02) : 588 - 602
  • [49] Fast ObjectRank for Large Knowledge Databases
    Shiokawa, Hiroaki
    SEMANTIC WEB - ISWC 2021, 2021, 12922 : 217 - 234
  • [50] A survey of mapping algorithms in the long-reads era
    Kristoffer Sahlin
    Thomas Baudeau
    Bastien Cazaux
    Camille Marchet
    Genome Biology, 24