A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases

被引：33

作者：

Jain, Chirag ^{[1
,2
]}

Dilthey, Alexander ^{[2
]}

Koren, Sergey ^{[2
]}

Aluru, Srinivas ^{[1
]}

Phillippy, Adam M. ^{[2
]}

机构：

[1] Georgia Inst Technol, Atlanta, GA 30332 USA

[2] NIH, Bethesda, MD 20894 USA

来源：

RESEARCH IN COMPUTATIONAL MOLECULAR BIOLOGY, RECOMB 2017 | 2017年 / 10229卷

基金：

美国国家科学基金会; 美国国家卫生研究院;

关键词：

Long read mapping; Jaccard; MinHash; Winnowing; Minimizers; Sketching; Nanopore; PacBio; GENOME; ALIGNMENT; GENERATION; TIME;

D O I：

10.1007/978-3-319-56970-3_5

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Emerging single-molecule sequencing technologies from Pacific Biosciences and Oxford Nanopore have revived interest in long read mapping algorithms. Alignment-based seed-and-extend methods demonstrate good accuracy, but face limited scalability, while faster alignment-free methods typically trade decreased precision for efficiency. In this paper, we combine a fast approximate read mapping algorithm based on minimizers with a novel MinHash identity estimation technique to achieve both scalability and precision. In contrast to prior methods, we develop a mathematical framework that defines the types of mapping targets we uncover, establish probabilistic estimates of p-value and sensitivity, and demonstrate tolerance for alignment error rates up to 20%. With this framework, our algorithm automatically adapts to different minimum length and identity requirements and provides both positional and identity estimates for each mapping reported. For mapping human PacBio reads to the hg38 reference, our method is 290x faster than BWA-MEM with a lower memory footprint and recall rate of 96%. We further demonstrate the scalability of our method by mapping noisy PacBio reads (each >= 5 kbp in length) to the complete NCBI RefSeq database containing 838 Gbp of sequence and > 60, 000 genomes.

引用

页码：66 / 81

页数：16

共 50 条

[41] Identifying Approximate Itemsets of Interest in Large Databases
Chengqi Zhang
Shichao Zhang
Geoffrey I. Webb
Applied Intelligence, 2003, 18 : 91 - 104
[42] A fast multiresolution feature matching algorithm for exhaustive search in large image databases
Song, BC
Kim, MJ
Ra, JB
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2001, 11 (05) : 673 - 678
[43] A fast algorithm for selecting sets of dissimilar molecules from large chemical databases
Holliday, JD
Ranade, SS
Willett, P
QUANTITATIVE STRUCTURE-ACTIVITY RELATIONSHIPS, 1995, 14 (06): : 501 - 506
[44] An improved assembly of the pearl millet reference genome using Oxford Nanopore long reads and optical mapping
Salson, Marine
Orjuela, Julie
Mariac, Cedric
Zekraoui, Leila
Couderc, Marie
Arribat, Sandrine
Rodde, Nathalie
Faye, Adama
Kane, Ndjido A.
Tranchant-Dubreuil, Christine
Vigouroux, Yves
Berthouly-Salazar, Cecile
G3-GENES GENOMES GENETICS, 2023, 13 (05):
[45] A Scalable Reference-Point Based Algorithm to Efficiently Search Large Chemical Databases
Napolitano, Francesco
Tagliaferri, Roberto
Baldi, Pierre
2010 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS IJCNN 2010, 2010,
[46] Fast query algorithm for encrypted databases
Huazhong Ligong Daxue Xuebao, 9 (8-10):
[47] Sensitive and fast mapping of di-base encoded reads
Hormozdiari, Farhad
Hach, Faraz
Sahinalp, S. Cenk
Eichler, Evan E.
Alkan, Can
BIOINFORMATICS, 2011, 27 (14) : 1915 - 1921
[48] Fast fingerprint identification for large databases
Peralta, D.
Triguero, I.
Sanchez-Reillo, R.
Herrera, F.
Benitez, J. M.
PATTERN RECOGNITION, 2014, 47 (02) : 588 - 602
[49] Fast ObjectRank for Large Knowledge Databases
Shiokawa, Hiroaki
SEMANTIC WEB - ISWC 2021, 2021, 12922 : 217 - 234
[50] A survey of mapping algorithms in the long-reads era
Kristoffer Sahlin
Thomas Baudeau
Bastien Cazaux
Camille Marchet
Genome Biology, 24

← 1 2 3 4 5 →