Motivation: A plethora of methods and applications share the fundamental need to associate information to words for high-throughput sequence analysis. Doing so for billions of k-mers is commonly a scalability problem, as exact associative indexes can be memory expensive. Recent works take advantage of overlaps between k-mers to leverage this challenge. Yet, existing data structures are either unable to associate information to k-mers or are not lightweight enough. Results: We present BLight, a static and exact data structure able to associate unique identifiers to k-mers and determine their membership in a set without false positive that scales to huge k-mer sets with a low memory cost. This index combines an extremely compact representation along with very fast queries. Besides, its construction is efficient and needs no additional memory. Our implementation achieves to index the k-mers from the human genome using 8 GB of RAM (23 bits per k-mer) within 10 min and the k-mers from the large axolotl genome using 63 GB of memory (27 bits per k-mer) within 76 min. Furthermore, while being memory efficient, the index provides a very high throughput: 1.4 million queries per second on a single CPU or 16.1 million using 12 cores. Finally, we also present how BLight can practically represent metagenomic and transcriptomic sequencing data to highlight its wide applicative range.
机构:
Univ Nacl San Luis, CONICET, Lab Ciencias Superficies & Medios Porosos, RA-5700 San Luis, ArgentinaUniv Nacl San Luis, CONICET, Lab Ciencias Superficies & Medios Porosos, RA-5700 San Luis, Argentina
Bulnes, F.
Ramirez-Pastor, A. J.
论文数: 0引用数: 0
h-index: 0
机构:
Univ Nacl San Luis, CONICET, Lab Ciencias Superficies & Medios Porosos, RA-5700 San Luis, ArgentinaUniv Nacl San Luis, CONICET, Lab Ciencias Superficies & Medios Porosos, RA-5700 San Luis, Argentina
Ramirez-Pastor, A. J.
Zgrablich, G.
论文数: 0引用数: 0
h-index: 0
机构:
Univ Nacl San Luis, CONICET, Lab Ciencias Superficies & Medios Porosos, RA-5700 San Luis, ArgentinaUniv Nacl San Luis, CONICET, Lab Ciencias Superficies & Medios Porosos, RA-5700 San Luis, Argentina
机构:
Argonne Natl Lab, Div Math & Comp Sci, Argonne, IL 60439 USA
San Diego State Univ, Dept Biol, San Diego, CA 92182 USA
San Diego State Univ, Dept Comp Sci, San Diego, CA 92182 USAArgonne Natl Lab, Div Math & Comp Sci, Argonne, IL 60439 USA
Edwards, Robert A.
Olson, Robert
论文数: 0引用数: 0
h-index: 0
机构:
Argonne Natl Lab, Div Math & Comp Sci, Argonne, IL 60439 USA
Univ Chicago, Computat Inst, Chicago, IL 60637 USAArgonne Natl Lab, Div Math & Comp Sci, Argonne, IL 60439 USA
Olson, Robert
Disz, Terry
论文数: 0引用数: 0
h-index: 0
机构:
Argonne Natl Lab, Div Math & Comp Sci, Argonne, IL 60439 USA
Univ Chicago, Computat Inst, Chicago, IL 60637 USAArgonne Natl Lab, Div Math & Comp Sci, Argonne, IL 60439 USA
Disz, Terry
Pusch, Gordon D.
论文数: 0引用数: 0
h-index: 0
机构:
Argonne Natl Lab, Div Math & Comp Sci, Argonne, IL 60439 USA
Univ Chicago, Computat Inst, Chicago, IL 60637 USAArgonne Natl Lab, Div Math & Comp Sci, Argonne, IL 60439 USA
Pusch, Gordon D.
Vonstein, Veronika
论文数: 0引用数: 0
h-index: 0
机构:
Fellowship Interpretat Genomes, Burr Ridge, IL 60527 USAArgonne Natl Lab, Div Math & Comp Sci, Argonne, IL 60439 USA
Vonstein, Veronika
Stevens, Rick
论文数: 0引用数: 0
h-index: 0
机构:
Argonne Natl Lab, Div Math & Comp Sci, Argonne, IL 60439 USA
Univ Chicago, Computat Inst, Chicago, IL 60637 USAArgonne Natl Lab, Div Math & Comp Sci, Argonne, IL 60439 USA
Stevens, Rick
Overbeek, Ross
论文数: 0引用数: 0
h-index: 0
机构:
Fellowship Interpretat Genomes, Burr Ridge, IL 60527 USAArgonne Natl Lab, Div Math & Comp Sci, Argonne, IL 60439 USA