BLight: efficient exact associative structure for k-mers

被引：15

作者：

Marchet, Camille ^{[1
]}

Kerbiriou, Mael ^{[1
]}

Limasset, Antoine ^{[1
]}

机构：

[1] Univ Lille, CRIStAL CNRS, UMR 9189, F-59000 Lille, France

来源：

BIOINFORMATICS | 2021年 / 37卷 / 18期

关键词：

ALGORITHM; GENOME;

D O I：

10.1093/bioinformatics/btab217

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Motivation: A plethora of methods and applications share the fundamental need to associate information to words for high-throughput sequence analysis. Doing so for billions of k-mers is commonly a scalability problem, as exact associative indexes can be memory expensive. Recent works take advantage of overlaps between k-mers to leverage this challenge. Yet, existing data structures are either unable to associate information to k-mers or are not lightweight enough. Results: We present BLight, a static and exact data structure able to associate unique identifiers to k-mers and determine their membership in a set without false positive that scales to huge k-mer sets with a low memory cost. This index combines an extremely compact representation along with very fast queries. Besides, its construction is efficient and needs no additional memory. Our implementation achieves to index the k-mers from the human genome using 8 GB of RAM (23 bits per k-mer) within 10 min and the k-mers from the large axolotl genome using 63 GB of memory (27 bits per k-mer) within 76 min. Furthermore, while being memory efficient, the index provides a very high throughput: 1.4 million queries per second on a single CPU or 16.1 million using 12 cores. Finally, we also present how BLight can practically represent metagenomic and transcriptomic sequencing data to highlight its wide applicative range.

引用

页码：2858 / 2865

页数：8

共 50 条

[41] Surface diffusion of k-mers in one-dimensional systems
Bulnes, F.
Ramirez-Pastor, A. J.
Zgrablich, G.
SURFACE SCIENCE, 2007, 601 (02) : 569 - 577
[42] Real Time Metagenomics: Using k-mers to annotate metagenomes
Edwards, Robert A.
Olson, Robert
Disz, Terry
Pusch, Gordon D.
Vonstein, Veronika
Stevens, Rick
Overbeek, Ross
BIOINFORMATICS, 2012, 28 (24) : 3316 - 3317
[43] Flexible protein database based on amino acid k-mers
Maxime Déraspe
Sébastien Boisvert
François Laviolette
Paul H Roy
Jacques Corbeil
Scientific Reports, 12
[44] Not all K-Mers are Equal - Some are Interesting, Some are Boring
Kaplinski, Lauris
Remm, Maido
HUMAN HEREDITY, 2016, 81 (04) : 233 - 234
[45] Fast Hybrid Data Structure for a Large Alphabet K-Mers Indexing for Whole Genome Alignment
Hrivnak, Rostislav
Gajdos, Petr
Snasel, Vaclav
IEEE ACCESS, 2021, 9 : 161890 - 161897
[46] Effects of spaced k-mers on alignment-free genotyping
Hantze, Hartmut
Horton, Paul
BIOINFORMATICS, 2023, 39 : i213 - i221
[47] Locality-preserving minimal perfect hashing of k-mers
Pibiri, Giulio Ermanno
Shibuya, Yoshihiro
Limasset, Antoine
BIOINFORMATICS, 2023, 39 : I534 - I543
[48] Phase separation transition of reconstituting k-mers in one dimension
Daga, Bijoy
Mohanty, P. K.
JOURNAL OF STATISTICAL MECHANICS-THEORY AND EXPERIMENT, 2015,
[49] Indexing Arbitrary-Length k-Mers in Sequencing Reads
Kowalski, Tomasz
Grabowski, Szymon
Deorowicz, Sebastian
PLOS ONE, 2015, 10 (07):
[50] Locality-preserving minimal perfect hashing of k-mers
Pibiri, Giulio Ermanno
Shibuya, Yoshihiro
Limasset, Antoine
BIOINFORMATICS, 2023, 39 : i534 - i543

← 1 2 3 4 5 →