BLight: efficient exact associative structure for k-mers

被引：15

作者：

Marchet, Camille ^{[1
]}

Kerbiriou, Mael ^{[1
]}

Limasset, Antoine ^{[1
]}

机构：

[1] Univ Lille, CRIStAL CNRS, UMR 9189, F-59000 Lille, France

来源：

BIOINFORMATICS | 2021年 / 37卷 / 18期

关键词：

ALGORITHM; GENOME;

D O I：

10.1093/bioinformatics/btab217

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Motivation: A plethora of methods and applications share the fundamental need to associate information to words for high-throughput sequence analysis. Doing so for billions of k-mers is commonly a scalability problem, as exact associative indexes can be memory expensive. Recent works take advantage of overlaps between k-mers to leverage this challenge. Yet, existing data structures are either unable to associate information to k-mers or are not lightweight enough. Results: We present BLight, a static and exact data structure able to associate unique identifiers to k-mers and determine their membership in a set without false positive that scales to huge k-mer sets with a low memory cost. This index combines an extremely compact representation along with very fast queries. Besides, its construction is efficient and needs no additional memory. Our implementation achieves to index the k-mers from the human genome using 8 GB of RAM (23 bits per k-mer) within 10 min and the k-mers from the large axolotl genome using 63 GB of memory (27 bits per k-mer) within 76 min. Furthermore, while being memory efficient, the index provides a very high throughput: 1.4 million queries per second on a single CPU or 16.1 million using 12 cores. Finally, we also present how BLight can practically represent metagenomic and transcriptomic sequencing data to highlight its wide applicative range.

引用

页码：2858 / 2865

页数：8

共 50 条

[1] Kcollections: A Fast and Efficient Library for K-mers
Fujimoto, M. Stanley
Lyman, Cole A.
Clement, Mark J.
2020 IEEE 34TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW 2020), 2020, : 193 - 198
[2] Fast detection of maximal exact matches via fixed sampling of query K-mers and Bloom filtering of index K-mers
Liu, Yuansheng
Zhang, Leo Yu
Li, Jinyan
BIOINFORMATICS, 2019, 35 (22) : 4560 - 4567
[3] Vectorizing the computation of k-mers
Vera Parra, Nelson Enrique
2018 13TH IBERIAN CONFERENCE ON INFORMATION SYSTEMS AND TECHNOLOGIES (CISTI), 2018,
[4] Sparse and skew hashing of K-mers
Pibiri, Giulio Ermanno
BIOINFORMATICS, 2022, 38 (SUPPL 1) : 185 - 194
[5] These Are Not the K-mers You Are Looking For: Efficient Online K-mer Counting Using a Probabilistic Data Structure
Zhang, Qingpeng
Pell, Jason
Canino-Koning, Rosangela
Howe, Adina Chuang
Brown, C. Titus
PLOS ONE, 2014, 9 (07):
[6] Efficient counting of k-mers in DNA sequences using a bloom filter
Melsted, Pall
Pritchard, Jonathan K.
BMC BIOINFORMATICS, 2011, 12
[7] Turtle: Identifying frequent k-mers with cache-efficient algorithms
Roy, Rajat Shuvro
Bhattacharya, Debashish
Schliep, Alexander
BIOINFORMATICS, 2014, 30 (14) : 1950 - 1957
[8] Revisiting pangenome openness with k-mers
Parmigiani, Luca
Wittler, Roland
Stoye, Jens
PEER COMMUNITY JOURNAL, 2024, 4
[9] Efficient Mining Closed k-mers from DNA and Protein Sequences
Zhang, Jingsong
Bi, Cheng
Wang, Yinglin
Zeng, Tao
Liao, Bo
Chen, Luonan
2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP 2020), 2020, : 342 - 349
[10] Efficient counting of k-mers in DNA sequences using a bloom filter
Páll Melsted
Jonathan K Pritchard
BMC Bioinformatics, 12

← 1 2 3 4 5 →