BLight: efficient exact associative structure for k-mers

被引:15
|
作者
Marchet, Camille [1 ]
Kerbiriou, Mael [1 ]
Limasset, Antoine [1 ]
机构
[1] Univ Lille, CRIStAL CNRS, UMR 9189, F-59000 Lille, France
关键词
ALGORITHM; GENOME;
D O I
10.1093/bioinformatics/btab217
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: A plethora of methods and applications share the fundamental need to associate information to words for high-throughput sequence analysis. Doing so for billions of k-mers is commonly a scalability problem, as exact associative indexes can be memory expensive. Recent works take advantage of overlaps between k-mers to leverage this challenge. Yet, existing data structures are either unable to associate information to k-mers or are not lightweight enough. Results: We present BLight, a static and exact data structure able to associate unique identifiers to k-mers and determine their membership in a set without false positive that scales to huge k-mer sets with a low memory cost. This index combines an extremely compact representation along with very fast queries. Besides, its construction is efficient and needs no additional memory. Our implementation achieves to index the k-mers from the human genome using 8 GB of RAM (23 bits per k-mer) within 10 min and the k-mers from the large axolotl genome using 63 GB of memory (27 bits per k-mer) within 76 min. Furthermore, while being memory efficient, the index provides a very high throughput: 1.4 million queries per second on a single CPU or 16.1 million using 12 cores. Finally, we also present how BLight can practically represent metagenomic and transcriptomic sequencing data to highlight its wide applicative range.
引用
收藏
页码:2858 / 2865
页数:8
相关论文
共 50 条
  • [1] Kcollections: A Fast and Efficient Library for K-mers
    Fujimoto, M. Stanley
    Lyman, Cole A.
    Clement, Mark J.
    2020 IEEE 34TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW 2020), 2020, : 193 - 198
  • [2] Fast detection of maximal exact matches via fixed sampling of query K-mers and Bloom filtering of index K-mers
    Liu, Yuansheng
    Zhang, Leo Yu
    Li, Jinyan
    BIOINFORMATICS, 2019, 35 (22) : 4560 - 4567
  • [3] Vectorizing the computation of k-mers
    Vera Parra, Nelson Enrique
    2018 13TH IBERIAN CONFERENCE ON INFORMATION SYSTEMS AND TECHNOLOGIES (CISTI), 2018,
  • [4] Sparse and skew hashing of K-mers
    Pibiri, Giulio Ermanno
    BIOINFORMATICS, 2022, 38 (SUPPL 1) : 185 - 194
  • [5] These Are Not the K-mers You Are Looking For: Efficient Online K-mer Counting Using a Probabilistic Data Structure
    Zhang, Qingpeng
    Pell, Jason
    Canino-Koning, Rosangela
    Howe, Adina Chuang
    Brown, C. Titus
    PLOS ONE, 2014, 9 (07):
  • [6] Efficient counting of k-mers in DNA sequences using a bloom filter
    Melsted, Pall
    Pritchard, Jonathan K.
    BMC BIOINFORMATICS, 2011, 12
  • [7] Turtle: Identifying frequent k-mers with cache-efficient algorithms
    Roy, Rajat Shuvro
    Bhattacharya, Debashish
    Schliep, Alexander
    BIOINFORMATICS, 2014, 30 (14) : 1950 - 1957
  • [8] Revisiting pangenome openness with k-mers
    Parmigiani, Luca
    Wittler, Roland
    Stoye, Jens
    PEER COMMUNITY JOURNAL, 2024, 4
  • [9] Efficient Mining Closed k-mers from DNA and Protein Sequences
    Zhang, Jingsong
    Bi, Cheng
    Wang, Yinglin
    Zeng, Tao
    Liao, Bo
    Chen, Luonan
    2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP 2020), 2020, : 342 - 349
  • [10] Efficient counting of k-mers in DNA sequences using a bloom filter
    Páll Melsted
    Jonathan K Pritchard
    BMC Bioinformatics, 12