BLight: efficient exact associative structure for k-mers

被引:15
|
作者
Marchet, Camille [1 ]
Kerbiriou, Mael [1 ]
Limasset, Antoine [1 ]
机构
[1] Univ Lille, CRIStAL CNRS, UMR 9189, F-59000 Lille, France
关键词
ALGORITHM; GENOME;
D O I
10.1093/bioinformatics/btab217
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: A plethora of methods and applications share the fundamental need to associate information to words for high-throughput sequence analysis. Doing so for billions of k-mers is commonly a scalability problem, as exact associative indexes can be memory expensive. Recent works take advantage of overlaps between k-mers to leverage this challenge. Yet, existing data structures are either unable to associate information to k-mers or are not lightweight enough. Results: We present BLight, a static and exact data structure able to associate unique identifiers to k-mers and determine their membership in a set without false positive that scales to huge k-mer sets with a low memory cost. This index combines an extremely compact representation along with very fast queries. Besides, its construction is efficient and needs no additional memory. Our implementation achieves to index the k-mers from the human genome using 8 GB of RAM (23 bits per k-mer) within 10 min and the k-mers from the large axolotl genome using 63 GB of memory (27 bits per k-mer) within 76 min. Furthermore, while being memory efficient, the index provides a very high throughput: 1.4 million queries per second on a single CPU or 16.1 million using 12 cores. Finally, we also present how BLight can practically represent metagenomic and transcriptomic sequencing data to highlight its wide applicative range.
引用
收藏
页码:2858 / 2865
页数:8
相关论文
共 50 条
  • [21] kmcEx: memory-frugal and retrieval-efficient encoding of counted k-mers
    Jiang, Peng
    Luo, Jie
    Wang, Yiqi
    Deng, Pingji
    Schmidt, Bertil
    Tang, Xiangjun
    Chen, Ningjiang
    Wong, Limsoon
    Zhao, Liang
    BIOINFORMATICS, 2019, 35 (23) : 4871 - 4878
  • [22] MQF and buffered MQF: quotient filters for efficient storage of k-mers with their counts and metadata
    Moustafa Shokrof
    C. Titus Brown
    Tamer A. Mansour
    BMC Bioinformatics, 22
  • [23] A fast, lock-free approach for efficient parallel counting of occurrences of k-mers
    Marcais, Guillaume
    Kingsford, Carl
    BIOINFORMATICS, 2011, 27 (06) : 764 - 770
  • [24] MQF and buffered MQF: quotient filters for efficient storage of k-mers with their counts and metadata
    Shokrof, Moustafa
    Brown, C. Titus
    Mansour, Tamer A.
    BMC BIOINFORMATICS, 2021, 22 (01)
  • [25] On the Maximal Independent Sets of k-mers with the Edit Distance
    Ma, Leran
    Chen, Ke
    Shao, Mingfu
    14TH ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY, AND HEALTH INFORMATICS, BCB 2023, 2023,
  • [26] Fast Approximation of Frequent k-Mers and Applications to Metagenomics
    Pellegrina, Leonardo
    Pizzi, Cinzia
    Vandin, Fabio
    JOURNAL OF COMPUTATIONAL BIOLOGY, 2020, 27 (04) : 534 - 549
  • [27] Configurational entropy for adsorbed linear species (k-mers)
    Romá, F
    Ramirez-Pastor, AJ
    Riccardo, JL
    JOURNAL OF CHEMICAL PHYSICS, 2001, 114 (24): : 10932 - 10937
  • [28] Jamming and percolation of linear k-mers on honeycomb lattices
    Iglesias Panuska, G. A.
    Centres, P. M.
    Ramirez-Pastor, A. J.
    PHYSICAL REVIEW E, 2020, 102 (03)
  • [29] Extraction of Long k-mers Using Spaced Seeds
    Leinonen, Miika
    Salmela, Leena
    IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2022, 19 (06) : 3444 - 3455
  • [30] Mining K-mers of Various Lengths in Biological Sequences
    Zhang, Jingsong
    Guo, Jianmei
    Yu, Xiaoqing
    Yu, Xiangtian
    Guo, Weifeng
    Zeng, Tao
    Chen, Luonan
    BIOINFORMATICS RESEARCH AND APPLICATIONS (ISBRA 2017), 2017, 10330 : 186 - 195