BLight: efficient exact associative structure for k-mers

被引:15
|
作者
Marchet, Camille [1 ]
Kerbiriou, Mael [1 ]
Limasset, Antoine [1 ]
机构
[1] Univ Lille, CRIStAL CNRS, UMR 9189, F-59000 Lille, France
关键词
ALGORITHM; GENOME;
D O I
10.1093/bioinformatics/btab217
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: A plethora of methods and applications share the fundamental need to associate information to words for high-throughput sequence analysis. Doing so for billions of k-mers is commonly a scalability problem, as exact associative indexes can be memory expensive. Recent works take advantage of overlaps between k-mers to leverage this challenge. Yet, existing data structures are either unable to associate information to k-mers or are not lightweight enough. Results: We present BLight, a static and exact data structure able to associate unique identifiers to k-mers and determine their membership in a set without false positive that scales to huge k-mer sets with a low memory cost. This index combines an extremely compact representation along with very fast queries. Besides, its construction is efficient and needs no additional memory. Our implementation achieves to index the k-mers from the human genome using 8 GB of RAM (23 bits per k-mer) within 10 min and the k-mers from the large axolotl genome using 63 GB of memory (27 bits per k-mer) within 76 min. Furthermore, while being memory efficient, the index provides a very high throughput: 1.4 million queries per second on a single CPU or 16.1 million using 12 cores. Finally, we also present how BLight can practically represent metagenomic and transcriptomic sequencing data to highlight its wide applicative range.
引用
收藏
页码:2858 / 2865
页数:8
相关论文
共 50 条
  • [31] MicroRNA categorization using sequence motifs and k-mers
    Yousef, Malik
    Khalifa, Waleed
    Acar, Ilhan Erkin
    Allmer, Jens
    BMC BIOINFORMATICS, 2017, 18
  • [32] Cooperative sequential adsorption of k-mers on heterogeneous substrates
    Zuppa, C
    Ciacera, M
    Zgrablich, G
    LANGMUIR, 1999, 15 (18) : 5984 - 5989
  • [33] Recombination spot identification Based on gapped k-mers
    Rong Wang
    Yong Xu
    Bin Liu
    Scientific Reports, 6
  • [34] MicroRNA categorization using sequence motifs and k-mers
    Malik Yousef
    Waleed Khalifa
    İlhan Erkin Acar
    Jens Allmer
    BMC Bioinformatics, 18
  • [35] Phenetic Comparison of Prokaryotic Genomes Using k-mers
    Deraspe, Maxime
    Raymond, Frederic
    Boisvert, Sebastien
    Culley, Alexander
    Roy, Paul H.
    Laviolette, Francois
    Corbeil, Jacques
    MOLECULAR BIOLOGY AND EVOLUTION, 2017, 34 (10) : 2716 - 2729
  • [36] Flexible protein database based on amino acid k-mers
    Deraspe, Maxime
    Boisvert, Sebastien
    Laviolette, Francois
    Roy, Paul H.
    Corbeil, Jacques
    SCIENTIFIC REPORTS, 2022, 12 (01)
  • [37] SPRISS: approximating frequent k-mers by sampling reads, and applications
    Santoro, Diego
    Pellegrina, Leonardo
    Comin, Matteo
    Vandin, Fabio
    BIOINFORMATICS, 2022, 38 (13) : 3343 - 3350
  • [38] Effects of spaced k-mers on alignment-free genotyping
    Hantze, Hartmut
    Horton, Paul
    BIOINFORMATICS, 2023, 39 : I213 - I221
  • [39] Classification of lncRNA and mRNA using k-mers and random forest
    Nadir, Rana M.
    Mateen, Hafsa
    Din, Saif U.
    4TH INTERNATIONAL CONFERENCE ON INNOVATIVE COMPUTING (IC)2, 2021, : 939 - 946
  • [40] EPIK: precise and scalable evolutionary placement with informative k-mers
    Romashchenko, Nikolai
    Linard, Benjamin
    Pardi, Fabio
    Rivals, Eric
    BIOINFORMATICS, 2023, 39 (12)