MQF and buffered MQF: quotient filters for efficient storage of k-mers with their counts and metadata

被引:3
|
作者
Shokrof, Moustafa [1 ]
Brown, C. Titus [2 ]
Mansour, Tamer A. [2 ,3 ]
机构
[1] Univ Calif Davis, Dept Comp Sci, Davis, CA 95616 USA
[2] Univ Calif Davis, Sch Vet Med, Dept Populat Hlth & Reprod, Davis, CA 95616 USA
[3] Univ Mansoura, Sch Med, Dept Clin Pathol, Mansoura, Egypt
关键词
Compact hash tables; k-mers; Debruijn graphs; NGS; Inexact data structures;
D O I
10.1186/s12859-021-03996-x
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Specialized data structures are required for online algorithms to efficiently handle large sequencing datasets. The counting quotient filter (CQF), a compact hashtable, can efficiently store k-mers with a skewed distribution. Result: Here, we present the mixed-counters quotient filter (MQF) as a new variant of the CQF with novel counting and labeling systems. The new counting system adapts to a wider range of data distributions for increased space efficiency and is faster than the CQF for insertions and queries in most of the tested scenarios. A buffered version of the MQF can offload storage to disk, trading speed of insertions and queries for a significant memory reduction. The labeling system provides a flexible framework for assigning labels to member items while maintaining good data locality and a concise memory representation. These labels serve as a minimal perfect hash function but are similar to tenfold faster than BBhash, with no need to re-analyze the original data for further insertions or deletions. Conclusions: The MQF is a flexible and efficient data structure that extends our ability to work with high throughput sequencing data.
引用
收藏
页数:14
相关论文
共 13 条
  • [1] MQF and buffered MQF: quotient filters for efficient storage of k-mers with their counts and metadata
    Moustafa Shokrof
    C. Titus Brown
    Tamer A. Mansour
    BMC Bioinformatics, 22
  • [2] Kcollections: A Fast and Efficient Library for K-mers
    Fujimoto, M. Stanley
    Lyman, Cole A.
    Clement, Mark J.
    2020 IEEE 34TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW 2020), 2020, : 193 - 198
  • [3] The backpack quotient filter: A dynamic and space-efficient data structure for querying k-mers with abundance
    Levallois, Victor
    Andreace, Francesco
    Le Gal, Bertrand
    Dufresne, Yoann
    Peterlongo, Pierre
    ISCIENCE, 2024, 27 (12)
  • [4] BLight: efficient exact associative structure for k-mers
    Marchet, Camille
    Kerbiriou, Mael
    Limasset, Antoine
    BIOINFORMATICS, 2021, 37 (18) : 2858 - 2865
  • [5] Efficient counting of k-mers in DNA sequences using a bloom filter
    Melsted, Pall
    Pritchard, Jonathan K.
    BMC BIOINFORMATICS, 2011, 12
  • [6] Turtle: Identifying frequent k-mers with cache-efficient algorithms
    Roy, Rajat Shuvro
    Bhattacharya, Debashish
    Schliep, Alexander
    BIOINFORMATICS, 2014, 30 (14) : 1950 - 1957
  • [7] Efficient Mining Closed k-mers from DNA and Protein Sequences
    Zhang, Jingsong
    Bi, Cheng
    Wang, Yinglin
    Zeng, Tao
    Liao, Bo
    Chen, Luonan
    2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP 2020), 2020, : 342 - 349
  • [8] Efficient counting of k-mers in DNA sequences using a bloom filter
    Páll Melsted
    Jonathan K Pritchard
    BMC Bioinformatics, 12
  • [9] Efficient Design of Compact Unstructured RNA Libraries Covering All k-mers
    Orenstein, Yaron
    Berger, Bonnie
    JOURNAL OF COMPUTATIONAL BIOLOGY, 2016, 23 (02) : 67 - 79
  • [10] Efficient Design of Compact Unstructured RNA Libraries Covering All k-mers
    Orenstein, Yaron
    Berger, Bonnie
    ALGORITHMS IN BIOINFORMATICS (WABI 2015), 2015, 9289 : 308 - 325