MQF and buffered MQF: quotient filters for efficient storage of k-mers with their counts and metadata

被引：3

作者：

Shokrof, Moustafa ^{[1
]}

Brown, C. Titus ^{[2
]}

Mansour, Tamer A. ^{[2
,3
]}

机构：

[1] Univ Calif Davis, Dept Comp Sci, Davis, CA 95616 USA

[2] Univ Calif Davis, Sch Vet Med, Dept Populat Hlth & Reprod, Davis, CA 95616 USA

[3] Univ Mansoura, Sch Med, Dept Clin Pathol, Mansoura, Egypt

来源：

BMC BIOINFORMATICS | 2021年 / 22卷 / 01期

关键词：

Compact hash tables; k-mers; Debruijn graphs; NGS; Inexact data structures;

D O I：

10.1186/s12859-021-03996-x

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Background: Specialized data structures are required for online algorithms to efficiently handle large sequencing datasets. The counting quotient filter (CQF), a compact hashtable, can efficiently store k-mers with a skewed distribution. Result: Here, we present the mixed-counters quotient filter (MQF) as a new variant of the CQF with novel counting and labeling systems. The new counting system adapts to a wider range of data distributions for increased space efficiency and is faster than the CQF for insertions and queries in most of the tested scenarios. A buffered version of the MQF can offload storage to disk, trading speed of insertions and queries for a significant memory reduction. The labeling system provides a flexible framework for assigning labels to member items while maintaining good data locality and a concise memory representation. These labels serve as a minimal perfect hash function but are similar to tenfold faster than BBhash, with no need to re-analyze the original data for further insertions or deletions. Conclusions: The MQF is a flexible and efficient data structure that extends our ability to work with high throughput sequencing data.

引用

页数：14

共 13 条

[1] MQF and buffered MQF: quotient filters for efficient storage of k-mers with their counts and metadata
Moustafa Shokrof
C. Titus Brown
Tamer A. Mansour
BMC Bioinformatics, 22
[2] Kcollections: A Fast and Efficient Library for K-mers
Fujimoto, M. Stanley
Lyman, Cole A.
Clement, Mark J.
2020 IEEE 34TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW 2020), 2020, : 193 - 198
[3] The backpack quotient filter: A dynamic and space-efficient data structure for querying k-mers with abundance
Levallois, Victor
Andreace, Francesco
Le Gal, Bertrand
Dufresne, Yoann
Peterlongo, Pierre
ISCIENCE, 2024, 27 (12)
[4] BLight: efficient exact associative structure for k-mers
Marchet, Camille
Kerbiriou, Mael
Limasset, Antoine
BIOINFORMATICS, 2021, 37 (18) : 2858 - 2865
[5] Efficient counting of k-mers in DNA sequences using a bloom filter
Melsted, Pall
Pritchard, Jonathan K.
BMC BIOINFORMATICS, 2011, 12
[6] Turtle: Identifying frequent k-mers with cache-efficient algorithms
Roy, Rajat Shuvro
Bhattacharya, Debashish
Schliep, Alexander
BIOINFORMATICS, 2014, 30 (14) : 1950 - 1957
[7] Efficient Mining Closed k-mers from DNA and Protein Sequences
Zhang, Jingsong
Bi, Cheng
Wang, Yinglin
Zeng, Tao
Liao, Bo
Chen, Luonan
2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP 2020), 2020, : 342 - 349
[8] Efficient counting of k-mers in DNA sequences using a bloom filter
Páll Melsted
Jonathan K Pritchard
BMC Bioinformatics, 12
[9] Efficient Design of Compact Unstructured RNA Libraries Covering All k-mers
Orenstein, Yaron
Berger, Bonnie
JOURNAL OF COMPUTATIONAL BIOLOGY, 2016, 23 (02) : 67 - 79
[10] Efficient Design of Compact Unstructured RNA Libraries Covering All k-mers
Orenstein, Yaron
Berger, Bonnie
ALGORITHMS IN BIOINFORMATICS (WABI 2015), 2015, 9289 : 308 - 325

← 1 2 →